DW Gym - 3 day python challenge

Today I reveived certyficate of finish 3 day python challenge organized by Data Workshop. The challenge was in december of previous year but I think this is a good idea to repeat what we trained. The main topic on challenge was Pandas library. I think it's the most popular library in Python to work with data.

In this post I will show all commands used in challenge exercises. Also, I will write how to start with Pandas and how to run your code with some data and try to execute these commands.

Firstly, we need some data and environment A lot of datasets we can find on kaggle.com. I choosed set with video games. To eaisly work with data we can choose Jupyter. The fastest way is using it with docker.

How to run it ? Just execute a command:

docker run --rm -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -v ${PWD}:/home/jovyan/work jupyter/datascience-notebook

Remember: ${PWD} evaluated to current directory. I execute command in directory when I save dataset file.

Working with Pandas

With short explanations.

Import library to using it. pd is only alias, but using very often and treat as good practice.

import pandas as pd

Load csv dataset. Of course datasets can be in other formats like xlsx. Typing only pd and executed shows result like print() function

df = pd.read_csv('./work/Video_Games_Sales_as_at_22_Dec_2016.csv')
df

To show some rows without printing everything

df.head()

The numbers of (rows, columns)

df.shape
    (16719, 17)

For each column we can display unique values. In this example for Platform column

df.Platform.unique()

Fuction to count how many rows have a unique value for column. In this example for Platform

df.Platform.value_counts()

Function apply on data frame alows to work with each row. In this example we not doing anything. Just returned row

df.apply(lambda row: row, axis=1)

The same like above but we returning keys and it's mean - column names

df.apply(lambda row: row.keys(), axis=1)

Choosing only one column from row

df.apply(lambda row: row['Platform'], axis=1)

Add statement to check data. It's evaluate to True and thats why we see only booleans on output

df.apply(lambda row: row['Global_Sales'] > 30, axis=1)

Mix using apply() and `valuecounts()`_

df['Best_Global_30'] = df.apply(lambda row: row['Global_Sales'] > 30, axis=1)
df['Best_Global_30'].value_counts()

Get `valuecounts()of column and choose only by statement. Pandas work on this data as aSeries` of data. Details in documentation._

genre_values = df['Genre'].value_counts()

top_ten_genre_values = genre_values[genre_values > 1000]
top_ten_genre_values

Using filtered only top values we can work again with main data

genre_norm = df["Genre"].map(lambda x: x if x in top_ten_genre_values else "other")
genre_norm.value_counts()

Grouping and aggregate functions. Pandas contains data structures like `pivottable`, and allows us to count values like minumum, maximum etc_

pd.pivot_table(df, values=["Global_Sales"], index=["Name"]).sort_values(by=("Global_Sales"), ascending=False)

(df[["Global_Sales", "Genre"]]
    .groupby("Genre")
    .agg(["mean", "median", "min", "max", "std", "size"])
)

Thanks for reading this. It was a small recap of workshop exercises.

Marcin