Pandas Cheat Sheet

This provides a quick summary of Pandas commands.

Importing Pandas

import pandas as pd

Reading a CSV file into a data frame

df = pd.read_csv("filename.csv")
df.info() shows you all column names and types

Renaming columns in a data frame

In many data sets, the column names may contain spaces or special characters. Or the column names may be very long and hard to remember. It is helpful to rename the columns up front and then use the simpler names instead.

df.rename(columns = {'OldName1':'NewName1', 'OldName2':'NewName2', ...}, inplace = True)

Accessing a column within a data frame

df["ColName"]
df.ColName

Selecting a subset of columns from a data frame

You can create a subset and use it immediately.
- df[["ColName1","ColName2", ...]]
- For example this would find the correlations between the three columns:
  df[["Calories","Floors", "Steps"]].corr()
Or you can create a named subset.

subset = df[["ColName1","ColName2", ...]]
For example these two steps do the same thing as above:
subset = df[["Calories","Floors","Steps:]]
subset.corr()

Observing a data frame

df.head()
df.tail()
df.nsmallest(n, "ColName")
df.nlargest(n, "Colname")
df.sample(n=Number)

Handling missing data

Once you have observed the data you may discover that some rows are missing important information. Missing data often shows up as NaN. the following command will drop every row that contains an NaN.

df.dropna(inplace=True)

Summarizing a data frame

df.describe()

Summarizing a column within a data frame

df["ColName"].sum()
df["ColName"].count()
df["ColName"].min()
df["ColName"].max()
df["ColName"].mean()
df["ColName"].median()

Sorting a data frame

df.sort_values("ColName")
df.sort_values("ColName", ascending = False)

Querying a data frame

df.query('ConditionalExpression')
For example: df.query('State=="PA" and City=="Philadelphia"')

Correlations

df.corr()
This will only work on data frames consisting entirely of numeric data. You can select a subset of numeric columns (see above) and then compute correlations.

Plotting

Scatter plot: df.plot.scatter(x="ColName1", y="ColName2")
Histogram: df["ColName"].plot.hist()