Pandas Cheat Sheet
This provides a quick summary of Pandas commands.
Importing Pandas
Reading a CSV file into a data frame
- df = pd.read_csv("filename.csv")
- df.info() shows you all column names and types
Renaming columns in a data frame
In many data sets, the column names may contain spaces or
special characters. Or the column names may be very long and hard
to remember. It is helpful to rename the columns up front and then
use the simpler names instead.
- df.rename(columns = {'OldName1':'NewName1', 'OldName2':'NewName2', ...}, inplace = True)
Accessing a column within a data frame
Selecting a subset of columns from a data frame
- You can create a subset and use it immediately.
- df[["ColName1","ColName2", ...]]
- For example this would find the correlations between the three columns:
df[["Calories","Floors", "Steps"]].corr()
- Or you can create a named subset.
- subset = df[["ColName1","ColName2", ...]]
- For example these two steps do the same thing as above:
subset = df[["Calories","Floors","Steps:]]
subset.corr()
Observing a data frame
- df.head()
- df.tail()
- df.nsmallest(n, "ColName")
- df.nlargest(n, "Colname")
- df.sample(n=Number)
Handling missing data
Once you have observed the data you may discover that some rows are missing important information. Missing data often shows up as NaN. the following command will drop every row that contains an NaN.
Summarizing a data frame
Summarizing a column within a data frame
- df["ColName"].sum()
- df["ColName"].count()
- df["ColName"].min()
- df["ColName"].max()
- df["ColName"].mean()
- df["ColName"].median()
Sorting a data frame
- df.sort_values("ColName")
- df.sort_values("ColName", ascending = False)
Querying a data frame
- df.query('ConditionalExpression')
- For example: df.query('State=="PA" and City=="Philadelphia"')
Correlations
- df.corr()
- This will only work on data frames consisting entirely of numeric data. You can select a subset of numeric columns (see above) and then compute correlations.
Plotting
- Scatter plot: df.plot.scatter(x="ColName1", y="ColName2")
- Histogram: df["ColName"].plot.hist()