Lab 3: Exploring Datasets
For this lab we will be working with real-world datasets
from CORGIS
(which is an acronym for: The Collection of Really Great,
Interesting Situated Datasets).
You may find it helpful to refer to this
short Pandas Cheat Sheet as you are
doing this lab.
Biases in real-world data
- It is important to be aware of what any real-world data
actually represents, and the power that you have as a data
scientist when you interpret real-world data.
- Remember that, as we have been discussing in this class,
technology is not value-neutral.
- The process of collecting and curating data is done by
humans (with limited resources and limited time), and as a
result the data we see typically have embedded biases that are
difficult or impossible to identify simply from looking at the
data.
- The process of analyzing data also involves making choices
about what aspects of the data to include and what to
ignore, which again can be affected by biases.
Part A: Explore the Hospitals data set
This data set contains information about hospitals throughout
the United States with the goal of helping consumers make
informed choices about which are most cost effective and have
the best ratings.
- Go to
the CORGIS
website and find the link to the data set about Hospitals (the
data sets are listed in alphabetic order).
- Click on the link and read through the description of the
Hospitals data set.
- Click on the link to download the data set, which is called
"hospitals.csv". It will likely place this in
your Downloads folder on your Mac. Use the finder app on
your Mac to move this file into your S3P folder.
- Download the Jupyter notebook ExploreHospitalsData.ipynb and save it in the S3P folder.
- Open a terminal window and type: cd Desktop/S3P to move into your S3P directory.
- In the terminal type: python3 -m notebook to start up Jupyter notebook.
- Double click on file named ExploreHospitalsData.ipynb.
- Read through this notebook and complete the exercises that are given.
Part B: Find a data set of interest to you
Now it's time for you to become a data scientist! Go back to
the CORGIS
website and explore what data sets are available. Read through the
descriptions and think about what is most interesting to you.
- Once you select a data set to focus on, go through the same
steps above to download its CSV file on to your computer and
move it into your S3P folder.
- Download the Jupyter notebook ExploreMyData.ipynb and save it in the S3P
folder.
- Open a terminal window and type: cd Desktop/S3P to
move into your S3P directory.
- In the terminal type: python3 -m notebook to start
up Jupyter notebook.
- Double click on file named ExploreMyData.ipynb.
- This notebook provides a template for you to begin exploring
the data set of your choice. Feel free to add/remove cells to
this data set as needed.
You may try one data set and discover it isn't quite what you
expected, or that it doesn't yield many interesting insights. Feel
free to try another until you find one that works for you. The
goal is that the data set that you choose will become the focus of
your final poster.