# Exploring a CORGIS data set about Hospitals

For the first part of this lab you will go through a similar set of steps as we did in the last lab to explore a new data set. However this is a larger and more complex data set about hospitals in the United States.

I have provided a structure to help you remember the steps that you typically want to take, but you'll need to fill in most of the commands yourself.  Remember, that you can refer to the [Pandas Cheat Sheet](https://www.cs.swarthmore.edu/~meeden/s3p/summer23/labs/CheatSheet.html) for help.

NOTE: Remember to always do **Shift-Return** on every cell in the notebook to execute it.

#### Import pandas

In [1]:
import pandas as pd

#### Read in the CSV file

In [2]:
df = pd.read_csv("hospitals.csv")

#### Find key info about data

Use the **info()** comand to find the column names and types of each column.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4772 entries, 0 to 4771
Data columns (total 24 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   Facility.Name                    4772 non-null   object
 1   Facility.City                    4772 non-null   object
 2   Facility.State                   4772 non-null   object
 3   Facility.Type                    4772 non-null   object
 4   Rating.Overall                   4772 non-null   int64 
 5   Rating.Mortality                 4772 non-null   object
 6   Rating.Safety                    4772 non-null   object
 7   Rating.Readmission               4772 non-null   object
 8   Rating.Experience                4772 non-null   object
 9   Rating.Effectiveness             4772 non-null   object
 10  Rating.Timeliness                4772 non-null   object
 11  Rating.Imaging                   4772 non-null   object
 12  Procedure.Heart Attack.Cost      4

#### Rename the columns

Columns that contain a period or a space are going to be problematic later, so let's rename them now. Remember that the old name is on the left and the new name is on the right.

The following command will only work if you name your data frame **df**.

In [4]:
df.rename(columns={'Facility.Name':'Name',
                   'Facility.City':'City',
                   'Facility.State':'State',
                   'Facility.Type':'Type',
                   'Rating.Overall':'Rating',
                   'Rating.Mortality':'Mortality',
                   'Rating.Safety':'Safety',
                   'Rating.Readmission':'Readmission',
                   'Rating.Experience':'Experience',
                   'Rating.Effectiveness':'Effectiveness',
                   'Rating.Timliness':'Timliness',
                   'Rating.Imaging':'Imaging',
                   'Procedure.Heart Attack.Cost':'HeartAttackCost',
                   'Procedure.Heart Attack.Quality':'HeartAttackQuality',
                   'Procedure.Heart Attack.Value':'HeartAttackValue',
                   'Procedure.Heart Failure.Cost':'HeartFailureCost',
                   'Procedure.Heart Failure.Quality':'HeartFailureQuality',
                   'Procedure.Heart Failure.Value':'HeartFailureValue',
                   'Procedure.Pneumonia.Cost':'PneumoniaCost',
                   'Procedure.Pneumonia.Quality':'PneumoniaQuality',
                   'Procedure.Pneumonia.Value':'PneumoniaValue',
                   'Procedure.Hip Knee.Cost':'HipKneeCost',
                   'Procedure.Hip Knee.Quality':'HipKneeQuality',
                   'Procedure.Hip Knee.Value':'HipKneeValue'
                  }, inplace=True)

Let's redo the **info()** command to see what we have now.

#### Observe the data

Sample 20 rows from the data frame.

#### Cleaning the data

Your sample is likely to reveal that some of the data in the data set is incomplete. Notice that some of the columns contain **NaN**, which indicates that this data is missing.

Pandas provides an easy way to eliminate all of the incomplete data using: **df.dropna(inplace=True)**. Try this below.

The original data set contained 4772 entries.  Let's redo the **info()** command to see how many entries reamin after cleaning up the missing data.

Let's do another sample of size 20 to make sure the data looks good now.

#### Summarize the data

1. What is the maximum cost for heart attacks?
2. And the max for Pneumonia? 
3. What are the mean costs for heart attacks?
4. And the mean for Pneumonia?

#### Sort the data

Try sorting the data by state.

#### Query the data

Find a hospital in your home town.  Or if there isn't one there, find a hospital in a town near your home town. 

#### Explore correlations

Find an interesting correlation within the data. Remember that correlations only work on numeric columns so grab a subset that includes all of the numeric columns. 

Note: There should be 5 numeric columns, the rating and the costs for all of the procedures.

#### Plot a correlation

Make a scatter plot of a correlation you found.

#### Make a histogram

Make a histogram of the column **Rating**