CS21 Lab 9: Filtering and Sorting
Due Saturday, November 23, before midnight
Goals
-
Extend sorting beyond a list of integers
-
Learn to filter and reduce dataset sizes before sorting
-
Continue practicing top-down design (TDD)
-
Manage real world data
Please read through the entire lab before starting!
Federal Election Commission Data
The Federal Election Commission (FEC) collects fundraising data on candidates running for federal elections. You can interactively browse much of the data online, but we can also use our CS skills to explore the data interactively in python. As candidates must itemize individual campaign contributions in excess of $200, there is a lot of public data to sift through and explore.
We provide a number of cleaned up and occasionally aggregated data files that each have records in the following format.
P00010520|HICKENLOOPER, JOHN W.|DENVER|CO|01|03|2019|32800
Fields are separated by a vertical pipe |
, and contain the following information:
-
A unique ID for a particular 2020 presidential campaign.
P00010520
in this case is the campaign for John Hickenlooper, whose 2020 campaign ended in August 2019. -
The name of the donor, e.g., HICKENLOOPER, JOHN W., who donated to his own campaign.
-
The city of the donor, e.g., DENVER.
-
The two letter state abbreviation of the donor, e.g., CO
-
The two digit month of the donation, e.g., 01
-
The two digit day of the donation, e.g., 03
-
The two digit year of the donation, e.g., 2019
-
The amount of the donation in US dollars.
In the directory /usr/local/doc/
we placed the file 2020all.txt
which has itemized individual campaign contributions for every 2020 presidential candidate. Contributions may span from 2016 to the present, and there are over 1.6 million total records in this file. Working with this file may be challenging so we also provide some smaller summary files in the same /usr/local/doc/
folder.
-
2020candidates.txt
(105 records) This file has one entry per candidate per calendar year containing all the contributions for that year. The donor name, city and location are replaced with a dummy|Summary|Summary|US
, and the month and day are replaced with01|01
.P00011866|Summary|Summary|US|01|01|2019|2015652 P00011312|Summary|Summary|US|01|01|2019|869686 P00010793|Summary|Summary|US|01|01|2019|12020868 P00006486|Summary|Summary|US|01|01|2019|2383887 P00011239|Summary|Summary|US|01|01|2019|740304 P00011999|Summary|Summary|US|01|01|2019|3475776 P00011338|Summary|Summary|US|01|01|2019|1318045 P00012716|Summary|Summary|US|01|01|2019|48199335 P00006213|Summary|Summary|US|01|01|2019|1461829 P00009910|Summary|Summary|US|01|01|2019|2781635
-
2020months.txt
(497 records) This file has one entry per candidate per calendar year per month containing all the contributions for that month. It is similar in format to2020candidates.txt
but we do not replace the month with01
-
2020states.txt
(9,951 records) Similar to2020months.txt
, but monthly aggregates are further subdivided by state, so the state column is unmodified and not replaced byUS
as it was in the2020months.txt
file -
2020donors.txt
(308,816 records) This file has one entry per candidate per donor per year. Occasionally a donor will make multiple campaign contributions to the same candidate. These are reported as individual transactions in the2020all.txt
file, but are aggregated into a single yearly transaction with a calendar date of 12/31 for each donor in the2020donors.txt
file
Searching the Dataset
Your task is to write a program that:
-
Prompts the user with a menu of six choices:
Please select one of the following choices: 0. Quit 1. Load file 2. Print top donations 3. Filter by campaign ID 4. Filter by state 5. Filter by city Number of current records: 0 Choice?
-
Repeatedly asks the user for a valid choice
-
Completes the selected task until the user quits (option 0)
Each option aside from 0 is described in more detail below.
Load file
For option 1, load file, you must:
-
Ask the user to select one of the valid filenames mentioned above
-
add the path
/usr/local/doc/
to the provided name -
read in the appropriate file of donation records from a file
-
return a new list of
Donation
records -
print the number of records returned
You should print a small sub-menu of the valid filenames and have the user select the filename by number,
$ python3 campaign.py 0. Quit 1. Load file 2. Print top donations 3. Filter by campaign ID 4. Filter by state 5. Filter by city Number of current records: 0 Your choice? 1 Data files: 1. Grouped by candidates (2020candidates.txt) 2. Grouped by month (2020months.txt) 3. Grouped by state (2020states.txt) 4. Grouped by donor (2020donors.txt) 5. All records (2020all.txt) Your choice? 1 0. Quit 1. Load file 2. Print top donations 3. Filter by campaign ID 4. Filter by state 5. Filter by city Number of current records: 105
The electiondata
module provides a small Donation
class that you can use to store donor info.
Donation(campaign, name, city, state, month, day, year, amt)
Given a line in the donor file, you could create a Donation
object as follows:
info = line.strip().split("|") d = Donation(info[0],info[1],info[2],info[3], int(info[4]), int(info[5]), int(info[6]), int(info[7]))
Print top donations
For option 2, print top donations, you must:
-
prompt the user for the number of records to view,
n
. -
sort the current list of donations by the donation amount.
-
print the top
n
donations in decreasing order with the highest amount first. -
you should print the name of the campaign associated with each donation.
If there are fewer than n
records in the current list of donations, you should print all the donations. The hints below will show you how you use the get_candidates()
function in the electiondata
library to map the campaign ID to a candidate name.
You must implement your own sorting algorithm to sort the records. It can be a variant of any of the quadratic runtime sort algorithms we discussed in class. You will need to adapt the algorithm to sort Donation
objects by comparing the amount of two separate donations and swapping their positions in the list if necessary.
A sample run on the candidate summary data file is shown below.
$ python3 campaign.py 0. Quit 1. Load file 2. Print top donations 3. Filter by campaign ID 4. Filter by state 5. Filter by city Number of current records: 0 Your choice? 1 Data files: 1. Grouped by candidates (2020candidates.txt) 2. Grouped by month (2020months.txt) 3. Grouped by state (2020states.txt) 4. Grouped by donor (2020donors.txt) 5. All records (2020all.txt) Your choice? 1 0. Quit 1. Load file 2. Print top donations 3. Filter by campaign ID 4. Filter by state 5. Filter by city Number of current records: 105 Your choice? 2 How many?: 8 STEYER, TOM: 48199335 P00012716|Summary|Summary|US|01|01|2019|$ 48199335 BUTTIGIEG, PETE: 33943835 P00010298|Summary|Summary|US|01|01|2019|$ 33943835 SANDERS, BERNARD: 30523241 P60007168|Summary|Summary|US|01|01|2019|$ 30523241 HARRIS, KAMALA D.: 22208134 P00009423|Summary|Summary|US|01|01|2019|$ 22208134 BIDEN, JOSEPH R JR: 20317082 P80000722|Summary|Summary|US|01|01|2019|$ 20317082 WARREN, ELIZABETH: 15329719 P00009621|Summary|Summary|US|01|01|2019|$ 15329719 KLOBUCHAR, AMY J.: 13165462 P80006117|Summary|Summary|US|01|01|2019|$ 13165462 TRUMP, DONALD J.: 12945866 P80001571|Summary|Summary|US|01|01|2019|$ 12945866
Filter by campaign ID
For option 3, filter by campaign ID, you must:
-
Prompt the user for a campaign ID, e.g.,
P00009621
. -
Filter the current list of donation records by creating a new list containing only the records whose campaign ID matches the given ID and replacing the current list of records with this new list.
-
Print the number of records returned.
Note it is perfectly fine to return an empty list if the user specifies an invalid campaign ID.
You will probably want to write a function that takes the list of current donors and returns a new list.
donations = filter_by_campaign(donations, id)
Below is list of campaign ID for some of the top fundraising campaigns. You can find more by using your program and looking at the 2020candidates.txt
file.
P00012716|STEYER, TOM P00010298|BUTTIGIEG, PETE P60007168|SANDERS, BERNARD P00009423|HARRIS, KAMALA D. P80001571|TRUMP, DONALD J. P80000722|BIDEN, JOSEPH R JR P00009621|WARREN, ELIZABETH P80006117|KLOBUCHAR, AMY J. P00009795|BOOKER, CORY A. P00010793|O'ROURKE, ROBERT BETO P00009290|GILLIBRAND, KIRSTEN P00011833|BENNET, MICHAEL F. P00010454|INSLEE, JAY R P00010520|HICKENLOOPER, JOHN W. P00009092|CASTRO, JULIAN P00011999|BULLOCK, STEVE P00006213|DELANEY, JOHN K. P00006486|YANG, ANDREW MR. P00009183|GABBARD, TULSI P00009910|WILLIAMSON, MARIANNE P00012054|DE BLASIO, BILL
Filter by state
For option 4, filter by state, you must :
-
Prompt the user for a two letter state abbreviation, e.g.,
PA
. -
Filter the current list of donation records by creating a new list containing only the records whose state matches the state and replacing the current list of records with this new list.
-
Print the number of records returned.
This is very similar to filter by campaign ID, but just filtering by a different criteria. Note since each filter replaces the current list of donations, applying successive filters can further reduce the list size. For example, filtering first by campaign P00009621
and then by state PA
should result in a list of donations to the P00009621
(Elizabeth Warren) campaign from PA
.
Filter by city
For option 5, filter by city, you must :
-
Prompt the user for a city name, e.g.,
Springfield
. -
Filter the current list of donation records by creating a new list containing only the records whose city name matches the city and replacing the current list of records with this new list.
-
Print the number of records returned.
The city match only matches by city name, so for city names that exist in multiple states, you would need to apply a city filter followed by a state filter to narrow down your results. This would be the user’s responsibility, not the programmers responsibility, so to search for Springfied, PA
the user would filter by city==Springfield
then state==PA
. Also note that the data in the data files are in all caps. You may want to use the .upper()
method on the user input to convert the user’s response to something that is easier to match in the list of records.
Sample output
See the sample output page for examples of how the program might run.
Provided Library
The electiondata
library has some helpful tools for managing donation
records in your program.
Donation Class
The electiondata
library provides a Donation
class, which encapsulates all the information you’ll
need to keep track of one donation record. You can create an instance object of this
class by calling the Donation()
constructor with each piece of information
from one row of your data file passed in as a separate parameter.
Each number is an integer, and all other fields are strings.
You can call the following methods on a Donation
object to retrieve information about it:
-
get_candidate()
: Retrieve a unique campaign identifier. This isn’t easy for humans to remember, but we can match it to a real candidate with theCandidate
class described below. -
get_name()
: Retrieve the name of the donor. -
get_city()
: Retrieve the city string. -
get_state()
: Retrieve the state string (abbreviated, e.g.,PA
). -
get_month()
: Retrieve the donation month as an integer. -
get_day()
: Retrieve the donation day as an integer. -
get_year()
: Retrieve the donation year as an integer. -
get_amt()
: Retrieve the donation amount as an integer.
For example:
>>> from electiondata import * >>> record = Donation("P00010520", "HICKENLOOPER, JOHN W.", "DENVER", "CO", 01, 03, 2019, 32800) >>> print(record.get_year()) 2019 >>> print(record.get_state()) CO >>> print(record) P00010520|HICKENLOOPER, JOHN W.|DENVER|CO|01|03|2019|$ 32800
Candidate class
Donations are listed by a unique campaign identifier, but this ID does not easily describe a candidate. A separate Candidate
class stores information about candidates, including their campaign ID
-
get_candidate_id()
: Retrieve the campaign ID associated with this candidate as it appears in the donor files. -
get_name()
: Retrieve real name of candidate. -
get_committee()
: Retrieve name of fundraising committee for candidate. -
get_party()
: Retrieve political party abbreviation of candidate. -
get_state()
: Retrieve state of registration for this candidate.
A text file containing info about the 2020 presidential candidates can be found in /usr/local/doc/2020presinfo.txt
, but the function get_candidates()
in the electiondata
library will read this file for you and return a list of Candidate
objects, so you should not need to parse this file on your own.
>>> from electiondata import * >>> cand = get_candidates() >>> print(cand[1]) P00010298|C00697441|BUTTIGIEG, PETE|PETE FOR AMERICA, INC.|DEM|IN >>> print(cand[4]) P80001571|C00580100|TRUMP, DONALD J.|DONALD J. TRUMP FOR PRESIDENT, INC.|REP|NY >>> print(cand[5]) P80000722|C00703975|BIDEN, JOSEPH R JR|BIDEN FOR PRESIDENT|DEM|DC >>> print(cand[7]) P80006117|C00696419|KLOBUCHAR, AMY J.|AMY FOR AMERICA|DEM|MN
When printing your result in option 2, you can pass in a list of candidates in addition to your donor info when displaying the results. You can use this list of candidates to search for a matching candidate_id
and lookup the candidate name. Since there are only 74 candidates in the 2020 candidates list, a basic linear search will be fast enough for most reporting.
Requirements, Hints, and Tips
-
You should practice good top-down design, incrementally implement and test your solution, and document your code with comments. Start with the main menu and get a skeleton of your code working first. Then add each option from the menu one at a time (start with quit as it is easy and will allow you to exit your program when you are testing it). While much of the design is up to you, the requirements below are designed to avoid some headaches in the initial design.
-
You should only need to read a donation file when the user selects option 1. For all other options, you should be using a list that was previously loaded from a file.
-
To print an instance object of the
Donation
class, you can callprint()
directly on the object:donation = alldonations[0] #alldonations is list of Donation objects print(donation)
The class knows how to format itself into a string, so you can avoid writing a nasty formatting string.
-
Validate menu selections. If the user doesn’t enter a valid number (1-6), let the user know it wasn’t a valid selection and prompt them again for a new menu item.
-
When reading file input,
split()
only generates a list of strings. Some of the strings may need to be converted to integers.
Extra Challenges
This will not affect your grade or gain extra points. It is included just as an extra challenge, if you are interested (and have already finished the entire assignment as described above).
There are numerous ways to extend this lab to make it even more useful. Here are some challenges you may want to try. If you decide to try them, add any extra menu options to the end of the list so that you do not change the default order.
-
Allow users to filter by city and state in a single step instead of applying two filters.
-
Allow an undo option, where selecting undo will undo the most recent filter and replace the current list of records with the previous list. This could be handy if you wanted to search for one state, undo, and search for a second, without reloading the entire file. For an extra, extra challenge, allow multiple levels of undo, e.g., allow users to undo up to the previous three filters.
-
Add an exclude filter option where you can return a list of records that do not match a query. For example, if you wanted to remove candidates that had dropped out of the race already, you could apply a filter on campaign ID that returns all records except a given campaign ID to remove a candidate.
-
Allow a default menu option so that users can e.g. load a default filename simply by pressing enter instead of typing a number. You can choose the default name. Your program should still allow users to select a file other than the default.
Answer the Questionnaire
Please edit
the Questions-09.txt
file in your cs21/labs/09
directory
and answer the questions in that file.
Once you’re done with that, run handin21
again.
Turning in your labs….
Remember to run handin21
to turn in your lab files!
You may run handin21
as many times as you want. Each time it will
turn in any new work. We recommend running handin21
after
you complete each program or after you complete significant
work on any one program.