CS21 Lab 08: Searching
Due Saturday, November 18, by 11:59pm
Please read through the entire lab before starting!
This is a one week lab. For the previous lab, you used a full week to practice the TDD component and a second week to implement your solution. Since the design process should be similar to the previous lab, you will be doing the design and implementation for this lab in a single week. If you found the open-ended nature of the last lab challenging, you will want to start this lab early.
All of the standard requirements from previous labs such as using good variable names, including function comments, etc. still apply.
Programming Tips
As you write programs, use good programming practices:
-
Use a comment at the top of the file to describe the purpose of the program (see example).
-
All programs should have a
main()
function (see example). -
Use variable names that describe the contents of the variables.
-
Write your programs incrementally and test them as you go. This is really crucial to success: don’t write lots of code and then test it all at once! Write a little code, make sure it works, then add some more and test it again.
-
Don’t assume that if your program passes the sample tests we provide that it is completely correct. Come up with your own test cases and verify that the program is producing the right output on them.
-
Avoid writing any lines of code that exceed 80 columns.
-
Always work in a terminal window that is 80 characters wide (resize it to be this wide)
-
In
vscode
, at the bottom right in the window, there is an indication of both the line and the column of the cursor.
-
Function Comments
All functions should have a top-level comment! Please see our function example page if you are confused about writing function comments.
Goals
-
Write a program that uses linear search
-
Write a program that uses binary search
-
Continue practicing Top-Down Design (TDD)
-
Connect CS topics with real-world data
1. Searching the USDA Economic Research Service data
1.1. USDA Economic Research Service data set
For this this assignment, we will be working with a real-world dataset from the USDA Economic Research Service. We have merged four datasets together to provide you with data that includes the population, education levels, unemployment rates, poverty rates and the median household income of every county in the United States from the year 2021. For this lab, we’ll be focusing on just a few of these features, but if you find the data interesting, you’re encouraged to download the full data sets and explore them on your own.
This data we will be working with is stored in CSV (Comma Separated Value)
format and is located at: /data/cs21/usda/usda.csv
there are actually two versions of the data file; the primary one is:
/data/cs21/usda/usda.csv
but there’s also a copy that’s had most of the lines removed, which may be
easier to use for testing your program, particularly in the early stages:
/data/cs21/usda/usda_small.csv
|
Here are the first three lines and last three lines from the usda.csv
file:
Abbeville County,SC,24258,83.7,18.2,4.5,14.8,49485 Acadia Parish,LA,57262,80.2,13.6,5.1,20.6,44648 Accomack County,VA,33364,82.2,21.0,4.3,16.2,50949 ... Zapata County,TX,13893,68.3,11.9,11.7,28.9,37474 Zavala County,TX,9487,75.6,14.8,12.4,25.4,35046 Ziebach County,SD,2400,89.2,14.6,3.9,38.1,37066
Note that the file is sorted by the the name of the county.
Each line contains 8 features. The following table provides information about these features.
Feature index |
Type |
Description |
0 |
string |
county |
1 |
string |
two-letter state abbreviation |
2 |
int |
population |
3 |
float |
percent of the population who completed high school (at a minimum) |
4 |
float |
percent of the population who completed a four-year college degree (at a minimum) |
5 |
float |
|
6 |
float |
|
7 |
int |
the median household income |
In this lab we will focus on searching counties, states, and population.
You will implement a program called search_usda.py
, which will allow the user
to search through this dataset and investigate the information contained in it.
1.2. Trigger warning and notes about real-world data
While this data is provided as demographics (i.e. population numbers) and does not provide information on specific individuals, it is actual data collected about the real world relating to access to education, income inequality, and unemployment. If you are not comfortable working with this data, please reach out to one of the instructors about the possibility of completing an alternate version of this lab.
Even if you are not triggered by this data, it is worth being aware of what it represents, and the power that algorithms can give you. Technology is not value-neutral; even if the search algorithms you implement are mathematically indifferent to the meaning of the data, the data you are searching are produced by a world which is not. Thus, the results of the search will not be bias-free even if the search algorithm is "fair".
In addition to the unfairness explicitly being represented and highlighted in this data set (e.g. the unequal levels of poverty in the country), always keep in mind that data collected about the world is not a perfect representation of that world. The process of collecting and curating data is done by humans, and as a result the data we see typically have embedded biases that are difficult or impossible to identify simply from looking at the numbers.
In this lab, we’ll explore the use of algorithms that might help us use data to identify unfairness in the world, with the idea that doing so would be a first step towards trying to remedy the problem. Even with good intentions, however, it is important to keep in mind not only the strengths but also the limitations of your data and algorithms.
2. Program Requirements
-
Reads in stored data from a file, storing a list of records. The data is stored in the file:
/data/cs21/usda/usda.csv
Note: don’t copy this file to your
labs/08
directory, just use"/data/cs21/usda/usda.csv"
as the file name in your python program.As you read in the data, each line of the file corresponds to one record, which your program should represent using a Usda object and store in a list.
-
Repeatedly presents a menu to the user with three choices: to search by state name, to search by county name, to search by median income, or to quit. For example:
Please select one of the following choices: 1. Search by state 2. Search by county 3. Search by median income 0. Quit Choice?
Depending on the user’s selection, prompt them for additional information and then do one of:
-
Search the dataset by state name, using linear search.
-
Search the dataset by county name, using binary search.
-
Search the dataset by median income, using linear search.
-
Exit the program.
-
-
When searching strings, your program should compare all data in lowercase format (so that user input is case insensitive) but displays results with their original case.
-
You should ensure that the user provides valid input for menu choices.
-
Uses linear search to find state matches. See Adapting linear search for state searches
-
Uses binary search to find county matches. See Adapting binary search for county searches
-
Uses linear search to find median household incomes. See Adapting linear search for median income searches
-
Uses formatted printing to show search results in a table format that fits in an 80-character wide terminal window. See Displaying matches
2.1. Example Output
-
Here are some examples of searching:
-
Here are some examples of handling valid/invalid input:
2.2. Tips
2.2.1. Continue to use TDD
We strongly recommend that you create a top-down design with a fully fleshed
out main
and stubs for all other functions. However, you may not work with a
partner on this lab. Also, you don’t need to get approval from an instuctor for
your design, though we’d be glad to review it with you in lab, office hours, or
ninja sessions. You should only begin implementing the other functions after
you have a clear plan in place for the structure of main
.
2.2.2. Create modular, reusable functions
Avoid duplicating similar code in multiple places throughout the program. Think about how you could re-use the same function. For example, we are doing two types of searchs (by name and by state), but the results of those searches are displayed in exactly the same way. Consider having a function that takes in a list of search results and displays them. Then you could call this function for either type of search.
3. The usda_data
library
The lab comes with a usda_lab
library that gives you several helpful
tools intended to make this lab easier. To use the library, we’ve put
this line at the top of your file:
from usda_lab import Usda, look_forward_backward
3.1. Usda class
The usda_lab
library provides a Usda
class, which encapsulates all the features
needed to store one row of the data set. You can create an object of this class by
calling the Usda
constructor with each field of information passed in
as separate parameters.
The Usda
constructor takes 8 parameters shown in the table below (ignore
the Getter Method column for now):
Parameter |
Type |
Description |
Getter Method |
0 |
string |
County name |
|
1 |
string |
State abbreviation |
|
2 |
int |
Population |
|
3 |
float |
High school degree |
|
4 |
float |
College degree |
|
5 |
float |
Unemployment rate |
|
6 |
float |
Poverty rate |
|
7 |
int |
Median income |
|
Each line of the file contains these 8 parameters in the same order. For example, the line describing Delaware County, PA (where Swarthmore is located) looks like this in the file (stored as a string):
Delaware County,PA,576772,93.4,40.5,5.9,10.1,78194
To create a Usda
object that contains the data about Delaware County, you
would write:
delco = Usda('Delaware County', 'PA', 576772, 93.4, 40.5, 5.9, 10.1, 78194)
Once you have a Usda
object, you can use the methods shown above (in the
Getter Method column) to extract the information you need. For example,
to print the population and unemployment rates in Delaware County given the
delco
object above, you could say:
population = delco.get_population()
unemployment = delco.pct_unemployment()
county_name = delco.get_county()
print("%s has a population of %d and an unemployment rate of %f.",
county_name, population, unemployment)
3.2. Storing the file as a list of Usda
objects
Notice that in the file, the data is stored as a string. But when you create
the Usda
object, the county and state are strings, the population and
median income are integers, and the other 4 fields are floats.
In order to create a Usda
object for each row in the file, you will
perform these steps:
-
read in one line from the file and save it in a variable, e.g.
line
. -
split this line on comma using the
split
method, e.g.fields = line.split(',')
-
each of the items in this list will be strings, but we need some of them to be
int
orfloat
, so convert individual items as necessary, e.g.fields[4] = float(fields[4])
-
use the values in
fields
to create a newUsda
object:data = Usda(fields[0], fields[1], fields[2], fields[3], fields[4], fields[5], fields[6], fields[7])
-
accumulate each
Usda
object into a list ofUsda
objects that will represent the entire dataset.
4. Displaying matches
Once you have read in the data, you’re going to need to display them nicely in the terminal. For example, the table below is taken from the last result in Example 2:
=========================================================================
| County | State |Total Pop| H.S. | Coll |Unemp |Pvrty | Income
-------------------------------------------------------------------------
| Santa Barbara C | CA | 437434 | 81.9 | 35.1 | 5.8 | 15.2 | 83185
| Santa Clara Cou | CA | 1886595 | 89.2 | 54.4 | 4.8 | 6.9 | 141161
| Santa Cruz Coun | AZ | 48034 | 77.7 | 23.0 | 8.8 | 20.4 | 49216
| Santa Cruz Coun | CA | 260495 | 88.3 | 41.9 | 6.9 | 10.6 | 90370
| Santa Fe County | NM | 155404 | 90.1 | 41.1 | 6.4 | 12.3 | 67311
| Santa Rosa Coun | FL | 193783 | 92.3 | 29.1 | 3.6 | 8.9 | 80837
-------------------------------------------------------------------------
Notice that each column has a fixed width even if the data in that column have different widths. For example, the total population of the counties shown take up different amounts of space, but the column remains well formatted regardless of the actual population being displayed. Similarly, the poverty rate has all of its values aligned on the decimal point.
The column for the county names is more complicated. Some county names in the data set are quite long. Most of the counties names shown above are too wide to fit into the "County" column: only Santa Fe County fits in its entirety. When county names are too long, we can use slicing to truncate these strings to a fixed maximum length.
4.1. Slicing
This section of the online textbook explains how to slice a string. For instance, the following example shows how to slice out the first 10 characters from a string:
>>> alphabet = 'abcdefghijklmnopqrstuvwxyz'
>>> alphabet[0:10]
'abcdefghij'
>>> alphabet
'abcdefghijklmnopqrstuvwxyz'
This also works even if we provide an upper bound that is longer than the string we’re trying to slice:
>>> short_string = 'xyz'
>>> short_string[0:10]
'xyz'
4.2. String formatting
Another tool that you will need to display the output results nicely is string
formatting. Recall that string formatting allows us to specify the width of the output
when printing. For example, print("%10s" % ("hi"))
will print "hi" with 8
spaces in front of it so that it takes up a total of 10 places on the screen.
If it’s helpful, you can also use the -
flag to get left-justified printing
(the default is right-justified). For example:
$ python3
>>> print("|%-10s|%10s|" % ("left", "right"))
|left | right|
You can also specify the width when displaying integers and floats. When displaying floats, we typically specify not just the width but also the number of digits that should be allowed after the decimal point. For example:
$ python3
>>> print("| %5d | %5d | %5.0f | %5.1f |" % (10, 31415, 23.2, 523.1))
| 10 | 31415 | 23 | 523.1 |
By combining these ideas, you can make your output look like a nice table where everything is aligned, as demonstrated in the example outputs.
If you are unsure about exactly how string formatting handles different values, try them out in the interpreter just like the two examples shown above. |
5. Adapting the search algorithms
5.1. Adapting linear search for state searches
The version of linear search that we discussed in class stops as soon as it finds the item it is looking for, and returns the index of that item or -1 if not found. For this problem, your linear search will need to accumulate a list of matching results and return that list, instead of returning the index where the item was found.
In addition, the basic version of linear search we discussed in class showed
how to search lists of integers or strings. In this lab, we will be searching
lists of objects, so will need to adapt the linear search function we
discussed in class to use method syntax shown in the section
describing the Usda
class to get the data within each object.
5.2. Adapting binary search for county searches
Because the data is sorted by county name, we can use binary search to very quickly locate a particular county. However, names can be somewhat long. Additionally, the data may contain multiple counties with similar name prefixes.
For example, there are several counties names that start with "East":
...
Early County <- this has a different prefix
East Baton Rouge Parish
East Carroll Parish <- suppose binary search finds this one
East Feliciana Parish
Eastland County
Eaton County <- this has a different prefix
...
5.2.1. Finding the first match
First write a binary search function that searches for a partial match to a
county name and returns the index of the first match that it finds (or -1
otherwise). Make sure your binary search function uses startswith
(see
here)
rather than ==
so it doesn’t miss any partial matches (e.g. using "East" to
match on "East Carroll Parish").
However, there might be more matches that are just before or just after that initial index you locate as shown in the example above. So after you find one match, you’ll need to search in the vicinity of that match to find all of the other potential matches. In the example shown above, if your search finds "East Carroll Parish", you need to look before that item to find "East Baton Rouge Parish", and you need to look after that item to find both "East Feliciana Parish" and "Eastland County".
To do this, you will use the provided look_forward_backward
function described below.
5.2.2. Finding nearby matches
We have provided you with a function to help with finding nearby
matches based on an initial seed. This is called look_forward_backward()
and
you it is also imported from the usda_lab
library. This function takes four
parameters and returns a list of matching items that it found looking either
forward or backwards. Here is the function definition and a block comment
explaining the arguments and return type:
def lookForwardBackward(data, index, prefix, direction):
"""
This function uses linear search to look forwards (or backwards) in
a list of Usda items for counties that start with the specified prefix.
It begins the linear search at the index specified. All matching objects
are accumulated into a list and returned. The item at the specified index
is *not* included in the accumulated results.
Args:
data (list): the Usda records to search over
index (int): the index to start linear search from (this is the index
that your binary search function returned)
prefix (str): the prefix of the county to search for
direction (int): This is represents whether we should search forward
(direction = 1) or backward (direction = -1)
Returns:
list: a list of Usda items that match the country prefix
(in the specified direction)
"""
You will need to call it twice for each binary search (once to get the forward matches and again to get the backward matches).
A call to the look_forward_backward()
function might look something like this:
backward_matches = look_forward_backward(all_data, current_index, search_term, -1)
Given the example shown above where the user searched for East
, let’s
assume that "East Carroll Parish" is at index 839. We could look for
forward matches using the look_forward_backward
function as follows:
forward_matches = look_forward_backward(all_data, 839, "East", 1)
The returned list of forward_matches
would contain the Usda
objects for
"East Feliciana Parish" and "Eastland County".
5.3. Adapting linear search for median income searches
When searching by median income, your search should find any county whose median income is up to $2000 higher or lower than the value you are searching for. For example, if you search for a median income of $31000, you will return a search results list containing all counties whose median income is between $29000 and $33000.
6. Extra Challenges
Here are some optional extensions! Neither is required, but either or both might be interesting and rewarding.
6.1. Optional: Finding the county with the highest and lowest poverty rates.
After completing a search by state, inform the user which county has the lowest poverty rate and the which county has the highest poverty rate.
Here is an example run showing how this extra challenge should work:
You might also think about adding something similar for unemployment rates or high school and college graduation rates.
6.2. Optional: Extending your own binary search
Rather than importing the provided look_forward_backward
function, try writing
your own. This should work the same way as the provided version: it will look
linearly forwards or linearly backwards based on an initial seed index that it
takes as a parameter. The function will also need to take the search string as
an input parameter (as well as potentially other information depending on your
design). This function should return a list of all the partial matches based on
that initial seed index.
The program can then add that list together with the entry at the initial seed to compute the total list of partial matches which are then displayed to the user.
Answer the Questionnaire
Each lab will have a short questionnaire at the end. Please edit
the Questions-08.txt
file in your cs21/labs/08
directory
and answer the questions in that file.
Once you’re done with that, you should run handin21
again.
Submitting lab assignments
Remember to run handin21
to turn in your lab files! You may run handin21
as many times as you want. Each time it will turn in any new work. We
recommend running handin21
after you complete each program or after you
complete significant work on any one program.
Logging out
When you’re done working in the lab, you should log out of the computer you’re using.
First quit any applications you are running, like the browser and the terminal. Then click on the logout icon ( or ) and choose "log out".
If you plan to leave the lab for just a few minutes, you do not need to log out. It is, however, a good idea to lock your machine while you are gone. You can lock your screen by clicking on the lock icon. PLEASE do not leave a session locked for a long period of time. Power may go out, someone might reboot the machine, etc. You don’t want to lose any work!