CS21 Lab 08: Searching
Due Saturday, November 16, by 11:59pm
Please read through the entire lab before starting!
This is a one week lab. For the previous lab, you used a full week to practice the TDD component and a second week to implement your solution. Since the design process should be similar to the previous lab, you will be doing the design and implementation for this lab in a single week. If you found the open-ended nature of the last lab challenging, you will want to start this lab early.
All of the standard style guidelines from previous labs still apply:
As you write programs, use good programming practices:
-
Use a comment at the top of the file to describe the purpose of the program (see example).
-
All programs should have a
main()
function (see example). -
Use variable names that describe the contents of the variables.
-
Write your programs incrementally and test them as you go. This is really crucial to success: don’t write lots of code and then test it all at once! Write a little code, make sure it works, then add some more and test it again.
-
Don’t assume that if your program passes the sample tests we provide that it is completely correct. Come up with your own test cases and verify that the program is producing the right output on them.
-
Avoid writing any lines of code that exceed 80 columns.
-
Always work in a terminal window that is 80 characters wide (resize it to be this wide)
-
In
vscode
, at the bottom right in the window, there is an indication of both the line and the column of the cursor.
-
Function Comments
All functions should have a top-level comment! Please see our function example page if you are confused about writing function comments.
Goals
-
Write a program that uses linear search
-
Write a program that uses binary search
-
Continue practicing Top-Down Design (TDD)
-
Connect CS topics with real-world data
1. Searching the Top 500 "Greatest" Novels database
1.1. The Top 500 "Greatest" Novels database
For this this assignment, we will be working with a real-world dataset from the Responsible Datasets in Context website. We downloaded the Top 500 "Greatest" Novels database that you will be using for this lab and edited out a few columns to make the dataset more manageable.
We encourage you to visit the link so you can read more about the dataset. That website says that "[t]his dataset contains information on the top 500 novels most widely held in libraries, according to OCLC, a library organization with over 16,000 member libraries in over 100 countries."
The data we will be working with are stored in a
delimiter-separated
value format and is located at: /data/cs21/novels/novels.dsv
there are actually two versions of the data file; the primary one is:
/data/cs21/novels/novels.dsv but there’s also a copy that’s had most of the
lines removed, which may be easier to use for testing your program, particularly
in the early stages: /data/cs21/novels/novels_small.dsv . That file only
contains novels published since 2008 and contains only 30 novels.
|
Each line in the file contains 7 pieces of information about a novel summarized below:
Feature index |
Type |
Description |
0 |
str |
title |
1 |
str |
author |
2 |
int |
publication year |
3 |
str |
the gender of the author (see note) |
4 |
int |
rank of the novel by how often it is held in libraries |
5 |
float |
the average rating on Goodreads |
6 |
int |
the number of ratings on Goodreads |
Note: As explained on the website where we downloaded the data, the gender of the author is according to an external data source that only contained binary gender categories. Therefore, this data may not accurately reflect gender for authors who identified differently.
Here are the first few lines and the last few lines of the file.
Each column is separated using the symbol |
(called the "pipe symbol") as shown below. The pipe symbol is located above the "Enter" or "Return" key on most keyboards.
The Tale of Genji|Murasaki Shikibu|1021|female|138|3.72|13275 The Decameron|Giovanni Boccaccio|1351|male|85|3.88|39870 Le Morte D'Arthur|Thomas Malory|1485|male|285|3.92|37568 ... Diary of a Wimpy Kid: The Long Haul|Jeff Kinney|2014|male|249|4.45|22 Go Set a Watchman|Harper Lee|2015|female|481|3.31|272837 The Girl on the Train|Paula Hawkins|2015|female|402|3.96|2924281
Note that the file is sorted by the third column, the year the novel was published.
In this lab we will focus on searching by title, by year of publication, and by the average rating on Goodreads.
You will implement a program called search_novels.py
, which will allow the user
to search through this dataset and investigate the information contained in it.
2. Implementation
Your program should do each of the following:
2.1. Read in the data
-
Your program should read in stored data from a file, storing a list of novels from the dataset. The data is stored in the following file:
/data/cs21/novels/novels.dsv
Note: don’t copy this file to your
labs/08
directory, just use"/data/cs21/novels/novels.dsv"
as the file name in your program.As you read in the data, each line of the file corresponds to one novel, which your program should represent using a
Novel
object and store in a list.
2.2. Present a menu
-
Your program should repeatedly present a menu to the user with three choices: to search by title, to search by year of publication, to search by average rating, or to quit. For example:
Please select one of the following choices: 1. Search by title 2. Search by publication year 3. Search by average rating 0. Quit Choice?
2.3. Search the data set
-
Depending on the user’s selection from the menu, your program should prompt the user for additional information and then do one of:
-
Search the dataset by title, using linear search. See Adapting linear search for title searches
-
When searching by title, your program should compare all data in lowercase format (so that user input is case insensitive) but displays results with their original case.
-
-
Search the dataset by publication year, using binary search. See Adapting binary search for publication year searches
-
Search the dataset by average rating, using linear search. See Adapting linear search for average rating searches
-
Exit the program.
-
2.4. Validate all input
-
You should ensure that the user provides valid input at all prompts.
-
For validating menu options, the user can only type in options that you provide them.
-
For validating publication year, any positive
int
up to 2024 is valid. -
For validating average rating, any
float
between 0.0 and 5.0 is valid.-
We have provided you with a function called
is_numeric
that you can use to check if a string is afloat
. This function is imported from thenovels_lab
library. The function works as follows:is_numeric("3.14") # returns True is_numeric("-3") # returns True is_numeric("3.1a") # returns False is_numeric("a") # returns False
-
-
2.5. Display the data
-
Uses formatted printing to show search results in a table format that fits in an 80-character wide terminal window. See Displaying matches
2.6. Example Output
-
Here are some examples of searching:
-
Here are some examples of handling valid/invalid input:
2.7. Tips
2.7.1. Continue to use TDD
We strongly recommend that you create a top-down design with a fully fleshed
out main
and stubs for all other functions. However, you may not work with a
partner on this lab. You don’t need to get approval from an instuctor for
your design, though we’d be glad to review it with you in lab, office hours, or
ninja sessions. You should only begin implementing the other functions after
you have a clear plan in place for the structure of main
.
2.7.2. Create modular, reusable functions
Avoid duplicating similar code in multiple places throughout the program. Think about how you could re-use the same function. For example, we are doing two types of searches (linears search and binary search), but the results of those searches are displayed in exactly the same way. Consider having a function that takes in a list of search results and displays them. Then you could call this function for either type of search.
3. The novels_lab
library
The lab comes with a novels_lab
library that gives you several helpful
tools intended to make this lab easier. To use the library, we’ve put
this line at the top of your file:
from novels_lab import Novel, look_forward_backward, is_numeric
3.1. Novel class
The novels_lab
library provides a Novel
class, which encapsulates all the features
needed to store one row of the data set. You can create an object of this class by
calling the Novel
constructor with each field of information passed in
as separate parameters.
The Novel
constructor takes 7 parameters shown in the table below (ignore
the Getter Method column for now):
Parameter |
Type |
Description |
Getter Method |
0 |
str |
title |
|
1 |
str |
author |
|
2 |
int |
publication year |
|
3 |
str |
the gender of the author |
|
4 |
int |
rank of the novel by how often it is held in libraries |
|
5 |
float |
the average rating on Goodreads |
|
6 |
int |
the number of ratings on Goodreads |
|
Each line of the file contains these 7 parameters in the same order. For example, the line describing "Persuasion" by Jane Austen looks like this in the file (stored as a string):
Persuasion|Jane Austen|1818|female|89|4.15|697703
To create a Novel
object that contains the data about this novel, you
would write:
book = Novel('Persuasion', 'Jane Austen', 1818, 'female', 89, 4.15, 697703)
Once you have a Novel
object, you can use the methods shown above (in the
Getter Method column) to extract the information you need. For example,
to print the author, title and average rating of the book
object above,
you could say:
author = book.get_author()
title = book.get_title()
rating = book.get_avg_rating()
print('"%s", by %s, has an average rating of %f' % (title, author, rating))
3.2. Storing the file as a list of Novel
objects
Notice that in the file, the data is stored as a string. But when you create
the Novel
object, only the title, author, and gender are strings. The year,
rank, and number of ratings are integers, and the average rating is a float.
In order to create a Novel
object for each row in the file, you will
perform these steps:
-
read in one line from the file and save it in a variable, e.g.
line
. -
split this line on the pipe symbol using the
split
method, e.g.fields = line.split('|')
-
each of the items in this list will be strings, but we need some of them to be
int
orfloat
, so convert individual items as necessary, e.g.fields[2] = int(fields[2])
-
use the values in
fields
to create a newNovel
object:data = Novel(fields[0], fields[1], fields[2], fields[3], fields[4], fields[5], fields[6])
-
accumulate each
Novel
object into a list ofNovel
objects that will represent the entire dataset.
4. Displaying matches
Once you have read in the data, you’re going to need to display them nicely in
the terminal. For example, the table below shows the results in a search for the
string "little"
.
==============================================================================
| Title and Author | Year | Gender | Rank | Rate | Ratings
------------------------------------------------------------------------------
| Little Dorrit
| Charles Dickens | 1857 | male | 169 | 4.00 | 47690
------------------------------------------------------------------------------
| Little Women
| Louisa May Alcott | 1868 | female | 16 | 4.15 | 2218654
------------------------------------------------------------------------------
| Little House In the Big Woods
| Laura Ingalls Wilder | 1932 | female | 214 | 4.20 | 268115
------------------------------------------------------------------------------
| Little House on the Prairie
| Laura Ingalls Wilder | 1935 | female | 239 | 4.21 | 289969
------------------------------------------------------------------------------
| Little Town on the Prairie
| Laura Ingalls Wilder | 1941 | female | 462 | 4.19 | 89010
------------------------------------------------------------------------------
| The Little Prince
| Antoine de Saint-Exupéry | 1943 | male | 35 | 4.32 | 2095592
------------------------------------------------------------------------------
| Stuart Little
| E.B. White | 1945 | male | 241 | 3.90 | 125340
------------------------------------------------------------------------------
Notice that each column has a fixed width even if the data in that column have different widths. For example, the titles shown take up different amount of space, but the column remains well formatted regardless of the title being displayed. Similarly, the average rating has all of its values shown with two decimal points and the data is aligned on the decimal point.
The title is displayed above the other data fields on a line of its own, ensuring that you can fit all of the data in an 80-column terminal window.
4.1. String formatting
Recall that string formatting allows us to specify the width of the output
when printing. For example, print("%10s" % ("hi"))
will print "hi" with 8
spaces in front of it so that it takes up a total of 10 places on the screen.
If it’s helpful, you can also use the -
flag to get left-justified printing
(the default is right-justified). For example:
$ python3
>>> print("|%-10s|%10s|" % ("left", "right"))
|left | right|
You can also specify the width when displaying integers and floats. When displaying floats, we typically specify not just the width but also the number of digits that should be allowed after the decimal point. For example:
$ python3
>>> print("| %5d | %5d | %5.1f | %5.2f |" % (31415, 10, 23.2, 23.1))
| 31415 | 10 | 23.2 | 23.10 |
By combining these ideas, you can make your output look like a nice table where everything is aligned, as demonstrated in the example outputs.
If you are unsure about exactly how string formatting handles different values, try them out in the interpreter just like the two examples shown above. |
5. Adapting the search algorithms
5.1. Adapting linear search for title searches
The version of linear search that we discussed in class stops as soon as it finds the item it is looking for, and returns the index of that item or -1 if not found. For this problem, your linear search will need to accumulate a list of matching results and return that list, instead of returning the index where the item was found.
In addition, the basic version of linear search we discussed in class showed
how to search lists of integers or strings. In this lab, we will be searching
lists of objects, so will need to adapt the linear search function we
discussed in class to use method syntax shown in
the section describing the Novel
class to get the data within each object.
For title searches, whenever your search string is a substring of the actual title, that is considered a match. Your searches are also case-insensitive. For example, if the user searches for "house", your search will match titles such as "The House of the Seven Gables" and "To the Lighthouse".
See Example 1 for samples results using this search.
5.2. Adapting binary search for publication year searches
Because the data is sorted by publication year, we can use binary search to very quickly locate novels published in a particular year. However, the data may contain multiple novels with the same publication year and we will need to make sure that our binary search function returns all of the matches.
5.2.1. Finding the first match
First write a binary search function that searches for a publication year
and returns the index of the first match that it finds (or -1
otherwise). This function should be similar to the binary search function
we wrote in class but adapted for use with the Novel
class.
For example, there are 5 novels that were published in 1970. When you use binary search to search for the publication year of 1970, you will find one of the novels from 1970, but not all of them. For example, your search might have yielded something like this (showing only the first three columns of the data):
...
Sounder|William H. Armstrong|1969 <- different year
The Summer of the Swans|Betsy Cromer Byars|1970
The Trumpet of the Swan|E.B. White|1970
Are You There God? It's Me, Margaret|Judy Blume|1970
Jonathan Livingston Seagull|Richard Bach|1970 <- suppose binary search finds this one
The Bluest Eye|Toni Morrison|1970
Mrs. Frisby and the Rats of NIMH|Robert C. O'Brien|1971 <- different year
...
In the example above, binary search found the Richard Bach novel, "Jonathan Livingston Seagull", which was published in 1970. But there are four other novels from 1970 that we also need to include in our search results. Since the novels are sorted by year, we know that all of the novels from 1970 will be grouped together in our data.
In the example shown above, if your search finds the Richard Bach novel, you need to look before that novel to find the novels by Judy Blume, E.B. White and Betsy Cromer Byars; and you need to look after the Richard Bach novel to find the Toni Morrison novel.
To do this, you will use the provided look_forward_backward
function described below.
5.2.2. Finding nearby matches
We have provided you with a function to help with finding nearby
matches based on an initial seed. This is called look_forward_backward()
and
you it is also imported from the novels_lab
library. This function takes four
parameters and returns a list of matching items that it found looking either
forward or backwards. Here is the function definition and a block comment
explaining the arguments and return type:
def look_forward_backward(data, index, year, direction):
"""
This function uses linear search to look forwards (or backwards) in
a list of Novel items for a matching year. It begins the linear search at
the index specified. All matching objects are accumulated into a list and
returned. The item at the specified index is *not* included in the
accumulated results.
Args:
data (list): the list of Novel objects to search over
index (int): the index to start linear search from (this is the index
that your binary search function returned)
year (int): the year to search for
direction (int): This is represents whether we should search forward
(direction = 1) or backward (direction = -1)
Returns:
list: a list of Novels items that match the year
(in the specified direction)
"""
You will need to call this function twice for each binary search (once to get the forward matches and again to get the backward matches).
Given the example shown above where the user searched for 1970
(search_year
), let’s assume that "Jonathan Livingston Seagull" is at
index 317 (index
).
A call to get the forward and backward matches might look something like this:
forward_matches = look_forward_backward(all_data, index, search_year, 1)
backward_matches = look_forward_backward(all_data, index, search_year, -1)
The returned list of forward_matches
would contain a Novel
object
for "The Bluest Eye" since it is the only Novel from 1970 that comes
after "Jonathan Livingston Seagull" in the list of Novels.
The returned list of backward_matches
would contain the Novel
objects for "The Summer of the Swans", "The Trumpet of the Swan", and
"Are You There God? It’s Me, Margaret", since those Novels are from
1970 and come before "Jonathan Livingston Seagull" in the list of
Novels.
To combine the results of your forward_matches
and backward_matches
:
result = backward_matches + [all_data[index]] + forward_matches
See Example 2 for the results of this search.
5.3. Adapting linear search for average rating searches
When searching by average rating, your search should find any Novel whose rating is within 0.015 of the rating the user is searching for. For example, if the user searches for a rating of 3.43, your search should return all Novels whose rating is between 3.415 and 3.445.
See Example 3 for the results of this search.
6. Requirements
The code you submit for labs is expected to follow good style practices, and to meet one of the course standards, you’ll need to demonstrate good style on six or more of the lab assignments across the semester. To meet the good style expectations, you should:
In addition, you’ll need to demonstrate good top-down design practices on two or more lab assignments across the semester. To meet the top-down design expectations, you should:
|
Your program should meet the following requirements:
-
Your program must read the novels data set from the
.dsv
file and store the data in a list of Novel objects. -
Your program must only accept valid input as described above.
-
When the user selects search by title, your program must search the novels data set by title using linear search as described above.
-
When the user selects search by publication year, your program must search the novels data set by publication year using binary search as described above.
-
When the user selects search by average rating, your program must search the novels data set by average rating using linear search as described above.
-
After performing a search, your program must display the results in a tabular format that fits in an 80-character wide terminal window and uses string formatting to align the data, similar to the format shown in the Displaying matches section. (If no results are found, your program must display a message indicating that no results were found.)
-
After each search, your program must return to the main menu and allow the user to perform another search or quit the program.
7. Extra Challenges
Here is an optional extension that you may find interesting.
7.1. Optional: Extending your own binary search
Rather than importing the provided look_forward_backward
function, try writing
your own. This should work the same way as the provided version: it will look
linearly forwards or linearly backwards based on an initial seed index that it
takes as a parameter. The function will also need to take the search string as
an input parameter (as well as potentially other information depending on your
design). This function should return a list of all the partial matches based on
that initial seed index.
The program can then add that list together with the entry at the initial seed to compute the total list of partial matches which are then displayed to the user.
Answer the Questionnaire
After each lab, please complete the short Google Forms questionnaire. Please select the right lab number (Lab 08) from the dropdown menu on the first question.
Once you’re done with that, you should run handin21
again.
Submitting lab assignments
Remember to run handin21
to turn in your lab files! You may run handin21
as many times as you want. Each time it will turn in any new work. We
recommend running handin21
after you complete each program or after you
complete significant work on any one program.
Logging out
When you’re done working in the lab, you should log out of the computer you’re using.
First quit any applications you are running, including your vscode editor, the browser and the terminal. Then click on the logout icon ( or ) and choose "log out".
If you plan to leave the lab for just a few minutes, you do not need to log out. It is, however, a good idea to lock your machine while you are gone. You can lock your screen by clicking on the lock icon. PLEASE do not leave a session locked for a long period of time. Power may go out, someone might reboot the machine, etc. You don’t want to lose any work!