CS21 Lab 08: Searching

Due Saturday, November 16, by 11:59pm

Please read through the entire lab before starting!

This is a one week lab. For the previous lab, you used a full week to practice the TDD component and a second week to implement your solution. Since the design process should be similar to the previous lab, you will be doing the design and implementation for this lab in a single week. If you found the open-ended nature of the last lab challenging, you will want to start this lab early.

All of the standard style guidelines from previous labs still apply:

As you write programs, use good programming practices:

Use a comment at the top of the file to describe the purpose of the program (see example).
All programs should have a main() function (see example).
Use variable names that describe the contents of the variables.
Write your programs incrementally and test them as you go. This is really crucial to success: don’t write lots of code and then test it all at once! Write a little code, make sure it works, then add some more and test it again.
Don’t assume that if your program passes the sample tests we provide that it is completely correct. Come up with your own test cases and verify that the program is producing the right output on them.
Avoid writing any lines of code that exceed 80 columns.
- Always work in a terminal window that is 80 characters wide (resize it to be this wide)
- In vscode, at the bottom right in the window, there is an indication of both the line and the column of the cursor.

Function Comments

All functions should have a top-level comment! Please see our function example page if you are confused about writing function comments.

Goals

Write a program that uses linear search
Write a program that uses binary search
Continue practicing Top-Down Design (TDD)
Connect CS topics with real-world data

1. Searching the Top 500 "Greatest" Novels database

1.1. The Top 500 "Greatest" Novels database

For this this assignment, we will be working with a real-world dataset from the Responsible Datasets in Context website. We downloaded the Top 500 "Greatest" Novels database that you will be using for this lab and edited out a few columns to make the dataset more manageable.

We encourage you to visit the link so you can read more about the dataset. That website says that "[t]his dataset contains information on the top 500 novels most widely held in libraries, according to OCLC, a library organization with over 16,000 member libraries in over 100 countries."

The data we will be working with are stored in a delimiter-separated value format and is located at: /data/cs21/novels/novels.dsv

there are actually two versions of the data file; the primary one is: /data/cs21/novels/novels.dsv but there’s also a copy that’s had most of the lines removed, which may be easier to use for testing your program, particularly in the early stages: /data/cs21/novels/novels_small.dsv. That file only contains novels published since 2008 and contains only 30 novels.

Each line in the file contains 7 pieces of information about a novel summarized below:

Table 1. Features of 500 Novels dataset
Feature index	Type	Description
0	str	title
1	str	author
2	int	publication year
3	str	the gender of the author (see note)
4	int	rank of the novel by how often it is held in libraries
5	float	the average rating on Goodreads
6	int	the number of ratings on Goodreads

Note: As explained on the website where we downloaded the data, the gender of the author is according to an external data source that only contained binary gender categories. Therefore, this data may not accurately reflect gender for authors who identified differently.

Here are the first few lines and the last few lines of the file. Each column is separated using the symbol | (called the "pipe symbol") as shown below. The pipe symbol is located above the "Enter" or "Return" key on most keyboards.

The Tale of Genji|Murasaki Shikibu|1021|female|138|3.72|13275
The Decameron|Giovanni Boccaccio|1351|male|85|3.88|39870
Le Morte D'Arthur|Thomas Malory|1485|male|285|3.92|37568
...
Diary of a Wimpy Kid: The Long Haul|Jeff Kinney|2014|male|249|4.45|22
Go Set a Watchman|Harper Lee|2015|female|481|3.31|272837
The Girl on the Train|Paula Hawkins|2015|female|402|3.96|2924281

Note that the file is sorted by the third column, the year the novel was published.

In this lab we will focus on searching by title, by year of publication, and by the average rating on Goodreads.

You will implement a program called search_novels.py, which will allow the user to search through this dataset and investigate the information contained in it.

2. Implementation

Your program should do each of the following:

2.1. Read in the data

Your program should read in stored data from a file, storing a list of novels from the dataset. The data is stored in the following file:

/data/cs21/novels/novels.dsv

Note: don’t copy this file to your labs/08 directory, just use "/data/cs21/novels/novels.dsv" as the file name in your program.

As you read in the data, each line of the file corresponds to one novel, which your program should represent using a Novel object and store in a list.

2.2. Present a menu

Your program should repeatedly present a menu to the user with three choices: to search by title, to search by year of publication, to search by average rating, or to quit. For example:
```
Please select one of the following choices:
1. Search by title
2. Search by publication year
3. Search by average rating
0. Quit

Choice?
```

2.3. Search the data set

Depending on the user’s selection from the menu, your program should prompt the user for additional information and then do one of:
1. Search the dataset by title, using linear search. See Adapting linear search for title searches
  - When searching by title, your program should compare all data in lowercase format (so that user input is case insensitive) but displays results with their original case.
2. Search the dataset by publication year, using binary search. See Adapting binary search for publication year searches
3. Search the dataset by average rating, using linear search. See Adapting linear search for average rating searches
4. Exit the program.

2.4. Validate all input

You should ensure that the user provides valid input at all prompts.
- For validating menu options, the user can only type in options that you provide them.
- For validating publication year, any positive int up to 2024 is valid.
- For validating average rating, any float between 0.0 and 5.0 is valid.
  1. We have provided you with a function called is_numeric that you can use to check if a string is a float. This function is imported from the novels_lab library. The function works as follows:
    
    is_numeric("3.14") # returns True is_numeric("-3") # returns True is_numeric("3.1a") # returns False is_numeric("a") # returns False

2.5. Display the data

Uses formatted printing to show search results in a table format that fits in an 80-character wide terminal window. See Displaying matches

2.6. Example Output

Here are some examples of searching:
Here are some examples of handling valid/invalid input:
1. Example 4: Handling (in)valid input

2.7. Tips

2.7.1. Continue to use TDD

We strongly recommend that you create a top-down design with a fully fleshed out main and stubs for all other functions. However, you may not work with a partner on this lab. You don’t need to get approval from an instuctor for your design, though we’d be glad to review it with you in lab, office hours, or ninja sessions. You should only begin implementing the other functions after you have a clear plan in place for the structure of main.

2.7.2. Create modular, reusable functions

Avoid duplicating similar code in multiple places throughout the program. Think about how you could re-use the same function. For example, we are doing two types of searches (linears search and binary search), but the results of those searches are displayed in exactly the same way. Consider having a function that takes in a list of search results and displays them. Then you could call this function for either type of search.

3. The `novels_lab` library

The lab comes with a novels_lab library that gives you several helpful tools intended to make this lab easier. To use the library, we’ve put this line at the top of your file:

from novels_lab import Novel, look_forward_backward, is_numeric

3.1. Novel class

The novels_lab library provides a Novel class, which encapsulates all the features needed to store one row of the data set. You can create an object of this class by calling the Novel constructor with each field of information passed in as separate parameters.

The Novel constructor takes 7 parameters shown in the table below (ignore the Getter Method column for now):

Table 2. Features of 500 Novels dataset with Getter Methods
Parameter	Type	Description	Getter Method
0	str	title	`get_title()`
1	str	author	`get_author()`
2	int	publication year	`get_year()`
3	str	the gender of the author	`get_gender()`
4	int	rank of the novel by how often it is held in libraries	`get_rank()`
5	float	the average rating on Goodreads	`get_avg_rating()`
6	int	the number of ratings on Goodreads	`get_num_ratings()`

Each line of the file contains these 7 parameters in the same order. For example, the line describing "Persuasion" by Jane Austen looks like this in the file (stored as a string):

Persuasion|Jane Austen|1818|female|89|4.15|697703

To create a Novel object that contains the data about this novel, you would write:

book = Novel('Persuasion', 'Jane Austen', 1818, 'female', 89, 4.15, 697703)

Once you have a Novel object, you can use the methods shown above (in the Getter Method column) to extract the information you need. For example, to print the author, title and average rating of the book object above, you could say:

author = book.get_author()
title = book.get_title()
rating = book.get_avg_rating()
print('"%s", by %s, has an average rating of %f' % (title, author, rating))

3.2. Storing the file as a list of `Novel` objects

Notice that in the file, the data is stored as a string. But when you create the Novel object, only the title, author, and gender are strings. The year, rank, and number of ratings are integers, and the average rating is a float.

In order to create a Novel object for each row in the file, you will perform these steps:

read in one line from the file and save it in a variable, e.g. line.
split this line on the pipe symbol using the split method, e.g. fields = line.split('|')
each of the items in this list will be strings, but we need some of them to be int or float, so convert individual items as necessary, e.g. fields[2] = int(fields[2])

use the values in fields to create a new Novel object:

data = Novel(fields[0], fields[1], fields[2], fields[3], fields[4], fields[5], fields[6])

accumulate each Novel object into a list of Novel objects that will represent the entire dataset.

4. Displaying matches

Once you have read in the data, you’re going to need to display them nicely in the terminal. For example, the table below shows the results in a search for the string "little".

==============================================================================
| Title and Author               | Year | Gender | Rank | Rate | Ratings
------------------------------------------------------------------------------
| Little Dorrit
| Charles Dickens                | 1857 |   male |  169 | 4.00 |   47690
------------------------------------------------------------------------------
| Little Women
| Louisa May Alcott              | 1868 | female |   16 | 4.15 | 2218654
------------------------------------------------------------------------------
| Little House In the Big Woods
| Laura Ingalls Wilder           | 1932 | female |  214 | 4.20 |  268115
------------------------------------------------------------------------------
| Little House on the Prairie
| Laura Ingalls Wilder           | 1935 | female |  239 | 4.21 |  289969
------------------------------------------------------------------------------
| Little Town on the Prairie
| Laura Ingalls Wilder           | 1941 | female |  462 | 4.19 |   89010
------------------------------------------------------------------------------
| The Little Prince
| Antoine de Saint-Exupéry       | 1943 |   male |   35 | 4.32 | 2095592
------------------------------------------------------------------------------
| Stuart Little
| E.B. White                     | 1945 |   male |  241 | 3.90 |  125340
------------------------------------------------------------------------------

Notice that each column has a fixed width even if the data in that column have different widths. For example, the titles shown take up different amount of space, but the column remains well formatted regardless of the title being displayed. Similarly, the average rating has all of its values shown with two decimal points and the data is aligned on the decimal point.

The title is displayed above the other data fields on a line of its own, ensuring that you can fit all of the data in an 80-column terminal window.

4.1. String formatting

Recall that string formatting allows us to specify the width of the output when printing. For example, print("%10s" % ("hi")) will print "hi" with 8 spaces in front of it so that it takes up a total of 10 places on the screen.

If it’s helpful, you can also use the - flag to get left-justified printing (the default is right-justified). For example:

$ python3
>>> print("|%-10s|%10s|" % ("left", "right"))
|left      |     right|

You can also specify the width when displaying integers and floats. When displaying floats, we typically specify not just the width but also the number of digits that should be allowed after the decimal point. For example:

$ python3
>>> print("| %5d | %5d | %5.1f | %5.2f |" % (31415, 10, 23.2, 23.1))
| 31415 |    10 |  23.2 | 23.10 |

By combining these ideas, you can make your output look like a nice table where everything is aligned, as demonstrated in the example outputs.

If you are unsure about exactly how string formatting handles different values, try them out in the interpreter just like the two examples shown above.

5. Adapting the search algorithms

5.1. Adapting linear search for title searches

The version of linear search that we discussed in class stops as soon as it finds the item it is looking for, and returns the index of that item or -1 if not found. For this problem, your linear search will need to accumulate a list of matching results and return that list, instead of returning the index where the item was found.

In addition, the basic version of linear search we discussed in class showed how to search lists of integers or strings. In this lab, we will be searching lists of objects, so will need to adapt the linear search function we discussed in class to use method syntax shown in the section describing the Novel class to get the data within each object.

For title searches, whenever your search string is a substring of the actual title, that is considered a match. Your searches are also case-insensitive. For example, if the user searches for "house", your search will match titles such as "The House of the Seven Gables" and "To the Lighthouse".

See Example 1 for samples results using this search.

5.2. Adapting binary search for publication year searches

Because the data is sorted by publication year, we can use binary search to very quickly locate novels published in a particular year. However, the data may contain multiple novels with the same publication year and we will need to make sure that our binary search function returns all of the matches.

5.2.1. Finding the first match

First write a binary search function that searches for a publication year and returns the index of the first match that it finds (or -1 otherwise). This function should be similar to the binary search function we wrote in class but adapted for use with the Novel class.

For example, there are 5 novels that were published in 1970. When you use binary search to search for the publication year of 1970, you will find one of the novels from 1970, but not all of them. For example, your search might have yielded something like this (showing only the first three columns of the data):

...
Sounder|William H. Armstrong|1969                       <- different year
The Summer of the Swans|Betsy Cromer Byars|1970
The Trumpet of the Swan|E.B. White|1970
Are You There God? It's Me, Margaret|Judy Blume|1970
Jonathan Livingston Seagull|Richard Bach|1970           <- suppose binary search finds this one
The Bluest Eye|Toni Morrison|1970
Mrs. Frisby and the Rats of NIMH|Robert C. O'Brien|1971 <- different year
...

In the example above, binary search found the Richard Bach novel, "Jonathan Livingston Seagull", which was published in 1970. But there are four other novels from 1970 that we also need to include in our search results. Since the novels are sorted by year, we know that all of the novels from 1970 will be grouped together in our data.

In the example shown above, if your search finds the Richard Bach novel, you need to look before that novel to find the novels by Judy Blume, E.B. White and Betsy Cromer Byars; and you need to look after the Richard Bach novel to find the Toni Morrison novel.

To do this, you will use the provided look_forward_backward function described below.

5.2.2. Finding nearby matches

We have provided you with a function to help with finding nearby matches based on an initial seed. This is called look_forward_backward() and you it is also imported from the novels_lab library. This function takes four parameters and returns a list of matching items that it found looking either forward or backwards. Here is the function definition and a block comment explaining the arguments and return type:

def look_forward_backward(data, index, year, direction):
    """
    This function uses linear search to look forwards (or backwards) in
    a list of Novel items for a matching year. It begins the linear search at
    the index specified. All matching objects are accumulated into a list and
    returned. The item at the specified index is *not* included in the
    accumulated results.

    Args:
        data (list): the list of Novel objects to search over
        index (int): the index to start linear search from (this is the index
                     that your binary search function returned)
        year (int): the year to search for
        direction (int): This is represents whether we should search forward
                         (direction = 1) or backward (direction = -1)

    Returns:
        list: a list of Novels items that match the year
              (in the specified direction)
    """

You will need to call this function twice for each binary search (once to get the forward matches and again to get the backward matches).

Given the example shown above where the user searched for 1970 (search_year), let’s assume that "Jonathan Livingston Seagull" is at index 317 (index).

A call to get the forward and backward matches might look something like this:

forward_matches = look_forward_backward(all_data, index, search_year, 1)
backward_matches = look_forward_backward(all_data, index, search_year, -1)

The returned list of forward_matches would contain a Novel object for "The Bluest Eye" since it is the only Novel from 1970 that comes after "Jonathan Livingston Seagull" in the list of Novels.

The returned list of backward_matches would contain the Novel objects for "The Summer of the Swans", "The Trumpet of the Swan", and "Are You There God? It’s Me, Margaret", since those Novels are from 1970 and come before "Jonathan Livingston Seagull" in the list of Novels.

To combine the results of your forward_matches and backward_matches:

result = backward_matches + [all_data[index]] + forward_matches

See Example 2 for the results of this search.

5.3. Adapting linear search for average rating searches

When searching by average rating, your search should find any Novel whose rating is within 0.015 of the rating the user is searching for. For example, if the user searches for a rating of 3.43, your search should return all Novels whose rating is between 3.415 and 3.445.

See Example 3 for the results of this search.

6. Requirements

The code you submit for labs is expected to follow good style practices, and to meet one of the course standards, you’ll need to demonstrate good style on six or more of the lab assignments across the semester. To meet the good style expectations, you should:

Write a comment that describes the program as a whole using a comment at the top of the file.
All programs should have a main() function.
Use descriptive variable names.
Write code with readable line length, avoiding writing any lines of code that exceed 80 columns.
Add comments to explain complex code segments (as applicable).
Write a comment for each function (except main) that describes its purpose, parameters and return value.

In addition, you’ll need to demonstrate good top-down design practices on two or more lab assignments across the semester. To meet the top-down design expectations, you should:

Have a main() function that serves as an outline for your program.
Your main() function should call other functions that implement subtasks.
The functions that are called by main() should be well-named and have good comments explaining their purpose, parameters (including their types), and return value (including its type).

Your program should meet the following requirements:

Your program must read the novels data set from the .dsv file and store the data in a list of Novel objects.
Your program must only accept valid input as described above.
When the user selects search by title, your program must search the novels data set by title using linear search as described above.
When the user selects search by publication year, your program must search the novels data set by publication year using binary search as described above.
When the user selects search by average rating, your program must search the novels data set by average rating using linear search as described above.
After performing a search, your program must display the results in a tabular format that fits in an 80-character wide terminal window and uses string formatting to align the data, similar to the format shown in the Displaying matches section. (If no results are found, your program must display a message indicating that no results were found.)
After each search, your program must return to the main menu and allow the user to perform another search or quit the program.

7. Extra Challenges

Here is an optional extension that you may find interesting.

7.1. Optional: Extending your own binary search

Rather than importing the provided look_forward_backward function, try writing your own. This should work the same way as the provided version: it will look linearly forwards or linearly backwards based on an initial seed index that it takes as a parameter. The function will also need to take the search string as an input parameter (as well as potentially other information depending on your design). This function should return a list of all the partial matches based on that initial seed index.

The program can then add that list together with the entry at the initial seed to compute the total list of partial matches which are then displayed to the user.

Answer the Questionnaire

After each lab, please complete the short Google Forms questionnaire. Please select the right lab number (Lab 08) from the dropdown menu on the first question.

Once you’re done with that, you should run handin21 again.

Submitting lab assignments

Remember to run handin21 to turn in your lab files! You may run handin21 as many times as you want. Each time it will turn in any new work. We recommend running handin21 after you complete each program or after you complete significant work on any one program.

Logging out

When you’re done working in the lab, you should log out of the computer you’re using.

First quit any applications you are running, including your vscode editor, the browser and the terminal. Then click on the logout icon ( or other logout icon ) and choose "log out".

If you plan to leave the lab for just a few minutes, you do not need to log out. It is, however, a good idea to lock your machine while you are gone. You can lock your screen by clicking on the lock xlock icon. PLEASE do not leave a session locked for a long period of time. Power may go out, someone might reboot the machine, etc. You don’t want to lose any work!