CS21 Lab 9: Google's Did you mean...

Due by 11:59 pm, Tuesday, Nov 8

Run update21 to create a directory for lab09. You will hand in two files for this lab, design.py, to be completed first, containing your top-down design, and google.py containing your implemented program. Run handin21 to submit these files.


Introduction

For Lab 9 and Lab 10, you will be writing a program that implements something similar to the algorithm Google uses for its Did you mean feature. This feature is used to suggest alternative search terms when you type in a misspelled search term. For example, if you go to Google and enter a slightly misspelled search term, such as coputer. Google will suggest that perhaps you meant to type computer (NOTE: Google recently changed this feature. It now shows results for computer, and provide a link if you *really* wanted to search for coputer. You will be implementing the older approach). Our version of the algorithm will not return exactly the same results that Google does, but it will be a reasonable approximation.


General Program Requirements

To accomplish this task you will create a list of lists of word-frequency information (referred to as a word-frequency list for the rest of this document). The words and how often they occur (i.e., their frequency), will be read in from the file /usr/local/doc/wordcts.txt (see Word Frequencies below for details). When the user enters a word to search, you will first check if the word appears in your word-frequency list. If so, you should report that it was found. If not, you will generate a new list of possible alternative spellings based on the typical kinds of typos that a user could make (see Generating Alternatives for details). Using this list of alternative spellings, you should search the word-frequency list to find all alternatives that are valid words (with their frequencies). Once all alternative valid words are found, you will output this list as suggestions for the correct spelling of the user-entered word (see Suggesting Alternatives for details). Sometimes the search word will be so badly misspelled that your program won't be able to find any viable alternatives. In this case, you should just report that it was not found. The program should keep asking the user for more words until they type QUIT.

Here is a sample interaction:

Welcome to "Did You Mean?".  This program will search a list of known
words to see if your search query is a valid word.  If not, it will
suggest alternate spellings that are valid words.

Enter a word or QUIT to exit: stell
Did you mean:
     tell 68765376
     sell 64152506
Enter a word or QUIT to exit: spell
Found!
Enter a word or QUIT to exit: spl
No entries found
Enter a word or QUIT to exit: ebase
Did you mean:
    base 55148877
    ease 10558329
Enter a word or QUIT to exit: QUIT

The next few sections provide some tips and hints. Please read through these sections before starting your design.


Requirements for Lab 9
  1. Top-Down Design - In a file design.py, implement the top-down design for your solution. After reading through the rest of this document, spend some time sketching out the main function and the other functions you'll need to complete this program. Remember to not get bogged down in all of the details of implementing your solutions.

    For your design, remember to focus on the interfaces. What parameters will each function need? What will each function return? How will you call the functions from main? Also, add calls to your created functions (control flow) in your design to show the order in which your program will call functions.

    The design must be completed first. It does not need to be perfect but should indicate that you have thought about the problem and the structure of your solution. In addition, while certain functionalities do not need to be implemented until Lab 10 your design must take into account all main functional pieces of both your final Lab 9 and 10 solutions.

  2. Your design must contain at least 8 functions.
  3. Bottom-up Implementation - Once you complete design.py, copy your design over to the file google.py. You will do bottom-up implementation and unit testing in google.py. You can copy your design to google.py like this:
           cp design.py google.py
    

  4. Your program should not stop running until the user enters the word QUIT
  5. All functions listed as to be implemented in Lab 10 should be stubbed-out (i.e., prototyped) for Lab 9, including comments and a correct return type. See Lab 7 for details on function stubs.
  6. You must use a list of lists to store the word-frequency list read in from the file wordcts.txt.
  7. Your program must use binary search to find entries in your word-frequency list (i.e., your lists of lists).
  8. HINT: when keeping track of valid alternate words, keep in mind that you will need to output both the word and frequency. Thus, your alternate word list will look like a smaller version of the word-frequency list.

    Also, remember to look at the "Useful Tips" section of this page often and if you get stuck trying to figure out how to implement parts of this lab...there may be a useful tip there to help you get un-stuck.


Word frequencies

We have provided you with a file, wordcts.txt, that contains most words in the English language and the frequency with which each word occurred in a very large document collection. Here are the first 10 lines of that file:

a 7841087012
aardvark 93030
aardvarks 11762
abaci 5391
aback 272920
abacus 149548
abacuses 4473
abaft 19188
abalone 237386
abalones 6544
Frequently used words, such as a, occurred very often (7,841,087,012 times). Common words such as abacus occurred more frequently (149,548 times) than uncommon words such as abaci (5,391 times). We will use the word frequencies in Lab 10 to rank the alternatives, choosing the most frequent word as the best alternative. For this lab, concentrate on how to keep track of all of this information.

Your program should read the file /usr/local/doc/wordcts.txt into a python list, where each item in the list is itself a list of a word and it's frequency. For example, the first few elements in the "list of lists" will look like this:

[['a', 7841087012],['aardvark', 93030],['aardvarks', 11762], ...]
Notice that for each word-frequency list, the word is a string and the frequency is an integer.

You do not, and you should not, make a copy of the wordcts.txt file in your lab 09 subdirectory. Instead, just open it by listing the full path name as an argument to open (you can open any file anywhere in the file system by giving the path name to it):

infile = open("/usr/local/doc/wordcts.txt", "r")
In addition, you can always make a smaller example file to test out your code. Just use vim to create a new file (you could call it practicewords.txt) that has contents that are formated the same as the big word file, but just contains a handful of word-frequency entries (make sure the entries are in sorted order by word):
artsy  23
bat  45
cat  13
dog  39
grandpa 33
hello   20
movie   50
piano   99
viola   3
zero    8
Don't leave any blank lines at the end of your small test file.

Then just change your call to open to open "practicewords.txt" to test out your program's sub-functionality. For example, this smaller file of word-frequency data is small enough to print out your full list of lists to verify you are creating it correctly. Just make sure to change the call to open back to the big list after doing unit testing of functions on your smaller practice file.

Generating Alternatives

There are four different methods you need to use to create alternative spellings: deletion, transposition, substitution, and insertion. Each is explained below. We'll use the word speling as our running example. This word is not in the word-frequency list, so we need to propose alternatives.

For Lab 9, only Deletions needs to be implemented. The other three functions will be implemented for Lab 10. However, you should still have function stubs for each in your Lab 9 design/solutions.

Deletions

To be designed and implemented for Lab 9 We form alternatives of the word by generating all possible single-letter deletions. Using our example, speling, we would generate: peling (deleted the first letter), seling (deleted the second letter), spling, speing, spelng, spelig, and spelin (deleted the last letter). Each alternative differs from the original word by only one letter. For a word of length n, there are n alternatives we can form by deleting a single character; here, the word (speling) was 7 letters long and there are 7 alternatives we can form using deletion.

Transpositions

Form alternatives of the word by using transpositions (swapping of letters within the word). Details will be provided in Lab 10.

Substitutions

Form alternatives of the word by generating all possible single-letter substitutions. Details will be provided in Lab 10.

Insertions

Form alternatives of the word by inserting single-letters in the word. Details will be provided in Lab 10.


Suggesting Alternatives

Lab 10 will address this problem in detail. For Lab 9, once all alternative spellings of a word are considered, your program should print to the screen all valid alternative spellings and their term frequencies values.

Note that, for some misspelled words, none of our generated alternatives will occur in the word-frequency list. For example, if you type in telefone, there are no alternatives which occur in the word list. In cases like this, simply tell the user "Not found".


Useful Tips

Lists and strings will probably be needed in this program. See our documentation about using strings and lists as objects for guidance on advanced operations for strings and lists.

Be sure to re-examine in class examples for programs that use lists of lists as well as file input/output.

Also, your "suggesting alternatives" solution may need to use string operations that we learned early in the semester, including concatenation, substring (or slicing), and repetitions. Here are some examples:

s1 = "hello there"
s2 = "goodbye"
newstr = s1 + s2   # newstr is "hello theregoodbye"
newstr = s1[3:10]  # newtr is "l0 ther"
newstr = s1*2      # newsr is "goodbyegoodbye"

Lastly, strings can be compared using all relational operators learned in class (i.e., <, <=, > > ==, !=). Strings are ordered lexographically, exactly like in the dictonary or phonebook. For example:

name1 = "Ameet"
name2 = "Tia"

if name1 < name2:
  print name1
else:
  print name2
outputs the value Ameet since A comes before T in the alphabet. If the first letter of both strings is the same, we compare the second letter in each just as we would in the dictionary. For example, Ameet is greater than Alex. Python continues evaluating position-by-position until one string ends. In that case, the shorter word is less than the longer; for example, Amy is less than Amylia.
Submit

Once you are satisfied with your program, hand it in by typing handin21 in a terminal window.