Run update21 to create a directory for lab09. You will hand in two files for this lab, design.py, to be completed first, containing your top-down design, and google.py containing your implemented program. Run handin21 to submit these files.
For Lab 9 and Lab 10, you will be writing a program that implements something similar to the algorithm Google uses for its Did you mean feature. This feature is used to suggest alternative search terms when you type in a misspelled search term. For example, if you go to Google and enter a slightly misspelled search term, such as coputer. Google will suggest that perhaps you meant to type computer (NOTE: Google recently changed this feature. It now shows results for computer, and provide a link if you *really* wanted to search for coputer. You will be implementing the older approach). Our version of the algorithm will not return exactly the same results that Google does, but it will be a reasonable approximation.
To accomplish this task you will create a list of lists of word-frequency information (referred to as a word-frequency list for the rest of this document). The words and how often they occur (i.e., their frequency), will be read in from the file /usr/local/doc/wordcts.txt (see Word Frequencies below for details). When the user enters a word to search, you will first check if the word appears in your word-frequency list. If so, you should report that it was found. If not, you will generate a new list of possible alternative spellings based on the typical kinds of typos that a user could make (see Generating Alternatives for details). Using this list of alternative spellings, you should search the word-frequency list to find all alternatives that are valid words (with their frequencies). Once all alternative valid words are found, you will output this list as suggestions for the correct spelling of the user-entered word (see Suggesting Alternatives for details). Sometimes the search word will be so badly misspelled that your program won't be able to find any viable alternatives. In this case, you should just report that it was not found. The program should keep asking the user for more words until they type QUIT.
Here is a sample interaction:
Welcome to "Did You Mean?". This program will search a list of known words to see if your search query is a valid word. If not, it will suggest alternate spellings that are valid words. Enter a word or QUIT to exit: stell Did you mean: tell 68765376 sell 64152506 Enter a word or QUIT to exit: spell Found! Enter a word or QUIT to exit: spl No entries found Enter a word or QUIT to exit: ebase Did you mean: base 55148877 ease 10558329 Enter a word or QUIT to exit: QUIT
The next few sections provide some tips and hints. Please read through these sections before starting your design.
For your design, remember to focus on the interfaces. What parameters will each function need? What will each function return? How will you call the functions from main? Also, add calls to your created functions (control flow) in your design to show the order in which your program will call functions.
The design must be completed first. It does not need to be perfect but should indicate that you have thought about the problem and the structure of your solution. In addition, while certain functionalities do not need to be implemented until Lab 10 your design must take into account all main functional pieces of both your final Lab 9 and 10 solutions.
cp design.py google.py
Also, remember to look at the "Useful Tips" section of this page often and if you get stuck trying to figure out how to implement parts of this lab...there may be a useful tip there to help you get un-stuck.
We have provided you with a file, wordcts.txt, that contains most words in the English language and the frequency with which each word occurred in a very large document collection. Here are the first 10 lines of that file:
a 7841087012 aardvark 93030 aardvarks 11762 abaci 5391 aback 272920 abacus 149548 abacuses 4473 abaft 19188 abalone 237386 abalones 6544Frequently used words, such as a, occurred very often (7,841,087,012 times). Common words such as abacus occurred more frequently (149,548 times) than uncommon words such as abaci (5,391 times). We will use the word frequencies in Lab 10 to rank the alternatives, choosing the most frequent word as the best alternative. For this lab, concentrate on how to keep track of all of this information.
Your program should read the file /usr/local/doc/wordcts.txt into a python list, where each item in the list is itself a list of a word and it's frequency. For example, the first few elements in the "list of lists" will look like this:
[['a', 7841087012],['aardvark', 93030],['aardvarks', 11762], ...]Notice that for each word-frequency list, the word is a string and the frequency is an integer.
You do not, and you should not, make a copy of the wordcts.txt file in your lab 09 subdirectory. Instead, just open it by listing the full path name as an argument to open (you can open any file anywhere in the file system by giving the path name to it):
infile = open("/usr/local/doc/wordcts.txt", "r")In addition, you can always make a smaller example file to test out your code. Just use vim to create a new file (you could call it practicewords.txt) that has contents that are formated the same as the big word file, but just contains a handful of word-frequency entries (make sure the entries are in sorted order by word):
artsy 23 bat 45 cat 13 dog 39 grandpa 33 hello 20 movie 50 piano 99 viola 3 zero 8Don't leave any blank lines at the end of your small test file.
Then just change your call to open to
open "practicewords.txt" to test out your program's sub-functionality.
For example, this smaller file of word-frequency data
is small enough to print out your full list of lists to verify you are
creating it correctly. Just make sure to change
the call to open back to the big list after doing unit testing of
functions on your smaller practice file.
There are four different methods you need to use to create alternative spellings: deletion, transposition, substitution, and insertion. Each is explained below. We'll use the word speling as our running example. This word is not in the word-frequency list, so we need to propose alternatives.
For Lab 9, only Deletions needs to be implemented. The other three functions will be implemented for Lab 10. However, you should still have function stubs for each in your Lab 9 design/solutions.
Form alternatives of the word by generating all possible single-letter substitutions. Details will be provided in Lab 10.
Form alternatives of the word by inserting single-letters in the word. Details will be provided in Lab 10.
Lab 10 will address this problem in detail. For Lab 9, once all alternative spellings of a word are considered, your program should print to the screen all valid alternative spellings and their term frequencies values.
Note that, for some misspelled words, none of our generated alternatives will occur in the word-frequency list. For example, if you type in telefone, there are no alternatives which occur in the word list. In cases like this, simply tell the user "Not found".
Lists and strings will probably be needed in this program. See our documentation about using strings and lists as objects for guidance on advanced operations for strings and lists.
Be sure to re-examine in class examples for programs that use lists of lists as well as file input/output.
Also, your "suggesting alternatives" solution may need to use string operations that we learned early in the semester, including concatenation, substring (or slicing), and repetitions. Here are some examples:
s1 = "hello there" s2 = "goodbye" newstr = s1 + s2 # newstr is "hello theregoodbye" newstr = s1[3:10] # newtr is "l0 ther" newstr = s1*2 # newsr is "goodbyegoodbye"
Lastly, strings can be compared using all relational operators learned in class (i.e., <, <=, > > ==, !=). Strings are ordered lexographically, exactly like in the dictonary or phonebook. For example:
name1 = "Ameet" name2 = "Tia" if name1 < name2: print name1 else: print name2outputs the value Ameet since A comes before T in the alphabet. If the first letter of both strings is the same, we compare the second letter in each just as we would in the dictionary. For example, Ameet is greater than Alex. Python continues evaluating position-by-position until one string ends. In that case, the shorter word is less than the longer; for example, Amy is less than Amylia.
Once you are satisfied with your program, hand it in by typing handin21 in a terminal window.