Run update21, to create the cs21/labs/08 directory (and some additional files). Then cd into your cs21/labs/08 directory and create the python programs for lab 8 in this directory
Here's an example:
$ cat letter.txt Dear Sam, How are you? I'm fine, but I misss seeing you more ofetn. Hope alk is well with your family, espceially your brother. I'll be stoping by next Saturddy when I get home from Swarthmore. Love, Alice $ python spellchecker.py Enter name of file to spell check: letter.txt Unknown words and suggestions: misss : miss ofetn : often alk : all espceially : especially stoping : storing Saturddy : Saturday Swarthmore : Swarthmore
For the dictionary of valid words, use the file /usr/share/dict/words that we used in class. Read this file to generate a list of all valid words. Don't worry about removing proper nouns or contractions. Just strip any whitespace at the begin or end.
With the list of words from your text file and the list of dictionary words, generate a list of unknown and possibly misspelled words. For each word in the original text file, you should search for the word using binary search in the list of valid words. Once you have identified the misspelled words, you can move on to the next part of suggesting corrections.
Let's begin by looking at how we generate alternatives for misspelled words. We'll use "seplling" as the running example. This word is not in the dictionary, so we need to propose alternatives. Our alternatives will be generated using three different methods: deletion, transposition, and substitution. Each is explained below:
We form alternatives of the word by generating all possible single-letter substitutions. For each letter in the original word, we substitute a letter from the alphabet (abcd...xyz) to replace it. There are too many possible substitutions to list them all. Using our example, "seplling", we would generate:
aeplling | saplling | sealing | sepaling | seplaing | sepllang | seplliag | sepllina |
beplling | sbplling | sebling | sepbling | seplbing | sepllbng | sepllibg | sepllinb |
ceplling | scplling | secling | sepcling | seplcing | sepllcng | sepllicg | sepllinc |
deplling | sdplling | sedling | sepdling | seplding | seplldng | sepllidg | sepllind |
eeplling | seplling | seeling | sepeling | sepleing | seplleng | sepllieg | seplline |
... | ... | ... | ... | ... | ... | ... | ... |
yeplling | syplling | seyling | sepyling | seplying | sepllyng | seplliyg | seplliny |
zeplling | szplling | sezling | sepzling | seplzing | sepllzng | sepllizg | sepllinz |
You program should contain one or more functions that generate possible alternatives for a given word and check which of the alternatives are valid words.
For the "seplling" example, we find that only 2 of the alternatives are legal words in our dictionary: selling, and spelling. Your program should suggest only one of these two alternatives, by selecting which word is more frequently used.
To help answer this question, we have provided you with a new dictionary (/usr/local/doc/wordcts.txt) which contains each of the words from /usr/share/dict/words augmented with a frequency with which this word occurred in a very large document collection. Here are the first 10 lines of that file:
a 7841087012 aardvark 93030 aardvarks 11762 abaci 5391 aback 272920 abacus 149548 abacuses 4473 abaft 19188 abalone 237386 abalones 6544Very frequently used words such as "a" occurred very often (7,841,087,012 times). Common words such as "abacus" occurred more frequently (149,548 times) than uncommon words such as "abaci" (5,391 times). We will use the frequency with which words occurred to rank the alternatives and we will choose the most frequently occurring word as the proposed alternative.
In our running example, we want to choose the alternative of "seplling" that is most frequent. Consulting wordcts.txt, we find that "selling" is the most frequent:
selling 29092601 spelling 5373561Therefore our program should choose selling as the proposed alternative.
To implement this ranking algorithm, you will first need to read the file /usr/local/doc/wordcts.txt into a list of [word, frequency] lists:
[['a', 7841087012], ['aardvark', 93030], ['aardvarks', 11762], ...]You will then need to search this list of [word, frequency] pairs for each of your proposed alternatives. This means for "seplling" you will need to perform two searches in this list. You will want to use binary search since the list of words is in sorted order. You will then need to choose the word that has the highest frequency of all the proposed alternatives.
Note that each item in this list is a [word, frequency] pair. When searching for selling, you cannot directly compare the word "selling" to the list ["selling", 29092601]. You will need to write a separate binary search function that is very similar to the search used in the first part of this lab for searching /usr/share/dict/words. Since we are only interested in the frequency of the word, you can return the frequency, e.g., 29092601 instead of the position in this search, and return -1 if the word is not found.
Combine this new binary search code with the rest of your program so that for each misspelled word, your program suggests the best alternative word. If there are two or more alternative words with the same highest frequency, you can return any of the matching valid words with this frequency
Note that for some misspelled words, there will be no alternatives that occur in the dictionary. For example, if you type in "Swarthmore", there are no alternatives which occur in the word list. In cases like this, simply return the original word, "Swarthmore".
from string import punctuation punx = punctuation print punx '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
These are not part of the lab assignment, but just "fun" extra things you can add to your program. Please don't try these until you have completed the above assignment. And they are just for fun -- no bonus points or extra credit.
If you attempt the extra challenges, please save your changes in spellcheck_extra.py instead of spellcheck.py
$ cp spellcheck.py spellcheck_extra.py $ vim spellcheck_extra.py
Once you are satisfied with your program, hand it in by typing handin21 in a terminal window.