CS91R Lab 05: Spelling Correctors
Due Wednesday, March 26, before midnight
Goals
The goals for this lab assignment are:
-
Close read a programming pearl
-
Create a program for spelling correction
-
employ edit distance
-
consider novel words
-
consider context in spelling correction (e.g., bigrams)
-
Cloning your repository
Log into the CS91R-S25 github
organization for our class and find your git repository for Lab 05,
which will be of the format lab05-user1-user2
, where user1
and user2
are you and your partner’s usernames.
You can clone your repository using the following steps while connected to the CS lab machines:
# cd into your cs91r/labs sub-directory and clone your lab05 repo
$ cd ~/cs91r/labs
$ git clone git@github.swarthmore.edu:CS91R-S25/lab05-user1-user2.git
# change directory to list its contents
$ cd ~/cs91r/labs/lab05-user1-user2
# ls should list the following contents
$ ls
spell.py README.md ngrams.py uspell.py
Answers to written questions should be included in the README.md file included in your repository.
Spelling Correction à la Norvig
First, read Norvig’s
article "How to Write a Spelling Corrector".
You can find a copy of this code in the spell.py
file in your repository.
Develop a theory of the program for his spelling corrector.
You can run python3 -i spell.py to test out individual functions
after the code has finished running. After running the program a few times,
you may find that commenting out the unit tests allows you to explore more
quickly.
|
-
What does the program do?
-
Which function accomplishes the spelling correction?
-
What are a few misspelled words his program correctly fix?
-
What are a few misspelled words his program incorrectly fixes?
-
-
How does the program work?
-
How many functions are there?
-
Run each function independently, make sure you understand what each is supposed to do.
-
Which functions are responsible for the selection mechanism, candidate model, language model, and the error model parts of spelling correction?
-
-
Why did he design the program the way he did:
-
Why did he use a
Counter
, aset
,list comprehensions
andgenerator expressions
? -
Rewrite
known
to use afor
loop andset.add()
rather than a generator expression. -
Rewrite
edits2
to use afor
loop and have the function return a list. -
Rewrite the
candidates
function in a style you are more familiar with.
-
-
What would you do differently if the correction task changed slightly (e.g., rely on a different text corpus or fix command lines)?
Next, further reflect on his approach:
-
In the clab, Norvig uses unit tests in this style: unit tests. That is one way to write unit tests. Come up with your own evaluation plan. How do you know if your spelling corrector works well?
-
Write a few sentences about some possible improvements you could make.
Norvig’s other Spell Corrector
In clab this week, we read the first section of Norvig’s "Natural Language Corpus Data" article. In the final section of that article, he outlines another approach to spelling correction.
For simplicity of discussion, let’s call the spell corrector in the previous section "web version" and the corrector in the PDF the "pdf version".
The "pdf version" uses a few features of python that have
changed slightly since the article was written. We have provided you
with a version that closely matches the original code but updated for the
latest version of python. You can find this file, ngrams.py
in your
repository.
The "pdf version" also uses a different data set (count_1w.txt
) than
the "web version" (big.txt
). We have provided you with (count_1w-big.txt
)
which you can use in place of count_1w.txt
if you’d like to test the
behavior of the "pdf version" on the same dataset.
-
How does the set of words returned by
edits
in the "pdf version" compare to theedits2
function in the "web version"? -
What does
edits
from the "pdf version" return that theedits2
function doesn’t? -
What does the
Pedit
function in the "pdf version" do? -
How is
Pedit
used bycorrect
in the "pdf version"? -
OPTIONAL: How could you modify
Pedit
in the "pdf version" to give you the same behavior as the "web version"?
Spelling Correction à la You
Finally, implement your own spelling corrector (uspell.py
); some possible approaches:
-
improve upon the first Norvig spell corrector; some ideas:
-
Use a different text corpus.
-
Improve upon his
words
function using the word tokenization from our earlier labs. -
Write an
edits3
function that considers edit distance of three. -
Use bigram or trigram context.
-
-
Combine ideas from Norvig’s second spell corrector into the first one.
-
incorporate some ideas from Jurasky and Martin
-
come up with your own approach
How to turn in your solutions
Edit the README.md
file that we provided to answer each of the questions
and add any discussion you think we’d like to know about.
Be sure to commit and push all changes to your python files.
If you think it would be helpful, use asciinema
to record
a terminal session and include it your README.md
.