CS91R Lab 05: Spelling Correctors

Due Wednesday, March 26, before midnight

Goals

The goals for this lab assignment are:

  • Close read a programming pearl

  • Create a program for spelling correction

    • employ edit distance

    • consider novel words

    • consider context in spelling correction (e.g., bigrams)

Cloning your repository

Log into the CS91R-S25 github organization for our class and find your git repository for Lab 05, which will be of the format lab05-user1-user2, where user1 and user2 are you and your partner’s usernames.

You can clone your repository using the following steps while connected to the CS lab machines:

# cd into your cs91r/labs sub-directory and clone your lab05 repo
$ cd ~/cs91r/labs
$ git clone git@github.swarthmore.edu:CS91R-S25/lab05-user1-user2.git

# change directory to list its contents
$ cd ~/cs91r/labs/lab05-user1-user2

# ls should list the following contents
$ ls
spell.py  README.md ngrams.py uspell.py

Answers to written questions should be included in the README.md file included in your repository.


Spelling Correction à la Norvig

First, read Norvig’s article "How to Write a Spelling Corrector". You can find a copy of this code in the spell.py file in your repository. Develop a theory of the program for his spelling corrector.

You can run python3 -i spell.py to test out individual functions after the code has finished running. After running the program a few times, you may find that commenting out the unit tests allows you to explore more quickly.
  1. What does the program do?

    • Which function accomplishes the spelling correction?

    • What are a few misspelled words his program correctly fix?

    • What are a few misspelled words his program incorrectly fixes?

  2. How does the program work?

    • How many functions are there?

    • Run each function independently, make sure you understand what each is supposed to do.

    • Which functions are responsible for the selection mechanism, candidate model, language model, and the error model parts of spelling correction?

  3. Why did he design the program the way he did:

    • Why did he use a Counter, a set, list comprehensions and generator expressions?

    • Rewrite known to use a for loop and set.add() rather than a generator expression.

    • Rewrite edits2 to use a for loop and have the function return a list.

    • Rewrite the candidates function in a style you are more familiar with.

  4. What would you do differently if the correction task changed slightly (e.g., rely on a different text corpus or fix command lines)?

Next, further reflect on his approach:

  1. In the clab, Norvig uses unit tests in this style: unit tests. That is one way to write unit tests. Come up with your own evaluation plan. How do you know if your spelling corrector works well?

  2. Write a few sentences about some possible improvements you could make.

Norvig’s other Spell Corrector

In clab this week, we read the first section of Norvig’s "Natural Language Corpus Data" article. In the final section of that article, he outlines another approach to spelling correction.

For simplicity of discussion, let’s call the spell corrector in the previous section "web version" and the corrector in the PDF the "pdf version".

The "pdf version" uses a few features of python that have changed slightly since the article was written. We have provided you with a version that closely matches the original code but updated for the latest version of python. You can find this file, ngrams.py in your repository.

The "pdf version" also uses a different data set (count_1w.txt) than the "web version" (big.txt). We have provided you with (count_1w-big.txt) which you can use in place of count_1w.txt if you’d like to test the behavior of the "pdf version" on the same dataset.

  • How does the set of words returned by edits in the "pdf version" compare to the edits2 function in the "web version"?

  • What does edits from the "pdf version" return that the edits2 function doesn’t?

  • What does the Pedit function in the "pdf version" do?

  • How is Pedit used by correct in the "pdf version"?

  • OPTIONAL: How could you modify Pedit in the "pdf version" to give you the same behavior as the "web version"?

Spelling Correction à la You

Finally, implement your own spelling corrector (uspell.py); some possible approaches:

  • improve upon the first Norvig spell corrector; some ideas:

    • Use a different text corpus.

    • Improve upon his words function using the word tokenization from our earlier labs.

    • Write an edits3 function that considers edit distance of three.

    • Use bigram or trigram context.

  • Combine ideas from Norvig’s second spell corrector into the first one.

  • incorporate some ideas from Jurasky and Martin

  • come up with your own approach

How to turn in your solutions

Edit the README.md file that we provided to answer each of the questions and add any discussion you think we’d like to know about.

Be sure to commit and push all changes to your python files.

If you think it would be helpful, use asciinema to record a terminal session and include it your README.md.