Note about a bug fix

IMPORTANT NOTE: The output in interaction.py is no longer valid. Please see the bottom of the webpage for (what I hope) is the correct output!

Introduction

In this week’s lab, you will build on last week’s edit distance finding code to implement a spell-checker that a) generates suggested spelling corrections and b) automatically fixes spelling errors.

Answers to written questions should be added to a file called Writeup.md in your repository.

EditDistanceFinder

This week’s starter code includes an EditDistance.py file that is the same as the one you wrote last week, but with a couple of additions:

There’s a prob method that returns the log likelihood of one string being converted to another.
Laplace smoothing has been added.
An argparse interface has been added.

Questions

In Writeup.md, explain how Laplace smoothing works in general and how it is implemented in the EditDistance.py file. Why is Laplace smoothing needed in order to make the prob method work? In other words, the prob method wouldn’t work properly without smoothing – why?
Describe the command-line interface for EditDistance.py. What command should you run to generate a model from /data/spelling/wikipedia_misspellings.txt and save it to ed.pkl?

LanguageModel

This lab’s starter code also includes a file called LanguageModel.py that defines an n-gram language model. Read through the code for the LanguageModel class, then answer the following questions:

What n-gram orders are supported by the given LanguageModel class?
How does the given LanguageModel class deal with the problem of 0-counts?
What behavior does the “__contains__()” method of the LanguageModel class provide?
Spacy uses a lot of memory if it tries to load a very large document. To avoid that problem, LanguageModel limits the amount of text that’s processed at once with the get_chunks method. Explain how that method works.
Describe the command-line interface for LanguageModel.py. What command should you run to generate a model from /data/gutenberg/*.txt and save it to lm.pkl if you want an alpha value of 0.1 and a vocabulary size of 40000?

The language model takes a bit of time to train – on the order of 20 minutes or so depending on what machine you use. You may want to start training the LanguageModel in another window before you continue reading the lab writeup.

Required Part (Everyone Does the Same Thing)

Your job for this week will be to write a SpellChecker that uses the EditDistanceFinder class as the error (channel) model and the provided LanguageModel as the language model to implement spelling correction.

You will be using spacy again in this lab, making use of the built-in part-of-speech tagger and parser. To initialize spacy for the lab, use the line below. You will probably want the nlp variable to be an instance variable in your class.

nlp = spacy.load("en", pipeline=["tagger", "parser"])

Your class should have the following member functions:

__init__(channel_model=None, language_model=None, max_distance=2), which should take an EditDistanceFinder, a LanguageModel, and an int as input, and should initialize your SpellChecker.
load_channel_model(fp), which should take a file pointer as input, and should initialize the SpellChecker object’s channel_model data member to a default EditDistanceFinder and then load the stored language model (e.g. ed.pkl) from fp into that data member.
load_language_model(fp), which should take a file pointer as input, and should initialize the SpellChecker object’s language_model data member to a default LanguageModel and then load the stored language model (e.g. lm.pkl) from fp into that data member.
bigram_score(prev_word, focus_word, next_word), which should take three words as input (a “previous” word, a “focus” word, and a “next” word), and should return the average of the bigram score of the bigrams (prev_word, focus_word) and (focus_word, next_word) according to the LanguageModel.
unigram_score(word), which should take a word as input, and should return the unigram probability of the word according to the LanguageModel.
cm_score(error_word, corrected_word) (“channel model score”), which should take an error word and a possible correction as input, and should return the EditDistanceFinder’s probability of the corrected word having been transformed into the error word. Be careful about the order of the arguments that you pass to the EditDistanceFinder: because of how you’ve trained the probability model, P(error_word|corrected_word) may not equal P(corrected_word|error_word).
inserts(word), which should take a word as input and return a list of potential words that are within one insert of word.
deletes(word), which should take a word as input and return a list of potential words that are within one deletion of word.
substitutions(word), which should take a word as input and return a list of potential words that are within one substitution of word.
generate_candidates(word), which should take a word as input and return a list of candidate words (that are in the LanguageModel) that are within self.max_distance edits of word by calling inserts, deletes, and substitutions. To find all words that are edit distance 1 away, just call inserts, deletes and substitutions and concatenate those results together. To generate candidate words that are distance 2 away, first generate all the candidates that are 1 away. Then, generate all the 1-edit-distance-away candidates for each of those. Continue in this fashion for distance 3, etc.
check_sentence(sentence, fallback=False), which should take a list of words as input and return a list of lists. Each sublist in the return value corresponds to a single word in the input sentence. Words in the sentence that are in the language model will be represented as a sublist containing just that word. Words in the sentence that are not in the language model will be represented as a sublist of possible corrections. This sublist of possible corrections should be, for each word in the sentence not in the language model, the result of calling generate_candidates with each of the candidates in the list and then sorting these candidates by the combination of LanguageModel score and EditDistance score. If no candidates are found and fallback is True, then non-words should be represented by a sublist with just the original word (the same representation as correctly-spelled words).
check_text(text, fallback=False), which should take a string as input, tokenize and sentence segment it with spacy, and then return the concatenation of the result of calling check_sentence on all of the resulting sentence objects.
autocorrect_sentence(sentence), which should take a tokenized sentence (as a list of words) as input, call check_sentence on the sentence with fallback=True, and return a new list of tokens where each non-word has been replaced by its most likely spelling correction.
autocorrect_line(line), which should take a string as input, tokenize and segment it with spacy, and then return the concatenation of the result of calling autocorrect_sentence on all of the resulting sentence objects.
suggest_sentence(sentence, max_suggestions), which should take a tokenized sentence (as a list of words) as input, call check_sentence on the sentence, and return a new list where:
- Real words are just strings in the list
- Non-words are lists of up to max_suggestions suggested spellings, ordered by your model’s preference for them.
suggest_text(text, max_suggestions), which should take a string as input, tokenize and segment it with spacy, and then return the concatenation of the result of calling suggest_sentence on all of the resulting sentence objects.

Hints and additional information about some of these functions follow:

For checking bigram probabilities of the first or last word in a sentence, you’ll want to make use of ‘<s>’ (start of sentence token) and ‘</s>’ (end of sentence token); the langauge model is trained to know what they are.
For ranking suggestions, I suggest using:
- Doing an evenly-weighted linear combination (.5, .5) of the unigram and bigram probabilities for the language model
- Evenly weighting the language model and channel model (since we’re in log space, that means just taking their sum).
When you are generating candidate corrections, you may find the constant string.ascii_lowercase helpful.

Sample Interaction

The file interaction.py gives a sample interaction with the SpellChecker class. If you call interaction.py from the command line with language and edit distance models created above, it should use them to check (and optionally autocorrect) sentences.

Evaluation

In /data/spelling/ there are two files:

reddit_comments.txt, which is an aggressively-filtered set of comments from Reddit, based on this Kaggle set
reddit_ispell.txt, which is the output we got by autocorrecting the comment file with ispell

For a variety of reasons, labeled corpora of spelling errors are hard to come by. You can perform a noisy evaluation of your system by comparing it to the ispell output.

The file autocorrect.py will use your spell checker, language model, and edit distance class to auto-correct every sentence in every line that is passed to it. Use your SpellChecker to autocorrect the reddit_comments.txt file, then use the diff tool to compare the output. Based on a hand analysis of a reasonable subset of differences, answer the following questions:

How often did your spell checker do a better job of correcting than ispell? Conversely, how often did ispell do a better job than your spell checker?
Can you characterize the type of errors your spell checker tended to best at, and the type of errors ispell tended to do best at?
Comment on anything else you notice that is interesting about spell checking – either for your model or for ispell.

Optional Part (Pick One or More)

Once you have your spell checker working to correct non-words, you should add one of the following:

Phonetic Suggestions

Expand your generate_candidates to also suggest words whose pronunciation is within an edit distance of self.max_distance of each error word. Your solution should use the metaphone code that is included with the lab. In Writeup.md, you should:

Describe your approach
Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
Discuss any challenges you ran into, design decisions you made, etc.

Real-Word Correction

Add a new member function to your SpellChecker class called check_words() that generates suggested corrections for real word spelling errors. Your check_spelling() function should call check_words after check_sentence_words, so functions like autocorrect_sentence and suggest_sentence should work off of the combination of the two.

You should feel free to use the simplifying assumtion of at most one real-word spelling error in a sentence if it makes your task easier.

In Writeup.md, you should:

Describe your approach
Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
Discuss any challenges you ran into, design decisions you made, etc.

Transpositions

Extend your model to handle character transpositions, where two characters are “swapped,” resulting in spelling errors like “teh.”

In Writeup.md, you should:

Describe your approach
Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
Discuss any challenges you ran into, design decisions you made, etc.

Other Extensions

With instructor approval, you are encouraged to come up with other ways to expand your spell checker. Some ideas:

Add one or more features that would make your spell checker work with another language.
Change the error model or language model underlying your spell checking system. For example, how could vector semantics be included?
Explore the bias inherent in spell checkers. Find and report on research related to whose language is represented in spell checkers, and how the way spell checkers are implemented might unequally impact different people.
Add a way for your system to learn when new words should be added to your dictionary.

Some good places to start looking for relevant research:

In Writeup.md, you should:

Describe your approach
Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
Discuss any challenges you ran into, design decisions you made, etc.

Bug fix

The original interaction.py file contained incorrect sample output. Below is the correct sample output.

>>> print(s.channel_model.prob("hello", "hello"))
-0.6520393913851943
>>> print(s.channel_model.prob("hellp", "hello"))
-10.655417526736118
>>> print(s.channel_model.prob("hllp", "hello"))
-12.889127454847866
>>> print(s.check_text("they did not yb any menas"))
[[['they'], ['did'], ['not'], ['be', 'by', 'my', 'you', 'i', 'in', 'ye', 'b', 'y', 'rib', 'yet', 
'ay', 'if', 'job', 'ob', 'yo', 'jib', 'of', 'ly', 'on', 'ab', 'o', 'rob', 'orb', 'jub', 'it', 
'ty', 'bo', 'is', 'a', 'yea', 'mob', 'cab', 'web', 'sob', 'to', 'up', 'yon', 'yew', 'yes', 'cob', 
'an', 'obi', 'ebb', 'nob', 'do', 'iv', 'alb', 'bab', 'eye', 'tob', 'yaw', 'v', 'abi', 'mab', 'at',
'he', 'go', 'as', 'x', 'rub', 'gob', 'lye', 'sub', 'or', 'ix', 'aye', 'd', 'lbs', 'cub', 'pub', 
'tub', 'z', 'so', 'dab', 'bob', 'we', 'l', 'dye', 'k', 'pmb', 'n', 'xv', 'ho', 'hye', 'il', 'yer',
'wo', 'yee', 'ex', 'bye', 'yis', 'vp', 'ox', 'rye', 'oh', 'w', 'io', 'en', 'm', 'ed', 'h', 'me',
'am', 'xx', 'el', 'us', 'no', 'fye', 'eh', 't', 'qu', 'ii', 'r', 'e', 'c', 'ah', 'ha', 's', 'lo',
'al', 'uz', 'em', 'ad', 'ao', 'ow', 'og', 'vs', 'er', 'ir', 'et', 'mr', 'un', 'hm', 'th', 'ji',
'ai', 'xi', 'je', 'hi', 'ze', 'co', 'wm', 'ee', 'au', 'ou', 'ar', 'ca', 'um', 'ro', 'vi', 'de',
'dr', 'fa', 'va', 'sh', 'la', 'nt', 'tm', 'ma', 'gr', 'ur', 'di', 're', 'st', 'tu', 'da', 'ms', 
'le', 'pi', 'si', 'se'], ['any'], ['men', 'means', 'mens', 'meals', 'mes', 'mans', 'meanes', 
'meats', 'meat', 'menials', 'omens', 'mean', 'mene', 'mines', 'enos', 'menace', 'mend', 'meads', 
'zenas', 'kenaz', 'menan', 'seas', 'ment', 'jonas', 'mess', 'mead', 'medes', 'medals', 'enan', 
'monks', 'minus', 'ends', 'mews', 'fens', 'minds', 'dens', 'meal', 'midas', 'eras', 'amends', 
'pens', 'hena', 'hens', 'tens', 'vedas', 'meres', 'mental', 'lens', 'peas', 'lena', 'meah', 
'medad', 'venus', 'arenas', 'aeneas', 'metals', 'enam', 'medan', 'demas', 'teas', 'zenan', 
'kenan', 'meets', 'sends', 'merab', 'texas', 'tents', 'bends', 'melts', 'metal', 'tends', 'penal', 
'dents', 'lends', 'cents', 'rents', 'annas']]]
>>> print(s.autocorrect_line("they did not yb any menas"))
[['they'], ['did'], ['not'], ['be'], ['any'], ['men']]
>>> print(s.suggest_text("they did not yb any menas", max_suggestions=2))
[['they'], ['did'], ['not'], ['be', 'by'], ['any'], ['men', 'means']]

In addition, you may find this to be helpful:

>>> text = """This should take a list of words as input and return a list of lists. 
	Each sublist in the return value corresponds to a single word in the input 
	sentence. Words in the sentence that are in the language model will be represented
	as a sublist containing just that word. Words in the sentence that are not in the
	language model will be represented as a sublist of possible corrections. This sublist
	of possible corrections should be, for each word in the sentence not in the language
	model, the result of calling generate_candidates with each of the candidates in the
	list and then sorting these candidates by the combination of LanguageModel score and
	EditDistance score. If no candidates are found and fallback is True, then non-words
	should be represented by a sublist with just the original word (the same 
	representation as correctly-spelled words).""".lower()
>>> result = sp.autocorrect_line(text)
>>> print(' '.join([x[0] for x in result]))
this should take a list of words as put and return a list of lists . each subtlest in the 
return value corresponds to a single word in the put sentence . words in the sentence that 
are in the language model will be represented as a subtlest containing just that word . 
words in the sentence that are not in the language model will be represented as a subtlest 
of possible corrections . this subtlest of possible corrections should be , for each word 
in the sentence not in the language model , the result of calling generate_candidates with 
each of the candidates in the list and then sorting these candidates by the combination 
of languagemodel score and editdistance score . if no candidates are found and fallacy is 
true , then non - words should be represented by a subtlest with just the original word 
( the same representation as correctly - spilled words ) .

Lab 05

Due 11:59am Monday October 22, 2018

Note about a bug fix

Introduction

EditDistanceFinder

Questions

LanguageModel

Required Part (Everyone Does the Same Thing)

Sample Interaction

Evaluation

Optional Part (Pick One or More)

Phonetic Suggestions

Real-Word Correction

Transpositions

Other Extensions

Bug fix