Note about a bug fix
IMPORTANT NOTE: The output in interaction.py
is no longer valid. Please see the bottom of the webpage for (what I hope) is the correct output!
Introduction
In this week’s lab, you will build on last week’s edit distance finding code to implement a spell-checker that a) generates suggested spelling corrections and b) automatically fixes spelling errors.
Answers to written questions should be added to a file called Writeup.md
in your repository.
EditDistanceFinder
This week’s starter code includes an EditDistance.py
file that is the same as the one you wrote last week, but with a couple of additions:
- There’s a
prob
method that returns the log likelihood of one string being converted to another. - Laplace smoothing has been added.
- An argparse interface has been added.
Questions
- In
Writeup.md
, explain how Laplace smoothing works in general and how it is implemented in theEditDistance.py
file. Why is Laplace smoothing needed in order to make theprob
method work? In other words, theprob
method wouldn’t work properly without smoothing – why? - Describe the command-line interface for
EditDistance.py
. What command should you run to generate a model from/data/spelling/wikipedia_misspellings.txt
and save it toed.pkl
?
LanguageModel
This lab’s starter code also includes a file called LanguageModel.py
that defines an n-gram language model. Read through the code for the LanguageModel
class, then answer the following questions:
- What n-gram orders are supported by the given
LanguageModel
class? - How does the given
LanguageModel
class deal with the problem of 0-counts? - What behavior does the “__contains__()” method of the
LanguageModel
class provide? - Spacy uses a lot of memory if it tries to load a very large document. To avoid that problem,
LanguageModel
limits the amount of text that’s processed at once with theget_chunks
method. Explain how that method works. - Describe the command-line interface for
LanguageModel.py
. What command should you run to generate a model from/data/gutenberg/*.txt
and save it tolm.pkl
if you want analpha
value of 0.1 and a vocabulary size of 40000?
The language model takes a bit of time to train – on the order of 20 minutes or so depending on what machine you use. You may want to start training the LanguageModel in another window before you continue reading the lab writeup.
Required Part (Everyone Does the Same Thing)
Your job for this week will be to write a SpellChecker
that uses the EditDistanceFinder
class as the error (channel) model and the provided LanguageModel
as the language model to implement spelling correction.
You will be using spacy
again in this lab, making use of the built-in part-of-speech tagger and parser. To initialize spacy
for the lab, use the line below. You will probably want the nlp
variable to be an instance variable in your class.
nlp = spacy.load("en", pipeline=["tagger", "parser"])
Your class should have the following member functions:
__init__(channel_model=None, language_model=None, max_distance=2)
, which should take anEditDistanceFinder
, aLanguageModel
, and anint
as input, and should initialize yourSpellChecker
.load_channel_model(fp)
, which should take a file pointer as input, and should initialize the SpellChecker object’schannel_model
data member to a defaultEditDistanceFinder
and then load the stored language model (e.g.ed.pkl
) fromfp
into that data member.load_language_model(fp)
, which should take a file pointer as input, and should initialize the SpellChecker object’s language_model data member to a defaultLanguageModel
and then load the stored language model (e.g.lm.pkl
) fromfp
into that data member.bigram_score(prev_word, focus_word, next_word)
, which should take three words as input (a “previous” word, a “focus” word, and a “next” word), and should return the average of the bigram score of the bigrams(prev_word, focus_word)
and(focus_word, next_word)
according to theLanguageModel
.unigram_score(word)
, which should take a word as input, and should return the unigram probability of the word according to theLanguageModel
.cm_score(error_word, corrected_word)
(“channel model score”), which should take an error word and a possible correction as input, and should return theEditDistanceFinder
’s probability of the corrected word having been transformed into the error word. Be careful about the order of the arguments that you pass to theEditDistanceFinder
: because of how you’ve trained the probability model, P(error_word|corrected_word) may not equal P(corrected_word|error_word).inserts(word)
, which should take a word as input and return a list of potential words that are within one insert ofword
.deletes(word)
, which should take a word as input and return a list of potential words that are within one deletion ofword
.substitutions(word)
, which should take a word as input and return a list of potential words that are within one substitution ofword
.generate_candidates(word)
, which should take a word as input and return a list of candidate words (that are in theLanguageModel
) that are withinself.max_distance
edits ofword
by callinginserts
,deletes
, andsubstitutions
. To find all words that are edit distance 1 away, just callinserts
,deletes
andsubstitutions
and concatenate those results together. To generate candidate words that are distance 2 away, first generate all the candidates that are 1 away. Then, generate all the 1-edit-distance-away candidates for each of those. Continue in this fashion for distance 3, etc.check_sentence(sentence, fallback=False)
, which should take a list of words as input and return a list of lists. Each sublist in the return value corresponds to a single word in the input sentence. Words in the sentence that are in the language model will be represented as a sublist containing just that word. Words in the sentence that are not in the language model will be represented as a sublist of possible corrections. This sublist of possible corrections should be, for each word in the sentence not in the language model, the result of callinggenerate_candidates
with each of the candidates in the list and then sorting these candidates by the combination of LanguageModel score and EditDistance score. If no candidates are found andfallback
isTrue
, then non-words should be represented by a sublist with just the original word (the same representation as correctly-spelled words).check_text(text, fallback=False)
, which should take a string as input, tokenize and sentence segment it withspacy
, and then return the concatenation of the result of callingcheck_sentence
on all of the resulting sentence objects.autocorrect_sentence(sentence)
, which should take a tokenized sentence (as a list of words) as input, callcheck_sentence
on the sentence withfallback=True
, and return a new list of tokens where each non-word has been replaced by its most likely spelling correction.autocorrect_line(line)
, which should take a string as input, tokenize and segment it withspacy
, and then return the concatenation of the result of callingautocorrect_sentence
on all of the resulting sentence objects.suggest_sentence(sentence, max_suggestions)
, which should take a tokenized sentence (as a list of words) as input, callcheck_sentence
on the sentence, and return a new list where:- Real words are just strings in the list
- Non-words are lists of up to
max_suggestions
suggested spellings, ordered by your model’s preference for them.
suggest_text(text, max_suggestions)
, which should take a string as input, tokenize and segment it withspacy
, and then return the concatenation of the result of callingsuggest_sentence
on all of the resulting sentence objects.
Hints and additional information about some of these functions follow:
- For checking bigram probabilities of the first or last word in a sentence, you’ll want to make use of ‘<s>’ (start of sentence token) and ‘</s>’ (end of sentence token); the langauge model is trained to know what they are.
- For ranking suggestions, I suggest using:
- Doing an evenly-weighted linear combination (.5, .5) of the unigram and bigram probabilities for the language model
- Evenly weighting the language model and channel model (since we’re in log space, that means just taking their sum).
- When you are generating candidate corrections, you may find the constant
string.ascii_lowercase
helpful.
Sample Interaction
The file interaction.py
gives a sample interaction with the SpellChecker
class. If you call interaction.py
from the command line with language and edit distance models created above, it should use them to check (and optionally autocorrect) sentences.
Evaluation
In /data/spelling/
there are two files:
reddit_comments.txt
, which is an aggressively-filtered set of comments from Reddit, based on this Kaggle setreddit_ispell.txt
, which is the output we got by autocorrecting the comment file with ispell
For a variety of reasons, labeled corpora of spelling errors are hard to come by. You can perform a noisy evaluation of your system by comparing it to the ispell output.
The file autocorrect.py
will use your spell checker, language model, and edit distance class to auto-correct every sentence in every line that is passed to it. Use your SpellChecker
to autocorrect the reddit_comments.txt
file, then use the diff tool to compare the output. Based on a hand analysis of a reasonable subset of differences, answer the following questions:
- How often did your spell checker do a better job of correcting than ispell? Conversely, how often did ispell do a better job than your spell checker?
- Can you characterize the type of errors your spell checker tended to best at, and the type of errors ispell tended to do best at?
- Comment on anything else you notice that is interesting about spell checking – either for your model or for ispell.
Optional Part (Pick One or More)
Once you have your spell checker working to correct non-words, you should add one of the following:
Phonetic Suggestions
Expand your generate_candidates
to also suggest words whose pronunciation is within an edit distance of self.max_distance
of each error word. Your solution should use the metaphone
code that is included with the lab. In Writeup.md
, you should:
- Describe your approach
- Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
- Discuss any challenges you ran into, design decisions you made, etc.
Real-Word Correction
Add a new member function to your SpellChecker
class called check_words()
that generates suggested corrections for real word spelling errors. Your check_spelling()
function should call check_words
after check_sentence_words
, so functions like autocorrect_sentence
and suggest_sentence
should work off of the combination of the two.
You should feel free to use the simplifying assumtion of at most one real-word spelling error in a sentence if it makes your task easier.
In Writeup.md
, you should:
- Describe your approach
- Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
- Discuss any challenges you ran into, design decisions you made, etc.
Transpositions
Extend your model to handle character transpositions, where two characters are “swapped,” resulting in spelling errors like “teh.”
In Writeup.md
, you should:
- Describe your approach
- Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
- Discuss any challenges you ran into, design decisions you made, etc.
Other Extensions
With instructor approval, you are encouraged to come up with other ways to expand your spell checker. Some ideas:
- Add one or more features that would make your spell checker work with another language.
- Change the error model or language model underlying your spell checking system. For example, how could vector semantics be included?
- Explore the bias inherent in spell checkers. Find and report on research related to whose language is represented in spell checkers, and how the way spell checkers are implemented might unequally impact different people.
- Add a way for your system to learn when new words should be added to your dictionary.
Some good places to start looking for relevant research:
In Writeup.md
, you should:
- Describe your approach
- Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
- Discuss any challenges you ran into, design decisions you made, etc.
Bug fix
The original interaction.py
file contained incorrect sample output. Below is the correct sample output.
>>> print(s.channel_model.prob("hello", "hello"))
-0.6520393913851943
>>> print(s.channel_model.prob("hellp", "hello"))
-10.655417526736118
>>> print(s.channel_model.prob("hllp", "hello"))
-12.889127454847866
>>> print(s.check_text("they did not yb any menas"))
[[['they'], ['did'], ['not'], ['be', 'by', 'my', 'you', 'i', 'in', 'ye', 'b', 'y', 'rib', 'yet',
'ay', 'if', 'job', 'ob', 'yo', 'jib', 'of', 'ly', 'on', 'ab', 'o', 'rob', 'orb', 'jub', 'it',
'ty', 'bo', 'is', 'a', 'yea', 'mob', 'cab', 'web', 'sob', 'to', 'up', 'yon', 'yew', 'yes', 'cob',
'an', 'obi', 'ebb', 'nob', 'do', 'iv', 'alb', 'bab', 'eye', 'tob', 'yaw', 'v', 'abi', 'mab', 'at',
'he', 'go', 'as', 'x', 'rub', 'gob', 'lye', 'sub', 'or', 'ix', 'aye', 'd', 'lbs', 'cub', 'pub',
'tub', 'z', 'so', 'dab', 'bob', 'we', 'l', 'dye', 'k', 'pmb', 'n', 'xv', 'ho', 'hye', 'il', 'yer',
'wo', 'yee', 'ex', 'bye', 'yis', 'vp', 'ox', 'rye', 'oh', 'w', 'io', 'en', 'm', 'ed', 'h', 'me',
'am', 'xx', 'el', 'us', 'no', 'fye', 'eh', 't', 'qu', 'ii', 'r', 'e', 'c', 'ah', 'ha', 's', 'lo',
'al', 'uz', 'em', 'ad', 'ao', 'ow', 'og', 'vs', 'er', 'ir', 'et', 'mr', 'un', 'hm', 'th', 'ji',
'ai', 'xi', 'je', 'hi', 'ze', 'co', 'wm', 'ee', 'au', 'ou', 'ar', 'ca', 'um', 'ro', 'vi', 'de',
'dr', 'fa', 'va', 'sh', 'la', 'nt', 'tm', 'ma', 'gr', 'ur', 'di', 're', 'st', 'tu', 'da', 'ms',
'le', 'pi', 'si', 'se'], ['any'], ['men', 'means', 'mens', 'meals', 'mes', 'mans', 'meanes',
'meats', 'meat', 'menials', 'omens', 'mean', 'mene', 'mines', 'enos', 'menace', 'mend', 'meads',
'zenas', 'kenaz', 'menan', 'seas', 'ment', 'jonas', 'mess', 'mead', 'medes', 'medals', 'enan',
'monks', 'minus', 'ends', 'mews', 'fens', 'minds', 'dens', 'meal', 'midas', 'eras', 'amends',
'pens', 'hena', 'hens', 'tens', 'vedas', 'meres', 'mental', 'lens', 'peas', 'lena', 'meah',
'medad', 'venus', 'arenas', 'aeneas', 'metals', 'enam', 'medan', 'demas', 'teas', 'zenan',
'kenan', 'meets', 'sends', 'merab', 'texas', 'tents', 'bends', 'melts', 'metal', 'tends', 'penal',
'dents', 'lends', 'cents', 'rents', 'annas']]]
>>> print(s.autocorrect_line("they did not yb any menas"))
[['they'], ['did'], ['not'], ['be'], ['any'], ['men']]
>>> print(s.suggest_text("they did not yb any menas", max_suggestions=2))
[['they'], ['did'], ['not'], ['be', 'by'], ['any'], ['men', 'means']]
In addition, you may find this to be helpful:
>>> text = """This should take a list of words as input and return a list of lists.
Each sublist in the return value corresponds to a single word in the input
sentence. Words in the sentence that are in the language model will be represented
as a sublist containing just that word. Words in the sentence that are not in the
language model will be represented as a sublist of possible corrections. This sublist
of possible corrections should be, for each word in the sentence not in the language
model, the result of calling generate_candidates with each of the candidates in the
list and then sorting these candidates by the combination of LanguageModel score and
EditDistance score. If no candidates are found and fallback is True, then non-words
should be represented by a sublist with just the original word (the same
representation as correctly-spelled words).""".lower()
>>> result = sp.autocorrect_line(text)
>>> print(' '.join([x[0] for x in result]))
this should take a list of words as put and return a list of lists . each subtlest in the
return value corresponds to a single word in the put sentence . words in the sentence that
are in the language model will be represented as a subtlest containing just that word .
words in the sentence that are not in the language model will be represented as a subtlest
of possible corrections . this subtlest of possible corrections should be , for each word
in the sentence not in the language model , the result of calling generate_candidates with
each of the candidates in the list and then sorting these candidates by the combination
of languagemodel score and editdistance score . if no candidates are found and fallacy is
true , then non - words should be represented by a subtlest with just the original word
( the same representation as correctly - spilled words ) .