This week, you’ll be writing a class from scratch that can find the weighted minimum edit distance between two strings, and can also display the resulting character alignments. It is important that you follow the API below as the code for Lab 05 will assume your class is written as explained below.
When you’re done, you’ll be able to run code like the following:
my_aligner = EditDistanceFinder()
my_aligner.train("/data/spelling/wikipedia_misspellings.txt")
dist, alignments = my_aligner.align("cought","caught")
print("Distance between 'cought' and 'caught' is", dist)
my_aligner.show_alignment(alignments)
print()
dist, alignments = my_aligner.align("caugt","caught")
print("Distance between 'caugt' and 'caught' is", dist)
my_aligner.show_alignment(alignments)
which should generate the output:
Distance between 'cought' and 'caught' is 0.98870
Observed Word: c o u g h t
Intended Word: c a u g h t
Distance between 'caugt' and 'caught' is 0.83541
Observed Word: c a u g % t
Intended Word: c a u g h t
Your EditDistanceFinder
class should have one data member:
probs
, which is a default dictionary of default dictionaries whose values are floats, and which will store the probability of each observation given the intended character.
You are welcome to include other data members if they help with your implementation, but a user of your class should not need to know about them.
Over the course of the assignment, you will implement the following member functions:
__init__
, which gives starting values to all data members,train_alignments
, which takes a list of misspellings and uses it to determine the most likely set of errors to have generated each,train_costs
, which takes a list of alignments of misspellings and uses it to determine the probability of different types of errors,train
, which iteratively callstrain_alignments
andtrain_costs
until the two converge,align
, which takes two words and returns the distance and the alignments corresponding to the minimum weighted edit distance between them,del_cost
, which is a helper function that takes a character as input and returns the cost of deleting that character,ins_cost
, which is a helper function that takes a character as input and returns the cost of inserting that character,sub_cost
, which is a helper function that takes two characters as input and returns the cost of substituting the second character for the first,show_alignment
, which takes a list of character alignments (of the type returned byalign
) and prints them out in a human-readable format.
More details about each method are included below. The description below is a suggested order for writing each of the methods, but you should feel free to develop in another order if it makes more sense to you.
__init__
In Python classes, __init__
is run when an object is first
created. In particular, for your EditDistanceFinder
, __init__
should initialize the probs
variable to an empty defaultdict
of
defaultdict
s of float
s. You can read the documentation for defaultdict
if you are unfamiliar with it. You will want to include the line ‘from collections import defaultdict
’ at the top of your program.
ins_cost(observed_char)
This method should take a single character as input, and should return a cost (between 0 and 1) of inserting that character. You should model the cost of inserting character c
as 1-p(c)
, where p(c)
is the probability of observing c
when nothing was intended, which should be stored as probs['%'][c]
.
del_cost(intended_char)
This method should take a single character as input, and should return a cost (between 0 and 1) of deleting that character. You should model the cost of deleting character c
as 1-p(c)
, where p(c)
is the probability of observing nothing when c
was intended, which should be stored probs[c]['%']
sub_cost(observed_char, intended_char)
This function should take two characters as input, and should return a cost (between 0 and 1) of replacing the observed character with the intended character. If c1==c2
, it should return 0
. Otherwise, you should model the cost of replacing character intended_char
with character observed_char
as 1-p(observed_char|intended_char)
, where p(observed_char|intended_char)
is the probability that the character observed is observed_char
given that the original character was intended_char
. This value should be stored as probs[intended_char][observed_char]
.
align(observed_word, intended_word)
This method is where the heart of the minimum edit distance functionality will live. It’s up to you whether all of the functionality lives entirely in this method, or you decide to break it into one or more helper functions, but users of your class should only need to know about the align
method.
This method takes two words as input. It returns a distance (as a float) and the corresponding character alignments (as a list of tuples of characters).
Using the example from above, before training costs, my_aligner.align("caugt","caught")
should return (1.0, [('c', 'c'), ('a', 'a'), ('u', 'u'), ('g', 'g'), ('%', 'h'), ('t', 't')])
, since the misspelling caugt
was the result of the deletion of a single h
from the intended string. Note that we will use the percent sign to indicate an empty character, so deleting a c
will show up in the alignment as (‘%’,c
) and inserting c
show up in the alignment as (c
,’%’).
Careful: Deleting the letter ‘h’ shows up in the alignment as (‘%’,’h’), but is stored in the probability matrix as probs['h']['%']
since ‘h’ was intended and ‘%’ was observed.
I recommend that you use a numpy
matrix to store your cost table. You can initialize an M
by N
matrix of zeros with numpy.zeros((M,N))
.
You may want to use a second numpy
matrix to store your backtraces, but that design decision is up to you.
This is likely to be the most difficult part of this week’s assignment, and it’s a place where it’s easy to make off-by-one type errors. I recommend that you draw pictures/figures/diagrams to check indices, step through simple examples, and use any other strategies you’re familiar with to build your solution in a structured manner.
show_alignment(alignments)
This method should take the alignments returned by align
and print
them in a friendly way. The first line should contain “Observed Word:” followed by all of the first characters in the alignment,
separated by spaces. The second line should contain “Intended Word:”
followed by all of the second characters in the alignment, separated
by spaces.
Once this is done, I recommend that you pause and write several test
cases for your align
method, using show_alignment
to visually
check the result of your alignment algorithm.
train
, train_costs
and train_alignments
These method all interact with each other, so it’s a bit tricky to decide which one to implement first.
We will start with the train
method. The train
method should take
the name of a file (in our case, /data/spelling/wikipedia_misspellings.txt
, which came
from this list). Each line of the file contains a common observed misspelling, a comma, and the
intended spelling. train
should read in the file and split it into a
list of tuples, e.g.
[(observed1, intended1), (observed2, intended2), ...]
train
will then iteratively call train_alignments
and
train_costs
. But for now, just have it call train_alignments
with
the list of misspellings that you read in from the file.
Now turn to train_alignments
, which should take a list of
misspellings like the one you just created. The method should call
align
on each of the (observed, intended)
pairs, and
should return a single list with all of the character alignments from
all of the pairs.
Go back to your train
method, and save the result of
train_alignments
to a local variable. Pass that list of alignments
to a call to train_costs
.
Now turn to train_costs
, which takes a list of character alignments
and uses it to estimate the likelihood of different types of
errors. You will want to count (the class collections.Counter
may be
useful here) the number of times that each character intended_char
is aligned to the character observed_char
. To update your probs
variable, each of those counts should be normalized by the total
number of times that each character was intended, which will ensure
that we have valid probability distributions.
Careful: Be sure you make a new self.probs
inside train_costs
to be certain that any value that should be zero after updating the
probabilities doesn’t have some “leftover” value from the previous
iteration instead.
Finally, go back to the train
method and update it to repeatedly
call train_alignments
and train_costs
until your model
converges. You’ll know the model converges when the alignments don’t
change from one iteration to another; that shouldn’t take more than 10
iterations for our data.
Test Cases
Once all of your methods are written, update and add to any test cases you wrote along the way to demonstrate the performance of your alignment class.
Questions
In Writeup.md
, answer the following questions:
- Explore the behavior of the alignments you get for a variety of word lengths and types of errors. Comment on what your model does well, and what it still doesn’t do very well.
- Which character(s) have the highest probability of being inserted? Does this surprise you?
- Which character(s) have the highest probability of being deleted? Does this surprise you?
- Which character(s) have the highest probability of being
substituted for something other than itself or
'%'
? Is there a letterx
that that stands out? Think of this in both directions, e.g.prob[x][y]
andprob[y][x]
. Does this surprise you? Why? - One common type of misspelling is related to how close two keys are on the keyboard. By examining your substitution probabilities, what evidence can you find to support or refute that as a source of the errors in our training data?
- What limitations does the model you trained have? There are multiple kinds of spelling mistakes that your model can not accurately account for. Which ones? Why? What would you need to add to your system in order to handle those?
- In fact, the model you trained vastly over estimates the probability of insertions, deletions, and substitutions. What is it about the data we used to train the model that would result in this over-estimate? What might be a more “fair” source of data to estimate the probabilities? What barriers might you run into if you wanted to train in a more justified way?