Overview
This lab has three starter files:
evaluate.py
segmenter.py
tokenizer.py
Answers to written questions should be included in your repository in a file called Writeup.md
. You will also create (and should add to your repository) a file called ngrams.py
during this assignment.
Docker and lxml
To process XML files, we’ll work with the lxml
library this week. We’ve updated the Docker image to have everything you need, but you’ll need to run docker pull jmedero/nlp:fall2018
to get the most recent version on your own machine.
Spacy
This week, we’ll get to use the spacy
python library for the first time. Spacy is designed to make it easy
to use pre-trained models to analyze very large sets of data. At the
beginning of the semester, we’ll be using spacy as a skeleton for
building our own NLP algorithms. Later in the semester, you’ll get a
chance to use more of the built in functionality to build larger
systems.
Sentence Segmentation
In the first part of the lab, you will write a simple sentence segmenter.
The /data/brown
directory includes three text files taken from the Brown Corpus:
/data/brown/editorial.txt
/data/brown/adventure.txt
/data/brown/romance.txt
The files do not indicate where one sentence ends and the next begins. In the data set you are working with, sentences can only end with one of 5 characters: period, colon, semi-colon, exclamation point and question mark.
However, there is a catch: not every period represents the end of a sentence since there are many abbreviations (U.S.A.
, Dr.
, Mon.
, etc.
, etc.) that can appear in the middle of a sentence where the periods do not indicate the end of a sentence. The text also has many examples where colon is not the end of the sentence. The other three punctuation marks are all nearly unambiguously the ends of a sentence. Yes, even semi-colons.
For each of the above files, I have also provided a file containing the line number (counting from 0) containing the actual locations of the ends of sentences:
/data/brown/editorial-eos.txt
/data/brown/adventure-eos.txt
/data/brown/romance-eos.txt
Your job is to write a sentence segmenter, and to add that segmenter to spacy’s processing pipeline.
Part 1a
The given segmenter.py
has some starter code, but it can’t be run
from the command line. We want it to be executable, though, and when
it’s called from the command line, it should take one required argument and one optional argument:
$ python3 ./segmenter.py --help
usage: segmenter.py [-h] --textfile FILE [--hypothesis_file FILE]
Predict sentence segmentation for a file.
optional arguments:
-h, --help show this help message and exit
--textfile FILE, -t FILE
Path to the unlabeled text file.
--hypothesis_file FILE, -y FILE
Write hypothesized boundaries to FILE (default stdout)
Write the command-line interface for segmenter.py
using the
argparse
module in python for processing command-line arguments. In addition to the module documentation, you may also find the argparse tutorial useful.
Both arguments to segmenter.py
should be FileType
arguments.
As in Lab 01, all print
statements should be in your main()
function,
which should only be called if segmenter.py
is run from the command
line.
Part 1b
This week’s starter code includes a file called tokenizer.py
defines
two things:
- Our
tokenize
function from last week - A class called
MyTokenizer
that extends theTokenizer
class in spacy.
In the segmenter.py
file, there’s one function called create_doc
that takes a readable file pointer (the type of the textfile
argument to segmenter.py
) and returns a spacy document.
Stop now and make sure that the MyTokenizer
class makes sense to
you. They both use python and/or spacy components that are likely new
to you; some places you can read to better understand how they work
include:
- Section 9.5 of this tutorial, which talks about inheritance in python.
- The documentation (and especially the description of the
__init__
method) for theDoc
class: in spacy. - The documentation (and especially the description of the
__init__
method) for theTokenizer
class in spacy. - The documentation for the
Language
class in spacy.
Questions
- Explain the
MyTokenizer
class in your own words. - Explain the
create_doc
function in your own words. - How many tokens are in the file
/data/brown/editorial.txt
? (Hint: You can get the number of tokens in a spacyDoc
by callinglen()
on theDoc
object.)
Part 1c
Next, write a function called baseline_segmenter
that takes a spacy
Doc
as its only argument. We’ll add this function to our spacy
pipeline after tokenization, so you can assume that the Doc
you
get is word tokenized.
Your function should iterate through all of the tokens in the Doc
(for token in doc:
)
and predict which ones are the ends of sentences. Instead of keeping
track of the ends of sentences, though, spacy keeps track of the
beginning of sentences. In particular, the first token in each
sentence in a spacy Doc
has its is_sent_start
attribute set to
True
.
For every token that you predict corresponds to the end of a sentence,
you should set the is_sent_start
attribute to True
for the
next token.
Remember that every sentence in our data set ends with one of the five
tokens ['.', ':', ';', '!', '?']
. Since it’s a baseline approach,
baseline_segmenter
should predict that every instance of one of
these characters is the end of a sentence. You can access the text
content of a Token
in spacy through its .text
attribute:
>>> my_token = doc[0]
>>> type(my_token)
<class 'spacy.tokens.token.Token'>
>>> my_token.text
'The'
>>> type(my_token.text)
<class 'str'>
Next, add your baseline_segmenter
function to the pipeline of
tools that will be called on every Doc
that is created with your
create_doc
function. To do that, you’ll want to look at the
add_pipe
function.
Finally, update your main()
function to write out the token numbers
corresponding to the predicted line breaks to hypothesis_file. Be sure to
write out the last token number as a sentence boundary. Since spacy is
keeping track of the starts of new sentences, the final sentence is
never explicitly marked. You can access a list of the sentences in a
spacy Doc with its sents
attribute:
>>> doc = nlp("The cat in the hat came back, wrecked a lot of havoc on the way.")
>>> print(len(list(doc.sents)))
1
Confirm that when you run your baseline_segmenter
on the file /data/brown/editorial.txt
, it predicts 3278 sentence boundaries.
Part 1d
To evaluate your system, I am providing you a program called
evaluate.py
that compares your hypothesized sentence boundaries
with the ground truth boundaries. This program will report to you the
true positives, true negatives, false positives and false negatives
(as well as precision, recall and F-measure, which we haven’t talked
about in class just yet). You can run evaluate.py
with the -h
option to see all of the command-line options that it supports.
A sample run with the output of your baseline segmenter from above
stored in editorial.hyp
would be:
python3 evaluate.py -d /data/brown/ -c editorial -y editorial.hyp
Run the evaluation on your baseline system’s output for the editorial category, and confirm that you get the following before moving on:
TP: 2719 FN: 0
FP: 559 TN: 60055
PRECISION: 82.95% RECALL: 100.00% F: 90.68%
Part 1e
Now it’s time to improve the baseline sentence segmenter. We don’t have any false negatives (since we’re predicting that every instance of the possibly-end-of-sentence punctuation marks is, in fact, the end of a sentence), but we have quite a few false positives.
Make a copy of your baseline_segmenter
function called
my_best_segmenter
. Change your create_doc
function to call your
new segmenter instead of the baseline one.
You can see the type of tokens that your system is mis-characterizing
by setting the verbosity
of evaluate.py
to something greater
than 0. Setting it to 1 will print out all of the false positives and
false negatives so you can work to improve your my_best_segmenter
function.
To test your segmenter, I will run it on a hidden text you haven’t seen yet. It may not make sense for you to spend a lot of time trying to fix obscure cases that show up in the three texts I am providing you because these cases may never show up in the hidden text that you haven’t seen. But it is important to handle cases that occur multiple times and even some cases that appear only once if you suspect they could appear in the hidden text. You want to write your code to be as general as possible so that it works well on the hidden text without leading to too many false positives.
NGrams
In class, we talked about the problem of 0’s in language modeling. If you were to train a unigram language model on the editorial category of the Brown corpus and then try to calculate the probability of generating the adventure category, you’d end with a 0 probability because there are words that occur in the adventure category that don’t appear in the editorial category (e.g. “badge”).
In this part of the assignment, we’ll explore that problem in more depth. To do that, we’ll start working with the dataset that you’ll use for your final project.
You should put your code for this part in a file called ngrams.py
.
Part 2a: Extracting data from XML files
A sample of the dataset we’ll use for your final project
is in /data/semeval
. There’s a single XML file that contains
all of the articles we’ll look at this week. Each article has a label
that gives its bias: “left”, “right”, or “least.”
The data file you’ll use for your final project is big (the most
recent release is around 3.6G) so it’s best not to store the whole
thing in memory at once if we can help it. Fortunately, the lxml library gives us a way to
iteratively parse through an xml file, dealing with one node at a
time. Here’s sample code that opens a file called myfile.xml
and
call a function called my_func
on every article
node:
from lxml import etree
fp = open("myfile.xml", "br")
for event, element in etree.iterparse(fp, events=("end",)):
if element.tag == "article":
my_func(element)
element.clear()
Take a minute now to look at the contents of some of the articles in
/data/semeval
and skim the
documentation for lxml. Then, write a
function called get_xml_contents()
that takes a file object and a
spacy language object and returns a dictionary. The dictionary’s keys
should be labels (that is, “left,” “right,” and “least”) and its
elements should be lists of spacy Documents, each containing the plain
text contents of a single article
. Hint: You might want to
take a look at the itertext()
method of etree Element
s.
You should use your MyTokenizer
from Part 1. We’ll ignore sentence boundaries for this part, so you don’t need to add your my_best_segmenter
.
Questions
- When you looked at the content of the articles, you probably noticed a lot of question marks. What’s your hypothesis about how those ended up in the file? In other words, what did the Task organizers not do correctly in preparing this data release? (The data release includes a note about the problem, and it should be fixed by the time we get to our final project. Alas, we will have to accept it the way it is for this week.)
- What percentage of the tokens that appear in the ‘left’ articles don’t appear in the ‘least’ articles?
- What percentage of tokens that appear in the ‘right’ articles don’t appear in the ‘least’ articles?
- What if you look at types instead of tokens?
- Are you surprised by these results? Why or why not?
Part 2c: Bigram analysis
What happens when you move to higher order n-gram models like bigrams and trigrams?
Write a function that takes as input a spacy Document, and returns a list of all of the bigrams in the document.
Write a function that takes as input a spacy Document, and returns a list of all of the trigrams in the document.
Questions
-
What percentage of the bigrams (tokens, not types) that appear in the ‘left’ articles don’t appear in the ‘least’ articles? What percentage of the bigrams that appear in the ‘right’ articles don’t appear in the ‘least’ articles? Are you surprised by these results? Why or why not?
-
What percentage of the trigrams (tokens, not types) that appear in the ‘left’ articles don’t appear in the ‘least’ articles? What percentage of the trigrams that appear in the ‘right’ articles don’t appear in the ‘least’ articles? Are you surprised by these results? Why or why not?
Part 2d: Collecting statistics on pairs of categories
Instead of collecting statistics on one set of articles (“training”) and seeing how closely these statistics match another set of articles (“testing”), what if you trained your language model on two of the sets (e.g. ‘left’ and ‘least’) and tested it on the third category (e.g. ‘right’)?
Questions
- Does that change your results? Why? Try for each of the three combinations. Which worked best? Why? Your Writeup.md file should include a table of your results, similar to the table below. Report percentages, not raw counts.
Train | Test | Missing Tokens | Missing Types | Missing Bigrams | Missing Trigrams |
---|---|---|---|---|---|
Left, Right | Least | ||||
Least, Right | Left | ||||
Left, Least | Right |
Part 2e: Train on equal-sized “chunks”
Instead of training on one set of articles and testing on another, suppose we break each of the sets into 4 equally-sized chunks and then combine chunks across all three sets. For simplicity, let’s call the first chunk of each category “chunk A”, the second “chunk B”, etc. That means that “chunk A” will contain the first 25% of ‘left’ articles and the first 25% of ‘right’ articles and the first 25% of ‘least’ articles. “Chunk B” contains the next 25% of each, etc.:
all text | = | left | + | least | + | right |
chunk A | = | 1st 25% of left | + | 1st 25% of least | + | 1st 25% of right |
chunk B | = | 2nd 25% of left | + | 2nd 25% of least | + | 2nd 25% of right |
chunk C | = | 3rd 25% of left | + | 3rd 25% of least | + | 3rd 25% of right |
chunk D | = | 4th 25% of left | + | 4th 25% of least | + | 4th 25% of right |
Now, combine chunks A, B and C and use that as your training data. Use chunk D as your test data.
Questions
- Try training on A,B,C and testing on D; then train on A, B, D and test on C; then train on A, C, D and test on B; and finally train on B, C, D and test on A. How are your results different from the previous question? Why? Your Writeup.md file should include a table of your results, similar to the table below. Report percentages, not raw counts. Evaluating your system in this way is called cross-validation. In this case, since you are breaking your data into 4 distinct test sets, you are performing 4-fold cross-validation.
Train | Test | Missing Tokens | Missing Types | Missing Bigrams | Missing Trigrams |
---|---|---|---|---|---|