CS91R Lab 04: Segmentation
Due Friday, March 7, before midnight
Goals
The goals for this lab assignment are:
-
Learn how to use python virtual environments to install libraries
-
Learn to use
spacy
, a natural language processing (NLP) library -
Use
argparse
for command-line argument processing -
Experimentally design and improve a sentence segmentation algorithm
-
Learn to evaluate algorithms using false positives & negatives
-
Learn to extract data from XML documents
-
Learn about unigrams and bigrams
-
Explore the relationship between training and testing data and issues of data sparsity
Cloning your repository
Log into the CS91R-S25 github
organization for our class and find your git repository for Lab 04,
which will be of the format lab04-user1-user2
, where user1
and user2
are you and your partner’s usernames.
You can clone your repository using the following steps while connected to the CS lab machines:
# cd into your cs91r/labs sub-directory and clone your lab04 repo
$ cd ~/cs91r/labs
$ git clone git@github.swarthmore.edu:CS91R-S25/lab04-user1-user2.git
# change directory to list its contents
$ cd ~/cs91r/labs/lab04-user1-user2
# ls should list the following contents
$ ls
evaluate.py bias.py README.md segmenter.py mytokenizer.py
Answers to written questions should be included in the README.md file included in your repository.
Virtual environments
Normally you are working in pairs in lab, but both you and your partner will need to complete these steps. You might want to both follow along at the same time on your own computers. |
A python virtual environment, or venv
, is a way for you to install python packages that you need for various projects
you might be working on. The following four bullet points are taken from python’s venv
page:
A virtual environment is (amongst other things):
Used to contain a specific Python interpreter and software libraries and binaries which are needed to support a project (library or application). These are by default isolated from software in other virtual environments and Python interpreters and libraries installed in the operating system.
Not checked into source control systems such as Git.
Considered as disposable – it should be simple to delete and recreate it from scratch. You don’t place any project code in the environment.
Not considered as movable or copyable – you just recreate the same environment in the target location.
virtualenvwrapper
One-time setup for virtualenvwrapper
We will be using a tool called virtualenvwrapper
to
help manage the virtual environments we create. The virtualenvwrapper
tool requires
some one-time setup.
Open your ~/.bashrc
file in a text editor and add the following lines to the
end of the file:
export PYTHONPATH=
export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3
export WORKON_HOME=/scratch/userid1/venvs // REPLACE userid1 with your user id
source /usr/local/bin/virtualenvwrapper.sh
Save your ~/.bashrc
file. You now need to either restart all of your
terminal windows or run source ~/.bashrc
at each open terminal’s prompt
to incorporate these changes. When you restart or run the source
command, you will get some
output indicating that virtualenvwrapper
has created a bunch of files
for you. The output should look something like this:
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/premkproject
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/postmkproject
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/initialize
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/premkvirtualenv
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/postmkvirtualenv
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/prermvirtualenv
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/postrmvirtualenv
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/predeactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/postdeactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/preactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/postactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/get_env_details
You shouldn’t see this displayed again in the future, and you should not need to redo these setup steps again.
Creating a venv using virtualenvwrapper
Every venv
has a name. You can call it anything you’d like, but we will call
the one we make cs91r
, both in the examples below and throughout the course
when referring the venv
we’ve built.
To create a venv
using virtualenvwrapper
, you will use the mkvirtualenv
script.
You should get output similar to that shown below:
stork[~]$ mkvirtualenv cs91r
created virtual environment CPython3.12.3.final.0-64 in 3520ms
creator CPython3Posix(dest=/scratch/userid1/venvs/cs91r, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, via=copy, app_data_dir=/home/userid1/.local/share/virtualenv)
added seed packages: pip==24.0
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/cs91r/bin/predeactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/cs91r/bin/postdeactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/cs91r/bin/preactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/cs91r/bin/postactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/cs91r/bin/get_env_details
(cs91r) stork[~]$
Notice that the prompt changed. When you are running inside of a virtual
environment, your prompt will change to indicate the venv
you are using.
Exiting and starting a venv using virtualenvwrapper
Since you are now in a venv
, let’s exit it. To do that, run the
deactivate
script at the prompt. Notice that your prompt reverts to its
original state:
(cs91r) stork[~]$ deactivate
stork[~]$
To start a venv
, you use the workon
script followed by the name of your
venv
. You will know it is successfull because your prompt will change back
to indicate you are using the venv
:
stork[~]$ workon cs91r
(cs91r) stork[~]$
Installing python packages
Now that we have our virtual environment, we will install some python
packages. For this week, we will need to install the python packages
spacy
and lxml
, along with some supporting packages. The commands
in this section are specific only to installing spacy
. Once you’ve
done that for this venv
, you won’t need to do them again.
These commands will produce a lot of output and may take 60-90 seconds to complete. I will only show a few lines of the output here:
$ pip install spacy lxml
...
Successfully installed MarkupSafe-3.0.2 annotated-types-0.7.0 blis-1.2.0 catalogue-2.0.10 certifi-2025.1.31 charset-normalizer-3.4.1 click-8.1.8 cloudpathlib-0.21.0 confection-0.1.5 cymem-2.0.11 idna-3.10 jinja2-3.1.6 langcodes-3.5.0 language-data-1.3.0 lxml-5.3.1 marisa-trie-1.2.1 markdown-it-py-3.0.0 mdurl-0.1.2 murmurhash-1.0.12 numpy-2.2.3 packaging-24.2 preshed-3.0.9 pydantic-2.10.6 pydantic-core-2.27.2 pygments-2.19.1 requests-2.32.3 rich-13.9.4 setuptools-76.0.0 shellingham-1.5.4 smart-open-7.1.0 spacy-3.8.4 spacy-legacy-3.0.12 spacy-loggers-1.0.5 srsly-2.5.1 thinc-8.3.4 tqdm-4.67.1 typer-0.15.2 typing-extensions-4.12.2 urllib3-2.3.0 wasabi-1.1.3 weasel-0.4.1 wrapt-1.17.2
$ python -m spacy download en_core_web_sm
...
Successfully installed en-core-web-sm-3.8.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
In future weeks we may install more packages, but for this week, we don’t need any more.
Spacy
This week, we are going to use the spacy python library for the first time. Spacy is designed to make it easy to use pre-trained models to analyze very large sets of data. In this lab, we’ll be using spacy as a skeleton for building our own tools.
1. Sentence segmentation
1.1 Overview
In the first part of the lab, you will write a simple sentence segmenter.
The /data/cs91r-s25/brown/
directory includes three text files taken from the Brown Corpus:
-
editorial.txt
-
adventure.txt
-
romance.txt
The files do not indicate where one sentence ends and the next begins. In the data set you are working with, sentences can only end with one of 5 characters: period, colon, semi-colon, exclamation point and question mark.
However, there is a catch: not every period represents the end of a sentence since there are many abbreviations (U.S.A.
, Dr.
, Mon.
, etc.
, etc.) that can appear in the middle of a sentence where the periods do not indicate the end of a sentence. The text also has many examples where colon is not the end of the sentence. The other three punctuation marks are all nearly unambiguously the ends of a sentence. Yes, even semi-colons.
For each of the above files, we have also provided a file containing the line number (counting from 0) containing the actual locations of the ends of sentences:
-
editorial-eos.txt
-
adventure-eos.txt
-
romance-eos.txt
Your job is to write a sentence segmenter. We will add that segmenter to spacy’s processing pipeline.
1.2 Handling command-line arguments
We have provided you with some starter code in segmenter.py
. Don’t worry about
what the existing code does just yet.
Eventually, we’re going to want to run this program, which means we’re going to
need main
to do something other than pass
. (Note: pass
is a python statement
that does nothing, but can sometimes be useful when python expects you to have an
indented block of code and you don’t want to put anything there. In that case,
using pass
as a placeholder is useful.) You’ll want to remove pass
once you
have something better to put there.
We’re going to want to be able to provide the segmenter.py
program with
command-line arguments, taking one required argument (a textfile
)
and one optional argument (the OUTPUT_FILE
specified using ‑o
or ‑‑outfile
).
If the user does not specify an OUTPUT_FILE
, the program should output
to sys.stdout
.
$ python3 ./segmenter.py --help
usage: segmenter.py [-h] [-o OUTFILE] textfile
Predict sentence segmentation for a file.
positional arguments:
textfile path to the unlabeled text file
options:
-h, --help show this help message and exit
-o OUTFILE, --outfile OUTFILE
write boundaries to file
Write the command-line interface for segmenter.py
using the argparse module in python for processing command-line arguments. In addition to the module documentation, you may also find the argparse tutorial useful. We’ve already included import argparse
at the top of the file to
get you started.
Both arguments to segmenter.py
will be files. The first file will be one that you read
and the second file will be one that you write. The type
of both arguments should be FileType arguments.
You can print out the arguments you get to verify that things are working, but you don’t need to do anything with the arguments just yet.
As in Lab 2, all print statements should be in your main()
function, which should only be called if segmenter.py
is run from the command line.
1.3 Creating a spacy Doc
Now that your segmenter.py
function takes command-line arguments, you
can use them to create a spacy Doc from the textfile
. Pass the textfile
argument into the create_doc
function we’ve provided. The create_doc
function
returns a spacy Doc.
Questions
-
Create a spacy Doc using the
create_doc
function specific in the section above. How many tokens are in the fileeditorial.txt
? You can get the number of tokens in a spacyDoc
by callinglen()
on theDoc
object.
1.4 Understanding the tokenizer
The tokenizer.py
file contains some code that you do not need to modify in any
way. However, it would be helpful if you understood what it does at a high-level,
even if the specifics aren’t clear.
The tokenizer.py
file defines two things:
-
a
tokenize
function, which uses the same regular expression to split strings as you’ve already seen in previous labs, and -
a class called
MyTokenizer
that extends theTokenizer
class in spacy
The MyTokenizer
class allows you to define a piece of spacy’s processing
pipeline. In spacy, a pipeline is defined as the steps that spacy takes to
take raw text and convert it into a format that is ready for various text
processing tasks.
In this step, we are replacing spacy’s default English tokenizer — a component of spacy’s processing pipeline — with our tokenizer. We are writing our own tokenizer because otherwise spacy would do much of the sentence segmentation for us. Since the goal of this lab is for you to write the segmenter, we are replacing the default tokenizer with a tokenizer that simply separates the non-alphabetic text (punctuation, dollar signs, etc.) from the letters and numbers in the text.
Optional deeper dive
The MyTokenizer
class might be confusing to you since it likely uses python and/or spacy components that are new to you. If you’re curious, here are sources that will help you understand what’s going on.
-
Section 9.5 of the python tutorial on classes, which talks about inheritance in python.
-
A high-level explanation of the __call__ method from realpython.com.
-
The documentation (and especially the description of the
__init__
method) for the Doc class in spacy. -
The documentation (and especially the description of the
__init__
method) for the Tokenizer class in spacy. -
The documentation for the Language class in spacy.
Questions
-
Briefly explain the
MyTokenizer
class in your own words. (You don’t need to fully understand it, but use the web page text and the comments in the file to help.)
1.5 Baseline segmenter
In the segmenter.py
file, write a function called baseline_segmenter
that takes a spacy Doc
as its only argument. This function will use a naive algorithm for determining
when we’ve reached the end of each sentence.
This is called a baseline segmenter, because we are expecting that this very simple segmenter can get pretty good results without much work on our part. By setting the bar pretty high, we’re expecting our improvements to be even better than this baseline. We will be able to compare our improved algorithm against this baseline to see how much better it performs.
The baseline segmenter will iterate through all of the tokens in the Doc
(using for token in doc:
) and predict which ones are the ends of sentences. Confusingly, instead of keeping track of the ends of sentences, spacy keeps track of the beginning of sentences. In particular, the first token in each sentence in a spacy Doc
has its is_sent_start
attribute set to True
.
This means that for every token that you predict corresponds to the end of a sentence, you should set the is_sent_start
attribute to True
for the next token.
Recall that every sentence in our data set ends with one of the five tokens ['.', ':', ';', '!', '?']
. Our baseline_segmenter
will predict that every instance of one of these characters is the end of a sentence.
As shown below, you can access the text content of a Token
in spacy through its .text
attribute to determine if it matches one of these five tokens:
>>> my_token = doc[0]
>>> type(my_token)
<class 'spacy.tokens.token.Token'>
>>> my_token.text
'The'
>>> type(my_token.text)
<class 'str'>
Next, add your baseline_segmenter
function to the pipeline of tools that will be called on every Doc
that is created with your create_doc
function. To do that, add the following line
to the create_doc
function just below where you set up your tokenizer:
nlp.tokenizer=MyTokenizer(nlp.vocab) # (existing line)
nlp.add_pipe("segmenter") # <--- ADD THIS LINE
Finally, update your main()
function to write out the token numbers corresponding to the predicted line breaks to outfile
, if it was specified as a command-line option,
or print the token numbers to the screen if the outfile
argument was omitted. You
should write the token numbers one per line. You can view /data/cs91r-s25/brown/editorial-eos.txt
as an example.
Be sure to write out the last token number as a sentence boundary. Since spacy is keeping track of the starts of new sentences, the final sentence is never explicitly marked. You can access a list of the sentences in a spacy Doc
with its sents
attribute:
>>> doc = nlp("The cat in the hat came back, wrecked a lot of havoc on the way.")
>>> print(len(list(doc.sents)))
1
Confirm that when you run your baseline_segmenter on the file editorial.txt
,
it predicts 3278 sentence boundaries.
Optional deeper dive
You can read more about the add_pipe function if you are interested.
Questions
-
Briefly explain the
create_doc
class in your own words. (You don’t need to fully understand it, but use the web page text and the comments in the file to help.)
1.6 Evaluate your segmenter
To evaluate your system, we are providing you a program called evaluate.py
that compares your hypothesized sentence boundaries with the ground truth boundaries. This program will report to you the true positives, true negatives, false positives and false negatives (as well as precision, recall and F-measure, which we haven’t talked about in class just yet). You can run evaluate.py
with the -h
option to see all of the command-line options that it supports.
A sample run with the output of your baseline segmenter from above stored in bl_editorial-eos.txt
would be:
python3 evaluate.py -d /data/cs91r-s25/brown/ -c editorial -y bl_editorial-eos.txt
Run the evaluation on your baseline system’s output for the editorial category, and confirm that you get the following before moving on:
TP: 2719 FN: 0
FP: 559 TN: 60055
PRECISION: 82.95% RECALL: 100.00% F: 90.68%
1.7 Improving your segmenter
Now it’s time to improve the baseline sentence segmenter. We don’t have any false negatives (since we’re predicting that every instance of the possibly-end-of-sentence punctuation marks is, in fact, the end of a sentence), but we have quite a few false positives.
Make a copy of your baseline_segmenter
function called my_best_segmenter
.
Be sure to copy the line that says @Language.component("baseline_segmenter")
and change it to say @Language.component("my_best_segmenter")
. In the create_doc
function, change the line that says add_pipe
to use my_best_segmenter
instead
of the baseline_segmenter
to use this new segmenter instead of the baseline one.
You can see the tokens that your system is mis-characterizing by setting the verbosity of evaluate.py
to something greater than 0. Setting it to 1 will print out all of the false positives and false negatives so you can work to improve your my_best_segmenter
function.
python3 evaluate.py -v 1 -d /data/cs91r-s25/brown/ -c editorial -y bl_editorial-eos.txt
To test your segmenter, we will run it on a hidden text you haven’t seen yet. It may not make sense for you to spend a lot of time trying to fix obscure cases that show up in the three texts we are providing you because these cases may never show up in the hidden text that you haven’t seen. But it is important to handle cases that occur multiple times and even some cases that appear only once if you suspect they could appear in the hidden text. You want to write your code to be as general as possible so that it works well on the hidden text without leading to too many false positives.
Questions
-
Describe the improvements you made in
my_best_segmenter
to increase performance. -
For each of the three texts (editorial, adventure, romance), what was the final precision, recall and F-measure?
-
If you didn’t get 100% precision and 100% recall, why did you stop improving your segmenter?
2. Exploring bias / using XML
In Lab 03, we used smoothing to deal with words that didn’t occur in a corpus, leading to the probability of a word (or ngram) being set to 0. As we start to use language for more complex tasks, related problems will arise.
For example, imagine you created a tool like ChatGPT that could generate
newpaper articles. If you built this tool by only showing it the editorial.txt
file, it might be able to generate text that looks like a newspaper editorial.
However, if you then asked it to generate the kinds of stories in adventure.txt
,
it would be unable to do so because there are lots of words in adventure.txt
that never show up in editorial.txt
, such as "bottle" and "window".
In this part of the assignment, we’ll explore this problem in more depth. To do that, we’ll start working with a larger dataset that contains news articles that have been classified as being "hyperpartisan". The data we will be working with in this lab was used as part of SemEval 2019, Task 4.
Each article will be labeled as either hyperpartisan ('true') or not ('false'). In addition, each 'true' hyperpartisan article will be labeled as having a "left" or "right" bias. Each 'false' hyperpartisan article will be labeled as having a "left-center", "right-center", or "least" (no particular) bias.
You should put your code for this part in the file called bias.py
.
2.1 Extracting data from XML files
The dataset we will be using is in /data/cs91r-s25/semeval-19-04/
.
The file articles.xml
file contains all of the articles we’ll look at this week.
The file labels.xml
contains the labeling of each article (is it hyperpartisan or not, and what is its bias).
These files have 600,000 articles and 600,000 labels, respectively, so they are
pretty large and hard to work with while you’re developing. Therefore,
we’ve provided small_articles.xml
and small_labels.xml
which only
contains 12 articles. You can work with them while you work out the bugs.
Once you switch to the full data set, you’ll have to deal with the fact
that the data file is big (around 2.6G) so we won’t store the whole thing in
memory at once if we can help it. Fortunately, the lxml
library gives us a
way to iteratively parse through an xml file, dealing with one node
(in our case, article
) at a time. Here’s sample code that opens a file called
myfile.xml
and calls a function called my_func
on every article
node:
from lxml import etree
fp = open("myfile.xml", "rb")
for event, element in etree.iterparse(fp, events=("end",)):
if element.tag == "article":
# process the element (see below)
element.clear() # optimization to free up memory
As needed, refer back to the lxml
tutorial
we saw in clab. You won’t need to use the event
variable, but
the element
variable will be a node in the XML tree. You will
want to get the .items()
of the element
(which contains the attributes of the
node such as id
and title
). And you will need to use the .itertext()
method to get the text of the node.
Do not use findall
or find
. There are two reasons: (1) the dataset is
very large and findall
will be slow, and (2) this data has a lot of embedded
HTML tags, so finding all of the text that is nested instead of <p>
and
<a>
tags is doable but non-trivial.
Before we continue, take a look at the contents of the articles.xml
file
and the labels.xml
file. Look at the structure of the documents,
especially at the start of each <article>
.
$ less /data/cs91r-s25/semeval-19-04/articles.xml # or small_articles.xml
$ less /data/cs91r-s25/semeval-19-04/labels.xml # or small_labels.xml
Questions
-
If the articles are in one file and the labels are in the other file, how do you know how each article is labeled?
2.2 Get the XML contents
Write two functions, one called get_labels()
and one called get_articles()
.
The get_labels
function should take a filename (e.g. /data/cs91r-s25/semeval-19-04/labels.xml
)
and it should return a dictionary mapping article IDs to their bias ("left", "right", etc).
The get_articles
function should take a filename (e.g. /data/cs91r-s25/semeval-19-04/articles.xml
)
and should return a dictionary that maps article IDs to a spacy Doc containing
the news article.
When creating spacy documents, don’t use your MyTokenizer
from Part 1. We’ll
be ignoring sentence boundaries for this part, so you don’t need to add your
my_best_segmenter
either. To initialize the spacy pipeline, you can just use this line:
nlp = English()
One problem we will need to worry about is that xml
files contain
HTML entities. These files
in particular include a lot of the entity  , a non-breaking space.
In order to clean these up, we can use a function in the html
library before
creating our spacy document:
import html
...
# the string storing the article is stored in the variable text
doc = nlp(html.unescape(text))
Once you have both the dictionary of labels and the dictionary articles, create another new dictionary whose keys are labels (e.g. "left", "right", etc). Each key should map to a list of all the articles (as spacy documents) labeled with that key.
Questions
-
What percentage of the tokens that appear in the 'left' articles don’t appear in the 'least' articles? Put your answer in the table shown below, which is also included in your
README.md
file. -
What percentage of tokens that appear in the 'right' articles don’t appear in the 'least' articles? Put your answer in the table shown below.
-
What if you considered types instead of tokens (for left vs least and right vs least). Put your answers in the table shown below.
-
Are you surprised by these results? Why or why not?
2.2 Bigram and trigram analysis
A bigram is a pair of words that occur together. For example, in the sentence "The cat sat on the mat", the bigrams are: "The cat", "cat sat", "sat on", "on the", "the mat". A trigram is a triplet of words that occur together. The trigrams are: "The cat sat", "cat sat on", "sat on the", "on the mat".
-
Write a function that takes as input a spacy Document, and returns a list of all of the bigrams in the document.
-
Write a function that takes as input a spacy Document, and returns a list of all of the trigrams in the document.
Questions
The table shown below is included in your README.md
file. When a question asks
you to fill in the table, fill in the matching table in the README.md
file.
Be sure to put your answers to Questions 8, 9 and 10 in the table.
-
What percentage of the bigrams (tokens, not types) that appear in the 'left' articles don’t appear in the 'least' articles? What percentage of the bigrams that appear in the 'right' articles don’t appear in the 'least' articles? Report in the table shown below. Are you surprised by these results? Why or why not?
-
What percentage of the trigrams (tokens, not types) that appear in the 'left' articles don’t appear in the 'least' articles? What percentage of the trigrams that appear in the 'right' articles don’t appear in the 'least' articles? Report in the table shown below. Are you surprised by these results? Why or why not?
Train | Test | Missing tokens | Missing types | Missing bigrams | Missing trigrams |
---|---|---|---|---|---|
Left |
Least |
Q8 |
Q10 |
Q12 |
Q13 |
Right |
Least |
Q9 |
Q10 |
Q12 |
Q13 |
2.3 Collecting statistics on pairs of categories
Instead of collecting statistics on one set of articles ("training") and seeing how closely these statistics match another set of articles ("testing"), what if you trained your tool on two of the sets (e.g. 'left' and 'least') and tested it on the third category (e.g. 'right')?
Questions
-
Does that change your results? Why? Try for each of the three combinations. Which worked best? Why? Your
README.md
includes the blank table shown below. Fill that table in with your results. Report percentages, not raw counts.
Train | Test | Missing tokens | Missing types | Missing bigrams | Missing trigrams |
---|---|---|---|---|---|
Left, Right |
Least |
||||
Least, Right |
Left |
||||
Least, Left |
Right |
How to turn in your solutions
Edit the README.md
file that we provided to answer each of the questions
and add any discussion you think we’d like to know about.
Be sure to commit and push all changes to your python files.
If you think it would be helpful, use asciinema
to record
a terminal session and include it your README.md
.