CS91R Lab 04: Segmentation

Due Friday, March 7, before midnight

Goals

The goals for this lab assignment are:

Learn how to use python virtual environments to install libraries
Learn to use spacy, a natural language processing (NLP) library
Use argparse for command-line argument processing
Experimentally design and improve a sentence segmentation algorithm
Learn to evaluate algorithms using false positives & negatives
Learn to extract data from XML documents
Learn about unigrams and bigrams
Explore the relationship between training and testing data and issues of data sparsity

Cloning your repository

Log into the CS91R-S25 github organization for our class and find your git repository for Lab 04, which will be of the format lab04-user1-user2, where user1 and user2 are you and your partner’s usernames.

You can clone your repository using the following steps while connected to the CS lab machines:

# cd into your cs91r/labs sub-directory and clone your lab04 repo
$ cd ~/cs91r/labs
$ git clone git@github.swarthmore.edu:CS91R-S25/lab04-user1-user2.git

# change directory to list its contents
$ cd ~/cs91r/labs/lab04-user1-user2

# ls should list the following contents
$ ls
evaluate.py  bias.py  README.md  segmenter.py  mytokenizer.py

Answers to written questions should be included in the README.md file included in your repository.

Virtual environments

Normally you are working in pairs in lab, but both you and your partner will need to complete these steps. You might want to both follow along at the same time on your own computers.

A python virtual environment, or venv, is a way for you to install python packages that you need for various projects you might be working on. The following four bullet points are taken from python’s venv page:

A virtual environment is (amongst other things):

Used to contain a specific Python interpreter and software libraries and binaries which are needed to support a project (library or application). These are by default isolated from software in other virtual environments and Python interpreters and libraries installed in the operating system.

Not checked into source control systems such as Git.

Considered as disposable – it should be simple to delete and recreate it from scratch. You don’t place any project code in the environment.

Not considered as movable or copyable – you just recreate the same environment in the target location.

virtualenvwrapper

One-time setup for virtualenvwrapper

We will be using a tool called virtualenvwrapper to help manage the virtual environments we create. The virtualenvwrapper tool requires some one-time setup.

Open your ~/.bashrc file in a text editor and add the following lines to the end of the file:

export PYTHONPATH=
export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3
export WORKON_HOME=/scratch/userid1/venvs    // REPLACE userid1 with your user id
source /usr/local/bin/virtualenvwrapper.sh

Save your ~/.bashrc file. You now need to either restart all of your terminal windows or run source ~/.bashrc at each open terminal’s prompt to incorporate these changes. When you restart or run the source command, you will get some output indicating that virtualenvwrapper has created a bunch of files for you. The output should look something like this:

virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/premkproject
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/postmkproject
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/initialize
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/premkvirtualenv
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/postmkvirtualenv
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/prermvirtualenv
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/postrmvirtualenv
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/predeactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/postdeactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/preactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/postactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/get_env_details

You shouldn’t see this displayed again in the future, and you should not need to redo these setup steps again.

Creating a venv using virtualenvwrapper

Every venv has a name. You can call it anything you’d like, but we will call the one we make cs91r, both in the examples below and throughout the course when referring the venv we’ve built.

To create a venv using virtualenvwrapper, you will use the mkvirtualenv script. You should get output similar to that shown below:

stork[~]$ mkvirtualenv cs91r
created virtual environment CPython3.12.3.final.0-64 in 3520ms
  creator CPython3Posix(dest=/scratch/userid1/venvs/cs91r, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, via=copy, app_data_dir=/home/userid1/.local/share/virtualenv)
    added seed packages: pip==24.0
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/cs91r/bin/predeactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/cs91r/bin/postdeactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/cs91r/bin/preactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/cs91r/bin/postactivate
virtualenvwrapper.user_scripts creating /scratch/userid1/venvs/cs91r/bin/get_env_details
(cs91r) stork[~]$

Notice that the prompt changed. When you are running inside of a virtual environment, your prompt will change to indicate the venv you are using.

Exiting and starting a venv using virtualenvwrapper

Since you are now in a venv, let’s exit it. To do that, run the deactivate script at the prompt. Notice that your prompt reverts to its original state:

(cs91r) stork[~]$ deactivate
stork[~]$

To start a venv, you use the workon script followed by the name of your venv. You will know it is successfull because your prompt will change back to indicate you are using the venv:

stork[~]$ workon cs91r
(cs91r) stork[~]$

Installing python packages

Now that we have our virtual environment, we will install some python packages. For this week, we will need to install the python packages spacy and lxml, along with some supporting packages. The commands in this section are specific only to installing spacy. Once you’ve done that for this venv, you won’t need to do them again.

These commands will produce a lot of output and may take 60-90 seconds to complete. I will only show a few lines of the output here:

$ pip install spacy lxml
...
Successfully installed MarkupSafe-3.0.2 annotated-types-0.7.0 blis-1.2.0 catalogue-2.0.10 certifi-2025.1.31 charset-normalizer-3.4.1 click-8.1.8 cloudpathlib-0.21.0 confection-0.1.5 cymem-2.0.11 idna-3.10 jinja2-3.1.6 langcodes-3.5.0 language-data-1.3.0 lxml-5.3.1 marisa-trie-1.2.1 markdown-it-py-3.0.0 mdurl-0.1.2 murmurhash-1.0.12 numpy-2.2.3 packaging-24.2 preshed-3.0.9 pydantic-2.10.6 pydantic-core-2.27.2 pygments-2.19.1 requests-2.32.3 rich-13.9.4 setuptools-76.0.0 shellingham-1.5.4 smart-open-7.1.0 spacy-3.8.4 spacy-legacy-3.0.12 spacy-loggers-1.0.5 srsly-2.5.1 thinc-8.3.4 tqdm-4.67.1 typer-0.15.2 typing-extensions-4.12.2 urllib3-2.3.0 wasabi-1.1.3 weasel-0.4.1 wrapt-1.17.2

$ python -m spacy download en_core_web_sm
...
Successfully installed en-core-web-sm-3.8.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')

In future weeks we may install more packages, but for this week, we don’t need any more.

Spacy

This week, we are going to use the spacy python library for the first time. Spacy is designed to make it easy to use pre-trained models to analyze very large sets of data. In this lab, we’ll be using spacy as a skeleton for building our own tools.

1. Sentence segmentation

1.1 Overview

In the first part of the lab, you will write a simple sentence segmenter.

The /data/cs91r-s25/brown/ directory includes three text files taken from the Brown Corpus:

editorial.txt
adventure.txt
romance.txt

The files do not indicate where one sentence ends and the next begins. In the data set you are working with, sentences can only end with one of 5 characters: period, colon, semi-colon, exclamation point and question mark.

However, there is a catch: not every period represents the end of a sentence since there are many abbreviations (U.S.A., Dr., Mon., etc., etc.) that can appear in the middle of a sentence where the periods do not indicate the end of a sentence. The text also has many examples where colon is not the end of the sentence. The other three punctuation marks are all nearly unambiguously the ends of a sentence. Yes, even semi-colons.

For each of the above files, we have also provided a file containing the line number (counting from 0) containing the actual locations of the ends of sentences:

editorial-eos.txt
adventure-eos.txt
romance-eos.txt

Your job is to write a sentence segmenter. We will add that segmenter to spacy’s processing pipeline.

1.2 Handling command-line arguments

We have provided you with some starter code in segmenter.py. Don’t worry about what the existing code does just yet.

Eventually, we’re going to want to run this program, which means we’re going to need main to do something other than pass. (Note: pass is a python statement that does nothing, but can sometimes be useful when python expects you to have an indented block of code and you don’t want to put anything there. In that case, using pass as a placeholder is useful.) You’ll want to remove pass once you have something better to put there.

We’re going to want to be able to provide the segmenter.py program with command-line arguments, taking one required argument (a textfile) and one optional argument (the OUTPUT_FILE specified using ‑o or ‑‑outfile). If the user does not specify an OUTPUT_FILE, the program should output to sys.stdout.

$ python3 ./segmenter.py --help
usage: segmenter.py [-h] [-o OUTFILE] textfile

Predict sentence segmentation for a file.

positional arguments:
  textfile              path to the unlabeled text file

options:
  -h, --help            show this help message and exit
  -o OUTFILE, --outfile OUTFILE
                        write boundaries to file

Write the command-line interface for segmenter.py using the argparse module in python for processing command-line arguments. In addition to the module documentation, you may also find the argparse tutorial useful. We’ve already included import argparse at the top of the file to get you started.

Both arguments to segmenter.py will be files. The first file will be one that you read and the second file will be one that you write. The type of both arguments should be FileType arguments.

You can print out the arguments you get to verify that things are working, but you don’t need to do anything with the arguments just yet.

As in Lab 2, all print statements should be in your main() function, which should only be called if segmenter.py is run from the command line.

1.3 Creating a spacy Doc

Now that your segmenter.py function takes command-line arguments, you can use them to create a spacy Doc from the textfile. Pass the textfile argument into the create_doc function we’ve provided. The create_doc function returns a spacy Doc.

Questions

Create a spacy Doc using the create_doc function specific in the section above. How many tokens are in the file editorial.txt? You can get the number of tokens in a spacy Doc by calling len() on the Doc object.

1.4 Understanding the tokenizer

The tokenizer.py file contains some code that you do not need to modify in any way. However, it would be helpful if you understood what it does at a high-level, even if the specifics aren’t clear.

The tokenizer.py file defines two things:

a tokenize function, which uses the same regular expression to split strings as you’ve already seen in previous labs, and
a class called MyTokenizer that extends the Tokenizer class in spacy

The MyTokenizer class allows you to define a piece of spacy’s processing pipeline. In spacy, a pipeline is defined as the steps that spacy takes to take raw text and convert it into a format that is ready for various text processing tasks.

In this step, we are replacing spacy’s default English tokenizer — a component of spacy’s processing pipeline — with our tokenizer. We are writing our own tokenizer because otherwise spacy would do much of the sentence segmentation for us. Since the goal of this lab is for you to write the segmenter, we are replacing the default tokenizer with a tokenizer that simply separates the non-alphabetic text (punctuation, dollar signs, etc.) from the letters and numbers in the text.

Optional deeper dive

The MyTokenizer class might be confusing to you since it likely uses python and/or spacy components that are new to you. If you’re curious, here are sources that will help you understand what’s going on.

Section 9.5 of the python tutorial on classes, which talks about inheritance in python.
A high-level explanation of the __call__ method from realpython.com.
The documentation (and especially the description of the __init__ method) for the Doc class in spacy.
The documentation (and especially the description of the __init__ method) for the Tokenizer class in spacy.
The documentation for the Language class in spacy.

Questions

Briefly explain the MyTokenizer class in your own words. (You don’t need to fully understand it, but use the web page text and the comments in the file to help.)

1.5 Baseline segmenter

In the segmenter.py file, write a function called baseline_segmenter that takes a spacy Doc as its only argument. This function will use a naive algorithm for determining when we’ve reached the end of each sentence.

This is called a baseline segmenter, because we are expecting that this very simple segmenter can get pretty good results without much work on our part. By setting the bar pretty high, we’re expecting our improvements to be even better than this baseline. We will be able to compare our improved algorithm against this baseline to see how much better it performs.

The baseline segmenter will iterate through all of the tokens in the Doc (using for token in doc:) and predict which ones are the ends of sentences. Confusingly, instead of keeping track of the ends of sentences, spacy keeps track of the beginning of sentences. In particular, the first token in each sentence in a spacy Doc has its is_sent_start attribute set to True. This means that for every token that you predict corresponds to the end of a sentence, you should set the is_sent_start attribute to True for the next token.

Recall that every sentence in our data set ends with one of the five tokens ['.', ':', ';', '!', '?']. Our baseline_segmenter will predict that every instance of one of these characters is the end of a sentence.

As shown below, you can access the text content of a Token in spacy through its .text attribute to determine if it matches one of these five tokens:

>>> my_token = doc[0]
>>> type(my_token)
<class 'spacy.tokens.token.Token'>
>>> my_token.text
'The'
>>> type(my_token.text)
<class 'str'>

Next, add your baseline_segmenter function to the pipeline of tools that will be called on every Doc that is created with your create_doc function. To do that, add the following line to the create_doc function just below where you set up your tokenizer:

nlp.tokenizer=MyTokenizer(nlp.vocab)  # (existing line)
nlp.add_pipe("segmenter")             # <--- ADD THIS LINE

Finally, update your main() function to write out the token numbers corresponding to the predicted line breaks to outfile, if it was specified as a command-line option, or print the token numbers to the screen if the outfile argument was omitted. You should write the token numbers one per line. You can view /data/cs91r-s25/brown/editorial-eos.txt as an example.

Be sure to write out the last token number as a sentence boundary. Since spacy is keeping track of the starts of new sentences, the final sentence is never explicitly marked. You can access a list of the sentences in a spacy Doc with its sents attribute:

>>> doc = nlp("The cat in the hat came back, wrecked a lot of havoc on the way.")
>>> print(len(list(doc.sents)))
1

Confirm that when you run your baseline_segmenter on the file editorial.txt, it predicts 3278 sentence boundaries.

Optional deeper dive

You can read more about the add_pipe function if you are interested.

Questions

Briefly explain the create_doc class in your own words. (You don’t need to fully understand it, but use the web page text and the comments in the file to help.)

1.6 Evaluate your segmenter

To evaluate your system, we are providing you a program called evaluate.py that compares your hypothesized sentence boundaries with the ground truth boundaries. This program will report to you the true positives, true negatives, false positives and false negatives (as well as precision, recall and F-measure, which we haven’t talked about in class just yet). You can run evaluate.py with the -h option to see all of the command-line options that it supports.

A sample run with the output of your baseline segmenter from above stored in bl_editorial-eos.txt would be:

python3 evaluate.py -d /data/cs91r-s25/brown/ -c editorial -y bl_editorial-eos.txt

Run the evaluation on your baseline system’s output for the editorial category, and confirm that you get the following before moving on:

TP:    2719	FN:       0
FP:     559	TN:   60055

PRECISION: 82.95%	RECALL: 100.00%	F: 90.68%

1.7 Improving your segmenter

Now it’s time to improve the baseline sentence segmenter. We don’t have any false negatives (since we’re predicting that every instance of the possibly-end-of-sentence punctuation marks is, in fact, the end of a sentence), but we have quite a few false positives.

Make a copy of your baseline_segmenter function called my_best_segmenter. Be sure to copy the line that says @Language.component("baseline_segmenter") and change it to say @Language.component("my_best_segmenter"). In the create_doc function, change the line that says add_pipe to use my_best_segmenter instead of the baseline_segmenter to use this new segmenter instead of the baseline one.

You can see the tokens that your system is mis-characterizing by setting the verbosity of evaluate.py to something greater than 0. Setting it to 1 will print out all of the false positives and false negatives so you can work to improve your my_best_segmenter function.

python3 evaluate.py -v 1 -d /data/cs91r-s25/brown/ -c editorial -y bl_editorial-eos.txt

To test your segmenter, we will run it on a hidden text you haven’t seen yet. It may not make sense for you to spend a lot of time trying to fix obscure cases that show up in the three texts we are providing you because these cases may never show up in the hidden text that you haven’t seen. But it is important to handle cases that occur multiple times and even some cases that appear only once if you suspect they could appear in the hidden text. You want to write your code to be as general as possible so that it works well on the hidden text without leading to too many false positives.

Questions

Describe the improvements you made in my_best_segmenter to increase performance.
For each of the three texts (editorial, adventure, romance), what was the final precision, recall and F-measure?
If you didn’t get 100% precision and 100% recall, why did you stop improving your segmenter?

2. Exploring bias / using XML

In Lab 03, we used smoothing to deal with words that didn’t occur in a corpus, leading to the probability of a word (or ngram) being set to 0. As we start to use language for more complex tasks, related problems will arise.

For example, imagine you created a tool like ChatGPT that could generate newpaper articles. If you built this tool by only showing it the editorial.txt file, it might be able to generate text that looks like a newspaper editorial. However, if you then asked it to generate the kinds of stories in adventure.txt, it would be unable to do so because there are lots of words in adventure.txt that never show up in editorial.txt, such as "bottle" and "window".

In this part of the assignment, we’ll explore this problem in more depth. To do that, we’ll start working with a larger dataset that contains news articles that have been classified as being "hyperpartisan". The data we will be working with in this lab was used as part of SemEval 2019, Task 4.

Each article will be labeled as either hyperpartisan ('true') or not ('false'). In addition, each 'true' hyperpartisan article will be labeled as having a "left" or "right" bias. Each 'false' hyperpartisan article will be labeled as having a "left-center", "right-center", or "least" (no particular) bias.

You should put your code for this part in the file called bias.py.

2.1 Extracting data from XML files

The dataset we will be using is in /data/cs91r-s25/semeval-19-04/. The file articles.xml file contains all of the articles we’ll look at this week. The file labels.xml contains the labeling of each article (is it hyperpartisan or not, and what is its bias). These files have 600,000 articles and 600,000 labels, respectively, so they are pretty large and hard to work with while you’re developing. Therefore, we’ve provided small_articles.xml and small_labels.xml which only contains 12 articles. You can work with them while you work out the bugs.

Once you switch to the full data set, you’ll have to deal with the fact that the data file is big (around 2.6G) so we won’t store the whole thing in memory at once if we can help it. Fortunately, the lxml library gives us a way to iteratively parse through an xml file, dealing with one node (in our case, article) at a time. Here’s sample code that opens a file called myfile.xml and calls a function called my_func on every article node:

from lxml import etree

fp = open("myfile.xml", "rb")
for event, element in etree.iterparse(fp, events=("end",)):
    if element.tag == "article":
        # process the element (see below)

        element.clear() # optimization to free up memory

As needed, refer back to the lxml tutorial we saw in clab. You won’t need to use the event variable, but the element variable will be a node in the XML tree. You will want to get the .items() of the element (which contains the attributes of the node such as id and title). And you will need to use the .itertext() method to get the text of the node.

Do not use findall or find. There are two reasons: (1) the dataset is very large and findall will be slow, and (2) this data has a lot of embedded HTML tags, so finding all of the text that is nested instead of <p> and <a> tags is doable but non-trivial.

Before we continue, take a look at the contents of the articles.xml file and the labels.xml file. Look at the structure of the documents, especially at the start of each <article>.

$ less /data/cs91r-s25/semeval-19-04/articles.xml # or small_articles.xml
$ less /data/cs91r-s25/semeval-19-04/labels.xml   # or small_labels.xml

Questions

If the articles are in one file and the labels are in the other file, how do you know how each article is labeled?

2.2 Get the XML contents

Write two functions, one called get_labels() and one called get_articles(). The get_labels function should take a filename (e.g. /data/cs91r-s25/semeval-19-04/labels.xml) and it should return a dictionary mapping article IDs to their bias ("left", "right", etc). The get_articles function should take a filename (e.g. /data/cs91r-s25/semeval-19-04/articles.xml) and should return a dictionary that maps article IDs to a spacy Doc containing the news article.

When creating spacy documents, don’t use your MyTokenizer from Part 1. We’ll be ignoring sentence boundaries for this part, so you don’t need to add your my_best_segmenter either. To initialize the spacy pipeline, you can just use this line:

nlp = English()

One problem we will need to worry about is that xml files contain HTML entities. These files in particular include a lot of the entity  , a non-breaking space. In order to clean these up, we can use a function in the html library before creating our spacy document:

import html
...
# the string storing the article is stored in the variable text
doc = nlp(html.unescape(text))

Once you have both the dictionary of labels and the dictionary articles, create another new dictionary whose keys are labels (e.g. "left", "right", etc). Each key should map to a list of all the articles (as spacy documents) labeled with that key.

Questions

What percentage of the tokens that appear in the 'left' articles don’t appear in the 'least' articles? Put your answer in the table shown below, which is also included in your README.md file.
What percentage of tokens that appear in the 'right' articles don’t appear in the 'least' articles? Put your answer in the table shown below.
What if you considered types instead of tokens (for left vs least and right vs least). Put your answers in the table shown below.
Are you surprised by these results? Why or why not?

2.2 Bigram and trigram analysis

A bigram is a pair of words that occur together. For example, in the sentence "The cat sat on the mat", the bigrams are: "The cat", "cat sat", "sat on", "on the", "the mat". A trigram is a triplet of words that occur together. The trigrams are: "The cat sat", "cat sat on", "sat on the", "on the mat".

Write a function that takes as input a spacy Document, and returns a list of all of the bigrams in the document.
Write a function that takes as input a spacy Document, and returns a list of all of the trigrams in the document.

Questions

The table shown below is included in your README.md file. When a question asks you to fill in the table, fill in the matching table in the README.md file. Be sure to put your answers to Questions 8, 9 and 10 in the table.

What percentage of the bigrams (tokens, not types) that appear in the 'left' articles don’t appear in the 'least' articles? What percentage of the bigrams that appear in the 'right' articles don’t appear in the 'least' articles? Report in the table shown below. Are you surprised by these results? Why or why not?
What percentage of the trigrams (tokens, not types) that appear in the 'left' articles don’t appear in the 'least' articles? What percentage of the trigrams that appear in the 'right' articles don’t appear in the 'least' articles? Report in the table shown below. Are you surprised by these results? Why or why not?

Train	Test	Missing tokens	Missing types	Missing bigrams	Missing trigrams
Left	Least	Q8	Q10	Q12	Q13
Right	Least	Q9	Q10	Q12	Q13

Train

Test

Missing tokens

Missing types

Missing bigrams

Missing trigrams

Left

Least

Q10

Q12

Q13

Right

Least

Q10

Q12

Q13

2.3 Collecting statistics on pairs of categories

Instead of collecting statistics on one set of articles ("training") and seeing how closely these statistics match another set of articles ("testing"), what if you trained your tool on two of the sets (e.g. 'left' and 'least') and tested it on the third category (e.g. 'right')?

Questions

Does that change your results? Why? Try for each of the three combinations. Which worked best? Why? Your README.md includes the blank table shown below. Fill that table in with your results. Report percentages, not raw counts.

Train

Test

Missing tokens

Missing types

Missing bigrams

Missing trigrams

Left, Right

Least

Least, Right

Left

Least, Left

Right

How to turn in your solutions

Edit the README.md file that we provided to answer each of the questions and add any discussion you think we’d like to know about.

Be sure to commit and push all changes to your python files.

If you think it would be helpful, use asciinema to record a terminal session and include it your README.md.