CS91R Lab 02: Counting
Due Wednesday, February 12, before midnight
Cloning your repository
Log into the CS91R-S25 github
organization for our class and find your git repository for Lab 02,
which will be of the format lab02-user1-user2
, where user1
and user2
are you and your partner’s usernames.
You can clone your repository using the following steps while connected to the CS lab machines:
# cd into your cs91r/labs sub-directory and clone your lab02 repo
$ cd ~/cs91r/labs
$ git clone git@github.swarthmore.edu:CS91R-S25/lab02-user1-user2.git
# change directory to list its contents
$ cd ~/cs91r/labs/lab02-user1-user2
# ls should list the following contents
$ ls
README.md ngrams.py tokenizer.py tqdm_ex.py
1. Warmups
Put all of your code for part 1 and part 2 in a single python file called
tokenizer.py
. Demonstrate that your functions work by writing a main
function.
The answers to each of the questions in the lab writeup should be placed in the
README.md
file.
You should call your main
function using the following pattern which will allow
you to import this file in future labs without problems.
if __name__ == '__main__':
main()
The main function should be the only place where you print anything.
1.1 get_words
Write a function called get_words
that takes a string s
as its only argument.
The function should return a list of the words in the same order as they
appeared in s
. Note that in this question a "word" is defined as a
"space-separated item". For example:
>>> get_words('The cat in the hat ate the rat in the vat')
['The', 'cat', 'in', 'the', 'hat', 'ate', 'the', 'rat', 'in', 'the', 'vat']
Hint: If you don’t know how to approach this problem, read about str.split()
.
1.2 count_words
Write a function called count_words
that takes a list of words as its only argument
and returns a dictionary
that maps a word to the frequency that it occurred in s
. Use the output of the get_words
function as the input to this function.
>>> s = 'The cat in the hat ate the rat in the vat'
>>> words = get_words(s)
>>> count_words(words)
{'The': 1, 'cat': 1, 'in': 2, 'the': 3, 'hat': 1, 'ate': 1, 'rat': 1, 'vat': 1}
Notice that this is somewhat unsatisfying because the
is counted separately
from The
. We can easily fix this by lower-casing all of the words before
counting them using str.lower
:
>>> words = get_words(s.lower())
>>> count_words(words)
{'the': 4, 'cat': 1, 'in': 2, 'hat': 1, 'ate': 1, 'rat': 1, 'vat': 1}
Hints
-
If you don’t have experience using dictionaries, read about python’s dict data structure.
-
If you have experience using python dictionaries, try using collections.defaultdict, an advanced kind of dictionary that can simplify your code.
Be sure that you are confident in your ability to use dictionaries in python (as much as is needed for this lab) as we will be using them a lot in this class.
If you are confident about dictionaries, try using collections.Counter. And if you aren’t confident yet, come back and try this later because Counters are really useful!
1.3 words_by_frequency
Write a function called words_by_frequency
that takes a list of words as its only argument. The function should return a list of (word, count)
tuples sorted by count such that the first item in the list is the most frequent item. The order of items with the same frequency does not matter (but you could try to sort such items alphabetically if you wanted to try that). You should use your count_words
function
to get the counts first.
>>> words_by_frequency(words)
[('the', 4), ('in', 2), ('cat', 1), ('hat', 1), ('ate', 1), ('rat', 1), ('vat', 1)]
Hints
To sort a list of tuples by the second field, use the operator.itemgetter function:
>>> import operator # be sure to import operator
>>> tupleList = [('cat', 9), ('hat', 5), ('rat', 12)]
>>> tupleList.sort(key=operator.itemgetter(1), reverse=True)
>>> tupleList
[('rat', 12), ('cat', 9), ('hat', 5)]
2. Tokenization
In part 2 of the lab, you will explore some files from Project Gutenberg and improve on the code you just wrote by adding to and/or modifying the tokenizer.py
program you began in the previous section.
2.1 Finding the most frequent words
In the /data/cs91r-s25/gutenberg/trimmed/
directory there are a number of .txt
files containing texts found in the Project Gutenberg collection.
In python, read the contents of Lewis Carroll’s "Alice’s Adventures in Wonderland", which is stored in the file carroll-alice.txt
. Use your words_by_frequency
and count_words
functions from Section 1 to explore the text. Assuming that you wrote your code properly in Section 1, and assuming that you lower-cased all of the words in the text, you should find that the five most frequent words in the text are:
the 1604
and 765
to 706
a 614
she 518
Questions
-
Use your
count_words
function to find out how many times the wordalice
occurs in the text. -
The word
alice
actually appears 397 times in the text, though this is not the answer you got for the previous question. Why? Examine the data to see if you can figure it out before continuing.
2.2 Getting more accurate counts
We now see that there is a deficiency in how we implemented the get_words
function. When we are counting words, we probably don’t care whether the word was adjacent to a punctuation mark. For example, the word hatter
appears in the text 57 times, but if we queried the count_words
dictionary, we would see it only appeared 24 times. However, it also appeared numerous times adjacent to a punctuation mark so those instances got counted separately:
>>> word_freq = words_by_frequency(words)
>>> for (word, freq) in word_freq:
... if 'hatter' in word:
... print('%-10s %3d' % (word, freq))
...
hatter 24
hatter. 13
hatter, 10
hatter: 6
hatters 1
hatter's 1
hatter; 1
hatter." 1
Our get_words function would be better if it separated punctuation from words. We can accomplish this by using the re.split
function. We used the re
library last lab to create our first regular expressions. The re.split
function
is more powerful than python’s str.split
function because it allows you to split a string based on a regular expression match rather just a literal string.
Be sure to import re
at the top of your file to make re.split
work. Below is a small example that demonstrates how str.split
works on a small text and compares it to using re.split
:
>>> import re
>>> text = '"Oh no, no," said the little Fly, "to ask me is in vain."'
>>> text.split()
['"Oh', 'no,', 'no,"', 'said', 'the', 'little', 'Fly,', '"to', 'ask', 'me', 'is',
'in', 'vain."']
>>> rgx = re.compile(r'(\W)')
>>> rgx.split(text)
['', '"', 'Oh', ' ', 'no', ',', '', ' ', 'no', ',', '', '"', '', ' ', 'said', ' ', 'the',
' ', 'little', ' ', 'Fly', ',', '', ' ', '', '"', 'to', ' ', 'ask', ' ', 'me', ' ', 'is',
' ', 'in', ' ', 'vain', '.', '', '"', '']
Note that this is not exactly what we want, but it is a lot closer. In the resulting list, we find empty strings and spaces, but we have also successfully separated the punctuation from the words.
Using the above example as a guide, write and test a function called tokenize
that takes a string as an input and returns a list of words and punctuation, but not extraneous spaces and empty strings. You don’t need to modify the re.split
line we provided: just process the resulting list to remove spaces and empty strings.
>>> tokenize(text.lower())
['"', 'oh', 'no', ',', 'no', ',', '"', 'said', 'the', 'little',
'fly', ',', '"', 'to', 'ask', 'me', 'is', 'in', 'vain', '.', '"']
>>> print(' '.join(tokenize(text.lower())))
" oh no , no , " said the little fly , " to ask me is in vain . "
Explaining the re.split
line
In the example above, the regular expression we used was
r'(\W)'
. The regular expression here is (\W)
. This incorporates
two things we have not yet learned, and one thing that is worth
emphasizing:
-
In regular expression syntax, there are numerous predefined sets of characters that you can match against. For example, if you want to match against a number, you can use the set notated by
\d
. If you want to match any alphanumeric character (letters, numbers, or underscores), you can use the set notated by\w
. If you want to match any whitespace (space, tab, newline, etc), you can use the set notated by\s
. Notice that these sets use lowercase letters. It is common practice to specify the inverse set using the respective capital letter. Therefore, to match anything that is not a digit, you would use\D
. To match anything that is not whitespace, you would use\S
. To match anything that is not alphanumeric, you would use\W
, which is used in the regular expression in this example. -
In regular expression syntax, we can "capture" the part of the pattern that matched by putting the part we want to capture in parenthesis. This allows us to refer to the specific section of the string that matched the portion of the regular expression in parenthesis. We will see more examples of this later in the semester. In this context, capturing the non-alphabetic characters tells the
re.split
function to include the punctutation in the list that it returns. If we try to run the line in the example above without the parenthesis, we’ll see that it omits all punctutation from the returned list.>>> rgx = re.compile(r'\W') >>> rgx.split(text) ['', 'Oh', 'no', '', 'no', '', '', 'said', 'the', 'little', 'Fly', '', '', 'to', 'ask', 'me', 'is', 'in', 'vain', '', '']
-
The regular expression here is not written in a normal python string. Notice that the first quotation mark is preceded by the letter
r
. This tells python that you are typing a raw string instead of a normal string. For this specific example, the raw string isn’t necessary, but it’s useful to think about why we might need it. Normally, python uses the\
symbol in strings to mean something special. Think about\n
, the newline symbol. Perhaps you’ve also seen\t
, the tab symbol. Python also uses\\
,\'
,\"
,\r
,\b
,\f
,\v
and\x
to mean special things inside of strings. If we actually wanted to print out\n
in python, we would have two options:>>> print('\\n') # escape the slash so it just means "slash" followed by n \n >>> print(r'\n') # use a raw string so python doesn't interpret \n as special \n
It is common practice to precede the string containing a regular expression with
r
in case there are special symbols in your regular expression that python might unintentionally treat in a way you weren’t expecting.
Catching underscores
In the above example, you used this regular expression to split the text:
rgx = re.compile(r'(\W)')
It works really well, right?! The problem is that the set of
characters in \w
includes the _
(underscore) character. If
you try to run your program with anything that contains an underscore,
you’ll spot the error. You’ll see it a bunch of times in Alice in
Wonderland, such as: "There was nothing so _very_ remarkable
in that;" When you tokenize the text with the rgx
above, you’ll end
up with _very_
as a word counted separately from very
.
To fix this, include the underscore character in the set of characters
you want to split on. Just like other regular expressions you’ve seen
already, you can include \W
in a set to say that you want to match
\W
or _
:
rgx = re.compile(r'([\W_])')
In order to get the exact counts shown below, you will need to modify your regular expression to match the one above.
2.3 Finding the most frequent words, again
Use your tokenize
function in conjunction with your words_by_frequency
function to
list the top 5 most frequent words in carroll-alice.txt
. You should get this:
, 2425 <-- comma
the 1643
“ 1118 <-- open quotation mark
” 1114 <-- close quotation mark
. 988 <-- period
2.4 Filter out non-words
Write a function called filter_nonwords
that takes a list of strings as input
and returns a new list of strings that excludes anything that isn’t entirely
alphabetic. You can use the str.isalpha
method to determine is a string is comprised of only alphabetic characters.
If you are more daring, try using a regular expression containing the special
alphanumeric set represented by \w
(and explained above).
>>> text = '"Oh no, no," said the little Fly, "to ask me is in vain."'
>>> tokens = tokenize(text)
>>> filter_nonwords(tokens)
['Oh', 'no', 'no', 'said', 'the', 'little', 'Fly', 'to', 'ask', 'me',
'is', 'in', 'vain']
Use this function to list the top 5 most frequent words in carroll-alice.txt
:
the 1643
and 872
to 729
a 632
it 595
2.5 Comparing the most frequent words across corpora
Iterate through all of the files in the /data/cs91r-s25/gutenberg/trimmed/
directory and print out the top 5 words for each. To get a list of all the files
in a directory, use the os.listdir
function:
>>> import os
>>> from pathlib import Path
>>> directory = Path('/data/cs91r-s25/gutenberg/trimmed/')
>>> files = os.listdir(directory)
>>> files
['shakespeare-caesar.txt', 'whitman-leaves.txt', 'chesterton-thursday.txt', 'austen-sense.txt', 'chesterton-brown.txt', 'shakespeare-macbeth.txt', 'austen-persuasion.txt', 'milton-paradise.txt', 'burgess-busterbear.txt', 'melville-moby_dick.txt', 'README', 'austen-emma.txt', 'blake-poems.txt', 'bryant-stories.txt', 'bible-kjv.txt', 'chesterton-ball.txt', 'edgeworth-parents.txt', 'carroll-alice.txt', 'shakespeare-hamlet.txt']
>>> infile = open(directory / files[0], 'r')
This example also uses the pathlib
library that you might want to read about.
Questions
-
Loop through all the files the
/data/cs91r-s25/gutenberg/trimmed/
directory that end in.txt
. Is'the'
always the most common word? If not, what are some other words that show up as the most frequent word? -
If you don’t lowercase all the words before you count them, how does this result change, if at all?
Sidenote: tqdm
When you create a program that takes a long time to run, it’s often
helpful to get some kind of status report on the progress, that way
you have some estimate for how long the program will take to run, and
it helps to make sure you aren’t sitting around waiting for a program to
finish that will take thousands of years to finish — or never, if you’re in an
infinite loop. That’s where the library tqdm
comes in.
Here’s a simple python program that sleeps between each interation of a for
loop. It shows you three possible ways to write such a loop: no progress,
printing progress, and using tqdm to show progess. This example should help
you see how tqdm
works and is included in your repository
with the name tqdm_ex.py
:
from tqdm import tqdm
from time import sleep
print("don't show any progress")
for step in range(10):
sleep(0.5)
print("show progress using a print statement")
for step in range(10):
print(f'{step=}')
sleep(0.5)
print("show progress using tqdm")
for step in tqdm(range(10)):
sleep(0.5)
Here is an asciinema recording showing the python program above running:
To use tqdm
, you simply need to wrap the sequence in your for loop
with the function tqdm(…)
. This syntax works for any sequence that
you can use in a for loop. When python can determine the length of the
sequence by calling len
, tqdm
will provide
a time estimate for completion. For sequences where the length of the
sequence is unknown (e.g. reading from sys.stdin
), tqdm
is unable to provide
a time estimate but will still provide updates to let you know your program
is still running.
In the next section, you will be reading from multiple very large files. Each file may take 60-90 seconds to process, and there are 24 files. Having some indication that your program is making progress is really helpful!
Sidenote: /scratch directory
The CS network has a lot of disk storage. However, we set quotas on the total
amount of disk storage each student can use because of the time and cost
of backing up the data, not storing the original data. There will be many
occasions where you’d like to store large files, but you don’t want to store
them in your own directory because of the disk quota limitations. That’s
where the /scratch
directory comes in. The CS department already has
a lot of your "disposable" files set up to be stored in /scratch
: your
Downloads
directory, your vscode configuration files, and other files that
don’t get backed up. But you can also use the /scratch
directory to store
large files that you create in your classes.
To access your own /scratch
directory, just cd
to it:
$ cd /scratch/userid1 # replace userid1 with your user id
$ ls
You may see some files we’ve been storing there. You can now make a directory just for this class, and even directories for each lab if you wanted. You could even clone your lab repos here (provided you remember to push regularly since this directory won’t be backed up.)
In the next section, you will be creating a large output file (between 150MB-300MB).
It would be good to store that output file in your /scratch
directory!
3. Google Ngrams
Have you ever used Google’s Books Ngram Viewer? NO!? Go do that now! When you’re done playing with it, come back here.
Google has a very large collection of digitized books. They have gone through each book and counted how many times each token appeared in the collection. Single tokens in isolation are called unigrams, sometimes written as 1-grams. Google also counted how often words occurred next to each other in text (e.g. "the chair", "New York"), called bigrams or 2-grams. In general, n words words are called an n-gram. We use the words unigram, bigram and trigram to talk about 1-grams, 2-grams and 3-grams. Beyond 3, we just say n-gram (e.g. 4-gram or 5-gram).
Google makes the data they collected available for download. We have saved
the unigram data in /data/cs91r-s25/ngrams/1gms
. The files are compressed
using gzip
. Do not unzip the files! You can view a single file using either
the zcat
command in combination with less
, or the zless
command:
$ cd /data/cs91r-s25/ngrams/1gms
$ zcat 1-00023-of-00024.gz | less
$ zless 1-00023-of-00024.gz
In this file, the first line counts how often the token www.socialstudies
appears. The second line has counts for πόνοι
and the third line has counts for
yellowness
. The format of each line is as follows:
-
The word we are counting is the first element on the line
-
Then, separated by tabs, each block of three numbers represents a year, the number of times that ngram appeared that year, and the number of unique books that ngrams was found in that year.
So, using the line for yellowness
as an example, we find that the word
yellowness
appeared one time total, in one book, in the year 1481. In
the year 1785, the word yellowness
appeared 23 total times spread across
17 different books. In 2019, the word had appeared 1358 total times across
827 different books.
yellowness 1481,1,1 ... 1785,23,17 ... 2019,1358,827
Questions
-
Read the
man
pages onzcat
andzless
. Describe some interesting features of both programs that perhaps you didn’t expect.$ man zcat $ man zless
Summarizing the counts
The goal of this part of the lab is to summarize those compressed files into the output below, where we have only saved the word and the total number of times the word occurred across all 24 files. (You do not need to keep track of how many volumes each word occurred in, only the total counts.)
...
yellowneck 192
yellownecked 3264
yellownecks 146
yellowneedles 253
yellownefs 295
yellownes 260
yellowness 293469
yellownessblueness 46
...
Write your solution to this part of the lab in ngrams.py
. Hints and
requirements are below.
Reading a gzip file
Your goal is to read these file in python without decompressing the
files first. To do that, see the
python gzip library
documentation. Try to print out the first 3 lines of the file
1-00023-of-00024.gz
in python using the gzip
library to open the
file. In the documentation, note that one of the options when you
open
the file is the mode
. Be sure to set the mode to rt
(read,
text); otherwise python will interpret the data incorrectly.
Note that these compressed files are very large and they take a while to read over the file system and decompress. When your program is complete, it may take up to 30 minutes to process all 24 files. Therefore, you should definitely be testing your code step-by-step to be sure things are working!
Command-line interface
Your program should have the following interface:
$ python3 ngrams.py file1 [file2 file3 file4]...
That means that when you run the program, you will specify
at least one parameter (that will be stored in sys.argv[1]
)
but you can specify as many files as you’d like on the command-line.
You can use the length of sys.argv
to figure out how many parameters
you’ve passed to your program.
Try out your interface using just one file and output the first 3 lines of the file as you did before. Next, provide multiple files as parameters and output the first 3 lines of each file you provide.
Filtering the words
You will notice that there is a lot of junk in these files. By junk, we mean things that aren’t words. You can view some of "words" Google has counted by looking at the first column of one of the files:
$ cd /data/cs91r-s25/ngrams/1gms
$ zcat 1-00023-of-00024.gz | cut -f1 | less
Many of these are not "words" that we would care about (e.g. strings with characters that aren’t English letters or strings that have numbers and punctuation in them). You only want to count words that meet the following criteria:
-
the word must start with one or more letters A-Z (lowercase or uppercase) or a hyphen.
-
after the letters and hyphens, a word may have:
-
no additional characters, e.g.
yellowness
-
a single underscore followed by one or more capital letters, e.g.
third-order_NOUN
-
a single period followed only one or more numbers, e.g.
unresolved.64
-
both a single period followed by numbers and a single underscore followed by capital letters, e.g.
tracts.78_NUM
-
For every word that meets these criteria, you will remove all characters
after (and including) the first period or underscore. So,
from the examples above, third-order_NOUN
will become third-order
,
unresolved.64
will become unresolved
, and tracts.78_NUM
will become
tracts
; the word yellowness
remains unchanged.
Run your program using the 1-00023-of-00024.gz
file as the only
parameter and print out all of the words in the order you find them.
Do not print anything except for the word (the first column of text in the
file). Only include the words that meet the criteria above and remove extraneous
characters (e.g. .78_NUM
) as described above. You should get the following
output for file 00023
:
$ python3 ngrams.py /data/cs91r-s25/ngrams/1gms/1-00023-of-00024.gz > out.txt
$ head out.txt
yellowness
tkep
xxvii
ukupnom
thyreoarytaenoideus
thishis
theOther
wineswilling
tracts
unterschiedne
It’s important to notice that before you removed the extraneous characters,
each word only appeared once in the file. But now, words like yellowness
appear multiple times:
$ grep '^yellowness$' out.txt | sort | uniq -c
5 yellowness
This is because the word yellowness
appears in 1-00023-of-00024.gz
as
yellowness
,yellowness_ADJ
, yellowness_X
, yellowness_VERB
, and yellowness_NOUN
.
This will be important in the next step.
Summarizing the counts
For each line that meets the criteria, you want to count the total number of times that the word occurred across all years. Do not count the number of volumes that the word appeared in. If you total up each line and print it out, you should get this output:
yellowness 146775
tkep 86
xxvii 153
ukupnom 335
thyreoarytaenoideus 149
...
However, remember that words can appear multiple times as we saw with
yellowness
above. If we grep
for the appearances of yellow
after
calculating the totals, we’d see:
yellowness 146775
yellowness 8342
yellowness 244
yellowness 2346
yellowness 135762
So, rather than printing the totals out as soon as you calculate them,
save the counts in a dictionary (or collections.Counter
). Then,
after you’ve gone through all of the files on the command-line,
print out the results. Here’s a sample of my output where you’ll
notice we are now summing all the appearances of yellowness
:
$ python3 ngrams.py /data/cs91r-s25/ngrams/1gms/1-00023-of-00024.gz > out.txt
$ grep yellowness out.txt
yellowness 293469
yellownesses 122
yellownesse 489
yellownessblueness 46
Note: Most words appear in only one file, but a very small number of words,
once you’ve removed the extraneous characters, appear in multiple files.
Once such example is Kids
that appears in file 00010
and file 00011
.
You can test to be sure your program handles those cases correctly by
verifying that if you run the program only on file 00010
you should
count 3643441 appearances of Kids
; if you run only on file 00011
you
should count 3639068 appearances of Kids
; and if you run it on both files
you get the 7282509, the sum of both files.
Running on all of the gzip files
Remember, each of these files has a lot of words, and decompressing them over the file system and then processing them will take a fair amount of time. Do not run on all 24 files until you are fairly certain your code is working. You should expect this step to take about 30 minutes. Of course, your particular implementation may make this slower or faster, but be patient!
Don’t forget to use tqdm
!
You can run on all 24 files by using file globbing to grab all gzip
files in the directory. Notice below that the output is being directed
to a file in the /scratch
directory. You can store it anywhere you’d like
in /scratch
but you’ll need to replace userid1
with your own user id.
$ python3 ngrams.py /data/cs91r-s25/ngrams/1gms/*.gz > /scratch/userid1/cs91r/ngrams.txt
gzip the output
You’ll notice that the output file is quite large (nearly 300MB). You
should compress that file, too, using gzip
:
$ cd /scratch/userid1/cs91r/
$ gzip ngrams.txt # this will take about 20 seconds
Or you could gzip
it directly as you run your program (shown without the
/scratch
directory to prevent word-wrapping):
$ python3 ngrams.py /data/cs91r-s25/ngrams/1gms/*.gz | gzip > ngrams.txt.gz
Questions
-
Using the
ngrams.txt
file (orngrams.txt.gz
file), what are the most frequent words in the ngram data set and what are their counts? Do the ten most frequent words match what you found in the Project Gutenberg novels from the previous section? -
How did you find the most frequent words from this file?
-
a. If you used a python program to do it, put the source code of that file in a code block.
-
b. If you performed this step on the command-line, put the command-line you used in a code block.
-
How to turn in your solutions
Edit the markdown file that we provided to answer each of the questions and add any discussion you think we’d like to know about.
Be sure to commit and push all changes to your python files.
If you think it would be helpful, use asciinema
to record
a terminal session and include it your markdown.
See Lab 1 for a reminder of how to use asciinema
.