CS91R Lab 02: Counting

Due Wednesday, February 12, before midnight

Cloning your repository

Log into the CS91R-S25 github organization for our class and find your git repository for Lab 02, which will be of the format lab02-user1-user2, where user1 and user2 are you and your partner’s usernames.

You can clone your repository using the following steps while connected to the CS lab machines:

# cd into your cs91r/labs sub-directory and clone your lab02 repo
$ cd ~/cs91r/labs
$ git clone git@github.swarthmore.edu:CS91R-S25/lab02-user1-user2.git

# change directory to list its contents
$ cd ~/cs91r/labs/lab02-user1-user2

# ls should list the following contents
$ ls
README.md     ngrams.py     tokenizer.py  tqdm_ex.py

1. Warmups

Put all of your code for part 1 and part 2 in a single python file called tokenizer.py. Demonstrate that your functions work by writing a main function.

The answers to each of the questions in the lab writeup should be placed in the README.md file.

You should call your main function using the following pattern which will allow you to import this file in future labs without problems.

if __name__ == '__main__':
    main()

The main function should be the only place where you print anything.

1.1 `get_words`

Write a function called get_words that takes a string s as its only argument. The function should return a list of the words in the same order as they appeared in s. Note that in this question a "word" is defined as a "space-separated item". For example:

>>> get_words('The cat in the hat ate the rat in the vat')
['The', 'cat', 'in', 'the', 'hat', 'ate', 'the', 'rat', 'in', 'the', 'vat']

Hint: If you don’t know how to approach this problem, read about str.split().

1.2 `count_words`

Write a function called count_words that takes a list of words as its only argument and returns a dictionary that maps a word to the frequency that it occurred in s. Use the output of the get_words function as the input to this function.

>>> s = 'The cat in the hat ate the rat in the vat'
>>> words = get_words(s)
>>> count_words(words)
{'The': 1, 'cat': 1, 'in': 2, 'the': 3, 'hat': 1, 'ate': 1, 'rat': 1, 'vat': 1}

Notice that this is somewhat unsatisfying because the is counted separately from The. We can easily fix this by lower-casing all of the words before counting them using str.lower:

>>> words = get_words(s.lower())
>>> count_words(words)
{'the': 4, 'cat': 1, 'in': 2, 'hat': 1, 'ate': 1, 'rat': 1, 'vat': 1}

Hints

If you don’t have experience using dictionaries, read about python’s dict data structure.
If you have experience using python dictionaries, try using collections.defaultdict, an advanced kind of dictionary that can simplify your code.

Be sure that you are confident in your ability to use dictionaries in python (as much as is needed for this lab) as we will be using them a lot in this class.

If you are confident about dictionaries, try using collections.Counter. And if you aren’t confident yet, come back and try this later because Counters are really useful!

1.3 `words_by_frequency`

Write a function called words_by_frequency that takes a list of words as its only argument. The function should return a list of (word, count) tuples sorted by count such that the first item in the list is the most frequent item. The order of items with the same frequency does not matter (but you could try to sort such items alphabetically if you wanted to try that). You should use your count_words function to get the counts first.

>>> words_by_frequency(words)
[('the', 4), ('in', 2), ('cat', 1), ('hat', 1), ('ate', 1), ('rat', 1), ('vat', 1)]

Hints

To sort a list of tuples by the second field, use the operator.itemgetter function:

>>> import operator  # be sure to import operator
>>> tupleList = [('cat', 9), ('hat', 5), ('rat', 12)]
>>> tupleList.sort(key=operator.itemgetter(1), reverse=True)
>>> tupleList
[('rat', 12), ('cat', 9), ('hat', 5)]

2. Tokenization

In part 2 of the lab, you will explore some files from Project Gutenberg and improve on the code you just wrote by adding to and/or modifying the tokenizer.py program you began in the previous section.

2.1 Finding the most frequent words

In the /data/cs91r-s25/gutenberg/trimmed/ directory there are a number of .txt files containing texts found in the Project Gutenberg collection.

In python, read the contents of Lewis Carroll’s "Alice’s Adventures in Wonderland", which is stored in the file carroll-alice.txt. Use your words_by_frequency and count_words functions from Section 1 to explore the text. Assuming that you wrote your code properly in Section 1, and assuming that you lower-cased all of the words in the text, you should find that the five most frequent words in the text are:

the      1604
and       765
to        706
a         614
she       518

Questions

Use your count_words function to find out how many times the word alice occurs in the text.
The word alice actually appears 397 times in the text, though this is not the answer you got for the previous question. Why? Examine the data to see if you can figure it out before continuing.

2.2 Getting more accurate counts

We now see that there is a deficiency in how we implemented the get_words function. When we are counting words, we probably don’t care whether the word was adjacent to a punctuation mark. For example, the word hatter appears in the text 57 times, but if we queried the count_words dictionary, we would see it only appeared 24 times. However, it also appeared numerous times adjacent to a punctuation mark so those instances got counted separately:

>>> word_freq = words_by_frequency(words)
>>> for (word, freq) in word_freq:
...     if 'hatter' in word:
...         print('%-10s %3d' % (word, freq))
...
hatter      24
hatter.     13
hatter,     10
hatter:      6
hatters      1
hatter's     1
hatter;      1
hatter."     1

Our get_words function would be better if it separated punctuation from words. We can accomplish this by using the re.split function. We used the re library last lab to create our first regular expressions. The re.split function is more powerful than python’s str.split function because it allows you to split a string based on a regular expression match rather just a literal string.

Be sure to import re at the top of your file to make re.split work. Below is a small example that demonstrates how str.split works on a small text and compares it to using re.split:

>>> import re
>>> text = '"Oh no, no," said the little Fly, "to ask me is in vain."'
>>> text.split()
['"Oh', 'no,', 'no,"', 'said', 'the', 'little', 'Fly,', '"to', 'ask', 'me', 'is',
 'in', 'vain."']
>>> rgx = re.compile(r'(\W)')
>>> rgx.split(text)
['', '"', 'Oh', ' ', 'no', ',', '', ' ', 'no', ',', '', '"', '', ' ', 'said', ' ', 'the',
 ' ', 'little', ' ', 'Fly', ',', '', ' ', '', '"', 'to', ' ', 'ask', ' ', 'me', ' ', 'is',
 ' ', 'in', ' ', 'vain', '.', '', '"', '']

Note that this is not exactly what we want, but it is a lot closer. In the resulting list, we find empty strings and spaces, but we have also successfully separated the punctuation from the words.

Using the above example as a guide, write and test a function called tokenize that takes a string as an input and returns a list of words and punctuation, but not extraneous spaces and empty strings. You don’t need to modify the re.split line we provided: just process the resulting list to remove spaces and empty strings.

>>> tokenize(text.lower())
['"', 'oh', 'no', ',', 'no', ',', '"', 'said', 'the', 'little',
 'fly', ',', '"', 'to', 'ask', 'me', 'is', 'in', 'vain', '.', '"']
>>> print(' '.join(tokenize(text.lower())))
" oh no , no , " said the little fly , " to ask me is in vain . "

Explaining the `re.split` line

In the example above, the regular expression we used was r'(\W)'. The regular expression here is (\W). This incorporates two things we have not yet learned, and one thing that is worth emphasizing:

In regular expression syntax, there are numerous predefined sets of characters that you can match against. For example, if you want to match against a number, you can use the set notated by \d. If you want to match any alphanumeric character (letters, numbers, or underscores), you can use the set notated by \w. If you want to match any whitespace (space, tab, newline, etc), you can use the set notated by \s. Notice that these sets use lowercase letters. It is common practice to specify the inverse set using the respective capital letter. Therefore, to match anything that is not a digit, you would use \D. To match anything that is not whitespace, you would use \S. To match anything that is not alphanumeric, you would use \W, which is used in the regular expression in this example.
In regular expression syntax, we can "capture" the part of the pattern that matched by putting the part we want to capture in parenthesis. This allows us to refer to the specific section of the string that matched the portion of the regular expression in parenthesis. We will see more examples of this later in the semester. In this context, capturing the non-alphabetic characters tells the re.split function to include the punctutation in the list that it returns. If we try to run the line in the example above without the parenthesis, we’ll see that it omits all punctutation from the returned list.
```
>>> rgx = re.compile(r'\W')
>>> rgx.split(text)
['', 'Oh', 'no', '', 'no', '', '', 'said', 'the', 'little', 'Fly', '', '', 'to', 'ask', 'me', 'is', 'in', 'vain', '', '']
```
The regular expression here is not written in a normal python string. Notice that the first quotation mark is preceded by the letter r. This tells python that you are typing a raw string instead of a normal string. For this specific example, the raw string isn’t necessary, but it’s useful to think about why we might need it. Normally, python uses the \ symbol in strings to mean something special. Think about \n, the newline symbol. Perhaps you’ve also seen \t, the tab symbol. Python also uses \\, \', \", \r, \b, \f, \v and \x to mean special things inside of strings. If we actually wanted to print out \n in python, we would have two options:
```
>>> print('\\n') # escape the slash so it just means "slash" followed by n
\n
>>> print(r'\n') # use a raw string so python doesn't interpret \n as special
\n
```
It is common practice to precede the string containing a regular expression with r in case there are special symbols in your regular expression that python might unintentionally treat in a way you weren’t expecting.

Catching underscores

In the above example, you used this regular expression to split the text:

rgx = re.compile(r'(\W)')

It works really well, right?! The problem is that the set of characters in \w includes the _ (underscore) character. If you try to run your program with anything that contains an underscore, you’ll spot the error. You’ll see it a bunch of times in Alice in Wonderland, such as: "There was nothing so _very_ remarkable in that;" When you tokenize the text with the rgx above, you’ll end up with _very_ as a word counted separately from very.

To fix this, include the underscore character in the set of characters you want to split on. Just like other regular expressions you’ve seen already, you can include \W in a set to say that you want to match \W or _:

rgx = re.compile(r'([\W_])')

In order to get the exact counts shown below, you will need to modify your regular expression to match the one above.

2.3 Finding the most frequent words, again

Use your tokenize function in conjunction with your words_by_frequency function to list the top 5 most frequent words in carroll-alice.txt. You should get this:

,        2425      <-- comma
the      1643
“	     1118      <-- open quotation mark
”	     1114      <-- close quotation mark
.	      988      <-- period

2.4 Filter out non-words

Write a function called filter_nonwords that takes a list of strings as input and returns a new list of strings that excludes anything that isn’t entirely alphabetic. You can use the str.isalpha method to determine is a string is comprised of only alphabetic characters. If you are more daring, try using a regular expression containing the special alphanumeric set represented by \w (and explained above).

>>> text = '"Oh no, no," said the little Fly, "to ask me is in vain."'
>>> tokens = tokenize(text)
>>> filter_nonwords(tokens)
['Oh', 'no', 'no', 'said', 'the', 'little', 'Fly', 'to', 'ask', 'me',
 'is', 'in', 'vain']

Use this function to list the top 5 most frequent words in carroll-alice.txt:

the      1643
and       872
to        729
a         632
it        595

2.5 Comparing the most frequent words across corpora

Iterate through all of the files in the /data/cs91r-s25/gutenberg/trimmed/ directory and print out the top 5 words for each. To get a list of all the files in a directory, use the os.listdirfunction:

>>> import os
>>> from pathlib import Path
>>> directory = Path('/data/cs91r-s25/gutenberg/trimmed/')
>>> files = os.listdir(directory)
>>> files
['shakespeare-caesar.txt', 'whitman-leaves.txt', 'chesterton-thursday.txt', 'austen-sense.txt', 'chesterton-brown.txt', 'shakespeare-macbeth.txt', 'austen-persuasion.txt', 'milton-paradise.txt', 'burgess-busterbear.txt', 'melville-moby_dick.txt', 'README', 'austen-emma.txt', 'blake-poems.txt', 'bryant-stories.txt', 'bible-kjv.txt', 'chesterton-ball.txt', 'edgeworth-parents.txt', 'carroll-alice.txt', 'shakespeare-hamlet.txt']
>>> infile = open(directory / files[0], 'r')

This example also uses the pathlib library that you might want to read about.

Questions

Loop through all the files the /data/cs91r-s25/gutenberg/trimmed/ directory that end in .txt. Is 'the' always the most common word? If not, what are some other words that show up as the most frequent word?
If you don’t lowercase all the words before you count them, how does this result change, if at all?

Sidenote: tqdm

When you create a program that takes a long time to run, it’s often helpful to get some kind of status report on the progress, that way you have some estimate for how long the program will take to run, and it helps to make sure you aren’t sitting around waiting for a program to finish that will take thousands of years to finish — or never, if you’re in an infinite loop. That’s where the library tqdm comes in.

Here’s a simple python program that sleeps between each interation of a for loop. It shows you three possible ways to write such a loop: no progress, printing progress, and using tqdm to show progess. This example should help you see how tqdm works and is included in your repository with the name tqdm_ex.py:

from tqdm import tqdm
from time import sleep

print("don't show any progress")
for step in range(10):
    sleep(0.5)

print("show progress using a print statement")
for step in range(10):
    print(f'{step=}')
    sleep(0.5)

print("show progress using tqdm")
for step in tqdm(range(10)):
    sleep(0.5)

Here is an asciinema recording showing the python program above running:

To use tqdm, you simply need to wrap the sequence in your for loop with the function tqdm(…). This syntax works for any sequence that you can use in a for loop. When python can determine the length of the sequence by calling len, tqdm will provide a time estimate for completion. For sequences where the length of the sequence is unknown (e.g. reading from sys.stdin), tqdm is unable to provide a time estimate but will still provide updates to let you know your program is still running.

In the next section, you will be reading from multiple very large files. Each file may take 60-90 seconds to process, and there are 24 files. Having some indication that your program is making progress is really helpful!

Sidenote: /scratch directory

The CS network has a lot of disk storage. However, we set quotas on the total amount of disk storage each student can use because of the time and cost of backing up the data, not storing the original data. There will be many occasions where you’d like to store large files, but you don’t want to store them in your own directory because of the disk quota limitations. That’s where the /scratch directory comes in. The CS department already has a lot of your "disposable" files set up to be stored in /scratch: your Downloads directory, your vscode configuration files, and other files that don’t get backed up. But you can also use the /scratch directory to store large files that you create in your classes.

To access your own /scratch directory, just cd to it:

$ cd /scratch/userid1   # replace userid1 with your user id
$ ls

You may see some files we’ve been storing there. You can now make a directory just for this class, and even directories for each lab if you wanted. You could even clone your lab repos here (provided you remember to push regularly since this directory won’t be backed up.)

In the next section, you will be creating a large output file (between 150MB-300MB). It would be good to store that output file in your /scratch directory!

3. Google Ngrams

Have you ever used Google’s Books Ngram Viewer? NO!? Go do that now! When you’re done playing with it, come back here.

Google has a very large collection of digitized books. They have gone through each book and counted how many times each token appeared in the collection. Single tokens in isolation are called unigrams, sometimes written as 1-grams. Google also counted how often words occurred next to each other in text (e.g. "the chair", "New York"), called bigrams or 2-grams. In general, n words words are called an n-gram. We use the words unigram, bigram and trigram to talk about 1-grams, 2-grams and 3-grams. Beyond 3, we just say n-gram (e.g. 4-gram or 5-gram).

Google makes the data they collected available for download. We have saved the unigram data in /data/cs91r-s25/ngrams/1gms. The files are compressed using gzip. Do not unzip the files! You can view a single file using either the zcat command in combination with less, or the zless command:

$ cd /data/cs91r-s25/ngrams/1gms
$ zcat 1-00023-of-00024.gz | less
$ zless 1-00023-of-00024.gz

In this file, the first line counts how often the token www.socialstudies appears. The second line has counts for πόνοι and the third line has counts for yellowness. The format of each line is as follows:

The word we are counting is the first element on the line
Then, separated by tabs, each block of three numbers represents a year, the number of times that ngram appeared that year, and the number of unique books that ngrams was found in that year.

So, using the line for yellowness as an example, we find that the word yellowness appeared one time total, in one book, in the year 1481. In the year 1785, the word yellowness appeared 23 total times spread across 17 different books. In 2019, the word had appeared 1358 total times across 827 different books.

yellowness  1481,1,1  ... 1785,23,17  ... 2019,1358,827

Questions

Read the man pages on zcat and zless. Describe some interesting features of both programs that perhaps you didn’t expect.
```
$ man zcat
$ man zless
```

Summarizing the counts

The goal of this part of the lab is to summarize those compressed files into the output below, where we have only saved the word and the total number of times the word occurred across all 24 files. (You do not need to keep track of how many volumes each word occurred in, only the total counts.)

...
yellowneck      192
yellownecked    3264
yellownecks     146
yellowneedles   253
yellownefs      295
yellownes       260
yellowness      293469
yellownessblueness      46
...

Write your solution to this part of the lab in ngrams.py. Hints and requirements are below.

Reading a gzip file

Your goal is to read these file in python without decompressing the files first. To do that, see the python gzip library documentation. Try to print out the first 3 lines of the file 1-00023-of-00024.gz in python using the gzip library to open the file. In the documentation, note that one of the options when you open the file is the mode. Be sure to set the mode to rt (read, text); otherwise python will interpret the data incorrectly.

Note that these compressed files are very large and they take a while to read over the file system and decompress. When your program is complete, it may take up to 30 minutes to process all 24 files. Therefore, you should definitely be testing your code step-by-step to be sure things are working!

Command-line interface

Your program should have the following interface:

$ python3 ngrams.py file1 [file2 file3 file4]...

That means that when you run the program, you will specify at least one parameter (that will be stored in sys.argv[1]) but you can specify as many files as you’d like on the command-line. You can use the length of sys.argv to figure out how many parameters you’ve passed to your program.

Try out your interface using just one file and output the first 3 lines of the file as you did before. Next, provide multiple files as parameters and output the first 3 lines of each file you provide.

Filtering the words

You will notice that there is a lot of junk in these files. By junk, we mean things that aren’t words. You can view some of "words" Google has counted by looking at the first column of one of the files:

$ cd /data/cs91r-s25/ngrams/1gms
$ zcat 1-00023-of-00024.gz | cut -f1 | less

Many of these are not "words" that we would care about (e.g. strings with characters that aren’t English letters or strings that have numbers and punctuation in them). You only want to count words that meet the following criteria:

the word must start with one or more letters A-Z (lowercase or uppercase) or a hyphen.
after the letters and hyphens, a word may have:
- no additional characters, e.g. yellowness
- a single underscore followed by one or more capital letters, e.g. third-order_NOUN
- a single period followed only one or more numbers, e.g. unresolved.64
- both a single period followed by numbers and a single underscore followed by capital letters, e.g. tracts.78_NUM

For every word that meets these criteria, you will remove all characters after (and including) the first period or underscore. So, from the examples above, third-order_NOUN will become third-order, unresolved.64 will become unresolved, and tracts.78_NUM will become tracts; the word yellowness remains unchanged.

Run your program using the 1-00023-of-00024.gz file as the only parameter and print out all of the words in the order you find them. Do not print anything except for the word (the first column of text in the file). Only include the words that meet the criteria above and remove extraneous characters (e.g. .78_NUM) as described above. You should get the following output for file 00023:

$ python3 ngrams.py /data/cs91r-s25/ngrams/1gms/1-00023-of-00024.gz > out.txt
$ head out.txt
yellowness
tkep
xxvii
ukupnom
thyreoarytaenoideus
thishis
theOther
wineswilling
tracts
unterschiedne

It’s important to notice that before you removed the extraneous characters, each word only appeared once in the file. But now, words like yellowness appear multiple times:

$ grep '^yellowness$' out.txt | sort | uniq -c
      5 yellowness

This is because the word yellowness appears in 1-00023-of-00024.gz as yellowness ,yellowness_ADJ, yellowness_X, yellowness_VERB, and yellowness_NOUN. This will be important in the next step.

Summarizing the counts

For each line that meets the criteria, you want to count the total number of times that the word occurred across all years. Do not count the number of volumes that the word appeared in. If you total up each line and print it out, you should get this output:

yellowness 146775
tkep 86
xxvii 153
ukupnom 335
thyreoarytaenoideus 149
...

However, remember that words can appear multiple times as we saw with yellowness above. If we grep for the appearances of yellow after calculating the totals, we’d see:

yellowness 146775
yellowness 8342
yellowness 244
yellowness 2346
yellowness 135762

So, rather than printing the totals out as soon as you calculate them, save the counts in a dictionary (or collections.Counter). Then, after you’ve gone through all of the files on the command-line, print out the results. Here’s a sample of my output where you’ll notice we are now summing all the appearances of yellowness:

$ python3 ngrams.py /data/cs91r-s25/ngrams/1gms/1-00023-of-00024.gz > out.txt
$ grep yellowness out.txt
yellowness	293469
yellownesses	122
yellownesse	489
yellownessblueness	46

Note: Most words appear in only one file, but a very small number of words, once you’ve removed the extraneous characters, appear in multiple files. Once such example is Kids that appears in file 00010 and file 00011. You can test to be sure your program handles those cases correctly by verifying that if you run the program only on file 00010 you should count 3643441 appearances of Kids; if you run only on file 00011 you should count 3639068 appearances of Kids; and if you run it on both files you get the 7282509, the sum of both files.

Running on all of the gzip files

Remember, each of these files has a lot of words, and decompressing them over the file system and then processing them will take a fair amount of time. Do not run on all 24 files until you are fairly certain your code is working. You should expect this step to take about 30 minutes. Of course, your particular implementation may make this slower or faster, but be patient!

Don’t forget to use tqdm!

You can run on all 24 files by using file globbing to grab all gzip files in the directory. Notice below that the output is being directed to a file in the /scratch directory. You can store it anywhere you’d like in /scratch but you’ll need to replace userid1 with your own user id.

$ python3 ngrams.py /data/cs91r-s25/ngrams/1gms/*.gz > /scratch/userid1/cs91r/ngrams.txt

gzip the output

You’ll notice that the output file is quite large (nearly 300MB). You should compress that file, too, using gzip:

$ cd /scratch/userid1/cs91r/
$ gzip ngrams.txt   # this will take about 20 seconds

Or you could gzip it directly as you run your program (shown without the /scratch directory to prevent word-wrapping):

$ python3 ngrams.py /data/cs91r-s25/ngrams/1gms/*.gz | gzip > ngrams.txt.gz

Questions

Using the ngrams.txt file (or ngrams.txt.gz file), what are the most frequent words in the ngram data set and what are their counts? Do the ten most frequent words match what you found in the Project Gutenberg novels from the previous section?
How did you find the most frequent words from this file?
- a. If you used a python program to do it, put the source code of that file in a code block.
- b. If you performed this step on the command-line, put the command-line you used in a code block.

How to turn in your solutions

Edit the markdown file that we provided to answer each of the questions and add any discussion you think we’d like to know about.

Be sure to commit and push all changes to your python files.

If you think it would be helpful, use asciinema to record a terminal session and include it your markdown. See Lab 1 for a reminder of how to use asciinema.