CS91R Lab 07: Document vectors

Due Wednesday, April 9, before midnight

Goals

The goals for this lab assignment are:

Understand how to represent documents as multidimensional vectors
Explore the use of sparse matrices for efficiently representing document collections
Use cosine similarity to compare document similarity
Use tf-idf to downweight frequent words
Learn how to treat a word’s context across a large corpus as a document

Cloning your repository

Log into the CS91R-S25 github organization for our class and find your git repository for Lab 07, which will be of the format lab07-user1-user2, where user1 and user2 are you and your partner’s usernames.

You can clone your repository using the following steps while connected to the CS lab machines:

# cd into your cs91r/labs sub-directory and clone your lab07 repo
$ cd ~/cs91r/labs
$ git clone git@github.swarthmore.edu:CS91R-S25/lab07-user1-user2.git

# change directory to list its contents
$ cd ~/cs91r/labs/lab07-user1-user2

# ls should list the following contents
$ ls
README.md

Answers to written questions should be included in the README.md file included in your repository.

Background

For further details on the content of this lab, refer to Jurafsky and Martin, Chapter 6, in particular, sections 6.3, 6.4 and 6.5.

Document Retrieval

In the first part of the lab, we will build a document retrieval system that allows us to quickly search a large collection of documents using keyword searches. For each search the user performs, our system will provide a ranking of all of the documents in the collection. The user can then easily browse through all of the documents in sorted order from best match to worst match.

We will use the entire English text corpus of Project Gutenberg as our document collection.

Explore the Project Gutenberg dataset in /data/cs91r-s25/gutenberg/mirror.
- Describe the contents of the four subfolders found in /data/cs91r-s25/gutenberg/mirror/data.
- How many downloaded texts are there?
- In the /research/data/cs91r-s25/gutenberg/mirror/metadata folder is a file called metadata.csv. Use grep or another command-line tool to find out how many English texts are in the corpus. There are some texts that are in English and another language: your count should contain the English-only texts.
- There is also a file called metadata_small.csv that contains a subset of the books that we strongly consider you use for testing purposes in later parts of the lab. How many English-only texts are in this file?

Examining the data

In the first part of the lab, we will convert the Project Gutenberg corpus into a term-document matrix. Each row of the matrix will represent a single document in the corpus. Each column will represent a unique word in the corpus. For example, in the version of Hamlet that we have been using (PG1524), the word "castle" appears 30 times. If the matrix is storing the text PG1524 in row i and the matrix is storing the word "castle" in column j, then then matrix[i][j]=30.

From Question 1, you found out that there are approximately 60K English texts. (Later in the lab you will find out the number is closer to 50K: see the note in the next paragraph. We’ll use 50K from now on as our approximation.) If we lowercase all the words in those 50K texts, there are approximately 70K unique words. If each integer in the matrix was stored as a 4-byte int, how much memory would you need to store the entire term-document matrix?

(Note: A fair number of texts listed in the metadata file were not actually downloaded. In some cases it is because the text is no longer available on Project Gutenberg. In other cases, the text is only available in a non-text format such as a PDF or audio file, and we’ve only downloaded the .txt files. Later in the lab you will be able to calculate this value more precisely.)

Background: the `pickle` library

As we work through this lab, we’re going to be reading through a lot of files. You might want to work with the metadata_small.csv subset during development and debugging.

It may take a few minutes to read through the entire Project Gutenberg corpus, and you don’t want to have to re-process the corpus over and over again. To help out, we’re going to take advantage of python’s pickle library. This library allows us to serialize our data. Practically, this means that we can create large data structures (e.g. our term-document matrix) in one program and save them directly to a file. In another program, we can read our data directly from the file with very little effort.

Here is a small example demonstrating how the pickle library works:

$ python3
>>> import pickle                  # import the pickle library
>>> names = ['Rich', 'Keith']      # create a list of names
>>> f = open('my_list.pkl', 'wb')  # open a file (write, binary mode)
>>> pickle.dump(names, f)          # save the list to the file
>>> f.close()                      # close the file
>>>
>>>
$ ls
my_list.pkl
$ python3
>>> f = open('my_list.pkl', 'rb')  # open the file (read, binary mode)
>>> data = pickle.load(f)          # load list from the file
>>> f.close()                      # close the file
>>> data                           # show that the data is loaded
['Rich', 'Keith']
>>>

Figuring out the dimensions of the matrix

Once we create the matrix, if we see the word "castle" in Hamlet, how do we know which cell in the matrix this goes in? In order to figure that out, we need to assign a row number to each text, and we need to assign a column number to each word. And in order to do this assignment, we need a list of all of the texts and another list of all the words in all of these texts.

Write a program called setup.py that does the following:

Read the file /usr/share/dict/words. This file contains a large number of English words and many proper nouns. Lowercase all the words. When you read in token counts in the bullet points below, you will only store words that appear in this file.
Read the metadata.csv file (mentioned in Question 1). You should use the csv library to read the file and make of list of the ids of each text written in only English.
For each English-only text in your list of fileids, open that file and read the token counts for that text. The token counts are stored in the /data/cs91r-s25/gutenberg/mirror/data/counts directory. You should not store the counts — simply keep track of all of the UNIQUE words across ALL of the texts. Using a set is helpful here.
NOTE: Some files may exist but not have any counts in them! Be sure your program doesn’t crash on those.
Create a dictionary that maps file ids to consecutive integers starting from 0. These integers will become the ROW numbers in your matrix, so each integer must appear only once, starting from 0, and increase by 1 for each next file id. The exact order of the file ids doesn’t matter as long as each book has a unique integer.
Create another dictionary that maps words to consecutive integers starting from 0. These integers will become the COLUMN numbers in your matrix, so again, each integer must only appear once, starting from 0. The order of the words doesn’t matter.
Use the pickle library to save the book dictionary to books.pkl
Similarly, save the words dictionary to words.pkl.

Background: Sparse matrices

How many types are there in the PG1524_counts.txt file? Use a command-line tool to answer this.
In Question 2, we said there were about 70K unique types across all of the English Project Gutenberg texts. If we were to fill in the Hamlet (PG1524) row with how often each type appeared, what percentage of the types would be storing a non-zero value?

If we assume that the percentage you got in the previous question is approximately true for all of the texts in Project Gutenberg, it’s pretty easy to see that a very large portion of the matrix we’re creating will be used to store the value 0, which seems like a huge waste of space. So instead of using a list of lists to store our matrix, we will use a sparse matrix. By default, these sparse matrices will assume that any unspecified values are 0, saving us quite a lot of memory.

There are many implementations of sparse matrices, and we will be using two of them. One implementation will be useful when we need to build the matrix, and one implementation will be useful when we need to do computations using it. Both implementations are in the scipy.sparse library.

We will start with the coo_matrix, or "coordinate matrix". The coo_matrix stores coordinates and their associated values. For example, if Hamlet was stored in row 4688 and if the word "castle" was stored in column 1105, and the word "castle" appeared 30 times in Hamlet, the the coo_matrix would store the fact that cell (1105, 4688) had the value 30.

We will use this constructor to create our coo_matrix (adapted from the reference page linked above):

coo_matrix((data, (r, c)), dtype=<datatype>)
    to construct from three arrays:
            data[:] the entries of the matrix, in any order
            r[:] the row indices of the matrix entries
            c[:] the column indices of the matrix entries

    Where A[r[k], c[k]] = data[k].
    dtype specifies the data type of the matrix.

To use the csr_matrix, coo_matrix, and np.int32 (described below), you will need to import the following:

import numpy as np
from scipy.sparse import csr_matrix, coo_matrix

Write python code that creates a coo_matrix called matrix to store the data below using the constructor shown above. By default, the dtype is set to a double-precision floating point number. We would like it to store a 32-bit integer. To do that, set dtype=np.int32.

0, 0, 2, 0, 0
0, 2, 0, 3, 0
4, 0, 1, 0, 0
0, 0, 0, 1, 5

If you saved your matrix in the variable matrix, you can verify you are correct by printing matrix.toarray().

The coo_matrix is optimized for constructing sparse matrices. But it is not a good format to use when you actually want to inspect values or perform arithmetic operations. Instead, once we create the coo_matrix, we will want to convert it to a csr_matrix. Given the coo_matrix you created, you can convert it to a csr_matrix as follows: matrix_csr = matrix.tocsr(). As before, you can verify this worked by printing matrix_csr.toarray().

Building the term-document matrix

Write a program called termdoc.py that creates the term-document matrix. Your program should:

Read in the two pickle files you created in setup.py so that you have the mapping of file ids to integers and the mapping of words to integers.
For every file id, read in the corresponding token counts. Recall that we need to store these token counts in a coo_matrix and our goal is to use the coo_matrix constructor we used above to create it. Therefore, for each token count you read in, you’ll need to append the token count to data, you’ll need to append the row number to r, and you’ll need to append the column number to c. You can find the row number from the dictionary that maps file id to integers, and you can find the column number from the dictionary that maps words to integers.
After reading in the token counts for all books and creating data, r, and c, create a coo_matrix.
Convert your coo_matrix to a csr_matrix.
Save your csr_matrix to a pickle file called matrix.pkl.

Background: Cosine similarity

Now that we have our term-document matrix saved, we can move on to finding documents that match a user’s search terms. To do this, we are going to use cosine similarity.

Recall that each row of the matrix represents the counts of the words we found in a single document. We will treat this row as a multidimensional vector (J&M section 6.3.1). You have likely seen two- and three-dimensional vectors, but in this matrix, since there are about 70K words in our vocabulary, each vector has approximately 70K dimensions.

Even though our vectors have a ridiculously large number of dimensions, we can still use standard mathematical techniques to determine how similar two vectors are. In particular, we will try to find the angle between the two vectors. If the angle between two vectors is small, then they are more similar; if the angle between two vectors is large, then they are less similar.

The value of each dimension in our vector is a word count. Each of these word counts has a non-negative value: it is either zero if the word did not occur in the document, or it is a positive value equal to the number of times it did occur. Therefore, the angle between any two vectors must be between 0 and 90 degrees.

Rather than find the angle directly, we will find the cosine of the angle between the two vectors because it is easy to calculate! When we use the cosine to determine how close two vectors are, we call this the "cosine similarity". The cosine similarity will be close to 1 when the two vectors are very similar, and close to 0 when the two vectors are very dissimilar. (See: J&M section 6.4))

Querying the document collection

Write a program called query.py that allows the user to enter a query (space-separated words) and returns to the user a ranked list of the documents that match the query.

This program will take one command-line argument which will help you distinguish between the three main ways you will run this program. For now, the command-line argument will always be "1" since this is part 1. (This will make more sense later, but for now, just make sure your query program reads and verifies that this is part 1.)

To do this, we will ask the user to enter a space separated list of words, e.g. "flower pollen", which we will call the "query". We will convert that query into a multidimensional vector with a 1 in the "flower" dimension, a 1 in the "pollen" dimension, and a 0 in every other dimension.

To create this query vector, we will first create a vector with the same number of columns as the matrix we built and we will fill all of the entries with 0. In order to make our computations efficient, we will use the numpy.zeros function to create an empty query vector.

Then, for every word in the query vector, we will set the corresponding column value to 1. If the user types includes a word in their query that is not in your words dictionary, just ignore it.

Here is an example of how to use the numpy.zeros function and how to set values in the resulting vector:

import numpy as np  # at the top of your program

# assumes you already have your matrix created and that it is called 'matrix'
num_rows, num_columns = matrix.shape    # find the size of your matrix
query = np.zeros((1, num_columns))      # create a vector of all zeros

# assuming "castle" was dimension 1105 and you wanted to store the value 1
# in that dimension, you would say:
query[0, 1105] = 1     # note the unusual indexing!

Now that you have your query vector, you can compare that query vector to all of the vectors in your matrix and then sort them from most similar to least similar:

from sklearn.metrics.pairwise import cosine_similarity # top of your file
similarities = cosine_similarity(query, matrix).flatten()
ranked = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)

The list ranked contains tuples where the first element is the row number and the second element is the similarity value. For example, the highest match might be (22345, 0.15) which means that whichever book was in row 22345 has a cosine similarity of 0.15 between it and your query vector.

At this point, complete your query.py so that it:

Reads in your words.pkl, books.pkl and matrix.pkl files that you created in previous programs.
Repeatedly performs the following steps:
- Asks the user the enter a query (space-separated words)
- Converts the query into a query vector
- Uses cosine similarity to compare the query to every row in the matrix
- Prints the cosine similarity, book id (e.g. "PG1524"), and the book’s title, which you can get from the metadata.csv file, for the top 5 best matches.

For the top five best matches to the query "flower pollen", display the cosine similarity, the book’s id, and the book’s title (which you can get from the metadata.csv file).
- Look through the titles of the books returned, as well as the text of the books which you can find in /data/cs91r-s25/gutenberg/mirror/data/text/. Do your results seem reasonable? Why or why not?
Repeat the previous question using the query "flower". Do the results seem reasonable? Why or why not?
Repeat the previous question using the query "the flower". Compare to the results you got for "flower". Do the results seem reasonable? Why or why not?
Try on a few other queries of your choosing. Which queries did you try? Were the results reasonable? Why or why not?

TF-IDF

Not all query terms are equally important, are they? That is, when you searched for "flower", it was important to find documents that contained the word "flower". But when you searched for "the flower", the cosine similarity algorithm treated "the" as equally important to "flower". That means that a document that has the word "the" a lot of times but doesn’t contain the word "flower" at all can still rank very highly. (In fact, it does! Did you find that when you answered the question above?)

To fix this, we will use a technique called tf-idf that will automatically downweight words that occur in a lot of documents. (See J&M section 6.5.)

At a high level, we are going scale all of the term frequencies so that words that occur in many documents (the "document frequency") have less impact than words that occur in fewer documents.

tf_w is the term frequency of the word: how often it appears in a document
df_w is the document frequency of a word: how many documents a word appears in
N is the number of documents in your collection
ln is the natural log function

The formulation for calculating the adjusted tf value that we will use is:

\[adjusted_w = tf_w * ln \frac{(N + 1)}{(df_w + 1)}\]

Let’s try to adjust the term frequency for the word "castle" in Hamlet. The word "castle" appears 30 times in Hamlet, so tf_w = 30. In the Project Gutenberg collection, surprisingly, there are 21204 English texts that contain the word "castle", so df_w = 30. There are approximately 50K documents in Project Gutenberg, so N = 50000.

\[adjusted_w = 30 * ln \frac{(50000 + 1)}{(21204 + 1)} = 54.734\]

We say that the adjusted term frequency is 54.734 instead of 30.

For each of the following three sub-parts, assume that there are 50K texts.
- The word "flower" appears in 26651 texts. If it’s term frequency in a particular document is 30, what is the adjusted term frequency after using tf-idf?
- The word "pollen" appears in 1781 texts. If it’s term frequency in a particular document is 30, what is the adjusted term frequency after using tf-idf?
- The word "the" appears in all but 7 texts. If it’s term frequency in a particular document is 30, what is the adjusted term frequency after using tf-idf?
- Reflect on the results from these calculations.

You should be sure you understand how tf-idf works! However, we will not write the code to implement tf-idf; rather, we will use the scipy library to do it for us.

from sklearn.feature_extraction.text import TfidfTransformer

In the first step, we will use the TfidfTransformer to calculate the document frequency values for each word and then recalculate each value in the matrix according to the tf-idf formula.

# create the tfidf_transformer
tfidf_transformer = TfidfTransformer()

# find the df values ("fit")
tfidf_transformer.fit(matrix)

# adjust the tfidf values to use the formulation we used in class
tfidf_transformer.idf_ -= 1

# apply tfidf to the matrix and save as tfidf_matrix
tfidf_matrix = tfidf_transformer.transform(matrix)

In the second step, after the user enters a query, we need to apply tf-idf to the query vector we made. However, we don’t need to recalculate the df values from the original matrix, so we just perform the "transform" step:

# apply tfidf to the query_vector and save as tfidf_query
tfidf_query = tfidf_transformer.transform(query_vector)

Add this code to your query.py program. Be sure to only apply tf-idf to your matrix once. You will need to apply tf-idf to each query that the user enters. When computing the cosine_similarity, use the tfidf_query and tfidf_matrix.

Compare the results you get when querying for "flower" vs "the flower". Are the results improved now that you implemented tf-idf? Why or why not?

OPTIONAL: Removing stop words

One common way to improve perfomance even further is to remove "stop words". These are very common words that (normally) provide very little information when performing these types of queries. You can find a list of common English stopwords here: /usr/local/share/nltk_data/corpora/stopwords/english. Add code to your setup.py and termdoc.py programs so that any word in that stopword list is ignored. Stopwords are not stored in your words.pkl file and you don’t add any words to your matrix if they are in that list. Compare your results for the queries "flower" vs "the flower".

Matching words to a query

We are now on part 2, so the command-line argument to your program will change to be "2" instead of "1".

I know this lab seems long, but the good news is you’re way more than half done!

In the second part of the lab, we’re going to build a different kind of cooccurence matrix. In the previous part, we built a term-document matrix where rows were documents, columns were the words in our vocabulary, and each cell stored the count of how many times each word appeared in each document.

In this part of the lab, we will build a term-term matrix, where rows and columns will both be the words in our vocabulary (J&M section 6.3.3). Each cell will store the count of how many times we’ve seen one word near another word in a large text corpus.

We will use data from the 2012 Google 5-grams dataset to figure out which words appear near other words. For example, the following 5-grams appear in this dataset:

1 large banana , peeled 88
1 medium banana , peeled        77
1 ripe banana , peeled  157
1 small banana , peeled 78
a peeled banana . _END_ 238
an unpeeled banana . _END_      42
large ripe banana , peeled      50
peeled a banana . _END_ 166
peeled a banana and ate 46
peeled the banana . _END_       62

In these 5-grams shown above, the word "banana" is the center word. Around the center word in each bigram are 4 other words. We will say that the document "banana" is the frequency of each other word in these contexts. For example, the word "ripe" appears in two 5-grams. One bigram appeared 157 times, the other 50 times. So in the "banana" document, we will say that we saw the word "ripe" appear 207 times. Similarly, the word "peeled" appears 1004 times in the "banana" document.

Processing 600 million Google 5-grams to create this matrix is tedious and slow. You’ve already done things like this in class, so we won’t make you do it again. Instead, we are providing you with 3 pickle files:

matrix.pkl: the term-term that contains word cooccurences derived from the Google 5-gram corpus.
books.pkl: the list of word vectors (ROWS) included in the matrix. The word "banana" was the word we were building the vector for in the example above.
words.pkl: the list of context words (COLUMNS) included in the matrix. The words "ripe" and "peeled" were the context words in the example above.

We chose these names to be maximally similar to the previous part of the lab.

We filtered the set of words so that only words that appeared in /usr/share/dict/words were included (or if there was a lowercase match). This means that many proper nouns will be excluded, but it makes the matrix a much more reasonable size to work with.

Note that because this is a term-term document, it could be the case that the list of words in books.pkl and the list of words in words.pkl would be the same, but because of the particulars of how the Google 5-gram set is constructed, they are slightly different.

Querying the matrix

If your code from the previous part works according to the specifications above, you should be able to run your query.py program with two minor modifications:

If your command-line argument is "1", use the pickle files that you created in the previous part, but if your command-line argument is "2", use the three pickle files stored in /data/cs91r-s25/ngrams/pickles.
In part 1, when you printed results, you printed both the book dictionary key (e.g. "PG1524") and the title from metadata.csv. In part 2, when your command-line argument is 2, you will only print out the book dictionary key (e.g. "spaghetti").

Now when you enter a query, e.g. "spaghetti", you will get a ranked list of the words where "spaghetti" appeared very frequently with it.

What are the 5 words that appear frequently with the word "spaghetti"?
What are the 5 words that appear frequently with days of the week? Try a bunch of them and see if there are any interesting results. (Note: all of the words in the matrix are lowercased, so use "monday", "tuesday", etc.)
What are the 5 words that appear frequently with "hamlet"? What about with the two words "hamlet castle"? What about "hamlet castle denmark"?
Try some other words or phrases? Did you find anything interesting? If you got results you didn’t expect, can you explain why you got these results?

Comparing words to each other

ALMOST DONE! There isn’t a lot of code left to write!

This is the last part of the lab, so your command-line argument is now 3.

In this part of the lab, we want you to try something a little bit different. The good news is that almost all of the code you wrote in the previous part of the lab will work here. The only difference is that instead of comparing a query vector to the matrix, we will compare two rows of the matrix to each other.

You will ask the user to enter a single word, such as "hamlet". The word "hamlet" is represented by a single row in the spreadsheet. You can find out which row it is by looking it up in the dictionary you read from books.pkl. Now, instead of creating a query vector, you will use the actual vector in the matrix as your query vector.

In the first part, you transformed the query using tf-idf before running cosine similarity. In this part, though, since the query is already in the matrix and the matrix has already been transformed, you will not transform the query vector again.

When you produce the ranked list of results, the most similar result will be the query word itself. So, when you print out the most similar other words, be sure not to include the query word itself.

Modifications to your original program

In the first two parts (command-line argument 1 or 2), you asked the user to enter a query which could be composed of any number of words. You then created a query vector from that string and applied tfidf to it. Finally, you compared that query vector to every vector in the matrix and printed the results.

In the final part (command-line argument 3), you will use the same pickle files as you did in part 2. But instead of asking for a query string as you did in the previous parts, you will ask the user for a single word. Do not create a new query vector: instead, your query vector will be the row in the matrix that corresponds to the word the user entered:

tfidf_query = tfidf_matrix[row_index]

Finally, compare that tfidf_query to every vector in the matrix and print the results. When printing the results, do not print the match between the query and its own row in the matrix.

Hopefully you can make just a few small changes to your existing code to make this work!

Explore

Using the word "spaghetti", how do your results compare between mode 2 and mode 3? Reflect on the differences that you find.
Try some other words in modes 2 and 3 (maybe run in different terminals at the same time) to compare the results you get and discuss. Which version seemed "better"? What does it mean to be better? Is it subjective? Do some kinds of words work better one way and others work the other way? What insights do you think might explain the differences you are seeing?

OPTIONAL: Comparing novels

You can do the same thing you do in part 3 with the novels without too much difficulty. Add a mode 4 that compares rows in the same way that mode 3 did, but use the pickle files from mode 1.

OPTIONAL: What did you find!? Share your results and thoughts.

How to turn in your solutions

Edit the README.md file that we provided to answer each of the questions and add any discussion you think we’d like to know about.

Be sure to commit and push all changes to your python files.

If you think it would be helpful, use asciinema to record a terminal session and include it your README.md.