CS91R Lab 07: Document vectors
Due Wednesday, April 9, before midnight
Goals
The goals for this lab assignment are:
-
Understand how to represent documents as multidimensional vectors
-
Explore the use of sparse matrices for efficiently representing document collections
-
Use cosine similarity to compare document similarity
-
Use tf-idf to downweight frequent words
-
Learn how to treat a word’s context across a large corpus as a document
Cloning your repository
Log into the CS91R-S25 github
organization for our class and find your git repository for Lab 07,
which will be of the format lab07-user1-user2
, where user1
and user2
are you and your partner’s usernames.
You can clone your repository using the following steps while connected to the CS lab machines:
# cd into your cs91r/labs sub-directory and clone your lab07 repo
$ cd ~/cs91r/labs
$ git clone git@github.swarthmore.edu:CS91R-S25/lab07-user1-user2.git
# change directory to list its contents
$ cd ~/cs91r/labs/lab07-user1-user2
# ls should list the following contents
$ ls
README.md
Answers to written questions should be included in the README.md file included in your repository.
Background
For further details on the content of this lab, refer to Jurafsky and Martin, Chapter 6, in particular, sections 6.3, 6.4 and 6.5.
Document Retrieval
In the first part of the lab, we will build a document retrieval system that allows us to quickly search a large collection of documents using keyword searches. For each search the user performs, our system will provide a ranking of all of the documents in the collection. The user can then easily browse through all of the documents in sorted order from best match to worst match.
We will use the entire English text corpus of Project Gutenberg as our document collection.
-
Explore the Project Gutenberg dataset in
/data/cs91r-s25/gutenberg/mirror
.-
Describe the contents of the four subfolders found in
/data/cs91r-s25/gutenberg/mirror/data
. -
How many downloaded texts are there?
-
In the
/research/data/cs91r-s25/gutenberg/mirror/metadata
folder is a file calledmetadata.csv
. Usegrep
or another command-line tool to find out how many English texts are in the corpus. There are some texts that are in English and another language: your count should contain the English-only texts. -
There is also a file called
metadata_small.csv
that contains a subset of the books that we strongly consider you use for testing purposes in later parts of the lab. How many English-only texts are in this file?
-
Examining the data
In the first part of the lab, we will convert the Project Gutenberg corpus
into a term-document matrix. Each row of the matrix will represent a single
document in the corpus. Each column will represent a unique word in the corpus.
For example, in the version of Hamlet that we have been using (PG1524), the
word "castle" appears 30 times. If the matrix is storing the text PG1524 in
row i
and the matrix is storing the word "castle" in column j
, then
then matrix[i][j]=30
.
-
From Question 1, you found out that there are approximately 60K English texts. (Later in the lab you will find out the number is closer to 50K: see the note in the next paragraph. We’ll use 50K from now on as our approximation.) If we lowercase all the words in those 50K texts, there are approximately 70K unique words. If each integer in the matrix was stored as a 4-byte
int
, how much memory would you need to store the entire term-document matrix?
(Note: A fair number of texts listed in the metadata file were not actually downloaded. In some cases it is because the text is no longer available on Project Gutenberg. In other cases, the text is only available in a non-text format such as a PDF or audio file, and we’ve only downloaded the .txt files. Later in the lab you will be able to calculate this value more precisely.)
Background: the pickle
library
As we work through this lab, we’re going to be reading through a lot
of files. You might want to work with the metadata_small.csv
subset during development and debugging.
It may take a few minutes to read through
the entire Project Gutenberg corpus, and you don’t want to have to
re-process the corpus over and over again. To help out, we’re going to
take advantage of python’s
pickle
library. This
library allows us to
serialize
our
data. Practically, this means that we can create large data structures
(e.g. our term-document matrix) in one program and save them directly
to a file. In another program, we can read our data directly from the
file with very little effort.
Here is a small example demonstrating how the pickle
library works:
$ python3
>>> import pickle # import the pickle library
>>> names = ['Rich', 'Keith'] # create a list of names
>>> f = open('my_list.pkl', 'wb') # open a file (write, binary mode)
>>> pickle.dump(names, f) # save the list to the file
>>> f.close() # close the file
>>>
>>>
$ ls
my_list.pkl
$ python3
>>> f = open('my_list.pkl', 'rb') # open the file (read, binary mode)
>>> data = pickle.load(f) # load list from the file
>>> f.close() # close the file
>>> data # show that the data is loaded
['Rich', 'Keith']
>>>
Figuring out the dimensions of the matrix
Once we create the matrix, if we see the word "castle" in Hamlet, how do we know which cell in the matrix this goes in? In order to figure that out, we need to assign a row number to each text, and we need to assign a column number to each word. And in order to do this assignment, we need a list of all of the texts and another list of all the words in all of these texts.
Write a program called setup.py
that does the following:
-
Read the file
/usr/share/dict/words
. This file contains a large number of English words and many proper nouns. Lowercase all the words. When you read in token counts in the bullet points below, you will only store words that appear in this file. -
Read the
metadata.csv
file (mentioned in Question 1). You should use thecsv
library to read the file and make of list of the ids of each text written in only English. -
For each English-only text in your list of fileids, open that file and read the token counts for that text. The token counts are stored in the
/data/cs91r-s25/gutenberg/mirror/data/counts
directory. You should not store the counts — simply keep track of all of the UNIQUE words across ALL of the texts. Using aset
is helpful here. -
NOTE: Some files may exist but not have any counts in them! Be sure your program doesn’t crash on those.
-
Create a dictionary that maps file ids to consecutive integers starting from 0. These integers will become the ROW numbers in your matrix, so each integer must appear only once, starting from 0, and increase by 1 for each next file id. The exact order of the file ids doesn’t matter as long as each book has a unique integer.
-
Create another dictionary that maps words to consecutive integers starting from 0. These integers will become the COLUMN numbers in your matrix, so again, each integer must only appear once, starting from 0. The order of the words doesn’t matter.
-
Use the
pickle
library to save the book dictionary tobooks.pkl
-
Similarly, save the words dictionary to
words.pkl
.
Background: Sparse matrices
-
How many types are there in the
PG1524_counts.txt
file? Use a command-line tool to answer this. -
In Question 2, we said there were about 70K unique types across all of the English Project Gutenberg texts. If we were to fill in the Hamlet (PG1524) row with how often each type appeared, what percentage of the types would be storing a non-zero value?
If we assume that the percentage you got in the previous question is approximately true for all of the texts in Project Gutenberg, it’s pretty easy to see that a very large portion of the matrix we’re creating will be used to store the value 0, which seems like a huge waste of space. So instead of using a list of lists to store our matrix, we will use a sparse matrix. By default, these sparse matrices will assume that any unspecified values are 0, saving us quite a lot of memory.
There are many implementations of sparse matrices, and we will be using
two of them. One implementation will be useful when we need to build the
matrix, and one implementation will be useful when we need to do computations
using it. Both implementations are in the scipy.sparse
library.
We will start with the coo_matrix
,
or "coordinate matrix". The coo_matrix
stores coordinates and their associated
values. For example, if Hamlet was stored in row 4688 and if the word "castle"
was stored in column 1105, and the word "castle" appeared 30 times in Hamlet,
the the coo_matrix
would store the fact that cell (1105, 4688) had the value 30.
We will use this constructor to create our coo_matrix
(adapted from
the reference page linked above):
coo_matrix((data, (r, c)), dtype=<datatype>)
to construct from three arrays:
data[:] the entries of the matrix, in any order
r[:] the row indices of the matrix entries
c[:] the column indices of the matrix entries
Where A[r[k], c[k]] = data[k].
dtype specifies the data type of the matrix.
To use the csr_matrix
, coo_matrix
, and np.int32
(described
below), you will need to import the following:
import numpy as np
from scipy.sparse import csr_matrix, coo_matrix
-
Write python code that creates a
coo_matrix
calledmatrix
to store the data below using the constructor shown above. By default, thedtype
is set to a double-precision floating point number. We would like it to store a 32-bit integer. To do that, setdtype=np.int32
.
0, 0, 2, 0, 0
0, 2, 0, 3, 0
4, 0, 1, 0, 0
0, 0, 0, 1, 5
If you saved your matrix in the variable matrix
, you can verify you are
correct by printing matrix.toarray()
.
The coo_matrix
is optimized for constructing sparse matrices. But it
is not a good format to use when you actually want to inspect values or
perform arithmetic operations. Instead, once we create the coo_matrix
, we
will want to convert it to a csr_matrix
.
Given the coo_matrix
you created, you can convert it to a csr_matrix
as
follows: matrix_csr = matrix.tocsr()
.
As before, you can verify this worked by printing matrix_csr.toarray()
.
Building the term-document matrix
Write a program called termdoc.py
that creates the term-document
matrix. Your program should:
-
Read in the two pickle files you created in
setup.py
so that you have the mapping of file ids to integers and the mapping of words to integers. -
For every file id, read in the corresponding token counts. Recall that we need to store these token counts in a
coo_matrix
and our goal is to use thecoo_matrix
constructor we used above to create it. Therefore, for each token count you read in, you’ll need to append the token count todata
, you’ll need to append the row number tor
, and you’ll need to append the column number toc
. You can find the row number from the dictionary that maps file id to integers, and you can find the column number from the dictionary that maps words to integers. -
After reading in the token counts for all books and creating
data
,r
, andc
, create acoo_matrix
. -
Convert your
coo_matrix
to acsr_matrix
. -
Save your
csr_matrix
to a pickle file calledmatrix.pkl
.
Background: Cosine similarity
Now that we have our term-document matrix saved, we can move on to finding documents that match a user’s search terms. To do this, we are going to use cosine similarity.
Recall that each row of the matrix represents the counts of the words we found in a single document. We will treat this row as a multidimensional vector (J&M section 6.3.1). You have likely seen two- and three-dimensional vectors, but in this matrix, since there are about 70K words in our vocabulary, each vector has approximately 70K dimensions.
Even though our vectors have a ridiculously large number of dimensions, we can still use standard mathematical techniques to determine how similar two vectors are. In particular, we will try to find the angle between the two vectors. If the angle between two vectors is small, then they are more similar; if the angle between two vectors is large, then they are less similar.
The value of each dimension in our vector is a word count. Each of these word counts has a non-negative value: it is either zero if the word did not occur in the document, or it is a positive value equal to the number of times it did occur. Therefore, the angle between any two vectors must be between 0 and 90 degrees.
Rather than find the angle directly, we will find the cosine of the angle between the two vectors because it is easy to calculate! When we use the cosine to determine how close two vectors are, we call this the "cosine similarity". The cosine similarity will be close to 1 when the two vectors are very similar, and close to 0 when the two vectors are very dissimilar. (See: J&M section 6.4))
Querying the document collection
Write a program called query.py
that allows the user to enter
a query (space-separated words) and returns to the user a ranked
list of the documents that match the query.
This program will take one command-line argument which will help you distinguish between the three main ways you will run this program. For now, the command-line argument will always be "1" since this is part 1. (This will make more sense later, but for now, just make sure your query program reads and verifies that this is part 1.)
To do this, we will ask the user to enter a space separated list of words, e.g. "flower pollen", which we will call the "query". We will convert that query into a multidimensional vector with a 1 in the "flower" dimension, a 1 in the "pollen" dimension, and a 0 in every other dimension.
To create this query vector, we will first create a vector with the same
number of columns as the matrix we built and we will fill all of the
entries with 0. In order to make our computations efficient,
we will use the numpy.zeros
function to create an empty query vector.
Then, for every word in the query vector, we will set the corresponding column
value to 1. If the user types includes a word in their query that is not
in your words
dictionary, just ignore it.
Here is an example of how to use the numpy.zeros
function and how to set
values in the resulting vector:
import numpy as np # at the top of your program
# assumes you already have your matrix created and that it is called 'matrix'
num_rows, num_columns = matrix.shape # find the size of your matrix
query = np.zeros((1, num_columns)) # create a vector of all zeros
# assuming "castle" was dimension 1105 and you wanted to store the value 1
# in that dimension, you would say:
query[0, 1105] = 1 # note the unusual indexing!
Now that you have your query vector, you can compare that query vector to all of the vectors in your matrix and then sort them from most similar to least similar:
from sklearn.metrics.pairwise import cosine_similarity # top of your file
similarities = cosine_similarity(query, matrix).flatten()
ranked = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)
The list ranked
contains tuples where the first element is the row
number and the second element is the similarity value. For example,
the highest match might be (22345, 0.15)
which means that whichever
book was in row 22345 has a cosine similarity of 0.15 between it and your
query vector.
At this point, complete your query.py
so that it:
-
Reads in your
words.pkl
,books.pkl
andmatrix.pkl
files that you created in previous programs. -
Repeatedly performs the following steps:
-
Asks the user the enter a query (space-separated words)
-
Converts the query into a query vector
-
Uses cosine similarity to compare the query to every row in the matrix
-
Prints the cosine similarity, book id (e.g.
"PG1524"
), and the book’s title, which you can get from themetadata.csv
file, for the top 5 best matches.
-
-
For the top five best matches to the query "flower pollen", display the cosine similarity, the book’s id, and the book’s title (which you can get from the
metadata.csv
file).-
Look through the titles of the books returned, as well as the text of the books which you can find in
/data/cs91r-s25/gutenberg/mirror/data/text/
. Do your results seem reasonable? Why or why not?
-
-
Repeat the previous question using the query "flower". Do the results seem reasonable? Why or why not?
-
Repeat the previous question using the query "the flower". Compare to the results you got for "flower". Do the results seem reasonable? Why or why not?
-
Try on a few other queries of your choosing. Which queries did you try? Were the results reasonable? Why or why not?
TF-IDF
Not all query terms are equally important, are they? That is, when you searched for "flower", it was important to find documents that contained the word "flower". But when you searched for "the flower", the cosine similarity algorithm treated "the" as equally important to "flower". That means that a document that has the word "the" a lot of times but doesn’t contain the word "flower" at all can still rank very highly. (In fact, it does! Did you find that when you answered the question above?)
To fix this, we will use a technique called tf-idf that will automatically downweight words that occur in a lot of documents. (See J&M section 6.5.)
At a high level, we are going scale all of the term frequencies so that words that occur in many documents (the "document frequency") have less impact than words that occur in fewer documents.
-
tf_w
is the term frequency of the word: how often it appears in a document -
df_w
is the document frequency of a word: how many documents a word appears in -
N
is the number of documents in your collection -
ln
is the natural log function
The formulation for calculating the adjusted tf
value that we will use is:
Let’s try to adjust the term frequency for the word "castle" in Hamlet.
The word "castle" appears 30 times in Hamlet, so tf_w = 30
.
In the Project Gutenberg collection, surprisingly, there are 21204 English texts
that contain the word "castle", so df_w = 30
. There are approximately 50K
documents in Project Gutenberg, so N = 50000
.
We say that the adjusted term frequency is 54.734 instead of 30.
-
For each of the following three sub-parts, assume that there are 50K texts.
-
The word "flower" appears in 26651 texts. If it’s term frequency in a particular document is 30, what is the adjusted term frequency after using tf-idf?
-
The word "pollen" appears in 1781 texts. If it’s term frequency in a particular document is 30, what is the adjusted term frequency after using tf-idf?
-
The word "the" appears in all but 7 texts. If it’s term frequency in a particular document is 30, what is the adjusted term frequency after using tf-idf?
-
Reflect on the results from these calculations.
-
You should be sure you understand how tf-idf works! However, we will not
write the code to implement tf-idf; rather, we will use the scipy
library
to do it for us.
from sklearn.feature_extraction.text import TfidfTransformer
In the first step, we will use the
TfidfTransformer
to calculate the document frequency values for each word and then recalculate each
value in the matrix according to the tf-idf formula.
# create the tfidf_transformer
tfidf_transformer = TfidfTransformer()
# find the df values ("fit")
tfidf_transformer.fit(matrix)
# adjust the tfidf values to use the formulation we used in class
tfidf_transformer.idf_ -= 1
# apply tfidf to the matrix and save as tfidf_matrix
tfidf_matrix = tfidf_transformer.transform(matrix)
In the second step, after the user enters a query, we need to apply tf-idf to the query vector we made. However, we don’t need to recalculate the df values from the original matrix, so we just perform the "transform" step:
# apply tfidf to the query_vector and save as tfidf_query
tfidf_query = tfidf_transformer.transform(query_vector)
Add this code to your query.py
program. Be sure to only apply tf-idf to
your matrix once. You will need to apply tf-idf to each query that the user
enters. When computing the cosine_similarity
, use the tfidf_query
and tfidf_matrix
.
-
Compare the results you get when querying for "flower" vs "the flower". Are the results improved now that you implemented tf-idf? Why or why not?
OPTIONAL: Removing stop words
One common way to improve perfomance even further is to remove "stop words".
These are very common words that (normally) provide very little information
when performing these types of queries. You can find a list of common
English stopwords here: /usr/local/share/nltk_data/corpora/stopwords/english
.
Add code to your setup.py
and termdoc.py
programs so that any word
in that stopword list is ignored. Stopwords are not stored in your
words.pkl
file and you don’t add any words to your matrix if they are
in that list. Compare your results for the queries "flower" vs "the flower".
Matching words to a query
We are now on part 2, so the command-line argument to your program will change to be "2" instead of "1".
I know this lab seems long, but the good news is you’re way more than half done!
In the second part of the lab, we’re going to build a different kind of cooccurence matrix. In the previous part, we built a term-document matrix where rows were documents, columns were the words in our vocabulary, and each cell stored the count of how many times each word appeared in each document.
In this part of the lab, we will build a term-term matrix, where rows and columns will both be the words in our vocabulary (J&M section 6.3.3). Each cell will store the count of how many times we’ve seen one word near another word in a large text corpus.
We will use data from the 2012 Google 5-grams dataset to figure out which words appear near other words. For example, the following 5-grams appear in this dataset:
1 large banana , peeled 88
1 medium banana , peeled 77
1 ripe banana , peeled 157
1 small banana , peeled 78
a peeled banana . _END_ 238
an unpeeled banana . _END_ 42
large ripe banana , peeled 50
peeled a banana . _END_ 166
peeled a banana and ate 46
peeled the banana . _END_ 62
In these 5-grams shown above, the word "banana" is the center word. Around the center word in each bigram are 4 other words. We will say that the document "banana" is the frequency of each other word in these contexts. For example, the word "ripe" appears in two 5-grams. One bigram appeared 157 times, the other 50 times. So in the "banana" document, we will say that we saw the word "ripe" appear 207 times. Similarly, the word "peeled" appears 1004 times in the "banana" document.
Processing 600 million Google 5-grams to create this matrix is tedious and slow. You’ve already done things like this in class, so we won’t make you do it again. Instead, we are providing you with 3 pickle files:
-
matrix.pkl
: the term-term that contains word cooccurences derived from the Google 5-gram corpus. -
books.pkl
: the list of word vectors (ROWS) included in the matrix. The word "banana" was the word we were building the vector for in the example above. -
words.pkl
: the list of context words (COLUMNS) included in the matrix. The words "ripe" and "peeled" were the context words in the example above.
We chose these names to be maximally similar to the previous part of the lab.
We filtered the set of words so that only words that appeared in
/usr/share/dict/words
were included (or if there was a lowercase match).
This means that many proper nouns will be excluded, but it makes the matrix
a much more reasonable size to work with.
Note that because this is a term-term document, it could be the case that the
list of words in books.pkl
and the list of words in words.pkl
would be the
same, but because of the particulars of how the Google 5-gram set is constructed,
they are slightly different.
Querying the matrix
If your code from the previous part works according to the specifications
above, you should be able to run your query.py
program with two minor
modifications:
-
If your command-line argument is "1", use the pickle files that you created in the previous part, but if your command-line argument is "2", use the three pickle files stored in
/data/cs91r-s25/ngrams/pickles
. -
In part 1, when you printed results, you printed both the book dictionary key (e.g.
"PG1524"
) and the title frommetadata.csv
. In part 2, when your command-line argument is 2, you will only print out the book dictionary key (e.g."spaghetti"
).
Now when you enter a query, e.g. "spaghetti", you will get a ranked list of the words where "spaghetti" appeared very frequently with it.
-
What are the 5 words that appear frequently with the word "spaghetti"?
-
What are the 5 words that appear frequently with days of the week? Try a bunch of them and see if there are any interesting results. (Note: all of the words in the matrix are lowercased, so use "monday", "tuesday", etc.)
-
What are the 5 words that appear frequently with "hamlet"? What about with the two words "hamlet castle"? What about "hamlet castle denmark"?
-
Try some other words or phrases? Did you find anything interesting? If you got results you didn’t expect, can you explain why you got these results?
Comparing words to each other
ALMOST DONE! There isn’t a lot of code left to write!
This is the last part of the lab, so your command-line argument is now 3.
In this part of the lab, we want you to try something a little bit different. The good news is that almost all of the code you wrote in the previous part of the lab will work here. The only difference is that instead of comparing a query vector to the matrix, we will compare two rows of the matrix to each other.
You will ask the user to enter a single word, such as "hamlet". The word
"hamlet" is represented by a single row in the spreadsheet. You can find
out which row it is by looking it up in the dictionary you read from
books.pkl
. Now, instead of creating a query vector, you will use the
actual vector in the matrix as your query vector.
In the first part, you transformed the query using tf-idf before running cosine similarity. In this part, though, since the query is already in the matrix and the matrix has already been transformed, you will not transform the query vector again.
When you produce the ranked list of results, the most similar result will be the query word itself. So, when you print out the most similar other words, be sure not to include the query word itself.
Modifications to your original program
In the first two parts (command-line argument 1 or 2), you asked the user to enter a query which could be composed of any number of words. You then created a query vector from that string and applied tfidf to it. Finally, you compared that query vector to every vector in the matrix and printed the results.
In the final part (command-line argument 3), you will use the same pickle files as you did in part 2. But instead of asking for a query string as you did in the previous parts, you will ask the user for a single word. Do not create a new query vector: instead, your query vector will be the row in the matrix that corresponds to the word the user entered:
tfidf_query = tfidf_matrix[row_index]
Finally, compare that tfidf_query
to every vector in the matrix and
print the results. When printing the results, do not print the match between
the query and its own row in the matrix.
Hopefully you can make just a few small changes to your existing code to make this work!
Explore
-
Using the word
"spaghetti"
, how do your results compare between mode 2 and mode 3? Reflect on the differences that you find. -
Try some other words in modes 2 and 3 (maybe run in different terminals at the same time) to compare the results you get and discuss. Which version seemed "better"? What does it mean to be better? Is it subjective? Do some kinds of words work better one way and others work the other way? What insights do you think might explain the differences you are seeing?
OPTIONAL: Comparing novels
You can do the same thing you do in part 3 with the novels without too much difficulty. Add a mode 4 that compares rows in the same way that mode 3 did, but use the pickle files from mode 1.
-
OPTIONAL: What did you find!? Share your results and thoughts.
How to turn in your solutions
Edit the README.md
file that we provided to answer each of the questions
and add any discussion you think we’d like to know about.
Be sure to commit and push all changes to your python files.
If you think it would be helpful, use asciinema
to record
a terminal session and include it your README.md
.