Lab 7: N-Gram Viewer
Due on Wednesday, April 17 at 11:59 PM. This is a team lab. You and your assigned lab partner(s) will complete this lab together. Make sure that you are familiar with the Partner Etiquette guidelines. You may discuss the concepts of this lab with other classmates, but you may not share your code with anyone other than course staff and your lab partner(s). Do not look at solutions written by students other than your team. If your team needs help, please post on the Piazza forum or contact the instructor or ninjas. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.
We will be using Teammaker to form teams. You can log in to that site to indicate your preferred partner. Once you and your partner have specified each other, a GitHub repository will be created for your team. If you have any trouble using Teammaker, contact your instructor.
- Overview
- Starting Code
- Part I: N-Gram Viewer Using a Max Heap
- Part II: Written Lab
- Summary of Requirements
Overview
For this lab you will use AVL-tree and max-heap data structures to construct an n-gram viewer, a way of summarizing common phrases in large collections of documents. You will also implement the bubbleUp and bubbleDown operations for a max heap. Finally you will complete some written work to test your understanding of balanced BSTs and heaps.
Starting Code
Your starting code can be found in the appropriate repository for your team. The following is a description of the repository’s initial contents; bolded files are those which you must change in completing the lab.
adts/
andtest_data/
: As usual, you have simlinks to the ADT interface declarations and testing data.maxHeap.h
,maxHeap-private-inl.h
,maxHeap-inl.h
: The max-heap class you will be implementing.nGramViewer.h
,nGramViewer.cpp
: The class that does the work of parsing documents and counting n-grams.main.cpp
: The user interface to the n-gram viewer.nGramUtils.h
,nGramUtils.cpp
: These provide functions for reading files and converting between string and vector representations of n-grams.tests.cpp
andtest*.cpp
: The unit testing program you will use to confirm that your code is correct.view_*.sh
: scripts to simplify running your program on large data sets.manualTests.cpp
: A sandbox test program for you to use while you develop your program.WrittenLab.tex
andwritten-trees/
, where you will complete written questions about AVL trees and heaps.
Part I: N-Gram Viewer Using a Max Heap
Your first task for this lab is to complete the implementation of a max heap.
Code for all of the public methods has been provided for you, but those methods call the private bubbleUp
and bubbleDown
methods, which you will need to implement.
The MaxHeap
class represents data internally using a vector
of priority/value pairs.
An item’s index in the vector corresponds to its position in a level-order traversal of the tree.
As a result, the index of any node’s parent or children can be computed from its index.
You will implement helper functions for computing these parent/child indices.
It is up to you whether to implement bubbleUp
and bubbleDown
recursively or iteratively.
The parentIndex
, leftChildIndex
, and rightChildIndex
functions should take no more than a couple of lines. You should not move on to the the main application until you have thoroughly tested your MaxHeap
.
N-Grams
Once you have implemented and tested your max heap functions, you can use your max heap along with an AVL tree to implement an n-gram viewer. In computational linguistics, an n-gram is a phrase consisting of n words. Because it can be hard for computational models to determine where meaningful phrases start and end, a common analysis technique is to extract all sequences of n words from a collection of documents as the first step in an analysis. Perhaps surprisingly, just by extracting sequences of n words and counting how many times each of those sequences appear, we can often learn a lot about a collection of documents.
Probably the most famous use of n-grams is the Google Ngram Viewer, which allows users to plot the frequency of various phrases over time. This data set was created by extracting n-grams from the entire collection of books Google has scaned from libraries around the world. In this lab, we will be creating an n-gram data set from a much smaller corpus of documents, but we will perform a similar operation of extracting all phrases of length n and counting how many times they occur.
Your task is to implement an n-gram viewer that can read documents, count the occurrences of n-grams in those documents, and identify the most common n-grams in the corpus.
In the NGramViewer
class, the addDocument
method should read input files and add their contents to the words
data field.
Once documents have been read, we want to count the n-grams in those documents with the buildNGrams
method.
This method constructs an AVL tree with n-grams as keys and their occurrence counts as values.
After this tree has been built, the occurrences
method allows the user to look up specific n-grams, and the getTopNGrams
method returns the most common n-grams in the corpus (and their counts).
Be sure to examine nGramUtils.h
before implementing this class; we have provided helper methods for e.g., parsing files and converting vectors to strings.
If the corpus of documents contains w words and u unique ngrams, then your NGramViewer
methods should have the following runtimes:
buildNGrams
should take O(w * log(u) ) time to scan the words and construct an AVL tree of unique n-grams.occurrences
should take O( log(u) ) time to perform a lookup in the AVL tree.getTopNGrams
should take O( u + k * log(u) ) time to find the top k n-grams. To achieve this, you must use the heapify operation that constructs a heap in linear time.
The main.cpp
program should implement a user interface to the NGramViewer
class.
This interface should take as command line arguments n (the phrase size) and a list of (at least one) filenames of documents to be read.
Your program should then:
- Print the number of files read using the
getFilenames
method. - Print the number of words read using the
wordCount
method. - Ask the user how many top n-grams they want to view, and print information about the top k n-grams using the
getTopNGrams
method. - Allow the user to look up other n-grams using the
occurrences
method, repeating until the user enters an empty string (""
).
When reading n-grams from cin
, the getline
method that we’ve seen before for reading files is helpful:
string phrase;
getline(cin, phrase);
vector<string> ngram = extract_ngram(phrase);
Note that if you have used cin >> k
previously to get other input, like the number of top n-grams, you may need to call getline
twice: once to clear the newline that followed the prior input, and once to get the line with the phrase.
Your main program should not crash if the occurrences
method generates a runtime error.
This can be accomplished with a try/catch block:
try{
//code that might generate an exception
}catch(...){
//code that runs if the exception is generated
}
If at any time during the execution of the try
an exception is generated, execution will immediately jump to the closest enclosing catch
.
Example output from my main function can be found here.
Note that the scripts view_*.sh
provide a convenient way to run the main program on a large number of files.
Reminder: The vector
STL Class
We will be making much more extensive use of the vector<T>
class this week, including vectors of strings to represent n-grams and a vector of priority/value pairs inside the max heap data structure.
Here is a reminder of how some of the vector operations relate to list operations we’re more familiar with:
Operation | List code |
vector code |
---|---|---|
Insert at front | myList.insertAtHead(value) |
no simple equivalent |
Insert at back | myList.insertAtTail(value) |
myVector.push_back(value) |
Determine size | myList.getSize() |
myVector.size() |
Get element by index | myList.get(index) |
myVector.at(index) |
Set element by index | no equivalent | myVector[index] = value |
Remove from back | myList.removeTail() |
myVector.pop_back() |
Note the myVector.at(index)
syntax is different from what we listed last lab.
This is because indexing with myVector[index]
doesn’t work well if you have a pointer to the vector.
Also remember that the pop_back
method is void
; you must retrieve the last element yourself if you want to see that element before it is removed, and that vector
s can be copied just like pair
s. Consider the following code:
List<int>* listPointer1 = new STLList<int>(); // create a pointer to a new List
List<int>* listPointer2 = listPointer1; // copy the pointer
listPointer1->insertAtTail(4); // add an element
cout << listPointer2->getSize() << endl; // prints 1; they point to the same List
List<int> list1; // create a statically-allocated list
List<int> list2 = list1; // illegal! Lists doesn't know how to copy themselves!
vector<int> vector1; // create a statically-allocated
vector1.push_back(4);
vector1.push_back(6);
vector<int> vector2 = vector1; // vectors do know how to copy themselves
cout << vector2.size() << endl; // prints 2 (since both elements were copied)
vector1[0] = 2; // assigns 2 to the first position of vector 1
cout << vector2[0] << endl; // prints 4; vector 2 is a different vector
Testing
We have provided a number of tests for the MaxHeap
and NGramViewer
classes.
To use the tests, you will need to make
and then run ./tests heap
or ./tests viewer
.
There are also larger tests you can run with ./tests bigData
, but this will take at least 20 seconds, and is not ideal for development, but rather to verify that you have implemented your program efficiently.
You are also strongly encouraged to implement your own testing in manualTests.cpp
and to add to the unit tests in either testMaxHeap.cpp
or testNGramViewer.cpp
.
Implementation Strategy
This lab involves writing a lot of code, so it is useful to plan an approach to your work. Below is a strategy which should organize your efforts; you are not required to proceed in this order, but it might help.
-
In
maxHeap-private-inl.h
, implement the indexing functions for finding parent/child nodes first, then work onbubbleUp
andbubbleDown
. At this point you should be able to pass./tests heap
. You should also test the heapify operation (constructing a heap from avector
of priority/value pairs). -
Implement and test the
NGramViewer
class. TheaddDocument
method should take advantage of theload_document
function declared innGramUtils.h
. ThebuildNGrams
method should use theSTLBST
class defined inadts
. ThegetTopNGrams
method should use theMaxHeap
class you implemented. -
Implement and test the
main
program. The scriptsview_small_corpus.sh
, etc. are useful for testing on large data sets.
Memory Errors
For this lab, your program is required to run without memory errors or leaks. You should use valgrind
as you proceed in your testing to track memory errors. When you have a complete first draft of your implementation:
- Commit and push what you have (in case something goes wrong).
- Run
valgrind ./tests
to make sure that the test program does not cause memory errors. - Once memory errors are cleared, run
valgrind --leak-check=full ./tests
to identify and correct memory leaks.
Coding Style Requirements
You are required to observe good coding style, as detailed in prior labs.
Part II: Written Lab
For this part of your assignment, you will give written answers much in the same way as you did in previous labs. Your submission must be in a typeset PDF format.
textree.py
This part of the lab involves drawing trees. This is difficult in LaTeX (and in most diagram software), so we’ve given you a Makefile
rule that will make things much easier. This rule calls the script textree.py
, which has been provided for you. Between this and the provided WrittenLab.tex
, all you need to do is draw some ASCII art trees in text files and your PDF will be built for you.
In the written-trees
directory, there are a series of files named e.g. problem1.1.tree
. The textree.py
script turns these tree files into .tex
code which will draw those trees in the PDF. To complete the written part of the assignment, you just need edit the .tree
files to contain the right trees. The following rules apply in these .tree
files:
- Blank lines and lines starting with a
#
are a comment (like in Python) and are ignored. - Lines starting with a
:
are treated as LaTeX text. You can write your explanation of what’s happening on those lines. - Lines containing only numbers, letters, and spaces are taken to be a binary tree. Each line is a level of the tree.
- Lines containing only symbols are ignored. This allows you to include
/
and\
symbols if you want to draw your tree in the text file.
For instance, this example diagram is produced by the following .tree
file.
: This is a sample BST.
2
1 3
# This line is a comment.
: I can put slashes into my tree if I want; they don't do anything.
2
/ \
1 3
: Each node decides its parent by the text that is closest to it. For instance,
: the 3 below is the left child of 4 (and not the right child of 1) because the
: 4 is closer to the three. The blank lines are ignored just like lines with
: symbols are.
2
1 4
3
AVL Tree Questions
Show the right rotation of the root of the following tree. Be sure to specify the X, Y, and Z subtrees used in the rotation.
The following tree has an imbalance at 13. Show the rotations that fix this imbalance. Be sure to specify the X, Y, and Z subtrees used in the rotation.
Using the AVL tree algorithm, insert the key 70 into the following tree. Show the tree before and after rebalancing.
Using the AVL tree algorithm, remove the key 220 from the following tree. Show the tree before and after each rebalancing.
Heap Questions
Note that these questions rely upon the discussion of heaps in lecture.
Show the addition of element 22 to max-heap below. First, show the addition of 22 to the tree; then, show each bubbling step.
Show the removal of the top element of this max-heap. First, show the swap of the root node; then, show each bubbling step.
Consider the sequence of elements[7,4,2,6,9,1,3]. Using the representation discussed in class, show the tree to which this sequence corresponds. Then, show the heapification of this tree; that is, show how this tree is transformed into a heap. Demonstrate each bubbling step.
Summary of Requirements
When you are finished, you should have
- Completed implementations of the
MaxHeap
andNGramViewer
classes - A main program that provides a user interface to the n-gram viewer
- No memory errors
- No memory leaks
- Code which conforms to the style requirements
- A typeset PDF file containing your answers to the written questions
When you are finished with your lab, please fill out the post-lab survey. You should do this independently of your partner.