CS35: Data Structures and Algorithms

Lab 7: N-Gram Viewer

Due on Wednesday, April 17 at 11:59 PM. This is a team lab. You and your assigned lab partner(s) will complete this lab together. Make sure that you are familiar with the Partner Etiquette guidelines. You may discuss the concepts of this lab with other classmates, but you may not share your code with anyone other than course staff and your lab partner(s). Do not look at solutions written by students other than your team. If your team needs help, please post on the Piazza forum or contact the instructor or ninjas. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.

We will be using Teammaker to form teams. You can log in to that site to indicate your preferred partner. Once you and your partner have specified each other, a GitHub repository will be created for your team. If you have any trouble using Teammaker, contact your instructor.

Overview
Starting Code
Part I: N-Gram Viewer Using a Max Heap
Part II: Written Lab
Summary of Requirements

Overview

For this lab you will use AVL-tree and max-heap data structures to construct an n-gram viewer, a way of summarizing common phrases in large collections of documents. You will also implement the bubbleUp and bubbleDown operations for a max heap. Finally you will complete some written work to test your understanding of balanced BSTs and heaps.

Starting Code

Your starting code can be found in the appropriate repository for your team. The following is a description of the repository’s initial contents; bolded files are those which you must change in completing the lab.

adts/ and test_data/: As usual, you have simlinks to the ADT interface declarations and testing data.
maxHeap.h, maxHeap-private-inl.h, maxHeap-inl.h: The max-heap class you will be implementing.
nGramViewer.h, nGramViewer.cpp: The class that does the work of parsing documents and counting n-grams.
main.cpp: The user interface to the n-gram viewer.
nGramUtils.h, nGramUtils.cpp: These provide functions for reading files and converting between string and vector representations of n-grams.
tests.cpp and test*.cpp: The unit testing program you will use to confirm that your code is correct.
view_*.sh: scripts to simplify running your program on large data sets.
manualTests.cpp: A sandbox test program for you to use while you develop your program.
WrittenLab.tex and written-trees/, where you will complete written questions about AVL trees and heaps.

Part I: N-Gram Viewer Using a Max Heap

Your first task for this lab is to complete the implementation of a max heap. Code for all of the public methods has been provided for you, but those methods call the private bubbleUp and bubbleDown methods, which you will need to implement. The MaxHeap class represents data internally using a vector of priority/value pairs. An item’s index in the vector corresponds to its position in a level-order traversal of the tree. As a result, the index of any node’s parent or children can be computed from its index. You will implement helper functions for computing these parent/child indices.

It is up to you whether to implement bubbleUp and bubbleDown recursively or iteratively. The parentIndex, leftChildIndex, and rightChildIndex functions should take no more than a couple of lines. You should not move on to the the main application until you have thoroughly tested your MaxHeap.

N-Grams

Once you have implemented and tested your max heap functions, you can use your max heap along with an AVL tree to implement an n-gram viewer. In computational linguistics, an n-gram is a phrase consisting of n words. Because it can be hard for computational models to determine where meaningful phrases start and end, a common analysis technique is to extract all sequences of n words from a collection of documents as the first step in an analysis. Perhaps surprisingly, just by extracting sequences of n words and counting how many times each of those sequences appear, we can often learn a lot about a collection of documents.

Probably the most famous use of n-grams is the Google Ngram Viewer, which allows users to plot the frequency of various phrases over time. This data set was created by extracting n-grams from the entire collection of books Google has scaned from libraries around the world. In this lab, we will be creating an n-gram data set from a much smaller corpus of documents, but we will perform a similar operation of extracting all phrases of length n and counting how many times they occur.

Your task is to implement an n-gram viewer that can read documents, count the occurrences of n-grams in those documents, and identify the most common n-grams in the corpus. In the NGramViewer class, the addDocument method should read input files and add their contents to the words data field. Once documents have been read, we want to count the n-grams in those documents with the buildNGrams method. This method constructs an AVL tree with n-grams as keys and their occurrence counts as values. After this tree has been built, the occurrences method allows the user to look up specific n-grams, and the getTopNGrams method returns the most common n-grams in the corpus (and their counts).
Be sure to examine nGramUtils.h before implementing this class; we have provided helper methods for e.g., parsing files and converting vectors to strings.

If the corpus of documents contains w words and u unique ngrams, then your NGramViewer methods should have the following runtimes:

buildNGrams should take O(w * log(u) ) time to scan the words and construct an AVL tree of unique n-grams.
occurrences should take O( log(u) ) time to perform a lookup in the AVL tree.
getTopNGrams should take O( u + k * log(u) ) time to find the top k n-grams. To achieve this, you must use the heapify operation that constructs a heap in linear time.

The main.cpp program should implement a user interface to the NGramViewer class. This interface should take as command line arguments n (the phrase size) and a list of (at least one) filenames of documents to be read. Your program should then:

Print the number of files read using the getFilenames method.
Print the number of words read using the wordCount method.
Ask the user how many top n-grams they want to view, and print information about the top k n-grams using the getTopNGrams method.
Allow the user to look up other n-grams using the occurrences method, repeating until the user enters an empty string ("").

When reading n-grams from cin, the getline method that we’ve seen before for reading files is helpful:

string phrase;
getline(cin, phrase);
vector<string> ngram = extract_ngram(phrase);

Note that if you have used cin >> k previously to get other input, like the number of top n-grams, you may need to call getline twice: once to clear the newline that followed the prior input, and once to get the line with the phrase.

Your main program should not crash if the occurrences method generates a runtime error. This can be accomplished with a try/catch block:

try{
  //code that might generate an exception
}catch(...){
  //code that runs if the exception is generated
}

If at any time during the execution of the try an exception is generated, execution will immediately jump to the closest enclosing catch.

Example output from my main function can be found here. Note that the scripts view_*.sh provide a convenient way to run the main program on a large number of files.

Reminder: The `vector` STL Class

We will be making much more extensive use of the vector<T> class this week, including vectors of strings to represent n-grams and a vector of priority/value pairs inside the max heap data structure. Here is a reminder of how some of the vector operations relate to list operations we’re more familiar with:

Operation	`List` code	`vector` code
Insert at front	`myList.insertAtHead(value)`	no simple equivalent
Insert at back	`myList.insertAtTail(value)`	`myVector.push_back(value)`
Determine size	`myList.getSize()`	`myVector.size()`
Get element by index	`myList.get(index)`	`myVector.at(index)`
Set element by index	no equivalent	`myVector[index] = value`
Remove from back	`myList.removeTail()`	`myVector.pop_back()`

Note the myVector.at(index) syntax is different from what we listed last lab. This is because indexing with myVector[index] doesn’t work well if you have a pointer to the vector. Also remember that the pop_back method is void; you must retrieve the last element yourself if you want to see that element before it is removed, and that vectors can be copied just like pairs. Consider the following code:

List<int>* listPointer1 = new STLList<int>(); // create a pointer to a new List
List<int>* listPointer2 = listPointer1;  // copy the pointer
listPointer1->insertAtTail(4);           // add an element
cout << listPointer2->getSize() << endl; // prints 1; they point to the same List
List<int> list1;                         // create a statically-allocated list
List<int> list2 = list1;                 // illegal! Lists doesn't know how to copy themselves!

vector<int> vector1;                     // create a statically-allocated
vector1.push_back(4);
vector1.push_back(6);
vector<int> vector2 = vector1;           // vectors do know how to copy themselves
cout << vector2.size() << endl;          // prints 2 (since both elements were copied)
vector1[0] = 2;                          // assigns 2 to the first position of vector 1
cout << vector2[0] << endl;              // prints 4; vector 2 is a different vector

Testing

We have provided a number of tests for the MaxHeap and NGramViewer classes. To use the tests, you will need to make and then run ./tests heap or ./tests viewer. There are also larger tests you can run with ./tests bigData, but this will take at least 20 seconds, and is not ideal for development, but rather to verify that you have implemented your program efficiently. You are also strongly encouraged to implement your own testing in manualTests.cpp and to add to the unit tests in either testMaxHeap.cpp or testNGramViewer.cpp.

Implementation Strategy

Source: OpenClipArt.org

This lab involves writing a lot of code, so it is useful to plan an approach to your work. Below is a strategy which should organize your efforts; you are not required to proceed in this order, but it might help.

In maxHeap-private-inl.h, implement the indexing functions for finding parent/child nodes first, then work on bubbleUp and bubbleDown. At this point you should be able to pass ./tests heap. You should also test the heapify operation (constructing a heap from a vector of priority/value pairs).
Implement and test the NGramViewer class. The addDocument method should take advantage of the load_document function declared in nGramUtils.h. The buildNGrams method should use the STLBST class defined in adts. The getTopNGrams method should use the MaxHeap class you implemented.
Implement and test the main program. The scripts view_small_corpus.sh, etc. are useful for testing on large data sets.

Memory Errors

For this lab, your program is required to run without memory errors or leaks. You should use valgrind as you proceed in your testing to track memory errors. When you have a complete first draft of your implementation:

Commit and push what you have (in case something goes wrong).
Run valgrind ./tests to make sure that the test program does not cause memory errors.
Once memory errors are cleared, run valgrind --leak-check=full ./tests to identify and correct memory leaks.

Coding Style Requirements

You are required to observe good coding style, as detailed in prior labs.

Part II: Written Lab

For this part of your assignment, you will give written answers much in the same way as you did in previous labs. Your submission must be in a typeset PDF format.

`textree.py`

This part of the lab involves drawing trees. This is difficult in LaTeX (and in most diagram software), so we’ve given you a Makefile rule that will make things much easier. This rule calls the script textree.py, which has been provided for you. Between this and the provided WrittenLab.tex, all you need to do is draw some ASCII art trees in text files and your PDF will be built for you.

In the written-trees directory, there are a series of files named e.g. problem1.1.tree. The textree.py script turns these tree files into .tex code which will draw those trees in the PDF. To complete the written part of the assignment, you just need edit the .tree files to contain the right trees. The following rules apply in these .tree files:

Blank lines and lines starting with a # are a comment (like in Python) and are ignored.
Lines starting with a : are treated as LaTeX text. You can write your explanation of what’s happening on those lines.
Lines containing only numbers, letters, and spaces are taken to be a binary tree. Each line is a level of the tree.
Lines containing only symbols are ignored. This allows you to include / and \ symbols if you want to draw your tree in the text file.

For instance, this example diagram is produced by the following .tree file.

Example `.tree` File

: This is a sample BST.

 2
1 3

# This line is a comment.

: I can put slashes into my tree if I want; they don't do anything.

  2
 / \
1   3

: Each node decides its parent by the text that is closest to it.  For instance,
: the 3 below is the left child of 4 (and not the right child of 1) because the
: 4 is closer to the three.  The blank lines are ignored just like lines with
: symbols are.

  2

1   4

   3

AVL Tree Questions

Show the right rotation of the root of the following tree. Be sure to specify the X, Y, and Z subtrees used in the rotation.
The following tree has an imbalance at 13. Show the rotations that fix this imbalance. Be sure to specify the X, Y, and Z subtrees used in the rotation.
Using the AVL tree algorithm, insert the key 70 into the following tree. Show the tree before and after rebalancing.
Using the AVL tree algorithm, remove the key 220 from the following tree. Show the tree before and after each rebalancing.

Heap Questions

Note that these questions rely upon the discussion of heaps in lecture.

Show the addition of element 22 to max-heap below. First, show the addition of 22 to the tree; then, show each bubbling step.
Show the removal of the top element of this max-heap. First, show the swap of the root node; then, show each bubbling step.
Consider the sequence of elements[7,4,2,6,9,1,3]. Using the representation discussed in class, show the tree to which this sequence corresponds. Then, show the heapification of this tree; that is, show how this tree is transformed into a heap. Demonstrate each bubbling step.

Summary of Requirements

When you are finished, you should have

Completed implementations of the MaxHeap and NGramViewer classes
A main program that provides a user interface to the n-gram viewer
No memory errors
No memory leaks
Code which conforms to the style requirements
A typeset PDF file containing your answers to the written questions

When you are finished with your lab, please fill out the post-lab survey. You should do this independently of your partner.