Data Structures and Algorithms

Lab 8: King James Programming

Due on Sunday, April 3rd at 11:59 PM. This is a team lab. You and your assigned lab partner(s) will complete this lab together. Make sure that you are familiar with the Partner Etiquette guidelines. You may discuss the concepts of this lab with other classmates, but you may not share your code with anyone other than course staff and your lab partner(s). Do not look at solutions written by students other than your team. If your team needs help, please post on the Piazza forum or contact the instructor or ninjas. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.

Overview

Source: OpenClipArt.org

In this lab, you will modify an implementation of a binary search tree to create an AVL tree. Operations on AVL trees have better worst-case time complexity than operations on simple binary search trees. We will make use of these operations to construct a generator which, given a collection of input files, can create novel-looking, fairly realistic sentences. This lab is inspired by and named after the single-topic blog King James Programming, which uses similar (albeit more sophisticated) techniques to achieve a similar goal.

In comparison to previous labs, the number of lines of code you must add to the starting repository to get a completed, correct submission is very small. Those lines of code are conceptually weighty, however, and you can expect to spend a fair amount of time debugging your AVL tree data structure. Make sure to test thoroughly!

Your team’s repository can be found in the usual location: git@github.swarthmore.edu:cs35-s16/lab08-<team-name>.

Starting Code

The starting files for this lab include a complete implementation of a binary search tree. The following is a description of each file in the starter code; files you will need to modify are shown in bold.

Existing data structures. As with the last lab, you are provided implementations of List, Queue, and Pair.
- circular_array_list.h
- circular_array_queue.h
- list.h
- ordered_collection.h
- pair.h
- queue.h
The AVL tree. Although the files have been renamed to e.g. avl_tree.h (as opposed to linked_bst.h), these files contain an implementation for a simple binary search tree. Your job will be to update these files to establish and maintain the AVL tree balance properties.
- bst.h: The declaration of the BST interface.
- avl_tree.h: The declaration of the AVLTree template class. You’re welcome to add private methods to this class if you like, but you don’t have to do so.
- avl_tree_node.h: The declaration of the AVLTreeNode template class. You’re also allowed to add private methods to this class.
- avl_tree__definitions.h: The implementation of the AVLTree methods.
- avl_tree_node__definitions.h: The implementation of the AVLTreeNode methods. You’ll spend most of your time writing code here.
A simple trigram model. We’ll use a trigram model to pick words to form our sentences. See below for further explanation.
- trigram_model.h/trigram_model.cpp: The declaration and definition of the TrigramModel class.
A place to implement your program.
- kjp.h/kjp.cpp: A library which can aggregate and generate sentences using a TrigramModel.
- kjp_main.h/kjp_main.cpp: A place for your kjp_main function.
- program.cpp: The file in which the “real” main function is stored (which we keep separate for testing).
A program for your tests.
- test_utils.h/test_utils.cpp
- tests.cpp

Using Your Previous Code

Source: OpenClipArt.org

If you prefer, you may use the code you submitted for Lab 7 in this assignment. If you do so, you should bear the following in mind:

Your files from the previous lab are named things like linked_bst.h. This assignment requires them to be renamed to e.g. avl_tree.h. This also goes for the classes within those files: LinkedBSTNode would have to be renamed to AVLTreeNode, for instance.
The AVL tree skeleton provided here has declared some extra private methods (such as rotate_left). You’ll need to add those new methods to your AVLTree and AVLTreeNode classes. In particular, the check_invariants methods are a requirement.
If anything is wrong with your Lab 7 code, those bugs could also affect your Lab 8 grade.

For most groups, it’s probably easier to read over the correct implementation of the BST provided in this lab than it is to adapt Lab 7 code for this assignment.

Implementation Strategy: The AVL Tree

The implementation of this lab can be accomplished in the following steps:

Source: OpenClipArt.org

Add a height field to the AVLTreeNode class to track nodes’ heights. * You’ll need to modify the constructor and the methods of AVLTreeNode to use this field and to keep it up to date. * The new height-related helper methods (such as left_subtree_height and recalculate_height) will be helpful here. * Specifically, recalculate_height should be called on any node whose height may have changed. * subtree_height should get much simpler! * Make sure to test your new height implementation. You should be able to copy tests from Lab 7 to help you make sure that e.g. subtree_height still works correctly.
Add calls to rebalance_self to each place in the AVLTree where nodes are added or removed. * rebalance_self should be called on every node between the root and the modified node. This includes the places where e.g. subtree_put recurses. * For now, just assume that rebalance_self does the right thing: it rebalances the node if the node needs rebalancing. You’ll implement that method in a moment.
Implement rotate_left and rotate_right. Then, implement rebalance_self. * Write this code very carefully; it may help to draw out diagrams of what you think is happening in each case. * Remember to consider the different cases of rebalancing. For each line of code, you should be able to describe which cases it handles. Are you in the LL case? The RL case? * Make sure to test this behavior thoroughly! It’s key to making the data structure perform correctly.
- Write out valid trees.
- Determine how they need to be rebalanced once a mapping is added or removed.
- Write code to create AVLTreeNodes in the shape of that tree and perform the addition or removal.
- Write tests to make sure the resulting AVLTreeNodes have the right keys and children in the right places.
- Don’t forget to test cases in which the tree doesn’t rebalance!
Implement check_invariants. * This method does nothing but check to make sure that the tree obeys the invariants we have established. * Call it from within your tests to verify that nothing has gone wrong with your tree operations!

Generating Sentences

The AVL tree implementation is the majority of the work in this lab due to how sensitive the implementation is. We’ll now use that structure to do something interesting: generating sentences from example data. We’ll accomplish this by reading several text files which have examples of sentences (most of which were obtained from Project Gutenberg) and teaching the computer to recognize basic patterns in the sentences it sees.

Trigrams

Source: OpenClipArt.org

Broadly speaking, a trigram is a group of three ordered elements. Here, a trigram is three words that have been discovered in order in a document. For instance, the sentence

We have a lot of fun in the CS department.

contains trigrams like “have a lot”, “lot of fun”, and “the CS department.” Since we’re focusing on sentence generation, we’ll also imagine that the start and end of the sentence contain invisible symbols that begin or end sentences; that is, the above text also contains the trigrams “[[START]] We have” and “CS department. [[END]]”. We’ll use these invisible symbols to know where to start when generating sentences and when to stop adding more words. In our code, we’ll use the variables START_SENTENCE and END_SENTENCE rather than writing these strings by hand.

Generating Sentences with Trigrams

We can use trigrams to get a rough idea of which words can follow which other words in a sentence. From the above, we learn that the word “lot” can follow the words “have a” and that “department.” can follow the words “the CS”. Imagine that we combined the trigrams from the above sentence with the trigrams in the following sentence:

I saw a lot of penguins in the newspaper.

This tells us that “penguins” can follow “lot of” and that “in” can follow “of penguins”. We also knew from the previous sentence that “CS” can follow “in the”. Combining everything that we know about the trigrams in these two sentences gives us reason to believe that we can write

We have a lot of penguins in the CS department.

since every trigram in that new sentence can be found in one of the previous two sentences. We can visualize how each pair of words leads to a third word as follows:

The `TrigramModel` Class

To keep track of trigrams, you’ll implement the TrigramModel class that has been provided for you. This class has two important fields. One keeps track of “starting words”: those words we have seen immediately follow the “[[START]]” invisible symbol. We know that each starting word may start a sentence. This is merely a list of strings.

The other field keeps track of “following words”: those words which may follow two other words. To keep track of this, we use an AVL tree where keys are pairs of strings (the first two words in the trigram) and values are lists of strings (the possibilities for the third word in the trigram). We encapsulate this reasoning within the TrigramModel class so that we don’t have to remember how the TrigramModel keeps track of all of this information later.

Training the `TrigramModel` Class

After you implement the TrigramModel class, you will need to “train” it. Training is the process of providing information to this object so that it can be used to generate sentences.

You have been provided a function, read_strings_from_file, in kjp.h/kjp.cpp which will produce a list of all of the strings the named file. It will also add START_SENTENCE and END_SENTENCE symbols wherever it believes that sentences start or end. (This function is not perfect, but it’s sufficient for our uses in this lab.) Once you’ve read the list of strings from a file, you only need to call the add_sequence method of your TrigramModel for each trigram that appears in that list. This should cause your TrigramModel to learn which words can follow which other words.

Generating Sentences from a `TrigramModel`

Source: OpenClipArt.org

Once you’ve trained your TrigramModel, the algorithm for generating a sentence is fairly easy:

Create an empty string accumulator.
Pick a random first word. Add it to the string accumulator.
Remember this first word as the second position in a trigram; the first position of the trigram is initially START_SENTENCE.
Pick a new random word for the third position of your trigram.
As long as your trigram’s third position is not END_SENTENCE: 1. Add that symbol to your sentence accumulator. 2. Move each word of the trigram one position to the left (discarding the first word). 3. Pick a new random word for the third position.

Your Command-Line Interface

On the command-line, your program should accept a number of sentences to generate and any number of data files. (The user must provide at least one.) For instance, the command

./kjp 4 test_data/Moby_Dick.txt

should train a TrigramModel on the contents of Moby Dick and then generate four sentences. The command

./kjp 4 test_data/Moby_Dick.txt test_data/Crime_and_Punishment.txt

will train a single TrigramModel on both Moby Dick and Crime and Punishment and then generate four sentences. In order to convert the string "4" to the integer value 4, you can use the stoi function. This function throws an exception if the number is invalid, which you should be sure to catch so you can give an appropriate error message and exit your kjp_main function gracefully.

The test_data directory contains a variety of documents for you to use, although some excellent results can be found by combining Pride and Prejudice with Fairy Tales by the Brothers Grimm:

$ ./kjp 3 test_data/Pride_and_Prejudice.txt test_data/Fairy_Tales_by_The_Brothers_Grimm.txt

Then she asked the true golden bird; and it was an inexhaustible subject of Bingley.

His reception, however, was dead.

"I beg your pardon, madam, for your wife.'

Bear in mind that the technique we are using is not a clever one and so you’ll often wind up with grammatically incorrect or otherwise meaningless sentences:

$ ./kjp 3 test_data/Pride_and_Prejudice.txt test_data/Fairy_Tales_by_The_Brothers_Grimm.txt

On the way she was already dark night.

Collins.

Why then does this place him!"

Don’t worry about this; you can learn more sophisticated techniques in other classes. If you want to test your sentence generation logic, try combining the two very simple files in your test_data directory and see if you get their different permutations:

$ ./kjp 8 test_data/simple_document_1.txt test_data/simple_document_2.txt

Here is a language.

C++ is a language.

Here is a sentence.

C++ is a sentence.

Here is a sentence.

Here is a language.

C++ is a language.

C++ is a language.

Testing

You are not required to test your program with e.g. RUN_MAIN as in previous labs, but you should still write tests that exercise the functionality of TrigramModel. Some tests have been provided for you which should do this. You are also required to have a rich set of tests for your AVLTree and AVLTreeNode classes; indeed, those tests should be written before you move on to the sentence generation.

Memory

Your program is required to run without any memory leaks or errors; you must free up all of the memory that you allocate and you must only use memory that you have allocated and initialized. We will use valgrind to determine if you have done this correctly, so you should as well. You should also consider using valgrind to find bugs in your program as you proceed; it’s one of two very powerful C++ debugging tools that we’ve discussed.

Coding Style Requirements

You are required to observe some good coding practices:

Source: OpenClipArt.org

You should pick meaningful variable names.

// Good
int* pixels = new int[size];
// Bad
int* p = new int[size];

You should use correct and consistent indentation. Lines of code within a block (that is, surrounded by { and }) should be indented four spaces further than the lines surrounding them.
```
// Good
if (condition) {
    cout << "Test" << endl;
}
// Bad
if (condition) {
cout << "Test" << endl;
    }
```
You should use a block whenever possible, even if it’s not necessary. This helps you avoid subtle or messy bugs in the future.
```
// Good
if (condition) {
    cout << "Something" << endl;
}
// Bad
if (condition)
    cout << "Something" << endl;
```
Any new methods or fields in your header files should have comments explaining their purpose and behavior. You are permitted to omit documentation for methods that are inherited from other classes; that is, if your class has a foo method because its superclass has a foo method, you don’t need to document that method.
```
// Good
public:
    /**
     * Saves this image object to a file.
     * @param file The already-open stream into which this image should be written.
     * @return The number of pixels in the saved image.
     * @throws runtime_error If the image could not be saved due to an I/O error.
     */
    int save(ofstream& file);

// Bad
public:
    int save(ofstream& file);
```
Your method/field documentation does not have to be in the format above, but you must describe the method’s behavior, its parameters, its return value, and any exceptions that it may throw. (If you’re indifferent, the above syntax is a good one to know; it’s a de facto standard used by Javadoc, Doxygen, and other tools that automatically process source code comments into other formats like searchable webpages.)

Peer Review

As this is a team assignment, you are required to complete a peer review of your lab partner or partners. You must do this even if everything is fine. You should complete this peer review after you have turned in your assignment. The peer review comes in the form of a simple Google Form and can be accessed here; in most cases, it will likely take less than a minute to complete. If you have trouble accessing this peer review, please make sure that you are logged into Google using your Swarthmore credentials.

Your peer review will be kept private; only course staff will have access to this information. You can also update your peer review after you have submitted it. However, if you do not submit a peer review, you will lose points on your lab. Please don’t forget!

Summary of Requirements

When you are finished, you should have

A completed implementation of AVLTree
A kjp_main which allows us to generate sentences from existing test files using the command-line
The ability to handle bad command-line arguments
Appropriate tests for these features
No memory errors!
Code which conforms to the style requirements above
A completed peer review

Lab 8: King James Programming

Overview

Starting Code

Using Your Previous Code

Implementation Strategy: The AVL Tree

Generating Sentences

Trigrams

Generating Sentences with Trigrams

The TrigramModel Class

Training the TrigramModel Class

Generating Sentences from a TrigramModel

Your Command-Line Interface

Testing

Memory

Coding Style Requirements

Peer Review

Summary of Requirements

The `TrigramModel` Class

Training the `TrigramModel` Class

Generating Sentences from a `TrigramModel`