Lab 8: King James Programming
Due on Sunday, April 3rd at 11:59 PM. This is a team lab. You and your assigned lab partner(s) will complete this lab together. Make sure that you are familiar with the Partner Etiquette guidelines. You may discuss the concepts of this lab with other classmates, but you may not share your code with anyone other than course staff and your lab partner(s). Do not look at solutions written by students other than your team. If your team needs help, please post on the Piazza forum or contact the instructor or ninjas. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.
Overview
In this lab, you will modify an implementation of a binary search tree to create an AVL tree. Operations on AVL trees have better worst-case time complexity than operations on simple binary search trees. We will make use of these operations to construct a generator which, given a collection of input files, can create novel-looking, fairly realistic sentences. This lab is inspired by and named after the single-topic blog King James Programming, which uses similar (albeit more sophisticated) techniques to achieve a similar goal.
In comparison to previous labs, the number of lines of code you must add to the starting repository to get a completed, correct submission is very small. Those lines of code are conceptually weighty, however, and you can expect to spend a fair amount of time debugging your AVL tree data structure. Make sure to test thoroughly!
Your team’s repository can be found in the usual location: git@github.swarthmore.edu:cs35-s16/lab08-<team-name>
.
Starting Code
The starting files for this lab include a complete implementation of a binary search tree. The following is a description of each file in the starter code; files you will need to modify are shown in bold.
- Existing data structures. As with the last lab, you are provided implementations of
List
,Queue
, andPair
.circular_array_list.h
circular_array_queue.h
list.h
ordered_collection.h
pair.h
queue.h
- The AVL tree. Although the files have been renamed to e.g.
avl_tree.h
(as opposed tolinked_bst.h
), these files contain an implementation for a simple binary search tree. Your job will be to update these files to establish and maintain the AVL tree balance properties.bst.h
: The declaration of theBST
interface.avl_tree.h
: The declaration of theAVLTree
template class. You’re welcome to add private methods to this class if you like, but you don’t have to do so.avl_tree_node.h
: The declaration of theAVLTreeNode
template class. You’re also allowed to add private methods to this class.avl_tree__definitions.h
: The implementation of theAVLTree
methods.avl_tree_node__definitions.h
: The implementation of theAVLTreeNode
methods. You’ll spend most of your time writing code here.
- A simple trigram model. We’ll use a trigram model to pick words to form our sentences. See below for further explanation.
trigram_model.h
/trigram_model.cpp
: The declaration and definition of theTrigramModel
class.
- A place to implement your program.
kjp.h
/kjp.cpp
: A library which can aggregate and generate sentences using aTrigramModel
.kjp_main.h
/kjp_main.cpp
: A place for yourkjp_main
function.program.cpp
: The file in which the “real”main
function is stored (which we keep separate for testing).
- A program for your tests.
test_utils.h
/test_utils.cpp
tests.cpp
Using Your Previous Code
If you prefer, you may use the code you submitted for Lab 7 in this assignment. If you do so, you should bear the following in mind:
- Your files from the previous lab are named things like
linked_bst.h
. This assignment requires them to be renamed to e.g.avl_tree.h
. This also goes for the classes within those files:LinkedBSTNode
would have to be renamed toAVLTreeNode
, for instance. - The AVL tree skeleton provided here has declared some extra private methods (such as
rotate_left
). You’ll need to add those new methods to yourAVLTree
andAVLTreeNode
classes. In particular, thecheck_invariants
methods are a requirement. - If anything is wrong with your Lab 7 code, those bugs could also affect your Lab 8 grade.
For most groups, it’s probably easier to read over the correct implementation of the BST provided in this lab than it is to adapt Lab 7 code for this assignment.
Implementation Strategy: The AVL Tree
The implementation of this lab can be accomplished in the following steps:
- Add a
height
field to theAVLTreeNode
class to track nodes’ heights. * You’ll need to modify the constructor and the methods ofAVLTreeNode
to use this field and to keep it up to date. * The new height-related helper methods (such asleft_subtree_height
andrecalculate_height
) will be helpful here. * Specifically,recalculate_height
should be called on any node whose height may have changed. *subtree_height
should get much simpler! * Make sure to test your newheight
implementation. You should be able to copy tests from Lab 7 to help you make sure that e.g.subtree_height
still works correctly. - Add calls to
rebalance_self
to each place in theAVLTree
where nodes are added or removed. *rebalance_self
should be called on every node between the root and the modified node. This includes the places where e.g.subtree_put
recurses. * For now, just assume thatrebalance_self
does the right thing: it rebalances the node if the node needs rebalancing. You’ll implement that method in a moment. - Implement
rotate_left
androtate_right
. Then, implementrebalance_self
. * Write this code very carefully; it may help to draw out diagrams of what you think is happening in each case. * Remember to consider the different cases of rebalancing. For each line of code, you should be able to describe which cases it handles. Are you in the LL case? The RL case? * Make sure to test this behavior thoroughly! It’s key to making the data structure perform correctly.- Write out valid trees.
- Determine how they need to be rebalanced once a mapping is added or removed.
- Write code to create
AVLTreeNode
s in the shape of that tree and perform the addition or removal. - Write tests to make sure the resulting
AVLTreeNode
s have the right keys and children in the right places. - Don’t forget to test cases in which the tree doesn’t rebalance!
- Implement
check_invariants
. * This method does nothing but check to make sure that the tree obeys the invariants we have established. * Call it from within your tests to verify that nothing has gone wrong with your tree operations!
Generating Sentences
The AVL tree implementation is the majority of the work in this lab due to how sensitive the implementation is. We’ll now use that structure to do something interesting: generating sentences from example data. We’ll accomplish this by reading several text files which have examples of sentences (most of which were obtained from Project Gutenberg) and teaching the computer to recognize basic patterns in the sentences it sees.
Trigrams
Broadly speaking, a trigram is a group of three ordered elements. Here, a trigram is three words that have been discovered in order in a document. For instance, the sentence
We have a lot of fun in the CS department.
contains trigrams like “have a lot”, “lot of fun”, and “the CS department.” Since we’re focusing on sentence generation, we’ll also imagine that the start and end of the sentence contain invisible symbols that begin or end sentences; that is, the above text also contains the trigrams “[[START]] We have” and “CS department. [[END]]”. We’ll use these invisible symbols to know where to start when generating sentences and when to stop adding more words. In our code, we’ll use the variables START_SENTENCE
and END_SENTENCE
rather than writing these strings by hand.
Generating Sentences with Trigrams
We can use trigrams to get a rough idea of which words can follow which other words in a sentence. From the above, we learn that the word “lot” can follow the words “have a” and that “department.” can follow the words “the CS”. Imagine that we combined the trigrams from the above sentence with the trigrams in the following sentence:
I saw a lot of penguins in the newspaper.
This tells us that “penguins” can follow “lot of” and that “in” can follow “of penguins”. We also knew from the previous sentence that “CS” can follow “in the”. Combining everything that we know about the trigrams in these two sentences gives us reason to believe that we can write
We have a lot of penguins in the CS department.
since every trigram in that new sentence can be found in one of the previous two sentences. We can visualize how each pair of words leads to a third word as follows:
The TrigramModel
Class
To keep track of trigrams, you’ll implement the TrigramModel
class that has been provided for you. This class has two important fields. One keeps track of “starting words”: those words we have seen immediately follow the “[[START]]” invisible symbol. We know that each starting word may start a sentence. This is merely a list of strings.
The other field keeps track of “following words”: those words which may follow two other words. To keep track of this, we use an AVL tree where keys are pairs of strings (the first two words in the trigram) and values are lists of strings (the possibilities for the third word in the trigram). We encapsulate this reasoning within the TrigramModel
class so that we don’t have to remember how the TrigramModel
keeps track of all of this information later.
Training the TrigramModel
Class
After you implement the TrigramModel
class, you will need to “train” it. Training is the process of providing information to this object so that it can be used to generate sentences.
You have been provided a function, read_strings_from_file
, in kjp.h
/kjp.cpp
which will produce a list of all of the strings the named file. It will also add START_SENTENCE
and END_SENTENCE
symbols wherever it believes that sentences start or end. (This function is not perfect, but it’s sufficient for our uses in this lab.) Once you’ve read the list of strings from a file, you only need to call the add_sequence
method of your TrigramModel
for each trigram that appears in that list. This should cause your TrigramModel
to learn which words can follow which other words.
Generating Sentences from a TrigramModel
Once you’ve trained your TrigramModel
, the algorithm for generating a sentence is fairly easy:
- Create an empty string accumulator.
- Pick a random first word. Add it to the string accumulator.
- Remember this first word as the second position in a trigram; the first position of the trigram is initially
START_SENTENCE
. - Pick a new random word for the third position of your trigram.
- As long as your trigram’s third position is not
END_SENTENCE
: 1. Add that symbol to your sentence accumulator. 2. Move each word of the trigram one position to the left (discarding the first word). 3. Pick a new random word for the third position.
Your Command-Line Interface
On the command-line, your program should accept a number of sentences to generate and any number of data files. (The user must provide at least one.) For instance, the command
./kjp 4 test_data/Moby_Dick.txt
should train a TrigramModel
on the contents of Moby Dick and then generate four sentences. The command
./kjp 4 test_data/Moby_Dick.txt test_data/Crime_and_Punishment.txt
will train a single TrigramModel
on both Moby Dick and Crime and Punishment and then generate four sentences. In order to convert the string "4"
to the integer value 4
, you can use the stoi
function. This function throws an exception if the number is invalid, which you should be sure to catch so you can give an appropriate error message and exit your kjp_main
function gracefully.
The test_data
directory contains a variety of documents for you to use, although some excellent results can be found by combining Pride and Prejudice with Fairy Tales by the Brothers Grimm:
$ ./kjp 3 test_data/Pride_and_Prejudice.txt test_data/Fairy_Tales_by_The_Brothers_Grimm.txt
Then she asked the true golden bird; and it was an inexhaustible subject of Bingley.
His reception, however, was dead.
"I beg your pardon, madam, for your wife.'
Bear in mind that the technique we are using is not a clever one and so you’ll often wind up with grammatically incorrect or otherwise meaningless sentences:
$ ./kjp 3 test_data/Pride_and_Prejudice.txt test_data/Fairy_Tales_by_The_Brothers_Grimm.txt
On the way she was already dark night.
Collins.
Why then does this place him!"
Don’t worry about this; you can learn more sophisticated techniques in other classes. If you want to test your sentence generation logic, try combining the two very simple files in your test_data
directory and see if you get their different permutations:
$ ./kjp 8 test_data/simple_document_1.txt test_data/simple_document_2.txt
Here is a language.
C++ is a language.
Here is a sentence.
C++ is a sentence.
Here is a sentence.
Here is a language.
C++ is a language.
C++ is a language.
Testing
You are not required to test your program with e.g. RUN_MAIN
as in previous labs, but you should still write tests that exercise the functionality of TrigramModel
. Some tests have been provided for you which should do this. You are also required to have a rich set of tests for your AVLTree
and AVLTreeNode
classes; indeed, those tests should be written before you move on to the sentence generation.
Memory
Your program is required to run without any memory leaks or errors; you must free up all of the memory that you allocate and you must only use memory that you have allocated and initialized. We will use valgrind
to determine if you have done this correctly, so you should as well. You should also consider using valgrind
to find bugs in your program as you proceed; it’s one of two very powerful C++ debugging tools that we’ve discussed.
Coding Style Requirements
You are required to observe some good coding practices:
-
You should pick meaningful variable names.
// Good int* pixels = new int[size]; // Bad int* p = new int[size];
-
You should use correct and consistent indentation. Lines of code within a block (that is, surrounded by
{
and}
) should be indented four spaces further than the lines surrounding them.// Good if (condition) { cout << "Test" << endl; } // Bad if (condition) { cout << "Test" << endl; }
-
You should use a block whenever possible, even if it’s not necessary. This helps you avoid subtle or messy bugs in the future.
// Good if (condition) { cout << "Something" << endl; } // Bad if (condition) cout << "Something" << endl;
-
Any new methods or fields in your header files should have comments explaining their purpose and behavior. You are permitted to omit documentation for methods that are inherited from other classes; that is, if your class has a
foo
method because its superclass has afoo
method, you don’t need to document that method.// Good public: /** * Saves this image object to a file. * @param file The already-open stream into which this image should be written. * @return The number of pixels in the saved image. * @throws runtime_error If the image could not be saved due to an I/O error. */ int save(ofstream& file); // Bad public: int save(ofstream& file);
Your method/field documentation does not have to be in the format above, but you must describe the method’s behavior, its parameters, its return value, and any exceptions that it may throw. (If you’re indifferent, the above syntax is a good one to know; it’s a de facto standard used by Javadoc, Doxygen, and other tools that automatically process source code comments into other formats like searchable webpages.)
Peer Review
As this is a team assignment, you are required to complete a peer review of your lab partner or partners. You must do this even if everything is fine. You should complete this peer review after you have turned in your assignment. The peer review comes in the form of a simple Google Form and can be accessed here; in most cases, it will likely take less than a minute to complete. If you have trouble accessing this peer review, please make sure that you are logged into Google using your Swarthmore credentials.
Your peer review will be kept private; only course staff will have access to this information. You can also update your peer review after you have submitted it. However, if you do not submit a peer review, you will lose points on your lab. Please don’t forget!
Summary of Requirements
When you are finished, you should have
- A completed implementation of
AVLTree
- A
kjp_main
which allows us to generate sentences from existing test files using the command-line - The ability to handle bad command-line arguments
- Appropriate tests for these features
- No memory errors!
- Code which conforms to the style requirements above
- A completed peer review