Lab 7: Word Count
Due on Sunday, March 27th at 11:59 PM. This is a team lab. You and your assigned lab partner(s) will complete this lab together. Make sure that you are familiar with the Partner Etiquette guidelines. You may discuss the concepts of this lab with other classmates, but you may not share your code with anyone other than course staff and your lab partner(s). Do not look at solutions written by students other than your team. If your team needs help, please post on the Piazza forum or contact the instructor or ninjas. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.
Overview
For this lab, you will implement binary search tree (BST) data structure; you will then use it to implement a word-counting tool. Like the last team lab, your repository URL will be git@github.swarthmore.edu:cs35-s16/lab07-<team-name>
. You can see a list of all of your repositories on the Swarthmore GitHub in the lower-right side of the main page; make sure to switch your view to the organization for this class to see them.
Starting Code
This lab contains several starting files. The bolded filenames below are the files you will need to change to complete the lab.
- Existing data structures. You have already implemented a queue and a circular array list as part of your previous labs, so these data structures are provided to you here. You won’t need to change these files.
circular_array_list.h
circular_array_queue.h
list.h
ordered_collection.h
queue.h
- A data structure for heterogeneous pairs. This data structure is useful in building a BST.
pair.h
- A place to implement your BST.
bst.h
: The declaration of the BST interface. You should not change this file.linked_bst.h
: A declaration of a BST using linked nodes to form a tree.linked_bst__definitions.h
: The implementation of the methods ofLinkedBST
, separated into a different file for organizational purposes.linked_bst_node.h
: A declaration of aLinkedBSTNode
class for use in theLinkedBST
.linked_bst_node__definitions.h
: The implementation of the methods ofLinkedBSTNode
, separated into a different file for organizational purposes.
- A place to implement your program.
wordcount.h
/wordcount.cpp
: A library which can count up the number of words in a given file. You will write this code.wordcount_main.h
/wordcount_main.cpp
: A place for yourwordcount_main
function.program.cpp
: The file in which the “real”main
function is stored (which we keep separate for testing).
- A program for your tests.
test_utils.h
/test_utils.cpp
tests.cpp
Implementing a Binary Search Tree
Most of the work of this lab is implementing the binary search tree, which will happen in linked_bst__definition.h
. The LinkedBST
class is a template class with two templated type names: K
(the type of the keys in the BST) and V
(the type of the values). The LinkedBST
class itself tracks only two pieces of data:
size
, which contains the number of nodes in the BSTroot
, which contains theLinkedBSTNode
representing the root node of the BST
A single node stores information about its key, its value, and its left and right children. If a node does not have a child, then the corresponding pointer variable will be set to NULL.
Many of the methods in the binary search tree are natural to implement recursively. For this reason, there are a great many private methods in the starter code which word over some arbitrary subtree (represented by the top node in that subtree) which can call themselves. For instance, the height of a node is equal to one more than the largest of the heights of its two children. (The max
function can be found in the C++ algorithm
library.) Most of the public methods of LinkedBST
will be single-line calls to these private, recursive methods to begin the operation at the BST’s root node.
Implementation Strategy
Although most of the methods in this lab are fairly simple, there’s quite a lot to do. Below is an approach to writing these methods which should organize your efforts fairly well. You’re not required to use this approach, but it might help.
- Get started on
LinkedBSTNode
. 1. Implement the constructors and destructor ofLinkedBSTNode
. Also implement the getters and setters. This should be very simple and easy to complete quickly. 2. Implement thesubtree_height
method ofLinkedBSTNode
. This will be a good introduction to writing recursive helper methods. 3. See if the provided unit tests forsubtree_height
method pass now. Make sure to read over those provided unit tests; they build a tree ofLinkedBSTNode
objects and then check properties of them. You’ll need to write several such methods of your own. 4. Implement thesubtree_min_key
andsubtree_max_key
methods. Write tests for them. 5. Implement thefind_node
method. Usefind_node
to implementsubtree_contains
andsubtree_get
. 6. Write unit tests forsubtree_contains
andsubtree_get
. Don’t implement any more methods until these are working. 7. Implement and testsubtree_put
. Note that this method is not as simple as callingfind_node
: the key might not exist anywhere in the tree yet. You’ll have to find the appropriate place to put the data and, if necessary, create a newLinkedBSTNode
for it. - Switch to
LinkedBST
and use the methods you have written. 1. Implement the constructor, destructor, andget_size
andis_empty
methods. These should be trivial. 2. Implementget_height
,get_min_key
,get_max_key
,contains
, andget
. These should be easy to write and very similar to each other; each just checks to see if the BST is empty and either performs a default action (if the BST is empty) or calls one of the methods ofLinkedBSTNode
(if the BST is not empty). 3. Implement theput
method. This is similar to the above, but be sure to use the return value of the helper function to update your BST’ssize
appropriately. 4. Write tests for the above methods. Unlike yourLinkedBSTNode
tests, you should now be able to create theLinkedBST
within your unit tests, using the e.g.put
method to create theLinkedBST
(instead of creating the tree by hand). - Write the remaining methods simultaneously.
1. Implement
find_parent_node
onLinkedBSTNode
and write some simple tests for it. Test it with several keys: the root’s key, other keys in the tree, and keys that are not in the tree. 2. Implementsubtree_remove
onLinkedBSTNode
. This is probably the hardest method to implement in this lab; make sure to consider each of the different cases of removal (key not found, key found and node has one child, key found and node has two children, etc.). 3. Testsubtree_remove
. Make sure you write code that triggers each of the different removal cases. 4. Implementremove
onLinkedBST
usingsubtree_remove
. There isn’t a lot of work to do here, but you have to handle the empty tree and singleton tree cases specially. 5. Testremove
onLinkedBST
. 6. Implement each of the traversal methods onLinkedBSTNode
. 7. Implement and test each of the traversal methods onLinkedBST
. 8. Implement the level-order traversal ofLinkedBST
. (There is noLinkedBSTNode
helper method to go along with it.) - Now that you’ve implemented the
LinkedBST
, use it to write your word-counting program.
Traversals
The LinkedBST
class must provide implementations of traversal methods: methods which allow us to see every piece of data in the tree. A traversal is given by a list of key-value pairs; we will represent such a pair as a Pair
object. A Pair
is just a container that stores two pieces of data. For instance, a Pair<int,string>
is just an object with two fields (an int
and a string
) and appropriate getter and setter methods. Since a traversal is a list of such pairs, we write the return value of the traversal methods as List< Pair<K,V> >*
(where K
and V
are the key and value of the LinkedBST
).
The BST
interface supports four different kinds of traversal:
- A pre-order traversal puts the current node first, followed by its left subtree and then its right subtree. For the example here, the pre-order traversal would be the sequence
F
,B
,A
,D
,C
,E
,H
,G
,I
. - A post-order traversal instead puts the current node after the left and right subtrees. The post-order traversal of this example is the sequence
A
,C
,E
,D
,B
,G
,I
,H
,F
. - An in-order traversal puts the current node between the left and right subtrees. The in-order traversal of this example is the sequence
A
,B
,C
,D
,E
,F
,G
,H
,I
. (The fact that this is sorted is not a coincidence!) - A level-order traversal reads the nodes from left to right and top to bottom, the same way we read English test. For this tree, the level-order traversal is
F
,B
,H
,A
,D
,G
,I
,C
,E
.
The first three forms of traversal are quite similar and are natural to define recursively. The level-order traversal is somewhat different and requires a breadth-first search approach similar to that which you used in the previous lab.
Wikipedia has some nice pictures of these traversals for reference. Note that your program is adding the elements to a list rather than “displaying” them as described on that Wikipedia page.
A note about nested template types
You may have observed that the traversal type above was written List< Pair<K,V> >*
with spaces around the “Pair
” type argument. This is intentional; if we remove the spaces (“List Pair<K,V>>*
”), the C++ parser can get confused about whether >>
is two closing angle brackets or a single >>
operator (as with ifstream
). We include the spaces to make sure this doesn’t happen.
Counting Words
The interface for your word-counting program will be quite simple:
wordcount <file-to-read> <result-file> <traversal>
Here, <traversal>
is one of pre
, post
, in
, or level
. Your program should read each word in the input file, use the LinkedBST
to tally up how many of each word it contains, and then write the resulting traversal to an output file with one pairing per line. For instance, suppose document.txt
contains the following:
It is easier to ask forgiveness than it is to get permission.
If we run
wordcount document.txt results.txt in
then we would expect results.txt
to contain:
1 ask
1 easier
1 forgiveness
1 get
2 is
1 it
1 It
1 permission
1 than
2 to
Note that it
and It
count as different spellings; you should not change capitalization for this assignment. Note also that permission
does not include a period at the end of the word; you should eliminate any non-alphabetic characters from the beginning and end of each word as you read it from the file. You can use the substr
method of string
to accomplish this.
Testing
As usual, you have been provided RUN_MAIN
and CHECK_FILES_EQUAL
. You should write tests in tests.cpp
for your LinkedBST
data structure as well as for your wordcount
program.
Memory
Your program is required to run without any memory leaks or errors; you must free up all of the memory that you allocate and you must only use memory that you have allocated and initialized. We will use valgrind
to determine if you have done this correctly, so you should as well. You should also consider using valgrind
to find bugs in your program as you proceed; it’s one of two very powerful C++ debugging tools that we’ve discussed.
Coding Style Requirements
You are required to observe some good coding practices:
-
You should pick meaningful variable names.
// Good int* pixels = new int[size]; // Bad int* p = new int[size];
-
You should use correct and consistent indentation. Lines of code within a block (that is, surrounded by
{
and}
) should be indented four spaces further than the lines surrounding them.// Good if (condition) { cout << "Test" << endl; } // Bad if (condition) { cout << "Test" << endl; }
-
You should use a block whenever possible, even if it’s not necessary. This helps you avoid subtle or messy bugs in the future.
// Good if (condition) { cout << "Something" << endl; } // Bad if (condition) cout << "Something" << endl;
-
Any new methods or fields in your header files should have comments explaining their purpose and behavior. You are permitted to omit documentation for methods that are inherited from other classes; that is, if your class has a
foo
method because its superclass has afoo
method, you don’t need to document that method.// Good public: /** * Saves this image object to a file. * @param file The already-open stream into which this image should be written. * @return The number of pixels in the saved image. * @throws runtime_error If the image could not be saved due to an I/O error. */ int save(ofstream& file); // Bad public: int save(ofstream& file);
Your method/field documentation does not have to be in the format above, but you must describe the method’s behavior, its parameters, its return value, and any exceptions that it may throw. (If you’re indifferent, the above syntax is a good one to know; it’s a de facto standard used by Javadoc, Doxygen, and other tools that automatically process source code comments into other formats like searchable webpages.)
Peer Review
As this is a team assignment, you are required to complete a peer review of your lab partner or partners. You must do this even if everything is fine. You should complete this peer review after you have turned in your assignment. The peer review comes in the form of a simple Google Form and can be accessed here; in most cases, it will likely take less than a minute to complete. If you have trouble accessing this peer review, please make sure that you are logged into Google using your Swarthmore credentials.
Your peer review will be kept private; only course staff will have access to this information. You can also update your peer review after you have submitted it. However, if you do not submit a peer review, you will lose points on your lab. Please don’t forget!
Summary of Requirements
When you are finished, you should have
- A completed implementation of
LinkedBST
- A
wordcount_main
which provides a command-line interface to count words using a BST - The ability to handle bad command-line arguments
- Appropriate tests for these features
- No memory errors!
- Code which conforms to the style requirements above
- A completed peer review