CS35: Lab 08

The main goals for this lab are to gain experience working with hash tables and to evaluate real-world data structure performance by running experiments. To that end, you will:

Implement a hash table using separate chaining.
Apply your hash table implementation to create a simple program to assist Scrabble players.
Compare the performance of your hash table implementation to that of a BST by running a series of experiments.

As with most assignments for the rest of the semester, you will be working with a partner. Both partners should be present and working on the code together. You will both be responsible for understanding all concepts, so dividing and conquering is not an option. The academic integrity policy applies to the entire pair; you cannot work or share code with anyone outside your partner.

You and your lab partner will share the same git repository, which is named lab08-<user1>-<user2>.git. Please access your lab by cloning the appropriate git repository, e.g.:

$ git clone git@github.swarthmore.edu:CS35-S18/lab08-lmeeden1-mgagne1.git

You can also get the ssh git url from CS35 github org.

Starting Code

Below is an overview of the files required for submitting this lab. Those highlighted in blue will require implementation on your part. Those highlighted in black are complete and should not be modified except for comments.

CS35 ADTs folder: a directory containing the C++ declarations of all the ADTs seen in CS35 so far, including Dictionary.h.
Makefile: the build instructions for this lab.
hashTable.h, hashTable-inl.h: the hash table implementation files.
hashFunctions.*: Simple hash functions for integer and string keys, similar to hashes we saw in class. If you decide to introduce new hash functions, add them here.
tests.cpp: Unit tests for your hash table implementation.
main.cpp: a simple Scrabble Assistant application.
scrabbleAssistant.*: a C++ class encapsulating the functionality of the Scrabble assistant.
WrittenLab.tex The $\LaTeX$ source for your experimental analysis portion of the lab.
experiment.cpp A program that runs experiments comparing performance of different dictionary implementations (hash table, binary search tree). We've coded the experiment framework for you. You just need to implement a single function in:
experimentUtil.* This file contains a single function called getExperimentKeys(int num_keys). The experiment framework calls this function to get the keys to put into the dictionaries during testing.
experimentClasses.*, experimentHashes.*: special C++ classes (and hash functions for those classes) you'll use to run your experiments.
manualTests.cpp: A "sandbox" test program that you can use while you develop your program.

Part I: Hash Table Implementation.

For the first part of this lab assignment, you will implement a hash table, by modifying hashTable.h and hashTable-inl.h. Unit tests have been provided; once your hash table implementation is complete, these unit tests should pass. Below are some implementation details and suggestions:

Your hash table implementation must use separate chaining.
Consider using an array of statically allocated vectors for the buckets in your hash table. You could also use a List or even a balanced BST. However, the difference in performance should be minimal because you should aim to have $O(1)$ elements in each bucket on average.
Similar to an ArrayList, you will need to expand capacity when there are too many elements in the hash table. However, what counts as "too many elements" is different with hash tables. Define the load factor of a hash table as the number of elements in the table, divided by its capacity (i.e., the numer of elements in the array). Your implementation should expand capacity whenever the load factor exceeds some maximum load factor threshold.
The HashTable class has two constructors. One allows the user to select a maximum load factor; the other should provide a reasonable default value (say 0.75).
In either case, select a reasonable default table capacity, say capacity=5.

Expanding Capacity

If the load factor ever exceeds the maximum load factor of your hash table, you will need to expand the table. This is conceptually similar to expanding capacity for ArrayList or a BinaryHeap. There are a couple of differences. One is that when to expand differs here -- it depends on the load factor, not on whether the array is full. A more subtle difference is that it's not enough to just copy table entries over. Each key/value pair in the hash table needs to be rehashed, because which bucket it goes into depends on the capacity of the table itself! Below we itemize what tasks need to be done to properly expand capacity.

First, copy the oldCapacity and oldTable somewhere safe
Second, double the capacity, and allocate a new table with the new capacity.
Third, go through each bucket in the oldTable (if only you had saved a pointer to this somewhere safe!)
For each bucket, look at each entry in that bucket's vector. Rehash the key with the new capacity, and insert into the corresponding bucket in the new table.
Finally, delete the oldTable.

Remember, your hash table should never have load factor higher than the maximum load factor. As soon as this happens, you should expand capacity.

Removing From a vector

The vector class provides a means to remove an element from the end of a list (pop_back) but doesn't give a clear interface for removing elements from the middle. We suggest two possible approaches.

In the first approach, you can use pop_back, by swapping the item at the end of the vector with the item at index i that you want to remove. Then simply call pop_back to remove the item that is now at the end.

The second approach uses some advanced features of the STL vector class. Vectors to provide a general method called erase which can delete entire contiguous regions:

vector myVector = ...;
myVector.erase(myVector.begin()+start_index, myVector.begin()+end_index);

The above function works much like a Python slice: the start index is included, but the end index is not, e.g., erasing with start/end indices (1,3) would remove the second and third indices (i.e. indices 1 and 2) from the vector. If you want to remove a single element from the vector, you can do something like:

vector myVector = ...; // whatever defines/initializes the vector
int index = ...;            // the index of the element you want to remove
myVector.erase(myVector.begin() + index, myVector.begin() + index+1);

For more information, consult the vector documentation in cplusplus.com.

Part II: Scrabble Assistant.

Once you have a working hash table, you can start writing your Scrabble Assistant. We'll design the application in two parts: a ScrabbleAssistant class and a separate main program. The main program (which we've written for you) provides a simple menu interface and makes calls to the ScrabbleAssistant, which encapsulates all of the Scrabble Assistant features. The ScrabbleAssistant class should load an English-language dictionary (allow the user to specify a dictionary, but use /usr/share/dict/words as a default) into a hash table and support the following functions:

spellCheck: given a word, this function should return true if and only iff the word is in the dictionary. Implementing this should be straightforward and should run in O(1) time.
anagramsOf: given a string of letters, this function returns a vector of all anagrams (i.e., permutations of these letters) which represent valid words For example, "cat" and "act" contain exactly the same letters. This function should also take constant time (aside from the cost of creating or copying the vector of anagrams).
findWords: given a string of letters, this function returns a vector of all legal words the player could spell with the given letters.

When we say that anagramsOf should take constant time, we mean that its running time should not depend on the size of the English-language dictionary you are using. The running time is however allowed to depend on the length of the string.

Implementing anagramsOf and findWords will require some algorithm design. You're welcome to include any data members you need for the ScrabbleAssistant, as well as any private helper functions you'd like.

In order to make anagramsOf run in constant time, it will be helpful to preprocess the word list in the ScrabbleAssistant constructor and maintain a table of anagrams. First, write a helper function to sort the letters in a string. (e.g. sortString("cat") should return "act"). You don't have to do this yourself; instead, use the std::sort function provided in the algorithm library.

#include <algorithm>
string sortLetters(string s) {
  sort(s.begin(), s.end());
  return s;
}

Then, create a dictionary that maps each sorted string to a vector of all words that sort to that string.

anagramsOf should provide all anagrams of a given set of letters, but won't be enough to provide all legal plays given a set of letters. To do this, you'll also need to evaluate all legal plays using fewer letters as well. Implementing findWords will require some clever design. One way to implement findWords is to break down the task into the following subtasks:

Compute the powerSet: given a string, this should return the power set of the string; that is, it should return the set of all subsets of the given characters. The following pseudocode implements powerSet recursively:

vector<string> stringPowerSet(word):
  if word length is 1:
    /* return a vector containing
       the empty string "" and the word
     */
    return ["", word]

  firstChar = word[0]
  /* substring is original word excluding the first char */
  substring = word.substr(1)
  smallerSet = stringPowerSet(substring)
  result = [] // an empty vector of strings
  for each word in smallerSet:
    add word to result
    add (firstChar+word) to result
  return result

Eliminating Duplicate Strings: If you use the powerSet function, you will likely find at some point that you have a vector of words that contain several duplicates. It might be helpful to create a removeDuplicates function that eliminates any extra copies of a word. There are several methods of accomplishing this:
1. Use a double-for loop that compares all pairs of words in the vector, looking for pairs that are the same. This will run in $O(n^2)$ time.
2. First, sort the vector of words. Then, make a linear scan over the list, eliminating adjacent duplicates. This will take $O(n \log n)$ time using an efficient sorting algorithm.
3. Create a hash table with words as keys. Then, insert each word in the hash table assuming it isn't already there. This should take $O(1)$ time per word, or $O(n)$ overall time.
Once you've written your powerSet and removeDuplicates function, you can combine them to find all legal words as follows:
1. Find the power set of the given string
2. Get all anagrams of each string in the power set
3. Accumulate all the anagrams in a single vector
4. Remove Duplicates

Once the above code is complete, your ScrabbleAssistant will be complete!

Part III: Dictionary Experiments.

The Experiment

For the final part of this lab, you'll run some experiments comparing the performance of several dictionary implementations, including a BST, your hash table using a good hash function, and your hash table using a suboptimal hash function. The experiment will add some number of elements to each dictionary and then retrieve them. It reports the average runtime for each insert and get operation.

There are two classes used in this experiment, both defined in experimentClasses. The StringIntGoodHash and StringIntBadHash classes both contain a string and an int data member. These classes are identical except for the hash functions, given in experimentHashes. It is your responsability to understand how these hash functions differ and exploit these differences in the experiment.

Your Tasks

Most of this code has been written for you in experiment.cpp. Your only implementation task is to implement getExperimentKeys function in experimentUtil.cpp. This function is used by the experiment to determine which keys to add to the dictionary. Your task is to create a set of keys tailor-made to cause several collisions for StringIntBadHash. Note that every key returned by getExperimentKeys must be distinct; the experiment will not run if you create duplicate keys.

Once you have written your getExperimentKeys function, use the experiment program to produce a graph much as you did for the mystery functions program from Lab03. This program does not require any arguments, so you can run it as e.g.:

$ ./experiment | gnuplot

This will both save a file called experiment.png as well as open a new window showing you the results of your experiment. If you successfully exploited the bad hash function, the runtime of the bad hash experiment should be comparitively high.

After completing your experiment, create a brief written summary that contains:

Your experimental graph
A description of how your getExperimentKeys function exploits the bad hash function.
A brief discussion of why the graph looks like it does, in terms of big-O complexity of the different dictionary implementations.

A couple of paragraphs should be sufficient. Please include a .tex or .pdf of your writeup, as well as the png file you created. If you provide .tex only and the grader cannot compile your writeup, you will receive no credit for the writeup.

Memory Management

Your program is required to run without memory errors; run valgrind to eliminate them. For this lab, memory leaks are not a priority. Small memory leaks are acceptable and will not significantly affect your grade. While "lost" memory should not occur in large amounts (e.g. millions of bytes), we encourage you to not spend hours and hours tracking down every last memory leak.

Commenting and Coding Style Requirements

Graders will assign minor penalties for poor commenting and coding style.
Please review the CS35 Coding Style for good C++ style.

Survey

When you have completed your lab, answer a few short survey questions in the file README.md.

Summary

When you have completed your lab, you should have

A working hash table implementation using separate chaining.
A working Scrabble Assistant.
A submitted PDF of your final WrittenLab, containing analysis of the experiment(s) you ran.
No memory errors!
Little to no memory leaks!
Code which conforms to the style guide above.
A completed survey in README.md

Submit

Once you are satisfied with your code, hand it in using git. Remember the add, commit, push development cycle. You can push as many times as you like, but only the most recent submission will be graded. You may want to run git status to confirm all modifications have been pushed.

CS35 Lab 08: Scrabble Assistant

Expanding Capacity

Removing From a vector

The Experiment

Your Tasks