git@github.swarthmore.edu:cs35-f20/lab08-<your-teamname>
Due on Wednesday, November 18th (by the end of the day, U.S. Eastern Daylight Time).
In this lab, you will
Implement a linear dictionary data structure.
Use the linear dictionary to implement a hash table data structure.
Use the hash table together with a provided priority queue to allow a user to find popular tweets in a database.
The program you will write in this lab loads a database of tweets from a text file. It then allows a user to identify the most popular tweets in the data file either from a particular user or on a particular day. To do this efficiently, we will make use of both hash tables and the top-k algorithm from priority queues.
Your starting code can be found in the appropriate repository for your team. The following is a description of the repository’s initial comments; bolded files are those which you must change in completing the lab.
adts/
: As with your last lab, the ADT interface declarations are stored here. The most notable addition for this lab is stlMaxPriorityQueue.h
.
linearDictionary.h
/linearDictionary-inl.h
: The LinearDictionary
class you will be implementing to store collisions within your HashTable.
hashFunctions.h
/hashFunctions.cpp
: The functions used to hash values of different types.
hashTable.h
/hashTable-inl.h
: The HashTable
class you will be implementing.
ioUtils.h
/ioUtils.cpp
: A utility library which will read tweets from a provided data file.
main.cpp
: The entry point for the query tool program.
manualTests.cpp
: A sandbox test program for you to use while you develop your program.
tests.cpp
: The unit testing program you will use to confirm that your LinearDictionary
and HashTable
are correct.
tweet.h
/tweet.cpp
: A definition of a Tweet
class; each instance holds the data for a single tweet.
The URL for your repository will be
git@github.swarthmore.edu:cs35-f20/lab08-<your-teamname>
where your-teamname
has been selected at random. You should be able to find your team repository by visiting the course organization on GitHub.
LinearDictionary
We recommend that you implement a LinearDictionary
as a vector of key, value pairs. All of the operations should run in \(O(n)\) time. You may be thinking: "Why would I want such an inefficient implementation of the Dictionary
ADT?" The answer is that, for very small dictionaries, simpler implementations perform much better as they have a smaller constant overhead.
We will use the LinearDictionary
to represent the chain of collisions in our HashTable
. Because we expect each chain to be relatively short, the LinearDictionary
provides a nice abstraction that will be efficient enough for our purposes.
Unit tests have been provided for this lab; once you have completed your LinearDictionary
implementation, the appropriate unit tests should pass.
vector
In order to implement the remove
operation of your LinearDictionary
, you’ll need to learn about how to remove from a vector
.
The vector
class provides a means to remove an element from the end of a list (pop_back
) but doesn’t give a clear interface for removing elements e.g. from an arbitrary index. Vectors provide a general function called erase
which can delete entire contiguous regions:
vector<int> vec = ...;
vec.erase(vec.begin() + start_index, vec.begin() + end_index);
The above function works much like a Python slice: the start index is included but the end index is not. Thus, vec.erase(vec.begin() + 1, vec.begin() + 3)
would remove the second and third elements (indices 1 and 2) from the vector. If you want to delete a single element from a vector, you can do something like this:
vector<int> vec = ...; // whatever defines this vector
int index = ...; // e.g. 4
vec.erase(vec.begin() + index, vec.begin() + index + 1);
HashTable
Once your LinearDictionary
implementation is complete, you can begin implementing a chaining hash table as we discussed in lecture. You will find a HashTable
class declared in hashTable.h
. Here are some important details about this class:
We recommend that you use a dynamically allocated array of statically allocated LinearDictionary
objects to represent the buckets of your HashTable
. This is not a requirement, but the static allocation of the LinearDictionary
objects will prevent you from having to allocate and deallocate each bucket yourself. You can also use a dynamically allocated array of dynamically allocated LinearDictionary
objects if you prefer.
One of the constructors allows the user to specify a maximum load factor; the other should use a sensible default value (e.g. 0.75
).
All operations on your HashTable
(other than getKeys
and getItems
) must run in \(O(1)\) average time.
The HashTable
must maintain a reasonable load factor in order to operate efficiently. You will likely want to create some private helper methods to assist with increasing the table’s capacity when the load factor becomes too high. (Don’t forget to document those methods according to the Coding Style Requirements below!)
Note that hashing functions have been provided in hashFunctions.h
and hashFunctions.cpp
for keys of type int
and string
.
Once you have completed your HashTable
implementation, the appropriate unit tests should pass.
Once your implementations of the LinearDictionary
and HashTable
are complete, you’ll be ready to implement the main program. Your program should take as its only command-line argument the name of a file containing a collection of tweets. We have provided such a file in test_data
called tweets.txt
which contains tweets in a specific format: there are nine lines per tweet containing the tweet’s ID, post date, username, screenname, text, reply count, retweet count, like count, and its URL. You do not need to write code to read this file. You will find an appropriate function declared in ioUtils.h
which will return to you a List
of Tweet
objects. Note that these Tweet
objects behave like the pair
template class in that they can be directly copied (rather than needing to be dynamically allocated).
Your program should then enter a loop to allow the user to query the database of tweets. One user query consists of two choices. First, the user chooses to search either by username or date. Then, the user chooses to rank by retweet count or like count. Once you’ve been given those two pieces of information, you will show the top five tweets ranked according to the user’s selection among the tweets with the matching search criteria.
For example, a user might choose to search by username and provide the username "probirdrights". The user might then choose to rank by like count. The query tool should then show the top five tweets from that user ranked by likes. Once it has displayed those tweets, the program will then ask the user for another query. Here’s an example of this scenario:
Please choose from the following options: A) Search by username B) Search by date C) Quit ? A Please enter your search username: probirdrights Search by (r)etweet count or (l)ike count? l @probirdrights (2020-03-14) WASH FREQUENT. MANTAIN SOCIAL DISTANCE. https://t.co/3L2xPK8qOk Rpl: 170 Rtw: 27654 Lik: 75033 URL: https://twitter.com/ProBirdRights/status/1238931210610434048 @probirdrights (2017-06-27) https://t.co/82WcpSEVZf Rpl: 406 Rtw: 20907 Lik: 53170 URL: https://twitter.com/ProBirdRights/status/879840865740480513 @probirdrights (2019-04-07) why birds decide to go from being big dinosaucers to small puff u ask??? to. fit. inside. dortitos. bag. Rpl: 120 Rtw: 9464 Lik: 45640 URL: https://twitter.com/ProBirdRights/status/1115086578051719168 @probirdrights (2020-01-08) WHO PUT A RED SOCK IN THE WASHINGMACHINE HOW I'M GONNA GO TO WORK LIKE THIS https://t.co/ito2P6mOui Rpl: 99 Rtw: 9000 Lik: 42027 URL: https://twitter.com/ProBirdRights/status/1215056643286634496 @probirdrights (2013-08-16) I am feel uncomfortable when we are not about me? Rpl: 182 Rtw: 19106 Lik: 35247 URL: https://twitter.com/ProBirdRights/status/368542088897372161
Note that your output doesn’t have to look exactly like this, but it should contain the same information.
Your query program must satisfy the following complexity requirements:
You must be able to search for the tweets which match the user’s criteria (e.g. username is "ProBirdRights") in average case \(O(1)\) time once the query is given.
You must be able to print the top five tweets in \(O(n)\) time, where \(n\) is the number of tweets that meet the search criteria.
Here are some algorithmic tips to accomplish the above. You can take as much time as you need to set up, so your program will likely proceed according to a few steps:
Read in all the tweets using the helper function declared in ioUtils.h
.
Create some dictionaries to organize the tweets. One dictionary, for instance, could map usernames to a List
of all of the tweets by that user. Another could map dates to all of the tweets on that date.
Enter a query loop to use those dictionaries. Now that your data is stored in these dictionaries, you can get to that data in average case constant time.
When selecting the top five tweets, be careful how you handle the lists. There is a toVector
method on STLList
which runs in linear time. You can then create a second vector containing the priorities and values. Make sure not to sort the list (which is \(O(n \log n)\) time) and definitely don’t iterate through the list (which, since it’s a linked list, would be \(O(n^2)\) time!).
The second criterion above makes use of heapification as we discussed in lecture. You can use the vector
constructor of the STLMaxPriorityQueue
class to heapify the vector into a priority queue. Note that your list from the dictionary will contain Tweet
objects while the STLMaxPriorityQueue
requires a vector
containing pair
s between the tweet’s priority and the tweet itself. It’s okay to create a whole new vector
full of pairs, though, because copying everything from your List
into this new vector
only takes linear time!
Note that creating the dictionaries at the start of the program may take a couple seconds. It’s worth the effort, though, because your program will run pretty quickly once the user issues queries to you. A smaller data file, test_data/small.txt
, is available if you find that the normal test_data/tweets.txt
is slowing down your testing process.
Here’s an example run of the program with multiple queries. Again: your output doesn’t have to look exactly like this, but it should contain the same information.
$ ./queryTool test_data/tweets.txt Loading tweets... Processing tweets... File test_data/tweets.txt loaded. Welcome to the tweet database query tool! Please choose from the following options: A) Search by username B) Search by date C) Quit ? A Please enter your search username: bigbenclock No tweets with username "bigbenclock" found. Please choose from the following options: A) Search by username B) Search by date C) Quit ? A Please enter your search username: big_ben_clock Search by (r)etweet count or (l)ike count? l @big_ben_clock (2020-10-02) BONG BONG BONG BONG Rpl: 3 Rtw: 41 Lik: 232 URL: https://twitter.com/big_ben_clock/status/1312044682004836353 @big_ben_clock (2020-10-07) BONG BONG BONG Rpl: 3 Rtw: 48 Lik: 220 URL: https://twitter.com/big_ben_clock/status/1313841776571146240 @big_ben_clock (2020-10-15) BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG Rpl: 1 Rtw: 18 Lik: 208 URL: https://twitter.com/big_ben_clock/status/1316876783292014593 @big_ben_clock (2020-10-03) BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG Rpl: 3 Rtw: 33 Lik: 178 URL: https://twitter.com/big_ben_clock/status/1312528375265681408 @big_ben_clock (2020-10-20) BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG Rpl: 3 Rtw: 30 Lik: 166 URL: https://twitter.com/big_ben_clock/status/1318688460866244609 Please choose from the following options: A) Search by username B) Search by date C) Quit ? A Please enter your search username: dog_feelings Search by (r)etweet count or (l)ike count? l @dog_feelings (2019-04-01) i wasn’t a good dog today april fools. i was so good Rpl: 1046 Rtw: 54176 Lik: 430999 URL: https://twitter.com/dog_feelings/status/1112883529279393792 @dog_feelings (2020-03-10) the human has been working from home the last couple days. and every so often. they let me participate in the video calls. all the other humans cheer when they see me. i am the only thing holding their company together Rpl: 1770 Rtw: 64640 Lik: 405958 URL: https://twitter.com/dog_feelings/status/1237451266508218368 @dog_feelings (2020-04-01) today. i wasn’t a very good dog april fools. i was so good Rpl: 1001 Rtw: 45354 Lik: 404346 URL: https://twitter.com/dog_feelings/status/1245501343126695939 @dog_feelings (2018-06-12) sometimes. the human presses their noggin against mine. to figure out what i’m thinking. so i just think really hard. about how much i love them. and hope they figure it out Rpl: 1638 Rtw: 80307 Lik: 397607 URL: https://twitter.com/dog_feelings/status/1006707052624990209 @dog_feelings (2018-04-01) today. i wasn’t a very good dog april fools. i was so good Rpl: 603 Rtw: 69469 Lik: 362870 URL: https://twitter.com/dog_feelings/status/980620017732636672 Please choose from the following options: A) Search by username B) Search by date C) Quit ? B Please enter your search date: 2020-02-01 Search by (r)etweet count or (l)ike count? l @catfoodbreath (2020-02-01) Wishing you a Saturday night at home with your cat and a great blanket. And cheesy snacks. Rpl: 13 Rtw: 30 Lik: 326 URL: https://twitter.com/CatFoodBreath/status/1223796805181345793 @suethetrex (2020-02-01) Important @DustinGrowick fan art. Rpl: 3 Rtw: 24 Lik: 280 URL: https://twitter.com/SUEtheTrex/status/1223618509881647106 @catfoodbreath (2020-02-01) Thing One: Hon, why is the cat meowing at the closet? Thing Two: How should I know? *Thing One looks in closet* Thing One: Your blankie's in here? Let's put it back on the couch. Thing Two: No, we need the red one for the game. Thing One: Cats before football. Thing Two: *sigh* Rpl: 9 Rtw: 13 Lik: 219 URL: https://twitter.com/CatFoodBreath/status/1223781298726166528 @catfoodbreath (2020-02-01) There better not be any sports yelling this weekend. I have big couch plans. Rpl: 21 Rtw: 12 Lik: 172 URL: https://twitter.com/CatFoodBreath/status/1223632753243164678 @catfoodbreath (2020-02-01) Thing Two just put Yellow Blanket in the closet and put a red blanket on the couch. ????????????????????? Rpl: 25 Rtw: 7 Lik: 166 URL: https://twitter.com/CatFoodBreath/status/1223742503733596161 Please choose from the following options: A) Search by username B) Search by date C) Quit ? X Invalid selection. Please choose from the following options: A) Search by username B) Search by date C) Quit ? C Goodbye! $
For this lab, your program is required to run without memory errors or leaks. You should use valgrind
as you proceed in your testing to track memory errors. When you have a complete first draft of your implementation:
Commit and push what you have (in case something goes wrong).
Run valgrind ./tests
to make sure that the test program does not cause memory errors.
Once memory errors are cleared, run valgrind --leak-check=full ./tests
to identify and correct memory leaks.
Note: valgrind
may take several minutes to run on the unit tests for this lab.
As usual, you will also be required to observe good coding practices:
Your C++ code must have proper and consistent indentations.
You must have proper and consistent usage of spacing and braces for if
, else
, for
, and while
conditions as well as class definitions and code blocks.
When you are finished, you should have
A completed implementation of LinearDictionary
A completed implementation of HashTable
A completed query tool program
Your program must conform to the complexity requirements above by using hashtables and heapification
Your program must handle errors without crashing
No memory errors
No memory leaks
Code which conforms to the style requirements
Remember to document your newly-declared fields and methods!
A completed lab assignment survey from each student