CPSC 035: Lab 8: Query Tool

Due on Wednesday, November 18th (by the end of the day, U.S. Eastern Daylight Time).

Overview

In this lab, you will

  1. Implement a linear dictionary data structure.

  2. Use the linear dictionary to implement a hash table data structure.

  3. Use the hash table together with a provided priority queue to allow a user to find popular tweets in a database.

The program you will write in this lab loads a database of tweets from a text file. It then allows a user to identify the most popular tweets in the data file either from a particular user or on a particular day. To do this efficiently, we will make use of both hash tables and the top-k algorithm from priority queues.

Starting Code

Your starting code can be found in the appropriate repository for your team. The following is a description of the repository’s initial comments; bolded files are those which you must change in completing the lab.

  • adts/: As with your last lab, the ADT interface declarations are stored here. The most notable addition for this lab is stlMaxPriorityQueue.h.

  • linearDictionary.h/linearDictionary-inl.h: The LinearDictionary class you will be implementing to store collisions within your HashTable.

  • hashFunctions.h/hashFunctions.cpp: The functions used to hash values of different types.

  • hashTable.h/hashTable-inl.h: The HashTable class you will be implementing.

  • ioUtils.h/ioUtils.cpp: A utility library which will read tweets from a provided data file.

  • main.cpp: The entry point for the query tool program.

  • manualTests.cpp: A sandbox test program for you to use while you develop your program.

  • tests.cpp: The unit testing program you will use to confirm that your LinearDictionary and HashTable are correct.

  • tweet.h/tweet.cpp: A definition of a Tweet class; each instance holds the data for a single tweet.

The URL for your repository will be

git@github.swarthmore.edu:cs35-f20/lab08-<your-teamname>

where your-teamname has been selected at random. You should be able to find your team repository by visiting the course organization on GitHub.

Part I. Implementing LinearDictionary

We recommend that you implement a LinearDictionary as a vector of key, value pairs. All of the operations should run in \(O(n)\) time. You may be thinking: "Why would I want such an inefficient implementation of the Dictionary ADT?" The answer is that, for very small dictionaries, simpler implementations perform much better as they have a smaller constant overhead.

We will use the LinearDictionary to represent the chain of collisions in our HashTable. Because we expect each chain to be relatively short, the LinearDictionary provides a nice abstraction that will be efficient enough for our purposes.

Unit tests have been provided for this lab; once you have completed your LinearDictionary implementation, the appropriate unit tests should pass.

Removing from a vector

In order to implement the remove operation of your LinearDictionary, you’ll need to learn about how to remove from a vector.

The vector class provides a means to remove an element from the end of a list (pop_back) but doesn’t give a clear interface for removing elements e.g. from an arbitrary index. Vectors provide a general function called erase which can delete entire contiguous regions:

vector<int> vec = ...;
vec.erase(vec.begin() + start_index, vec.begin() + end_index);

The above function works much like a Python slice: the start index is included but the end index is not. Thus, vec.erase(vec.begin() + 1, vec.begin() + 3) would remove the second and third elements (indices 1 and 2) from the vector. If you want to delete a single element from a vector, you can do something like this:

vector<int> vec = ...; // whatever defines this vector
int index = ...; // e.g. 4
vec.erase(vec.begin() + index, vec.begin() + index + 1);

Part II. Implementing HashTable

Once your LinearDictionary implementation is complete, you can begin implementing a chaining hash table as we discussed in lecture. You will find a HashTable class declared in hashTable.h. Here are some important details about this class:

  • We recommend that you use a dynamically allocated array of statically allocated LinearDictionary objects to represent the buckets of your HashTable. This is not a requirement, but the static allocation of the LinearDictionary objects will prevent you from having to allocate and deallocate each bucket yourself. You can also use a dynamically allocated array of dynamically allocated LinearDictionary objects if you prefer.

  • One of the constructors allows the user to specify a maximum load factor; the other should use a sensible default value (e.g. 0.75).

  • All operations on your HashTable (other than getKeys and getItems) must run in \(O(1)\) average time.

  • The HashTable must maintain a reasonable load factor in order to operate efficiently. You will likely want to create some private helper methods to assist with increasing the table’s capacity when the load factor becomes too high. (Don’t forget to document those methods according to the Coding Style Requirements below!)

  • Note that hashing functions have been provided in hashFunctions.h and hashFunctions.cpp for keys of type int and string.

Once you have completed your HashTable implementation, the appropriate unit tests should pass.

Part III. Query Tool

Once your implementations of the LinearDictionary and HashTable are complete, you’ll be ready to implement the main program. Your program should take as its only command-line argument the name of a file containing a collection of tweets. We have provided such a file in test_data called tweets.txt which contains tweets in a specific format: there are nine lines per tweet containing the tweet’s ID, post date, username, screenname, text, reply count, retweet count, like count, and its URL. You do not need to write code to read this file. You will find an appropriate function declared in ioUtils.h which will return to you a List of Tweet objects. Note that these Tweet objects behave like the pair template class in that they can be directly copied (rather than needing to be dynamically allocated).

Your program should then enter a loop to allow the user to query the database of tweets. One user query consists of two choices. First, the user chooses to search either by username or date. Then, the user chooses to rank by retweet count or like count. Once you’ve been given those two pieces of information, you will show the top five tweets ranked according to the user’s selection among the tweets with the matching search criteria.

For example, a user might choose to search by username and provide the username "probirdrights". The user might then choose to rank by like count. The query tool should then show the top five tweets from that user ranked by likes. Once it has displayed those tweets, the program will then ask the user for another query. Here’s an example of this scenario:

Please choose from the following options:
A) Search by username
B) Search by date
C) Quit
? A
Please enter your search username: probirdrights
Search by (r)etweet count or (l)ike count? l

@probirdrights    (2020-03-14)
WASH FREQUENT. MANTAIN SOCIAL DISTANCE.  https://t.co/3L2xPK8qOk
Rpl: 170    Rtw: 27654    Lik: 75033    URL: https://twitter.com/ProBirdRights/status/1238931210610434048

@probirdrights    (2017-06-27)
 https://t.co/82WcpSEVZf
Rpl: 406    Rtw: 20907    Lik: 53170    URL: https://twitter.com/ProBirdRights/status/879840865740480513

@probirdrights    (2019-04-07)
why birds decide to go from being big dinosaucers to small puff u ask??? to. fit. inside. dortitos. bag.
Rpl: 120    Rtw: 9464    Lik: 45640    URL: https://twitter.com/ProBirdRights/status/1115086578051719168

@probirdrights    (2020-01-08)
WHO PUT A RED SOCK IN THE WASHINGMACHINE HOW I'M GONNA GO TO WORK LIKE THIS  https://t.co/ito2P6mOui
Rpl: 99    Rtw: 9000    Lik: 42027    URL: https://twitter.com/ProBirdRights/status/1215056643286634496

@probirdrights    (2013-08-16)
I am feel uncomfortable when we are not about me?
Rpl: 182    Rtw: 19106    Lik: 35247    URL: https://twitter.com/ProBirdRights/status/368542088897372161

Note that your output doesn’t have to look exactly like this, but it should contain the same information.

Complexity Requirements

Your query program must satisfy the following complexity requirements:

  • You must be able to search for the tweets which match the user’s criteria (e.g. username is "ProBirdRights") in average case \(O(1)\) time once the query is given.

  • You must be able to print the top five tweets in \(O(n)\) time, where \(n\) is the number of tweets that meet the search criteria.

Here are some algorithmic tips to accomplish the above. You can take as much time as you need to set up, so your program will likely proceed according to a few steps:

  1. Read in all the tweets using the helper function declared in ioUtils.h.

  2. Create some dictionaries to organize the tweets. One dictionary, for instance, could map usernames to a List of all of the tweets by that user. Another could map dates to all of the tweets on that date.

  3. Enter a query loop to use those dictionaries. Now that your data is stored in these dictionaries, you can get to that data in average case constant time.

  4. When selecting the top five tweets, be careful how you handle the lists. There is a toVector method on STLList which runs in linear time. You can then create a second vector containing the priorities and values. Make sure not to sort the list (which is \(O(n \log n)\) time) and definitely don’t iterate through the list (which, since it’s a linked list, would be \(O(n^2)\) time!).

The second criterion above makes use of heapification as we discussed in lecture. You can use the vector constructor of the STLMaxPriorityQueue class to heapify the vector into a priority queue. Note that your list from the dictionary will contain Tweet objects while the STLMaxPriorityQueue requires a vector containing pairs between the tweet’s priority and the tweet itself. It’s okay to create a whole new vector full of pairs, though, because copying everything from your List into this new vector only takes linear time!

Note that creating the dictionaries at the start of the program may take a couple seconds. It’s worth the effort, though, because your program will run pretty quickly once the user issues queries to you. A smaller data file, test_data/small.txt, is available if you find that the normal test_data/tweets.txt is slowing down your testing process.

Example Run

Here’s an example run of the program with multiple queries. Again: your output doesn’t have to look exactly like this, but it should contain the same information.

$ ./queryTool test_data/tweets.txt
Loading tweets...
Processing tweets...
File test_data/tweets.txt loaded.
Welcome to the tweet database query tool!

Please choose from the following options:
A) Search by username
B) Search by date
C) Quit
? A
Please enter your search username: bigbenclock
No tweets with username "bigbenclock" found.

Please choose from the following options:
A) Search by username
B) Search by date
C) Quit
? A
Please enter your search username: big_ben_clock
Search by (r)etweet count or (l)ike count? l

@big_ben_clock    (2020-10-02)
BONG BONG BONG BONG
Rpl: 3    Rtw: 41    Lik: 232    URL: https://twitter.com/big_ben_clock/status/1312044682004836353

@big_ben_clock    (2020-10-07)
BONG BONG BONG
Rpl: 3    Rtw: 48    Lik: 220    URL: https://twitter.com/big_ben_clock/status/1313841776571146240

@big_ben_clock    (2020-10-15)
BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG
Rpl: 1    Rtw: 18    Lik: 208    URL: https://twitter.com/big_ben_clock/status/1316876783292014593

@big_ben_clock    (2020-10-03)
BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG
Rpl: 3    Rtw: 33    Lik: 178    URL: https://twitter.com/big_ben_clock/status/1312528375265681408

@big_ben_clock    (2020-10-20)
BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG BONG
Rpl: 3    Rtw: 30    Lik: 166    URL: https://twitter.com/big_ben_clock/status/1318688460866244609

Please choose from the following options:
A) Search by username
B) Search by date
C) Quit
? A
Please enter your search username: dog_feelings
Search by (r)etweet count or (l)ike count? l

@dog_feelings    (2019-04-01)
i wasn’t a good dog today  april fools. i was so good
Rpl: 1046    Rtw: 54176    Lik: 430999    URL: https://twitter.com/dog_feelings/status/1112883529279393792

@dog_feelings    (2020-03-10)
the human has been working from home the last couple days. and every so often. they let me participate in the video calls. all the other humans cheer when they see me. i am the only thing holding their company together
Rpl: 1770    Rtw: 64640    Lik: 405958    URL: https://twitter.com/dog_feelings/status/1237451266508218368

@dog_feelings    (2020-04-01)
today. i wasn’t a very good dog  april fools. i was so good
Rpl: 1001    Rtw: 45354    Lik: 404346    URL: https://twitter.com/dog_feelings/status/1245501343126695939

@dog_feelings    (2018-06-12)
sometimes. the human presses their noggin against mine. to figure out what i’m thinking. so i just think really hard. about how much i love them. and hope they figure it out
Rpl: 1638    Rtw: 80307    Lik: 397607    URL: https://twitter.com/dog_feelings/status/1006707052624990209

@dog_feelings    (2018-04-01)
today. i wasn’t a very good dog  april fools. i was so good
Rpl: 603    Rtw: 69469    Lik: 362870    URL: https://twitter.com/dog_feelings/status/980620017732636672

Please choose from the following options:
A) Search by username
B) Search by date
C) Quit
? B
Please enter your search date: 2020-02-01
Search by (r)etweet count or (l)ike count? l

@catfoodbreath    (2020-02-01)
Wishing you a Saturday night at home with your cat and a great blanket.  And cheesy snacks.
Rpl: 13    Rtw: 30    Lik: 326    URL: https://twitter.com/CatFoodBreath/status/1223796805181345793

@suethetrex    (2020-02-01)
Important @DustinGrowick fan art.
Rpl: 3    Rtw: 24    Lik: 280    URL: https://twitter.com/SUEtheTrex/status/1223618509881647106

@catfoodbreath    (2020-02-01)
Thing One: Hon, why is the cat meowing at the closet? Thing Two: How should I know? *Thing One looks in closet* Thing One: Your blankie's in here?  Let's put it back on the couch. Thing Two: No, we need the red one for the game. Thing One: Cats before football.  Thing Two: *sigh*
Rpl: 9    Rtw: 13    Lik: 219    URL: https://twitter.com/CatFoodBreath/status/1223781298726166528

@catfoodbreath    (2020-02-01)
There better not be any sports yelling this weekend.  I have big couch plans.
Rpl: 21    Rtw: 12    Lik: 172    URL: https://twitter.com/CatFoodBreath/status/1223632753243164678

@catfoodbreath    (2020-02-01)
Thing Two just put Yellow Blanket in the closet and put a red blanket on the couch.   ?????????????????????
Rpl: 25    Rtw: 7    Lik: 166    URL: https://twitter.com/CatFoodBreath/status/1223742503733596161

Please choose from the following options:
A) Search by username
B) Search by date
C) Quit
? X
Invalid selection.

Please choose from the following options:
A) Search by username
B) Search by date
C) Quit
? C

Goodbye!
$

Memory

For this lab, your program is required to run without memory errors or leaks. You should use valgrind as you proceed in your testing to track memory errors. When you have a complete first draft of your implementation:

  • Commit and push what you have (in case something goes wrong).

  • Run valgrind ./tests to make sure that the test program does not cause memory errors.

  • Once memory errors are cleared, run valgrind --leak-check=full ./tests to identify and correct memory leaks.

  • Note: valgrind may take several minutes to run on the unit tests for this lab.

Coding Style Requirements

As usual, you will also be required to observe good coding practices:

  • Your C++ code must have proper and consistent indentations.

  • You must have proper and consistent usage of spacing and braces for if, else, for, and while conditions as well as class definitions and code blocks.

Summary of Requirements

When you are finished, you should have

  • A completed implementation of LinearDictionary

  • A completed implementation of HashTable

  • A completed query tool program

  • Your program must conform to the complexity requirements above by using hashtables and heapification

  • Your program must handle errors without crashing

  • No memory errors

  • No memory leaks

  • Code which conforms to the style requirements

  • Remember to document your newly-declared fields and methods!

  • A completed lab assignment survey from each student