This lab is the third part of a larger project to build a web browser with a search engine for a limited portion of the web. In the first part you summarized the content of a web page by creating a binary search tree that contained the word frequencies of all words that appeared in the page that were not in a given file of words to ignore. In the second part you read in a list of URLs and constructed word frequency trees to summarize their content. Next you prompted the user for a query, searched for each query word in the saved word frequency trees, and then used a priority queue to report the most relevant pages for the query.
In the third part of this project, you will make the search for relevant pages more efficient by caching search results in a dictionary. Each time a query is entered, you will first check a dictionary to see if this query has been answered before. If so, you can get the results directly from the dictionary without any further processing. If not, you will create the results as you did in last wee's lab and add them to the dictionary.
We'd like to be able to use the cache as much as possible to make
our web browser fast. To this end we will remove all ignore words
from the query, sort the remaining words, and concatenate them
together to make a string key for our dictionary. In this way the
queries "computer science department" and "department of computer
science" will both end up with the same key "computer department
science", assuming "of" is one of the ignore words. Using this
approach will allow us to find saved results more frequently, and take
the most advantage of the cache.
Your work for this lab consists of three parts:
As usual, you can get the initial files for this lab by running update35. The new files are listed here, and the files that you will need to edit are highlighted in blue:
$ cd $ cd cs35/labs/09 $ cp queryprocessor* sortedSearch.cpp binaryHeap.inl ../10
You should start by completing and testing the HashTable class. Our HashTable uses an array of CircularArrayLists to store the data within the hash table, using chaining to resolve collisions. Note that this is not a CircularArrayList of CircularArrayList. table is a standard C++ array, where each index contains an actual CircularArrayList object (not a pointer). You should be able to draw this out as a memory diagram.
Each CircularArrayList stores key-value pairs; for keys of type K and values of type V each key-value pair therefore has type Pair<K,V>.
Please complete each function marked with a TODO comment in hashtable.inl. You must test each HashTable function you implement using the testHash program.
Once you've completed and tested the HashTable class, continue by implementing and testing the BinaryHeap copy constructor. A copy constructor is a constructor that creates a copy of a class from another instance of the same class, in this case a BinaryHeap. (The parameter for a copy constructor actually needs to be a const reference to an instance of the same class, but this is a C++ detail that you don't need to learn.)
To copy an object you can usually directly copy (by value) most data from the source object, but you must avoid copying some data (such as a pointer to a memory location) that would make the source object and the copy dependent on each other. For an example of a copy constructor, see the newly-provided CircularArrayList class.
The copy constructor is invoked just as any other constructor, but the argument is itself another instance of the class. So, for example:
BinaryHeap < int,int > heap1; //Invokes default constructor //Insert elements into heap1 BinaryHeap < int,int > cpy (heap1); //Invokes copy constructor, copying // all of the data from heap1 over to cpyThis works exactly the same way with dynamic memory except we need to dereference pointers since the actual object needs to be sent in:
BinaryHeap < int,int >* heap1 = new BinaryHeap< int,int > //Invokes default constructor //Insert elements into heap1 BinaryHeap < int,int >* cpy = new BinaryHeap< int,int >(*heap1);
If your sortedSearch program from Lab 09 was well-designed, you can complete this portion of the lab by editing only your queryprocessor.h and queryprocessor.inl implementation, without modifying sortedSearch.cpp.
As described in the introduction above, your modified QueryProcessor should first check a dictionary to learn if this query has been answered before; if so, it can return a copy of the result (using the BinaryHeap's copy constructor!) from the dictionary without any further processing. (It's important to return a copy -- and not a value from the cache itself -- to prevent the main program from modifying the BinaryHeaps inside the query cache.) If the query has not been answered before, you compute the query result as in Lab 09, save the results in the dictionary, and then return them.
Your query result cache should be a hash table using query strings as keys and PriorityQueue pointers (BinaryHeap pointers) as values. To maximize the chance that a query string uses the cache, you should first convert the query string into a canonical form before using it as the cache key. To create the cache key from a search query, do the following:
string word = "Hello"; string lower = ""; for(int i = 0; i < word.length(); i++) lower += tolower(word[i]);
For example, if "of" is an ignored word, the search query "department of computer science" would be converted to the cache key "computer department science". Similarly, "Computer Science department" would also be converted to "computer department science", which would allow the cache to return a saved query result instead of recomputing the result.
Please print output to tell us when a query result is newly-computed and when a query result was used from the cache. For example:
$ ./sortedSearch inputs/urls.txt inputs/ignore.txt Enter your search query: tree Searching for "tree": Cache miss. www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 33 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 21 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 9 Enter your search query: node Searching for "node": Cache miss. www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 12 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 10 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 6 www.cs.swarthmore.edu/~newhall/index.php: 1 Enter your search query: tree node Searching for "tree node": Cache miss. www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 45 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 31 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 9 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 6 www.cs.swarthmore.edu/~newhall/index.php: 1 Enter your search query: tree node Searching for "tree node": Returning query result from cache for "node tree": www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 45 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 31 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 9 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 6 www.cs.swarthmore.edu/~newhall/index.php: 1 Enter your search query: A Node Tree Searching for "A Node Tree": Returning query result from cache for "node tree": www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 45 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 31 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 9 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 6 www.cs.swarthmore.edu/~newhall/index.php: 1 Enter your search query: the a nOdE trEE the a Searching for "the a nOdE trEE the a": Returning query result from cache for "node tree": www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 45 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 31 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 9 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 6 www.cs.swarthmore.edu/~newhall/index.php: 1 Enter your search query: <CTRL-D> $
Once you are satisfied with your program, hand it in by typing handin35 at the unix prompt.
You may run handin35 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize after handing in some programs that you'd like to make a few more changes to them.