A skeleton version of the programs will appear in your cs35/labs/08 directory when you run update35. The program handin35 will only submit files in this directory.
I encourage you to work with a partner on this lab and the
entire project.
This lab is the third part of a larger project to build a web browser with a search engine for a limited portion of the web. In the first part you summarized the content of a web page by creating a binary search tree that contained the word frequencies of all words that appeared in the page that were not in a given file of words to ignore. In the second part you read in a list of URLs and constructed word frequency trees to summarize their content. Next you prompted the user for a query, searched for each query word in the saved word frequency trees, and then used a priority queue to report the most relevant pages for the query.
In the third part of this project, you will make the search for relevant pages more efficient by caching search results in a dictionary. Each time a query is entered, you will first check a dictionary to see if this query has been answered before. If so, you can get the results directly from the dictionary without any further processing. If not, you will create the results as you did in part 2 and add them to the dictionary.
We'd like to be able to use the cache as much as possible to make
our web browser fast. To this end we will remove all ignore words
from the query, sort the remaining words, and concatenate them
together to make a string key for our dictionary. In this way the
queries "computer science department" and "department of computer
science" will both end up with the same key "computer department
science", assuming "of" is one of the ignore words. Using this
approach will allow us to find saved results more frequently, and take
the most advantage of the cache.
As before, your program will take two command-line arguments: the filename of a list of URLs stored in the Computer Science domain and the filename of a list of words to ignore during the analysis. For example:
% ./part3 urls.txt ignore.txt
This will output:
Scanning the ignore file...done. Summarized 13 urls Enter a query or type 'quit' to end Search for: analysis of algorithms ignoring of query key: algorithms analysis Relevant pages: www.cs.swarthmore.edu/~adanner : 7 www.cs.swarthmore.edu/~richardw : 6 www.cs.swarthmore.edu/~meeden/cs35/f09/lab3.php : 4 www.cs.swarthmore.edu/~meeden/cs35/f09/index.php : 4 www.cs.swarthmore.edu/~cfk : 4 www.cs.swarthmore.edu/~newhall : 2 www.cs.swarthmore.edu/~meeden : 1 www.cs.swarthmore.edu/~meeden/cs35/f09/lab6.php : 1 Added to cache Search for: algorithms analysis query key: algorithms analysis Found in cache! Relevant pages: www.cs.swarthmore.edu/~adanner : 7 www.cs.swarthmore.edu/~richardw : 6 www.cs.swarthmore.edu/~meeden/cs35/f09/lab3.php : 4 www.cs.swarthmore.edu/~meeden/cs35/f09/index.php : 4 www.cs.swarthmore.edu/~cfk : 4 www.cs.swarthmore.edu/~newhall : 2 www.cs.swarthmore.edu/~meeden : 1 www.cs.swarthmore.edu/~meeden/cs35/f09/lab6.php : 1 Search for: computer science department query key: computer department science Relevant pages: www.cs.swarthmore.edu/~newhall : 35 www.cs.swarthmore.edu/~cfk : 24 www.cs.swarthmore.edu/~richardw : 20 www.cs.swarthmore.edu/~adanner : 17 www.cs.swarthmore.edu/~meeden : 16 www.cs.swarthmore.edu/~meeden/cs35/f09/index.php : 5 www.cs.swarthmore.edu/~meeden/cs35/f09/lab6.php : 1 Added to cache Search for: department of computer science ignoring of query key: computer department science Found in cache! Relevant pages: www.cs.swarthmore.edu/~newhall : 35 www.cs.swarthmore.edu/~cfk : 24 www.cs.swarthmore.edu/~richardw : 20 www.cs.swarthmore.edu/~adanner : 17 www.cs.swarthmore.edu/~meeden : 16 www.cs.swarthmore.edu/~meeden/cs35/f09/index.php : 5 www.cs.swarthmore.edu/~meeden/cs35/f09/lab6.php : 1 Search for: quit Goodbye
A number of classes are provided for you. Classes, programs or files from the previous lab that you need to copy into the current lab directory are underlined below. Classes, programs or files that you have to complete for this lab appear in bold below.
1- | BST | Class | Abstract data type for Binary Search Trees |
1- | KVPair | Class | General KVPair class for key value pairs |
1- | LinkedBST | Class | Linked implementation of BST |
1- | Scanner | Class | Parser for web pages and text files |
1- | URLContent | Class | Stores filename, URL, and word frequency tree for a web page |
1- | ignore.txt | File | A list of words to ignore when parsing web pages |
2- | PQ | Class | Abstract data type for Priority Queues |
2- | HeapPQ | Class | Heap implementation of PQ |
2- | ProcessQuery | Class | Stores summarized web pages and handles a search query |
2- | urls.txt | File | List of URLs to find and summarize |
3- | Dictionary | Class | Abstract data type for Dictionaries |
3- | HashTableDictionary | Class | Hash table implementation of Dictionary |
3- | SavedResult | Class | Stores query answer and best matching URL |
3- | part3.cpp | Program | The main program for part 3 of the project |
char line[80]; string answer; answer = "Relevant pages:\n"; loop until PQ is empty removeMin from the PQ of urls sprintf(line, "%60s : %d\n", url.c_str(), frequency); answer += line;It uses sprintf, which does formatted printing into a string, to construct the result in the variable answer.
If this idea of saving the hit count and url of all matching webpages as string seems a bit artificial, look at the optional extensions below for a better alternative.
char query[100]; cout << "Search for: "; cin.getline(query, 100); string q = query; //convert char* buffer to stringUsing SavedResult objects as a dictionary value may cause some unexpected errors. Here are some common errors and their solution.
This error appears in the keyValuePair constructor when the VALUE type (in this case a SavedResult object) does not have a default constructor. The quick fix is to just a add a dummy default constructor in the SavedResult class. Add SavedResult(); as public method in SavedResult.h and add the following implementation in SavedResult.cpp.
SavedResult::SavedResult() { string answer = ""; string bestURL = ""; }A better fix is to use Initialization lists in keyValuePair.inl and change the constructor to
template <typename KEY, typename VALUE> KVPair<KEY, VALUE>::KVPair(KEY k, VALUE v): key(k), value(v) { }but initializations lists are a bit beyond the scope of cs35 and default constructors may come in handy at other times.
This problem results because C++ does not know how to compare two SavedResult objects for equality. This can be fixed using operator overloading and defining == for the SavedResult class. Add bool operator==(SavedResult& other); to SavedResult.h and implement as follows in SavedResult.cpp
bool SavedResult::operator==(SavedResult& other){ return (other.getBestURL()==bestURL && other.getAnswer()==answer); }
These additional exercises are not required. Only attempt them once you have successfully completed the requirements described above.