CS35: Lab 8

A skeleton version of the programs will appear in your cs35/labs/08 directory when you run update35. The program handin35 will only submit files in this directory.

I encourage you to work with a partner on this lab and the entire project.

Introduction

This lab is the third part of a larger project to build a web browser with a search engine for a limited portion of the web. In the first part you summarized the content of a web page by creating a binary search tree that contained the word frequencies of all words that appeared in the page that were not in a given file of words to ignore. In the second part you read in a list of URLs and constructed word frequency trees to summarize their content. Next you prompted the user for a query, searched for each query word in the saved word frequency trees, and then used a priority queue to report the most relevant pages for the query.

In the third part of this project, you will make the search for relevant pages more efficient by caching search results in a dictionary. Each time a query is entered, you will first check a dictionary to see if this query has been answered before. If so, you can get the results directly from the dictionary without any further processing. If not, you will create the results as you did in part 2 and add them to the dictionary.

We'd like to be able to use the cache as much as possible to make our web browser fast. To this end we will remove all ignore words from the query, sort the remaining words, and concatenate them together to make a string key for our dictionary. In this way the queries "computer science department" and "department of computer science" will both end up with the same key "computer department science", assuming "of" is one of the ignore words. Using this approach will allow us to find saved results more frequently, and take the most advantage of the cache.

Sample Run

As before, your program will take two command-line arguments: the filename of a list of URLs stored in the Computer Science domain and the filename of a list of words to ignore during the analysis. For example:

% ./part3 urls.txt ignore.txt

This will output:

Scanning the ignore file...done.
Summarized 13 urls

Enter a query or type 'quit' to end
Search for: analysis of algorithms
ignoring of
query key: algorithms analysis 
Relevant pages:
                              www.cs.swarthmore.edu/~adanner : 7
                             www.cs.swarthmore.edu/~richardw : 6
             www.cs.swarthmore.edu/~meeden/cs35/f09/lab3.php : 4
            www.cs.swarthmore.edu/~meeden/cs35/f09/index.php : 4
                                  www.cs.swarthmore.edu/~cfk : 4
                              www.cs.swarthmore.edu/~newhall : 2
                               www.cs.swarthmore.edu/~meeden : 1
             www.cs.swarthmore.edu/~meeden/cs35/f09/lab6.php : 1
Added to cache

Search for: algorithms analysis
query key: algorithms analysis
Found in cache!
Relevant pages:
                              www.cs.swarthmore.edu/~adanner : 7
                             www.cs.swarthmore.edu/~richardw : 6
             www.cs.swarthmore.edu/~meeden/cs35/f09/lab3.php : 4
            www.cs.swarthmore.edu/~meeden/cs35/f09/index.php : 4
                                  www.cs.swarthmore.edu/~cfk : 4
                              www.cs.swarthmore.edu/~newhall : 2
                               www.cs.swarthmore.edu/~meeden : 1
             www.cs.swarthmore.edu/~meeden/cs35/f09/lab6.php : 1

Search for: computer science department
query key: computer department science
Relevant pages:
                              www.cs.swarthmore.edu/~newhall : 35
                                  www.cs.swarthmore.edu/~cfk : 24
                             www.cs.swarthmore.edu/~richardw : 20
                              www.cs.swarthmore.edu/~adanner : 17
                               www.cs.swarthmore.edu/~meeden : 16
            www.cs.swarthmore.edu/~meeden/cs35/f09/index.php : 5
             www.cs.swarthmore.edu/~meeden/cs35/f09/lab6.php : 1
Added to cache

Search for: department of computer science
ignoring of
query key: computer department science
Found in cache!
Relevant pages:
                              www.cs.swarthmore.edu/~newhall : 35
                                  www.cs.swarthmore.edu/~cfk : 24
                             www.cs.swarthmore.edu/~richardw : 20
                              www.cs.swarthmore.edu/~adanner : 17
                               www.cs.swarthmore.edu/~meeden : 16
            www.cs.swarthmore.edu/~meeden/cs35/f09/index.php : 5
             www.cs.swarthmore.edu/~meeden/cs35/f09/lab6.php : 1

Search for: quit
Goodbye

Classes, Programs, and Files Used

A number of classes are provided for you. Classes, programs or files from the previous lab that you need to copy into the current lab directory are underlined below. Classes, programs or files that you have to complete for this lab appear in bold below.

1-	BST	Class	Abstract data type for Binary Search Trees
1-	KVPair	Class	General KVPair class for key value pairs
1-	LinkedBST	Class	Linked implementation of BST
1-	Scanner	Class	Parser for web pages and text files
1-	URLContent	Class	Stores filename, URL, and word frequency tree for a web page
1-	ignore.txt	File	A list of words to ignore when parsing web pages
2-	PQ	Class	Abstract data type for Priority Queues
2-	HeapPQ	Class	Heap implementation of PQ
2-	ProcessQuery	Class	Stores summarized web pages and handles a search query
2-	urls.txt	File	List of URLs to find and summarize
3-	Dictionary	Class	Abstract data type for Dictionaries
3-	HashTableDictionary	Class	Hash table implementation of Dictionary
3-	SavedResult	Class	Stores query answer and best matching URL
3-	part3.cpp	Program	The main program for part 3 of the project

Getting Started

Copy the inclass modifications to HashTableDictionary.* to your labs/08 folder.
Be sure that your program only scans the ignore file once. You can accomplish this by creating a static variable in the URLContent class to store a data structure of the ignore words, and making sure you have a check to only create the ignore tree when neccessary
You'll need a way to save the entire query result as a string. Below is a mix of pseudocode and C++ code that demonstrates a method for accomplishing this.
```
char line[80];
string answer;

answer = "Relevant pages:\n";
loop until PQ is empty
   removeMin from the PQ of urls
   sprintf(line, "%60s : %d\n", url.c_str(), frequency);
   answer += line;
```
It uses sprintf, which does formatted printing into a string, to construct the result in the variable answer.
If this idea of saving the hit count and url of all matching webpages as string seems a bit artificial, look at the optional extensions below for a better alternative.
Complete the implementation of the SavedResult class. Instances of this class will serve as the elements that will be stored in the dictionary. When creating a new instance of this class be sure to store the url that has the highest word frequency count. We'll use this later in our GUI to bring up the most relevant page immediately.
You will need to modify your ProcessQuery class to include the new functionality needed for this part of the project, such as removing ignore words from the query, sorting the query words, and creating a key from the sorted word for use with the dictionary. Because the query is typically quite short, any sort–even an n-squared sort, could be used to put query words in order. If you #include <algorithm>, you can even try STLs built in sort. Or use your quicksort from inclass after modifying it to sort objects of the appropriate type (See comments in optional extensions below)
Create the main program in part3.cpp. Your program should display which words of the query are ignored, the query key, and should report when results are found in the cache and added to the cache.

Tips/Tricks

To read multiple words in from the user on a single line, use cin.getline(...); Here's an example

 char query[100];
 cout << "Search for: ";
 cin.getline(query, 100);
 string q = query; //convert char* buffer to string

Using SavedResult objects as a dictionary value may cause some unexpected errors. Here are some common errors and their solution.

keyValuePair.inl:7: error: no matching function for call to 'SavedResult::SavedResult()
This error appears in the keyValuePair constructor when the VALUE type (in this case a SavedResult object) does not have a default constructor. The quick fix is to just a add a dummy default constructor in the SavedResult class. Add SavedResult(); as public method in SavedResult.h and add the following implementation in SavedResult.cpp.
```
SavedResult::SavedResult() {
  string answer = "";
  string bestURL = "";
}
```
A better fix is to use Initialization lists in keyValuePair.inl and change the constructor to
```
template <typename KEY, typename VALUE>
KVPair<KEY, VALUE>::KVPair(KEY k, VALUE v): key(k), value(v) {
}
```
but initializations lists are a bit beyond the scope of cs35 and default constructors may come in handy at other times.
Another possible error is something similar to HashTableDictionary.inl:123: error: no match for 'operator==' in 'KVPair<KEY, VALUE>::getValue() [with KEY = string, VALUE = SavedResult]() == value
This problem results because C++ does not know how to compare two SavedResult objects for equality. This can be fixed using operator overloading and defining == for the SavedResult class. Add bool operator==(SavedResult& other); to SavedResult.h and implement as follows in SavedResult.cpp
```
bool SavedResult::operator==(SavedResult& other){
  return (other.getBestURL()==bestURL && other.getAnswer()==answer);
}
```

Optional additions

These additional exercises are not required. Only attempt them once you have successfully completed the requirements described above.

Saving all search results as a string in SavedResult is a bit unsatisfying. We lose a lot of structure by packing everything into a single string. Modify SavedResult to save all results as a vector of KVPair<int,string> objects. You should not have to modify much code.
Use your quicksort code inclass to sort an array of query words. For an added challenge, modify your quicksort code to use a vector instead of an array. You will need to modify the type of item being sorted.
By removing ignore words from the query and sorting the remaining words we are more likely to find saved query results in the cache. However we could do even better. Rather than saving the results for an entire query, we could instead save the results for each individual word in the query. For example if a user asked for pages on "artificial christmas trees", we would find all pages relevant for the word "artificial", then for the word "christmas", and finally for the word "trees" and store these in the cache. We would then combine all of these results into a single list for the entire query. Later, if the user asked for pages about "artificial intelligence" we would already have relevant pages for "artificial" and would only need to go find relevant pages for "intelligence" and then combine them for the complete answer. Implement this approach to caching.
Search engines like google often recognize misspellings and respond with "did you mean...?" and provide an appropriate correction. In CS21 we have done a lab related to this. Modify your search engine to incorporate the ability to recognize misspelled words in queries and suggest reasonable alternatives. You could use techniques similar to those described in the CS21 lab or you could use the word frequency trees in some way.

Submit

Once you are satisfied with your code, hand it in by typing handin35. This will copy the code from your cs35/labs/08 to my grading directory. You may run handin35 as many times as you like, and only the most recent submission will be recorded. If you worked with a partner, only one of you needs to handin the code.

CS35 Lab 8:

Caching query results with a dictionary