CS35 Lab 8:

Caching query results with a dictionary

Due by 11:59pm Wednesday, November 10

A skeleton version of the programs will appear in your cs35/labs/08 directory when you run update35. The program handin35 will only submit files in this directory.

I encourage you to work with a partner on this lab and the entire project.

Introduction

This lab is the third part of a larger project to build a web browser with a search engine for a limited portion of the web. In the first part you summarized the content of a web page by creating a binary search tree that contained the word frequencies of all words that appeared in the page that were not in a given file of words to ignore. In the second part you read in a list of URLs and constructed word frequency trees to summarize their content. Next you prompted the user for a query, searched for each query word in the saved word frequency trees, and then used a priority queue to report the most relevant pages for the query.

In the third part of this project, you will make the search for relevant pages more efficient by caching search results in a dictionary. Each time a query is entered, you will first check a dictionary to see if this query has been answered before. If so, you can get the results directly from the dictionary without any further processing. If not, you will create the results as you did in part 2 and add them to the dictionary.

We'd like to be able to use the cache as much as possible to make our web browser fast. To this end we will remove all ignore words from the query, sort the remaining words, and concatenate them together to make a string key for our dictionary. In this way the queries "computer science department" and "department of computer science" will both end up with the same key "computer department science", assuming "of" is one of the ignore words. Using this approach will allow us to find saved results more frequently, and take the most advantage of the cache.

Sample Run

As before, your program will take two command-line arguments: the filename of a list of URLs stored in the Computer Science domain and the filename of a list of words to ignore during the analysis. For example:

% ./part3 urls.txt ignore.txt

This will output:

Scanning the ignore file...done.
Summarized 13 urls

Enter a query or type 'quit' to end
Search for: analysis of algorithms
ignoring of
query key: algorithms analysis 
Relevant pages:
                              www.cs.swarthmore.edu/~adanner : 7
                             www.cs.swarthmore.edu/~richardw : 6
             www.cs.swarthmore.edu/~meeden/cs35/f09/lab3.php : 4
            www.cs.swarthmore.edu/~meeden/cs35/f09/index.php : 4
                                  www.cs.swarthmore.edu/~cfk : 4
                              www.cs.swarthmore.edu/~newhall : 2
                               www.cs.swarthmore.edu/~meeden : 1
             www.cs.swarthmore.edu/~meeden/cs35/f09/lab6.php : 1
Added to cache

Search for: algorithms analysis
query key: algorithms analysis
Found in cache!
Relevant pages:
                              www.cs.swarthmore.edu/~adanner : 7
                             www.cs.swarthmore.edu/~richardw : 6
             www.cs.swarthmore.edu/~meeden/cs35/f09/lab3.php : 4
            www.cs.swarthmore.edu/~meeden/cs35/f09/index.php : 4
                                  www.cs.swarthmore.edu/~cfk : 4
                              www.cs.swarthmore.edu/~newhall : 2
                               www.cs.swarthmore.edu/~meeden : 1
             www.cs.swarthmore.edu/~meeden/cs35/f09/lab6.php : 1

Search for: computer science department
query key: computer department science
Relevant pages:
                              www.cs.swarthmore.edu/~newhall : 35
                                  www.cs.swarthmore.edu/~cfk : 24
                             www.cs.swarthmore.edu/~richardw : 20
                              www.cs.swarthmore.edu/~adanner : 17
                               www.cs.swarthmore.edu/~meeden : 16
            www.cs.swarthmore.edu/~meeden/cs35/f09/index.php : 5
             www.cs.swarthmore.edu/~meeden/cs35/f09/lab6.php : 1
Added to cache

Search for: department of computer science
ignoring of
query key: computer department science
Found in cache!
Relevant pages:
                              www.cs.swarthmore.edu/~newhall : 35
                                  www.cs.swarthmore.edu/~cfk : 24
                             www.cs.swarthmore.edu/~richardw : 20
                              www.cs.swarthmore.edu/~adanner : 17
                               www.cs.swarthmore.edu/~meeden : 16
            www.cs.swarthmore.edu/~meeden/cs35/f09/index.php : 5
             www.cs.swarthmore.edu/~meeden/cs35/f09/lab6.php : 1

Search for: quit
Goodbye
Classes, Programs, and Files Used

A number of classes are provided for you. Classes, programs or files from the previous lab that you need to copy into the current lab directory are underlined below. Classes, programs or files that you have to complete for this lab appear in bold below.

1-BSTClassAbstract data type for Binary Search Trees
1-KVPairClassGeneral KVPair class for key value pairs
1-LinkedBSTClassLinked implementation of BST
1-ScannerClassParser for web pages and text files
1-URLContentClassStores filename, URL, and word frequency tree for a web page
1-ignore.txtFileA list of words to ignore when parsing web pages
2-PQClassAbstract data type for Priority Queues
2-HeapPQClassHeap implementation of PQ
2-ProcessQueryClassStores summarized web pages and handles a search query
2-urls.txtFileList of URLs to find and summarize
3-DictionaryClassAbstract data type for Dictionaries
3-HashTableDictionaryClassHash table implementation of Dictionary
3-SavedResultClassStores query answer and best matching URL
3-part3.cppProgramThe main program for part 3 of the project

Getting Started
  1. Copy the inclass modifications to HashTableDictionary.* to your labs/08 folder.
  2. Be sure that your program only scans the ignore file once. You can accomplish this by creating a static variable in the URLContent class to store a data structure of the ignore words, and making sure you have a check to only create the ignore tree when neccessary
  3. You'll need a way to save the entire query result as a string. Below is a mix of pseudocode and C++ code that demonstrates a method for accomplishing this.
    char line[80];
    string answer;
    
    answer = "Relevant pages:\n";
    loop until PQ is empty
       removeMin from the PQ of urls
       sprintf(line, "%60s : %d\n", url.c_str(), frequency);
       answer += line;
    
    It uses sprintf, which does formatted printing into a string, to construct the result in the variable answer.

    If this idea of saving the hit count and url of all matching webpages as string seems a bit artificial, look at the optional extensions below for a better alternative.

  4. Complete the implementation of the SavedResult class. Instances of this class will serve as the elements that will be stored in the dictionary. When creating a new instance of this class be sure to store the url that has the highest word frequency count. We'll use this later in our GUI to bring up the most relevant page immediately.
  5. You will need to modify your ProcessQuery class to include the new functionality needed for this part of the project, such as removing ignore words from the query, sorting the query words, and creating a key from the sorted word for use with the dictionary. Because the query is typically quite short, any sort–even an n-squared sort, could be used to put query words in order. If you #include <algorithm>, you can even try STLs built in sort. Or use your quicksort from inclass after modifying it to sort objects of the appropriate type (See comments in optional extensions below)
  6. Create the main program in part3.cpp. Your program should display which words of the query are ignored, the query key, and should report when results are found in the cache and added to the cache.
Tips/Tricks
To read multiple words in from the user on a single line, use cin.getline(...); Here's an example
 char query[100];
 cout << "Search for: ";
 cin.getline(query, 100);
 string q = query; //convert char* buffer to string
Using SavedResult objects as a dictionary value may cause some unexpected errors. Here are some common errors and their solution.
Optional additions

These additional exercises are not required. Only attempt them once you have successfully completed the requirements described above.

Submit
Once you are satisfied with your code, hand it in by typing handin35. This will copy the code from your cs35/labs/08 to my grading directory. You may run handin35 as many times as you like, and only the most recent submission will be recorded. If you worked with a partner, only one of you needs to handin the code.