CS35 Lab7:

Processing user queries with priority queues

Due 11:59pm Wednesday, November 3

A skeleton version of the programs will appear in your cs35/labs/07 directory when you run update35. The program handin35 will only submit files in this directory.

I encourage you to work with a partner on this lab and the entire project.

Introduction

This lab is the second part of a larger project to build a web browser with a search engine for a limited portion of the web. In the first part of this project you summarized the content of a web page by creating a binary search tree that contained the word frequencies of all words that appeared in the page that were not in a given file of words to ignore.

In the second part of this project, you will read in a file of URLs and a file of words to ignore, convert the URLs into filenames, and summarize the content of each file as in the previous lab. Then you will repeatedly prompt the user for a search query. You will determine how often each of the query words appear in the summarized web pages. For example, suppose a user enters the query "computer science department". For each web page, your program should determine the sum of the number of times the individual words "computer", "science", and "department" appear in a page. You will convert this sum into a priority and associate it with the URL and store them in a priority queue. Once all the pages have been checked, you will report the relevant pages in priority order.

Sample Run

Your program will take two command-line arguments: the filename of a list of URLs stored in the Computer Science domain and the filename of a list of words to ignore during the analysis. For example:

% ./part2 urls.txt ignore.txt

This will output:

Web Search Engine
Summarized 13 URLs

Enter a query or type 'quit' to end
Search for: natural language processing
Relevant pages:
                            www.cs.swarthmore.edu/~richardw/ : 17
            www.cs.swarthmore.edu/~adanner/cs35/s10/lab8.php : 4
            www.cs.swarthmore.edu/~adanner/cs35/s08/lab7.php : 2
           www.cs.swarthmore.edu/~adanner/cs35/s10/index.php : 2
                             www.cs.swarthmore.edu/~newhall/ : 1

Search for: Billings Montana
No relevant pages found

Search for: developmental robotics
Relevant pages:
                               www.cs.swarthmore.edu/~meeden : 16
             www.cs.swarthmore.edu/~adanner/cs35/s10/lab8.php : 2

Search for: maze       
Relevant pages:
            www.cs.swarthmore.edu/~adanner/cs35/s08/lab4.php : 38
            www.cs.swarthmore.edu/~adanner/cs35/s10/lab4.php : 33
            www.cs.swarthmore.edu/~adanner/cs35/s10/lab8.php : 1
            www.cs.swarthmore.edu/~adanner/cs35/s08/lab7.php : 1

Search for: analysis of algorithms
Relevant pages:
                             www.cs.swarthmore.edu/~adanner/ : 8
           www.cs.swarthmore.edu/~adanner/cs35/s08/index.php : 6
                            www.cs.swarthmore.edu/~richardw/ : 6
            www.cs.swarthmore.edu/~adanner/cs35/s10/lab3.php : 4
           www.cs.swarthmore.edu/~adanner/cs35/s10/index.php : 4
                                 www.cs.swarthmore.edu/~cfk/ : 4
            www.cs.swarthmore.edu/~adanner/cs35/s10/lab8.php : 3
            www.cs.swarthmore.edu/~adanner/cs35/s10/lab7.php : 2
            www.cs.swarthmore.edu/~adanner/cs35/s08/lab3.php : 1
                              www.cs.swarthmore.edu/~meeden/ : 1

Search for: quit
Goodbye

Notice that only web pages that contain at least one word in common with the query are shown, and that for some queries there may be no relevant pages in our limited search domain.

Classes, Programs, and Files Used

A number of classes are provided for you. Classes, programs or files from the previous lab that you need to copy into the current lab directory are underlined below. Classes, programs or files that you have to complete for this lab appear in bold below.

KVPairClassKey/Value Pairs
BSTClassAbstract data type for Binary Search Trees
BSTNodeClassGeneral node for storing BST key and element
LinkedBSTClassLinked implementation of BST
ScannerClassParser for web pages and text files
URLContentClassStores filename, URL, and word frequency tree for a web page
ignore.txtFileA list of words to ignore when parsing web pages
PQClassAbstract data type for Priority Queues
HeapPQClassHeap implementation of PQ, you need to write removeMin and bubbleDown
ProcessQueryClassStores summarized web pages and handles a search query
urls.txtFileList of URLs to find and summarize, you may add more URLs to the initial list
part2.cppProgramThe main program for part 2 of the project

Getting Started
  1. Complete the implementation of the HeapPQ class by writing the methods bubbleDown and getMinIndex. Test this class thorougly before continuing.
  2. You'll need to read in the URLs from the file urls.txt and convert them into the appropriate filenames as shown below.
    URL:  www.cs.swarthmore.edu/~userName
    file: /home/userName/public_html/index.html or index.php
      
    URL:  www.cs.swarthmore.edu/~userName/dirName/.../fileName
    file: /home/userName/public_html/dirName/.../fileName
    
    Look at readFile.cpp to see a demonstration of how to read in a file one line at a time. Look at the C++ string reference to see useful methods for manipulating strings.
  3. You'll need to read in the query and break it up into a vector of individual words. You can use cin to read in an entire line and then convert it into a C++ string as follows:
    char line[100];
    cin.getline(line, 100);
    string query = (const char *)line;
    
    The above code assumes that the user will never enter a line longer than 100 characters. Once you have the entire query as a string, you can then use C++ string methods to break it up into individual words.
  4. Design and implement the ProcessQuery class. This should handle most of the computation necessary to complete this lab. Note, you should only construct this class one time in your program. It should scan all the URLs once and store the resulting URLContent objects in a data structure to be re-used for each query search.
  5. Write the main program in part2.cpp.
  6. Verify that your seach engine returns the correct results.
Tips
It may be helpful to check if a file exists. Below is some sample code to do just that
#include <fstream>
#include <string>

using namespace std;

bool file_exists(string filename){ 
  ifstream inp;
  inp.open(filename.c_str(), ifstream::in);
  inp.close();
  return !inp.fail();
}
Here's some starter code for removeMin in the PQ. You still have to implement bubbleDown.
template <typename KEY, typename VALUE>
KVPair<KEY,VALUE> HeapPQ<KEY,VALUE>::removeMin() {
  if (isEmpty()){
    throw runtime_error("priority queue is empty, cannot getMin");
  }
  KVPair<KEY,VALUE> kv = *array[1];
  delete array[1]; array[1]=NULL;
  swap(n, 1);
  n--;
  bubbleDown(1);
  return kv;
}
Submit
Once you are satisfied with your code, hand it in by typing handin35. This will copy the code from your cs35/labs/07 to my grading directory. You may run handin35 as many times as you like, and only the most recent submission will be recorded. If you worked with a partner, only one of you needs to handin the code.