CS35: Lab 8

A skeleton version of the programs will appear in your cs35/labs/08 directory when you run update35. The program handin35 will only submit files in this directory.

I encourage you to work with a partner on this lab and the entire project.

Introduction

This lab is the second part of a larger project to build a web browser with a search engine for a limited portion of the web. In the first part of this project you summarized the content of a web page by creating a binary search tree that contained the word frequencies of all words that appeared in the page that were not in a given file of words to ignore.

In the second part of this project, you will read in a file of URLs and a file of words to ignore, convert the URLs into filenames, and summarize the content of each file as in the previous lab. Then you will repeatedly prompt the user for a search query. You will determine how often each of the query words appear in the summarized web pages. For example, suppose a user enters the query "computer science department". For each web page, your program should determine the sum of the number of times the individual words "computer", "science", and "department" appear in a page. You will convert this sum into a priority and associate it with the URL and store them in a priority queue. Once all the pages have been checked, you will report the relevant pages in priority order.

Sample Run

Your program will take two command-line arguments: the filename of a list of URLs stored in the Computer Science domain and the filename of a list of words to ignore during the analysis. For example:

% ./part2 urls.txt ignore.txt

This will output:

Web Search Engine
Summarized 13 URLs

Enter a query or type 'quit' to end
Search for: natural language processing
Relevant pages:
                            www.cs.swarthmore.edu/~richardw/ : 17
            www.cs.swarthmore.edu/~adanner/cs35/s10/lab8.php : 4
            www.cs.swarthmore.edu/~adanner/cs35/s08/lab7.php : 2
           www.cs.swarthmore.edu/~adanner/cs35/s10/index.php : 2
                             www.cs.swarthmore.edu/~newhall/ : 1

Search for: Billings Montana
No relevant pages found

Search for: developmental robotics
Relevant pages:
                               www.cs.swarthmore.edu/~meeden : 16
             www.cs.swarthmore.edu/~adanner/cs35/s10/lab8.php : 2

Search for: maze       
Relevant pages:
            www.cs.swarthmore.edu/~adanner/cs35/s08/lab4.php : 38
            www.cs.swarthmore.edu/~adanner/cs35/s10/lab4.php : 33
            www.cs.swarthmore.edu/~adanner/cs35/s10/lab8.php : 1
            www.cs.swarthmore.edu/~adanner/cs35/s08/lab7.php : 1

Search for: analysis of algorithms
Relevant pages:
                             www.cs.swarthmore.edu/~adanner/ : 8
           www.cs.swarthmore.edu/~adanner/cs35/s08/index.php : 6
                            www.cs.swarthmore.edu/~richardw/ : 6
            www.cs.swarthmore.edu/~adanner/cs35/s10/lab3.php : 4
           www.cs.swarthmore.edu/~adanner/cs35/s10/index.php : 4
                                 www.cs.swarthmore.edu/~cfk/ : 4
            www.cs.swarthmore.edu/~adanner/cs35/s10/lab8.php : 3
            www.cs.swarthmore.edu/~adanner/cs35/s10/lab7.php : 2
            www.cs.swarthmore.edu/~adanner/cs35/s08/lab3.php : 1
                              www.cs.swarthmore.edu/~meeden/ : 1

Search for: quit
Goodbye

Notice that only web pages that contain at least one word in common with the query are shown, and that for some queries there may be no relevant pages in our limited search domain.

Classes, Programs, and Files Used

A number of classes are provided for you. Classes, programs or files from the previous lab that you need to copy into the current lab directory are underlined below. Classes, programs or files that you have to complete for this lab appear in bold below.

BST	Class	Abstract data type for Binary Search Trees
BSTNode	Class	General node for storing BST key and element
LinkedBST	Class	Linked implementation of BST
LinkedBSTNode	Class	Linked specialization of BSTNode
Scanner	Class	Parser for web pages and text files
URLContent	Class	Stores filename, URL, and word frequency tree for a web page
ignore.txt	File	A list of words to ignore when parsing web pages
PQ	Class	Abstract data type for Priority Queues
PQNode	Class	Node for storing PQ key and element
HeapPQ	Class	Heap implementation of PQ, you need to write `removeMin` and `bubbleDown`
ProcessQuery	Class	Stores summarized web pages and handles a search query
urls.txt	File	List of URLs to find and summarize, you may add more URLs to the initial list
part2.cpp	Program	The main program for part 2 of the project

Getting Started

Complete the implementation of the HeapPQ class by writing the methods bubbleDown and getMinIndex. Test this class thorougly before continuing. Note that you will need to make a slight modification to bubbleDown. You should not always swap and continue to bubbleDown as indicated. You need to add an if statement.
You'll need to read in the URLs from the file urls.txt and convert them into the appropriate filenames as shown below.
```
URL:  www.cs.swarthmore.edu/~userName
file: /home/userName/public_html/index.html or index.php
  
URL:  www.cs.swarthmore.edu/~userName/dirName/.../fileName
file: /home/userName/public_html/dirName/.../fileName
```
Look at readFile.cpp to see a demonstration of how to read in a file one line at a time. Look at the C++ string reference to see useful methods for manipulating strings.
You'll need to read in the query and break it up into a vector of individual words. You can use cin to read in an entire line and then convert it into a C++ string as follows:
```
char line[100];
cin.getline(line, 100);
string query = (const char *)line;
```
The above code assumes that the user will never enter a line longer than 100 characters. Once you have the entire query as a string, you can then use C++ string methods to break it up into individual words.
Design and implement the ProcessQuery class. This should handle most of the computation necessary to complete this lab. Note, you should only construct this class one time in your program. It should scan all the URLs once and store the resulting URLContent objects in a data structure to be re-used for each query search.
Write the main program in part2.cpp.
Verify that your seach engine returns the correct results.

Tips

It may be helpful to check if a file exists. Below is some sample code to do just that

#include <fstream>
#include <string>

using namespace std;

bool file_exists(string filename){ 
  ifstream inp;
  inp.open(filename.c_str(), ifstream::in);
  inp.close();
  return !inp.fail();
}

Submit

Once you are satisfied with your code, hand it in by typing handin35. This will copy the code from your cs35/labs/08 to my grading directory. You may run handin35 as many times as you like, and only the most recent submission will be recorded. If you worked with a partner, only one of you needs to handin the code.

CS35 Lab8:

Processing user queries with priority queues