CS35 Project: Part 2

Part 2: Processing User Queries to Find the Most Relevant Web Pages

Problem Description
Getting Started
Java Classes
What to Hand In

PROBLEM DESCRIPTION

For this program you will implement part of a web search engine that orders web pages based on how well they match a search query. The best match is the web page with the highest word frequency counts for the words in the query string. Your main class for this assignment should be called ProcessQueries and will be called as follows:

java ProcessQueries urlListFile ignoreFile

The urlListFile should contain a list of URLs, one per line. These URLs have to correspond to files that you can open locally (URLs whose corresponding .html file is stored on our file system). An example urlListFile might contain:

 
www.cs.swarthmore.edu/~cfk
www.cs.swarthmore.edu/~eroberts
www.cs.swarthmore.edu/~knerr
www.cs.swarthmore.edu/~marshall
www.cs.swarthmore.edu/~meeden
www.cs.swarthmore.edu/~newhall
www.cs.swarthmore.edu/~newhall/cs35/cs35.html
www.cs.swarthmore.edu/~newhall/cs21/f00/cs21.html

The ignoreFile should contain a list of words that you would like to ignore as you count word frequencies in html files (just as you did in the last assignment).

In order to process queries from a user, you'll need to create a new class that joins together a URL string with a WordFrequencyTree representing that web page's content. Call this class URLContent. Your program should create a list of URLContent objects, one for each URL that appears in the urlListFile.

Once you have processed all the URLs in the list (you should gracefully handle invalid URLs), your program will enter a loop as shown below, which prompts the user to enter a search query (or -1 to quit), and then lists all URL's that match the query in order of the best match first and the worst match last. Include each result URL's priority in parenthesis after each result. URLs of web pages that do not contain any of the words in the query should not appear in the result list.

Enter a query or -1 to quit.

Search for: neural networks
Relevant pages:
www.cs.swarthmore.edu/~meeden    (priority = x)
www.cs.swarthmore.edu/~marshall  (priority = y)

Search for: evolutionary computation
Relevant pages:
www.cs.swarthmore.edu/~meeden    (priority = x)
www.cs.swarthmore.edu/~marshall  (priority = y)
www.cs.swarthmore.edu/~cfk       (priority = z)

Search for: -1

To find the results of the query in order, you will process each WordFrequencyTree in the list of URLContent objects, create a priority queue element for it, and add it to a priority queue for the search. Then use the priority queue to print out the matching urls in order. The priority value is based on how well the web page matches the words in the query. Remember that in a priority queue low values equate with high priority.

GETTING STARTED

Much of this assignment will be figuring out how to use some of the classes that we give you. Once you have run the test programs for these classes, and understand how they work, then you can start implementing code.

Start by implementing the insert method in the HeapPriorityQueue class. Test that this works before moving on to the next part.

Next, implement the part of your program that processes the urlListFile. For each URL read in, create the appropriate file name according to the following rules. Then calculate the word frequencies for that file.

www.cs.swarthmore.edu/~user_name

/home/user_name/public_html/index.html

www.cs.swarthmore.edu/~user_name/dir_name/file_name.html

/home/user_name/public_html/dir_name/file_name.html

Next, implement that part that reads in a search query, builds a priority queue by inserting (URLContent, key) pairs where the key is the priority of the URL's WordFrequencyTree based on how well it matches the query string. Then print out the matching URLs in order of best to worst match.

Your program should handle multiple word queries, and return the best matches based on all words in the query. For example, the query "computer science department" should search each URL's WordFrequencyTree for all three words to determine the URL's priority.

CLASSES

For this and all remaining homework assignments, you will need to copy the .java and .class files that we give you from our directories.

The .java files for a homework assignment will be in:
~newhall/public/cs35/hw#/classes/
Our .class file solutions for a homework assignment will be in:
~newhall/public/cs35/hw#/solution/
And documentation for the classes that we give you is in files named class_name.html in:
~newhall/public/cs35/hwdoc/

Classes you'll need for this assignment include all the classes for assignment 6 plus the following (these can be copied from ~newhall/public/cs35/hw07/classes/):

PriorityQueue interface
HeapPriorityQueue class
Scanner class for scanning either an input string or a file (same version as hw6).
You can use this to parse the query string once it is read in by SimpleIO's readString method.
SimpleIO class Same version as hw#4
ReadStream class For reading simple types from a file (like the SimpleIO class, but for file I/O).
You can use this to read in each URL from the list of URL. To create a new ReadStream object passing in a String representing a file name:
```
	ReadStream r = new ReadStream(new FileInputStream(new File(url_list)));
```
Then just enter a loop that reads in the next URL (use the readLine method) until eof() is true.
TryHeap class a simple program that tests the HeapPriorityQueue class
TryScanner class a simple program that tests the Scanner class. It demonstrates how to use the Scanner class to parse both files and strings.
Makefile a sample Makefile for this assignment. You may need to modify it to work with your solution. If you are not doing so already, get in to the habit of using a Makefile for all remaining assignments.

In addition, if you did not complete assignment 6, then you can use our solution as a starting point for your assignment 7.
The .java files that we gave you as a starting point, can be copied from ~newhall/public/cs35/hw06/classes/.
Our solution .class files, for the following classes, can be copied from ~newhall/public/cs35/hw06/solution/

LinkedBinarySearchTree class documentation is here (these are accessible only locally)
WordFrequencyTree class documentation is here
WordFrequencyObject class documentation is here

HAND IN

Using cs35handin, hand in a single tar file containing:

All .java files necessary for compiling your code (include any of the classes that we give you that you use in your solution).
A Makefile for building your code
A README file with:
1. Your name (and your partner's name if you had one)
2. The name of the class containing your main method

If you work with a partner, please only one of you submit your joint solution using cs35handin.

Part 2: Processing User Queries to Find the Most Relevant Web Pages

CONTENTS

PROBLEM DESCRIPTION

GETTING STARTED

CLASSES

HAND IN