CS35: Lab 7

A skeleton version of the programs will appear in your cs35/lab/07 directory when you run update35. The program handin35 will only submit files in this directory.

I encourage you to work with a partner on this lab and the entire project.

Introduction

This lab is the first part of a larger project to build a web browser with a search engine for a limited portion of the web. By the end of the project, your browser and search engine will have the following capabilities:

Display a web page given a URL
Find relevant web pages given a query entered by a user
Display relevant URLs in order of best match to worst
Automatically display the best matching URL result of a search

The first step to building a search engine is to be able to analyze web pages based on their content. This will eventually allow us to rate how relevant a particular page is for a given user request. The content of a page is obviously correlated with the words it contains. For this lab you will use a binary search tree to help you calculate and store the frequencies of words found in a given web page. Then you will print out the most frequent words found.

Sample Run

Your program will take three command-line arguments: the filename of a web page, a minimum frequency number to report, and a filename for a file containing a list of words to ignore during the analysis. For example:

$ ./part1 /home/adanner/public_html/index.php 6 ignore.txt

This will output a list of words in alphabetic order that appeared at least 6 times on the given web page and were not in the given ignore file.

algorithms: 7
arge: 8
bib: 8
computer: 8
cs: 10
danner: 10
data: 13
i/o-efficient: 6
nbsp: 30
pdf: 9
science: 6

Note that any user that has a web page on our system typically has a file named /home/username/public_html/index.php (it may also be called index.html). You may test your program on any of these files.

Classes, Programs, and Files Used

A number of classes are provided for you. Classes, programs or files that you have to complete for this lab appear in bold below.

BST	Class	Abstract data type for Binary Search Trees
KVPair	Class	General class for storing Key/Value pairs
BSTNode	Class	General node for storing BST key and element
LinkedBST	Class	Linked implementation of BST
LLRBTree	Class	Left Leaning Red Black Tree implementation of BST
Scanner	Class	Parser for web pages and text files
URLContent	Class	Stores filename, URL, and word frequency tree for a web page
ignore.txt	File	A list of words to ignore when parsing web pages
part1.cpp	Program	The main program for part 1 of the project

Getting Started

Experiment with the Scanner class.

Review the Scanner.h file to see the interface available for parsing. Next, review the ScannerTest.cpp file to see an example of using this class. You can do make ScannerTest, and then ./ScannerTest to see how it works on the given test file called ScannerFile.txt. Also test the scanner on an actual web page.
If you'd like to make your own web page as a test for the Scanner and this week's lab do the following.
- In your top-level directory, make a directory to hold the web page: mkdir public_html
- Make the directory readable and executable: chmod a+rx public_html
- In the public_html directory, edit a file called index.html. As a start you can just write "hello web" in it and add more later.
- After you save the file, make it readable: chmod a+r index.html
- In a web browser go to: www.cs.swarthmore.edu/~yourUserName
You should be able to view your own web page.
Create an initial list of words to ignore.

Edit the file ignore.txt. Remove the comment at the top and add one word per line for words you think should be ignored when analyzing a web page. Common words such as "a", "and", or "the" are good candidates. It makes sense to ignore words that do not add much information about what the page is actually about.
Experiment with handling command-line arguments.

In order to process command-line arguments in C++ you need to add standard parameters called argc and argv to the main program. The parameter argc is an integer representing the number of arguments entered. The parameter argv is an array of strings, one string for each argument. Below is a sample program that will report on the command-line arguments entered.
```
int main(int argc, char *argv[]) {
  cout << "Number of command-line arguments: " << argc << endl;
  for (int i=0; i!=argc; i++) {
    printf("Argument %d was %s\n", i, argv[i]);
  }
}
```
When you compile and test this program you'll see the following results.
```
% g++ test.cpp
% ./a.out try 21 and 35
Number of command-line arguments: 5
Argument 0 was ./a.out
Argument 1 was try
Argument 2 was 21
Argument 3 was and
Argument 4 was 35
```
Note that each of the arguments is stored as a string. To convert an argument from a string into an integer you can use the function atoi that is part of the cstdlib library. Try converting argv[2] ("21") and argv[4] ("35") in the example above from strings into integers.
Implement the URLContent class.

Review the URLContent.h file which specifies the public interface for this class. Feel free to add additional private helper methods to this class declaration as needed.
The constructor of this class will initiate most of the computation necessary to complete this lab. It will open the given web page file, parse it using an instance of the Scanner class, and create a word frequency tree that summarizes the content of the page. The constructor includes a variable called urlName that will be used in subsequent parts; you can ignore it for now. The word frequency tree is a binary search tree where the key is a string representing a unique token and the element is an integer representing how frequently that token appears in the page. You should only include tokens that are:
- longer than length 1
  Use the size() method of strings to determine length.
- are not html tags
  You can assume that anything that begins with a less than sign is a tag.
- not in the ignore file
  You may want to use the Scanner class to read in the ignore words; store them in an appropriate data structure so that you can efficiently find them.
Complete the main program in part1.cpp.

The main program should be quite brief. It should process the command-line arguments, create an instance of the URLContent class, and then report the results of analyzing the web page.
Fine tune the ignore list stored in the ignore.txt file.

Based on tests of many different web pages on our system, update the ignore list appropriately.

Submit

Once you are satisfied with your code, hand it in by typing handin35. This will copy the code from your cs35/lab/07 to my grading directory. You may run handin35 as many times as you like, and only the most recent submission will be recorded. If you worked with a partner, only one of you needs to handin the code.

CS35 Lab7:

Analyzing web content with binary search trees