CS35 Lab6:

Analyzing web content with binary search trees

Due 11:59pm Wednesday, October 27th

A skeleton version of the programs will appear in your cs35/lab/06 directory when you run update35. The program handin35 will only submit files in this directory.

I encourage you to work with a partner on this lab and the entire project.

Introduction
This lab is the first part of a larger project to build a web browser with a search engine for a limited portion of the web. By the end of the project, your browser and search engine will have the following capabilities:

The first step to building a search engine is to be able to analyze web pages based on their content. This will eventually allow us to rate how relevant a particular page is for a given user request. The content of a page is obviously correlated with the words it contains. For this lab you will use a binary search tree to help you calculate and store the frequencies of words found in a given web page. Then you will print out the most frequent words found.

Sample Run

Your program will take three command-line arguments: the filename of a web page, a minimum frequency number to report, and a filename for a file containing a list of words to ignore during the analysis. For example:

$ ./part1 /home/adanner/public_html/index.php 6 ignore.txt

This will output a list of words in alphabetic order that appeared at least 6 times on the given web page and were not in the given ignore file.

algorithms: 7
arge: 8
bib: 9
computer: 8
cs: 8
danner: 11
data: 13
i/o-efficient: 6
pdf: 10
science: 6

Note that any user that has a web page on our system typically has a file named /home/username/public_html/index.php (it may also be called index.html). You may test your program on any of these files.

Classes, Programs, and Files Used

A number of classes are provided for you. Classes, programs or files that you have to complete for this lab appear in bold below.

BSTClassAbstract data type for Binary Search Trees
KVPairClassGeneral class for storing Key/Value pairs
BSTNodeClassGeneral node for storing BST key and element
LinkedBSTClassLinked implementation of BST
ScannerClassParser for web pages and text files
URLContentClassStores filename, URL, and word frequency tree for a web page
ignore.txtFileA list of words to ignore when parsing web pages
part1.cppProgramThe main program for part 1 of the project

Getting Started
  1. Experiment with the Scanner class.

    Review the Scanner.h file to see the interface available for parsing. Next, review the ScannerTest.cpp file to see an example of using this class. You can do make ScannerTest, and then ./ScannerTest to see how it works on the given test file called ScannerFile.txt. Also test the scanner on an actual web page.
  2. If you'd like to make your own web page as a test for the Scanner and this week's lab do the following.
    • In your top-level directory, make a directory to hold the web page: mkdir public_html
    • Make the directory readable and executable: chmod a+rx public_html
    • In the public_html directory, edit a file called index.html. As a start you can just write "hello web" in it and add more later.
    • After you save the file, make it readable: chmod a+r index.html
    • In a web browser go to: www.cs.swarthmore.edu/~yourUserName
    You should be able to view your own web page.
  3. Create an initial list of words to ignore.

    Edit the file ignore.txt. Remove the comment at the top and add one word per line for words you think should be ignored when analyzing a web page. Common words such as "a", "and", or "the" are good candidates. It makes sense to ignore words that do not add much information about what the page is actually about.
  4. Experiment with handling command-line arguments.

    In order to process command-line arguments in C++ you need to add standard parameters called argc and argv to the main program. The parameter argc is an integer representing the number of arguments entered. The parameter argv is an array of strings, one string for each argument. Below is a sample program that will report on the command-line arguments entered.
    int main(int argc, char *argv[]) {
      cout << "Number of command-line arguments: " << argc << endl;
      for (int i=0; i!=argc; i++) {
        printf("Argument %d was %s\n", i, argv[i]);
      }
    }
    
    When you compile and test this program you'll see the following results.
    % g++ test.cpp
    % ./a.out try 21 and 35
    Number of command-line arguments: 5
    Argument 0 was ./a.out
    Argument 1 was try
    Argument 2 was 21
    Argument 3 was and
    Argument 4 was 35
    
    Note that each of the arguments is stored as a string. To convert an argument from a string into an integer you can use the function atoi that is part of the cstdlib library. Try converting argv[2] ("21") and argv[4] ("35") in the example above from strings into integers.
  5. Implement the URLContent class.

    Review the URLContent.h file which specifies the public interface for this class. Feel free to add additional private helper methods to this class declaration as needed.

    The constructor of this class will initiate most of the computation necessary to complete this lab. It will open the given web page file, parse it using an instance of the Scanner class, and create a word frequency tree that summarizes the content of the page. The constructor includes a variable called urlName that will be used in subsequent parts; you can ignore it for now. The word frequency tree is a binary search tree where the key is a string representing a unique token and the element is an integer representing how frequently that token appears in the page. You should only include tokens that are:

    • longer than length 1
      Use the size() method of strings to determine length.
    • are not html tags
      You can assume that anything that begins with a less than sign is a tag.
    • not in the ignore file
      You may want to use the Scanner class to read in the ignore words; store them in an appropriate data structure so that you can efficiently find them.
  6. Complete the main program in part1.cpp.

    The main program should be quite brief. It should process the command-line arguments, create an instance of the URLContent class, and then report the results of analyzing the web page.
  7. Fine tune the ignore list stored in the ignore.txt file.

    Based on tests of many different web pages on our system, update the ignore list appropriately.
Submit
Once you are satisfied with your code, hand it in by typing handin35. This will copy the code from your cs35/lab/06 to my grading directory. You may run handin35 as many times as you like, and only the most recent submission will be recorded. If you worked with a partner, only one of you needs to handin the code.