A skeleton version of the programs will appear in your cs35/lab/07 directory when you run update35. The program handin35 will only submit files in this directory.
I encourage you to work with a partner on this lab and the
entire project.
The first step to building a search engine is to be able to analyze
web pages based on their content. This will eventually allow us to
rate how relevant a particular page is for a given user request. The
content of a page is obviously correlated with the words it
contains. For this lab you will use a binary search tree to help you
calculate and store the frequencies of words found in a given web
page. Then you will print out the most frequent words found.
Your program will take three command-line arguments: the filename of a web page, a minimum frequency number to report, and a filename for a file containing a list of words to ignore during the analysis. For example:
$ ./part1 /home/adanner/public_html/index.php 6 ignore.txt
This will output a list of words in alphabetic order that appeared at least 6 times on the given web page and were not in the given ignore file.
algorithms: 7 arge: 8 bib: 9 computer: 9 cs: 10 danner: 11 data: 13 i/o-efficient: 6 pdf: 10 science: 6
Note that any user that has a web page on our system typically has
a file named /home/username/public_html/index.php (it may
also be called index.html). You may test your program on any
of these files.
A number of classes are provided for you. Classes, programs or files that you have to complete for this lab appear in bold below.
BST | Class | Abstract data type for Binary Search Trees |
KVPair | Class | General class for storing Key/Value pairs |
BSTNode | Class | General node for storing BST key and element |
LinkedBST | Class | Linked implementation of BST |
AVLTree | Class | Balanced AVL implementation of BST |
Scanner | Class | Parser for web pages and text files |
URLContent | Class | Stores filename, URL, and word frequency tree for a web page |
ignore.txt | File | A list of words to ignore when parsing web pages |
part1.cpp | Program | The main program for part 1 of the project |
int main(int argc, char *argv[]) { cout << "Number of command-line arguments: " << argc << endl; for (int i=0; i!=argc; i++) { printf("Argument %d was %s\n", i, argv[i]); } }When you compile and test this program you'll see the following results.
% g++ test.cpp % ./a.out try 21 and 35 Number of command-line arguments: 5 Argument 0 was ./a.out Argument 1 was try Argument 2 was 21 Argument 3 was and Argument 4 was 35Note that each of the arguments is stored as a string. To convert an argument from a string into an integer you can use the function atoi that is part of the cstdlib library. Try converting argv[2] ("21") and argv[4] ("35") in the example above from strings into integers.
The constructor of this class will initiate most of the computation necessary to complete this lab. It will open the given web page file, parse it using an instance of the Scanner class, and create a word frequency tree that summarizes the content of the page. The constructor includes a variable called urlName that will be used in subsequent parts; you can ignore it for now. The word frequency tree is a binary search tree where the key is a string representing a unique token and the element is an integer representing how frequently that token appears in the page. You should only include tokens that are: