Your work for this lab consists of two parts:
As usual, you can get the initial files for this lab by running update35. When you run update35 you will obtain a large number of files for this lab, but you need to edit only four of them to complete this lab. The files that you will need to edit are highlighted below. Many of the files from previous labs are in the library subdirectory and are not listed here:
In the next section we introduce our AVLTree implementation. We then describe your main task (the search.cpp program) and several specific tools you'll need to complete it.
You should start by completing and testing the AVLTree implementation. The AVLTree is very similar to the LinkedBST from Lab 07, except it maintains the AVL balance property to guarantee that the tree is balanced. (The AVL balance property is that, for each node in the tree, the height of that node's left and right subtrees differs by at most one.)
To efficiently maintain the AVL property, each AVLTreeNode stores its current height in addition to the other data stored by a LinkedBSTNode. In sum, each AVLTreeNode contains:
Our algorithms define an empty child node to have a height of -1 for simplicity, but an empty child node is not an actual node. It is the pointer NULL, so we have been very careful to avoid segmentation faults in our code. This makes the code a little messier than the discussion in class, but you should be able to follow the methods. You will need to be careful to check that children nodes are not NULL before attempting to access information such as a child node's height or children.
To complete the AVLTree, you should complete each of the rotation functions marked TODO in AVLTree.inl. There is one rotation function for each of the four cases discussed in class:
You should test the AVLTree implementation using the testAVL program. Ideally, your tests should confirm that each of your rotation functions behaves as expected for known examples of the tree, and also stress test the AVLTree implementation for a large number of random insertions and deletions.
You should write whatever tests are needed to evaluate your work, but you do not need to test other AVLTree functions that we provided for you.
Your main work for this lab is to write a search program, inside search.cpp that analyzes www.cs.swarthmore.edu web pages and allows a user to find pages that match a search query. This program should take two command-line arguments:
Your program should do the following:
$ ./search inputFiles/urls.txt inputFiles/ignore.txt Enter your earch query: lab Search results for "lab": www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab02.php: 5 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab05.php: 6 www.cs.swarthmore.edu/~newhall/index.php: 2 www.cs.swarthmore.edu/~aviv/index.html: 1 www.cs.swarthmore.edu/~adanner/index.php: 3 www.cs.swarthmore.edu/~jwaterman/index.php: 2 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab02.php: 4 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab01.php: 5 www.cs.swarthmore.edu/~richardw/index.html: 2 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab04.php: 6 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab03.php: 8 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab03.php: 2 www.cs.swarthmore.edu/~soni/cs21/f11/index.php: 20 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab06.php: 7 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab07.php: 17 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab02.php: 5 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab01.php: 8 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab06.php: 5 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab05.php: 9 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab04.php: 5 www.cs.swarthmore.edu/~soni/cs21/s12/index.php: 19 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 7 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab01.php: 7 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab05.php: 12 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab04.php: 14 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab03.php: 2 www.cs.swarthmore.edu/~soni/cs35/f12/index.php: 23 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab07.php: 11 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab06.php: 14 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab08.php: 16 www.cs.swarthmore.edu/~soni/index.php: 2 Enter search query: the Search results for "the": No results found Enter search query: intelligence Search results for "intelligence": www.cs.swarthmore.edu/~meeden/index.html: 3 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab08.php: 2 Enter search query: systems Search results for "systems": www.cs.swarthmore.edu/~newhall/index.php: 11 www.cs.swarthmore.edu/~aviv/index.html: 2 www.cs.swarthmore.edu/~adanner/index.php: 6 www.cs.swarthmore.edu/~meeden/index.html: 5 www.cs.swarthmore.edu/~richardw/index.html: 3 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab08.php: 2 Enter search query: machine Search results for "machine": www.cs.swarthmore.edu/~richardw/index.html: 1 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab08.php: 2 www.cs.swarthmore.edu/~soni/index.php: 4 Enter search query: database Search results for "database": www.cs.swarthmore.edu/~newhall/index.php: 1 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab08.php: 2 Enter search query: algorithms Search results for "algorithms": www.cs.swarthmore.edu/~newhall/index.php: 2 www.cs.swarthmore.edu/~adanner/index.php: 7 www.cs.swarthmore.edu/~meeden/index.html: 1 www.cs.swarthmore.edu/~richardw/index.html: 1 www.cs.swarthmore.edu/~soni/cs21/f11/index.php: 2 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab07.php: 1 www.cs.swarthmore.edu/~soni/cs21/s12/index.php: 2 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab05.php: 1 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab04.php: 2 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab03.php: 4 www.cs.swarthmore.edu/~soni/cs35/f12/index.php: 6 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab07.php: 1 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab06.php: 9 www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab08.php: 1 www.cs.swarthmore.edu/~soni/index.php: 2 Enter your search query: <CTRL-D> $
Note that the order of your results may differ depending on how you process your search, but the content should be the same.
For each URL you should build a tree to store the frequency
of each word in the web page; each of these trees will probably have a
type such as AVLTree<string,int>. But wait, there's more!
You will also need a method to associate each URL with its word-frequency
tree. One way to accomplish this is...with another tree! For instance,
you could use each URL as a key, and a pointer to the URL's word-frequency tree
as the value. This might have the type
Below we give some specific advice for this lab, including how to read command-line arguments in C++, how to use C++ file streams and the SwatHTMLStream, and how to create your own web page for testing purposes.
To process command-line arguments in C++ you need to add two arguments to your main function:
int main(int argc, char** argv) { // ... return 0; }argc is an int representing the number of arguments entered (including the program name itself), and argv is an array of character arrays, with each character array containing one argument to the program. (These variables can be called anything you want, but it's a long-standing convention to call them argc and argv.) You can then use argc and argv inside your program to access the command-line arguments. For example, if you run your program:
$ ./search inputFiles/urls.txt inputFiles/ignore.txtargc has the value 3 (there are three parameters). That means that argv is a c-style array with three values. argv[0] is the value "search", argv[1] is "inputFiles/urls.txt" and argv[2] is "inputFiles/ignore.txt". argc is useful to check if the user entered the correct number of arguments. For example, if the user forgets the arguments you can print an error:
$ ./search Usage: search <url-file> <ignore-file> Required options <url-file>: a file of urls to index and search <ignore-file>: a file of words to ignore
We've been using cin and cout to get input from and display output to the user, but cin and cout are just two example input and output streams in C++. In this lab you will create and use an input stream to read from a file, much like we've read input from cin. For example:
#include <iostream> #include <string> #include <fstream> using namespace std; int main(int argc, char** argv) { ifstream myFile("inputFiles/urls.txt"); if (!myFile.good()) { cerr << "Whoops! We couldn't open inputFiles/urls.txt.\n"; return 1; } string url; myFile >> url; while (!myFile.eof()) { cout << url << endl; myFile >> url; } myFile.close(); return 0; }To use a file stream you must include the fstream library. This program creates a file input stream, myFile, to read from the file urls.txt. It then checks that the input stream is valid (i.e., that the file exists and is readable), and then uses the input stream to read each input from the file. At the end of the file, the program closes the file and exits.
This program also demonstrates cerr, another output stream like cout. The usual convention is to use cout for standard output when the program is normally executing, but to use cerr for unusual output such as error messages.
For this lab we have also provided SwatHTMLStream, a stream-like class that allows you to read HTML files hosted by the local CS web server, www.cs.swarthmore.edu. To use the SwatHTMLStream you must be logged into a Swarthmore Computer Science lab computer. You can use the SwatHTMLStream much like a file stream:
#include <iostream> #include <string> #include "swatHTMLStream.h" using namespace std; int main(int argc, char** argv) { SwatHTMLStream myFile("www.cs.swarthmore.edu/~soni/index.html"); if (!myFile.good()) { cerr << "Whoops! We couldn't open the url.\n"; return 1; } string word; myFile >> word; while (!myFile.eof()) { cout << word << endl; myFile >> word; } myFile.close(); return 0; }The SwatHTMLStream is not a fully-featured input stream, but can read strings of words (and number-strings) from a given URL. It skips HTML tags (for correctly-formatted HTML) and most punctuation characters, and converts all strings to lowercase.
$ mkdir ~/public_html $ chmod a+rx ~/public_html
$ chmod a+r ~/public_html/index.htmlYou can start with a simple text file (e.g., Hello world!) and add more to it later. (See other users' web pages for examples of what you can do in HTML.)
Once you are satisfied with your program, hand it in by typing handin35 at the unix prompt.
You may run handin35 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize after handing in some programs that you'd like to make a few more changes to them.