Your work for this lab consists of two parts:
As usual, you can get the initial files for this lab by running update35. When you run update35 you will obtain a large number of files for this lab, but you need to edit only four of them to complete this lab. The files that you will need to edit are highlighted below. Many of the files are from Lab 07 and are not listed here:
In the next section we introduce our AvlTree implementation. We then describe your main task (the search.cpp program) and several specific tools you'll need to complete it.
You should start by completing and testing the AvlTree implementation. The AvlTree is very similar to the LinkedBst from Lab 07, except it maintains the AVL balance property to guarantee that the tree is balanced. (The AVL balance property is that, for each node in the tree, the height of that node's left and right subtrees differs by at most one.)
To efficiently maintain the AVL property, each AvlTreeNode stores its current height in addition to the other data stored by a LinkedBstNode. In sum, each AvlTreeNode contains:
Our AvlTree has one representational oddity: instead of representing an empty tree as a NULL pointer we define a special node, AvlTreeNode<K,V>::EMPTY_TREE, to represent an empty tree. The EMPTY_TREE is just a node with height -1 and left and right child pointers back to itself:
To complete the AvlTree, you should complete each of the rotation functions marked TODO in avltree.inl. There is one rotation function for each of the four cases discussed in class:
You should test the AvlTree implementation using the testAvl program. Ideally, your tests should confirm that each of your rotation functions behaves as expected for known examples of the tree, and also stress test the AvlTree implementation for a large number of random insertions and deletions.
You should write whatever tests are needed to evaluate your work, but you do not need to test other AvlTree functions that we provided for you.
Your main work for this lab is to write a search program, inside search.cpp that analyzes www.cs.swarthmore.edu web pages and allows a user to find pages that match a search query. This program should take two command-line arguments:
Your program should do the following:
$ ./search inputFiles/urls.txt inputFiles/ignore.txt Enter your search query: biology Searching for "biology": www.cs.swarthmore.edu/~meeden/index.html: 1 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab06.php: 1 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab04.php: 2 www.cs.swarthmore.edu/~soni/cs21/s12/index.php: 3 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 1 www.cs.swarthmore.edu/~soni/index.php: 6 Enter your search query: database Searching for "database": www.cs.swarthmore.edu/~charlie/index.html: 1 www.cs.swarthmore.edu/~newhall/index.php: 1 Enter your search query: machine Searching for "machine": www.cs.swarthmore.edu/~richardw/index.html: 1 www.cs.swarthmore.edu/~soni/index.php: 4 Enter your search query: intelligence Searching for "intelligence": www.cs.swarthmore.edu/~meeden/index.html: 3 Enter your search query: systems Searching for "systems": www.cs.swarthmore.edu/~meeden/index.html: 5 www.cs.swarthmore.edu/~adanner/index.php: 5 www.cs.swarthmore.edu/~charlie/index.html: 7 www.cs.swarthmore.edu/~richardw/index.html: 3 www.cs.swarthmore.edu/~newhall/index.php: 9 Enter your search query: geographic Searching for "geographic": www.cs.swarthmore.edu/~adanner/index.php: 5 Enter your search query: lab Searching for "lab": www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab01.php: 5 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab02.php: 4 www.cs.swarthmore.edu/~meeden/index.html: 2 www.cs.swarthmore.edu/~adanner/index.php: 3 www.cs.swarthmore.edu/~richardw/index.html: 2 www.cs.swarthmore.edu/~newhall/index.php: 1 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab01.php: 5 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab02.php: 5 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab06.php: 7 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab04.php: 6 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab03.php: 8 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab05.php: 6 www.cs.swarthmore.edu/~soni/cs21/f11/index.php: 20 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab07.php: 17 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab01.php: 8 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab06.php: 5 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab04.php: 5 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab03.php: 2 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab05.php: 9 www.cs.swarthmore.edu/~soni/cs21/s12/index.php: 20 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 7 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 16 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab03.php: 2 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab02.php: 6 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 13 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab06.php: 13 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab05.php: 10 www.cs.swarthmore.edu/~soni/cs35/s12/index.php: 19 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 15 www.cs.swarthmore.edu/~soni/index.php: 3 Enter your search query: the Searching for "the": Enter your search query: twilight Searching for "twilight": Enter your search query: <CTRL-D> $
Well, that's comforting. Searching research areas seems to match what we know each professor works on at Swarthmore. Also, we like to give you lots of labs and reinforce that this is a lab. Lastly, notice what happens you search for ignore words or crappy movies.
For each URL you should build a tree to store the frequency
of each word in the web page; each of these trees will probably have a
type such as AvlTree<string,int>. But wait, there's more!
You will also need a method to associate each URL with its word-frequency
tree. One way to accomplish this is...with another tree! For instance,
you could use each URL as a key, and a pointer to the URL's word-frequency tree
as the value. This might have the type
Below we give some specific advice for this lab, including how to read command-line arguments in C++, how to use C++ file streams and the SwatHtmlStream, and how to create your own web page for testing purposes.
To process command-line arguments in C++ you need to add two arguments to your main function:
int main(int argc, char** argv) { // ... return 0; }argc is an int representing the number of arguments entered (including the program name itself), and argv is an array of character arrays, with each character array containing one argument to the program. (These variables can be called anything you want, but it's a long-standing convention to call them argc and argv.) You can then use argc and argv inside your program to access the command-line arguments.
We've been using cin and cout to get input from and display output to the user, but cin and cout are just two example input and output streams in C++. In this lab you will create and use an input stream to read from a file, much like we've read input from cin. For example:
#include <iostream> #include <string> #include <fstream> using namespace std; int main(int argc, char** argv) { ifstream myFile("inputFiles/urls.txt"); if (!myFile.good()) { cerr << "Whoops! We couldn't open inputFiles/urls.txt.\n"; return 1; } string url; myFile >> url; while (!myFile.eof()) { cout << url << endl; myFile >> url; } myFile.close(); return 0; }To use a file stream you must include the fstream library. This program creates a file input stream, myFile, to read from the file urls.txt. It then checks that the input stream is valid (i.e., that the file exists and is readable), and then uses the input stream to read each input from the file. At the end of the file, the program closes the file and exits.
This program also demonstrates cerr, another output stream like cout. The usual convention is to use cout for standard output when the program is normally executing, but to use cerr for unusual output such as error messages.
For this lab we have also provided SwatHtmlStream, a stream-like class that allows you to read HTML files hosted by the local CS web server, www.cs.swarthmore.edu. To use the SwatHtmlStream you must be logged into a Swarthmore Computer Science lab computer. You can use the SwatHtmlStream much like a file stream:
#include <iostream> #include <string> #include "swathtmlstream.h" using namespace std; int main(int argc, char** argv) { SwatHtmlStream myFile("www.cs.swarthmore.edu/~soni/index.html"); if (!myFile.good()) { cerr << "Whoops! We couldn't open the url.\n"; return 1; } string word; myFile >> word; while (!myFile.eof()) { cout << word << endl; myFile >> word; } myFile.close(); return 0; }The SwatHtmlStream is not a fully-featured input stream, but can read strings of words (and number-strings) from a given URL. It skips HTML tags (for correctly-formatted HTML) and most punctuation characters, and converts all strings to lowercase.
$ mkdir ~/public_html $ chmod a+rx ~/public_html
$ chmod a+r ~/public_html/index.htmlYou can start with a simple text file (e.g., Hello world!) and add more to it later. (See other users' web pages for examples of what you can do in HTML.)
Once you are satisfied with your program, hand it in by typing handin35 at the unix prompt.
You may run handin35 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize after handing in some programs that you'd like to make a few more changes to them.