A skeleton version of the programs will appear in your
cs35/lab/06 directory when you run update35. The
program handin35 will only submit files in this
directory.
I encourage you to work with a partner on this lab and the
entire project.
Introduction
This lab is the first part of a larger project to build a web browser
with a search engine for a limited portion of the web. By the end of
the project, your browser and search engine will have the following
capabilities:
- Display a web page given a URL
- Find relevant web pages given a query entered by a user
- Display relevant URLs in order of best match to worst
- Automatically display the best matching URL result of a
search
The first step to building a search engine is to be able to analyze
web pages based on their content. This will eventually allow us to
rate how relevant a particular page is for a given user request. The
content of a page is obviously correlated with the words it
contains. For this lab you will use a binary search tree to help you
calculate and store the frequencies of words found in a given web
page. Then you will print out the most frequent words found.
Sample Run
Your program will take three command-line arguments: the filename
of a web page, a minimum frequency number to report, and a filename
for a file containing a list of words to ignore during the analysis.
For example:
$ ./part1 /home/adanner/public_html/index.php 6 ignore.txt
This will output a list of words in alphabetic order that appeared
at least 6 times on the given web page and were not in the given
ignore file.
algorithms: 7
arge: 8
bib: 9
computer: 8
cs: 8
danner: 11
data: 13
i/o-efficient: 6
pdf: 10
science: 6
Note that any user that has a web page on our system typically has
a file named /home/username/public_html/index.php (it may
also be called index.html). You may test your program on any
of these files.
Classes, Programs, and Files Used
A number of classes are provided for you. Classes, programs or
files that you have to complete for this lab appear in bold below.
BST | Class | Abstract data type for Binary Search Trees |
KVPair | Class | General class for storing Key/Value pairs |
BSTNode | Class | General node for storing BST key and element |
LinkedBST | Class | Linked implementation of BST |
Scanner | Class | Parser for web pages and text
files |
URLContent | Class | Stores filename, URL,
and word frequency tree for a web page |
ignore.txt | File | A list of words to ignore when
parsing web pages |
part1.cpp | Program | The main program for part 1 of the project |
Getting Started
- Experiment with the Scanner class.
Review the Scanner.h file to see the interface available for
parsing. Next, review the ScannerTest.cpp file to see an
example of using this class. You can do make ScannerTest,
and then ./ScannerTest to see how it works on the given test
file called ScannerFile.txt. Also test the scanner on an
actual web page.
-
If you'd like to make your own web page as a test for the Scanner and
this week's lab do the following.
- In your top-level directory, make a directory to hold the web
page: mkdir public_html
- Make the directory readable and executable: chmod a+rx
public_html
- In the public_html directory, edit a file called index.html. As a
start you can just write "hello web" in it and add more later.
- After you save the file, make it readable: chmod a+r index.html
- In a web browser go to: www.cs.swarthmore.edu/~yourUserName
You should be able to view your own web page.
- Create an initial list of words to ignore.
Edit the file ignore.txt. Remove the comment at the top and
add one word per line for words you think should be ignored when
analyzing a web page. Common words such as "a", "and", or "the" are
good candidates. It makes sense to ignore words that do not add much
information about what the page is actually about.
- Experiment with handling command-line arguments.
In
order to process command-line arguments in C++ you need to add
standard parameters called argc and argv to the main
program. The parameter argc is an integer representing the
number of arguments entered. The parameter argv is an array
of strings, one string for each argument. Below is a sample program
that will report on the command-line arguments entered.
int main(int argc, char *argv[]) {
cout << "Number of command-line arguments: " << argc << endl;
for (int i=0; i!=argc; i++) {
printf("Argument %d was %s\n", i, argv[i]);
}
}
When you compile and test this program you'll see the following
results.
% g++ test.cpp
% ./a.out try 21 and 35
Number of command-line arguments: 5
Argument 0 was ./a.out
Argument 1 was try
Argument 2 was 21
Argument 3 was and
Argument 4 was 35
Note that each of the arguments is stored as a string. To convert an
argument from a string into an integer you can use the function
atoi that is part of the cstdlib library. Try
converting argv[2] ("21") and argv[4] ("35") in the
example above from strings into integers.
- Implement the URLContent class.
Review the
URLContent.h file which specifies the public interface for
this class. Feel free to add additional private helper methods to
this class declaration as needed.
The constructor of this class will initiate most of the computation
necessary to complete this lab. It will open the given web page file,
parse it using an instance of the Scanner class, and create a word
frequency tree that summarizes the content of the page. The
constructor includes a variable called urlName that will be
used in subsequent parts; you can ignore it for now. The word
frequency tree is a binary search tree where the key is a string
representing a unique token and the element is an integer representing
how frequently that token appears in the page. You should only
include tokens that are:
- longer than length 1
Use the size() method of strings to determine length.
- are not html tags
You can assume that anything that begins with a less than sign is a tag.
- not in the ignore file
You may want to use the Scanner class
to read in the ignore words; store them in an appropriate data
structure so that you can efficiently find them.
-
Complete the main program in part1.cpp.
The main program should be quite brief. It should process the
command-line arguments, create an instance of the URLContent class,
and then report the results of analyzing the web page.
-
Fine tune the ignore list stored in the ignore.txt file.
Based on tests of many different web pages on our system, update the
ignore list appropriately.
Submit
Once you are satisfied with your code, hand it in by typing
handin35. This will copy the code from your
cs35/lab/06 to my grading directory. You may run
handin35 as many times as you like, and only the most recent
submission will be recorded. If you worked with a partner, only one of
you needs to handin the code.