CS35 Project: Part 1
The first step to building a search engine is to be able to analyze web
pages based on their content. This will eventually allow us to rate how
relevant a particular page is for a given user request. The content of a page
is obviously correlated with the words it contains. For this assignment
you will use a binary search tree to help you calculate and store the
frequencies of words found in a given web page. Then you will print
out the most frequent words found.
Begin by experimenting with the Scanner class and its associated test
class provided below. The Scanner class is written to scan either a file
or a string. For this assignment you will use the Scanner class for scanning
a file. The constructor for the Scanner class requires a filename. It then
opens the given file and will return the next token that it finds when the
method getNextToken() is called. For our purposes a token will be a single
non-alphabetic and non-numeric character, such as punctuation, or a string of
contiguous alphabetic and numeric characters up until white space is
encountered or the end of file.
Once you understand how the Scanner class works, modify the
TestScanner class to read in a filename from the command line, and test that
it works (see the section on Command Line Arguments below).
Next, you need to implement the
Your program will take two or three command line arguments:
After processing the input file. Your program should print out all words
in the WordFrequencyTree in alphabetical order that occur at
least
Command Line Arguments
WordFrequencyTree Class
The following are examples of what your command line should look like, and
what your program should do for different calls
Part of this assignment includes developing a good ignorelist file for html
files. As you test your program for different .html input files, develop a
list of words that should be ignored from html files, and use this list as
the optional third command line argument to your program. You will submit
your html_ignore_list file as part of your homework solution.
GETTING STARTED
insertElement
,
findMinimumNode
,
and findMaximumNode
methods of the LinkedBinarySearchTree
class. Make sure that your insertElement
does not allow
duplicate elements to be inserted in the tree. Once you have tested these
changes to the LinkedBinaryTree class, you can start implementing the main
part of this assignment.
% java WordFrequencyTree input_file_name min_freq_num [ignore_file_name]
The first is an input file that your program will process using the Scanner
class. You should use .html input files. Any user that has a webpage on
our system has a file named /home/username/public_html/index.html. You can
test your WordFrequencyTree by parsing these files. The second argument,
min_freq_num, is used to determine which elements in your WordFrequency
tree your program prints out (only words with a
frequency count >= min_freq_num). The optional third argument is an ignore
list file containing words
that should be ignored in the input file. If there is an ignore list
command line argument, then your program should process the ignore list first
and create a data structure for storing the list of tokens to ignore. Next,
your program should process the input file. For all tokens in the input
file that are not on the ignore list, you will create a WordFrequency object
and add it to a binary search tree sorted alphabetically. Only unique
words should be added to the tree. When a duplicate word is encountered,
just update the frequency count of its WordFrequency object that is already
in the tree.
min_freq_num
times.
Command line arguments are passed into a main method as an array of String:
public static void main(String args[]) {
To get the number of command line args, check value of args.length
and to get a particular command line argument access the corresponding
args array entry (arg[0]
is the first command line argument
as a String). To convert a command line argument from its String form to
another type, use the valueOf
methods of the Integer, Float, etc.
classes. For example:
# if this is the command line
#
% java MatrixMult 10 15 matrix_contents
# this is how to convert the command line args 10 and 15 to int
#
int m = (Integer.valueOf(args[0])).intValue(); // arg[0] is "10"
int n = (Integer.valueOf(args[1])).intValue(); // arg{1] is "15"
String matrixFile = args[2];
You should create a WordFrequencyTree class that implements the
BinarySearchTree interface. This class will have a LinkedBinaryTree data
member (it may need additional data members too), and its implementation of
the BinarySearchTree interface methods just invoke the corresponding method on
its LinkedBinaryTree data member. In addition it should have at least three
constructors: a default constructor, a constructor that takes an input file
argument and a constructor that takes an input file and an ignore file argument.
The second two constructors will process the file(s) and create the
initial word frequency tree as specified above.
# program will add WordFrequency objects for each token
# in the index.html input file that is not on the html_ignore_list
# and will print out in alphabetical order all words that occur
# at least 3 times in the tree
#
% java WordFrequencyTree index.html 3 html_ignore_list
#
# program will add WordFrequency objects for each token
# in the index.html input file and will print out in alphabetical
# order all words that occur at least 4 times in the tree
#
% java WordFrequencyTree index.html 4
#
# program will print a usage message to stdout and exit
# (ex) usage: WordFrequencyTree inputfile min_freq_num
# or: WordFrequencyTree inputfile min_freq_num ignorelist
#
% java WordFrequencyTree index.html
JAVA CLASSES
Classes that you should use as a starting point for this assignment
can be copied from ~newhall/public/cs35/hw06/classes/
These classes include the following:
testscannerfile
HAND IN
Using cs35handin, hand in a single tar file containing:
If you work with a partner, please only one of you submit your joint solution
using cs35handin.