A skeleton version of the programs will appear in your cs35/labs/08 directory when you run update35. The program handin35 will only submit files in this directory.
I encourage you to work with a partner on this lab and the
entire project.
This lab is the second part of a larger project to build a web browser with a search engine for a limited portion of the web. In the first part of this project you summarized the content of a web page by creating a binary search tree that contained the word frequencies of all words that appeared in the page that were not in a given file of words to ignore.
In the second part of this project, you will read in a file of URLs
and a file of words to ignore, convert the URLs into filenames, and
summarize the content of each file as in the previous lab. Then you
will repeatedly prompt the user for a search query. You will
determine how often each of the query words appear in the summarized
web pages. For example, suppose a user enters the query "computer
science department". For each web page, your program should
determine the sum of the number of times the individual words
"computer", "science", and "department"
appear in a page. You will convert this sum into a priority and
associate it with the URL and store them in a priority queue. Once
all the pages have been checked, you will report the relevant pages in
priority order.
Your program will take two command-line arguments: the filename of a list of URLs stored in the Computer Science domain and the filename of a list of words to ignore during the analysis. For example:
% ./part2 urls.txt ignore.txt
This will output:
Web Search Engine Summarized 13 URLs Enter a query or type 'quit' to end Search for: natural language processing Relevant pages: www.cs.swarthmore.edu/~richardw/ : 17 www.cs.swarthmore.edu/~adanner/cs35/s10/lab8.php : 4 www.cs.swarthmore.edu/~adanner/cs35/s08/lab7.php : 2 www.cs.swarthmore.edu/~adanner/cs35/s10/index.php : 2 www.cs.swarthmore.edu/~newhall/ : 1 Search for: Billings Montana No relevant pages found Search for: developmental robotics Relevant pages: www.cs.swarthmore.edu/~meeden : 16 www.cs.swarthmore.edu/~adanner/cs35/s10/lab8.php : 2 Search for: maze Relevant pages: www.cs.swarthmore.edu/~adanner/cs35/s08/lab4.php : 38 www.cs.swarthmore.edu/~adanner/cs35/s10/lab4.php : 33 www.cs.swarthmore.edu/~adanner/cs35/s10/lab8.php : 1 www.cs.swarthmore.edu/~adanner/cs35/s08/lab7.php : 1 Search for: analysis of algorithms Relevant pages: www.cs.swarthmore.edu/~adanner/ : 8 www.cs.swarthmore.edu/~adanner/cs35/s08/index.php : 6 www.cs.swarthmore.edu/~richardw/ : 6 www.cs.swarthmore.edu/~adanner/cs35/s10/lab3.php : 4 www.cs.swarthmore.edu/~adanner/cs35/s10/index.php : 4 www.cs.swarthmore.edu/~cfk/ : 4 www.cs.swarthmore.edu/~adanner/cs35/s10/lab8.php : 3 www.cs.swarthmore.edu/~adanner/cs35/s10/lab7.php : 2 www.cs.swarthmore.edu/~adanner/cs35/s08/lab3.php : 1 www.cs.swarthmore.edu/~meeden/ : 1 Search for: quit Goodbye
Notice that only web pages that contain at least one word in common
with the query are shown, and that for some queries there may be no
relevant pages in our limited search domain.
A number of classes are provided for you. Classes, programs or files from the previous lab that you need to copy into the current lab directory are underlined below. Classes, programs or files that you have to complete for this lab appear in bold below.
BST | Class | Abstract data type for Binary Search Trees |
BSTNode | Class | General node for storing BST key and element |
LinkedBST | Class | Linked implementation of BST |
LinkedBSTNode | Class | Linked specialization of BSTNode |
Scanner | Class | Parser for web pages and text files |
URLContent | Class | Stores filename, URL, and word frequency tree for a web page |
ignore.txt | File | A list of words to ignore when parsing web pages |
PQ | Class | Abstract data type for Priority Queues |
PQNode | Class | Node for storing PQ key and element |
HeapPQ | Class | Heap implementation of PQ, you need to write removeMin and bubbleDown |
ProcessQuery | Class | Stores summarized web pages and handles a search query |
urls.txt | File | List of URLs to find and summarize, you may add more URLs to the initial list |
part2.cpp | Program | The main program for part 2 of the project |
URL: www.cs.swarthmore.edu/~userName file: /home/userName/public_html/index.html or index.php URL: www.cs.swarthmore.edu/~userName/dirName/.../fileName file: /home/userName/public_html/dirName/.../fileNameLook at readFile.cpp to see a demonstration of how to read in a file one line at a time. Look at the C++ string reference to see useful methods for manipulating strings.
char line[100]; cin.getline(line, 100); string query = (const char *)line;The above code assumes that the user will never enter a line longer than 100 characters. Once you have the entire query as a string, you can then use C++ string methods to break it up into individual words.
#include <fstream> #include <string> using namespace std; bool file_exists(string filename){ ifstream inp; inp.open(filename.c_str(), ifstream::in); inp.close(); return !inp.fail(); }