Your work for this lab consists of two parts:
So far in this course we have been using very general data structures to organize the data for our programs, but we have been just using functions to organize our program's main computation. In addition to the BinaryHeap implementation and slight extensions to Lab 08, one of the goals of this lab is to improve your program design. Specifically, you must reorganize most of the code from your search program into a separate QueryProcessor class, an object that indexes the web pages and then supports web queries. When you are done, your main program should be a relatively simple program that mostly just processes command-line arguments, constructs a QueryProcessor, and then calls a QueryProcessor function for each user query.
As usual, you can get the initial files for this lab by running update35. The new files are listed here, and the files that you will need to edit are highlighted in blue:
In the next section we introduce our BinaryHeap implementation. We then describe your main task (the sortedSearch.cc program) and several specific tools you'll need to complete it.
You should start by implementing and testing the BinaryHeap class. Be sure to map out a memory diagram of an array representation of a heap. When implementing functions, ask what your data looks like. If there is a child for a current position, what index is it at? How do I know if a index is a leaf, or specifcally a child? HINT: You don't need to actually look at the data in the nodes to figure this out.
We have defined the necessary data and many functions in binaryheap.h; you may edit this file if you want but it is probably not necessary. You should not add any additional public methods, unless for temporary debugging purposes. You should implement each function defined in the binaryheap.inl implementation file.
The main data in the heap is represented as an array of priority-value pairs. For priorities of type P and values of type V, each pair therefore has type Pair<P,V>. This array stores a level-order traversal of the binary heap, as described in lecture. Note that these are not pointers to Pair values, so you can't delete or allocate them for memory mangagement. The BinaryHeap is internally responsible for the memory management of this array, much like the data array in the ArrayList class from week 04.
You must test each BinaryHeap function you write. For private internal functions, you should use the BinaryHeap's public interface in a way to ensure that each private function is executed and tested. Your testHeap.cpp file will be analyzed when grading, and you will be required to show your testing strategy when receiving help from a ninja or myself.
The purpose of the QueryProcessor class is to encapsulate all the functionality of web indexing and search. In many ways its interface is similar to the main search program from Lab 08:
Once your QueryProcessor is complete, your main sortedSearch program can merely process the command-line arguments and user input, using each user-input query as an argument to the QueryProcessor::executeQuery function and printing the highest-ranked URLs.
For example:
$ ./sortedSearch urls.txt ignore.txt Enter your search query: tree Searching for "tree": www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 33 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 21 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 9 Enter your search query: node Searching for "node": www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 12 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 10 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 6 www.cs.swarthmore.edu/~newhall/index.php: 1 Enter your search query: tree node Searching for "tree node": www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 45 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 31 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 9 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 6 www.cs.swarthmore.edu/~newhall/index.php: 1 Enter your search query: sleep time Searching for "sleep time": www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab06.php: 12 www.cs.swarthmore.edu/~soni/cs21/f11/index.php: 6 www.cs.swarthmore.edu/~soni/cs21/s12/index.php: 5 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab06.php: 4 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab02.php: 4 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 4 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab03.php: 4 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab05.php: 3 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 3 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab06.php: 2 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab07.php: 2 www.cs.swarthmore.edu/~soni/cs35/s12/index.php: 1 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab01.php: 1 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 1 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab04.php: 1 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab05.php: 1 www.cs.swarthmore.edu/~richardw/index.html: 1 Enter your search query: <CTRL-D> $Note that in this lab, you can now query multiple-word phrases. You will want to obtain the entire user phrase at once (e.g., using getline()) and then query each word individually (HINT: see the instructions below on how to split a string). The final result is the union of all occurences of each word in the query. So, for example, tree occurred 21 times on the lab08 page and node occurred 10 times on that page. The resulting priority for tree node, as a result, is simply the sum of those two individual frequences (i.e., 31).
The requirements for the QueryProcessor are the same as for last week's
lab.. That is, you should handle files properly including in the case
that an incorrect file name is given. And you must properly deallocate any
heap memory belonging to the object.
Parsing user input in C++ can be less pleasant than in other languages. In the QueryProcessor we've provided a private function, split, that behaves much like the split function in Python. Specifically, given a string as input, it returns a pointer to a heap-allocated list of the whitespace-separated tokens from that string. For example,
split("This is a test");would return a pointer to list containing "This", "is", "a", and "test". The caller is responsible for freeing the memory referred to by that pointer.
A helpful tutorial is available via Prof. Newhall's tutorial. You can test valgrind out by running it in last week's lab. To do so, type the program name valgrind at the Unix prompt followed by the exact execution call for your program. For example if your program was run last week by typing:
$ ./search inputFiles/urls.txt inputFiles/ignore.txtthen to run valgrind:
$ valgrind ./search inputFiles/urls.txt inputFiles/ignore.txt ==27852== Memcheck, a memory error detector ==27852== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al. ==27852== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info ==27852== Command: ./search inputFiles/urls.txt inputFiles/ignore.txt ==27852== Enter your search query: time Searching for "time": www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab01.php: 1 www.cs.swarthmore.edu/~richardw/index.html: 1 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab02.php: 4 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab06.php: 11 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab04.php: 1 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab05.php: 2 www.cs.swarthmore.edu/~soni/cs21/f11/index.php: 6 www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab07.php: 2 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab06.php: 4 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab05.php: 1 www.cs.swarthmore.edu/~soni/cs21/s12/index.php: 5 www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 3 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 4 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab03.php: 4 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 1 www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab06.php: 2 www.cs.swarthmore.edu/~soni/cs35/s12/index.php: 1 Enter your search query: <CTRL-D> ==27852== ==27852== HEAP SUMMARY: ==27852== in use at exit: 859,137 bytes in 26,330 blocks ==27852== total heap usage: 157,727 allocs, 131,397 frees, 4,854,882 bytes allocated ==27852== ==27852== LEAK SUMMARY: ==27852== definitely lost: 768 bytes in 32 blocks ==27852== indirectly lost: 858,369 bytes in 26,298 blocks ==27852== possibly lost: 0 bytes in 0 blocks ==27852== still reachable: 0 bytes in 0 blocks ==27852== suppressed: 0 bytes in 0 blocks ==27852== Rerun with --leak-check=full to see details of leaked memory ==27852== ==27852== For counts of detected and suppressed errors, rerun with: -v ==27852== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4)Uh-oh! I didn't take care of all my memory. To get more details, add the --leak-check=full flag to get a stack trace of where all lost heap memory originated:
$ valgrind --leak-check=full ./search inputFiles/urls.txt inputFiles/ignore.txt
Once you are satisfied with your program, hand it in by typing handin35 at the unix prompt.
You may run handin35 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize after handing in some programs that you'd like to make a few more changes to them.