Last week, you implemented a mechanism for quickly indexing and searching documents in the hopes of detecting plagiarized content. While functional, this implementation was limiting in many ways. For this lab, you will address two short comings:
In the next section we set up your lab directory, including transferring over your tree implementations. Then, we introduce the BinaryHeap implementation. Lastly, we describe your main task (the cheatDetector.cpp program) and several specific tools you'll need to complete it.
As usual, you can get the initial files for this lab by running update35. The new files are listed here, and the files that you will need to edit are highlighted in blue:
$ cd ~/cs35/labs/09 $ cp ../08/AVLTree-private-inl.h library/ $ cp ../08/linkedBST-private-inl.h library/
NOTE 1: If you modified the header files (i.e., AVLTree.h, linkedBST.h) you must copy those over as well.
NOTE 2: Since you are reusing your tree implementations, you are responsible for making any corrections to errors from Lab 7 and Lab 8
You should start by implementing and testing the BinaryHeap class. Be sure to map out a memory diagram of an array representation of a heap. When implementing functions, ask what your data looks like. If there is a child for a current position, what index is it at? How do I know if a index is a leaf, or specifcally a child? HINT: You don't need to actually look at the data in the nodes to figure this out.
Unlike previous labs, you will need to define the entire class from scratch. This includes:
The main data in the heap is represented as an array of priority-value pairs. For priorities of type P and values of type V, each pair therefore has type Pair<P,V>. This array stores a level-order traversal of the binary heap, as described in lecture. Note that these are not pointers to Pair values, so you can't delete or allocate them for memory mangagement. The BinaryHeap is internally responsible for the memory management of this array, much like the data array in the ArrayList class from week 04.
You must test each BinaryHeap function you write. For private internal functions, you should use the BinaryHeap's public interface in a way to ensure that each private function is executed and tested. Your testPQ.cpp file will be analyzed when grading, and you will be required to show your testing strategy when receiving help from a ninja or myself. You should follow testing strategy requirements from the past few labs including using asserts, modular testing, and exception handling.
Finally, note that you must implement a mechanism for expanding the capacity of the underlying array. You are required to use an amortized O(1) method. Specifically, you should have a non-zero initial capacity (e.g., 10) and then multiply the capacity every time you run out of room (e.g., double).
The purpose of the ComputeEngine class is to encapsulate all the functionality of document indexing and comparison. In many ways its interface is similar to the main cheatDetector program from Lab 08. In fact, if you had a good design for Lab 8, translating between the two labs will require a minimal amount of effort.
Your main program has a reduced roll. Specifically, it should:
We have provided a portion of the CompareEngine interface. Your job is to complete the CompareEngine, using your Lab 8 cheatDetector design as guidance. At a minimum, your CompareEngine must include:
A common problem in parsing natural language is accounting for common terms. The occurrence of commonly-used phrases in two documents is likely to lead to false-positives (i.e., flagged cases of plagiarism that arose by chance). To account for this, we will use a basic algorithm to remove common phrases from documents (or, "prune our trees" if you will).
Your algorithm (implemented in removeCommonPhrases()) should accomplish the following:
In all, you will conduct roughly N^2/2 document comparisons where we have N documents. Rather than output the maximum match for each document as in Lab 8, you will report the top N scores overall. This means that some documents won't appear in the results at all, while others may appear often. Your output should be in sorted order. So, for example, if you compare 2000 documents against 1000 documents, you will perform roughly 500,000 comparison and report the top 1000 overall scores.
Note that you want to sort by most common phrases. But we implemented a min priority queue, what do we do?!? Well, what happens if you negative all values? A min priority queue prioritizes the most negative values. Thus, when you removeMin , you are getting the negative of the maximum value. Voila, you have simulated a max priority queue! You can use this trick to sort by most number of matches between essays.
To create formatted output, you can use a C routine known as printf. Read this tutorial to learn about its usage in detail. Your result does not need to match ours exactly, but each field should line up nicely.
At a high level, it works like print formatting in Python. As an example, if I want to print someone's name and age I could do the following:
string name1 = "Ameet", name2 = "Eric"; int age1 = 31, age2 = 2; printf("%-8s is %3d years old", name1.c_str(), age1); printf("%-8s is %3d years old", name2.c_str(), age2);This results in:
Ameet is 31 years old Eric is 2 years old%s is a template for strings, %d for integers. %8s means the string has 8 spaces to fill. If the string is shorter, it will add white space to the left. %-8s is similar, with the white space being added to the right.
Unix has a built in command called time that can let you know how long your program takes to run. To get a better feel for how using a LinkedBST and an AVLTree differ in terms of actual run-time, we can use this is a method for experimentally testing our options.
To use time, simply add the word time before the command you wish to run. For example:
time ./cheatDetector inputs/bigList.txt 3 0You will see your normal output, and at the end three statistics will pop out:
real 17m41.461s user 17m29.498s sys 0m10.989sThose provide the total wall time, followed by the division of that time into how much was used by the actual program versus system calls.
What run times do you get when using a LinkedBST vs AVLTree? You should see a significant speed-up. Report this in your README file. You should think about how this difference compares to our worst case analysis. What are factors that mitigate the difference? What variables hide the true speed up we see between using the two types of trees?
Take a look at the results on the big test. How many of those results seem worth investigating? One of the papers is called catchmeifyoucan.txt, a planted case of cheating. They seemed to have cheated a little from a lot of papers. We may not have noticed this in last week's results due to our strategy for reporting. What are the shortcomings of our approach? What are some improvements we can make? Think both about how we could improve finding common phrases as well measuring overlaps. Note that your results may swap the ordering of documents. But you should ensure that you only compare two documents once.
Once you are satisfied with your program, hand it in by typing handin35 at the unix prompt.
You may run handin35 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize after handing in some programs that you'd like to make a few more changes to them.