CS35 Lab 8A: Plagiarism Detector

Due by 11:59 p.m., Sunday, November 10, 2013

The writeup is broken up into two parts that will need to be completed. Here, I will describe your task to complete the main program portion of the assignment. Part B will describe the development of an AVL Tree data structure for creating a more-efficient index of documents. That will be released on Monday once you have completed the LinkedBST implementation.


Introduction

This week's lab will further the development of a plagiarism detection algorithm. Given a set of documents your algorithm will identify significant matches between pairs of documents that indicate shared use of the same material. Specifically, we will test your program on a set of essays submitted by high school students. You can peruse the data set here in

/home/soni/public/cs35/cheatDetector/data/
Beware, the writing quality may induce blood-filled tears.

Your main program will be implemented in cheatDetector.cpp. For each document you will report the essay with largest number of hits (i.e., matching phrases). The design of your main program will be largely left up to you, but we have broken down the requirements into the following parts:

  1. Starting files: your starting files are a mixture of reusing lab 7 material and obtaining new files.
  2. Input: Your program will begin by obtaining inputs. However, you will not perform any user input (i.e., cin). Instead, you will use command-line arguments, discussed below.
  3. Load one document: your program will create a searchable version of one essay by representing all of its phrases in a binary search tree.
  4. Storing all documents: once you have determined how to load one document, you will need to determine how to keep track of all documents in a data structure.
  5. Detect top matches: for each essay, you will compare against all other essays by utilizing your binary search tree representation. Each essay will report the highest match score.
  6. Output statistics: in part B you will implement an AVL Tree. To compare the performance of a balanced vs unbalanced binary search tree, you will report the heights of your trees.

Ensure your design is easily modified to use either an AVLTree or LinkedBST to represent one document. While you have not implemented the AVLTree, a good design will make it very easy to incorporate the second representation in part B. The user will specify which they want to use at run-time. If you understand polymorphism, this is pretty simple to accomplish.


Getting Started

TODO:

First, run update35 to create your labs/08 directory and obtain the central file cheatDetector.cpp. You'll also get test files in the input directory and Makefile that accounts for the AVLTree. You'll also want to copy over the files from Lab 07 for use this week (except the Makefile). To do so, follow this script:

$ cd ~/cs35/labs/08/
$ cp -r ../07/* .
cp: overwrite `./Makefile'? no
This will let you use your LinkedBST implementation this week. Note that you should be sure to fix any errors in that implementation otherwise your main program may not work properly. Once you start adding in elements for the AVLTree you'll have to modify the Makefile; you'll find instructions in Part B.

Input via the Command Line

TODO:

To avoid user-interaction during run-time, a common too for obtaining input is by using command-line arguments. These are values given when executing from the command-line. To illustrate by example, our usual execution without command-line arguments:

$ ./cheatDetector
An example run for this lab would instead be:
$ ./cheatDetector inputs/smallList.txt 3 0

To process command-line arguments in C++ you need to add two arguments to your main function:

  int main(int argc, char* argv[]) 
argc is an int representing the number of arguments entered (including the program name itself), and argv is an array of character arrays, with each character array containing one argument to the program. You can then use argc and argv inside your program to access the command-line arguments. For example, in the run above, argc has the value 4 meaning that argv is an array with four c-string values. argv[0] is always the program being run ("cheatDetector"), argv[1] is "inputs/smallList.txt", argv[2] is "3" and argv[3] is "0". argc is useful to check if the user entered the correct number of arguments. For example, if the user forgets the arguments you can print an error:
$ ./cheatDetector 
Incorrect number of arguments
Usage: cheatDetector file-list phrase-size useAVL

We have provided the central set up already; the provided code converts argv[2] to the phraseSize (using atoi which converts c-strings to ints) and argv[3] to the boolean useAVL. You are responsible for obtaining the name of the file listing essays. Remember, argv is a c-string array, so if you want to store the file name as a string, use the string constructor. For example, if we wanted to store the third parameter as a string, we'd use string useAVL(argv[3]);


Storing One Document as a BST

TODO:

One naive solution to this lab would be to compare all phrases in one document against all phrases in all other documents. That can be extremely costly (O(w^2) for w words in the corpus!). Instead, we will index web pages into a binary search tree to quickly query phrases. The next section will discuss how to load all documents. For now, assume you are attempting to load one document, e.g., data/tyc6.txt. To do so, first check that the file exists, exiting if it does not.

Given the file exists, you will index the document by loading all phrases into a binary search tree. That is, your key will be an n-word phrase string (where n is specified on the command-line) and the value is the number of times that phrase occurs in the document. You should overlap all overlapping phrases. For example, for two word phrases, the sentence: "Computer Science 35 rules" leads to the 3 following keys being inserted into the BST: "Computer Science","Science 35","35 rules" .

In addition to allowing phrases of different size, you should determine whether to use a LinkedBST or a AVLTree depending on the boolean value useAVL given as a command-line argument.


Storing All Documents

The provided command-line argument is a list of the locations of all essays. Each line specifies the location of one document. You can read the entire line in as one long string.

For each document, (1) check that the file exists, (2) load the document into its own tree (see above) and (3) store the BST into some meta-data structure. That last part is up to you to figure out. One way to accomplish this is...with another tree! For instance, you could use each file name as a key, and a pointer to the essay's phrase index tree as the value. This might have the type AVLTree< string, AVLTree< string,int >* > There are other reasonable options if this data type is scarier than your Halloween costume.


Detecting Matches

At this point, your well-designed program has produced a set of search-trees, each indexing all of the phrases within one essay. Now, we want to detect pairs of essays with significant overlap in content. To do this, we make the following assumption: essays guilty of plagiarism will have significant overlap in the phrases used. Your program will use this assumption to detect potential cheaters. You will use the following pseudocode:
For each essay i:
  For each other essay j:
    Count the number of matches between i and j
    If this is the most matches to i, save the result
  Output i and the j with maximum overlap
You'll also want to report the size of the overlap for the optimal hit.

For counting matches, you'll want to take every phrase in essay i and determine if it is in essay j's tree. For every match, increase the total number of matches for that pair.


Outputting Statistics

After your program has completed finding and outputting the largest overlap for each essay, you should output three statistics to give us an understanding of the efficiency difference between AVLTrees and LinkedBSTs

Sample Runs and README questions

Sample runs:

README

Answer the following questions the README file in your directory:
  1. Your destructor is implemented using a helper function traverseAndDelete. What type of traversal algorithm is this method using? Why did we choose this type of traversal as opposed to any of the others?

  2. For the following applications,which of the 4 traversal algorithms in class would best provide a solution (i.e., either Level, Pre-Order, Post-Order, or In-Order). You do not need to explain.


  3. Given the hierarchical outline for a book (in tree format), print the Table of Contents. For example, the root of the tree would be the title, all Level 1 nodes are the chapters, all Level 2 nodes are the sections of the respective chapters. A Table of Contents prints a chapter followed by all of its sections before going to the next chapter and doing the same.

  4. For each directory on a disk, calculate the amount of disk usage. For, example, running the du command on your labs directory produces:
    $ du -h cs35/labs/
    472K	cs35/labs/06
    6.0M	cs35/labs/01/input
    6.0M	cs35/labs/01
    2.2M	cs35/labs/04
    44K	cs35/labs/03
    48K	cs35/labs/02
    16K	cs35/labs/05/inputs
    120K	cs35/labs/05
    28K	cs35/labs/07/library
    20K	cs35/labs/07/testFiles
    304K	cs35/labs/07
    9.2M	cs35/labs/
    
  5. Any mathematical formula can be set up as a parse tree, where the internal nodes are operators (e.g., *,+,-,/) and the leaves are the operands. For example, the formula (3*5+1)*2 is the tree:
            *
          /   \
         +      2
        / \
       *   1
      / \
     3   5
    
    What traversal would produce the correct order of operations?


Submitting your lab
Be sure to complete Part B of the lab as well. Submit using handin35. Remember to indicate your parter using the 'p' option.