CS35 Midterm Project:

External merge sort

Due 11:59pm Monday, 19 November 2007
Since this is part of your midterm exam you must work on this assignment by yourself. If you have questions about the project, please contact me. In this project, you will implement a version of external merge sort—a sorting algorithm that scales to very large input sizes. In general, the built in sort routines of Java work fine on a thousands or even a few million elements, but for multi-gigabyte data sets containing a billion elements, the built in sorts either run out of memory and crash or take a very long time to run. While we will not be experimenting with multi-gigabyte data sets, we can still see how external merge sort works by artificially constraining resources. We constrain the problem by not allowing java to sort more than m elements at one time. If we want to sort n > m elements we will need to be more clever.

Overview of external merge sort
External merge sort works in two stages. Given a sequence of n items stored in a file, the first stage partitions the input data into a set of sorted runs. More specifically the algorithm reads in data from the file into a list m items at a time. The algorithm then sorts these m items and writes them out in sorted order to a temporary output file. Such a list of m sorted items is called a run. The algorithm continues to read and sort m items at a time until it has formed n/m sorted runs, each stored in a separate temporary file.

In the second stage sorted runs are merged together until there is only one sorted run remaining. This final run is the sorted output. At most k sorted runs are merged together at a time. To merge k runs, the algorithm reads the first item from each of the k sorted runs and inserts the k items into a priority queue. Then the algorithm removes the smallest of the k items from the priority queue and writes it to a new output run. If the smallest element came from run i, the next item in sorted order from run i (if there are elements remaining in the run) is inserted into the priority queue. The process continues until all elements have been read from the k runs and written to a single temporary file in sorted order.

The merge stage works in one or more passes. In the first pass, runs created from the partition and sort stage are merged k runs at a time. If there are more than k initial runs, then multiple merge passes are needed. Each successive pass merges runs created in the previous pass. Note that since the number of runs reduces by a factor of k after each merge pass, there will eventually be one run remaining. At this point the entire data set is sorted. A example where two merge passes are needed is shown below.

File I/O
In this assignment you will be reading data from input files, and writing data to temporary output files. The file TestIO.java in your homework/07 shows a small example of how to open a file for writing and write values to it using the PrintWriter class. Take a look at the sample code and determine how the PrintWriter works. You should save your files in the /local path of the machine you are currently working on. Change the name adanner to your login name or some other prefix. The method deletefile can remove a file given its name. Be careful when using this method. If you do not delete files from within java, be sure to delete the files you create in /local manually. You can do this using rm /local/adanner*.txt. If you are repeatedly asked to confirm deletion, you can use rm -f /local/adanner*.txt. Practice reading and writing some data of your own format. Before you logout of the machine, please remember to remove all your temp files from /local so that the space is available for others.

Partition and sort
You can write all your code in a single class that has a main method that sorts a sample input file. Begin by writing code that partitions and sorts an input file into runs with at most size m, where m is a parameter. A sample method signature might be partition(String filename, int m). Use temporary files in the /local directory to store your runs. In addition to partitioning this input file, the partition should store a list of filenames containing sorted runs. To sort, you may use the Java collections sort methods. The file /usr/local/doc/elevations.txt has a list of x,y,z elevation data points measured in North Carolina using a laser scanner mounted to a plane. In total, this file contains almost two million points (~66MB). You should sort this data by increasing z coordinate. When doing your initial testing do not try to read in all the data in at once. Instead see if you can handle 10 elements, then a 1000, then 100000, and so on. Test your partition code using different values of m before proceeding to the next part. The value m should never exceed 100,000.

one pass k-way merging
Implement a one-pass k-way merger that performs a single pass over a list of filenames, merging at most k files at a time into new output files. You can use the heap priority queue from class to implement your priority queue. You will need to think about they keys and values you store in the queue. Recall if the minimum element in the priority queue came from run i, you want to read the next element from run i. The merger should store the new output filenames and remove the old runs from the system. Your implementation must accept k as a parameter and it should never exceed 100. Once your one-pass merger is implemented, you should be able to sort the entire data set in one pass using an appropriate k and m. Test incrementally on a subset of the data before attempting to sort the entire data set. Note that m=50,000 and k=40 would be sufficient to sort all the data in one pass.

multi-pass k-way merging
Once your one pass k-way merger is working, implement code that can make multiple passes until a single output file remains. If the one-pass merger is implemented well, the multi-pass merger is a short loop. Adjust m and k such that two or three passes are needed to sort the data. Again, test incrementally using a subset of the data and small values of m and k.
Along with your Java source code, you should hand in a README file, and your Makefile if you used one. These files will be imported automatically via handin35. In your readme, discuss the run time of the partition and sort phase and a single k-way merge. Summarize the experiments you did and document any bugs you may have. Can you estimate the number of passes needed for a given n,m and k?