This lab will continue to build on the themes from last week's lab:
You may work with one partner on this lab. Be sure to indicate who you worked with in your submission (i.e., in the README, program header, an using the 'p' option for handin68).
To get started, run update68 to obtain your starting files. You should see the following files (those marked in blue will be modified):
As with the previous lab, you will create a library of classifiers that you can call interchangeably in any application. This file, at at a minimum, should implement general algorithms including:
You are to implement the standard k-nearest neighbors algorithm discussed in lecture. To guide your algorithm design, follow these tips/requirements:
You should implement the main program in leukemiaPredictor.py. This program should
A user will input two arguments to run your program in the following order:
$ python leukemiaPredictor.py 3 input/simple.csvI originally designed this lab to have you utilize a tuning set to pick k, but cut that out for brevity. If you choose to implement that as an extension, you can omit the first command-line argument. Please state that you did this at the top of your program
The provided leukemia data set comes from
the paper referenced in class for using expression analysis to differentiate
cancer patients:
T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, D. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science, 286(5439):531-537. 1999.
The data is in the input directory labs in the file titled
leukemia.csv. Each line describes the expression values of one
patient. Each column (except the last) is the expression value
for one gene. Details can be found
here.
If you are curious, the descriptors for the columns are also available
in /home/soni/public/cs68/human_genes.txt. The
last column is the class for the sample: -1 for acute lymphoblastic
leukemia (ALL) and 1 for acute myeloid leukemia (AML).
To provide an accurate estimate of classifier error, we must validate the algorithm on a separate set of data than used for training. As discussed in class, a typical methodology to accomplish this while utilizing all of the data is to use n-fold cross validation as your strategy. You should use 5 folds for this lab. You do not need to worry about using a tuning set as we will assume k is set by some external piece of information.
To perform proper cross-validation, each example should appear in the test set in exactly one fold, no more no less. When not in the test set, it should be in the training set for the fold. In addition, the relative size of each training/test set split should be about the same for each fold (since 72 is not evenly divisible by 5, 3 of your test sets will have 14 data points and 2 will have 15). Some methodologies also balance the ratio of positive to negative examples in the test set, but we'll skip that for this lab.
In Python, you should utilize the shuffle command. The following pseudocode should provide guidance. I assume the entire data set is in a two-dimensional list origData.
generateFolds(numFolds, origData): indexList = list of numbers from 0 to N-1 random.shuffle(indexList) #list of integers all shuffled up shuffledData = empty list for each index in indexList append origData[index] to shuffledData #this rearranges the rows foldSize = N/numFolds #use float division for each fold i: leftEnd = i*foldSize #this needs to be int (round down) rightEnd = (i+1)*foldSize #this needs to be an int (round down) testFold[i] = shuffledData[leftEnd:rightEnd] trainFold[i] = rest of shuffledData #items before left end and after right end return [trainFolds, testFolds]Note that you'll have to deal with floats and ints for fold size and index range formulas. Also, be careful of Python's negative indexing for lists. Note that you will only need to run this function once as it generates all of the folds for you. There are many ways to perform cross validation, this is just one. Be sure to implement this function in classifiers.py
You should maintain the cumulative counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) across the folds. As a reminder, a true positive is an example that is correctly classified as positive, while a false positive is a negative example incorrectly predicted to be positive. To aggregate for five fold cross-validation, simply sum the number of true positives for each fold to obtain the final true positive count (and likewise for the other measures). The output of your program should include the following measures of error, where N is the number of total predictions:
In your directory, you'll find a toy data set titled simple.csv. A sample run with k=2 and 5-fold cross-validation produces the following output:
$ python leukemiaClassifier.py 2 simple.csv Classification using k-nearest neighbors with k=2 and 5 folds Results: Accuracy: 0.80 Sensitivity: 0.67 Specificity: 1.00To help with debugging, take a look at the intermediate values produced by the algorithm. Note that randomization may produce a different ordering for your the folds. However, since 5-fold cross validation is the same as leave-one-out in this example, you should have the same exact results just in a shuffled ordered of folds.
A sample run with the leukemia data set would look similar to:
$ python leukemiaClassifier.py 5 leukemia.csv Classification using k-nearest neighbors with k=5 and 5 folds Results: Accuracy: 0.93 Sensitivity: 0.86 Specificity: 0.98Your results will differ as the folds are determined randomly, but the error rates should be similar.
When you are finished with your lab, submit your code using handin68.