The objects for this week's lab are to provide a sampling of common tasks that fall under traditional machine learning. Specifically, you will obtain experience and an understanding for:
For this portion of the lab, you will implement k-means clustering and test it on expression data set of the yeast genome. To be reusable in the future, you should create a library file clusterAlgos.py which should contain a general k-means formulation as well as functions to evaluate clusters based on various criteria. You should design a distance function. Your main program should be in a file named yeastCluster.py
For your main test, refer to the following paper:
Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. PNAS 1998 95 (25) 14863-14868
In particular, take a look at the publicly available data set and associated key for column headers. I have provided a pre-processed
sample of the data in your update directories. small-yeast.csv contains
the expression values for 52 genes, one gene per line while full-yeast.csv contains the expression values for all 2467 genes in the original data set. Each file contains a companion gene identification file that contains
the gene name and also the annotated functional assignment of that gene to help
validate clusters.
These two files (full-yeast-names.txt, sample-yeast-names.txt) contain the names of each gene which will
help you identify the genes in a cluster. The list of gene names is
an example of the optional command-line argument. It is not used in the
actual algorithm, but is used when printing out cluster members to better understand the results.
You should implement the standard k-means clustering algorithm discussed in lecture. Additional requirements/design choices:
Your program should output the following:
For both the small and full yeast data set, test your program on a reasonable range of possible values for k. Then, in a separate file (results.{doc, pdf}), provide the following:
$ python yeastCluster.py 2 class.csv Clustering using k-means with k=2 SSE: 7.33 AIC: 15.33 Silhouette: 0.85The output file called kmeans.out should like this. For the yeast example:
$ python yeastCluster.py 2 sample-yeast.csv sample-yeast-names.txt Clustering using k-means with k=2 SSE: 37115.38 AIC: 37431.38 Silhouette: 0.36and the corresponding output file kmeans.out.
NOTE: since k-means is locally optimal and there is randomness in the starting point, your results may vary. You should get similar results if you run your method a few times.
For this portion of the lab, you will implement the supervised learning algorithm k-nearest neighbors, and employ appropriate experimental methodology to estimate the accuracy of your algorithm. As with the above problem, you should use good design skills by creating a library for classification algorithms in the file classifiers.py and put your main program in the file leukemiaClassifier.py.
A user will input two arguments to run your program in the following order:
Your data set comes from
the paper referenced in class for using expression analysis to differentiate
cancer patients:
T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, D. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science, 286(5439):531-537. 1999.
The data has been made available in your labs directory in the file titled
leukemia.csv. Each line describes the expression values of one
patient. Each column (except the last) is the expression value
for one gene. Details can be found here. The
last column is the class for the sample: -1 for acute lymphoblastic
leukemia (ALL) and 1 for acute myeloid leukemia (AML).
You are to implement the standard k-nearest neighbors algorithm discussed in lecture. To guide your algorithm design, follow these tips/requirements:
To provide an accurate estimate of classifier error, we must validate the algorithm on a separate set of data than used for training. As discussed in class, a typical methodology to accomplish this while utilizing all of the data is to use n-fold cross validation as your strategy. You should use 5 folds for this lab. You do not need to worry about using a tuning set as we will assume k is set by some external piece of information.
To perform proper cross-validation, each example should appear in the test set in exactly one fold, no more no less. When not in the test set, it should be in the training set for the fold. In addition, the relative size of each training/test set split should be about the same for each fold (since 72 is not evenly divisible by 5, 3 of your test sets will have 14 data points and 2 will have 15). Some methodologies also balance the ratio of positive to negative examples in the test set, but we'll skip that for this lab.
In Python, you should utilize the shuffle command. The following pseudocode should provide guidance. I assume the entire data set is in a two-dimensional list title origData.
generateCVData(numFolds, origData): indexList = list of numbers from 0 to N-1 random.shuffle(indexList) #list of integers all shuffled up shuffledData = empty list for each index in indexList append origData[index] to shuffledData foldSize = N/numFolds for each fold k: leftEnd = k*foldSize rightEnd = (k+1)*foldSize testFold[k] = shuffledData[leftEnd:rightEnd] trainFold[k] = rest of shuffledData #items before left end and after right end return [trainFolds, testFolds]Note that you'll have to deal with floats and ints for fold size and index range formulas. Also, be careful of Python's negative indexing for lists. Note that you will only need to run this function once as it generates all of the folds for you. There are many ways to perform cross validation, this is just one.
You should maintain the cumulative counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) across the folds. That is, for five fold cross-validation, sum the number of true positives for each fold to obtain the final true positive count. The output of your program should include the following measures of error:
In your directory, you'll find a toy data set titled simple.csv. A sample run with k=2 and 5-fold cross-validation produces the following output:
$ python leukemiaClassifier.py 2 simple.csv Classification using k-nearest neighbors with k=2 and 5 folds Results: Accuracy: 0.80 Sensitivity: 0.67 Specificity: 1.00To help with debugging, take a look at the intermediate values produced by the algorithm. Note that randomization may produce a different ordering for your the folds. However, since 5-fold cross validation is the same as leave-one-out in this example, you should have the same exact results just in a shuffled ordered of folds.
A sample run with the leukemia data set would look similar to:
$ python leukemiaClassifier.py 5 leukemia.csv Classification using k-nearest neighbors with k=5 and 5 folds Results: Accuracy: 0.93 Sensitivity: 0.86 Specificity: 0.98Your results will differ as the folds are determined randomly.
When you are finished with your lab, submit your code using handin68.