The objectives for the next two labs are to provide a sampling of common tasks that fall under traditional machine learning. Specifically, you will obtain experience and an understanding for:
You may work with one partner on this lab. Be sure to indicate who you worked with in your submission (i.e., in the README, program header, an using the 'p' option for handin68). As usual, obtain this week's starting files using update68. You will hand in an implementation of k-means clustering, as well as analysis of your results in the results.pdf file. Labs will be due Sunday, November 2.
To get started, run update68 to obtain your starting files. Below is the lab organization, with files to modify highlighted in blue:
You will implement your clustering algorithm in clustering.py. It is common to build a library of common algorithms that you can easily reference in a main program. For example, your clustering.py file should allow you to implement Gaussian Mixture Models without having to reimplement common methods (e.g., distance functions, Silhouette scores, AIC). Your clustering.py file should contain functions for:
You should implement the standard k-means clustering algorithm discussed in lecture. Additional requirements/design choices:
Add functions for calculating the sum of squared errors (SSE), AIC, and Silhouette values as well as any other metrics you choose. You will call these methods from your main function (see below).
You should implement the main program in run_kmeans.py. This file should:
A user will enter three arguments on the command-line:
$ python run_kmeans.py 2 class.csvFor the yeast data set, however, we want to pair gene names with the profiles:
$ python run_kmeans.py 2 sample-yeast.csv sample-yeast-names.txtIf there are any errors in usage or with the files, your program should output a message and exit.
The data file will contain one instance per line and will be in comma separated format. There is a sample taken from class in class.csv.
For your main test, refer to the following paper:
Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. PNAS 1998 95 (25) 14863-14868
In particular, take a look at the publicly available data set and associated key for column headers. You have been provided two versions
of this data in your lab directories. small-yeast.csv contains
the expression values for 52 genes, one gene per line. You should
use this for intermediate testing. full-yeast.csv contains the expression values for all 2467 genes in the original data set. Each file contains a companion gene identification file that contains
the gene name and also the annotated functional assignment of that gene to help
validate clusters.
These two files (full-yeast-names.txt, sample-yeast-names.txt) contain the names of each gene which will
help you identify the genes in a cluster. The list of gene names is
an example of the optional command-line argument. It is not used in the
actual algorithm, but is used when printing out cluster members to better understand the results.
A detailed omitted in class is that complexity penalization methods, such as AIC, assume the data is normalized. For this lab, you should normalize values by doing feature-length normalization. That is, for each column in your data, the length of the vector should equal 1. Do do this in practice is not difficult. You can follow this pseudocode:
Given: Data matrix, D, with n rows (one for each gene) and m columns (for each experiment/feature) Do: Length normalize features for j=1..m #for each feature sum = 0 for i=1..n #for each gene sum += D[i][j]**2 //square each value length = sqrt(sum) #Calculates length of vector, same as euclidean distance for i=1..n D[i][j] = D[i][j]/length #normalize values
Your program should output the following:
$python run_kmeans.py 2 input/class.csv K-means with k=2 SSE: 0.07 AIC: 8.07 Silhouette: 0.86The output file called kmeans.out should like this. For the small yeast example:
$ python run_kmeans.py 2 sample-yeast.csv sample-yeast-names.txt K-means with k=2 SSE: 2629.30 AIC: 2945.30 Silhouette: 0.28and the corresponding output file kmeans.out. For 5 clusters on the same data set:
$ python run_kmeans.py 5 sample-yeast.csv sample-yeast-names.txt K-means with k=5 SSE: 1977.77 AIC: 2667.77 Silhouette: 0.33and the output file kmeans.out.
NOTE: since k-means is locally optimal and there is randomness in the starting point, your results may vary. You should get similar results if you run your method a few times.
For both the small and full yeast data set, test your program on a reasonable range of possible values for k. Then, in a separate file (experiments.pdf), provide the following:
When you are finished with your lab, submit your code using handin68.