The objectives for the next two labs are to provide a sampling of common tasks that fall under traditional machine learning. Specifically, you will obtain experience and an understanding for:
Clone your git repo with the lab 4 starting point files into your labs directory:
$ cd cs68/labs $ git clone [the ssh url to your repo)Then cd into your Lab4-id1_id2 subdirectory. You will have the following files (those in blue require your modification):
We will separate out the general algorithm (e.g., EM, k-means, AIC evaluation, etc.) from the application specific tasks (parsing file; interpreting results). You will implement your clustering algorithm in clustering.py. clustering.py file should contain functions for:
Recall that you will need to estimate the parameters of the model - a mean, covariance, and mixing probability for each cluster. To initialize each of these:
sigma = numpy.diag(numpy.cov(data.T))Here, data has the rows as instances and columns as features. First, data.T transposes the data, then numpy.cov() returns a 2D covariance matrix of the data, and finally we keep only the diagonal of the matrix numpy.diag(). The output is a 1D array representing the variance of each feature in the data set. This is the only time sigma needs to be calculated.
from scipy.stats import multivariate_normal prob_x = multivariate_normal.pdf(x,mu,sigma)where x is your data point, mu is the cluster mean, and sigma is the covariance. This is equivalent to N(x|mu, sigma) from your lecture notes.
Add functions for calculating the sum of squared errors (SSE), AIC, and (optionally) Silhouette values as well as any other metrics you choose. You will call these methods from your main function (see below).
You should implement the main program in gene_cluster.py. This file should:
A user will enter three arguments on the command-line:
$ python gene_cluster.py 2 kmeans_class 0Your program sets k=2, and opens input/kmeans_class.csv for the data. There is no index file, so the identifier for row 1 is x1, etc. For the yeast data set, we want to pair gene names with the profiles:
$ python gene_cluster.py 2 sample-yeast 1This will load the gene profiles from input/sample-yeast.csv and labels (gene names/functions) from input/sample-yeast.names. If there are any errors in usage or with the file names, your program should output a message and exit.
The data file will contain one instance per line and will be in comma separated format. The data from class is in kmeans_class.csv.
For your main test, refer to the following paper:
Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. PNAS 1998 95 (25) 14863-14868
In particular, take a look at the publicly available data set and associated key for column headers. You have been provided two versions
of this data in your lab directories. small-yeast.csv contains
the expression values for 52 genes, one gene per line. You should
use this for intermediate testing. full-yeast.csv contains the expression values for all 2467 genes in the original data set. Each file contains a companion gene identification file that contains
the gene name and also the annotated functional assignment of that gene to help
validate clusters.
These two files (full-yeast.names, sample-yeast.names) contain the names of each gene which will
help you identify the genes in a cluster. The list of gene names is
an example of the optional command-line argument. Line 1 is the name/function of the
gene represented in line 1 of the gene profile data (.csv). It is not used in the
actual algorithm, but is used when printing out results to better understand the results.
A detailed omitted in class is that complexity penalization methods, such as AIC, assume the data is normalized. For this lab, you must normalize values by doing feature-length normalization. That is, for each column in your data, the length of the vector should equal 1. You can follow this pseudocode:
Given: Data matrix, D, with n rows (one for each gene) and m columns (for each experiment/feature) Do: Length normalize features for j=1..m #for each feature sum = 0 for i=1..n #for each gene sum += D[i][j]**2 //square each value length = sqrt(sum) #Calculates length of vector, same as euclidean distance for i=1..n D[i][j] = D[i][j]/length #normalize values
Your program should output the following:
$ python gene_cluster.py 2 kmeans_class 0 Fitting Gaussian Mixture Model using EM with k=2 SSE: 0.09 AIC: 8.09 Silhouette: 0.86The output file called kmeans_class.out should look like this. (Note: due to the small data set size, you may hit a corner case where both clusters converge to the same mean. You can just rerun and get a better result). To help with debugging, use these intermediate values for the E-step and M-step after each iteration. For the small yeast example:
$ python gene_cluster.py 2 sample-yeast 1 Fitting Gaussian Mixture Model using EM with k=2 SSE: 2629.30 AIC: 2945.30 Silhouette: 0.28and the corresponding output file sample-yeast.out. For 5 clusters on the same data set:
$ python gene_cluster.py 5 sample-yeast 1 Fitting Gaussian Mixture Model using EM with k=5 SSE: 1852.38 AIC: 2642.38 Silhouette: 0.36and the output file sample-yeast.out.
NOTE: since EM is locally optimal and there is randomness in the starting point, your results may vary. You should get similar results if you run your method a few times and pick the one that looks best.
For the small yeast data set, test your program on a reasonable range of possible values for k. Then, in a separate file (experiments.pdf), provide the following: