The goal of this week's lab is to explore the use of Markov models for the task of finding genes in a large genome. Specifically, you will implement several variations of Markov models discussed in class and then use your models to test for genes in an E. coli genome. This lab is more complex from a design perspective than previous assignments. As such, I strongly recommend you work with a partner on this lab. When your lab is complete, you will hand it in using handin68.
Your program will construct the following models:
Your program will need to be well-designed, so read this document thoroughly before beginning to program. Your program, at a high-level, will have the following phases:
Your main program should be called geneFinder (or geneFinder.py). You should also create supplementary files for any classes or functions you define - for example, a Markov class to represent your Markov chain models
Your program will take the following four arguments on the command line:
The training data for your two Markov models are provided in ecoliStrainM54.txt and ecoliGeneIndex.txt. The first file contains the complete E. coli genome sequence (M54 is a popular laboratory strain of E. coli). If using Python, be careful what methods you use when loading the sequence (and reversing it). Since strings are immutable, certain operators can be very inefficient. It should not take more than a few seconds to run your whole program. See the class email for details.
The gene index file contains the start and stop index of known genes. Each line represent one gene. It is not a complete list, and only covers the first 223,048 base pairs in the sequence. The first column of each line is an identifier for the gene, and the second column lists the type of gene. Both of these can be ignored. The last two indices represent the start and stop index of the sequence so that you can extract genes from the original sequence. The last character is the orientation and represents which strand the gene is found. A > indicates that the gene is the sequence in the file. A < means the gene is on the complementary strand, so you will need to take the reverse-complement of the sequence. Some things to keep in mind when processing the genes:
In addition, your program should output at the end the accuracy of your program in terms of precision and recall. (See the Classification section for details).
Once you have loaded the input files, you will need to construct two Markov models - one for non-coding regions (a second-order homogenous model) and for coding regions (either a second or fifth-order inhomogenous model). As discussed in class, you will need train your model to detect these two specific positions by using parameter estimation. For this lab, you will use Laplace estimates for estimating the parameters. Be sure to use double precision to represent probabilities (python does this for you, C++ requires you to use double instead of float for the type). Your models do not need to model the probability of beginning or ending with certain characters. You should estimate your 0th order probabilities using all characters in the appropriate subsequence.
To train your gene model, extract all genes from the sequence using the genes identified in the gene-index input file. Be careful to use the correct strand (main strand or complement) as discussed above.
To train your non-coding mode, use all non-gene sequences on both strands. You should not use any sequence past the last gene. You should also not use any sequence if there is a gene in the region on the opposite strand. As example, assume we have a sequence of length 1000, with the first 100 designated as training. Assume the only gene boundary is (5,10) on the original strand. Bases in position 5 through 10 should be used to estimate the coding model, while the sequences (1,4), (11,100) and the reverse complement of these two should all be used to estimate parameters in the non-coding model.
This process is fairly simple if designed well. An example design is to store all genes in a list and store the right-most index of the previous gene as a cursor. When you load a gene, first train on the non-coding model on the sequence from the cursor to the left index on both the original and complementary strand. Then, train your coding model on the sequence from the left index through right index only on the appropriate strand. Be sure to update the cursor only if it increases in value (some genes overlap, so you don't want to backtrack the cursor.
You will use the test file provided to evaluate your models. For each subsequence defined in the test file, you will:
P(x) = P(x1)P(x2|x1)....P(xn|xn-1) log P(X) = log P(x1) + log P(x2|x1) + .... + P(xn|xn-1)When outputting your classifications, you will output the information from the file in the first three columns and then either a + or - depending on your programs decision of whether the sequence is a gene or not. You should also output the log probabilities for each model (non-coding first).
In addition, you will report the cumulative statistics of precision and recall. To determine those value, you will keep track of sum of accuracy of your individual predictions using four metrics. If you predict + and the correct answer from the file is +, that counts as a true positive. If it was supposed to be negative, that is a false positive. Conversely, if you predict a - and it is -, that is a true negative while an incorrect prediction is a false negative.
Precision is the ratio of correctly identified genes versus total genes predicted to be true. Recall is the number of genes correctly found versus the total number of genes in the test file:
precision = true positives -------------------------------- true positives + false positives recall = true positives -------------------------------- true positives + false negatives
For example, let us say you have a test file of 200 test sequences containing 50 actual genes and 150 non-coding sequences. If your program correctly identifies 25 of those genes, your recall is 50% (25 true positives, 25 false negatives for the genes not found). In addition, if you incorrectly predicted 10 non-coding regions as genes, your precision is 25/(25+10) = 71.4%
In the labs directory, I have added a sample toy set. A sample run would like as follows:
$ python geneFinder.py 2 smallSequence.txt smallGeneIndext.txt smallTestIndex.txt 15 18 - - -5.48063892334 -5.48063892334 23 27 - + -7.33693691371 -6.64378973315 25 31 + + -10.1095256359 -5.83731386728 Precision: 0.5 Recall: 1.0 Overall Accuracy: 0.666666666667More detailed outputs of the counts after training (including Laplace estimates) can be found here.
Update: I have posted the classification results for the first 10 test cases for the 2nd order coding Markov model and 5th order model. You are responsible for coming up with test cases to fully evaluate your methods.
When you are finished with your lab, submit your code using handin68. You should submit, at a minimum: