The goal of this week's lab is to obtain experience implementing probabilistic sequence models. Specifically, you will implement a profile HMM to obtain a multiple sequence alignment for a family of proteins. As with the previous lab, there is a good deal of complexity to this lab. I strongly recommend that you work with a partner on this lab and that you spend significant time thinking about the design of your implementation. When your lab is complete, you will hand it in using handin68.
Your program will need to be well-designed, so read this document thoroughly before beginning to program. Your program, at a high-level, will have the following phases:
Your main program should be called pHMM-MSA (or pHMM-MSA.py). You should also create supplementary files for any classes or functions you define - for example, a ProfileHMM class to represent your profile hidden Markov models
Your program will take the following two arguments on the command line:
The training data for your HMM is provided in a FASTA file with the following format:
>sequence 1 name XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX >sequence 2 name YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYwhere the repeating X and Y is replaced with the actual protein sequence.
Your outputs should occur in two sections: 1) Forward and Backward values for each sequence for each iteration of Baum-Welch and 2) the final results in the format of a multiple sequence alignment.
The first outputs allows me to see your intermediate calculations. You should format as follows. First, at the beginning of each iteration of Baum-Welch, output the current iteration number. Second, after completing the calculation of forward and backwards calculations for the E step, output the P(x) according to both algorithms (they should be the same). For example, your output will look similar to:Iteration 1: P(x_1) == P(x_2) = = ... Iteration 2: P(x_1) = = P(x_2) = = ...
Once Baum-Welch is complete, your program should output the resulting MSA of the input sequences. Each line should begin with sequence name, followed by the alignment of that sequence. To keep things nicely formatted, you can use the following string formatting print state for Python:
for i in range(len(sequences)): print names[i] print "\t\t" + viterbi[i]where names is a list of all sequence names and viterbi is a list of all Viterbi results for each sequence. Details on creating the Viterbi trace are below.
Once you have loaded the input files, you should construct your HMM. Assume the number of match state is equal to the number of amino acids in the first sequence. Using the profile HMM construction from class, you should also add the appropriate insert state, delete states, and begin/end state.
Your HMM will be a sparse graph; that is, you should only train transitions specified in the standard profile HMM diagram. For example, you should not have a connection from the begin state to match state 4.
You should initialize your transition parameters as follows:
You will use Baum-Welch to train your model for the specified number of iterations provided in the input. You should use Laplace estimates estimates. Be careful to not allow transitions that are not in the graph
After training the parameters of your model, your program should compute a multiple sequence alignment for the given protein sequence. You should implement Viterbi to do this and use sum of log probabilities instead of multiple probabilities.
When creating the Viterbi trace, you should follow the points from the END state at time step N until you reach the BEGIN state at time step 0. Along the way, you will accumulate a string with the following pattern depending on what state you are in:
This following paragraph is now optional:
In addition, to obtain a proper format, if a sequence uses an insert state
at position i, all sequences that did not also use an insert
at that position should insert a gap.
For the short test example (short.fasta) run using 2 iterations:
$ ./pHMM-MSA.py short.fasta 2
There is a larger example in globin.fasta which is taken from the Pfam list of globin proteins. Pfam utilizes a profile HMM similar to the one you constructed to identify protein families. I have taken 34 proteins identified as being characteristic of the globin family. The results after 10 iterations of Baum-Welch is available here. Note that I have implemented the optional alignment fix for inserts so your final alignment may have fewer gap characters.
When you are finished with your lab, submit your code using handin68. If you know your lab is not correct, also submit a README file describing what parts of your algorithm do not work and what you attempted to try to solve the problem.