This week's lab will give you practice with common bioinformatic tools
for multiple sequence alignment. Specifically, you will complete
a tutorial on using ClustalX. Second, you will complete a homework
set of problems that will ask you to think more closely about the
algorithms we have discussed in class. You may work closely in discussing
these problems with one other person, but your answers must be submitted
individually. Please write the name of your co-worker at the top
of the assignment. You should hand in a paper copy of your solution
in class, Thursday, February 21. Supplementary materials (e.g., code)
should be submitted using handin68.
ClustalX
To begin, read this tutorial on using ClustalX. I have placed two sets of sequences you can use to
test out Clustal in your labs directory: DnaB.txt which is a small
set of medium-length protein sequences, and sequences12.txt - a large
set of myosin heavy chain proteins. You will run ClustalX by entering
the following command:
$ /usr/swat/bin/clustalx-2.1/clustalx
You will probably want to increase the font size open starting the program up
as the sequences are hard to read.
You can begin by using the quick start guide on your DnaB.txt file.
Then, read the rest of the document testing out different parameters to see
how they alter your results.
Once you complete the document and analyze both sequences, answer the
following questions:
- Briefly describe the trade-offs between the Fast and Slow approaches.
Which option is suitable for your DnaB.txt example? How about
the sequences12.txt file? Thinking of the intent of a user
interested in asking biological questions,
why might you want to combine both uses? That is - attempt a fast run
followed by a slow run.
- After producing an alignment, how do you know if you have a good
alignment, or the best alignment? Is that possible to know? Describe
the author's suggestion for a strategy to answer these questions
- What extra information about the sequences
can ClustalX utilize to improve alignment? Briefly
describe why this extra domain knowledge can help aid in determining
an alignment
- Read the EBI's description of output results. As a biologist, how does the color
coding aid you in analysis? Specifically, think of a column which has
varied content (i.e., medium to high entropy) yet consistently stays the same
color. How might this alter your assessment of the "conservation" of a column.
- After running ClustalX one one of the input sets, head on over
to the WebLogo page for
producing representations of the entropy of a sequence. Load the alignment
from one of your examples and choose a range of approximately 25 amino acids.
Your range should include example of highly conserved regions and low-conserved regions. Save the output file and print it or electronically submit it
with your solution. If you electronically submit, name the file logo.png.
- As noted in class, ClustalW is the most widely used technique, but is not
necessarily the most advanced method on the market. Another popular program
is T-coffee. Read the original paper
outlining the algorithm, available here. Discuss the paper
with at least one other person in the class - I don't expect you to understand
all of the details but I want you to get an understanding of the problems
still around with multiple sequence alignment. With your homework,
submit a 2 paragraph (about 1 page double spaced) summary. You should discuss
the generaly pitfalls identified in current approaches, the high-level
tactic takes to address these issues, the advantages and disadvantages of
T-coffee. That is, under what scenarios would you consider using T-coffee?
ClustalW?
Problem Set
- You are given the following sequences:
- CTATTAATAC
- TATTAATAC
- CTATTAT
- CTATTAATC
- ATTAATAC
Use the Star algorithm to identify a multiple sequence alignment. Pick the
string that has the maximal average alignment score to select the center.
To perform your alignments, use a global alignment algorithm with linear
gap penalty. Matches should have a score of +1, substitutions -1, and gaps -3. You can do this by hand or write a short
program. Show your work if done by hand, or submit your code as
star.py. At a minimum, you should show the average/sum of
alignment scores for each sequence and the full multiple sequence alignment.
- Do the same for the Feng-Doolittle algorithm. You should be able to use
your intermediate results from the Star Alignment. Be sure to submit your
initial distance matrix (use the equation in the book to convert your
alignment scores to similarity scores) and also your guide tree. Use the
following values for a random alignment:
Srand(0,1) = -6
Srand(0,2) = -7
Srand(0,3) = -4
Srand(0,4) = -7
Srand(1,2) = -8
Srand(1,3) = -6
Srand(1,4) = -3
Srand(2,3) = -8
Srand(2,4) = -6
Srand(3,4) = -5
For S_max, I incorrectly stated that you should use the maximum
score in the matrix. If you did this, no worries just note this on your
results. As the book points out, though, the maximum score is the
maximum possible score for the alignment. To calculate S_max for
a pair of sequences x and y, perform Needleman-Wunsch
for an alignment of x to x, and then for y to y, the finally take the
average of these two scores as S_max. The max scores for
the 5 sequences are respectively 10, 9, 7, 9, 8.
When updating the distance matrix, it is acceptable to either take the average
distance of all pairs or to pick the minimum. So, for example, if you merge
sequences x and y into one cluster, the distance between {x,y} and z is the either the average or minimum of the individual pairwise
distances d_xz and d_yz. I'll assume you are using
the minimum, simply state otherwise if you use the average. Also,
logs normally are calculated using the natural log. If you used a different
log base, just state which you used.
- In class, we discussed two scoring metrics: sum of pairs (SP) and
minimum entropy. Consider a column m_i that contains the following
characters: A,A,A,A and another that contains A,A,A,D.
Using the BLOSUM62 matrix, compute the sum of pairs and minimum
entropy scores for each column. Note that I have given columns
and not sequences. s(A,A) = 4 and s(A,D)=-2.
- Do the same for column A,A,A,A,A versus column A,A,A,A,D.
Compare your results to the previous problem. How does this support our
discussion on the relative advantages/disadvantages to the scoring metrics.
- In class, we discussed DNA as a linear sequence of nucleotides. Not all
DNA exists in this double helix form, however. The genomes of many bacteria
and even some eukaryotes (e.g., mitochondrial DNA in humans) exist in
ciruclar form. That is, there is no "start" or "end" to the sequence.
Devise an efficient algorithm to perform optimal local alignment on two
circular DNA sequences. This problem is conceptually difficult because
there is not a 0th character in the sequence.
A O(nm) algorithm exists, but if you cannot
think of it, describe your best attempt and its O() run time. As
a hint, your solution should build off of your existing local/global alignment
algorithms from class.
- Come up with a solution for global alignment of circular DNA. This
is more difficult so it is okay if you do not solve it, but you should
be able to make progress on the problem when discussing with your lab partner.
Submitting your work
You must hand in an individual solution in class on the due date. You
must also write the name of the individual you discussed problems in detail
with. You may talk about high-level concepts with as many individuals as
you like (for example, the paper on T-Coffee). But you are only allowed
to work on the specific problems and their solutions with one person in the
class. Supplementary material should be handed in using handin68.