The goals of this week's lab:
- Obtain experience using implementions of multiple sequence
alignment tools (ClustalW)
- Connect biology to algorithmic theory by interpreting algorithms
parameters and results
- Obtain experience generating and evaluating data visualization
approaches
- Analyzing research literature for methods that improve upon the
base algorithms covered in class
- Practice using algorithms from class to construct multiple sequence
alignments
To accomplish these goals, the lab is broken into three parts:
- A tutorial on using ClustalX (a GUI version of ClustalW).
- An analysis and short 1-page response to a research paper
on the multiple sequence alignment algorithm, MUSCLE.
- A problem set to practice and analyze algorithms we have
discussed in class.
You may work closely in discussing these problems
with one other person, but your answers
must be submitted individually.
Please write the name of your co-worker at the top of the assignment.
You may hand in your solution either (neatly) written or typed or
electronically as a PDF. In both cases, be sure to fill and
submit the README file in this week's directory. You should also submit
any supplementary materials (e.g., code) using
handin68.
ClustalX
To begin, read this tutorial on using ClustalX. I have placed two sets of sequences you can use to
test out ClustalX in your labs directory: DnaB.txt which is a small
set of medium-length protein sequences, and sequences12.txt - a large
set of myosin heavy chain proteins. You will run ClustalX by entering
the following command:
$ /usr/swat/bin/clustalx-2.1/clustalx
Note: you will probably want to increase the font size after starting
the program so the sequences/results are easier to read.
You can begin by using the quick start guide on your DnaB.txt file.
Then, read the rest of the document testing out different parameters to see
how they alter your results.
Once you complete the document and analyze both sequences, answer the
following questions:
- Briefly describe the trade-offs between the Fast and Slow approaches.
Which option is suitable for your DnaB.txt example? How about
the sequences12.txt file? Thinking of the intent of a user
interested in asking biological questions;
why might you want to combine both options? That is - attempt a fast run
followed by a slow run.
- After producing an alignment, one must assess the quality.
After one run, can you determine that you have the best alignment
possible? What strategies do the authors suggest for a) assessing quality and b) finding the best alignment (HINT: read the section
"Assessing the quality of alignment")?
- Attempt to interpret your results. As a biologist, how does the color
coding aid you in analysis? Specifically, think of a column which has
varied content (i.e., medium to high entropy) yet consistently stays the same
color. How might this alter your assessment of the "conservation" of a column.
- After running ClustalX on one of the input sets, head on over
to the WebLogo page for
producing representations of the entropy of a sequence. Load the alignment
from one of your examples and choose a range of approximately 25 amino acids.
Your range should include example of highly conserved regions and low-conserved
regions. Save the output file and either included it in your written
solutions or add it as a file named logo.png in this week's lab
directory and submit using handin68.
Further Reading and Response Paper
As noted in class, ClustalW is one of the most widely used
technique, but is not
necessarily the most advanced method available. Another popular program
is MUSCLE. Read the
brief introduction of the paper, available
here. Discuss the paper with
your classmates (at least one other person) and reflect on the following
aspects:
- How does the method relate to our discussion in class, in terms
of the type of multiple alignment approaches?
- What problem is the paper addressing, other than the general
problem of multiple sequence alignment? Is there a specific
set of criticisms of available methods that authors aim to solve?
- What similarities are there between MUSCLE and other
methods discussed in class?
- What new algorithmic approaches does the paper introduce that
we have not seen before?
- In what ways do the authors evaluate their method? Is there a
particular type of data set? What other methods do they consider? Is
their assessment "fair"? What metrics do they consider?
With your homework, submit a short paper on one interesting
aspect of the paper. The paper should be short -
less than 1 page
single-spaced. Rather than addressing all of the questions above,
your paper should provide a
detailed analysis on a specific aspect of
the paper .
This should stem from your discussion above
(e.g., the use of distance functions; the relationship to phylogenetic trees;
the choice of statistical methods for evaluation, etc.) or can explore some
other aspect you find interesting.
You should a) summarize the aspect of the MUSCLE paper
you are
discussing, b) provide a detailed analysis of this aspect
(e.g., compare and contrast to in-class methods; explain the relationship
between the biology and the algorithm, etc.) and c) pose questions of
further inquiry that the paper did not cover. The provided
paper is a shortened version of
the full detailed algorithm. You may want to examine this longer version
to get more details for your written response.
Problem Set
- You are given the following sequences:
- CTATTAATAC
- TATTAATAC
- CTATTAT
- CTATTAATC
- ATTAATAC
Use the Star algorithm to identify a multiple sequence alignment. Pick the
string that has the maximal average alignment score to select the center.
To perform your alignments, use a global alignment algorithm with linear
gap penalty. Matches should have a score of +1, substitutions -1, and gaps -3. You can do this by hand or write a short
program. Show your work if done by hand, or submit your code as
star.py. At a minimum, you should show the average/sum of
alignment scores for each sequence and the full multiple sequence alignment.
- Do the same for the Feng-Doolittle algorithm. (HINT: You should be able
to use your intermediate results from the Star Alignment problem above).
Be sure to submit your
initial distance matrix (use the equation in the book to convert your
alignment scores to distance scores) and also your guide tree. Use the
following values for a random alignment:
Srand(0,1) = -6
Srand(0,2) = -7
Srand(0,3) = -4
Srand(0,4) = -7
Srand(1,2) = -8
Srand(1,3) = -6
Srand(1,4) = -3
Srand(2,3) = -8
Srand(2,4) = -6
Srand(3,4) = -5
The S_max for the 5 sequences are respectively 10, 9, 7, 9, 8.
When updating the distance matrix, it is acceptable to either take the average
distance of all pairs or to pick the minimum. So, for example, if you merge
sequences x and y into one cluster, the distance between {x,y} and
z is the either the average or minimum of the individual pairwise
distances d_xz and d_yz. Use the minimum distance
between the two groups.
Also, please use natural log when calculating distances to make
grading a bit easier.
- In class, we discussed two scoring metrics: sum of pairs (SP) and
minimum entropy. Consider a column m_i that contains the following
characters: A,A,A,A and another that contains A,A,A,D.
Using the BLOSUM62 matrix, compute the sum of pairs and minimum
entropy scores for each column. Note that I have given columns
and not sequences. s(A,A) = 4 and s(A,D)=-2.
- Do the same for column A,A,A,A,A versus column A,A,A,A,D.
Compare your results to the previous problem. How does this support our
discussion on the relative advantages/disadvantages to the scoring metrics.
- In class, we discussed DNA as a linear sequence of nucleotides. Not all
DNA exists in this double helix form, however. The genomes of many bacteria
and even some eukaryotes (e.g., mitochondrial DNA in humans) exist in
ciruclar form. That is, there is no "start" or "end" to the sequence.
Devise an efficient algorithm to perform optimal local alignment on two
circular DNA sequences. This problem is conceptually difficult because
there is not a 0th character in the sequence.
A O(nm) algorithm exists, but if you cannot
think of it, describe your best attempt and its O() run time. As
a hint, your solution should build off of your existing local/global alignment
algorithms from class.
Submitting your work
You must hand in an individual solution by the due date. You
must also write the name of the individual you discussed problems in detail
with. You may talk about high-level concepts with as many individuals as
you like (for example, the paper on MUSCLE). But you are only allowed
to work on the specific problem solutions with one person in the
class. Supplementary material should be handed in using handin68. Do not forgot to submit your README for this week, as well.