The goals of this week's lab:
- Reinforce the general algorithmic concept of dynamic programming
- Apply algorithms for pairwise sequence alignment to both
global alignment with linear gap penalty (Needleman-Wunsch) and
local alignment with linear gap penalty (Smith-Waterman) to sample
sequences
- Connect biology to theory through interpretation of algorithm
parameters and results
- Implement Smith-Waterman in Python
- Analyze new research directions beyond the lecture material
This week's lab is broken into two segments: a problem set,
followed by a programming assignment to implement the local pairwise
alignment algorithm from class. Please see instructions below for
submitting solutions for each portion. You may work with one
partner for this lab. As usual, use
update68 to obtain
starting point files.
Problem Set
Answer the following questions, which should be handed in on paper.
Algorithms
The minimum coin change problem is as follows: given
the denomination of coins for a currency and a particular amount of cents for
change, produce the minimum number of coins necessary to make exact change.
For example, if a customer pays a bill and expects 77 cents in change, the
teller's task is to produce 77 cents using as few coins as possible.
For US currency (i.e., 1, 5, 10, 25), there is a known greedy solution
that is correct. For example, to produce change for 77 cents, simply subtract
the largest denomination (25 cents) three times, and then you are left with just2 pennies for an answer of 5 coins. For an
arbitrary set of denominations, however, the greedy solution is not optimal
so do not use it below.
For this problem, assume a simple scenario where there are three coins
in the currency:
1, 3, and 4 cents. Note that for this problem,
you are trying to determine the number of coins
in the optimal solution, not the actual coins needed.
- Using either pseudocode or a recurrence relation, define a simple recursive
solution that is guaranteed to provide the optimal solution.
Assume the change amount M and a function
minNumCoins(M) that returns the minimum number of coins needed to make
change for M. HINT: if you start with 77 cents, there are three
possible scenarios: use 1 cent (with 76 cents remaining), 3 cent (74 cents),
or 4 cent (73 cents). What subproblems result? You do not need to define the base case here.
- Design an efficient dynamic programming solution for this problem
using pseudocode. Your solution can be either top-down or bottom-up. What
is the run time of your solution in big-O?
- Solve the min-coin change problem for 10 cents using your
dynamic programming solution.
How many coins are needed? Show the dynamic programming array for solving
the problem with 10 cents.
Practice: Pairwise Sequence Alignment
- Perform Needleman-Wunsch to solve the optimal global alignment with linear
gap penalty on the sequences AACGTTAC and CGATAAC. Use a
scoring function of gap penalty -1, and a substitution function
of +1 for matches and -1 for mismatches. Show your final
array including pointers, the optimal alignment score, and the
possible traceback (i.e., alignments) for the optimal score using a highroad strategy
to break ties.
- For the above problem, perform Smith-Waterman for local alignment
with linear gap penalty with the same sequences and scoring parameters.
If there are multiple maximum alignments, return any of them.
Interpretation and Analysis
- Amino acids D, E, and K are all charged molecules while V, I, and L are
hydrophobic (i.e., afraid of water). Using the BLOSUM50 scoring matrix
(in your book or in lecture slides), determine the average substitution score
within each group (i.e., average of D to E, E to K, and D to K, etc.). Then,
calculate the average score for all pairs between the two groups (D to V, D to I,...). You should notice a pattern, explain what the pattern is and why
it is expected.
- Head over to the BLAST web page and examine
the various inputs and parameter options:
- If one has a gene of interest, but wants to restrict comparisons to a
single organism (e.g., mouse), what parameter should the user tweak?
- If the user wants to discover, in particular, very distant homologs, which
subsitution matrix should she pick? Read about the BLOSUM and PAM matrix
here and then
specify which of the available PAM, and which of the available BLOSUM matrices
the user should chose.
- Complete this question once you have finished the programming assignment.
Hemoglobin
and myoglobin are
evolutionary relatives. One major difference is that hemoglobin tends to form
into tetramers (groups of 4) while myoglobin prefers to remain by itself.
Despite this, they have similar functions - hemoglobin is used to transport oxygen while myoglobin stores oxygen. In humans, we can
align the two proteins to see the significant amount of divergence using your
implementation.
Use your local alignment algorithm to align human hemoglobin beta hemoglobin_beta.txt with human myoglobin, myoglobin.txt. How do your
results differ as you increase the gap penalty? Specifically, measure
the identity (exact matches) and gaps as you vary the gap penalty. Do the
identity and number of gaps increase proportionally? Pick a gap penalty that
provides good results and justify your choice in 1 or 2 sentences.
- Analysis of research:
You should have received an email assigning you one of the research papers
related to sequence alignment. Your job is to prepare
for an in-lab discussion the following week (Sept 22) followed by a short
write-up detailing your findings for one of two papers:
- Alignment of Whole Genomes by
Delcher et al., Nucleic Acids Research, 1999.
- Gapped BLAST and PSI-BLAST: A New
Generation of Protein Database Search Programs, Nucleic Acids
Research, 1997.
The discussion questions will be posted to Piazza under "September 22 Discussion
Questions". Your paper should 1-2 pages
about any algorithmic component of your paper you chose. Specific ideas
for what you can write about are posted under "Lab 1: discussion paper questions".
Programming Assignment: Local alignment
with linear gap penalty
Implement a program that uses dynamic programming to compute a pairwise local
alignment using a linear gap penalty function. The program should work
on protein sequences (you can generalize it to DNA sequences as well, but this
is not required).
Usage
Your program should take in 5 command-line arguments in this specific order:
- The name of a file containing the first sequence to align
- The name of a file containing the second sequence to align
- The name of a file containing the substitution matrix
- the gap penalty (g)
- The name of the output file to save the alignment
For example:
python localAlign.py input/sample1.fasta input/sample2.fasta blosum-62 -2 sampleAlign.txt
If the user specifies an incorrect number of arguments, print a usage
statement and exit the program. You should also error check all parameters
to ensure that the files exist and the gap penalty is negative. If anything
is amiss, you can simply exit the program. For example:
$ python localAlign.py
Error, incorrect number of arguments
Usage: python localAlign.py seq1 seq2 matrixFile g output
seq1, seq2 - fasta files containing two sequences to align (1 per file)
matrixFile - file containing substitution matrix
g - integer specifying penalty for a gap, must be negative
output - name of output file for raw alignment
Breaking Ties
To be consistent, and obtain the same output as I have below,
break ties as follows:
- End alignment (if 0)
- insert gap in y (i.e., output character in first sequence, gap in second)
- match
- insert gap in x
If there are multiple substrings with the same maximum score, output them all. You should worry about this requirement after you get your program
working for just one maximum.
Input
I have provided three sample files in your directory:
- test - a simple trivial pair of proteins in test1.fasta
and test2.fasta
- calcium - a pair of related calcium ion proteins - human
calmodulin and calpain
- kinases - a pair of yeast kinase proteins - STE7 and MKK1
Input files will follow a similar format to lab 0: the FASTA format where
the first line is a comment, and then each line after that
is part of the sequence.
Output
At a minimum, your output should contain:
- The maximal alignment score
- The optimal alignments of one subsequence to the other, including gaps,
on standard output. You should break ties within one alignment as listed
above. If there is a tie for optimal score at different locations in the
score matrix (F), you should output them all on separate lines
- The (1-based) index of the substrings making up the alignment
from the original sequences.
That is, the location of the corresponding amino acids (not counting gaps),
for each sequence. As an example, the alignment for MSTYG to
AXSTXYA results in:
2 ST-Y 4
|| |
3 STXY 6
indicating the first sequence, characters 2 to 4, align with the second
sequence, characters 3 through 6.
- The alignment should be printed to the given output file, with no formatting necessary, 1 sequence per line. For example, your file for the above alignment should just have:
ST-Y
STXY
Do not print scores and indices
While these are the minimum, part of your final grade will involve going beyond
the minimum to present visually appealing results. My sample output shows
an example - you do not need to conform to the exact examples. Take a look
at tools online, such as BLAST and EMBOSS to get ideas on visualizing
alignments. One suggestion is to break them alignment into smaller segments
(50 in my sample output). Another is to use the matplotlib library to
visualize the scoring matrix of optimal scores.
Sample output
Below our the outputs from the provided examples.
My standard output of the alignment includes outputting 50 characters per line
for each sequence, with the start and ending character indices indicated as well
as (optional) markers to indicate perfect alignments (
'|') and
substitutions (
'.').
Below are the outputs for several of the examples:
<
Additional requirements
- Your program should follow good design practices - effective top-down
design, modularity, commenting for functions and non-trivial code, etc.
- You may submit as many additional files as you wish. For example,
any classes you define can be defined in a separate file. But the
main program should be in localAlign.py
- Practice defensive programming; e.g., did the user provide enough
arguments on the command line? Are the files formatted properly? You do
not need to check if the contents of the file make sense.
- Do not interact with the user for your program - if an incorrect filename
is given, simply exit the program with a message; do not prompt for a new
file.
- Clean up any debugging print statements. It will be useful to print
out matrices when debugging, but be sure to remove this when handing in.
Tips and Hints
Command-line arguments
To use command-line arguments in Python, first, import the
sys library.
from sys import *
You can now access a list
argv containing all of the arguments.
Recall
that
argv[0] is the name of the program, while
argv[1]
is the first argument provided by the user.
Substitution Matrix
Reading in the substitution matrix will require some careful thought and
testing. One method is to simply store the values as a 20 by 20
two-dimensional array of integers and then have a separate function that
maps amino acid characters to an index.
Another technique is to use a
two-dimensional dictionary. Each row in the file will translate into a
dictionary. They key is the character at the beginning of the line.
The value it maps to is itself a dictionary containing all of the
(column,value) pairs for that row. If I do this in python for just the first
two rows and columns and print the resulting dictionary of dictionaries,
I get the following value:
{'A': {'A': 4, 'R': -1}, 'R': {'A': -1, 'R': 5}}
The first pair in the dictionary has a key
'A' and value which itself
is a dictionary
{'A': 4, 'R': -1}.
If I want the substitution value for replacing an
A with
an
R, I simple call:
>>> print s['A']['R']
-1
where
s is the dictionary structure I stored the values in.
It's a bit more complicated to think about, but easier to use since you
will not need to map the amino acids to integers.
A third approach is to
stick with the two-dimensional array, but place it in a class
SubstitutionMatrix
such that you can create an object and call a getScore method that
does all of the mapping for you. Either technique is acceptable, but your
code should be readable and well designed. If using C++, either of the two-
dimensional array options are probably easier to implement.
Submitting your work
The problem set and paper response should be handed in on paper. You
can hand them in class or place them in the mailbox outside my door by the
due date.
Once you are satisfied with your program for local alignment,
hand it in by typing
handin68 at the unix prompt. Remember to select your partner
using the 'p' option - you only need to do this once per lab.
You may run handin68 as many times as you like, and only the
most recent submission will be recorded. This is useful if you realize
after handing in some programs that you'd like to make a few more
changes to them. Only files in this week's lab directory will be submitted.