The goals of this week's lab:
- Reinforce the general concept of dynamic programming
- Apply various algorithms for pairwise sequence alignment on a pair of sample sequences
- Connect biology to theory through interpretation of algorithm
parameters and results
- Implement the Smith-Waterman algorithm
- Analyze new research directions beyond the lecture material
This week's lab is broken into two segments: a problem set,
followed by a programming assignment to implement the local pairwise
alignment algorithm from class. Please see instructions below for
submitting solutions for each portion. You may work with one
partner for this lab.
Problem Set
Problem set questions are available here, and your solutions should be handed in on paper.
Programming Assignment: Local alignment
with linear gap penalty
Implement a program that uses dynamic programming to compute a pairwise local
alignment using a linear gap penalty function. The program should work
on protein sequences (you can generalize it to DNA sequences as well, but this
is not required).
Getting Started
Find your git repo for this lab assignment with name format of
Lab1-id1_id2 (id1 being replaced by the
user id of you and your partner):
CS68-S17
Clone your git repo with the lab 1 starting point files into
your labs directory:
$ cd cs68/labs
$ git clone [the ssh url to your your repo)
Then cd into your
Lab1-id1_id2 subdirectory. You will have the following files (those in blue require your modification):
- README - questionaire to be completed for your final submission.
- localAlign.py - the main program for implementing your sequence alignment algorithm
- inputs - sample sequences to align including one toy example and 3 gene pairs
- blosum-62 - contains valeus for the BLOSUM-62 matrix to score substitutions
You are allowed to add additional python files as for importing into
localAlign.py.
Usage
Your program should take in 5 command-line arguments in this specific order:
- The name of a file containing the first sequence to align
- The name of a file containing the second sequence to align
- The name of a file containing the substitution matrix
- the gap penalty (g)
- The name of the output file to save the alignment
For example:
python localAlign.py input/sample1.fasta input/sample2.fasta blosum-62 -2 sampleAlign.txt
If the user specifies an incorrect number of arguments, print a usage
statement and exit the program. You should also error check all parameters
to ensure that the files exist and the gap penalty is negative. If anything
is amiss, you can simply exit the program. For example:
$ python localAlign.py
Error, incorrect number of arguments
Usage: python localAlign.py seq1 seq2 matrixFile g output
seq1, seq2 - fasta files containing two sequences to align (1 per file)
matrixFile - file containing substitution matrix
g - integer specifying penalty for a gap, must be negative
output - name of output file for raw alignment
Breaking Ties
To be consistent, and obtain the same output as I have below,
break ties using the high road protocol:
- End alignment (if 0)
- Insert gap in y (i.e., output character in first sequence, gap in second)
- Match the characters in x and y
- Insert gap in x
If there are multiple substrings with the same maximum score, output them all. You should worry about this requirement after you get your program
working for just one maximum.
Input
I have provided three sample files in your directory:
- test - a simple trivial pair of proteins in test1.fasta
and test2.fasta
- calcium - a pair of related calcium ion proteins - human
calmodulin and calpain
- kinases - a pair of yeast kinase proteins - STE7 and MKK1
Input files will follow a similar format to lab 0: the FASTA format where
the first line is a comment, and then each line after that
is part of the sequence.
Output
At a minimum, your output should contain:
- The maximal alignment score
- The optimal alignments of one subsequence to the other, including gaps,
on standard output. You should break ties within one alignment as listed
above. If there is a tie for optimal score at different locations in the
score matrix (F), you should output them all on separate lines
- The (1-based) index of the substrings making up the alignment
from the original sequences.
That is, the location of the corresponding amino acids (not counting gaps),
for each sequence. As an example, the alignment for MSTYG to
AXSTXYA results in:
2 ST-Y 4
|| |
3 STXY 6
indicating the first sequence, characters 2 to 4, align with the second
sequence, characters 3 through 6.
- The alignment should be printed to the given output file, with no formatting necessary, 1 sequence per line. For example, your file for the above alignment should just have:
ST-Y
STXY
Do not write scores and indices to the file.
While these are the minimum, part of your final grade will involve going beyond
the minimum to present visually appealing results. My sample output shows
an example - you do not need to conform to the exact formatting if you have diffferent preferences for visualizing results. Take a look
at tools online, such as BLAST and EMBOSS to get ideas on visualizing
alignments. One suggestion is to break the alignment into smaller segments
(50 in my sample output). Another is to use the matplotlib library to
visualize the scoring matrix of optimal scores (see images below).
Sample output
Below our the outputs from the provided examples.
My standard output of the alignment includes outputting 50 characters per line
for each sequence, with the start and ending character indices indicated as well
as (optional) markers to indicate perfect alignments (
'|') and
substitutions (
'.').
Below are the outputs for several of the examples:
Additional requirements
- Your program should follow good design practices - effective top-down
design, modularity, commenting for functions and non-trivial code, etc.
- You may submit as many additional files as you wish. For example,
any classes you define can be defined in a separate file. But the
main program should be in localAlign.py
- Practice defensive programming; e.g., did the user provide enough
arguments on the command line? Do the files exist? You do
not need to check if the contents are correct.
- Do not interact with the user for your program - if an incorrect filename
is given, simply exit the program with a message; do not prompt for a new
file.
- Clean up any debugging print statements. It will be useful to print
out matrices when debugging, but be sure to remove this when handing in.
Tips and Hints
Command-line arguments
To use command-line arguments in Python, first, import the
sys library.
from sys import *
You can now access a list
argv containing all of the arguments.
Recall
that
argv[0] is the name of the program, while
argv[1]
is the first argument provided by the user.
Substitution Matrix
Reading in the substitution matrix will require some careful thought and
testing. One method is to simply store the values as a 20 by 20
two-dimensional array of integers and then have a separate function that
maps amino acid characters to an index.
Another technique is to use a
two-dimensional dictionary. Each row in the file will translate into a
dictionary. They key is the character at the beginning of the line.
The value it maps to is itself a dictionary containing all of the
(column,value) pairs for that row. If I do this in python for just the first
two rows and columns and print the resulting dictionary of dictionaries,
I get the following value:
{'A': {'A': 4, 'R': -1}, 'R': {'A': -1, 'R': 5}}
The first pair in the dictionary has a key
'A' and value which itself
is a dictionary
{'A': 4, 'R': -1}.
If I want the substitution value for replacing an
A with
an
R, I simple call:
>>> print s['A']['R']
-1
where
s is the dictionary structure I stored the values in.
It's a bit more complicated to think about, but easier to use since you
will not need to map the amino acids to integers.
A third approach is to
stick with the two-dimensional array, but place it in a class
SubstitutionMatrix
such that you can create an object and call a getScore method that
does all of the mapping for you. Either technique is acceptable, but your
code should be readable and well designed. If using C++, either of the two-
dimensional array options are probably easier to implement.
Submitting your work
The problem set and paper response should be handed in on paper. You
can hand them in class or place them in the mailbox outside my door by the
due date.
For the programming portion, be sure to commit your work often to prevent lost data. Only your final
pushed solution will be graded.