CS68: Lab 1

The goals of this week's lab:

Reinforce the general concept of dynamic programming
Apply various algorithms for pairwise sequence alignment on a pair of sample sequences
Connect biology to theory through interpretation of algorithm parameters and results
Implement the Smith-Waterman algorithm
Analyze new research directions beyond the lecture material

This week's lab is broken into two segments: a problem set, followed by a programming assignment to implement the local pairwise alignment algorithm from class. Please see instructions below for submitting solutions for each portion. You may work with one partner for this lab.

Problem Set

Problem set questions are available here, and your solutions should be handed in on paper.

Programming Assignment: Local alignment with linear gap penalty

Implement a program that uses dynamic programming to compute a pairwise local alignment using a linear gap penalty function. The program should work on protein sequences (you can generalize it to DNA sequences as well, but this is not required).

Getting Started

Find your git repo for this lab assignment with name format of Lab1-id1_id2 (id1 being replaced by the user id of you and your partner): CS68-S17

Clone your git repo with the lab 1 starting point files into your labs directory:

$ cd cs68/labs
$ git clone [the ssh url to your your repo)

Then cd into your Lab1-id1_id2 subdirectory. You will have the following files (those in blue require your modification):

README - questionaire to be completed for your final submission.
localAlign.py - the main program for implementing your sequence alignment algorithm
inputs - sample sequences to align including one toy example and 3 gene pairs
blosum-62 - contains valeus for the BLOSUM-62 matrix to score substitutions

You are allowed to add additional python files as for importing into localAlign.py.

Usage

Your program should take in 5 command-line arguments in this specific order:

The name of a file containing the first sequence to align
The name of a file containing the second sequence to align
The name of a file containing the substitution matrix
the gap penalty (g)
The name of the output file to save the alignment

For example:

python localAlign.py input/sample1.fasta input/sample2.fasta blosum-62 -2 sampleAlign.txt

If the user specifies an incorrect number of arguments, print a usage statement and exit the program. You should also error check all parameters to ensure that the files exist and the gap penalty is negative. If anything is amiss, you can simply exit the program. For example:

$ python localAlign.py
Error, incorrect number of arguments
Usage: python localAlign.py seq1 seq2 matrixFile g output
  seq1, seq2 - fasta files containing two sequences to align (1 per file)
  matrixFile - file containing substitution matrix
  g - integer specifying penalty for a gap, must be negative
  output - name of output file for raw alignment

Breaking Ties

To be consistent, and obtain the same output as I have below, break ties using the high road protocol:

End alignment (if 0)
Insert gap in y (i.e., output character in first sequence, gap in second)
Match the characters in x and y
Insert gap in x

If there are multiple substrings with the same maximum score, output them all. You should worry about this requirement after you get your program working for just one maximum.

Input

I have provided three sample files in your directory:

test - a simple trivial pair of proteins in test1.fasta and test2.fasta
calcium - a pair of related calcium ion proteins - human calmodulin and calpain
kinases - a pair of yeast kinase proteins - STE7 and MKK1

Input files will follow a similar format to lab 0: the FASTA format where the first line is a comment, and then each line after that is part of the sequence.

Output

At a minimum, your output should contain:

The maximal alignment score
The optimal alignments of one subsequence to the other, including gaps, on standard output. You should break ties within one alignment as listed above. If there is a tie for optimal score at different locations in the score matrix (F), you should output them all on separate lines
The (1-based) index of the substrings making up the alignment from the original sequences. That is, the location of the corresponding amino acids (not counting gaps), for each sequence. As an example, the alignment for MSTYG to AXSTXYA results in:
```
	2 ST-Y 4
	  || |
	3 STXY 6
	
```
indicating the first sequence, characters 2 to 4, align with the second sequence, characters 3 through 6.
The alignment should be printed to the given output file, with no formatting necessary, 1 sequence per line. For example, your file for the above alignment should just have:
```
ST-Y
STXY
```
Do not write scores and indices to the file.

While these are the minimum, part of your final grade will involve going beyond the minimum to present visually appealing results. My sample output shows an example - you do not need to conform to the exact formatting if you have diffferent preferences for visualizing results. Take a look at tools online, such as BLAST and EMBOSS to get ideas on visualizing alignments. One suggestion is to break the alignment into smaller segments (50 in my sample output). Another is to use the matplotlib library to visualize the scoring matrix of optimal scores (see images below).

Sample output

Below our the outputs from the provided examples. My standard output of the alignment includes outputting 50 characters per line for each sequence, with the start and ending character indices indicated as well as (optional) markers to indicate perfect alignments ('|') and substitutions ('.'). Below are the outputs for several of the examples:

Test Example, gap penalty of -3: Standard output, File output, and Visual Plot
Kinase proteins, gap penalty of -5: Standard output, File output, and Visual Plot
Calcium pair, gap penalty of -7: Standard output, File output

Additional requirements

Your program should follow good design practices - effective top-down design, modularity, commenting for functions and non-trivial code, etc.
You may submit as many additional files as you wish. For example, any classes you define can be defined in a separate file. But the main program should be in localAlign.py
Practice defensive programming; e.g., did the user provide enough arguments on the command line? Do the files exist? You do not need to check if the contents are correct.
Do not interact with the user for your program - if an incorrect filename is given, simply exit the program with a message; do not prompt for a new file.
Clean up any debugging print statements. It will be useful to print out matrices when debugging, but be sure to remove this when handing in.

Tips and Hints

Command-line arguments

To use command-line arguments in Python, first, import the sys library.

from sys import *

You can now access a list argv containing all of the arguments. Recall that argv[0] is the name of the program, while argv[1] is the first argument provided by the user.

Substitution Matrix

Reading in the substitution matrix will require some careful thought and testing. One method is to simply store the values as a 20 by 20 two-dimensional array of integers and then have a separate function that maps amino acid characters to an index.

Another technique is to use a two-dimensional dictionary. Each row in the file will translate into a dictionary. They key is the character at the beginning of the line. The value it maps to is itself a dictionary containing all of the (column,value) pairs for that row. If I do this in python for just the first two rows and columns and print the resulting dictionary of dictionaries, I get the following value:

{'A': {'A': 4, 'R': -1}, 'R': {'A': -1, 'R': 5}}

The first pair in the dictionary has a key 'A' and value which itself is a dictionary {'A': 4, 'R': -1}. If I want the substitution value for replacing an A with an R, I simple call:

>>> print s['A']['R']
-1

where s is the dictionary structure I stored the values in. It's a bit more complicated to think about, but easier to use since you will not need to map the amino acids to integers.

A third approach is to stick with the two-dimensional array, but place it in a class SubstitutionMatrix such that you can create an object and call a getScore method that does all of the mapping for you. Either technique is acceptable, but your code should be readable and well designed. If using C++, either of the two- dimensional array options are probably easier to implement.

Submitting your work

The problem set and paper response should be handed in on paper. You can hand them in class or place them in the mailbox outside my door by the due date.

For the programming portion, be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded.

CS68 Lab 1: Dynamic Programming and Pairwise Sequence Alignment