CS68: Lab 1

The goals of this week's lab:

Reinforce the general algorithmic concept of dynamic programming
Apply algorithms for pairwise sequence alignment to both global alignment with linear gap penalty (Needleman-Wunsch) and local alignment with linear gap penalty (Smith-Waterman) to sample sequences
Connect biology to theory through interpretation of algorithm parameters and results
Implement Smith-Waterman in Python
Analyze new research directions beyond the lecture material

This week's lab is broken into two segments: a problem set, followed by a programming assignment to implement the local pairwise alignment algorithm from class. Please see instructions below for submitting solutions for each portion. You may work with one partner for this lab. As usual, use update68 to obtain starting point files.

Problem Set

Answer the following questions, which should be handed in on paper.

Algorithms

The minimum coin change problem is as follows: given the denomination of coins for a currency and a particular amount of cents for change, produce the minimum number of coins necessary to make exact change. For example, if a customer pays a bill and expects 77 cents in change, the teller's task is to produce 77 cents using as few coins as possible.

For US currency (i.e., 1, 5, 10, 25), there is a known greedy solution that is correct. For example, to produce change for 77 cents, simply subtract the largest denomination (25 cents) three times, and then you are left with just2 pennies for an answer of 5 coins. For an arbitrary set of denominations, however, the greedy solution is not optimal so do not use it below.

For this problem, assume a simple scenario where there are three coins in the currency: 1, 3, and 4 cents. Note that for this problem, you are trying to determine the number of coins in the optimal solution, not the actual coins needed.

Using either pseudocode or a recurrence relation, define a simple recursive solution that is guaranteed to provide the optimal solution. Assume the change amount M and a function minNumCoins(M) that returns the minimum number of coins needed to make change for M. HINT: if you start with 77 cents, there are three possible scenarios: use 1 cent (with 76 cents remaining), 3 cent (74 cents), or 4 cent (73 cents). What subproblems result? You do not need to define the base case here.

Design an efficient dynamic programming solution for this problem using pseudocode. Your solution can be either top-down or bottom-up. What is the run time of your solution in big-O?

Solve the min-coin change problem for 10 cents using your dynamic programming solution. How many coins are needed? Show the dynamic programming array for solving the problem with 10 cents.

Practice: Pairwise Sequence Alignment

Perform Needleman-Wunsch to solve the optimal global alignment with linear gap penalty on the sequences AACGTTAC and CGATAAC. Use a scoring function of gap penalty -1, and a substitution function of +1 for matches and -1 for mismatches. Show your final array including pointers, the optimal alignment score, and the possible traceback (i.e., alignments) for the optimal score using a highroad strategy to break ties.

For the above problem, perform Smith-Waterman for local alignment with linear gap penalty with the same sequences and scoring parameters. If there are multiple maximum alignments, return any of them.

Interpretation and Analysis

Amino acids D, E, and K are all charged molecules while V, I, and L are hydrophobic (i.e., afraid of water). Using the BLOSUM50 scoring matrix (in your book or in lecture slides), determine the average substitution score within each group (i.e., average of D to E, E to K, and D to K, etc.). Then, calculate the average score for all pairs between the two groups (D to V, D to I,...). You should notice a pattern, explain what the pattern is and why it is expected.

Head over to the BLAST web page and examine the various inputs and parameter options:
1. If one has a gene of interest, but wants to restrict comparisons to a single organism (e.g., mouse), what parameter should the user tweak?
2. If the user wants to discover, in particular, very distant homologs, which subsitution matrix should she pick? Read about the BLOSUM and PAM matrix here and then specify which of the available PAM, and which of the available BLOSUM matrices the user should chose.

Complete this question once you have finished the programming assignment. Hemoglobin and myoglobin are evolutionary relatives. One major difference is that hemoglobin tends to form into tetramers (groups of 4) while myoglobin prefers to remain by itself. Despite this, they have similar functions - hemoglobin is used to transport oxygen while myoglobin stores oxygen. In humans, we can align the two proteins to see the significant amount of divergence using your implementation. Use your local alignment algorithm to align human hemoglobin beta hemoglobin_beta.txt with human myoglobin, myoglobin.txt. How do your results differ as you increase the gap penalty? Specifically, measure the identity (exact matches) and gaps as you vary the gap penalty. Do the identity and number of gaps increase proportionally? Pick a gap penalty that provides good results and justify your choice in 1 or 2 sentences.

Analysis of research: You should have received an email assigning you one of the research papers related to sequence alignment. Your job is to prepare for an in-lab discussion the following week (Sept 22) followed by a short write-up detailing your findings for one of two papers:
1. Alignment of Whole Genomes by Delcher et al., Nucleic Acids Research, 1999.
2. Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs, Nucleic Acids Research, 1997.
The discussion questions will be posted to Piazza under "September 22 Discussion Questions". Your paper should 1-2 pages about any algorithmic component of your paper you chose. Specific ideas for what you can write about are posted under "Lab 1: discussion paper questions".

Programming Assignment: Local alignment with linear gap penalty

Implement a program that uses dynamic programming to compute a pairwise local alignment using a linear gap penalty function. The program should work on protein sequences (you can generalize it to DNA sequences as well, but this is not required).

Usage

Your program should take in 5 command-line arguments in this specific order:

The name of a file containing the first sequence to align
The name of a file containing the second sequence to align
The name of a file containing the substitution matrix
the gap penalty (g)
The name of the output file to save the alignment

For example:

python localAlign.py input/sample1.fasta input/sample2.fasta blosum-62 -2 sampleAlign.txt

If the user specifies an incorrect number of arguments, print a usage statement and exit the program. You should also error check all parameters to ensure that the files exist and the gap penalty is negative. If anything is amiss, you can simply exit the program. For example:

$ python localAlign.py 
Error, incorrect number of arguments
Usage: python localAlign.py seq1 seq2 matrixFile g output
  seq1, seq2 - fasta files containing two sequences to align (1 per file)
  matrixFile - file containing substitution matrix
  g - integer specifying penalty for a gap, must be negative
  output - name of output file for raw alignment

Breaking Ties

To be consistent, and obtain the same output as I have below, break ties as follows:

End alignment (if 0)
insert gap in y (i.e., output character in first sequence, gap in second)
match
insert gap in x

If there are multiple substrings with the same maximum score, output them all. You should worry about this requirement after you get your program working for just one maximum.

Input

I have provided three sample files in your directory:

test - a simple trivial pair of proteins in test1.fasta and test2.fasta
calcium - a pair of related calcium ion proteins - human calmodulin and calpain
kinases - a pair of yeast kinase proteins - STE7 and MKK1

Input files will follow a similar format to lab 0: the FASTA format where the first line is a comment, and then each line after that is part of the sequence.

Output

At a minimum, your output should contain:

The maximal alignment score
The optimal alignments of one subsequence to the other, including gaps, on standard output. You should break ties within one alignment as listed above. If there is a tie for optimal score at different locations in the score matrix (F), you should output them all on separate lines
The (1-based) index of the substrings making up the alignment from the original sequences. That is, the location of the corresponding amino acids (not counting gaps), for each sequence. As an example, the alignment for MSTYG to AXSTXYA results in:
```
	2 ST-Y 4
	  || |
	3 STXY 6
	
```
indicating the first sequence, characters 2 to 4, align with the second sequence, characters 3 through 6.
The alignment should be printed to the given output file, with no formatting necessary, 1 sequence per line. For example, your file for the above alignment should just have:
```
ST-Y
STXY
```
Do not print scores and indices

While these are the minimum, part of your final grade will involve going beyond the minimum to present visually appealing results. My sample output shows an example - you do not need to conform to the exact examples. Take a look at tools online, such as BLAST and EMBOSS to get ideas on visualizing alignments. One suggestion is to break them alignment into smaller segments (50 in my sample output). Another is to use the matplotlib library to visualize the scoring matrix of optimal scores.

Sample output

Below our the outputs from the provided examples. My standard output of the alignment includes outputting 50 characters per line for each sequence, with the start and ending character indices indicated as well as (optional) markers to indicate perfect alignments ('|') and substitutions ('.'). Below are the outputs for several of the examples: <

Test Example, gap penalty of -3: Standard output, File output, and Visual Plot
Kinase proteins, gap penalty of -5: Standard output, File output, and Visual Plot
Calcium pair, gap penalty of -7: Standard output, File output

Additional requirements

Your program should follow good design practices - effective top-down design, modularity, commenting for functions and non-trivial code, etc.
You may submit as many additional files as you wish. For example, any classes you define can be defined in a separate file. But the main program should be in localAlign.py
Practice defensive programming; e.g., did the user provide enough arguments on the command line? Are the files formatted properly? You do not need to check if the contents of the file make sense.
Do not interact with the user for your program - if an incorrect filename is given, simply exit the program with a message; do not prompt for a new file.
Clean up any debugging print statements. It will be useful to print out matrices when debugging, but be sure to remove this when handing in.

Tips and Hints

Command-line arguments

To use command-line arguments in Python, first, import the sys library.

from sys import *

You can now access a list argv containing all of the arguments. Recall that argv[0] is the name of the program, while argv[1] is the first argument provided by the user.

Substitution Matrix

Reading in the substitution matrix will require some careful thought and testing. One method is to simply store the values as a 20 by 20 two-dimensional array of integers and then have a separate function that maps amino acid characters to an index.

Another technique is to use a two-dimensional dictionary. Each row in the file will translate into a dictionary. They key is the character at the beginning of the line. The value it maps to is itself a dictionary containing all of the (column,value) pairs for that row. If I do this in python for just the first two rows and columns and print the resulting dictionary of dictionaries, I get the following value:

{'A': {'A': 4, 'R': -1}, 'R': {'A': -1, 'R': 5}}

The first pair in the dictionary has a key 'A' and value which itself is a dictionary {'A': 4, 'R': -1}. If I want the substitution value for replacing an A with an R, I simple call:

>>> print s['A']['R']
-1

where s is the dictionary structure I stored the values in. It's a bit more complicated to think about, but easier to use since you will not need to map the amino acids to integers.

A third approach is to stick with the two-dimensional array, but place it in a class SubstitutionMatrix such that you can create an object and call a getScore method that does all of the mapping for you. Either technique is acceptable, but your code should be readable and well designed. If using C++, either of the two- dimensional array options are probably easier to implement.

Submitting your work

The problem set and paper response should be handed in on paper. You can hand them in class or place them in the mailbox outside my door by the due date.

Once you are satisfied with your program for local alignment, hand it in by typing handin68 at the unix prompt. Remember to select your partner using the 'p' option - you only need to do this once per lab.

You may run handin68 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize after handing in some programs that you'd like to make a few more changes to them. Only files in this week's lab directory will be submitted.

CS68 Lab 1: Dynamic Programming and Pairwise Sequence Alignment