CS68: Lab 2

This week's lab is broken into two segments: a set of problems you will answer, followed by a programming assignment to implement the local pairwise alignment algorithm from class.

The goal of this week's lab is to obtain an understanding of dynamic programming. You will devise solutions to general problems and analyze the efficiency gains dynamic programming provides over recursive solutions. The second goal is learn how to implement a pairwise sequence alignment algorithm using the dynamic programming algorithm discussed in class.

You may hand in your problem set manually or through handin68. Hard copies must be turned in the mail slot outside my office door. Solutions to the programming assignment should be handed in using handin68. All solutions, whether electronic or hard copy, are due Wednesday, February 13 by midnight. Your problem set is to be done individually. For the programming assignment, you make work in pairs.

Problem Set

This part of the lab should be completed individually

Answer the following questions, which should be handed in either electronically or on paper. If handing in a hard copy, ensure that it is in the mail slot outside my office door by the assignment deadline.

Coin Problem: The minimum coin change problem is as follows: given the denomination of coins for a currency and a particular amount of cents for change, produce the minimum number of coins necessary to make exact change. For example, if a customer pays a bill and expects 77 cents in change, the teller's task is to produce 77 cents using as few coins as possible.

For US currency, there is a known greedy solution that is correct. For an abitrary set of denominations, however, the greedy solution is not optimal. For this problem, assume a simple scenario where there are three coins in the currency: 1, 3, and 7 cents. Note that for this problem, you are trying to determine the number of coins in the optimal solution, not the actual coins needed.

Using either pseudocode or a recurrence relation, define a simple recursive solution. Assume you change amount M and a function minNumCoins(M) that returns the minimum number of coins needed to make change for M. HINT: if you start with 77 cents, there are three possible scenarios: use 1 cent (with 76 cents remaining), 3 cent (74), or 7 cent (70). What is the subproblem that you need to call and what value do you return? You can ignore base cases for your solution.

Design a dynamic programming solution for this problem using pseudocode.

Solve the problem for 9 cents using your dynamic programming solution. How many coins are needed? Show the dynamic programming array for solving the problem with 9 cents.

Amino acids D, E, and K are all charged molecules while V, I, and L are hydrophobic (i.e., afraid of water). Using the BLOSUM50 scoring matrix (in your book or in lecture slides), determine the average substitution score within each group (i.e., average of D to E, E to K, and D to K, etc.). Then, calculate the average score for all pairs between the two groups (D to V, D to I,...). You should notice a pattern, explain what the pattern is and why it is expected.

Calculate the dynamic programming matrix for the optimal global alignment of the DNA sequences GAATTC and GATTA. Use +2 for matches, -1 for mismatches, and a linear gap penalty of g = -2.

Using the BLOSUM50 matrix, calculate the scores for the two given alignments in Figure 2.1b. Assume an affine gap penalty where h = -12 and g = -2

Programming Assignment: Local alignment with affine gap penalty

Implement a program that uses dynamic programming to compute a pairwise local alignment using an affine gap penalty function. The program should work on protein sequences (you can generalize it to DNA sequences as well, but this is not required).

You may work with one individual on this assignment. Please ensure that both partner's names are at the top of each file submitted and that only one individual in the pair submit the assignment.

You are allowed to complete your assignment in either C++ or Python, whichever you are more comfortable with. In either case, however, the execution style should meet the requirements below to making testing easier. If handing in C++ files, ensure that you submit an executable named alignment alongside your code. For Python, your main program should be in alignment.py. You should make this file executable (see the instructions in Tips and Hints).

Your program should take in 4 command-line arguments in this specific order:

The name of a file containing the two sequences to align
The name of a file containing the substitution matrix
the gap initiation penalty (h)
the gap extension penalty (g)

If the user specifies an incorrect number of arguments, print a usage statement and exit the program.

Breaking Ties

To be consistent, and obtain the same output as I have below, break ties using a high-road approach. That is, if a cell obtains the same value by two various preceeding alignments, choose based on this ranking:

If there are multiple substrings with the same maximum score, choose the location with the higher row index. If there are ties within the same row, use the one with the higher column index. For example, if M(10,100) and M(11,80) have the same value, choose M(11,80).

Input and Output

I have provided three sample files in your directory:

example - a simple trivial pair of proteins
calcium - a pair of related calcium ion proteins
kinases - a pair of kinase proteins

Input files will follow a similar format: sequence x on line 1, an empty line, then sequence y, followed by comments.

To report your results, output the maximum alignment score followed by the local alignment. For example, a sample run with the simple example and a affine penalty of h = -10, g = -1 would look like:

$ ./alignment.py wisc/example blosum-62 -10 -1
Alignment Score: 30

Alignment:
MQR---GGDEQ
MQRGGGGGDEQ

You can/should have extra output byt those results should appear at the end of the file. For debugging purposes, I have provided the final value of all three matrices for this example (-8888 refers to negative infinity) in addition to the results for the more complex examples:

Tips and Hints

Command-line arguments

You should recall how to use command-line arguments from CS35. If you need a refresher, head here. For Python, it is very similar. First, import the sys library.

from sys import *

You can now access a list argv containing all of the arguments. Recall that for both languages, argv[0] is the name of the program.

Directly executable python code

The output compilation in C++ creats an executable than can be directly called. Python interprets programs on the fly, so there is no equivalent file available to the user. To make your program executable, simple add the following to the top of your python file:

#! /usr/bin/env python

Next, make your file executable in unix:

$ chmod +x alignment.py

Then, when you execute, you can simply call the program directly without invoking python:

$ ./alignment.py pair blosum -12 -4

Substitution Matrix

Reading in the substitution matrix will require some careful thought and testing. One method is to simply store the values as a 20 by 20 two-dimensional array of integers and then have a separate function that maps amino acid characters to an index.

Another technique is to use a two-dimensional dictionary. Each row in the file will translate into a dictionary. They key is the character at the beginning of the line. The value it maps to is itself a dictionary containing all of the (column,value) pairs for that row. If I do this in python for just the first two rows and columns and print the resulting dictionary of dictionaries, I get the following value:

{'A': {'A': 4, 'R': -1}, 'R': {'A': -1, 'R': 5}}

The first pair in the dictionary has a key 'A' and value which itself is a dictionary {'A': 4, 'R': -1}. If I want the substitution value for replacing an A with an R, I simple call:

>>> print s['A']['R']
-1

where s is the dictionary structure I stored the values in. It's a bit more complicated to think about, but easier to use since you will not need to map the amino acids to integers.

A third approach is to stick with the two-dimensional array, but place it in a class SubstitutionMatrix such that you can create an object and call a getScore method that does all of the mapping for you. Either technique is acceptable, but your code should be readable and well designed. If using C++, either of the two- dimensional array options are probably easier to implement.

Initialization and Infinity

Initializing your matrices was defined in class. From a practical perspective, however, you should initialize all values to negative infinity before starting the initialization steps to 0. Furthermore, infinity is not exactly easy to represent. It is acceptable to use some large number instead, such as -8888. For your pointers, be sure to be consistent in how you represent a non-existing parent. For example, in Python you can use None. You can use NULL in C++ if using actual pointers, otherwise designate some flag, like -1, if using some mapping from ints.

Submitting your work

Once you are satisfied with your program, hand it in by typing handin68 at the unix prompt.