CS 21: Lab 09

Run update21, if you haven't already, to create the cs21/labs/09 directory. Then cd into your cs21/labs/09 directory and create the python program for lab 09 in this directory (handin21 looks for your lab 09 assignments in your cs21/labs/09 directory).

$ update21
$ cd cs21/labs/09

Part 1. Draw a triangle fractal pattern

In triangle.py, write a recursive function:

  fracTriangle(window, top, left, right, color, n)

that takes as parameters a GraphWin window, three Points (top, left, and right), a color that will be "black" or "white", and an integer n which is the recursive depth. The fracTriangle function should draw a triangle of color color in the graphics window and then, depending on the depth n, either return or recursively use the fracTriangle function to draw smaller triangles as described below.

The base case: n = 0

Recall that every recursive function needs a base case where the recursion ends. For your fracTriangle function, the base case should occur when n is zero; your fracTriangle function should use top, left, and right to draw a single triangle of the given color, and then return. For example, with color="black" and n=0, your function should produce something like the single triangle below. (Your image should not contain the text labels "top", "left", and "right". We just put them in the image here to clarify how the triangle is drawn.)

The recursive case: n > 0

If n is greater than zero, your function should use top, left, and right to draw a triangle of the appropriate color and then use three recursive calls (with depth n-1) to draw smaller triangles inside the given triangle. The three recursive calls should draw triangles of the opposite color defined by the points illustrated below:

top, midLeft, and midRight
midLeft, left, and midBottom
midRight, midBottom, and right

For reference, the diagram below was produced with depth n=1 and color="white". Your program should not produce any text labels; we just put the labels here to clarify the various points you should use when recursively calling the fracTriangle function. Also note that the center area (marked here in French) was not drawn using the fracTriangle function; it is just the white area that was left when the other three black triangles were drawn.

The left image below was drawn with n = 2 and color = "black", and the right image was drawn with n = 3 and color = "white":

Using larger depths can lead to attractive patterns, but can take a long time! The following diagram was drawn with color="white" and n=7:

The main function

In main(), your program should read in a recursive depth n and initial color (which must be "black" or "white") and draw the pattern of the appropriate recursive depth.

To summarize, here are some hints for the recursion:

For the base case (when n equals 0), fracTriangle should draw a single triangle and make no recursive calls.
For non-base cases, fracTriangle should draw a single triangle and make three recursive calls. Each recursive call should:
- correspond to one of the triangles (using midLeft, midRight, and midBottom) described above
- reverse the color
- use a recursion depth that is one smaller than the current recursion depth
Our example triangles were drawn using an 800x400 pixel window and the initial points top as Point(600,50), left as Point(50,350), and right as Point(750,350).
Your main function should call fracTriangle just once. All other triangles should be drawn using recursive calls.

Part 2. Global Sequence Alignment

Introduction

In Lab 07, we explored the idea of clustering a set of related genetics sequences to form a hierarchical representation of the relationship between a set of species. Underlying this task is the biological concept known as homology; that is, similarity due to descent from a common ancestor. The assumptions is that the high-level of similarity between the genetic sequences of two species indicates an evolutionary relationship. Importantly, this thesis allows us to propose solutions to important questions, such as: given a novel genetic sequence (e.g., a new virus), do we already know any similar sequence? If so, we may be able to infer the function or mechanism of the novel gene. A separate question we can ask is, given a genetic sequence and a family of similar sequences, which two species are most related (similar to lab 07)?

The computational question arises: how do we detect similarity between sequences? In Lab 07, you accomplished this task by aligning sequences in several different configurations and calculating an optimal similarity score. We assumed sequences could have gaps, but only at the beginning or end of the sequence. This is a simplistic assumption, and in this lab we will answer a more difficult quesiton: how do we align two protein sequences (i.e., strings) given mutations could not only change the value of one amino acid (i.e., one character), but also amino acids could be removed or inserted anywhere in the sequence.

There are many different ways to model sequence similarity. In this lab, we will assume we are working with small sequences and are looking for a global alignment; that is, the optimal alignment of all characters in two sequences. For example, let us say that two sequences align optimally in this fashion:

CG--GATTCGAAT
CGCCGATT---CT

Dashes represent gaps, indicating that the matching characters in the other sequence were removed or inserted from a common ancestor. Blue pairs indicate a similarity shared between the two genes, while reds indicate mismatches through either mutation or insertion/removal.

Recursion using Needleman-Wunsch

Your lab will calculate the optimal global sequence alignment between two sequences using the Needleman-Wunsch algorithm.

At a high level, Needleman-Wunsch defines the optimal alignment of two sequences seq1, seq2 of length x and y respectively, as a function Fit(x,y). The value of Fit(x,y) is defined recursively to be the best alignment of the prefix of both strings plus the alignment of the last characters. This is similar to defining the reverse word example in class. The optimal alignment is defined to be the maximum of three possibilities:

Fit(x-1, y-1) + match(seq1[x-1], seq2[y-1])
(i.e., the match between the last character of both sequences plus the best possible score for aligning all characters before x and y).

Fit(x-1, y) + gapPenalty
(i.e., the score for inserting gap into seq2 plus the score for the alignment of the rest of the sequence).

Fit(x, y-1) + gapPenalty
(i.e., similar, but inserting a gap into sequence seq1).

Note, these are not the function definitions in Python, but a mathemtical definition of the recursion. Read below for tips on converting this into a Python function. Here, gapPenalty is the cost of introducing a gap, similar to lab 7. This should be set by asking the user for a negative value. match is the likelihood that we would line up two amino acids, and is the calculated by using the blosum80 matrix discussed in lab 7.

As you can see, we need to calcuate the function recursively in three possible ways. The algorithm recurses, trying to find the optimal alignment for all prefixes until one of the sequences is empty. If you trace through a few steps, you'll notice they we calculate the same prefixes many times over under this recursion. As discussed in class with the Fibonacci sequence, dynamic programming can be used to prevent recalculating the same subproblem multiple times.

More concretely, in globalAlign.py, define a recursive function that corresponds to the definition of Fit(x,y) above:

   needlemanWunsch(sequence1, sequence2, fitMatrix, gapPenalty)

that takes as parameters two protein sequences, sequence1 and sequence2, the gap penalty, and a list of lists, fitMatrix, that stores our solved subproblem solutions. For example, the value of fitMatrix[x][y] stores the optimal alignment score between the first x characters of sequence1 and first y characters of sequence2 This function should recursive calculate the optimal alignment score between the two sequences using the definition above.

Notice that your parameters are similar to the Fit but not exactly the same. Instead of sending in positions x and y, you are sending in the substring of the original proteins sequences as in the reverse.py example from class. So, corresponding to the recursive call to Fit(x-1,y) is a call to needlemanWunsch where sequence2 is unchanged but only the first x-1 characters are sent for sequence1.

The base case: n = 0

There are a few base cases to deal with. The first is that the subproblem has already been solved. Just as with the Fibonacci sequence solution, we know the subproblem is not solved if our matrix has a value of None. That is, if our sequences have length x and y, we should check to see if fitMatrix[x][y] has a value.

In terms of the recursion itself, we have base cases for either sequence being empty. For example, if sequence1 has 0 characters left, then we cannot recurse further. The defined alignment is simply y gaps. For example, if sequence2 is "TGCL" and sequence1 is empty (i.e., ""), then the alignment must be:

   ----
   TGCL

which has a score of 4 gap penalties (i.e., -32 if the gap penalty is -8). The other base case occurs in the flip scenario, where sequence2 is empty.

The recursive case: n > 0

If the problem has not been solved before and neither sequence is empty, we have three recursive calls to implement. For case 1 above, we are calclulating the score for matching the two characters of interest. As an example, let's align the sequences "TGCL" and "AGCT". The three possibilities are 1) matching the "L" with the "T",

   needlemanWunsch( "TGC", "AGC", matrix) + blosum80("L", "T")

Notice the change in the two sequences (the last characters have been removed) and the characters that we send into to the match function. 2) Having the"L" of the first sequence lining up with a gap (assume the gap penalty is -8):

   needlemanWunsch( "TGC", "AGCT", matrix) + -8

And 3) a gap in the first sequence lining up with the "T",

   needlemanWunsch( "TGCL", "AGC", matrix) + -8

You must then pick the maximum of these three possibilities and don't forget to save your result before returning it!

The main function

Your main() method should be very simple:

Add the following lines at the top of the program:

    from genetics import blosum80
    from path import printPath

Ask the user for a file name to open (you can assume it is a valid name)
Open the file and read in the two sequences
Ask the user for a gap penalty (gaps should never be given a reward, ensure the user gives a legal value)
Initialize the dynamic programming matrix. You should define this as its own function and it will look similar to:
```
  matrix = [None] * (x+1)
  for i in range(x+1):
    matrix[i] = [None] * (y+1)
```
where x and y are the lengths of your two sequences.
Make a call to your global sequence alignment function
Print out the optimal alignment score (not the matrix, the actual final score)
Call the already defined printPath method to print out the actual optimal alignment. A sample looks similar to:
```
    printPath(sequence1, sequence1, gapPenalty, matrix)
```
taking in the two sequences, a gap penalty (negative value) and the final scores matrix.
Print out the entire matrix in a nice format to see all of the final values. You should define a function for this purpose.

printPath has been defined for you, but feel free to try a crack at implementing it yourself. You start at the bottom right corner, and until you reach the origin (0,0), you determine what that maximum value was at every step. If it was case 1, you append the two characters to the current alignment. If it was a gap, you append a dash to the gap sequence and the aligned character to the other. Feel free to email me for some tips! A sample run of your program:

$ python globalAlign.py 

    This program will run the Needleman-Wunsch algorithm to determine
    the optimal global alignment between two protein sequences.
  
Enter file location: sequences.txt 
Enter gap penalty: -8 

################ RESULTS ################

Optimal Score: 25
Optimal Alignment:
	AGACTAGTTAC
	AGAC---TTAC

Scores matrix:
         A    G    A    C    T    T    A    C 
     0   -8  -16  -24  -32  -40  -48  -56  -64
A   -8    5   -3  -11  -19  -27  -35  -43  -51
G  -16   -3   11    3   -5  -13  -21  -29  -37
A  -24  -11    3   16    8    0   -8  -16  -24
C  -32  -19   -5    8   25   17    9    1   -7
T  -40  -27  -13    0   17   30   22   14    6
A  -48  -35  -21   -8    9   22   30   27   19
G  -56  -43  -29  -16    1   14   22   30   23
T  -64  -51  -37  -24   -7    6   19   22   29
T  -72  -59  -45  -32  -15   -2   11   19   21
A  -80  -67  -53  -40  -23  -10    3   16   18
C  -88  -75  -61  -48  -31  -18   -5    8   25

Tips

While you don't need a large top down design, it is still a good idea to think about this lab modularly. Implement small portions and build up.

For the recursive case, you may want to first implement the solution with no dynamic program and no gaps. Can you create a recursive function that calculates the alignment score with no gaps? If so, you may get an alignment that prints out something similar to:

-24 -25 -27 -22 -26 -21 -16 -11 -2

Next, add in the matrix and see if you can produce:

################ RESULTS ################

Optimal Score: -2

Scores matrix:
         A    G    A    C    T    T    A    C 
  None None None None None None None None None
A None None None None None None None None None
G None None None None None None None None None
A  -24 None None None None None None None None
C None  -25 None None None None None None None
T None None  -27 None None None None None None
A None None None  -22 None None None None None
G None None None None  -26 None None None None
T None None None None None  -21 None None None
T None None None None None None  -16 None None
A None None None None None None None  -11 None
C None None None None None None None None   -2

Here, the gap penalty was -8 using the input sequence.

You may want to figure out the base cases first. Try calling your function with empty sequences, and then try with small sequences of one or two characters.
To print the matrix, be sure to use string formatting. You do not have to print the actual sequence characters on the side, but it sure makes it look pretty if you do!
Want another example? You should have a file zinc.txt if you run update21. Here is an alignment of two human zinc finger proteins:
```
$ python globalAlign.py 

    This program will run the Needleman-Wunsch algorithm to determine
        the optimal global alignment between two protein sequences.

Enter file location: zinc.txt 
Enter gap penalty: -8 

################ RESULTS ################

Optimal Score: -21
Optimal Alignment:
        TYHMCQFHCRYVNNHSGEKLYECNERSKAFSCPSHLQCHKRRQIGEKTHEHNQCGKAFPT
        -YE-CN-QC-------G-KAF-A-QH----S--S-LKCHYRTHIGEKPYECNQCGKAFSK
```
I turned off print the matrix since it is too big to fit on the screen in a nice format. Hmm, that alignment seems pretty bad. There is a whole field of algorithms for performing local alignment - that is, finding the optimal regional alignment of two protein sequences. In fact, the most highly cited paper for all of the 1990s was for the BLAST algorithm, a powerful local-alignment alignment that is utilized in almost every biology laboratory in the world.

Submit

Once you are satisfied with your program, hand it in by typing handin21 in a terminal window.

CS21 Lab 9: Global Sequence Alignment

Part 1. Draw a triangle fractal pattern

Part 2. Global Sequence Alignment

Introduction

Tips