Run update21, if you haven't already, to create the cs21/labs/09 directory. Then cd into your cs21/labs/09 directory and create the python program for lab 09 in this directory (handin21 looks for your lab 09 assignments in your cs21/labs/09 directory).
$ update21 $ cd cs21/labs/09
In triangle.py, write a recursive function:
fracTriangle(window, top, left, right, color, n)that takes as parameters a GraphWin window, three Points (top, left, and right), a color that will be "black" or "white", and an integer n which is the recursive depth. The fracTriangle function should draw a triangle of color color in the graphics window and then, depending on the depth n, either return or recursively use the fracTriangle function to draw smaller triangles as described below.
Recall that every recursive function needs a base case where the
recursion ends. For your fracTriangle function, the base
case should occur when n is zero; your fracTriangle
function should use top, left, and right to draw
a single triangle of the given color, and then return.
For example, with color="black" and n=0, your
function should produce something like the single triangle below.
(Your image should not contain the text labels "top", "left", and "right".
We just put them in the image here to clarify how the triangle is
drawn.)
If n is greater than zero, your function should use top, left, and right to draw a triangle of the appropriate color and then use three recursive calls (with depth n-1) to draw smaller triangles inside the given triangle. The three recursive calls should draw triangles of the opposite color defined by the points illustrated below:
The left image below was drawn with n = 2 and
color = "black", and the right image was drawn with
n = 3 and color = "white":
Using larger depths can lead to attractive patterns, but can take a long
time! The following diagram was drawn with color="white" and
n=7:
In main(), your program should read in a recursive depth n and initial color (which must be "black" or "white") and draw the pattern of the appropriate recursive depth.
To summarize, here are some hints for the recursion:
In Lab 07, we explored the idea of clustering a set of related genetics sequences to form a hierarchical representation of the relationship between a set of species. Underlying this task is the biological concept known as homology; that is, similarity due to descent from a common ancestor. The assumptions is that the high-level of similarity between the genetic sequences of two species indicates an evolutionary relationship. Importantly, this thesis allows us to propose solutions to important questions, such as: given a novel genetic sequence (e.g., a new virus), do we already know any similar sequence? If so, we may be able to infer the function or mechanism of the novel gene. A separate question we can ask is, given a genetic sequence and a family of similar sequences, which two species are most related (similar to lab 07)?
The computational question arises: how do we detect similarity between sequences? In Lab 07, you accomplished this task by aligning sequences in several different configurations and calculating an optimal similarity score. We assumed sequences could have gaps, but only at the beginning or end of the sequence. This is a simplistic assumption, and in this lab we will answer a more difficult quesiton: how do we align two protein sequences (i.e., strings) given mutations could not only change the value of one amino acid (i.e., one character), but also amino acids could be removed or inserted anywhere in the sequence.
There are many different ways to model sequence similarity. In this lab, we will assume we are working with small sequences and are looking for a global alignment; that is, the optimal alignment of all characters in two sequences. For example, let us say that two sequences align optimally in this fashion:
CG--GATTCGAAT CGCCGATT---CTDashes represent gaps, indicating that the matching characters in the other sequence were removed or inserted from a common ancestor. Blue pairs indicate a similarity shared between the two genes, while reds indicate mismatches through either mutation or insertion/removal.
Your lab will calculate the optimal global sequence alignment between two sequences using the Needleman-Wunsch algorithm.
At a high level, Needleman-Wunsch defines the optimal alignment of two sequences seq1, seq2 of length x and y respectively, as a function Fit(x,y). The value of Fit(x,y) is defined recursively to be the best alignment of the prefix of both strings plus the alignment of the last characters. This is similar to defining the reverse word example in class. The optimal alignment is defined to be the maximum of three possibilities:
Note, these are not the function definitions in Python, but a mathemtical definition of the recursion. Read below for tips on converting this into a Python function. Here, gapPenalty is the cost of introducing a gap, similar to lab 7. This should be set by asking the user for a negative value. match is the likelihood that we would line up two amino acids, and is the calculated by using the blosum80 matrix discussed in lab 7.
As you can see, we need to calcuate the function recursively in three possible ways. The algorithm recurses, trying to find the optimal alignment for all prefixes until one of the sequences is empty. If you trace through a few steps, you'll notice they we calculate the same prefixes many times over under this recursion. As discussed in class with the Fibonacci sequence, dynamic programming can be used to prevent recalculating the same subproblem multiple times.
More concretely, in globalAlign.py, define a recursive function that corresponds to the definition of Fit(x,y) above:
needlemanWunsch(sequence1, sequence2, fitMatrix, gapPenalty)that takes as parameters two protein sequences, sequence1 and sequence2, the gap penalty, and a list of lists, fitMatrix, that stores our solved subproblem solutions. For example, the value of fitMatrix[x][y] stores the optimal alignment score between the first x characters of sequence1 and first y characters of sequence2 This function should recursive calculate the optimal alignment score between the two sequences using the definition above.
Notice that your parameters are
similar to the Fit but not exactly the same. Instead of sending in
positions x and y, you are sending in the substring of the
original proteins sequences as in the reverse.py example from
class. So, corresponding to the recursive call to Fit(x-1,y) is a
call to needlemanWunsch where sequence2 is unchanged but only
the first x-1 characters are sent for sequence1.
There are a few base cases to deal with. The first is that the subproblem has already been solved. Just as with the Fibonacci sequence solution, we know the subproblem is not solved if our matrix has a value of None. That is, if our sequences have length x and y, we should check to see if fitMatrix[x][y] has a value.
In terms of the recursion itself, we have base cases for either sequence being empty. For example, if sequence1 has 0 characters left, then we cannot recurse further. The defined alignment is simply y gaps. For example, if sequence2 is "TGCL" and sequence1 is empty (i.e., ""), then the alignment must be:
---- TGCLwhich has a score of 4 gap penalties (i.e., -32 if the gap penalty is -8). The other base case occurs in the flip scenario, where sequence2 is empty.
If the problem has not been solved before and neither sequence is empty, we have three recursive calls to implement. For case 1 above, we are calclulating the score for matching the two characters of interest. As an example, let's align the sequences "TGCL" and "AGCT". The three possibilities are 1) matching the "L" with the "T",
needlemanWunsch( "TGC", "AGC", matrix) + blosum80("L", "T")Notice the change in the two sequences (the last characters have been removed) and the characters that we send into to the match function. 2) Having the"L" of the first sequence lining up with a gap (assume the gap penalty is -8):
needlemanWunsch( "TGC", "AGCT", matrix) + -8And 3) a gap in the first sequence lining up with the "T",
needlemanWunsch( "TGCL", "AGC", matrix) + -8You must then pick the maximum of these three possibilities and don't forget to save your result before returning it!
from genetics import blosum80 from path import printPath
matrix = [None] * (x+1) for i in range(x+1): matrix[i] = [None] * (y+1)where x and y are the lengths of your two sequences.
printPath(sequence1, sequence1, gapPenalty, matrix)taking in the two sequences, a gap penalty (negative value) and the final scores matrix.
$ python globalAlign.py This program will run the Needleman-Wunsch algorithm to determine the optimal global alignment between two protein sequences. Enter file location: sequences.txt Enter gap penalty: -8 ################ RESULTS ################ Optimal Score: 25 Optimal Alignment: AGACTAGTTAC AGAC---TTAC Scores matrix: A G A C T T A C 0 -8 -16 -24 -32 -40 -48 -56 -64 A -8 5 -3 -11 -19 -27 -35 -43 -51 G -16 -3 11 3 -5 -13 -21 -29 -37 A -24 -11 3 16 8 0 -8 -16 -24 C -32 -19 -5 8 25 17 9 1 -7 T -40 -27 -13 0 17 30 22 14 6 A -48 -35 -21 -8 9 22 30 27 19 G -56 -43 -29 -16 1 14 22 30 23 T -64 -51 -37 -24 -7 6 19 22 29 T -72 -59 -45 -32 -15 -2 11 19 21 A -80 -67 -53 -40 -23 -10 3 16 18 C -88 -75 -61 -48 -31 -18 -5 8 25
-24 -25 -27 -22 -26 -21 -16 -11 -2Next, add in the matrix and see if you can produce:
################ RESULTS ################ Optimal Score: -2 Scores matrix: A G A C T T A C None None None None None None None None None A None None None None None None None None None G None None None None None None None None None A -24 None None None None None None None None C None -25 None None None None None None None T None None -27 None None None None None None A None None None -22 None None None None None G None None None None -26 None None None None T None None None None None -21 None None None T None None None None None None -16 None None A None None None None None None None -11 None C None None None None None None None None -2Here, the gap penalty was -8 using the input sequence.
$ python globalAlign.py This program will run the Needleman-Wunsch algorithm to determine the optimal global alignment between two protein sequences. Enter file location: zinc.txt Enter gap penalty: -8 ################ RESULTS ################ Optimal Score: -21 Optimal Alignment: TYHMCQFHCRYVNNHSGEKLYECNERSKAFSCPSHLQCHKRRQIGEKTHEHNQCGKAFPT -YE-CN-QC-------G-KAF-A-QH----S--S-LKCHYRTHIGEKPYECNQCGKAFSKI turned off print the matrix since it is too big to fit on the screen in a nice format. Hmm, that alignment seems pretty bad. There is a whole field of algorithms for performing local alignment - that is, finding the optimal regional alignment of two protein sequences. In fact, the most highly cited paper for all of the 1990s was for the BLAST algorithm, a powerful local-alignment alignment that is utilized in almost every biology laboratory in the world.
Once you are satisfied with your program, hand it in by typing handin21 in a terminal window.