CS68: Lab 3

This week's lab will give you practice with common bioinformatic tools for multiple sequence alignment. Specifically, you will complete a tutorial on using ClustalX. Second, you will complete a homework set of problems that will ask you to think more closely about the algorithms we have discussed in class. You may work closely in discussing these problems with one other person, but your answers must be submitted individually. Please write the name of your co-worker at the top of the assignment. You should hand in a paper copy of your solution in class, Thursday, February 21. Supplementary materials (e.g., code) should be submitted using handin68.

ClustalX

To begin, read this tutorial on using ClustalX. I have placed two sets of sequences you can use to test out Clustal in your labs directory: DnaB.txt which is a small set of medium-length protein sequences, and sequences12.txt - a large set of myosin heavy chain proteins. You will run ClustalX by entering the following command:

$ /usr/swat/bin/clustalx-2.1/clustalx

You will probably want to increase the font size open starting the program up as the sequences are hard to read.

You can begin by using the quick start guide on your DnaB.txt file. Then, read the rest of the document testing out different parameters to see how they alter your results.

Once you complete the document and analyze both sequences, answer the following questions:

Briefly describe the trade-offs between the Fast and Slow approaches. Which option is suitable for your DnaB.txt example? How about the sequences12.txt file? Thinking of the intent of a user interested in asking biological questions, why might you want to combine both uses? That is - attempt a fast run followed by a slow run.

After producing an alignment, how do you know if you have a good alignment, or the best alignment? Is that possible to know? Describe the author's suggestion for a strategy to answer these questions

What extra information about the sequences can ClustalX utilize to improve alignment? Briefly describe why this extra domain knowledge can help aid in determining an alignment

Read the EBI's description of output results. As a biologist, how does the color coding aid you in analysis? Specifically, think of a column which has varied content (i.e., medium to high entropy) yet consistently stays the same color. How might this alter your assessment of the "conservation" of a column.

After running ClustalX one one of the input sets, head on over to the WebLogo page for producing representations of the entropy of a sequence. Load the alignment from one of your examples and choose a range of approximately 25 amino acids. Your range should include example of highly conserved regions and low-conserved regions. Save the output file and print it or electronically submit it with your solution. If you electronically submit, name the file logo.png.

As noted in class, ClustalW is the most widely used technique, but is not necessarily the most advanced method on the market. Another popular program is T-coffee. Read the original paper outlining the algorithm, available here. Discuss the paper with at least one other person in the class - I don't expect you to understand all of the details but I want you to get an understanding of the problems still around with multiple sequence alignment. With your homework, submit a 2 paragraph (about 1 page double spaced) summary. You should discuss the generaly pitfalls identified in current approaches, the high-level tactic takes to address these issues, the advantages and disadvantages of T-coffee. That is, under what scenarios would you consider using T-coffee? ClustalW?

Problem Set

You are given the following sequences:
1. CTATTAATAC
2. TATTAATAC
3. CTATTAT
4. CTATTAATC
5. ATTAATAC
Use the Star algorithm to identify a multiple sequence alignment. Pick the string that has the maximal average alignment score to select the center. To perform your alignments, use a global alignment algorithm with linear gap penalty. Matches should have a score of +1, substitutions -1, and gaps -3. You can do this by hand or write a short program. Show your work if done by hand, or submit your code as star.py. At a minimum, you should show the average/sum of alignment scores for each sequence and the full multiple sequence alignment.

Do the same for the Feng-Doolittle algorithm. You should be able to use your intermediate results from the Star Alignment. Be sure to submit your initial distance matrix (use the equation in the book to convert your alignment scores to similarity scores) and also your guide tree. Use the following values for a random alignment:
```
Srand(0,1) = -6
Srand(0,2) = -7
Srand(0,3) = -4
Srand(0,4) = -7
Srand(1,2) = -8
Srand(1,3) = -6
Srand(1,4) = -3
Srand(2,3) = -8
Srand(2,4) = -6
Srand(3,4) = -5
```
For S_max, I incorrectly stated that you should use the maximum score in the matrix. If you did this, no worries just note this on your results. As the book points out, though, the maximum score is the maximum possible score for the alignment. To calculate S_max for a pair of sequences x and y, perform Needleman-Wunsch for an alignment of x to x, and then for y to y, the finally take the average of these two scores as S_max. The max scores for the 5 sequences are respectively 10, 9, 7, 9, 8.
When updating the distance matrix, it is acceptable to either take the average distance of all pairs or to pick the minimum. So, for example, if you merge sequences x and y into one cluster, the distance between {x,y} and z is the either the average or minimum of the individual pairwise distances d_xz and d_yz. I'll assume you are using the minimum, simply state otherwise if you use the average. Also, logs normally are calculated using the natural log. If you used a different log base, just state which you used.

In class, we discussed two scoring metrics: sum of pairs (SP) and minimum entropy. Consider a column m_i that contains the following characters: A,A,A,A and another that contains A,A,A,D. Using the BLOSUM62 matrix, compute the sum of pairs and minimum entropy scores for each column. Note that I have given columns and not sequences. s(A,A) = 4 and s(A,D)=-2.

Do the same for column A,A,A,A,A versus column A,A,A,A,D. Compare your results to the previous problem. How does this support our discussion on the relative advantages/disadvantages to the scoring metrics.

In class, we discussed DNA as a linear sequence of nucleotides. Not all DNA exists in this double helix form, however. The genomes of many bacteria and even some eukaryotes (e.g., mitochondrial DNA in humans) exist in ciruclar form. That is, there is no "start" or "end" to the sequence. Devise an efficient algorithm to perform optimal local alignment on two circular DNA sequences. This problem is conceptually difficult because there is not a 0th character in the sequence. A O(nm) algorithm exists, but if you cannot think of it, describe your best attempt and its O() run time. As a hint, your solution should build off of your existing local/global alignment algorithms from class.

Come up with a solution for global alignment of circular DNA. This is more difficult so it is okay if you do not solve it, but you should be able to make progress on the problem when discussing with your lab partner.

Submitting your work

You must hand in an individual solution in class on the due date. You must also write the name of the individual you discussed problems in detail with. You may talk about high-level concepts with as many individuals as you like (for example, the paper on T-Coffee). But you are only allowed to work on the specific problems and their solutions with one person in the class. Supplementary material should be handed in using handin68.

CS68 Lab 3: Multiple Sequence Alignment