The goal of this week's lab is to have you implement two popular algorithms for tree inference: Neighbor Joining (an example of distance-based methods) and Weighted Parsimony (an example of character-based parsimony methods) and visualize the results. You may work with a lab partner on this lab and use either Python or C++ for your implementation
We will be doing the following problems in-lab. Given the matrix in your labs/NJTree directory titled inclass.dist:
Neighbor Joining is the most popular distance-based method for inferring tree structures given its ability to handle non-ultrametric data. You will write a program NJTree that will output a list of edges formed in the tree. Your program should take in one command line argument for a file containing your distance values. An example run for a Python program would be:
$ ./NJTree.py distancesYour distance file will be in the format:
A B distance_ABwith one line for each pair of taxa in the set. Since distances are symmetric, the ordering of the sequences is not important.
To name your clusters, simply merge the names of the two sub-clusters with a hyphen in between, e.g. merging A and B yields A-B
Note: The expected output has changed since the original posting: Your program should output a list of edges in your tree in a similar format.
A A-B distance_AA-B B A-B distance_BA-BIn addition, should output a visualization of your tree using the instructions below.
import pygraphviz as pgvTo create a graph G and add nodes and edges:
G = pgv.AGraph() #Constructs a graph object G.add_node("A") G.add_node("B") G.add_edge("A","B",label="1.0", len=1.0)You will add a node for each sequence in the set. Then, every time a cluster is created by merging two smaller clusters, you will create a node for that cluster and add edges to the two merged clusters. For each edge, both the label and len should be the distance, d_ik between the two clusters i and k. I suggest setting the len field to a minimum value of 1.0 to overcome overlapping nodes. When you finish the main loop, you will merge your last two clusters by adding an edge. Lastly, you will output your graph to a file by using the command:
G.draw("tree.png",prog="neato")You can use your favorite image viewer to view your results.
Using the in-lab example (run update68 to obtain inclass.dist)
$ ./NJTree.py inclass.dist A A-B 6.0 B A-B 1.0 A-B A-B-C 3.0 C A-B-C 2.0 D A-B-C 5.0The resulting image: In-class Example
For the next set of questions, take a look at the following image produced by the neighbor joining algorithm for studying HIV. The numbers in the nodes map to various strains of HIV (human immunodeficiency virus) and SIV (simian immunodeficiency virus). You can take a look at the mapping and original distance matrix.
AIDS (Acquired Immune Deficiency Syndrome) is caused by two variants of HIV in humans - HIV-1 and HIV-2. You will use the tree to answer questions about the evolution of these two variants, in particular focusing on their relationship with various strains of SIV.
For the second part of your lab, you will implement the Weighted Parsimony method detailed in class. Your program will take in a tree, a scoring function, and the sequences of the leaf nodes and outputs a score for the tree and assigned labels inferred for the ancestral nodes.
$ ./weightedParsimony.py tree substitution_matrix leaf_sequences
Leaf Sequences: Contains a list of all of the K sequences you are attempting to line up in the phylogenetic tree. Each line will be in the format:
identifier: sequenceThe identifier should be the same name you use when defining the tree (below). Using the example we have from class, I have labeled each leaf sequence as L1 through L4 in weightedParsimony/inclass.seq
L1: A L2: C L3: T L4: G
Tree File: The tree file will define the topology of the phylogenetic tree you are attempting to score. The first line specifies the root. Each line after that is in the format:
child parentFor example, the simple tree from class (weightedParsimony/inclass.tree) would be:
N1 N2 N1 N3 N1 L1 N2 L2 N2 L3 N3 L4 N3Note that the root has been labeled N1. It has two children N2,N3. The leaves connect to either N2 or N3.
Weight Matrix: Specifies the substitution value for each pair of states possible at each position in the sequence. This has the exact same format as the distance matrix in the Neighbor Joining algorithm; that is, each line contains a pair of characters followed by their substitution score. You can find the example from class in weightedParsimony/inclass.wgt:
A C 9 A G 4 A T 3 C G 4 C T 4 G T 2The diagnol is not defined, but you should set it to 0 when you create your substitution function.
identifier: sequencesimilar to the input sequence file. The in-lecture example would appear as follows (I have shown intermediate values for debugging purposes, you are only required to output the data below the asteriks):
$./weightedParsimony.py inclass.tree inclass.wgt inclass.seq L4 : A : (100000.0, None, None) C : (100000.0, None, None) T : (100000.0, None, None) G : (0.0, None, None) L2 : A : (100000.0, None, None) C : (0.0, None, None) T : (100000.0, None, None) G : (100000.0, None, None) L3 : A : (100000.0, None, None) C : (100000.0, None, None) T : (0.0, None, None) G : (100000.0, None, None) L1 : A : (0.0, None, None) C : (100000.0, None, None) T : (100000.0, None, None) G : (100000.0, None, None) N1 : A : (14.0, A, T) C : (15.0, C, T) T : (9.0, T, T) G : (10.0, G, G) N2 : A : (9.0, A, C) C : (9.0, A, C) T : (7.0, A, C) G : (8.0, A, C) N3 : A : (7.0, T, G) C : (8.0, T, G) T : (2.0, T, G) G : (2.0, T, G) ********************************* Score: 9.0 Inferred States: L4: G L2: C L3: T L1: A N1: T N2: T N3: T
$./weightedParsimony.py long.tree long.wgt long.seq Score: 11.0 Inferred States: Aardvark: CAGGTA Chimp: CGGGTA Dog: TGCACT Bison: CAGACA Elephant: TGCGTA C3: CGGGTA C2: TGCGTA C1: CAGGTA C4: CGCGTAAn output of intermediate values can be found here.
When you are finished with your lab, submit your code using handin68. You should submit, at a minimum: