Compgen Tool Suite

Problem Set #5

mcmillan / Version 16

Comp 590-087/790-087: BioAlgorithms -- Fall 2011

Problem Set #5

Issued: 11/21/2011 Due: In class 12/6/2010

Homework Information: Some of the problems are probably too long to attempt the night before the due date, so plan accordingly. No late homework will accepted. However, your lowest homework will be dropped. Feel free to work with others, but the work you hand in should be your own.

Doublet starred (**) parts of questions are optional and worth up to 10% extra credit.

Question 1. Clustering

Consider the following distance matrix.

	A	B	C	D	E
A	0	22	16	12	16
B	22	0	18	26	10
C	16	18	0	20	12
D	12	26	20	0	20
E	16	10	12	20	0

Is this distance matrix additive? Justify your answer.
Apply the ADDITIVEPHYLOGENY algorithm descibed on page 364 to the given matrix. Draw the final tree and the step-by-step results in a manner similar to Figure 10.14 (Page 363).
Apply UPGMA to the given distance matrix and draw the resulting dendrogram. How does it compare to the tree derived in part B?
Describe an efficient algorithm for determining the value of δ in step 5 of ADDITIVEPHYLOGENY.

Question 2. Tree construction

A. How many canonical single-character Small Parsimony Problems are there for the following tree?

(Hint: Consider that each problem falls into one of 4 classes. It has either 4, 3, 2, or 1 distinct leaf labels. Notice that the order of labels in the 1st and 2nd positions does not matter. Likewise, for the 3rd and 4th positions. Also swapping the 1st and 2nd labels with the 3rd and 4th does not change the problem)

B. Give an example of each canonical Small Parsimony Problem from part A, and find its solution.

C. A parisimony score of at least n-1 is required to generate n distinct leafs. How many of the canonical Small Parsimony Problems for this tree have higher parsimony scores?

Question 3. Hidden Markov Models

Assume that a genome is composed of two types of segments, normal sequence and CG islands with the following base compostions and mean segment sizes:

	A	T	C	G	Mean length
Normal (N)	0.3	0.3	0.2	0.2	10
CG-island (I)	0.1	0.1	0.4	0.4	5

Draw an HMM diagram encapsulating the given data.

Given the observed sequence TGCC, what is the most likely series of hidden-segment states?

Given the observed sequence TGCCG, what is the most likely series of hidden-segment states?

What is the most likely hidden-segment state at the first 'C' in the sequence 'TGCCG'?