Compgen Tool Suite

Problem Set #4

mcmillan / Version 12

Comp 590-087/790-087: BioAlgorithms -- Fall 2011

Problem Set #4

Issued: 11/10/2011 Due: In class 11/29/2010

Homework Information: Some of the problems are probably too long to attempt the night before the due date, so plan accordingly. No late homework will accepted. However, your lowest homework will be dropped. Feel free to work with others, but the work you hand in should be your own.

Doublet starred (**) parts of questions are optional and worth up to 10% extra credit.

Question 1. In a Sequencing By Hybridization (SBH) study the following spectrum of 3-mers is collected:

S = { 'aca', 'att', 'cat', 'ctg', 'ctt', 'gct', 'tct', 'tga', 'tgc', 'ttc', 'ttg', 'ttt' }

A. Draw the Hamiltonian graph for S.

B. Draw the Eulerian graph for S.

C. Give all possible DNA sequences whose spectrum is S.

Question 2.

A. Estimate a theoretical mass spectrum for the protein WARNING using the following mass approximations (you can ignore ionization):

Amino Acid	Abbreviation	Mass
Alanine	A	89
Glycine	G	75
Isoleucine	I	131
Asparagine	N	132
Arginine	R	174
Tryptophan	W	204

Contrast your spectrum in part A with the following spectrums:

S₁ = {75, 204, 207, 293, 338, 467, 542, 671, 716, 802, 805, 934, 1009}

S₂ = {75, 204, 207, 293, 338, 425, 512, 599, 644, 730, 733, 862, 937}

B. How does your spectrum compare to the two given in terms of shared peak counts?

C. Use spectral convolution to estimate the number of sequence differences between your theoretical spectrum and the two given.

D. Estimate the residue substitutions and their position that explain the differences seen in S1 and S2.

Question 3. Compute a suffix array for the megabase DNA segment at this link. Use it, to answer the following questions about the sequence:

A. What is the shortest unique substring (excluding any substring ending with the special end-of-string character)?

B. What is the longest repeated substring (overlaps are allowed)?

**C. Find the longest match to a given target string, s, with at most one mismatch (you can assume that the length s is at least 20).

Feel free to start with the suffix array code given in class. You are welcome to speed it up if you like. It took about 40 mins for the argsort() to complete on my machine. The performance of findFirst() and findLast() were fine. Turn in your code to implement parts A and B as well as an answer. If you choose to do part C for extra credit, turn in your code and several examples.

Posted by leonardo on 2011-11-16 02:43:35 (references Version 9)

In Question 3.A, what does "excluding any string at the end of the segment" mean? For example, if the DNA fragment is AACCCTTTG, does that mean that we should output AA as the shortest unique substring, instead of G? What is the purpose of that?

Posted by mcmillan on 2011-11-22 02:04:34 (references Version 12)

Recall a special end-of-string character, '$', is used to terminate every string. This leads to a series of short substrings all ending with '$' in the suffix array. The point of this question is to find the shortest nonrepeated substring that is not made distinct from the substrings immedidately preceding or following it in the suffix array by this '$'. Thus, the 'G' in the previous comment is the shortest nonrepeated substring of AACCCTTTG$, because it does not rely on the '$' to make it unique. However, 'T' is not the shortest nonrepeated substring of AAACCCTTT$; instead either of {AC, CT} are.

Comp 590-087/790-087: BioAlgorithms -- Fall 2011 Problem Set #4Issued: 11/10/2011 Due: In class 11/29/2010

Posted by leonardo on 2011-11-16 02:43:35 (references Version 9)

Posted by mcmillan on 2011-11-22 02:04:34 (references Version 12)

Comp 590-087/790-087: BioAlgorithms -- Fall 2011

Problem Set #4

Issued: 11/10/2011 Due: In class 11/29/2010