Compgen Tool Suite

Problem set 1

kemal / Version 19

Comp 555: BioAlgorithms -- Fall 2013

Problem Set #1

Issued: 9/3/2013 Due: In class 9/26/2013

Homework Information: Some of the problems are probably too long to attempt the night before the due date, so plan accordingly. No late homework will accepted. However, your lowest homework will be dropped. Feel free to work with others, but the work you hand in should be your own.

Answers to Questions 1-5 should to be submitted as hardcopy at classtime on 9/26/2013. Answer to programming question should be emailed to kemal@cs.unc.edu with the subject "COMP 555 PS1"

Question 1. What is the expected length of shortest subsequence that appears nowhere else in the human chromosome of 1 which is 249 million bp long? What are the assumptions of your answer, if any?

How many times would one expect to see a given 12-base subsequence appear in the same chromosome, assuming all bases are independent and equally likely?

Question 2. Calculate the entropy of all possible dinucleotides in the sequence in this file. What are the most and least frequent dinucleotides? Ungraded optional question: What might be the reason for this imbalance?

Question 3. Design an algorithm for computing n-th Fibonacci number that requires less than O(n) time. Hint: you probably want use the close form expression but note that computing it naively still requires O(n) time since each multiplication is a single operation.

Question 4. Reconsider the incorrect "Change Problem" algorithm from Lecture 3. In how many of the 99 possible inputs for M does it given an incorrect answer for the denominations c = (35, 25, 10, 5, 1)?

Suppose that you are desigining coinage for a new nation, but you are limited to minting only 5 denominations. How do you choose them such that the average number of coins is minimized for all values of change from 1 to 99? Will the greedy algorithm work with your proposed coinage?

Question 5. Design a brute force algorithm for Probed Partial Digest Problem described below and suggest a branch-and-bound approach to improve its performance

Probed Partial Digest Problem: In this method DNA is partially digested with a restriction enzyme, thus generating a collectionof DNA fragments between every two cutting sites. After this, a labeled probe that attaches tothe DNA between two cutting sites is hybridized to the partially digested DNA, and the sizesof fragments to which the probe hybridizes are measured. In contrast to the Partial Digest Problem, where the input consists of all partial fragments, the input for the PPDP consists of all partial fragments thatcontain a given point (this point corresponds to the position of the labeled probe). The problemis to reconstruct the positions of the sites from the multiset of measured lengths.

In the PPDP, we assume that the labeled probe is hybridized at position 0 and that A is the setof restriction sites with negative coordinates while B is the set of restriction sites with positivecoordinates. The probed partial digest experiment provides the multiset {b−a : a ∈ A, b ∈ B}and the problem is to ﬁnd A and B given this set.

Programming Problem. Proteins are sequence of aminoacids. We know that consecutive aminoacids in a protein are interacting with each other. For instance for a protein (of aminoacid sequence) ARTGW, A and R, R and T, T and G, G and W are pairs of aminoacids that interact with each other. Non-consecutive aminoacids might be interacting with each other. For instance, R and W can be interacting with each other in the last mentioned protein. Suppose that we are given all pairs of aminoacids that interact with each other in a specific protein. We are also given list of patterns which are defined to be chains of interacting aminoacids. Since these patterns give us clues about the protein function, we want to find all patterns that exists in the given protein. Design and implement an efficient algorithm for this problem.

Input file 1:(sample file) Aminoacid sequence followed by 1-based position of interacting pairs of aminoacids. i'th line starts with number i-1 denoting aminocid i-1. Other numbers in the same line denotes the positions of aminoacids that this aminoacid (i-1) interacts with. You can assume that this file include interactions between consecutive aminoacids as well as other ones.

Input file 2 (sample file): List of patterns. One pattern in each line

Output: List of patterns that exist in the given protein. One pattern in each line