Compgen Tool Suite

Problem Set #4

mcmillan / Version 15

Comp 555: BioAlgorithms -- Fall 2013

Problem Set #4

Issued: 11/12/2013 Due: In class 11/21/2013

Homework Information: Some of the problems are probably too long to attempt the night before the due date, so plan accordingly. No late homework will accepted. However, your lowest homework will be dropped. Feel free to work with others, but the work you hand in should be your own.

Question 1. Given a a genomic sequence, G, of length n, explain how to find the number of occurrences of a given k-mer using each of the following data structures and quantify its asymptotic performance.

a) A Hash-table of the positions of all k/2 sized substrings in G

b) A Suffix Tree of G.

c) A Suffix Array of G.

Question 2. Consider the following sets of 3D points:

(0.39, 3.3, 2.06), (0.09, 4.5, 1.63), (2.64, 3.3, 5.24), (1.64, 1.2, 3.82), (3.59, 7.5, 6.58),
(1.94, 5.6, 4.25), (3.09, 6.5, 5.87), (0.54, 4.6, 2.27), (2.04, 4.5, 4.39), (-0.16, 3.8, 1.28)

a) Compute a distance matrix between all pairs of points. Assume that Euclidean distance is used.

b) Assume that the distance between two clusters is defined as the maximal distance of any pair of their elements. Draw the cluster tree constructed by the hierarchical clustering algorithm (in page 345 of the book)

c) Assume that the distance between two clusters is defined as the average distance over all pairs. Draw the hierarchical cluster tree again. Write your observations on (b) and (c)

d) Run the K-means algorithm with k=2 using the first two points as the initial cluster representatives.

e) Run the K-means algorithm with k=2, using the last two points as the initial cluster representatives. Write your observations on (d) and (e)

Programming Problem. (Please submit code by emailing kemal@cs.unc.edu):

Write a Python function, compress(string), to run-length encode a string as follows: Replace all runs of 2 or more characters with a decimal string of the number of repeats followed by the character. Unrepeated chracters are just copied to the output. For example compress("AACGCCC$AAAAAAAAAAT") would return "2ACG3C$10AT".

Use the same genomic sequence from in the last problem set, MusChr01.fa, and write code to do the following:

Select 1000 random substrings of length 1000 from the genomic sequence, which contains no 'N's. Compress the original substring, S[i] and its Burrows-Wheeler transform BWT(S[i]+'$'). Plot a scatter plot of the points (len(compress(S[i]), len(BWT(compress(S[i]+'$'))).
Repeat the task above for substrings of length 10000.

You can make your plots in one of two ways. Using the Python package "Matplotlib" or by generating a comma separated list of points and plotting them using any spreadsheet package (Excel, etc.).

Comp 555: BioAlgorithms -- Fall 2013Problem Set #4Issued: 11/12/2013 Due: In class 11/21/2013

Comp 555: BioAlgorithms -- Fall 2013

Problem Set #4

Issued: 11/12/2013 Due: In class 11/21/2013