Compgen Tool Suite

Problem set #3

mcmillan / Version 15

Comp 555: BioAlgorithms -- Fall 2013

Problem Set #3

Issued: 10/3/2013 Due: In class 10/24/2013

Homework Information: Some of the problems are probably too long to attempt the night before the due date, so plan accordingly. No late homework will accepted. However, your lowest homework will be dropped. Feel free to work with others, but the work you hand in should be your own.

Question 1. Consider the following k-mer spectrum:

{'ACATG', 'AGGCC', 'ATGCC', 'CAGGC', 'CATGC', 'CCGTG', 'CCTGC', 'CCTGT',
'CGTGT', 'CTACA', 'CTGCT', 'CTGTG', 'GCCGT', 'GCCTG', 'GCTAC', 'GGCCT',
'GTGCC', 'TACAT', 'TCAGG', 'TGCCG', 'TGCCT', 'TGCTA', 'TGTGC'}

Draw the graph where each k-mer in the given spectrum is a vertex and with a directed edges where the k-1 suffix of a source vertex matches the k-1 prefix of the destination vertex. Within this graph find a minimal-length path or paths that includes every vertex at least once. For each path found, give the sequence or sequences that correspond.

Question 2. Use the set of k-mers given in question 1 to construct a graph whose edges-- rather than vertices -- correspond to the given k-mers as follows. In this new graph the vertices correspond to the (k-1)-mers, with a directed edge for each k-mer whose k-1 prefix matches the source vertex sequence and k-1 suffix matches the destination vertex.

Find a Eulerian path or paths in the graph that you constructed, and give the corresponding sequence for each path. Compare these answers to those found in question 1.

What are the advantages/disadvantages of two graphs constructions and path finding problems described in question 1 and question 2?

Programming Problem. Download the following FASTA sequence MusChr01.fa. A FASTA file contains one or more sequences in a plain text file. Prior to each sequence is a header line that begins with the character '>'. This line conatins a series of fields decribing the seqeunce. The following lines are the actual sequence data with returns inserted to increase reability. The following Python code can be used to parse a FASTA file containing a single sequence.


def readFasta(filename):
    """ Reads a sequence in Fasta format """
    fp = open(filename, 'rb')
    header = ""
    seq = ""
    while True:
        line = fp.readline()
        if (line == ""):
            break
        if (line.startswith('>')):
            header = line[1:].strip()
        else:
            seq += line.strip().upper()
    fp.close()
    return (header, seq)

Addendum: It has been reported that the code given above is slow on windows machines. If you experience slow code, try the following alternative:


def readFasta(filename):
    """ Reads a sequence in Fasta format """
    fp = open(filename, 'rb')
    header = ""
    seq = ""
    while True:
        line = fp.readline()
        if (line == ""):
            break
        if (line.startswith('>')):
            header = line[1:].strip()
        else:
            seq = fp.read().replace('\n','')
            seq = seq.replace('\r','')          # for windows
            break
    fp.close()
    return (header, seq)

Write a program to do the following two tasks using Hash Tables (Python dictionarys). Find the shortest k-mer or k-mers that appear exactly once in the given sequence. Find the longest repeated k-mer in the given sequence.

Note: In addtion to 'A', 'C', 'G' and 'T', the sequence may contain 'N's. In your nswers above, do not consider any k-mers that containing one or more 'N's.

Comp 555: BioAlgorithms -- Fall 2013Problem Set #3Issued: 10/3/2013 Due: In class 10/24/2013

Comp 555: BioAlgorithms -- Fall 2013

Problem Set #3

Issued: 10/3/2013 Due: In class 10/24/2013