Finding TFBS Motifs in our Lifetime

  • Recall from last time that the Brute Force approach for finding a common 10-mer motif common to 10 sequences of length 80 bases was going to take up roughly 30,000 years
  • Today well consider alternative and non-obvious approaches for solving this problem
  • We will trade one old man (us) for another (an Oracle)

Note: Our Python/Jupyter study session will be held in SN014 on 1/28 from 5:00pm to 6:30pm

1


Recall from last Lecture

  • The following set of 10 sequences have an embedded noisy motif, TAGATCCGAA.

 1 tagtggtcttttgagtgTAGATCTGAAgggaaagtatttccaccagttcggggtcacccagcagggcagggtgacttaat
 2 cgcgactcggcgctcacagttatcgcacgtttagaccaaaacggagtTGGATCCGAAactggagtttaatcggagtcctt
 3 gttacttgtgagcctggtTAGACCCGAAatataattgttggctgcatagcggagctgacatacgagtaggggaaatgcgt
 4 aacatcaggctttgattaaacaatttaagcacgTAAATCCGAAttgacctgatgacaatacggaacatgccggctccggg
 5 accaccggataggctgcttatTAGGTCCAAAaggtagtatcgtaataatggctcagccatgtcaatgtgcggcattccac
 6 TAGATTCGAAtcgatcgtgtttctccctctgtgggttaacgaggggtccgaccttgctcgcatgtgccgaacttgtaccc
 7 gaaatggttcggtgcgatatcaggccgttctcttaacttggcggtgCAGATCCGAAcgtctctggaggggtcgtgcgcta
 8 atgtatactagacattctaacgctcgcttattggcggagaccatttgctccactacaagaggctactgtgTAGATCCGTA
 9 ttcttacacccttcttTAGATCCAAAcctgttggcgccatcttcttttcgagtccttgtacctccatttgctctgatgac
10 ctacctatgtaaaacaacatctactaacgtagtccggtctttcctgatctgccctaacctacaggTCGATCCGAAattcg

2


Consensus Scoring Function

  • We developed a consensus scoring function to overcome the noise, but we needed to apply it an exponential number, $O(N^t)$ of times!
  • Here's the scoring function...
In [103]:
def Score(s, DNA, k):
    """ 
        compute the consensus SCORE of a given k-mer 
        alignment given offsets into each DNA string.
            s = list of starting indices, 1-based, 0 means ignore
            DNA = list of nucleotide strings
            k = Target Motif length
    """
    score = 0
    for i in xrange(k):
        # loop over string positions
        cnt = dict(zip("acgt",(0,0,0,0)))
        for j, sval in enumerate(s):
            # loop over DNA strands
            base = DNA[j][sval+i] 
            cnt[base] += 1
        score += max(cnt.itervalues())
    return score

3


And here's the Score we're looking for...

In [104]:
seqApprox = [
    'tagtggtcttttgagtgtagatctgaagggaaagtatttccaccagttcggggtcacccagcagggcagggtgacttaat',
    'cgcgactcggcgctcacagttatcgcacgtttagaccaaaacggagttggatccgaaactggagtttaatcggagtcctt',
    'gttacttgtgagcctggttagacccgaaatataattgttggctgcatagcggagctgacatacgagtaggggaaatgcgt',
    'aacatcaggctttgattaaacaatttaagcacgtaaatccgaattgacctgatgacaatacggaacatgccggctccggg',
    'accaccggataggctgcttattaggtccaaaaggtagtatcgtaataatggctcagccatgtcaatgtgcggcattccac',
    'tagattcgaatcgatcgtgtttctccctctgtgggttaacgaggggtccgaccttgctcgcatgtgccgaacttgtaccc',
    'gaaatggttcggtgcgatatcaggccgttctcttaacttggcggtgcagatccgaacgtctctggaggggtcgtgcgcta',
    'atgtatactagacattctaacgctcgcttattggcggagaccatttgctccactacaagaggctactgtgtagatccgta',
    'ttcttacacccttctttagatccaaacctgttggcgccatcttcttttcgagtccttgtacctccatttgctctgatgac',
    'ctacctatgtaaaacaacatctactaacgtagtccggtctttcctgatctgccctaacctacaggtcgatccgaaattcg']
In [105]:
print Score([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], seqApprox, 10)
89
In [106]:
%timeit Score([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], seqApprox, 10)
10000 loops, best of 3: 50.2 µs per loop

So even at a blazing $50.6 \mu s$ we'll need many lifetimes to compute the $70^{10}$ scores

4


Pruning Trees

  • One method for reducing the computational cost of a search algorithm is to prune the space of permutations that could not possibly lead to a better answer than the current best answer.
  • Pruning decisions are based on solutions to subproblems that appear early on and offer no hope
  • How does this apply to our Motif finding problem?

Consider any permutation of offsets that begins with the indices [25, 63, 10, 43, ....]. Just based on the first 4 indices the largest possible score is 17 + 6*10 = 87, which assumes that all 6 remaining strings match perfectly at all 10 positions.

 DNA[0][25:35]        a  a  g  g  g  a  a  a  g  t
 DNA[1][63:73]        g  t  t  t  a  a  t  c  g  g
 DNA[2][10:20]        a  g  c  c  t  g  g  t  t  a
 DNA[3][43:53]        t  t  g  a  c  c  t  g  a  t
                   a [2, 1, 0, 1, 1, 2, 1, 1, 1, 1]
       Profile     c [0, 0, 1, 1, 1, 1, 0, 1, 0, 0]
                   g [1, 1, 2, 1, 1, 1, 1, 1, 2, 1]
                   t [1, 2, 1, 1, 1, 0, 2, 1, 1, 2]
                     [2, 2, 2, 1, 1, 2, 2, 1, 2, 2] Score = 17

If the best answer so far is 89, there is no need to consider the 706 offset permuations that start with these 4 indices.

5


Search Trees

  • Our standard method for enumerating permutations can be considered as a traversal of leaf nodes in a search tree
  • Suppose after checking the first few offsets could know already that any score of children nodes could not beat the best score seen so far?

6


Branch-and-Bound Motif Search

  • Since each level of the tree goes deeper into search, discarding a prefix discards all following branches
  • This saves us from looking at $(N – k + 1)^{t-depth}$ leaves
  • Note our enumeration of tree-branches is depth-first
  • We'll formulate of trimming algorithm as a recursive algorithm

5


A Recursive Exploration of a Search Tree

In [71]:
bestAlignment = []
prunedPaths = 0

def exploreMotifs(DNA,k,path,bestScore):
    """ Recursively finds a k-length motif in the
        list of DNA sequences by exploring all paths
        in a search tree. Each recursive call extends 
        path by one index, once the path reaches the 
        number of DNA strings a score is computed. """
    global bestAlignment, prunedPaths
    depth = len(path)
    t = len(DNA)
    if (depth == t):
        s = Score(path,DNA,k)
        if (s > bestScore):
            bestAlignment = [p for p in path]
            return s
        else:
            return bestScore
    else:
        # Before going down this path further let's 
        # consider if an optimistic best score can 
        # possibly beat the best score so far
        if (depth > 1):
            OptimisticScore = k*(t-depth) + Score(path,DNA,k)
        else:
            OptimisticScore = k*t
        if (OptimisticScore < bestScore):
            prunedPaths = prunedPaths + 1
            return bestScore
        else:
            for s in xrange(len(DNA[depth])-k+1):
                newPath = tuple([i for i in path] + [s])
                bestScore = exploreMotifs(DNA,k,newPath,bestScore)
            return bestScore

def BranchAndBoundMotifSearch(DNA, k):
    """ Finds a k-length motif within a list of DNA sequences"""
    global bestAlignment, prunedPaths
    bestAlignment = []
    prunedPaths = 0
    bestScore = 0
    for s in xrange(len(DNA[0])-k+1):
        bestScore = exploreMotifs(DNA,k,[s],bestScore)
    print bestAlignment, bestScore, prunedPaths

%time BranchAndBoundMotifSearch(seqApprox[0:6], 10)
[17, 47, 18, 33, 21, 0] 53 8615931
CPU times: user 6min 35s, sys: 4.61 s, total: 6min 40s
Wall time: 6min 39s

7


Observations

  • For our problem instance, Branch-and-Bound Motif finding is significantly faster
    • It found a motif in the first 6 strings in less time than the Brute Force approach found a solution in the first 4 strings
    • More than $70^2 \approx 5000$ times faster
    • It did so by trimming more than 8 Million paths
    • Trimming added extra calls to Score (basically doubling the worst-case number of calls), but ended up saving even more hopeless calls along longer paths.
    • In practice, Branch-and-Bound, significantly improved the average performance
  • Does this improve the worst-case performance from $O(tkN^t)$?
    • What if all of our motifs were found at the end of each DNA string?
    • How do we avoid these worse case data sets?
    • Randomize the search-tree tranversal order

8


We Need A Fresh New Approach

  • Enumerating every possible permuation of motif positions is not getting us the speed we want.
  • Let's try another tried and tested apprach to algorithm design, mixing up the problem
    • Suppose that some Oracle could tell us what the motif is
    • How long would it take us to find its position in each string?
    • We could compute the Hamming Distance from our given motif to the k-mer at every position of each DNA sequence and keep track of the smallest distance and its position on each string.
    • These positions are our best guess of where the motif can be found on each string
  • Let's this approach to scanning-and-scoring a given motif.

9


Scanning-and-Scoring a Motif

In [73]:
def ScanAndScoreMotif(DNA, motif):
    totalDist = 0
    bestAlignment = []
    k = len(motif)
    for seq in DNA:
        minHammingDist = k+1
        for s in xrange(len(seq)-k+1):
            HammingDist = sum([1 for i in xrange(k) if motif[i] != seq[s+i]])
            if (HammingDist < minHammingDist):
                bestS = s
                minHammingDist = HammingDist
        bestAlignment.append(bestS)
        totalDist += minHammingDist
    return bestAlignment, totalDist    
In [74]:
print ScanAndScoreMotif(seqApprox, "tagatccgaa")
%timeit ScanAndScoreMotif(seqApprox, "tagatccgaa")
([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11)
1000 loops, best of 3: 1.76 ms per loop

Wow, we can test over 500 motifs per second!

10


Scan-and-Score Motif Performance

  • There are $t (N - k + 1)$ positions to test the motif, and each test requires $k$ tests.
  • So each scan is $O(t N k)$.
  • So where where do we get candidate motifs?
  • Can we try all of them? There are $4^{10} = 1048576$ in our example.
    • Do the math, 1048576 motifs × 2 mS ≈ 35 mins
    • Not fast, but not a lifetime
  • This approach is called a Median String Motif Search
  • Recall from last Lecture that a string that minimizes Hamming distance is like finding a middle or median string that is closer to all instances than the instances are to each other.

11


Let's Do It

In [75]:
import itertools

def MedianStringMotifSearch(DNA,k):
    """ Consider all possible 4**k motifs"""
    bestAlignment = []
    minHammingDist = k*len(DNA)
    kmer = ''
    for pattern in itertools.product('acgt', repeat=k):
        motif = ''.join(pattern)
        align, dist = ScanAndScoreMotif(DNA, motif)
        if (dist < minHammingDist):
            bestAlignment = [s for s in align]
            minHammingDist = dist
            kmer = motif
    return bestAlignment, minHammingDist, kmer

%time MedianStringMotifSearch(seqApprox,10)
CPU times: user 31min 46s, sys: 16.1 s, total: 32min 2s
Wall time: 32min 18s
Out[75]:
([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11, 'tagatccgaa')

Should we declare victory and move on? Do you find anything uncomfortable about an algorithm that requires $O(t N k 4^k)$ steps?

12


Notes on Median String Motif Search

  • There are similarities between finding and alignment with minimal Hamming Distance and maximizing a Motif's consensus score.
  • In fact, if instead of counting mismatches as in the code fragment:
    HammingDist = sum([1 for i in xrange(k) if motif[i] != seq[s+i]])
    we had counted matches
    Matches = sum([1 for i in xrange(k) if motif[i] == seq[s+i]])
    and found the maximum(TotalMatches) instead of the min(TotalHammingDistance) we would be using the same measure as Score().
  • Thus, we expect MedianStringMotifSearch() to give the same answer as either BruteForceMotifSearch() or BranchAndBoundMotifSearch().
  • However, the $4^k$ term raises some concerns. If k were instead 20, then we'd have to Scan-and-Score more than $10^12$ times. Another not-in-my-lifetime algorithm
  • We can also apply the Branch-and-Bound approach to the Median string method, but, as before it would only improve the average case.

13


Other ways to guess the Motif?

  • If we knew that the motif that we are looking for was contained somewhere in our DNA sequences we could test the $(N-k+1)t$ motifs from our DNA, giving a $O(N^2t^2)$ algorithm.
    • Unfortunately, as you may recall our motif did not appear actually appear in our data.
  • You could keep track of a few good motif candidates using a manageable and perhaps random subsets of the given DNA sequences, and use them as your candidate motifs.

14


Let's try considering only Motifs seen in the DNA

In [87]:
def ContainedMotifSearch(DNA,k):
    """ Consider only motifs from the given DNA sequences"""
    motifSet = set()
    for seq in DNA:
        for i in xrange(len(seq)-k+1):
            motifSet.add(seq[i:i+k])
    print "%d Motifs in our set" % len(motifSet)
    bestAlignment = []
    minHammingDist = k*len(DNA)
    kmer = ''
    for motif in motifSet:
        align, dist = ScanAndScoreMotif(DNA, motif)
        if (dist < minHammingDist):
            bestAlignment = [s for s in align]
            minHammingDist = dist
            kmer = motif
    return bestAlignment, minHammingDist, kmer

%time ContainedMotifSearch(seqApprox,10)
709 Motifs in our set
CPU times: user 1.2 s, sys: 41.6 ms, total: 1.24 s
Wall time: 1.22 s
Out[87]:
([17, 31, 18, 33, 21, 0, 46, 70, 16, 65], 17, 'tagatccaaa')

Not exactly the motif we were looking for (off by a 'g'),

[17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11, 'tagatccgaa'
but boy was it fast! Where's a good Oracle when you need one?

16


Perhaps the Consensus Score matrix can provide insight

If we call Score([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], seqApprox, 10)


DNA[0][17:27]    t  a  g  a  t  c  t  g  a  a
DNA[1][31:41]    t  a  g  a  c  c  a  a  a  a
DNA[2][18:28]    t  a  g  a  c  c  c  g  a  a
DNA[3][33:43]    t  a  a  a  t  c  c  g  a  a
DNA[4][21:31]    t  a  g  g  t  c  c  a  a  a
DNA[5][ 0:10]    t  a  g  a  t  t  c  g  a  a
DNA[6][46:56]    c  a  g  a  t  c  c  g  a  a
DNA[7][70:80]    t  a  g  a  t  c  c  g  t  a
DNA[8][16:26]    t  a  g  a  t  c  c  a  a  a
DNA[9][65:75]    t  c  g  a  t  c  c  g  a  a
              a [0, 9, 1, 9, 0, 0, 1, 3, 9,10]
              c [1, 1, 0, 0, 2, 9, 8, 0, 0, 0]
              g [0, 0, 9, 1, 0, 0, 0, 7, 0, 0]
              t [9, 0, 0, 0, 8, 1, 1, 0, 1, 0]
                [9, 9, 9, 9, 8, 9, 8, 7, 9,10]  Score = 87
Consensus        t  a  g  a  t  c  c  g  a  a   Our motif!

Any Ideas?

17


Contained Consensus Motif Search

In [97]:
def Consensus(s, DNA, k):
    """ 
        compute the consensus k-Motif of an alignment 
        given offsets into each DNA string.
            s = list of starting indices, 1-based, 0 means ignore
            DNA = list of nucleotide strings
            k = Target Motif length
    """
    consensus = ''
    for i in xrange(k):
        # loop over string positions
        cnt = dict(zip("acgt",(0,0,0,0)))
        for j, sval in enumerate(s):
            # loop over DNA strands
            base = DNA[j][sval+i] 
            cnt[base] += 1
        consensus += max(cnt.iteritems(), key=lambda tup: tup[1])[0]
    return consensus

def ContainedConsensusMotifSearch(DNA,k):
    bestAlignment, minHammingDist, kmer = ContainedMotifSearch(DNA,k)
    motif = Consensus(bestAlignment,DNA,k)
    newAlignment, HammingDist = ScanAndScoreMotif(DNA, motif)
    return newAlignment, HammingDist, motif

%time ContainedConsensusMotifSearch(seqApprox,10)
709 Motifs in our set
CPU times: user 1.26 s, sys: 20.8 ms, total: 1.28 s
Wall time: 1.29 s
Out[97]:
([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11, 'tagatccgaa')

Now we're cooking!

18


Dad, are we there yet?

We got the answer that we were looking for, but

  • How can we be sure it will always give the correct answer?
    • Our other methods were exhaustive, they examined every possibility
    • This method considers only a subset of solutions, picks the best one in a greedy fashion
      • What it there had been ties amoung the candidate motifs?
      • What if the consensus score (87% matches) had been lower
      • Would we, should we, be satisfied?
  • It's one thing to be greedy, and another to be both greedy and biased
    • Our method is greedy in that it considers only the best contained motif, greedy methods are subject to falling into local minimums
    • Since consider only subsequences as motifs we introduce bias
  • Note that Consensus can generate motif not seen in our data

19


A randomized approach to Motif finding

  • One way to avoid bias and local minima is to introduce randomness
  • We can generate candidate motifs from our data by treating it as distribution
    • likely motif candidates from this distribution are those generated by Consensus
    • Consensus strings can be tested by Scan-and-Score and their alignments lead to new consensus strings
    • Eventually, we should converge to some local minimal answer
  • To avoid finding a local minimum, we try several random starts, and search for the best score amongst all these starts.
  • A randomized algorithm does not guarantee an optimal solution. Instead it promises a good/plausible answer on average, and it is not susceptible to a worse-case data sets as our greedy/biased method was.

20


Code for Randomized Motif Search

In [102]:
import random

def RandomizedMotifSearch(DNA,k):
    """ Searches for a k-length motif that appears 
    in all given DNA sequences. It begins with a
    random set of candidate consensus motifs 
    derived from the data. It refines the motif
    until a true consensus emerges."""
    
    # Seed motifs from random alignments
    motifSet = set()
    for i in xrange(500):
        randomAlignment = [random.randint(0,len(DNA[j])-k) for j in xrange(len(DNA))]
        motif = Consensus(randomAlignment, DNA, k)
        motifSet.add(motif)

    bestAlignment = []
    minHammingDist = k*len(DNA)
    kmer = ''
    testSet = motifSet.copy()
    while (len(testSet) > 0):
        print len(motifSet),
        nextSet = set()
        for motif in testSet:
            align, dist = ScanAndScoreMotif(DNA, motif)
            # add new motifs based on these alignments
            newMotif = Consensus(align, DNA, k)
            if (newMotif not in motifSet):
                nextSet.add(newMotif)
            if (dist < minHammingDist):
                bestAlignment = [s for s in align]
                minHammingDist = dist
                kmer = motif
        testSet = nextSet.copy()
        motifSet = motifSet | nextSet
    return bestAlignment, minHammingDist, kmer

%time RandomizedMotifSearch(seqApprox,10)    
500 754 835 854 860CPU times: user 1.65 s, sys: 38.5 ms, total: 1.69 s
Wall time: 1.69 s
Out[102]:
([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11, 'tagatccgaa')

Randomized algorithms should be restarted multiple times to insure a stable solution.

In [101]:
for i in xrange(10):
    print RandomizedMotifSearch(seqApprox,10)
500 760 830 848 850 ([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11, 'tagatccgaa')
500 735 806 819 820 ([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11, 'tagatccgaa')
500 750 817 835 839 841 ([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11, 'tagatccgaa')
500 735 821 838 839 ([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11, 'tagatccgaa')
500 734 804 814 818 ([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11, 'tagatccgaa')
500 739 802 817 820 ([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11, 'tagatccgaa')
500 749 824 840 844 845 ([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11, 'tagatccgaa')
500 749 821 834 837 ([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11, 'tagatccgaa')
500 746 820 837 839 ([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11, 'tagatccgaa')
499 738 816 832 834 835 ([17, 47, 18, 33, 21, 0, 46, 70, 16, 65], 11, 'tagatccgaa')

21


Lessons Learned

  • We can find Motifs in our lifetime
    • Practical exhaustive search algorithm for small k, MedianStringMotifSearch()
    • Practical fast algorthim RandomizedMotifSearch(DNA,k)
  • Three algorithm design approaches "Branch-and-Bound", "Greedy", and "Randomized"
  • Reversing the objective, pretending that you know the answer, and validating it
  • The power of randomness
    • Not susceptable to worse case data
    • Avoids local minimums that plague some greedy algorithms

22


In [ ]: