Comparing Sequences

By Miguel Andrade at English Wikipedia

1


Sequence Similarity

  • A common problem in Biology
Insulin Protein Sequence
Human
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
Dog
MALWMRLLPLLALLALWAPAPTRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVEDLQVRDVELAGAPGEGGLQPLALEGALQKRGIVEQCCTSICSLYQLENYCN
Cat
MAPWTRLLPLLALLSLWIPAPTRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEDLQGKDAELGEAPGAGGLQPSALEAPLQKRGIVEQCCASVCSLYQLEHYCN
Pig
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAENPQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN
  • All similar, but how similar, and how do you measure similarity?
  • Uses
    • To establish a phylogeny
    • To identify functional or conserved components of the sequence

2


Hand Alignments

  • Not that long ago, many aligments were done by hand
Human : MALWMRLLPLLALLALWGPdPAaAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQ_____________GSLQPLALEGs_LQKRGIVEQCCTSICSLYQLENYCN
        ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||             |||||||||||||||||||||||||||||||||||||
  Dog : MALWMRLLPLLALLALWAPAPtRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREvEDLQvrDVELaG_APGeGGLQPLALEGA_LQKRGIVEQCCTSICSLYQLENYCN
        ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||  |||| | ||| ||||||||| | |||||||||||||||||||||||||
  Cat : MApWtRLLPLLALLsLWiPAPtRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEDLQgkDaEL_GeAPGaGGLQPsALE_APLQKRGIVEQCCaSvCSLYQLEHYCN
        ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||  |||| | ||  |||||||||||| ||||||||||||||||||||||||
  Pig : MALWtRLLPLLALLAlWAPAPAqAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEnpQagaVEL_Gggl__GGLQaLALEGpP_QKRGIVEQCCTSICSLYQLENYCN
                               AFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAE                             QKRGIVEQCC SICSLYQLENYCN
  • Long conserved regions are shown below
  • Solution strategy?
  • Is this a well defined problem?
    • Is there an optimal or best solution
    • Did we find it
  • By the way, this is an easy case. Within vertebrates, the amino acid sequence of insulin is strongly conserved.

3


The Alignment Game

Let's Simplify the problem a bit by considering only 2 sequences, and make the problem statement more precise by estabishing rules as if it were a game.

  • Rules:

    • You must remove all characters from both sequences
    • There are 3 possible moves at any point in the game.
    • Each move removes at least one character from one of the two given strings
    • Pressing [Match] removes one left-most character from both sequences
      • You get 1 point if the characters match, otherwise you get 0 points
    • Pressing [Del] removes the left-most character from the top sequence
      • You lose 1 point
    • Pressing [Ins] removes the left-most character from the bottom sequence
      • You lose 1 point
    • Your point total is allowed to go negative
  • Objective: Get the most points

In [155]:
%%javascript

var Insulin = {
    "Human" : "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN",    
    "Dog" : "MALWMRLLPLLALLALWAPAPTRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVEDLQVRDVELAGAPGEGGLQPLALEGALQKRGIVEQCCTSICSLYQLENYCN",
    "Cat" : "MAPWTRLLPLLALLSLWIPAPTRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEDLQGKDAELGEAPGAGGLQPSALEAPLQKRGIVEQCCASVCSLYQLEHYCN",
    "Pig" : "MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAENPQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN"
};

var score = 0;
var topIndex = 0;
var botIndex = 0;

function Match() {
    var top = Insulin[document.getElementById("top").value];
    var bot = Insulin[document.getElementById("bot").value];
    score += (top[topIndex] == bot[botIndex]) ? 1 : 0;
    topIndex += 1;
    botIndex += 1;    
    document.getElementById("topseq").innerHTML = top.substring(topIndex,topIndex+24);
    document.getElementById("botseq").innerHTML = bot.substring(botIndex,botIndex+24);
    document.getElementById("scoreboard").innerHTML = score;    
}

function Delete() {
    var top = Insulin[document.getElementById("top").value];
    score -= 1;
    topIndex += 1;    
    document.getElementById("topseq").innerHTML = top.substring(topIndex,topIndex+24);
    document.getElementById("scoreboard").innerHTML = score;    
}

function Insert() {
    var bot = Insulin[document.getElementById("bot").value];
    score -= 1;
    botIndex += 1;    
    document.getElementById("botseq").innerHTML = bot.substring(botIndex,botIndex+24);
    document.getElementById("scoreboard").innerHTML = score;    
}


function onStart() {
    var row, cell, button;
    var board = document.getElementById("board");
    var table = document.createElement("table");
    var tbody = document.createElement("tbody");
    
    topIndex = 0;
    botIndex = 0;
    var topseq = document.createElement("pre");
    topseq.id = "topseq";
    topseq.style = "color: blue; font-size: 50px; line-height: 100%;"
    topseq.innerHTML = Insulin[document.getElementById("top").value].substring(topIndex,topIndex+24);
    
    var botseq = document.createElement("pre");
    botseq.id = "botseq";
    botseq.style = "color: blue; font-size: 50px; line-height: 100%;"
    botseq.innerHTML = Insulin[document.getElementById("bot").value].substring(topIndex,topIndex+24);
    
    row = document.createElement("tr");
    cell = document.createElement("td");
    cell.style = "padding: 5px;";
    cell.rowSpan = 2;
    button = document.createElement("input");
    button.style = "height: 90px;";
    button.type = "button";
    button.value = "Match";
    button.addEventListener("click", Match, false);
    cell.appendChild(button);
    row.appendChild(cell);
    cell = document.createElement("td");
    button = document.createElement("input");
    button.style = "height: 40px;";
    button.type = "button";
    button.value = "DEL";
    button.addEventListener("click", Delete, false);
    cell.appendChild(button);
    row.appendChild(cell);
    cell = document.createElement("td");
    cell.style = "padding-left: 20px;"
    cell.appendChild(topseq);
    row.appendChild(cell);
    tbody.appendChild(row);

    row = document.createElement("tr");
    cell = document.createElement("td");
    button = document.createElement("input");
    button.style = "height: 40px;";
    button.type = "button";
    button.value = "INS";
    button.addEventListener("click", Insert, false);
    cell.appendChild(button);
    row.appendChild(cell);
    cell = document.createElement("td");
    cell.appendChild(botseq);
    cell.style = "padding-left: 20px;"
    row.appendChild(cell);
    tbody.appendChild(row);
    table.appendChild(tbody);

    board.innerHTML = "";
    board.appendChild(table);
    score = 0;
    document.getElementById("scoreboard").innerHTML = score;
}

function selection(name, check) {
    var element = document.createElement("select");
    element.id = name;
    for (var key in Insulin) {
        var option = document.createElement("option");
        option.value = key;
        option.appendChild(document.createTextNode(key));
        option.selected = (key == check);
        element.appendChild(option);
    }
    return element;
}

element.append(selection("top", "Human"));
element.append("   ");
element.append(selection("bot", "Dog"));
element.append("   ");

var start = document.createElement("input");
start.type = "button";
start.value = "Start";
start.addEventListener("click", onStart, false);
element.append(start);

element.append("&nbsp;&nbsp;&nbsp; <b>Score:</b> ");
var scoreboard = document.createElement("span");
scoreboard.id = "scoreboard";
scoreboard.innerHTML = score;
element.append(scoreboard);

var board = document.createElement("div");
board.id = "board";
board.style = "margin: 20px; height: 160px;"
element.append(board);

// Hidden-Cell

4


How do you get the highest possible score?

  • The solution my not be unique

  • How many presses?

    • Minimum moves = Max(len(top), len(bot))
    • Maximum moves = len(top) + len(bot)
  • How many possible moves?

    • Less than $O(3^{len(top) + len(bot)})$

  • How big are these for our problem instance?

    • len(Human) = 98, len(dog) = 110
    • $3^{208} \approx 1.73 \times 10^{99}$, almost a googol (not a google)
  • What algorithm is solves this problem

    • Make each move by considering only a short horizon following the current aligment thus far

5


There is an effcient solution

  • It relies on a rather suprising idea

    • The best score can be found for the len(top) and len(bot) strings by finding the best score for every pair of substrings len(top[0:n]) and len(bot[0:m]) for all values of n up to len(top) and m up to len(bot)
    • Finding this solution requires only $O(len(top)len(bot))$ steps
    • It also requires a table of size $Max(len(top),len(bot))$
  • But before we solve this problem, let's look at another related related problem

  • Finding a best city tour on a Manhattan grid

6


Manhattan Tourist Problem (MTP)

Imagine seeking a path from a given source to given destination in a Manhattan-like city grid that maximizes the number of attractions (\*) passed. With the following caveat– at every step you must make progress towards the goal. We treat the city map as a graph, with a vertices at each intersection, and weighted edges along each block. The weights are the number of attractions along each block.

7


Manhattan Tourist Game

Goal: Find the maximum weighted path in a grid.

Input: A weighted grid G with two distinct vertices, one labeled source and the other labeled destination

Output: A shortest path in G from source to destination with the greatest weight

  • There are many shortest paths that go north 7 block and east 5 blocks
  • Of those paths, which sees the most sites?

8


MTP: A Greedy Algorithm Is Not Optimal

9


MTP: Observations

  • There are limited number of ways to reach any destination

    • For example, in our grid, one can reach the desitination node, (n,m), from either the north, (n,m-1), or the west (n-1,m).
    • for each of those routes there is a known number of sites to see, so the best path is:

    $$Score(n,m) = Max(Score(n-1,m)+Edge(n-1,m), Score(n,m-1)+Edge(n,m-1))$$

  • Why is there only one edge per intersection? Because only one direction makes progress to our goal
  • This rule applies recursively with the base case

    $$Score(0,0) = 0$$

  • We could write this strategu as a recursive algorithm, but it would still not be effcient. Why?

10


A New Solution Strategy

Dynamic Programming is a technique for computing recurrence relations efficiently by storing partial or intermediate results

Three keys to constructing a dynamic programming solution:

  1. Formulate the answer as a recurrence relation
  2. Consider all instances of the recurrence at each step (In our cae this means all paths that lead to a vertex).
  3. Order evaluations so you will always have precomputed the needed partial results

11


MTP Dynamic Program Solution

The solution may not be unique, but it will have the best possible score

12


MTP Dynamic Programming as a Strategy

  • Instead of solving the Manhattan Tourist problem directly, (i.e. the path from (0,0) to (n,m)) we will solve a more general problem: find the longest path from (0,0) to any arbitrary vertex (i,j).

  • If the longest path from (0,0) to (n,m) passes through some vertex (i,j), then the path from (0,0) to (i,j) must be the longest. Otherwise, you could increase the weight along your path by changing it.

13


MTP: Dynamic Program

  • Calculate optimal path score for every vertex in the graph between our source and destination
  • Each vertex’s score is the maximum of the prior vertices score plus the weight of the connecting edge in between

14


MTP: Dynamic Program Continued

15


MTP: Dynamic Program Continued

16


MTP: Dynamic Program Continued

17


MTP: Dynamic Program Continued

18


MTP: Dynamic Program Continued

  • Once the destination node (intersection) is reached, we’re done.
  • Our table will have the answer of the maximum number of attractions stored in the entry associated with the destination.
  • We use the links back in the table to recover the path. (Backtracking)

19


MTP: Recurrence

Computing the score for a point (i,j) by the recurrence relation:

The running time is n x m for a n by m grid

  • (You visit all intersections once, and performed 2 tests)

(n = # of rows, m = # of columns)

20


Manhattan Is Not A Perfect Grid

  • Easy to fix. Just adds more recursion cases.
  • The score at point B is given by:

21


Other Ways to Traversing the Manhattan Grid

  • We chose to evaluate our table in a particular order. Uniform distances from the source (all points one block away, then 2 blocks, etc.)
  • Other strategies:
    • Column by column
    • Row by row
    • Along diagonals
  • This choice can have performance implications

22


Next Time

  • Return to biology
  • Solving sequence alignments using Dynamic Programming

23


In [145]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
    if (code_show){
        $("div.input:contains('Hidden-Cell')").hide();
    } else {
        $("div.input:contains('Hidden-Cell')").show();
    }
    code_show = !code_show;
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Toggle Hidden Cells"></form>''')
Out[145]: