1
2
u: ATATATAT v: TATATATA
3
u: ATATATAT- v: -TATATATA
4
Levenshtein (1966) introduced the notion of an “edit distance” between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other. (But, he gave no solution)
d(v,w) = Minimum number of elementary operations to transform v → v
Computing Hamming distance is a trivial task
Computing edit distance is non-trivial
5
TGCATAT → ATCCGAT in 5 steps TGCATAT → (DELETE last T) TGCATA → (DELETE last A) TGCAT → (INSERT A at front) ATGCAT → (SUBSTITUTE C for 3rd G) ATCCAT → (INSERT G before last A) ATCCGAT (Done)
What is the edit distance? 5?
6
TGCATAT → ATCCGAT in 4 steps TGCATAT → (INSERT A at front) ATGCATAT → (DELETE 6th T) ATGCAAT → (SUBSTITUTE G for 5th A) ATGCGAT → (SUBSTITUTE C for 3rd G) ATCCGAT (Done)
Is 4 the minimum edit distance? 3?
7
8
9
0 1 2 2 3 4 5 6 7 7
v A T _ G T T A T _
w A T C G T _ A _ C
0 1 2 3 4 5 5 6 6 7
Path: (0,0), (1,1), (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)
10
0 1 2 2 3 4 5 6 6 7
v A T _ G T T A _ T
w A T C G _ T A C _
0 1 2 3 4 4 5 6 7 7
Path: (0,0), (1,1), (2,2), (2,3), (3,4), (4,4), (5,5), (6,6), (6,7), (7,7)
11
0 0 0 0 0 0 1 2 3 4 5 6 7 7
v _ _ _ _ _ A T G T T A T _
w A T C G T A _ _ _ _ _ _ T
0 1 2 3 4 5 6 6 6 6 6 6 6 7
Path: (0,0), (0,1), (0,2), (0,3), (0,4), (0,5), (1,6),
(2,6), (3,6), (4,6), (5,6), (6,6), (7,6), (7,7)
12
13
Initialize 1st row and 1st column to all zeroes.
Evaluate recursion for next row and/or next column
Continue recursion for next row and/or next column
Then one more row and/or column
And so on...
And so on...
Getting closer
Until we reach the last row and column
We reach the end, which corresponds to an LCS of length 5
Our answer includes both an optimal score, and a path back to find the alignment
14
Let's see how well the code matches the approach we sketched out...
from numpy import *
def findLCS(v, w):
score = zeros((len(v)+1,len(w)+1), dtype="int32")
backt = zeros((len(v)+1,len(w)+1), dtype="int32")
for i in xrange(1,len(v)+1):
for j in xrange(1,len(w)+1):
# find best score at each vertex
if (v[i-1] == w[j-1]):
score[i,j], backt[i,j] = max((score[i-1,j-1] + 1, 3),
(score[i-1,j],1),
(score[i,j-1],2))
else:
score[i,j], backt[i,j] = max((score[i-1,j],1),
(score[i,j-1],2))
return (score, backt)
s, b = findLCS("ATGTTAT","ATCGTAC")
print "score =\n", s
print "\nbacktrack =\n", b
15
In our example we used arrows {↓, →, \}, which were represented in our matrix as {1,2,3} respectively. This numbering is arbitrrary, except that it does break ties in our implementation (matches > w deletions > w insertions).
Next we need code that finds a path from the end of our strings to the beginning using our arrow matrix
16
We can write a simple recursive routine to return along the path of arrows that led to our best score.
def LCS(b,v,i,j):
if ((i == 0) and (j == 0)):
return ''
if (b[i,j] == 3):
result = LCS(b,v,i-1,j-1)
result = result + v[i-1]
return result
else:
if (b[i,j] == 2):
return LCS(b,v,i,j-1)
else:
return LCS(b,v,i-1,j)
print LCS(b,"ATGTTAT",b.shape[0]-1,b.shape[1]-1)
Perhaps we would like to get more than just the LCS; for example, the correpsonding alignment.
def Alignment(b,v,w,i,j):
if ((i == 0) and (j == 0)):
return ['','']
if (b[i,j] == 3):
result = Alignment(b,v,w,i-1,j-1)
result[0] += v[i-1]
result[1] += w[j-1]
return result
else:
if (b[i,j] == 2):
result = Alignment(b,v,w,i,j-1)
result[0] += "-"
result[1] += w[j-1]
return result
else:
result = Alignment(b,v,w,i-1,j)
result[0] += v[i-1]
result[1] += "-"
return result
align = Alignment(b,"ATGTTAT","ATCGTAC",b.shape[0]-1,b.shape[1]-1)
print "v =", align[0]
print "w =", align[1]
17
Example:
18
Amino acid changes that tend to preserve the electro-chemical properties of the original residue
Common Amino acid substitution matrices
DNA substitution matrices
19
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ... Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ... Arg R 3 17 4 3 2 5 3 2 6 3 2 9 Asn N 4 4 6 7 2 5 6 4 6 3 2 5 Asp D 5 4 8 11 1 7 10 5 6 3 2 5 Cys C 2 1 1 1 52 1 1 2 2 2 1 1 Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 ... Trp W 0 2 0 0 0 0 0 0 1 0 1 0 Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 Val V 7 4 4 4 4 4 4 4 5 4 15 10
20
21
The Global Alignment Problem tries to find the highest scoring path between vertices (0,0) and (n,m) in the edit graph.
The Local Alignment Problem tries to find the highest scoring subpath between all vertex pairs $(i_1,j_1)$ and $(i_2, j_2)$ in the edit graph where $i_2 > i_1$ and $j_2 > j_1$.
In an edit graph with negatively-weighted scores, a Local Alignment may score higher than a Global Alignment
Example:
Global Alignment
--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC
| || | || | | | ||| || | | | | |||| |
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C
Local Alignment finds longer conserved segment
tccCAGTTATGTCAGgggacacgagcatgcagagac
||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc
22
Local Alignment Problem:
23
Find the best global alignment amoung all blocks $(i_1,j_1,i_2,j_2)$
Long run time $O(n^4)$:
This can be remedied by giving free rides
25
26
27
28