Advanced Sequence Alignment¶

Midterm on Wednesday
- Covers up to and including Lecture 11
- Online can be downloaded at the start of class, same Jupyter Nookbook format as Problem Sets
- Open Computer, Open Notes
- You can add extra cells for scratch work, but only the indicated answer cells will be graded
- Mix of short answer, multiple choice, and writing code fragments
- Hard deadline for submission! Bank versions as the time limit approaches.

1

Recall Local Alignment¶

The zero is our free ride that allows the node to restart with a score of 0 at any point
- What does this imply?
After solving for the entire score matrix, we then search for s_i,j with the highest score, this is $(i_2,j_2)$
We follow our back tracking matrix until we reach a score of 0, whose coordinate becomes $(i_1,j_1)$

2

Smith-Waterman Local Alignment¶

3

A Local Alignment Example¶

4

A Local Alignment Example - continued¶

4

A Local Alignment Example - continued¶

4

A Local Alignment Example - continued¶

4

A Local Alignment Example - continued¶

4

A Local Alignment Example - continued¶

4

A Local Alignment Example - continued¶

4

Scoring Indels: Naive Approach¶

A fixed penalty σ is given to every indel:
- -σ for 1 indel,
- -2σ for 2 consecutive indels
- -3σ for 3 consecutive indels, etc.
Can be too severe penalty for a series of 100 consecutive indels
- large insertions or deletions might result from a single event

5

Affine Gap Penalties¶

In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events:

6

Accounting for Gaps¶

Gaps- contiguous sequence of indels in one of the rows
Modify the scoring for a gap of length x to be:
```
                 -(ρ + σx)
```
where ρ+σ > 0 is the penalty for introducing a gap:

gap opening penalty
and σ is the cost of extending it further (ρ+σ >>σ):

gap extension penalty
because you do not want to add too much of a penalty for further extending the gap, once it is opened.

7

Affine Gap Penalties¶

Gap penalties:
- -ρ - σ when there is 1 indel
- -ρ - 2σ when there are 2 indels
- -ρ - 3σ when there are 3 indels, etc.
- -ρ - x·σ (-gap opening - x gap extensions)
Somehow reduced penalties (as compared to naïve scoring) are given to runs of horizontal and vertical edges

8

Adding Affine Gap Penalties to our Graph¶

To reflect affine gap penalties we have to add “long” horizontal and vertical edges to the edit graph.
Each such edge of length x should have weight -ρ - x·σ
There are many such edges!
Adding them to the graph increases the running time of the alignment algorithm by a factor of n (where n is the number of vertices)
So the complexity increases from $O(n^2)$ to $O(n^3)$

9

Adding Two More Tables¶

Affine Gap penalties can be expressed in terms of 3 recurrences

10

A 3-level Manhattan Grid¶

The three recurrences for the scoring algorithm creates a 3-layered graph.
The top level creates/extends gaps in the sequence w.
The bottom level creates/extends gaps in sequence v.
The middle level extends matches and mismatches.

11

Switching between 3 Layers¶

Levels:
- The main level is for diagonal edges
- The lower level is for horizontal edges
- The upper level is for vertical edges
A jumping penalty is assigned to moving from the main level to either the upper level or the lower level (-ρ - σ)
There is a gap extension penalty for each continuation on a level other than the main level (-σ)

12

Multiple Alignment versus Pairwise Alignment¶

Up until now we have only tried to align two sequences.
What about more than two? And what for?
A faint similarity between two sequences becomes significant if present in many
Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal

13

Generalizing Pairwise Alignment¶

Alignment of 2 sequences is represented as a 2-row matrix
In a similar way, we represent alignment of 3 sequences as a 3-row matrix
```
    A T _ G C G _
    A _ C G T _ A
    A T C A C _ A
```

Score: more conserved columns, better alignment

14

Three-D Alignment Paths¶

An alignment of 3 sequences: ATGC, AATC, ATGC

Resulting path in (x,y,z) space:
(0,0,0) → (1,1,0) → (1,2,1) → (2,3,2) → (3,3,3) → (4,4,4)
Is there a better one?

15

Aligning Three Sequences¶

Same strategy as aligning two sequences
Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align
For global alignments, go from source to sink

16

2-sequence vs 3-sequence Alignment¶

17

A 2-D cell versus a 3-D Alignment Cell¶

2-D [(i-1,j-1), (i-1,j), (i,j-1)] → (i,j)
3-D [(i-1,j-1,k-1), (i-1,j,k), (i,j-1,k), (i,j,k-1), (i,j-1,k-1), (i-1,j,k-1), (i-1,j-1,k),] → (i,j,k)

18

Structure of a 3-D Alignment Cell¶

19

Multiple Alignment: Recursion Relation¶

20

Multiple Alignment: Running Time¶

For 3 sequences of length n, the run time is $7n^3$; $O(n^3)$
For k sequences, build a k-dimensional Manhattan, with run time $(2^k-1)(n^k)$; $O(2^kn^k)$
Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time

21

Multiple Alignment Induces Pairwise Alignments¶

Every multiple alignment induces pairwise alignments

          x:    AC-GCGG-C        
          y:    AC-GC-GAG
          z:    GCCGC-GAG

Induces:

    x: ACGCGG-C;  x: AC-GCGG-C;  y: AC-GCGAG
    y: ACGC-GAC;  z: GCCGC-GAG;  z: GCCGCGAG

22

Inverse Problem¶

Do Pairwise Alignments imply a Multiple Alignment?

Given 3 arbitrary pairwise alignments:

  x: ACGCTGG-C;  x: AC-GCTGG-C;  y: AC-GC-GAG
  y: ACGC--GAC;  z: GCCGCA-GAG;  z: GCCGCAGAG

Can we construct a multiple alignment that induces them?
```
  NOT ALWAYS
```
Why? Because pairwise alignments may be arbitrarily inconsistent

23

Combining Optimal Pairwise Alignments¶

In some cases we can combine pairwie alignments into a single multiple alignment

But, in others we cannot because one alignment makes a choice that is inconsistent with the overall best choice

      AAAATTTT--------                  ----AAAATTTT----
      ----TTTTGGGG----       -OR-       --------TTTTGGGG
      --------GGGGAAAA                  GGGGAAAA--------

Is there another way?

24

Multiple Alignment from Pairwise Alignments¶

From an optimal multiple alignment, we can infer pairwise alignments between all pairs of sequences, but they are not necessarily optimal
It is difficult to infer a “good” multiple alignment from optimal pairwise alignments between all sequences
Are we stuck, or is there some other trick?

25

Multiple Alignment using a Profile Scores¶

We used profile scores earlier when we discussed Motif finding

        -  A  G  G  C  T  A  T  C  A  C  C  T  G 
        T  A  G  –  C  T  A  C  C  A  -  -  -  G 
        C  A  G  –  C  T  A  C  C  A  -  -  -  G 
        C  A  G  –  C  T  A  T  C  A  C  –  G  G 
        C  A  G  –  C  T  A  T  C  G  C  –  G  G 

  A     0  5  0  0  0  0  5  0  0  4  0  0  0  0    
  C     3  0  0  0  5  0  0  2  5  0  3  1  0  0
  G     0  0  5  1  0  0  0  0  0  1  0  0  2  5
  T     1  0  0  0  0  5  0  3  0  0  0  0  1  0
  -     1  0  0  4  0  0  0  0  0  0  2  4  2  0

Thus far we have aligned sequences against other sequences
Can we align a sequence against a profile?
Can we align a profile against a profile?

26

Aligning Alignments¶

A more general version of the multi-alignment problem:

Given two alignments, can we align them?

  x: GGGCACTGCAT
  y: GGTTACGTC--    Alignment 1 
  z: GGGAACTGCAG

  w: GGACGTACC--    Alignment 2
  v: GGACCT-----

Idea: don’t use the sequences, but align their profiles

  x: GGGCAC=TGCAT
  y: GGTTAC=GTC-- 
  z: GGGAAC=TGCAG     Combined Alignment
     ||  || | |
  w: GG==ACGTACC--    
  v: GG==ACCT-----

27

Profile-Based Multiple Alignment: A Greedy Approach¶

Choose the most similar pair of strings and combine them into a profile, thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat
This is a heuristic greedy method

28

Example¶

Consider these 4 sequences

        s₁:    GATTCA
        s₂:    GTCTGA
        s₃:    GATATT
        s₄:    GTCAGC

with the scoring matrix: {Match = 1, Mismatch = -1, Indel = -1}

29

Example (continued)¶

There are ${4 \choose 2} = 6$ possible pairwise alignments

    s₂:  GTCTGA                          s₁:  GATTCA--
    s₄:  GTCAGC (score = 2)              s₄:  G-T-CAGC (score = 0)

    s₁:  GAT-TCA                         s₂:  G-TCTGA
    s₂:  G-TCTGA (score = 1)             s₃:  GATAT-T (score  = -1)

    s₁:  GAT-TCA                         s₃:  GAT-ATT
    s₃:  GATAT-T (score  = 1)            s₄:  G-TCAGC (score = -1)

The best pairwise score, 2, is between s₂ and s₄

29

Example (continued)¶

Combine s₂ and s₄:

    s₂:  G T C T G A
         | | |   |         →      s_2,4:  G T C t/a G a/c
    s₄:  G T C A G C

Giving a set of three sequences:

        s₁:    G  A  T  T  C  A
        s₃:    G  A  T  A  T  T
        s_2,4:    G  T  C t/a G a/c

Repeat for ${3 \choose 2} = 3$ possible pairwise alignments

    s₁:  GAT-TCA
    s₃:  GATAT-T (score  = 1 + 1 + 1 - 1 + 1 - 1 - 1 = 1)

    s₁:  GAT-TCA
    s_2,4:  G-TCtGa (score  = 2 - 2 + 2 - 2 + 1 - 1 + 1 = 1)

    s₃:  GATAT-T
    s_2,4:  G-TCtGa (score  = 2 - 2 + 2 - 2 + 1 - 1 - 1 = -1)

29

Progressive Alignment¶

Progressive alignment is a variation of a greedy profile alignment algorithm with a somewhat more intelligent strategy for choosing the order of alignments.
Progressive alignment works well for close sequences, but deteriorates for distant sequences
- Once a gap appears in a consensus string it is permanent
- Uses profiles to compare sequences
CLUSTAL OMEGA

30

Clustal Omega¶

A popular multiple alignment tool commonly used today
‘W’ stands for ‘weighted’ (different parts of alignment are weighted differently).
Three-step process
1. Construct pairwise alignments
2. Build Guide Tree
3. Progressive Alignment guided by the tree

31

Clustal Omega's First Step¶

Pairwise alignment

Align each sequence against all others giving a similarity matrix
Similarity = exact matches / sequence length (percent identity)

32

ClustalW's Second Step¶

Create Guide Tree using the similarity matrix
- ClustalW uses the neighbor-joining method
  (we will discuss this later in the course, in the section on clustering)
- Guide tree roughly reflects evolutionary relations

33

ClustalW's Third Step¶

Start by aligning the two most similar sequences
Following the guide tree, add in the next sequences, aligning to the existing alignment
Insert gaps as necessary

34

Next Time¶

Midterm on Wednesday
Covers material up to Lecture 11
When we return from spring break we'll finish Sequence Alignment

35