DNA sequences are a biological system's hard drive
DNA sequences vary in size
How can we read off the sequence of DNA?
2
Sanger method (1977):Labeled ddNTPs terminate DNA copying at random points. |
Gilbert method (1977):Chemical method to cleave DNA at specific points (G, G+A, T+C, C). |
|
Both methods generate labeled fragments of varying lengths that are further electrophoresed |
3
|
4
In 1990, a moon-shot-like project was begun to sequence the entire Human Genome.
A $3 billion dollar NIH funded public effort led by Francis Collins with a 15-year plan.
It would distribute the work across several labs in a community effort by assigning primers
to groups on a first-come basis. New sequencing results yielded new primers, so the project
required a central coordination. |
In 1997 a private company, Celera, lead by Craig Venter, suggested they could beat the public effort by dispensing with primers. They'd just randomly fragment DNA and sequence each fragment with no idea of how sequenced fragment would fit together. In other words, they were going to rely on computer science to assemble their reads algorithmically. |
The result was that, despite tensions, the groups ended up sharing data and technologies. ANd the competition led to a completed draft 5 years ahead of schedule.
5
Since the Human Genome project there have been an explosion of genomes sequenced. Initially, the focus was on model organisms, then favorites, then all of human diversity, and finally a catalog of life's diversity.
6
Next generation seqquencing machines have revolutionized the DNA sequencing process. They work in various ways including massiviely-parallel single-base extension methods, to captured Dnases whose motions suggest a the base being replicated, to microholes that only a single DNA molecule can pass through, and the bases are determined by detectable charge differences. | In a way, the *genome moonshot* was far more successful than the real moonshot. The rate at which genomes can be sequenced, and the cost per base has seen unprecented improvements. Faster than even Moore's Law. |
7
Some important differences
9
You'd look for fragments that fit together based on some overlapping context that they share.
And then, build upon those to assemble a more complete picture
11
This leads us to a computational analogy called a graph
One can devise both representaions for, and algorithms that operate on, graphs.
Let's rethink our DNA ssembly problem as a graph problem.
13
For the moment let's imagine that reads are like k-mers from a sequence, as they do tend to be uniform in length.
GACGGCGGCGCACGGCGCAA - Our toy sequence GACGG ACGGC CGGCG GGCGG GCGGC CGGCG GGCGC - The complete set of 16 5-mers GCGCA CGCAC GCACG CACGG ACGGC CGGCG GGCGC GCGCA CGCAA
Now we can construct a graph where:
14
The read-overlap graph for the 5-mers from:
GACGGCGGCGCACGGCGCAA
The problem is How to infer the original sequence from this graph?
15
A version of Hamilton's game: |
16
Our desired sequence:
GACGGCGGCGCACGGCGCAAis indeed a path in this graph
How can we write a program to solve Hamilton's puzzle?
Is the solution unique?
18
GACGGhas a prefix
GACGand a suffix
ACGG
19
This rather odd graph is called the "De Bruijn" graph, was named after a famous mathematician.
The problem is How to infer the original sequence from this graph?
20
Leonhard Euler |
A version of Euler's game:
Bridges of Königsberg Find a city tour that crosses every bridge just once |
21
22