2. What to Sequence and Why?
• De novo whole genome sequencing
– requires de novo whole genome assembly
• Polymorphism discovery (distinct from genotyping!)
– Targeted approaches
– Whole genome
– SNPs, copy number variations, insertions, deletions, etc.
• Expressed sequence discovery
– ESTs
– cDNAs
– miRNAs, etc
• Functional genomics
– ChIP
– Expression profiling
– Nucleosome positioning
Structure
Function
3. Three Fundamentally Different Types of Chemistries
for DNA Sequencing
• Maxam-Gilbert
– chemical degradation of DNA
– infrequently used due to toxic compounds and difficulty
• Sequencing by synthesis (“SBS”)
– uses DNA polymerase in a primer extension reaction
– Very common approach
– Sanger developed it (“Sanger sequencing”)
– 454, Illumina, Pacific Biosciences
• Ligation-based
– sequencing using short probes that hybridize to the template
– novel approach
– SOLiD, Complete Genomics
“Chemistry”: Jargon; meaning the specific molecular biology and chemistry utilized in the sequencing reactions
4. 5’ and 3’
Adapted From Berg et
al: Biochemistry 5th ed.
Freeman+Co, 2002
Base plus sugar
“nucleoside”
Adenine Adenosine
Guanine Guanosine
Cytosine Cytidine
Thymine Thymidine
in DNA: “deoxyadenosine”
plus triphosphate
“deoxynucleotide”
“2’-deoxyadenosine 5’-
triphosphate” = dATP
PO3 OH
3’
5’
base base base
PO
3
5’
OH 3’
If I throw in DNA polymerase and free
nucleotide, which end gets extended?
Antiparellel
base base base
base base
5’
3’ Deoxyribose
5. 5’ 3’
Adapted From Berg et
al: Biochemistry 5th ed.
Freeman+Co, 2002
Watson 5’ T A G C G T C A G C T G 3’
Crick 3’ A T C G C A G T C G A C 5’
PO3 OH
3’
5’
base base base
PO
3
5’
OH 3’
base base base
base base
6. Sanger Sequencing Templates
Plasmid
backbone
Insert
seq
primer
site
Plasmid “Clone”
PCR product
seq primer site
In Sanger sequencing, Crick is the template and Watson’s synthesis starts at the primer’s 3’OH
Watson 5’ .. T A G C G T C A G C T .. 3’
Crick 3’ .. A T C G C A G T C G A .. 5’
Primer T A G C G
3’ .. A T C G C A G T C G A C .. 5’
5’ 3’
7. The Chain Terminator
Adapted From Berg et
al: Biochemistry 5th ed.
Freeman+Co, 2002
dideoxy
H
3’
5’
base base base
PO
3
5’
3’
base base base base
• Dideoxy nucleotides cannot be further extended, and so terminate the sequence chain
8. Recombinant DNA: Genes and Genomes.
3rd Edition (Dec06). WH Freeman Press.
Expose gel to x-ray film (to
make an “auto-radiogram”)
Original Sanger Sequencing with Radioactive
Signal
Template (Crick)
A nested series of
DNA fragments
ending in the base
specified by the
terminator-ddNTP
Watsons
very low
concentration
of ddNTPs
compared to
dNTPs
9. This is great but…
Wouldn’t it be great to run everything in one
lane?
Save space and time, more efficient
Fluorescently label the ddNTPs so that
they each appear a different color
C
A
A
G
T
G
T
C
T
T
A
A
C
10. Recombinant DNA: Genes and Genomes.
3rd Edition (Dec06). WH Freeman Press.
Each of the 4 ddNTPs is labeled with a different fluorescent dye (instead of radioactivity)
Fluorescent Sanger Sequencing: “Dye-
terminators”
11. Fluorescent Sanger Sequencing
One-tube sequencing reaction
(note: cycle sequencing with modified Taq Polymerase)
dGTP
dATP
dTTP
dCTP
+
Direction
of electro-
phoresis
Load on gel
(modern machines use
capillaries, not slab gels)
12. Recombinant DNA: Genes and Genomes.
3rd Edition (Dec06). WH Freeman Press.
Fluorescent Sanger Sequencing Trace
Lane signal
Trace
(Real fluorescent signals from a lane/capillary are much uglier than this).
A bunch of magic to boost signal/noise, correct for dye-effects, mobility
differences, etc, generates the ‘final’ trace (for each capillary of the run)
13. Sanger Base Calling
... 0 3 20 25 40 88 95 99 99 99 99 99 ... 10 0 0 ...
Quality score = -10 * log(probability of error)
For Q20, probability of error = 1/100
For Q99, probability of error ~10-10
ugly over here ugly over there
... 44 45 46 47 48 49 50 51 52 53 54 55 ... 718 719 720 ...
... N A G C G T T C C G C G ... A N N ...
Base Caller (Phred)
14. • Algorithm based on ideas about what might go wrong in a
sequencing reaction and in electrophoresis
• Tested the algorithm on a huge dataset of “gold standard”
sequences (finished human and C. elegans sequences generated by highly-redundant
sequencing)
• Compared the results of phred with the ABI Basecaller
• Phred was considerably more accurate (40-50% fewer
errors), particularly for indels and particularly for the
higher quality sequences
(Ewing et al., 1998, Gen. Research 8: 175-185; Ewing and Green 1998, Gen. Research 8: 186-194)
Phred: The base-calling program
15. Radioactive polyacrylamide slab
gel electrophoresis
Low throughput, labor-intensive
1 grad student could do
10 runs a day, 100 bp/run =
1000 bp/day
Progress of Sanger Sequencing Technology
16. AB slab gel sequencers
Fluorescent sequencing
Per machine:
6 runs/day
96 reads/run
500 bp/read
288,000 bp/day
Progress of Sanger Sequencing Technology
17. AB capillary sequencers
Per machine:
24 runs/day
96 reads/run
550 – 1,000 bp/read
1-2 million
bp/day
Progress of Sanger Sequencing Technology
~1,000-fold increase in throughput since 1985 accomplished by
incremental improvements of the same underlying technology
Novel Disruptive Sequencing Technologies have 100-1000x more throughput:
454 Pyrosequencing, Solexa, SOLiD
20. Rationale for Hierarchical Strategy
• Better for a repeat-rich genome
– less misassembly of finished genome
• long-range misassembly largely eliminated and short-range reduced
• Better for an outbred organism
– each clone from an individual and no polymorphisms in the
final sequence.
– (Added bonus: get SNPs from regions of overlapping clones)
• Better if there are cloning biases
– use minimum tiling path,so the same coverage for each
region
• Easier to identify and fill gaps (from unclonable regions) sooner
BUT
• Time consuming and expensive to make minimum tiling path
21. De Novo Whole Genome Sequencing
sequencing
primer
"forward
read"
"reverse
read"
sequencing
primer
Plasmid "backbone"
Insert
Make millions of random
clones: “Shotgunning”
22. Some Key Concepts in Assembly
• Mate pairs ( = paired end reads)
• Contigs
• Supercontigs ( = Scaffolds)
• Sequence coverage
• Physical coverage
30. Draft Assemblies are not Perfect
• Sequence coverage may vary (Ciona is 12x, dog is 6x, shrew is 2x)
– missing regions
– strong fragmentation
• Some regions don’t clone well
– results in low sequence coverage
– which causes gaps in assembly
• Some regions don’t sequence well
– extreme GC content
– homopolymeric or otherwise low-complexity runs
• Some regions don’t assemble well
– mobile elements
• high identity, large copy number
– segmental duplications
• Polymorphism
31. 454 Sequencing
• Most viable “UHTS” method for de novo
genome sequencing
• Still sequencing by synthesis, but
fundamentally different
– Uses pyrosequencing (light produced as a
by product of base extension)
– No cloning required
– Sequences ~1,000,000 reads
simultaneously
– Each read is ~400bp long
– Takes ~10 hours for machine run
32. 454 Library Generation (1)
Genomic DNA
Fragment (shear or restrict)
DNA Fragments
Add adaptors
33. 454 Library Generation (2)
Adaptor DNA adapter ends plus beads
~ 1 DNA molecule and 1 bead per droplet
Make water-in-oil emulsion
emPCR
Most beads with with multiple copies
of individual DNA molecules
36. Sequencing Strategy with 454
• High Coverage (~30x) of single end
reads
• One tenth coverage with paired end
reads (provides greater physical
coverage)
• Assemble genome using both sets of
data – paired end reads provide long
range continuity.
37. 454 Advantages,
Limitations and Future
• Advantages
– Fast, accurate
– Great for small, simple genomes
• Disadvantages
– Reads still relatively short (~400bp)
– Doesn’t yet work well for de novo sequencing of
large genomes
– Homopolymer stretches (8+) are difficult to read
• Future
– Improved chemistry is expected to lead to read
lengths > 1000 bp in 2010.
38. The Players
• Commercially available now:
– Solexa (Illumina)
– SOLiD (Applied Biosystems/Life Technologies)
– Helicos
– (Complete Genomics)
• Next next generation approaches
– two or three potentially viable ones are in R+D phase
(E.g. Pacific Biosciences, Nabys, Visigen)
– may become commercially available soon
39. How UHTS Differs from Sanger
• Data
– Amount of data generated per day or per run (“throughput”)
– Type of data
– Quality of data (error)
• Process of Data Generation:
1. Molecular Biology: Library construction
2. Molecular Biology: Sample prep
3. Sequencing: Chemistry
4. Sequencing: Detection
5. Analysis: Image Processing
6. Analysis: Read Analysis
In other words: it’s a completely different beast
40. Read Length
0 100 200 300 400 500 600
700 bases of genome, 1x sequence coverage
Sanger
454
Sol*
700
[Consider DNA sequence variation vs functional genomics]
41. Further Important Differences
Parameter
Sanger
(AB 3730)
454
(Roche)
Solexa
(Illumina)
SOLiD
(AB)
Read L (bp) 700 400 35-100 50
Number of
reads per run
[per day]
96
[2000]
1,000,000
[2,000,000]
1,000,000,000
[600,000,000]
1,000,000,000
[200,000,000]
Mb per day 1.4 80 Up to 25,000 Up to 12,500
SNP error rate low low high (~1%) high (>1%)
Indel error rate low high low low
42. How UHTS Differs from Sanger
• Data
– Amount of data generated per day or per run (“throughput”)
– Type of data
– Quality of data (error)
• Process of Data Generation:
1. Molecular Biology: Library construction
2. Molecular Biology: Sample prep
3. Sequencing: Chemistry
4. Sequencing: Detection
5. Analysis: Image Processing
6. Analysis: Read Analysis
43. “Fragment” Library
Seq Primer “Insert”
Left amplification handle Right amplification handle
Solexa, SOLiD: L <200 bp
44. Making Paired Tags
Conceptually Simple ... Practically Somewhat Difficult, though now
fairly standard
long gDNA fragment
molecular biology magic
... on to Amplification and Sequencing
45. Generating DiTag Libraries
• Paired Tag for clonal amplification-based sequencing
Sheared gDNA
(Long .. > 2kb!)
Adapter with EcoP15I sites
>snip< with
EcoP15I
Seq Primer 1 Tag 1
Left single molecule
amplification handle
Right single molecule
amplification handle
Seq Primer 2 Tag 2
Add linkers
46. How UHTS Differs from Sanger
• Data
– Amount of data generated per day or per run (“throughput”)
– Type of data
– Quality of data (error)
• Process of Data Generation:
1. Molecular Biology: Library construction
2. Molecular Biology: Sample prep
3. Sequencing: Chemistry
4. Sequencing: Detection
5. Analysis: Image Processing
6. Analysis: Read Analysis
47. Sample Preparation
• Some approaches use a single DNA molecule as a sequencing
template do exist (Helicos, Pacific Biosciences). Challenges include:
– Keeping single molecules stable during insults of sequencing
– Signal to noise ratio in base detection
– BUT
– Avoid amplification biases
• Next-best Solution: Clonal Amplification of Single Molecules
– Single molecule only briefly needed as a template, not during the entire
time of the sequencing reaction
– Thousands of identical molecules boost signal
– Two different methods
• Emulsion PCR
– 454 and Applied Biosystems (SOLiD)
• Bridge amplification of molecules immobilized on surface
– Solexa system (Illumina)
48. DNA
(0.1-1.0 ug)
Single molecule array
Sample
preparation Cluster growth
5’
5’
3’
G
T
C
A
G
T
C
A
G
T
C
A
C
A
G
T
C
A
T
C
A
C
C
T
A
G
C
G
T
A
G
T
1 2 3 7 8 9
4 5 6
Image acquisition Base calling
T G C T A C G A T …
Sequencing
Illumina Sequencing Technology
Robust Reversible Terminator Chemistry Foundation
50. How UHTS Differs from Sanger
• Data
– Amount of data generated per day or per run (“throughput”)
– Type of data
– Quality of data (error)
• Process of Data Generation:
1. Molecular Biology: Library construction
2. Molecular Biology: Sample prep
3. Sequencing: Chemistry
4. Sequencing: Detection
5. Analysis: Image Processing
6. Analysis: Read Analysis
51. Detection, Chemistry
• Massively Parallel Detection
On immobilized “molecular colonies”
Means you have to measure (image) every cycle
Instead of the Sanger model (letting reaction go to completion and then
separating products by size)
• Requires specially designed ‘chemistry’
– Solexa: reversible dye-terminators (polymerase)
– AB: ligation-based (ligase; 3’->5’ !)
52. O
PPP
HN
N
O
O
cleavage site
fluorophore
3’
3’ OH is blocked
Solexa Sequencing: Reversible Terminators
Detection
O
HN
N
O
O
3’
DNA
O
Incorporate
Ready for Next Cycle
O
DNA
HN
N
O
O
3’
O
free 3’ end
OH
Deblock
and Cleave
off Dye
53. C
T
A G
C
T
A
3’- …-5’
G
T
First base incorporated
Cycle 1: Add sequencing reagents
Remove unincorporated bases
Detect signal
Cycle 2-n: Add sequencing reagents and repeat
G T C
T A G
T C
T G C
T A
G
A
Sequencing by Synthesis, One Base at a Time
54. How UHTS Differs from Sanger
• Data
– Amount of data generated per day or per run (“throughput”)
– Type of data
– Quality of data (error)
• Process of Data Generation:
1. Molecular Biology: Library construction
2. Molecular Biology: Sample prep
3. Sequencing: Chemistry
4. Sequencing: Detection
5. Analysis: Image Processing
6. Analysis: Read Analysis
56. Flow Cells and Imaging
Solexa: single 8-channel
Reagents flowed in here
Out to waste
Image for each
channel is
actually tiled (four
images per panel,
one for each
color)
57. Amount of Image Data
Solexa: single 8-channel
First chemistry, then imaging
SOLiD: dual flow cell, 8 quadrants
alternating chemistry and imaging
~1 TB image data per run ~3 TB image data per run
Number of image panels in
each channel:
Number of image panels in
each quadrant:
27 columns X 17 rows
3672 for both flow cells
3 columns X 100 rows
2400 for each flow cell
times 4 images
times 35 cycles
336,000 images
times 4 images
times 35 cycles
514,080 images
58. Image Processing, Base Calling
• Image processing algorithms find
signals in each panel, align signals from
different panels, etc
– Machines ship with server or small cluster
that does image analysis while run is
happening
• Sequence data after base calling much
reduced in size (tens of gigabytes) =>
more manageable but still large
59. How UHTS Differs from Sanger
• Data
– Amount of data generated per day or per run (“throughput”)
– Type of data
– Quality of data (error)
• Process of Data Generation:
1. Molecular Biology: Library construction
2. Molecular Biology: Sample prep
3. Sequencing: Chemistry
4. Sequencing: Detection
5. Analysis: Image Processing
6. Analysis: Read Analysis
60. Placement (Mapping) of Short Reads
Requires reference genome (hence “placement” or
“mapping”; meaning alignment of read to genome)
Read Placement is a significant computational challenge because of
the sheer amount of data
Need lots of disk space
Need a cluster
Need lots of RAM
Commercial vendors (Illumina, AB) have decided on fast algorithms
that *only* tolerate mismatches (and not many)
Example: 3 Gigabases of read data in 35 bp chunks (1x
mammalian genome coverage) takes 30 CPU hours to align with
the fastest mismatch-only software …
Placement can be ambiguous because reads are short and
error-prone (error rates matter greatly!)
... whereas it takes 125 days with Mike Brudno’s SHRIMP (a superfast
Smith-Waterman implementation explicitly developed for short reads)
To make the most of the data you
may need fast and robust
algorithms typically written by
computer scientists.
61. Useful Papers
Technology:
• Sanger, F., Nicklen, S. and Coulson, A.R. (1977). DNA sequencing with chain-terminating inhibitors. Proc Natl Acad
Sci U S A. 74, 5463-7.
• Ronaghi, M., Uhlén, M. and Nyrén, P. (1998). A sequencing method based on real-time pyrophosphate. Science
281, 363-365.
• Margulies, M. et al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376-
80.
• Eid, J. et al. (2009). Real-time DNA sequencing from single polymerase molecules. Science. 323, 133-8.
• Drmanac, R. et al. (2010). Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA
Nanoarrays. Science 327, 78-81.
Quality:
• Ewing, B., Hillier, L., Wendl, M.C. and Green, P. (1988). Base-calling of automated sequencer traces using phred. I.
Accuracy assessment. Genome Res. 8, 175-85.
• Ewing, B. and Green, P. (1998). Base-calling of automated sequencer traces using phred. II. Error probabilities.
Genome Res. 8, 186-94.
Assembly:
• Batzoglou, S., Jaffe, D.B., Stanley, K., Butler, J., Gnerre, S., Mauceli, E., Berger, B., Mesirov, J.P. and Lander, E.S.
(2002). ARACHNE: a whole-genome shotgun assembler. Genome Res. 12, 177-89.
• Jaffe, D.B., Butler, J., Gnerre, S., Mauceli, E., Lindblad-Toh, K., Mesirov, J.P., Zody, M.C. and Lander, E.S. (2003).
Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91-6.
62.
63. SOLiD Emulsion PCR on Beads
From Dressman et al. (2003) Proc. Natl.
Acad. Sci. USA 100, 8817-8822
Purify
~2x108
Beads
Spread
Beads on
Flow Cell
For
Seq’ing
64. 4-6 M Emulsions with 1 M Beads (SOLiD)
Courtesy of Gina
Costa (AB)
reactors containing a
bead are false-colored
65. Flow Cells with “Molecular Colonies”
• SOLiD
– enrich for beads that productively amplified
– spread out beads randomly on microscope slide to which the beads
stick via covalent bond ( <- important; beads must not move during
sequencing)
• Solexa
– flow cell with randomly spaced molecular clusters
– spacing depends on initial seeding of the single molecules onto the
flow cell