2. Overview
Real Time Single Molecule
Sequencing
• Mechanism/Motivation
• Computational Challenges
• Example: Sequence Alignment,
probabilistic model
Sequencing on a Chip
• Mechanism/Motivation
• Computation Challenges
• Example: Sequence Alignment,
probabilistic model
2
4. 4
P
G
C
A
T
P
P
A
P
G
Polymerase
Patient’s DNA sequence
Real Time Single Molecule Sequencing
Quantum
dot
FRET
FRET: fluorescence resonance
energy transfer
-> Energy is transferred from
quantum dot to dye, resulting
in a color shift
Emitted light
Glass plate
T
5. Raw time series of one molecule (live)
5
Time (seconds)
T A A G G
6. Motivation
• 5000 bp single reads or more
• Multiple polymerase on one strand of DNA
-> ultra long reads of up to 250’000 bp
• (It’s very cool :-)
6
7. Computational Challenges in Sequencing
BaseCalling
• Determine the sequence of G, A, T, C
-> Signal detection (noise)
-> Quality of base call
Mapping
• Locate billion(s) of sequences on genome
-> Efficient mapping strategies
-> Dealing with errors in reads
SNP/CNV
Detection
• Does the patient have cancer?
-> (Multiple) sequence alignment
-> correct probabilistic model is critical
7
T A C G T A C G T C T G A G C A
“Reads”
“Sequences”
Assembled
genome
Genotype,
CNV, SNP
8. Maximum likelihood alignment:
Dynamic programming
Confidential and Proprietary—DO NOT
DUPLICATE
8
G A A G T A
G
A
G
A
A
Reference sequence
R
e
a
d missed base call
mismatch
missed base call
NSB
Correct model
and transition
probabilities
are crucial
10. Polymerase: event detection
• For a detection limit of
10ms, we will miss ~ 7.5%
of all insertion events
• To see more events we
could:
– Slow down polymerase
insertion rate
– Lower insertion detection
limit (increase frame rate
and lower noise)
Does not include blinking
10
Detection Limit % Events Missed
5ms 3.9%
10ms 7.5%
20ms 18%
30ms ~30%
11. 11
Quantum Dot Blinking: Power law
Sub-Sampled
timeseries
“Real”BlinkingSignal
Light dt·dpix “on”
“off” Bell Curve
Power Law
Normal Distribution Power Law
Finite mean/variance Infinite mean/variance possible
Height/Weight distribution of people Financials: Stock market, Foreign exchange rates
IQ of people Hedge fund risks
Roulette, Blackjack etc File sizes, download times, city sizes, book sales
13. Challenges of SMS
• Enzyme kinetics (too fast to see,
branching ratio)
• Quantum dot blinking: random
switching between ON and OFF states
common to nanoscale emitters ->
power law
• Photobleaching (destruction of dot)
• Single molecule -> stochastic behavior
13
Accuracy
14. Sequencing on a Semiconductor Chip:
The Chip is the Machine™
Ion Torrent
Sensor Plate
Silicon Substrate
Drain SourceBulk
∆ pH
∆ V
Sensing Layer
H+
DNA
on bead
15. H+
Sensor Plate
Silicon Substrate
Drain SourceBulk
∆ pH
∆ V
Sensing Layer
H+
15
• Natural chemistry: natural polymerase and nucleotides
• Fast: one PGM run 1.5-7.5 hours, entire workflow 8-23 hours
• Cost (50k for a PGM, 300.- for a sample with chip)
• Simplicity: direct measurement, no cameras involved -> high accuracy
• Read length: approx. 400 bp
P
G
C
A
T
P
P
A
P
G
T
Patient’s DNA sequence
Method/Motivation
DNA
16. Raw signal of one base call
Incorporations add hydrogen (dV):
• one nucleotide
• two nucelotides
• three nucelotides
• four nucleotides
16
17. Flowgram of one read
17
T A C G T A C G T C T G A G C A
Flow 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Nr bases inserted
1
2
T
2T
C CA AGG
18. Dynamic programming
Confidential and Proprietary—DO NOT
DUPLICATE
18
Flow 1
A
Flow 2
G
Flow 3
T
Flow 4
C
Flow 5
A
Flow 6
T
Base 1
A
Base 2
T
Base 3
T
Base 4
A
carry forward
Incomplete
extension
Read
R
e
f
e
r
e
n
c
e
Correct model
and transition
probabilities
are crucial
19. Phasing Parameter Typical Values Description
Carry Forward 0.2% - 1.0%
% of polymerases that will
incorporate when they shouldn’t
Incomplete Extension 0.2% - 1.0%
% of polymerases that will
not incorporate when they should
Droop 0.0% - 0.3%
% of polymerases that will
stop working
Phasing Model
We have one bead per well and there a many copies of DNA on a bead.
During a run the DNA copies get out of synch with each other
19
20. 20
Temporal model:
When will the polymerase incorporate?
Histogram of flows during which
6th base (T) is incorporated
T A C G T A C G T C T G A G C A T C G A
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Flows (“time”)
A C G T C T G A G C A T C G A T C G A T G T A C
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
Histogram of flows during which
50th base (A) is incorporated
In phase
In phase
21. Goal of base calling (in principle)
21 Confidential and Proprietary—DO NOT DUPLICATE
T C A G T T G A C TFind
the base sequence
for which
the predicted flogram
is most similar to
the observed flogram
T A C G T A C G T C T G A G C A
T A C G T A C G T C T G A G C A
Phasing model
Least squares
22. Regions, Exomes, Genomes and Beyond
Small Genome
Transcriptome
100M
1G
10G
SequenceOutputperRun
100G
Exome
Small to Large Gene Panels
10M
Ion 316™
Ion 314
Ion 318™
PII
Human
Genome
PI
PIII
from 1.2 Million Sensors……… to 1.2 Billion
22
23. About me…
Detection of propylthiouracil
(coffee, cabbage, grapefruit, green tea)
Muscle performance (ACTN3)
23
24. FAQ on SMS
24
• Quantum dot: nanocrystal made of semiconductor
materials small enough to exhibit quantum mechanical
properties. Exitons (electron/hole pair) are confined in all
spacial dimensions. The frequency of emitted light
increases as the size of the dot decreases
• Blinking: caused by intercepted electrons, or by emitting
electron. On/off times follow power law.
• FRET: energy transfer without emission of photon.
Efficiency is inverse of 6th power of distance (10-100 Å).
Spectra must overlap