Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Jared Simpson
!
Ontario Institute for Cancer Research
&
Department of Computer Science
University ofToronto
Error correcti...
Our collaboration
2
An overview of NGS assembly
• Illumina data: short reads, very accurate, very deep	

• nearly all Illumina assembly is bas...
Long read assembly pipeline
4
Error correction
Celera Assembler
Consensus
Input reads
Genome Assembly
Input Data
• First challenge is finding overlaps for reads with 15-20% errors
5
Overlap Detection
6
we use github.com/thegenemyers/daligner to compute overlaps
Partial Order Graphs
7
add read GCTACGAT that we want to correct to graph
Partial Order Graphs
8
add sequence GCTCGAT to graph
Partial Order Graphs
9
add sequence GCTCGATT to graph
Partial Order Graphs
10
maximum weight path GCTCGAT is the corrected read
Error Correction
11
Contig Assembly
12
Celera Assembler produces one contig at 98.5% identity
Assembly Polishing
• Consensus problem is viewed as choosing a sequence C’ that maximizes
the probability of the event dat...
Selecting a Consensus
14
Mutate
ACTACGATCGACTTACGA
CCTACGATCGACTTACGA
TCTACGATCGACTTACGA
...
-CTACGATCGACTTACGA
G-TACGATCG...
Selecting a Consensus
15
GCT-CGATCGACTTACGA
Mutate
A
C
T
...
-
G
GC
GCT
...
G
G
G
G
P(D|S)
-190
-187
-192
-176
-191
-193
-...
Pore Models
16
Generating Events
• What do we expect events from a given sequence to look like?
17
GCTACGATT
Sample Current ●●●
●
●●●●●
●...
Generating Events
• What do we expect events from a given sequence to look like?
18
GCTACGATT
Sample Current ●●●
●
●●●●●
●...
Generating Events
• What do we expect events from a given sequence to look like?
19
GCTACGATT
Sample Current ●●●
●
●●●●●
●...
Generating Events
• What do we expect events from a given sequence to look like?
20
GCTACGATT
Sample Current ●●●
●
●●●●●
●...
Generating Events
• What do we expect events from a given sequence to look like?
21
CTACGATT
Sample Current ●●●
●
●●●●●
●●...
Event Detection
22
●●●
●
●●●●●
●●
●
●
●●
●
●●●
●
●●
●
●●●●●
●
●
●
●●●●
●
●●
●●
●●●●●
●
●
●●●●
●
●●
●●
●
●
●
●●●
●●●
●●●●
●...
A simple model
• What is the probability of observing events E given a sequence S?	

• Assuming for the moment there are n...
Complications
24
●
●
●●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
...
Complications
25
●
●
●●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
...
Complications
26
●
●
●●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
...
Nanopore HMM
• must consider:	

• over segmentation	

• under segmentation	

• missed short events	

• HMM:	

• M states: ...
Transition Probabilities
• Probability of not observing an
event is a function of absolute
difference between (expected)
c...
Transition Probabilities
• Probability of not observing an
event is a function of absolute
difference between (expected)
c...
Transition Probabilities
30
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
●
●
●
● ● ● ● ●
● ●
● ● ●
●
●
●
●
● ●
●
●
●
●
● ●
0.1
0.2
0.3
0....
Assembly Accuracy
31
0
5000
10000
0 5000 10000
5 mer count in reference
5mercountindraftassembly
0
3000
6000
9000
12000
TT...
Assembly Accuracy
32
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGG
kmer
count
draft
reference
900...
Aligning Events to a Reference
• HMM can also align events to a reference genome



!
!
!
!
!
• Read about it here:	

• ht...
Planned Improvements
• Model dwell duration to better call homopolymers	

!
!
!
• SNP calling/genotyping	

!
!
!
• Improve...
Acknowledgements & Code
• Collaborators:	

• Nick Loman, Josh Quick (Birmingham)	

• Jonathan Dursi (OICR)	

!
!
• Code:	
...
Upcoming SlideShare
Loading in …5
×

150514 jts london_calling

6,516 views

Published on

London Calling 2015

Published in: Science

150514 jts london_calling

  1. 1. Jared Simpson ! Ontario Institute for Cancer Research & Department of Computer Science University ofToronto Error correction, assembly and consensus algorithms for MinION data London Calling, May 14th, 2015
  2. 2. Our collaboration 2
  3. 3. An overview of NGS assembly • Illumina data: short reads, very accurate, very deep • nearly all Illumina assembly is based on exact matching algorithms • fragmented assemblies ! • Algorithms for Illumina data do not work for long, noisy reads • PacBio developed a pipeline (“HGAP”) to assemble their data • We used this recipe as a starting point but with custom components 3
  4. 4. Long read assembly pipeline 4 Error correction Celera Assembler Consensus Input reads Genome Assembly
  5. 5. Input Data • First challenge is finding overlaps for reads with 15-20% errors 5
  6. 6. Overlap Detection 6 we use github.com/thegenemyers/daligner to compute overlaps
  7. 7. Partial Order Graphs 7 add read GCTACGAT that we want to correct to graph
  8. 8. Partial Order Graphs 8 add sequence GCTCGAT to graph
  9. 9. Partial Order Graphs 9 add sequence GCTCGATT to graph
  10. 10. Partial Order Graphs 10 maximum weight path GCTCGAT is the corrected read
  11. 11. Error Correction 11
  12. 12. Contig Assembly 12 Celera Assembler produces one contig at 98.5% identity
  13. 13. Assembly Polishing • Consensus problem is viewed as choosing a sequence C’ that maximizes the probability of the event data 13 C0 = arg max S2C P(D|S) P(D|S) = rY k=1 P(ei,k, ei+1,k, ..., ej,k|S, ⇥) where
  14. 14. Selecting a Consensus 14 Mutate ACTACGATCGACTTACGA CCTACGATCGACTTACGA TCTACGATCGACTTACGA ... -CTACGATCGACTTACGA G-TACGATCGACTTACGA GC-ACGATCGACTTACGA GCT-CGATCGACTTACGA ... GACTACGATCGACTTACGA GCCTACGATCGACTTACGA GGCTACGATCGACTTACGA GTCTACGATCGACTTACGA P(D|S) -190 -187 -192 -176 -191 -193 -168 -198 -191 -195 -181 GCTACGATCGACTTACGA
  15. 15. Selecting a Consensus 15 GCT-CGATCGACTTACGA Mutate A C T ... - G GC GCT ... G G G G P(D|S) -190 -187 -192 -176 -191 -193 -168 -198 -191 -195 -181 Select new consensus GCT-CGATCGACTTACGA -168
  16. 16. Pore Models 16
  17. 17. Generating Events • What do we expect events from a given sequence to look like? 17 GCTACGATT Sample Current ●●● ● ●●●●● ●● ● ● ●● ● ●●● ● ●● ● ●●●●● ● ● ● ●●●● ● ●● ●● ●●●●● ● ● ●●●● ● ●● ●● ● ● ● ●●● ●●● ●●●● ●● ● ● ●●●● ● ● ● ● ●●● ●● ●●●●●●● ● ● ● ● ● ●●● ● ● ●●●●●●●● ● ● ● ●●●● ● ●●● ●●● ● ●● ●●● ●● ●● ●●● ● ●● ●●●●●●●● ● ●●●● ● ●●●●●● ●●●● ●●● ● ●●●● ● ● ●● ● ● ●●●●●●● ●●● ●●● ●●● ● ●●● ●● ●● ● ● ●●● ●● ●●● ●●●●●●● ● ● ● ●●●● ● ● ●●●●● ● ●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●●● ●● ●● ● ● ●● ● ● ●●●●● ● ● ●● ●●●● ● ● ●●● ● ● ●● ●● ●● ● ●● ● ●●●●●●●●●●● ●●● ● ●● ● ●●● ● ●●● ● ● ●●●●●●● ● ●● ● ●● ●●●●● ● ● ●● ● ● ●● ●● ● ● ●● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ●●●●●●●●●●●●●● ● ● ●●●●●● ●● ●● ● ●● ●●● ●●● ●●● ● ●● ● ● ●● ●● ●●●●● ●● ●●● ●●●● ● ● ●●● ● ●● ●●● ● ●●●●●● ● ● ●● ●●● ●●●● ● ● ●●● ● ● ● ●●●● ● ●●●●● ●● ● ● ●● ● ●●●● ●●● ● ●● ●● ●● ● ●●●●● ●●●●● ●● ● 0 25 50 75 100 0.00 0.25 0.50 0.75 1.00 1.25 time (s)Current(pA)
  18. 18. Generating Events • What do we expect events from a given sequence to look like? 18 GCTACGATT Sample Current ●●● ● ●●●●● ●● ● ● ●● ● ●●● ● ●● ● ●●●●● ● ● ● ●●●● ● ●● ●● ●●●●● ● ● ●●●● ● ●● ●● ● ● ● ●●● ●●● ●●●● ●● ● ● ●●●● ● ● ● ● ●●● ●● ●●●●●●● ● ● ● ● ● ●●● ● ● ●●●●●●●● ● ● ● ●●●● ● ●●● ●●● ● ●● ●●● ●● ●● ●●● ● ●● ●●●●●●●● ● ●●●● ● ●●●●●● ●●●● ●●● ● ●●●● ● ● ●● ● ● ●●●●●●● ●●● ●●● ●●● ● ●●● ●● ●● ● ● ●●● ●● ●●● ●●●●●●● ● ● ● ●●●● ● ● ●●●●● ● ●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●●● ●● ●● ● ● ●● ● ● ●●●●● ● ● ●● ●●●● ● ● ●●● ● ● ●● ●● ●● ● ●● ● ●●●●●●●●●●● ●●● ● ●● ● ●●● ● ●●● ● ● ●●●●●●● ● ●● ● ●● ●●●●● ● ● ●● ● ● ●● ●● ● ● ●● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ●●●●●●●●●●●●●● ● ● ●●●●●● ●● ●● ● ●● ●●● ●●● ●●● ● ●● ● ● ●● ●● ●●●●● ●● ●●● ●●●● ● ● ●●● ● ●● ●●● ● ●●●●●● ● ● ●● ●●● ●●●● ● ● ●●● ● ● ● ●●●● ● ●●●●● ●● ● ● ●● ● ●●●● ●●● ● ●● ●● ●● ● ●●●●● ●●●●● ●● ● ●● ●● ●●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●●●● ● ●● ● ●●●● ●● ● ● ● ● ● ●● ● ● ● ●●●●● ● ●●● ● ● ● ●● ● ● ●●●●●● ●● ● ●●●● ●● ● ● ●●● ●●●● ● ● ●●●●● ●● ●● ● ● ●●●●●●● 0 25 50 75 100 0.00 0.25 0.50 0.75 1.00 1.25 time (s)Current(pA)
  19. 19. Generating Events • What do we expect events from a given sequence to look like? 19 GCTACGATT Sample Current ●●● ● ●●●●● ●● ● ● ●● ● ●●● ● ●● ● ●●●●● ● ● ● ●●●● ● ●● ●● ●●●●● ● ● ●●●● ● ●● ●● ● ● ● ●●● ●●● ●●●● ●● ● ● ●●●● ● ● ● ● ●●● ●● ●●●●●●● ● ● ● ● ● ●●● ● ● ●●●●●●●● ● ● ● ●●●● ● ●●● ●●● ● ●● ●●● ●● ●● ●●● ● ●● ●●●●●●●● ● ●●●● ● ●●●●●● ●●●● ●●● ● ●●●● ● ● ●● ● ● ●●●●●●● ●●● ●●● ●●● ● ●●● ●● ●● ● ● ●●● ●● ●●● ●●●●●●● ● ● ● ●●●● ● ● ●●●●● ● ●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●●● ●● ●● ● ● ●● ● ● ●●●●● ● ● ●● ●●●● ● ● ●●● ● ● ●● ●● ●● ● ●● ● ●●●●●●●●●●● ●●● ● ●● ● ●●● ● ●●● ● ● ●●●●●●● ● ●● ● ●● ●●●●● ● ● ●● ● ● ●● ●● ● ● ●● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ●●●●●●●●●●●●●● ● ● ●●●●●● ●● ●● ● ●● ●●● ●●● ●●● ● ●● ● ● ●● ●● ●●●●● ●● ●●● ●●●● ● ● ●●● ● ●● ●●● ● ●●●●●● ● ● ●● ●●● ●●●● ● ● ●●● ● ● ● ●●●● ● ●●●●● ●● ● ● ●● ● ●●●● ●●● ● ●● ●● ●● ● ●●●●● ●●●●● ●● ● ●● ●● ●●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●●●● ● ●● ● ●●●● ●● ● ● ● ● ● ●● ● ● ● ●●●●● ● ●●● ● ● ● ●● ● ● ●●●●●● ●● ● ●●●● ●● ● ● ●●● ●●●● ● ● ●●●●● ●● ●● ● ● ●●●●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ●● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● 0 25 50 75 100 0.00 0.25 0.50 0.75 1.00 1.25 time (s)Current(pA)
  20. 20. Generating Events • What do we expect events from a given sequence to look like? 20 GCTACGATT Sample Current ●●● ● ●●●●● ●● ● ● ●● ● ●●● ● ●● ● ●●●●● ● ● ● ●●●● ● ●● ●● ●●●●● ● ● ●●●● ● ●● ●● ● ● ● ●●● ●●● ●●●● ●● ● ● ●●●● ● ● ● ● ●●● ●● ●●●●●●● ● ● ● ● ● ●●● ● ● ●●●●●●●● ● ● ● ●●●● ● ●●● ●●● ● ●● ●●● ●● ●● ●●● ● ●● ●●●●●●●● ● ●●●● ● ●●●●●● ●●●● ●●● ● ●●●● ● ● ●● ● ● ●●●●●●● ●●● ●●● ●●● ● ●●● ●● ●● ● ● ●●● ●● ●●● ●●●●●●● ● ● ● ●●●● ● ● ●●●●● ● ●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●●● ●● ●● ● ● ●● ● ● ●●●●● ● ● ●● ●●●● ● ● ●●● ● ● ●● ●● ●● ● ●● ● ●●●●●●●●●●● ●●● ● ●● ● ●●● ● ●●● ● ● ●●●●●●● ● ●● ● ●● ●●●●● ● ● ●● ● ● ●● ●● ● ● ●● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ●●●●●●●●●●●●●● ● ● ●●●●●● ●● ●● ● ●● ●●● ●●● ●●● ● ●● ● ● ●● ●● ●●●●● ●● ●●● ●●●● ● ● ●●● ● ●● ●●● ● ●●●●●● ● ● ●● ●●● ●●●● ● ● ●●● ● ● ● ●●●● ● ●●●●● ●● ● ● ●● ● ●●●● ●●● ● ●● ●● ●● ● ●●●●● ●●●●● ●● ● ●● ●● ●●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●●●● ● ●● ● ●●●● ●● ● ● ● ● ● ●● ● ● ● ●●●●● ● ●●● ● ● ● ●● ● ● ●●●●●● ●● ● ●●●● ●● ● ● ●●● ●●●● ● ● ●●●●● ●● ●● ● ● ●●●●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ●● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ●●● ● ● ●●● ●●● ● ●●● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ●● 0 25 50 75 100 0.00 0.25 0.50 0.75 1.00 1.25 time (s)Current(pA)
  21. 21. Generating Events • What do we expect events from a given sequence to look like? 21 CTACGATT Sample Current ●●● ● ●●●●● ●● ● ● ●● ● ●●● ● ●● ● ●●●●● ● ● ● ●●●● ● ●● ●● ●●●●● ● ● ●●●● ● ●● ●● ● ● ● ●●● ●●● ●●●● ●● ● ● ●●●● ● ● ● ● ●●● ●● ●●●●●●● ● ● ● ● ● ●●● ● ● ●●●●●●●● ● ● ● ●●●● ● ●●● ●●● ● ●● ●●● ●● ●● ●●● ● ●● ●●●●●●●● ● ●●●● ● ●●●●●● ●●●● ●●● ● ●●●● ● ● ●● ● ● ●●●●●●● ●●● ●●● ●●● ● ●●● ●● ●● ● ● ●●● ●● ●●● ●●●●●●● ● ● ● ●●●● ● ● ●●●●● ● ●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●●● ●● ●● ● ● ●● ● ● ●●●●● ● ● ●● ●●●● ● ● ●●● ● ● ●● ●● ●● ● ●● ● ●●●●●●●●●●● ●●● ● ●● ● ●●● ● ●●● ● ● ●●●●●●● ● ●● ● ●● ●●●●● ● ● ●● ● ● ●● ●● ● ● ●● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ●●●●●●●●●●●●●● ● ● ●●●●●● ●● ●● ● ●● ●●● ●●● ●●● ● ●● ● ● ●● ●● ●●●●● ●● ●●● ●●●● ● ● ●●● ● ●● ●●● ● ●●●●●● ● ● ●● ●●● ●●●● ● ● ●●● ● ● ● ●●●● ● ●●●●● ●● ● ● ●● ● ●●●● ●●● ● ●● ●● ●● ● ●●●●● ●●●●● ●● ● ●● ●● ●●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●●●● ● ●● ● ●●●● ●● ● ● ● ● ● ●● ● ● ● ●●●●● ● ●●● ● ● ● ●● ● ● ●●●●●● ●● ● ●●●● ●● ● ● ●●● ●●●● ● ● ●●●●● ●● ●● ● ● ●●●●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ●● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ●●● ● ● ●●● ●●● ● ●●● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●●●● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●●●●● ● ● ● ● ●● ●●●● ● ●●● ●● ● ● ● ● ●●● ●● ● ● ● ●●● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ●●●● ●●● ● ● ● ● ●● ●● ●●● ● ● ●● ●●● ● ● ● ●●● ●●●●● ● ● ● ● ● ● 0 25 50 75 100 0.00 0.25 0.50 0.75 1.00 1.25 time (s)Current(pA)
  22. 22. Event Detection 22 ●●● ● ●●●●● ●● ● ● ●● ● ●●● ● ●● ● ●●●●● ● ● ● ●●●● ● ●● ●● ●●●●● ● ● ●●●● ● ●● ●● ● ● ● ●●● ●●● ●●●● ●● ● ● ●●●● ● ● ● ● ●●● ●● ●●●●●●● ● ● ● ● ● ●●● ● ● ●●●●●●●● ● ● ● ●●●● ● ●●● ●●● ● ●● ●●● ●● ●● ●●● ● ●● ●●●●●●●● ● ●●●● ● ●●●●●● ●●●● ●●● ● ●●●● ● ● ●● ● ● ●●●●●●● ●●● ●●● ●●● ● ●●● ●● ●● ● ● ●●● ●● ●●● ●●●●●●● ● ● ● ●●●● ● ● ●●●●● ● ●●●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●●● ●● ●● ● ● ●● ● ● ●●●●● ● ● ●● ●●●● ● ● ●●● ● ● ●● ●● ●● ● ●● ● ●●●●●●●●●●● ●●● ● ●● ● ●●● ● ●●● ● ● ●●●●●●● ● ●● ● ●● ●●●●● ● ● ●● ● ● ●● ●● ● ● ●● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ●●●●●●●●●●●●●● ● ● ●●●●●● ●● ●● ● ●● ●●● ●●● ●●● ● ●● ● ● ●● ●● ●●●●● ●● ●●● ●●●● ● ● ●●● ● ●● ●●● ● ●●●●●● ● ● ●● ●●● ●●●● ● ● ●●● ● ● ● ●●●● ● ●●●●● ●● ● ● ●● ● ●●●● ●●● ● ●● ●● ●● ● ●●●●● ●●●●● ●● ● ●● ●● ●●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●●●● ● ●● ● ●●●● ●● ● ● ● ● ● ●● ● ● ● ●●●●● ● ●●● ● ● ● ●● ● ● ●●●●●● ●● ● ●●●● ●● ● ● ●●● ●●●● ● ● ●●●●● ●● ●● ● ● ●●●●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ●● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ●●● ● ● ●●● ●●● ● ●●● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●●●● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●●●●● ● ● ● ● ●● ●●●● ● ●●● ●● ● ● ● ● ●●● ●● ● ● ● ●●● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ●●●● ●●● ● ● ● ● ●● ●● ●●● ● ● ●● ●●● ● ● ● ●●● ●●●●● ● ● ● ● ● ● 0 25 50 75 100 0.00 0.25 0.50 0.75 1.00 1.25 time (s)Current(pA) Event mean current (pA) current stdv duration (s) 1 60.3 0.7 0.521 2 40.6 1.0 0.112 3 52.2 2.0 0.356 4 54.1 1.2 0.291 5 49.5 1.5 0.141
  23. 23. A simple model • What is the probability of observing events E given a sequence S? • Assuming for the moment there are no missing or extra events: 23 P(e1, e2, ..., en|s1, s2, ..., sn, ⇥) = nY i=1 P(ei|si, µsi , si ) P(ei|k, µk, k) = N(µk, 2 k)
  24. 24. Complications 24 ● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ● ●●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ●●● ● ● ●●●● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●●● ● ● ● ● ● 30 40 50 60 70 0.0 0.2 0.4 0.6 time (s) Current(pA)
  25. 25. Complications 25 ● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ● ●●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ●●● ● ● ●●●● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●●● ● ● ● ● ● 30 40 50 60 70 0.0 0.2 0.4 0.6 time (s) Current(pA) Is this an event ?
  26. 26. Complications 26 ● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ● ●●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ●●● ● ● ●●●● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●●● ● ● ● ● ● 30 40 50 60 70 0.0 0.2 0.4 0.6 time (s) Current(pA) One event or two ?
  27. 27. Nanopore HMM • must consider: • over segmentation • under segmentation • missed short events • HMM: • M states: match event to 5-mers • E states: extra obs. of an event • K states: no event obs. for 5-mer 27 P(D|S) P(⇡, e1, e2, ..., en|S, ⇥) = nY i=1 P(ei|⇡i, µsi , si )P(⇡i|⇡i 1, S) P(e1, e2, ..., en|S, ⇥) = X ⇡ P(⇡, e1, e2, ..., en|S, ⇥)
  28. 28. Transition Probabilities • Probability of not observing an event is a function of absolute difference between (expected) current 28 ● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ● ●●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ●●● ● ● ●●●● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●●● ● ● ● ● ● 30 40 50 60 70 0.0 0.2 0.4 0.6 time (s) Current(pA)
  29. 29. Transition Probabilities • Probability of not observing an event is a function of absolute difference between (expected) current 29 ● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ● ●●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ●●● ● ● ●●●● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●●● ● ● ● ● ● 30 40 50 60 70 0.0 0.2 0.4 0.6 time (s) Current(pA)
  30. 30. Transition Probabilities 30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.1 0.2 0.3 0.4 0.5 0 5 10 15 20 absolute difference (pA) SkipProbability
  31. 31. Assembly Accuracy 31 0 5000 10000 0 5000 10000 5 mer count in reference 5mercountindraftassembly 0 3000 6000 9000 12000 TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGG kmer count draft reference y 12000 A C B D 0 5000 0 5000 10000 5 mer count in reference 5mercou 0 5000 10000 0 5000 10000 5 mer count in reference 5mercountinpolishedassembly C D Draft: 98.5% accuracy Polished: 99.5% accuracy
  32. 32. Assembly Accuracy 32 0 3000 6000 9000 12000 TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGG kmer count draft reference 9000 12000 B D 0 0 5000 10000 5 mer count in reference 5mer 0 3000 TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGG kmer 0 5000 10000 0 5000 10000 5 mer count in reference 5mercountinpolishedassembly 0 3000 6000 9000 12000 TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGG kmer count polished reference C D
  33. 33. Aligning Events to a Reference • HMM can also align events to a reference genome
 
 ! ! ! ! ! • Read about it here: • http://simpsonlab.github.io/2015/04/08/eventalign/ 33
  34. 34. Planned Improvements • Model dwell duration to better call homopolymers ! ! ! • SNP calling/genotyping ! ! ! • Improve scalability to handle larger genomes • Use signal data during error correction 34 CTAAAAAAAAAAAAGTACA P(gi|D) = P(D|gi)P(gi) P(D)
  35. 35. Acknowledgements & Code • Collaborators: • Nick Loman, Josh Quick (Birmingham) • Jonathan Dursi (OICR) ! ! • Code: • github.com/jts/nanocorrect (error correction) • github.com/jts/nanopolish (signal-level algorithms) • github.com/jts/nanopore-paper-analysis (reproduce our paper) 35

×