1. Jared Simpson
!
Ontario Institute for Cancer Research
&
Department of Computer Science
University ofToronto
Error correction, assembly
and consensus algorithms
for MinION data
London Calling, May 14th, 2015
3. An overview of NGS assembly
• Illumina data: short reads, very accurate, very deep
• nearly all Illumina assembly is based on exact matching algorithms
• fragmented assemblies
!
• Algorithms for Illumina data do not work for long, noisy reads
• PacBio developed a pipeline (“HGAP”) to assemble their data
• We used this recipe as a starting point but with custom components
3
13. Assembly Polishing
• Consensus problem is viewed as choosing a sequence C’ that maximizes
the probability of the event data
13
C0
= arg max
S2C
P(D|S)
P(D|S) =
rY
k=1
P(ei,k, ei+1,k, ..., ej,k|S, ⇥)
where
23. A simple model
• What is the probability of observing events E given a sequence S?
• Assuming for the moment there are no missing or extra events:
23
P(e1, e2, ..., en|s1, s2, ..., sn, ⇥) =
nY
i=1
P(ei|si, µsi , si )
P(ei|k, µk, k) = N(µk, 2
k)
27. Nanopore HMM
• must consider:
• over segmentation
• under segmentation
• missed short events
• HMM:
• M states: match event to 5-mers
• E states: extra obs. of an event
• K states: no event obs. for 5-mer
27
P(D|S)
P(⇡, e1, e2, ..., en|S, ⇥) =
nY
i=1
P(ei|⇡i, µsi , si )P(⇡i|⇡i 1, S)
P(e1, e2, ..., en|S, ⇥) =
X
⇡
P(⇡, e1, e2, ..., en|S, ⇥)
31. Assembly Accuracy
31
0
5000
10000
0 5000 10000
5 mer count in reference
5mercountindraftassembly
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGG
kmer
count
draft
reference
y
12000
A
C
B
D
0
5000
0 5000 10000
5 mer count in reference
5mercou
0
5000
10000
0 5000 10000
5 mer count in reference
5mercountinpolishedassembly
C D
Draft: 98.5% accuracy Polished: 99.5% accuracy
32. Assembly Accuracy
32
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGG
kmer
count
draft
reference
9000
12000
B
D
0
0 5000 10000
5 mer count in reference
5mer
0
3000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGG
kmer
0
5000
10000
0 5000 10000
5 mer count in reference
5mercountinpolishedassembly
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGG
kmer
count
polished
reference
C D
33. Aligning Events to a Reference
• HMM can also align events to a reference genome
!
!
!
!
!
• Read about it here:
• http://simpsonlab.github.io/2015/04/08/eventalign/
33
34. Planned Improvements
• Model dwell duration to better call homopolymers
!
!
!
• SNP calling/genotyping
!
!
!
• Improve scalability to handle larger genomes
• Use signal data during error correction
34
CTAAAAAAAAAAAAGTACA
P(gi|D) =
P(D|gi)P(gi)
P(D)