Karen Miga
03/28/19
GIAB Workshop
Generating high-quality human reference genomes
using PromethION nanopore sequencing
@khmiga
Broader Goal:

Improving Diploid
T2T Assemblies
One (haploid) genome reference assembly
Technology Bottleneck
Long read
Sequencing
Compute:
Assembly
+
PromethION
100 kb+ Reads
Scalable
Assembly Tools
Multi-flow Cells
Requirements for
Long Read Sequencing
Consistency in Assembly Quality
Capacity to Scale:
Parallelized Long-Read Sequencing
Comprehensive Genome
Representation
Sequencing 11 Reference Genomes
in 9 Days
Flip
Flop
Racon Medakawtdbg2 HiRise
Sequencing/
Basecalling Assembly
Polishing
Scaffolding
4x
FINISHED
ASSEMBLY
HiC Data
Phasing
Sequencing 11 Reference Genomes
in 9 Days
Sequencing strategy for
enrichment of UL-reads
ttps://www.circulomics.com/
Centrifuge
Wash Step
Re-suspend
Size-selected HMW DNA
gDNA + buffer
x2
Short Read
Eliminator Kit
Decrease
Standard HMW DNA Prep
Circulomics
Short Read Eliminator Kit
Increase
Read Lengths (kb)
NumberofBases(Mb)
Sequencing strategy for
enrichment of UL-reads
ttps://www.circulomics.com/
Centrifuge
Wash Step
Re-suspend
Size-selected HMW DNA
gDNA + buffer
x2
Short Read
Eliminator Kit
Read Lengths (kb)
NumberofBases(Mb)
FoldEnrichment
0
5
10
15
20
25
30
35
40
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
Read Lengths (kb)
Enrichment of 100kb+ reads
Sequencing strategy for
enrichment of UL-reads
ttps://www.circulomics.com/
Centrifuge
Wash Step
Re-suspend
Size-selected HMW DNA
gDNA + buffer
x2
Short Read
Eliminator Kit
Read Lengths (kb)
NumberofBases(Mb)
FoldEnrichment
0
5
10
15
20
25
30
35
40
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
Read Lengths (kb)
Enrichment of 100kb+ reads
0
5
10
15
20
HG020HG02055HG01243HG01109HG00733GM24385GM24149GM24143
Coverage
>10 kb
100 kb+
Boost in Overall
Coverage of
100kb+
Sequencing strategy for
enrichment of UL-reads
0
1000
2000
3000
4000
5000
6000
7000
8000
100kb+
10-100kb
<10kbMb
Read Len
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0 20 40 60 80 220200180160140120100
Read Length (kb)
NumberofBases(Gb)
30
35
40
45
50
24143 24149 24385 00733 01109 01243 02055 02080 02723 03098 03492
N50s: 44kb
GM24143
GM24149
GM24385
HG00733
HG01109
HG01243
HG02055
HG02080
HG02723
HG03098
HG03492
0
30 60 90
Diploid
Genomes
min max
62
79
80
71
68
74
79
81
71
107
98
45
40
68
41
64
43
52
62
27
82
88
Flow Cell Throughput (Gb)
ave 69 Gb Per Flow Cell
48x
54x
69x
52x
61x
57x
61x
74x
47x
83x
85x
cov
159 (Gb)
177
227
173
201
188
201
243
156
274
280
Total throughput
100 kb+ Reads
(ave 22Gb, 7.3x)
High-Throughput Runs
48x
54x
69x
52x
61x
57x
61x
74x
47x
83x
85x
cov
159 (Gb)
177
227
173
201
188
201
243
156
274
280
Total throughput
100 kb+ Reads
(ave 22Gb, 7.3x)
Evaluation of Read Accuracy
Flip-flop
Non-flip flop
HG00733
Flow Cell
Replicates
0.5 0.6 0.7 0.8 0.9 1.0
0.5 0.6 0.7 0.8 0.9 1.0
Sequence
Identity
Sequence
Identity
Flip
Flop
Racon Medakawtdbg2 HiRise
Sequencing/
Basecalling Assembly
Polishing
Scaffolding
4x
FINISHED
ASSEMBLY
HiC Data
Phasing
HG00733
99.18%
2.76 GB aligned
Consensus
Base Accuracy
(GRCh38)

• Not phased alignments
• Additional polishing steps
(pilon/methylation aware
polishing)
• Alignments are not to the
individuals genome
Assembly Performance:
Base Accuracy
Flip
Flop
Racon Medakawtdbg2 HiRise
Sequencing/
Basecalling Assembly
Polishing
Scaffolding
4x
FINISHED
ASSEMBLY
HiC Data
Phasing
HG00733
99.18%
2.76 GB aligned
Consensus
Base Accuracy
(GRCh38)

Assembly Performance:
Base Accuracy
• Alignments are not to the
individuals genome
Complete BAC alignments
21 BACs: 3.1Mb
Flip
Flop
Racon Medakawtdbg2 HiRise
Sequencing/
Basecalling Assembly
Polishing
Scaffolding
4x
FINISHED
ASSEMBLY
HiC Data
Phasing
HG00733
99.18%
2.76 GB aligned
Consensus
Base Accuracy
(GRCh38)

Assembly Performance:
Base Accuracy
• Alignments are not to the
individuals genome
Complete BAC alignments
21 BACs: 3.1Mb
0.9976NA12878 ONT (NBT 2018, update):
Nanopolish (x2), CpG methylation-mode
(Sergey Koren and Adam Phillippy)
*
• 6 mos (May-Oct)
• 62 MinION Flow Cells
• 155Gb (50X Coverage)
• N50s 70kb
• 44Gb 100kb+ (16.5x)
• 6 mos (May-Oct)
• 62 MinION Flow Cells
• 155Gb (50X Coverage)
• N50s 70kb
• 44Gb 100kb+ (16.5x)
• 4 days
• 3 PromethION Flow Cells
• 207 Gb (69X Coverage)
• N50s 44 kb
• 22Gb 100kb+ (7x)
10 Reference Genome Assemblies
in 10 Days
Flip
Flop
Racon Medakawtdbg2 HiRise
Sequencing/
Basecalling Assembly
Polishing
Scaffolding
4x
FINISHED
ASSEMBLY
HiC Data
Phasing
Not yet running
at full capacity
Improvement Assembly and Polishing:

Reduce cost — Improve quality
Haplotype
Phasing
Benedict Paten Mark AkesonDavid Haussler
Acknowledgements
Simon Mayes
Vania Costa
Daniel Garalde
David Stoddart
Rosemary Dokos
Jon Pugh
Chris Seymour
Chris Wright
ONT
TEAM
Adam Novak
Glenn Hickey
Jordan Eizenga
Erik Garrison
Jean Monlong
Xian Chang
Miten Jain
Hugh Olsen
Kristof Tigyi
Marina Haukness
Ryan Lorig-Roach
Trevor Pesout
Joel Armstrong
Nicholas Maurer
Justin Zook, Nate Olson

New data from giab genomes promethion

  • 1.
    Karen Miga 03/28/19 GIAB Workshop Generatinghigh-quality human reference genomes using PromethION nanopore sequencing @khmiga
  • 2.
    Broader Goal:
 Improving Diploid T2TAssemblies One (haploid) genome reference assembly
  • 3.
  • 4.
    PromethION 100 kb+ Reads Scalable AssemblyTools Multi-flow Cells Requirements for Long Read Sequencing Consistency in Assembly Quality Capacity to Scale: Parallelized Long-Read Sequencing Comprehensive Genome Representation
  • 5.
    Sequencing 11 ReferenceGenomes in 9 Days
  • 6.
    Flip Flop Racon Medakawtdbg2 HiRise Sequencing/ BasecallingAssembly Polishing Scaffolding 4x FINISHED ASSEMBLY HiC Data Phasing Sequencing 11 Reference Genomes in 9 Days
  • 8.
    Sequencing strategy for enrichmentof UL-reads ttps://www.circulomics.com/ Centrifuge Wash Step Re-suspend Size-selected HMW DNA gDNA + buffer x2 Short Read Eliminator Kit Decrease Standard HMW DNA Prep Circulomics Short Read Eliminator Kit Increase Read Lengths (kb) NumberofBases(Mb)
  • 9.
    Sequencing strategy for enrichmentof UL-reads ttps://www.circulomics.com/ Centrifuge Wash Step Re-suspend Size-selected HMW DNA gDNA + buffer x2 Short Read Eliminator Kit Read Lengths (kb) NumberofBases(Mb) FoldEnrichment 0 5 10 15 20 25 30 35 40 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Read Lengths (kb) Enrichment of 100kb+ reads
  • 10.
    Sequencing strategy for enrichmentof UL-reads ttps://www.circulomics.com/ Centrifuge Wash Step Re-suspend Size-selected HMW DNA gDNA + buffer x2 Short Read Eliminator Kit Read Lengths (kb) NumberofBases(Mb) FoldEnrichment 0 5 10 15 20 25 30 35 40 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Read Lengths (kb) Enrichment of 100kb+ reads 0 5 10 15 20 HG020HG02055HG01243HG01109HG00733GM24385GM24149GM24143 Coverage >10 kb 100 kb+ Boost in Overall Coverage of 100kb+
  • 11.
    Sequencing strategy for enrichmentof UL-reads 0 1000 2000 3000 4000 5000 6000 7000 8000 100kb+ 10-100kb <10kbMb Read Len 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0 20 40 60 80 220200180160140120100 Read Length (kb) NumberofBases(Gb) 30 35 40 45 50 24143 24149 24385 00733 01109 01243 02055 02080 02723 03098 03492 N50s: 44kb
  • 12.
    GM24143 GM24149 GM24385 HG00733 HG01109 HG01243 HG02055 HG02080 HG02723 HG03098 HG03492 0 30 60 90 Diploid Genomes minmax 62 79 80 71 68 74 79 81 71 107 98 45 40 68 41 64 43 52 62 27 82 88 Flow Cell Throughput (Gb) ave 69 Gb Per Flow Cell 48x 54x 69x 52x 61x 57x 61x 74x 47x 83x 85x cov 159 (Gb) 177 227 173 201 188 201 243 156 274 280 Total throughput 100 kb+ Reads (ave 22Gb, 7.3x) High-Throughput Runs
  • 13.
    48x 54x 69x 52x 61x 57x 61x 74x 47x 83x 85x cov 159 (Gb) 177 227 173 201 188 201 243 156 274 280 Total throughput 100kb+ Reads (ave 22Gb, 7.3x) Evaluation of Read Accuracy Flip-flop Non-flip flop HG00733 Flow Cell Replicates 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 Sequence Identity Sequence Identity
  • 14.
    Flip Flop Racon Medakawtdbg2 HiRise Sequencing/ BasecallingAssembly Polishing Scaffolding 4x FINISHED ASSEMBLY HiC Data Phasing HG00733 99.18% 2.76 GB aligned Consensus Base Accuracy (GRCh38)
 • Not phased alignments • Additional polishing steps (pilon/methylation aware polishing) • Alignments are not to the individuals genome Assembly Performance: Base Accuracy
  • 15.
    Flip Flop Racon Medakawtdbg2 HiRise Sequencing/ BasecallingAssembly Polishing Scaffolding 4x FINISHED ASSEMBLY HiC Data Phasing HG00733 99.18% 2.76 GB aligned Consensus Base Accuracy (GRCh38)
 Assembly Performance: Base Accuracy • Alignments are not to the individuals genome Complete BAC alignments 21 BACs: 3.1Mb
  • 16.
    Flip Flop Racon Medakawtdbg2 HiRise Sequencing/ BasecallingAssembly Polishing Scaffolding 4x FINISHED ASSEMBLY HiC Data Phasing HG00733 99.18% 2.76 GB aligned Consensus Base Accuracy (GRCh38)
 Assembly Performance: Base Accuracy • Alignments are not to the individuals genome Complete BAC alignments 21 BACs: 3.1Mb 0.9976NA12878 ONT (NBT 2018, update): Nanopolish (x2), CpG methylation-mode (Sergey Koren and Adam Phillippy) *
  • 17.
    • 6 mos(May-Oct) • 62 MinION Flow Cells • 155Gb (50X Coverage) • N50s 70kb • 44Gb 100kb+ (16.5x)
  • 18.
    • 6 mos(May-Oct) • 62 MinION Flow Cells • 155Gb (50X Coverage) • N50s 70kb • 44Gb 100kb+ (16.5x) • 4 days • 3 PromethION Flow Cells • 207 Gb (69X Coverage) • N50s 44 kb • 22Gb 100kb+ (7x)
  • 19.
    10 Reference GenomeAssemblies in 10 Days
  • 20.
    Flip Flop Racon Medakawtdbg2 HiRise Sequencing/ BasecallingAssembly Polishing Scaffolding 4x FINISHED ASSEMBLY HiC Data Phasing Not yet running at full capacity Improvement Assembly and Polishing:
 Reduce cost — Improve quality Haplotype Phasing
  • 21.
    Benedict Paten MarkAkesonDavid Haussler Acknowledgements Simon Mayes Vania Costa Daniel Garalde David Stoddart Rosemary Dokos Jon Pugh Chris Seymour Chris Wright ONT TEAM Adam Novak Glenn Hickey Jordan Eizenga Erik Garrison Jean Monlong Xian Chang Miten Jain Hugh Olsen Kristof Tigyi Marina Haukness Ryan Lorig-Roach Trevor Pesout Joel Armstrong Nicholas Maurer Justin Zook, Nate Olson