Generating high-quality human
reference genomes using PromethION
nanopore sequencing
Karen Miga
khmiga
One (haploid) genome reference
assembly for all of humanity
Deletion
Large Structural Variation
Insertion
SNV
Deletion
Large Structural Variation
Insertion
SNV Computational
Pan-Genomics
TEAM
Deletion
Insertion
SNV
Large Structural Variation
Computational
Pan-Genomics
TEAM
Sampling to maximize
allelic diversity
Analysis: Adam Phillippy (NIH/NHGRI), Fritz Sedlazeck (Baylor), Benedict Paten (UCSC)
Sampling to maximize
allelic diversityTop 10 Individuals
Selected to Maximize
Diversity
Sampling to maximize
allelic diversity
Emphasis on Phasing:
Selection of Trios
Top 10 Individuals
Selected to Maximize
Diversity
Emphasis on Quality:
Low Passage Cells
Analysis: Adam Phillippy (NIH/NHGRI), Fritz Sedlazeck (Baylor), Benedict Paten (UCSC)
High-quality, phased
reference genomesTop 10 Individuals
Selected to Maximize
Diversity
Analysis: Adam Phillippy (NIH/NHGRI), Fritz Sedlazeck (Baylor), Benedict Paten (UCSC)
Adam Phillippy
(NIH/NHGRI)
High Coverage PacBio
Genome Assemblies
From the Same Individuals
Collaboration
Dramatic increase in the
production of
high quality
reference assemblies
Long read
Sequencing
Compute:
Assembly
+
Technology Bottleneck
Computational
Pan-Genomics
TEAM
Sampling to maximize
allelic diversity
Long Read Sequencing
Consistency in Assembly Quality
Capacity to Scale:
Parallelized Long-Read Sequencing
Comprehensive Genome
Representation
1
2
3
Long Read Sequencing
Consistency in Assembly Quality
Capacity to Scale:
Parallelized Long-Read Sequencing
PromethION
Comprehensive Genome
Representation
1
2
3
100 kb+ Reads
Scalable
Assembly Tools
Multi-flow Cells
Sequencing 11 Reference Genomes in 9 Days
Flip
Flop
Racon Medakawtdbg2 HiRise
Sequencing/
Basecalling Assembly
Polishing
Scaffolding
4x
FINISHED
ASSEMBLY
HiC Data
Phasing
https://www.circulomics.com/
Centrifuge
Wash Step
Re-suspend
Size-selected HMW DNA
gDNA + buffer
x2
Short Read Eliminator Kit
Sequencing strategy for enrichment of long reads
Decrease
Standard HMW DNA Prep
Circulomics
Short Read Eliminator Kit
Increase
Read Lengths (kb)
NumberofBases(Mb)
https://www.circulomics.com/
Centrifuge
Wash Step
Re-suspend
Size-selected HMW DNA
gDNA + buffer
x2
Short Read Eliminator Kit
FoldEnrichment
Read Lengths (kb)
Sequencing strategy for enrichment of long reads
Enrichment of 100kb+ reads
0
5
10
15
20
25
30
35
40
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
Read Lengths (kb)
NumberofBases(Mb)
https://www.circulomics.com/
Centrifuge
Wash Step
Re-suspend
Size-selected HMW DNA
gDNA + buffer
x2
Short Read Eliminator Kit
FoldEnrichment
Read Lengths (kb)
Sequencing strategy for enrichment of long reads
Enrichment of 100kb+ reads
0
5
10
15
20
25
30
35
40
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200
Read Lengths (kb)
0
5
10
15
20
HG020HG02055HG01243HG01109HG00733GM24385GM24149GM24143
Coverage
7x
5x
>10 kb
100 kb+
Boost in Overall
Coverage of 100kb+
NumberofBases(Mb)
30
35
40
45
50
24143 24149 24385 00733 01109 01243 02055 02080 02723 03098 03492
0
1000
2000
3000
4000
5000
6000
7000
8000
100kb+
10-100kb
<10kbMb
Read Len
N50s: 44kb
Sequencing strategy for enrichment of long reads
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0 20 40 60 80 220200180160140120100
Read Length (kb)
NumberofBases(Gb)
GM24143
GM24149
GM24385
HG00733
HG01109
HG01243
HG02055
HG02080
HG02723
HG03098
HG03492
0 30 60 90
Diploid
Genomes
min max
62
79
80
71
68
74
79
81
71
107
98
45
40
68
41
64
43
52
62
27
82
88
Flow Cell Throughput (Gb)
ave 69 Gb Per Flow Cell
Sequencing Throughput
GM24143
GM24149
GM24385
HG00733
HG01109
HG01243
HG02055
HG02080
HG02723
HG03098
HG03492
0 30 60 90
Diploid
Genomes
min max
62
79
80
71
68
74
79
81
71
107
98
45
40
68
41
64
43
52
62
27
82
88
Flow Cell Throughput (Gb)
ave 69 Gb Per Flow Cell
48x
54x
69x
52x
61x
57x
61x
74x
47x
83x
85x
cov
159 (Gb)
177
227
173
201
188
201
243
156
274
280
Total throughput
100 kb+ Reads
(ave 22Gb, 7.3x)
02055 01109 24385 03492 24143 02723 01243 24149 02080 00733 03098
Individuals
SequenceIdentity(GRCh38)
1.0
0.9
0.8
0.7
0.6
0.5
0.4
Flip-flop
Non-flip flop
HG00733
Flow Cell
Replicates
Evaluation of Read Accuracy
0.5 0.6 0.7 0.8 0.9 1.0
0.5 0.6 0.7 0.8 0.9 1.0
Sequence
Identity
Sequence
Identity
mode: 0.915
median: 0.88203
Flip
Flop
Racon Medakawtdbg2 HiRise
Sequencing/
Basecalling Assembly
Polishing
Scaffolding
4x
FINISHED
ASSEMBLY
HiC Data
Phasing
Assembly Performance: Base Accuracy
Flip
Flop
Racon Medakawtdbg2 HiRise
Sequencing/
Basecalling Assembly
Polishing
Scaffolding
4x
FINISHED
ASSEMBLY
HiC Data
Phasing
Assembly Performance: Base Accuracy
HG00733
99.18%
2.76 GB aligned
Consensus Base
Accuracy (GRCh38)

Flip
Flop
Racon Medakawtdbg2 HiRise
Sequencing/
Basecalling Assembly
Polishing
Scaffolding
4x
FINISHED
ASSEMBLY
HiC Data
Phasing
Assembly Performance: Base Accuracy
HG00733
99.18%
2.76 GB aligned
Consensus Base
Accuracy (GRCh38)

• Not phased alignments
• Additional polishing steps (pilon/
methylation aware polishing)
• Alignments are not to the
individuals genome
Flip
Flop
Racon Medakawtdbg2 HiRise
Sequencing/
Basecalling Assembly
Polishing
Scaffolding
4x
FINISHED
ASSEMBLY
HiC Data
Phasing
Assembly Performance: Base Accuracy
HG00733
99.18%
2.76 GB aligned
Consensus Base
Accuracy (GRCh38)

• Alignments are not to the
individuals genome
0.9976
Complete BAC alignments
21 BACs: 3.1Mb
Flip
Flop
Racon Medakawtdbg2 HiRise
Sequencing/
Basecalling Assembly
Polishing
Scaffolding
4x
FINISHED
ASSEMBLY
HiC Data
Phasing
Assembly Performance: Base Accuracy
HG00733
99.18%
2.76 GB aligned
Consensus Base
Accuracy (GRCh38)

• Alignments are not to the
individuals genome
Complete BAC alignments
21 BACs: 3.1Mb
0.9976
NA12878 ONT (NBT 2018, update):
Nanopolish (x2), CpG methylation-mode
(Sergey Koren and Adam Phillippy)
*
Assembly Performance: Contiguity
Workflow consistency
Contigsize(Mb)
Assembly Performance: Contiguity + HiC
Scaffoldsize(Mb)
Assembly Performance: Spanning Repeats
Test against a benchmark set of
GRCh38 segmental duplications
SD
Mid-Range (25kb-100kb):
Large (100kb+):
1. No assembly gaps (50kb+/-)
2. Flanking unique sequences (10 kb +/-)
146 Sites
121/122 Regions
23/24 Regions
Assembly spans: 98.6%
5’ 3’
10 Reference Genome Assemblies in 10 Days
Production Cost
Estimate:
$10,000/Genome
Sequencing 10 Reference Genomes in 10 Days
Long read
Sequencing
Compute:
Assembly
+
Flip
Flop
Racon Medakawtdbg2 HiRise
Sequencing/
Basecalling Assembly
Polishing
Scaffolding
4x
FINISHED
ASSEMBLY
HiC Data
Phasing
Not yet running at
full capacity
Improvement Assembly and Polishing:

Reduce cost — Improve quality
Haplotype
Phasing
Deletion
Insertion
SNV
Large Structural Variation
Computational
Pan-Genomics
TEAM
Sampling to maximize
allelic diversity
Production of high quality
reference assemblies
Benedict Paten
David Haussler
Ed Green
Mark Akeson
Sofie Salama
TEAM
Miten Jain
Hugh Olsen
Kristof Tigyi
Marina Haukness
Ryan Lorig-Roach
Trevor Pesout
Joel Armstrong
Nicholas Maurer
UCSC
Adam Novak
Glenn Hickey
Jordan Eizenga
Erik Garrison
Jean Monlong
Xian Chang
Adam Phillippy
(NIH/NHGRI)
Fritz Sedlazeck
(Baylor)
Simon Mayes
Vania Costa
Daniel Garalde
David Stoddart
Rosemary Dokos
Jon Pugh
Chris Seymour
Chris Wright
ONT
Kelvin Liu
Duncan Kilburn
Paolo Carnevali

Generating high-quality human reference genomes using PromethION nanopore sequencing