FOR	RESEARCH	USE	ONLY.	Not	for	use	in	diagnostic	procedures.©	10x	Genomics,	Inc.	2016
Everyday	de	novo	diploid assembly
Deanna	M.	Church
Oct,	2016
@deannachurch
2
Disclosures
Employee	and	Shareholder
Shareholder
10x	Genomics
Personalis
10x	Genomics	products	described	are	for	Research	Use	Only.	Not	
for	use	in	diagnostic	procedures.
3
Acknowledgements
The	entire	team	at	10x
David	Jaffe
Neil	Weisenfeld
Vijay	Kumar
Preyas Shah
Patrick	Marks
4
Agenda
•Why	haven’t	we	always	done	de	novo	assembly	
on	every	sample?
5
Agenda
•Why	haven’t	we	always	done	de	novo	assembly	
on	every	sample?
•What	are	Linked-Reads?
6
Agenda
•Why	haven’t	we	always	done	de	novo	assembly	
on	every	sample?
•What	are	Linked-Reads?
•What	does	every	day	de	novo	assembly	enable	today?
7
Why	haven’t	we	always	done	de	novo genome	
analysis?
8
9
10
11
Current	approach:	averaging	over	haplotypes
12
Averaging	over	haplotypes	fails	with	increased	diversity
doi:10.1038/nature20098
AK1
13
New	technology	reveals	more	information
14
New	technology	reveals	more	information
10.1038/nrg3933
15
Public	human	assemblies	to	date
https://www.ncbi.nlm.nih.gov/assembly/organism/9606/latest/
Composite	genomes Individual	Genomes
Hydatidiform moles	(single	
haplotypes)	
• GRCh38
• Celera	(2)
• CHM1	(9)
• CHM13	(5)
• NA12878	(9)
• HX1	
• A/J	Son	
• A/J	Mother
• A/J	Father
• NA18507
• YH1
• HS1011
• AK1
• HuRef
•Lots	of	labor
•Lots	of	time
•Lots	of	coverage
•Lots	of	money
16
What	are	Linked-Reads?
17
Unlinked-Reads:	short	range	information
18
Linked-Reads:	long	range	information
19
Start	with	long	molecules
NA19240
20
Making	Linked-Reads
P5 16bp	BCR1 Nmer gDNA Insert
21
Making	Linked-Reads
Long	input	molecule
Excess	of	sequenceable
inserts	randomly	primed	
off	each	long	molecule
P5 16bp	BCR1 Nmer gDNA Insert
22
Making	Linked-Reads
Long	input	molecule	(50Kb)
Excess	of	sequenceable
inserts	randomly	primed	
off	each	long	molecule
P5 16bp	BCR1 Nmer gDNA Insert
Long	input	molecule	(50Kb)
30x	sequence
~35	fragments
~0.2x	coverage
Standard	reference	based
analysis	recommendations
23
Making	Linked-Reads
Long	input	molecule	(50Kb)
Excess	of	sequenceable
inserts	randomly	primed	
off	each	long	molecule
P5 16bp	BCR1 Nmer gDNA Insert
Long	input	molecule	(50Kb)
56x	sequence
~65	fragments
~0.4x	coverage
Supernova
analysis	recommendations
24
Synthetic	Long	Reads:	less	physical	coverage
CA B
Sequencing	cost
Physical	coverage
25
Linked-Reads:	greater	physical	coverage
CA B
Sequencing	cost
Physical	coverage
26
Linked-Reads	allow	for	increased	physical	coverage
150X avg physical	coverage
Chr13: BRCA2
▲
►
>	56X avg	read coverage	(assembly)
27
Generating	Linked-Reads
Start	with:
HMW	gDNA,	100Kb+	molecules
1.0	ng input	DNA	=	300	copies	of	the	genome
0.5ng	DNA	=	150 copies	of	the	genome,
partitioned	into	>	1M	GEMs
DNA	
OilBarcoded	Primer	Library Enzyme Collect
28
Assembly	made	easy
FASTABCL
Supernova
De	novo	Assembly
1200M	
NA19240
http://www.biorxiv.org/content/early/2016/08/19/070425
1	server	
348	Gb	memory
2	days	compute
1	library
1	ng	input
29
Assembly	made	easy
FASTABCL
Supernova
De	novo	Assembly
1200M	
NA192401	library
1	ng	input
http://www.biorxiv.org/content/early/2016/08/19/070425
1	server		(28	cores)
348	Gb	memory
2	days	compute
30
Assembly	made	easy
FASTABCL
Supernova
De	novo	Assembly
1200M	
NA192401	library
1	ng	input
http://www.biorxiv.org/content/early/2016/08/19/070425
1	server		(28	cores)
348	Gb	memory
2	days	compute
megabubble megabubble megabubble
31
Performance	over	multiple	human	samples
http://www.biorxiv.org/content/early/2016/08/19/070425
sample ethnicity sex cov frag
N50	
contig
(Kb)
N50	
scaffold	
(Mb)
N50	
Phase
block	
(Mb)
Gap	
(%)
NA19238 YRI F 56 115 114.6 18.7 8 2.1
NA19240 YRI F 56 125 118.8 16.4 9.3 2.3
HG00733 PR F 56 106 123.6 17.8 3.4 2.0
HG00512 HAN M 56 102 113.2 15.4 2.7 2.2
NA24385 AJ M 56 120 106.4 15.1 4.2 2.6
HGP EUR M 56 139 120.2 18.6 4.5 2.5
NA12878 EUR F 56 92 118.5 16.4 2.8 2.9
32
High	quality	Assembly	at	lower	coverage
102
104
106
108
110
112
114
116
118
120
122
500 700 900 1,100 1,300
Contig	N50	(kb)
Number	of	reads	(millions)
0
5
10
15
20
25
500 700 900 1,100 1,300
Scaffold	N50	(Mb)
Number	of	reads	(millions)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
500 700 900 1,100 1,300
Phase	Block	N50	(Mb)
Number	of	reads	(millions)
33
De	Novo	Performance	Drastically	Improves	with	Increased	
DNA	Length
0
20,000
40,000
60,000
80,000
100,000
120,000
0 10,000 20,000 30,000 40,000 50,000 60,000
Contig	N50
0
5
10
15
20
0 10,000 20,000 30,000 40,000 50,000 60,000
Scaffold	N50	
(Mb)
0
100,000
200,000
300,000
400,000
500,000
0 10,000 20,000 30,000 40,000 50,000 60,000
Phase	Block	
N50
DNA	Length
34
Comparison	to	truth	data
35
Assembly	assessment
Supernova	10x Other	methods
0
5
10
15
20
25
NA19238 NA19240 HG00733 HG00512 NA24385 HGP NA12878 YH NA12878 NA12878 NA12878 NA24385 NA24143
Percent	GRCh37	100mers	missing	per	assembly
Missing	100mers	haploid Missing	100mers	diploid
Diploid Haploid
36
What	does	every	day	de	novo	assembly	
enable?
37
Ideal:	Complete	genome	information	
doi:10.1038/nature09534
• SNVs
• Deletions
• Insertions
• Inversions
• Translocations
38
Areas	in	which	assembly	excels:	diverse	regions
AluY
Supernova
(de	novo)	
PacBio Reads
Illumina	Reads
39
Areas	in	which	assembly	excels:	insertions
Supernova
(de	novo)	
PacBio
Reads
Illumina	
Reads
40
Areas	in	which	assembly	excels:	insertions
41
Areas	in	which	assembly	excels:	insertions
SHANK2
GRCh37:	chr11
GRCh37.p13:	chr11_fix_patch
42
Areas	in	which	assembly	excels:	insertions
SHANK2
GRCh37:	chr11
GRCh37.p13:	chr11_fix_patch
35	kb
43
Areas	in	which	assembly	excels:	insertions
Hap1_scaffold	7938
Hap2_scaffold	7939
chr11
SHANK2
44
Areas	in	which	assembly	excels:	insertions
Hap1_scaffold	7938
Hap2_scaffold	7939
chr11
SHANK2
45
Areas	in	which	assembly	excels:	insertions
Hap1_scaffold	7938
Hap1_scaffold	7939
chr11
SHANK2
chr11
Hap2_scaffold	7939
SHANK2
Hap1_scaffold	7938
46
Assembly	analysis:	alignment	work	needed
SHANK2
Supernova
(de	novo)	
PacBio Reads
Illumina	Reads
47
Areas	in	which	assembly	excels:	inversions	
GRCh37 chrX:6137041-6138541 (NLGN4X)
Supernova
(de	novo)	
PacBio Reads
Illumina	Reads
48
Assembly	analysis:	alignment	work	needed
GRCh37 chrX:6137041-6138541 (NLGN4X)
Hap1_scaffold	5127
Hap2_scaffold	5128
49
Fasta is	a	lossy format
megabubble megabubble megabubble
multi-Mb
phase	blocks
many	Mb	scaffold
micro	structure
• bubbles,	often	at	indeterminate	poly-A
• short	gaps,	often	at	poly-A
50
Native	formats	have	more	information
Supernova
(de	novo)	
Long	Ranger
Reference
based
51
Native	formats	have	more	information
Supernova
(de	novo)	
Long	Ranger
Reference
based
52
Native	formats	have	more	information
Supernova
(de	novo)	
Long	Ranger
Reference
based
53
Conclusions
•Routine,	de	novo,	diploid	assembly	of	1000s	of	samples
is	possible	today!
•Early	uses	will	be	for	better	resolution	of	divergent
regions	and	novel	sequence
•A	new	generation	of	tools	needs	to	be	developed	to
fully	utilize	assembly	data

Everyday de novo diploid assembly