Haplotype	resolved	structural	
varia1on	assembly	with	long	reads	
Mount	Sinai:	Ali	Bashir,	Oscar	
Rodriguez,	MaAhew	Pendleton	
Reed	College:	Anna	Ritz,	Alex	Ledger
Overview	
•  Background	
– Automated	Hybrid	Assembly	
•  Phased	Diploid	Assembly	
– Exisi1ng	Limita1on	of	assembly	
– 10X	+	PacBio	
•  Issues	with	calling	SVs	and	comparing	datasets
Pacbio:	44X,	~4.9	kb	avg.	length	 BioNano:	80X,	278kb	mean	span	
Technologies	of	Greater	Scale	Needed	
to	Study	Human	Complexity
Sequence Contigs
Contig Maps
In	silica	diges1on	
Genome Maps
Aligned Contig-Scaffold Pairs
NGS	
De	Novo	Assemble	
Align	Genome	Maps	to	Con6gs	/	
Con6gs	to	Genome	Maps	
Scaffold	Graph	Construc6on	and	
Layout	
	A.	Pang	 	M.	Pendleton	
BioNano	Raw Molecules
Hybrid Scaffolds
Schema1c	for	Hybrid	Scaffolding
Sequence Contigs
Contig Maps
In	silica	diges1on	
Genome Maps
Aligned Contig-Scaffold Pairs
NGS	
De	Novo	Assemble	
Align	Genome	Maps	to	Con6gs	/	
Con6gs	to	Genome	Maps	
Scaffold	Graph	Construc6on	and	
Layout	
	A	P.ang	 	M.	Pendleton	
BioNano	Raw Molecules
Hybrid Scaffolds
Scaffolding	is	symmetric	
Schema1c	for	Hybrid	Scaffolding
Sequence Contigs
Contig Maps
In	silica	diges1on	
Genome Maps
Aligned Contig-Scaffold Pairs
NGS	
De	Novo	Assemble	
Align	Genome	Maps	to	Con6gs	/	
Con6gs	to	Genome	Maps	
Scaffold	Graph	Construc6on	and	
Layout	
	A	P.ang	 	M.	Pendleton	
BioNano	Raw Molecules
Hybrid Scaffolds
Inconsistencies	can	
be	flagged	to	break	or	
eliminate	ConHgs	or	
Genome	Maps	
Schema1c	for	Hybrid	Scaffolding
As	anHcipated,	heterochromaHc	regions	are	most	difficult	to	span	
	
-Completes	as	many	as	28	gaps	in	hg38	
-Other	methods	and	data	types	can	be	used	to	further	resolve	addiHonal	gaps	in	the	assembly	
Hybrid	Assembly	Between	BioNano	&	PacBio	Supercon1gs
Hybrid	Assembly	Between	BioNano	&	PacBio	Supercon1gs	
As	anHcipated,	heterochromaHc	regions	are	most	difficult	to	span	
	
-Completes	as	many	as	28	gaps	in	hg38	
-Other	methods	and	data	types	can	be	used	to	further	resolve	addiHonal	gaps	in	the	assembly	
Orthogonal	Error	Profiles	enable	drama1c	improvements
hg19	
chr1	
Super	
scaffold		
hg38	
chr1	
Hg19	gap	 Hg19	gap	
Super	
scaffold	
Molecule	
pileup	
Complex	Structural	Rearrangements	can	be	Validated	
Rela1ve	to	Current	References	
S1ll	resolves	at	least	
28	gaps	in	hg38	
assembly	for	>400kb		
in	predicted	gap	
intervals	
Many	large	structural	variants	predicted	
like	the	above.	
Are	they	real?
hg19	
chr1	
Super	
scaffold		
hg38	
chr1	
Hg19	gap	 Hg19	gap	
Super	
scaffold	
Molecule	
pileup	
Complex	Structural	Rearrangements	can	be	Validated	
Rela1ve	to	Current	References	
S1ll	resolves	at	least	
28	gaps	in	hg38	
assembly	for	>400kb		
in	predicted	gap	
intervals
Very	Large	TR	Expansions	Detected	via	Op1mal	Map	
LPA	(Kringle	IV	Exon	Expansion	–	Each	Period	>5kb)
Example	Signatures	of	Complex	Events
Most	Inversions	From	the	1000	Genomes	
Project	Are	Not	Actually	Inversions!
•  GIAB	
–  Long-read	sequencing	of	Trios	
•  20-30X	parents	
•  40-70X	children	
–  10X	Chromium	Data	
–  Ashkenazi	Jewish	
•  IkG	
–  Long-read	sequencing	
•  ~20-25X	parents	
•  ~45-50X	parents	
–  10X	Chromium	Data	
•  	Chinese,	Puerto	Rican,	and	Yorbuan	
ancestry	
Technology	Con1nues	to	March	Forward	
GIAB	AJ	and	1000	Genomes	Trios	
Sub	RL	N50	=	11,087	bp	
Total	#	Bases	=	220	Gb	
#	of	Reads	=	27.4	M	reads
Current	Genome	Assembly	Results	on	GIAB	and	1kG	Trios	
Sample Contigs Average N50 Max Total Size
HG002 13231 230 kb 4.1 Mb 31.6 Mb 3.04 Gb
HG003 17873 172 kb 4.6 Mb 21.5 Mb 3.08 Gb
HG004 16487 185 kb 5.3 Mb 22.6 Mb 3.05 Gb
HG00512 23146 117 kb 369 kb 2.6 Mb 2.72 Gb
HG00513 18443 151 kb 401 kb 2.4 Mb 2.78 Gb
HG00514 11517 264kb 7.2 Mb 61.1Mb 3.04Gb
HG00731 20811 132 kb 451 kb 3.8 Mb 2.74 Gb
HG00732 13672 214 kb 1.3 Mb 10.9 Mb 2.93 Gb
HG00733 11143 281kb 11.4 Mb 57.4 Mb 3.14 Gb
NA19238 56480 39 kb 70 kb 645 kb 2.20 Gb
NA19239 73478 23 kb 40 kb 1.01Mb 1.71 Gb
NA19240 15245 203 kb 3.8 Mb 20.1 Mb 3.09 Gb
Joyce	Lee	(BioNano	Genomics)	
Sample Enzyme Contigs N50 Total Size
HG00733 (Genome Map) BspQI 2185 4.2 Mb 5.6 Gb
HG00733 (Genome Map) BssSI 5038 1.39 Mb 5.1 Gb
HG00733 (Hybrid) BspQI 133 / 10637 56.4 Mb / 52.2 Mb 2.8 Gb
HG00733 (Hybrid) BssSI 234 / 10749 40.26 Mb / 29.7 Mb 2.8 / 3.2 Gb
HG00733 (Hybrid) BspQI + BssSI 104 / 10590 72.6 Mb / 61.5 Mb 2.9 Gb / 3.2 Gb
Hybrid	Assembly	with	mul1ple	Enzymes	drama1cally	
improves	con1guity	and	coverage	of	the	genome
Current	Genome	Assembly	Results	on	GIAB	and	1kG	Trios	
Sample Contigs Average N50 Max Total Size
HG002 13231 230 kb 4.1 Mb 31.6 Mb 3.04 Gb
HG003 17873 172 kb 4.6 Mb 21.5 Mb 3.08 Gb
HG004 16487 185 kb 5.3 Mb 22.6 Mb 3.05 Gb
HG00512 23146 117 kb 369 kb 2.6 Mb 2.72 Gb
HG00513 18443 151 kb 401 kb 2.4 Mb 2.78 Gb
HG00514 11517 264kb 7.2 Mb 61.1Mb 3.04Gb
HG00731 20811 132 kb 451 kb 3.8 Mb 2.74 Gb
HG00732 13672 214 kb 1.3 Mb 10.9 Mb 2.93 Gb
HG00733 11143 281kb 11.4 Mb 57.4 Mb 3.14 Gb
NA19238 56480 39 kb 70 kb 645 kb 2.20 Gb
NA19239 73478 23 kb 40 kb 1.01Mb 1.71 Gb
NA19240 15245 203 kb 3.8 Mb 20.1 Mb 3.09 Gb
Joyce	Lee	(BioNano	Genomics)	
Hybrid	Assembly	with	mul1ple	Enzymes	drama1cally	
improves	con1guity	and	coverage	of	the	genome	
Sample Enzyme Contigs N50 Total Size
HG00733 (Genome Map) BspQI 2185 4.2 Mb 5.6 Gb
HG00733 (Genome Map) BssSI 5038 1.39 Mb 5.1 Gb
HG00733 (Hybrid) BspQI 133 / 10637 56.4 Mb / 52.2 Mb 2.8 Gb
HG00733 (Hybrid) BssSI 234 / 10749 40.26 Mb / 29.7 Mb 2.8 / 3.2 Gb
HG00733 (Hybrid) BspQI + BssSI 104 / 10590 72.6 Mb / 61.5 Mb 2.9 Gb / 3.2 Gb
Previous	Work	in	NA12878:	
Hybrid	Assembly/Varia1on	Analysis	Pipeline	
Pendleton	et	al.,	Nature	Methods	2010	
M.	Pendleton	
A.	Pang	
J.	Chin
Previous	Work	in	NA12878:	
Hybrid	Assembly/Varia1on	Analysis	Pipeline	
Pendleton	et	al.,	Nature	Methods	2010	
M.	Pendleton	
A.	Pang	
J.	Chin
Previous	Work	in	NA12878:	
Hybrid	Assembly/Varia1on	Analysis	Pipeline	
Pendleton	et	al.,	Nature	Methods	2010	
M.	Pendleton	
A.	Pang	
J.	Chin
Summary	Of	1kG/GIAB	PacBio	SV	Calls	
*Ran	through	a	streamlined	version	of	the	SV	pipeline	may	not	be	comparable	
Other	Notes:	
-  MEI	calls	use	conserva1ve	parameters	(likely	undercalling	inser1ons)	
-  “Other”	calls	contain	some	improperly	flagged	inser1ons/dele1ons	as	well	as	complex	
events	and	inversions	
	
InserHon	 DeleHon	 Complex	
Sample	
#	of	
Calls	
#	of	TR	
calls	
#	of	
Alu	
#	
of	L1	
#	of	
SVA	 #	of	Calls		
#	of	TR	
calls*	
#	of	
Alu		 #	of	L1		 #	of	SVA		
HG002	 13471	 5573	 325	 68	 7	 9639	 6880	 798	 201	 22	 2493	
HG003*	 12947	 5133	 411	 74	 5	 9692	 6776	 411	 74	 5	 2580	
HG004	 12769	 5066	 475	 160	 96	 9509	 7233	 971	 282	 33	 2599	
HG00512	 9830	 4164	 366	 75	 67	 7672	 5781	 768	 275	 23	 2157	
HG00513	 9761	 4175	 351	 86	 79	 7791	 5936	 770	 258	 27	 2314	
HG00514	 1285	 4866	 212	 42	 3	 9636	 6770	 767	 222	 26	 2635	
HG00731	 9874	 4322	 357	 76	 75	 7678	 5797	 790	 256	 17	 2174	
HG00732	 11059	 4884	 400	 85	 85	 8227	 6274	 813	 271	 24	 2351	
HG00733*	 11769	 5365	 330	 45	 4	 8848	 6179	 743	 191	 25	 2313	
NA19238	 7512	 2999	 280	 72	 59	 6320	 4765	 628	 237	 12	 1910	
NA19239	 5909	 2357	 199	 46	 50	 5061	 3809	 528	 161	 21	 1468	
NA19240*	 13285	 5185	 345	 78	 7	 9791	 7596	 911	 275	 23	 2600
Venn	Diagrams	Between	all	Trios:		
YRI/PUR/CHS	Dele1ons	
Not	seeing	all	proband	calls	(blue)	in	the	parents	
-	Under-calling	of	hets?
Haplotype	Resolved	Assembly	
Par11oned	assembly	of	PacBio	reads	with	
10X	Genomics	phased	variant	calls
Assembly	Sequences	Oven	Mix	
Haplotype	Informa1on	
hAp://support.10xgenomics.com/de-novo-assembly/sovware/pipelines/latest/output/genera1ng	
Bubbles	represent	divergent	alleles	
Blue	=	Maternal	
Yellow	=	Paternal
hAp://support.10xgenomics.com/de-novo-assembly/sovware/pipelines/latest/output/genera1ng	
Bubbles	represent	divergent	alleles	
Blue	=	Maternal	
Yellow	=	Paternal	
A	conserva1ve	assembly	will	NOT	try	to	
link	across	the	black	bubbles	without	
some	sort	of	scaffolding	informa1n	
Assembly	Sequences	Oven	Mix	
Haplotype	Informa1on
hAp://support.10xgenomics.com/de-novo-assembly/sovware/pipelines/latest/output/genera1ng	
Bubbles	represent	divergent	alleles	
Blue	=	Maternal	
Yellow	=	Paternal	
Assemblies	oven	take	a	single	path	in	this	graph,	
this	could	mix	maternal	and	paternal	alleles	
Assembly	Sequences	Oven	Mix	
Haplotype	Informa1on
10X	+	PacBio:	Haplotype	Resolved	SVs	
Genome	
10X	par11oned	SNVs	
Long
reads
Overview	of	Approach
Overview	of	Approach
Overview	of	Approach
Revised	Call	Sets	
*	PacBio	calls	take	from	“Sniffles”	–	developed	at	Schatz	Lab	(John	Hopkins)	
**	10X	calls	take	provided	using	LongRanger	from	10X	Genomics
Hybrid	Haplotype	Separated	Callsets	Add	
Many	SVs	
*	PacBio	calls	take	from	“Sniffles”	–	developed	at	Schatz	Lab	(John	Hopkins)	
**	10X	calls	take	provided	using	LongRanger	from	10X	Genomics
Combined	Calls	Largely	Contain	
Exis1ng	Calls	
(Sniffles	SV	Detec1on)
Calls	Remain	That	Are	Unique	To	A	
Single	Plaxorm	
(Sniffles	SV	Detec1on)
Assembly	Calls	“Missed”	In	Haplotype	Calls	
Par11oned	
Reads	
Haplotype	
Con1gs	
Assembly	Calls	
Haplotype	Calls
Haplotype	Calls	“Missed”	In	Assembly	Calls	
Par11oned	
Reads	
Haplotype	
Con1gs	
Assembly	Calls	
Haplotype	Calls
Haplotype	Calls	“Missed”	In	Assembly	Calls	
Par11oned	
Reads	
Haplotype	
Con1gs	
Assembly	Calls	
Haplotype	Calls
Repeats	(like	TRs)	can	shiv	
boundaries	for	various	callers
Example	Workarounds	
-Breakpoint	calls	(not	assembled	sequence):	
	 	If	L1	is	within	10%	of	L2	size	and	L1	and	L2	are	both	in	R	
	we	call	it	a	“match”	
	
-	Assembly	Calls		
	 	Perform	MSA	to	iden1fy	truly	“homozygous”	
	sequences	
L1	
L2	
Tandem	Repeat
Complex	hets	Are	Oven	in	“Hard	regions”	
Par11oned	
Reads	
Haplotype	
Con1gs	
Assembly	Calls	
Haplotype	Calls
Assembly	Provides	Detailed	
Indica1ons	Of	Quality		
•  Provides	sequence	of	breakpoint	
•  Poten1ally	provides	co-located	events	
•  Poten1ally	provides	informa1on	on	accuracy	
of	the	assembly	in	that	region
Slide	from	Jason	Chin	
at	the	SMRT	
Informa6cs	Workshop	
Ctg 33
Ctg33mappedtoChr1
Ctg 120
Mis-assembly point
Assembly	have	addi1onal	informa1on
Ongoing/Future	Work	
•  GIAB	is	working	on	how	to	integrate	assembly	calls	robustly	
–  Surprisingly	poor	overlap!	
•  Typing	calls	
•  Integra1ng	parental	informa1on	
•  Providing	full	length	haplotype	assemblies	for	genomes	
–  Can	be	done	with	trio	phasing	
–  But,	it’s	now	possible	with	new	technologies!	
•  Hi-C	
•  StrandSeq	
•  Integra1ng	Graphs	from	10x	and	PacBio	
•  Pulling	in	very	large	SVs	with	BioNano	
•  Moving	into	full	indel	resolu1on	and	tools	for	comparing	datasets	
rapidly	to	alt	haps
Acknowledgements	
•  Mount Sinai
–  Eric Schadt
–  Matt Pendleton
–  Ajay Ummat
–  Oscar Franzen
–  Gintaras Deikus
–  Robert Sebra
–  Oscar Rodriguez*
•  Reed College
–  Anna Ritz
–  Alex Ledger
•  UCSF
–  Pui Kwok
•  PacBio
–  Jason Chin
•  1000 Genomes SV
Working Group
•  UW
–  Mark Chaisson
•  EMBL
–  Jan Korbel
–  Markus H.-Y.
Fritz
–  Tobias Rausch
•  BioNano Genomics
–  Han Cao
–  Alex Hastie
–  Heng Dai
–  Andy Pang
–  Joyce Lee
•  10X Genomics
–  Patrick Marks
–  Deanna Church
–  Mike Schnall-Levin
–  Sofia
Kyriazopoulou-
Panagiotopoulou

Haplotype resolved structural variation assembly with long reads