Introduction
Robust benchmarking of small variant calls
Justin Zook1, Peter Krusche2, Michael Eberle2, Len Trigg3, Kevin Jacobs4, Brendan O’Fallon5, Marc Salit1,
the Global Alliance for Genomics and Health Benchmarking Team, and the Genome in a Bottle Consortium
(1) Genome-Scale Measurements Group, National Institute of Standards and Technology
(2) Illumina, Inc.; (3) Real Time Genomics; (4) Helix; (5) ARUP Laboratories
• The Global Alliance for Genomics and Health Benchmarking Team has
developed a variety of resources for benchmarking germline small
variant calls:
• Standardized performance metrics definitions (e.g., false positives,
false negatives, precision, recall/sensitivity, genotype error rate)
• Links to high-confidence calls and data for benchmark genomes
• Benchmarking Tools
• Integrate variant comparison tools into a single benchmarking
framework
• Enable stratification of performance by variant type and genome
context
• Sophisticated variant comparison tools are important to handle different
representations of complex variants
• Benchmarking tools have been used in PrecisionFDA Challenges
Public Benchmark Callsets/Genomes
Resources
• GitHub	site:	https://github.com/ga4gh/benchmarking-tools
• In-progress	benchmarking	standards	document:	doc/standards
• Description	of	intermediate	formats:	doc/ref-impl
• Benchmark	descriptions	and	downloads:	resources/high-confidence-sets
• Stratification	bed	files	and	descriptions:	resources/stratification-bed-files
• Python-code	for	HTML	reporting	and	running	benchmarks: reporting/basic
• Please	contribute	/	join	the	discussion!	Email jzook@nist.gov if you’re
interested
Genome PGP	ID Coriell ID NIST	ID NIST	RM	#
CEPH	Mother N/A GM12878 HG001 8398
AJ	Son huAA53E0 GM24385 HG002 8391	(son)/8392
AJ	Father hu6E4515 GM24149 HG003 8392	(trio)
AJ	Mother hu8E87A9 GM24143 HG004 8392	(trio)
Chinese	Son hu91BD69 GM24631 HG005 8393
Practical Implications of Benchmark Callsets
Stratification by Variant Type and Context
Table 1: Genomes currently being characterized by GIAB by integrating
data from multiple technologies. Vials from large homogeneous batch of
DNA available as NIST Reference Materials (RMs)
Clinical Benchmarking Considerations
Different	variant	representation	
changes	variant	counts.
MNP	(=>	one	TP	/	FP	/	FN):
chr1 16837188 TGC CGT
SNPs	(=>	two	TP	/	FP	/	FN):
chr1 16837188 T C
chr1 16837190 C T
Variant	types	can	change	when	decomposing	
or	recomposing	variants:
Complex	variant:
chr1 201586350 CTCTCTCTCT CA
DEL	+	SNP:
chr1 201586350 CTCTCTCTCT C
chr1 201586359 T A
Variants	cannot	always	be	canonicalized	
uniquely:
Complex	variant:
chr20 21221450 GCCC GCG
chr20 21221450 GC G
chr20 21221452 C G
chr20 21221452 C G
chr20 21221453 C <DEL>
BWA/GATK-
no	decoy
vs. NIST	
2.18
vs.	NIST	
3.3.2
vs.	PG	
2016-1.0
Precision 91% 67% 93%
Sensitivity 99.8% 99.4% 93%
Outside bed 91% 92% 78%
• Decoy	was	designed	to	capture	mis-mapped	reads	
that	cause	FPs	– can	we	test	this?
• v3.3.2 best at identifying FP SNPs
• 43% of all FPs in decoy (only 0.5% of TPs)
• PG and v2.18 exclude most FP sites
• PG best at identifying possible FN SNPs
• Clustered, unclear, difficult-to-map variants
What is the effect of not mapping to the decoy?
What is the SNP sensitivity in coding exons?
• 97.98%	sensitivity	vs.	PG
• FNs	predominately	in	low	MQ	and/or	segmental	duplication	regions
• ~80%	of	FNs	supported	by	long	or	linked	reads
• 99.96%	sensitivity	vs.	NISTv3.3.2
• 62x	lower	FN	rate	than	vs	PG
True	accuracy	is	
hard	to	estimate,	
especially	in	difficult	
regions
1x0.3x 10x3x 30x
11	to	50	bp51	to	200	bp
2bp	unit	repeat
3bp	unit	repeat
4bp	unit	repeat
2bp	unit	repeat
3bp	unit	repeat
4bp	unit	repeat
FN	rate	vs.	average
What is the FP rate for compound heterozygous indels?
• 93%	precision	vs.	PG
• 4/10	manually	inspected	putative	FPs	were	errors	in	test	set
• 6/10	were	correct	in	test	set	(partial	calls	or	missing	in	PG	vcf)
• 95%	precision	vs.	NISTv3.3.2
• 9/10	manually	inspected	FPs	were	errors	in	test	set	(1	error	in	v3.3.2)
• Benchmark	genomes	may	contain	a	limited	number	of	variants	in	targeted	
regions,	particularly	for	non-SNPs,	so	always	calculate	confidence	intervals
• The variants that many clinical assays are most interested in detecting are
often enriched for difficult variants (e.g., indels, complex variants)
• Could we use the stratified performance on benchmark genomes to predict
the assay’s performance for clinically interesting variants?
• Other useful benchmarking approaches:
• Synthetic DNA spike-ins
• Cell lines with engineered mutations
• Simulated reads
• Modified real reads
• Modified reference genomes
• Confirming results found in real samples over time
• Accuracy	often	varies	by	variant	type	and	
genomic	context
• Error rates for complex variants >
indels > SNPs
• Error rates in tandem repeats and
difficult to map regions are greater than
in non-repetitive regions
• The benchmarking team has made
available a set of bed files describing
difficult and interesting regions
• Different types of tandem repeats
• Low mappability regions
• Segmental duplications
• Coding regions
PrecisionFDA Challenge	Results	Example	(precision.fda.gov)
Genome PGP	ID Coriell ID NIST	ID NIST	RM	#
CEPH	Mother N/A GM12878 HG001 8398
CEPH	Father N/A GM12877 N/A N/A
Table 2: Genomes with high-confidence calls from the Illumina Platinum
Genomes Project by phasing parents and 11 children and finding variants
inherited as expected
Accounting for different representations of
complex variants
• Complex variants (i.e.,
nearby SNPs and indels)
can usually be correctly
represented in multiple
ways
• GA4GH Benchmarking
tools account for these
differences in
representation
Sophisticated
comparison tools
(right) make a
significant
difference in
performance
metrics compared
to naïve tools (left)
Example	Complex	Variant	where	normalization	
alone	(e.g.,	with	vt)	does	not	work

2017 agbt benchmarking_poster

  • 1.
    Introduction Robust benchmarking ofsmall variant calls Justin Zook1, Peter Krusche2, Michael Eberle2, Len Trigg3, Kevin Jacobs4, Brendan O’Fallon5, Marc Salit1, the Global Alliance for Genomics and Health Benchmarking Team, and the Genome in a Bottle Consortium (1) Genome-Scale Measurements Group, National Institute of Standards and Technology (2) Illumina, Inc.; (3) Real Time Genomics; (4) Helix; (5) ARUP Laboratories • The Global Alliance for Genomics and Health Benchmarking Team has developed a variety of resources for benchmarking germline small variant calls: • Standardized performance metrics definitions (e.g., false positives, false negatives, precision, recall/sensitivity, genotype error rate) • Links to high-confidence calls and data for benchmark genomes • Benchmarking Tools • Integrate variant comparison tools into a single benchmarking framework • Enable stratification of performance by variant type and genome context • Sophisticated variant comparison tools are important to handle different representations of complex variants • Benchmarking tools have been used in PrecisionFDA Challenges Public Benchmark Callsets/Genomes Resources • GitHub site: https://github.com/ga4gh/benchmarking-tools • In-progress benchmarking standards document: doc/standards • Description of intermediate formats: doc/ref-impl • Benchmark descriptions and downloads: resources/high-confidence-sets • Stratification bed files and descriptions: resources/stratification-bed-files • Python-code for HTML reporting and running benchmarks: reporting/basic • Please contribute / join the discussion! Email jzook@nist.gov if you’re interested Genome PGP ID Coriell ID NIST ID NIST RM # CEPH Mother N/A GM12878 HG001 8398 AJ Son huAA53E0 GM24385 HG002 8391 (son)/8392 AJ Father hu6E4515 GM24149 HG003 8392 (trio) AJ Mother hu8E87A9 GM24143 HG004 8392 (trio) Chinese Son hu91BD69 GM24631 HG005 8393 Practical Implications of Benchmark Callsets Stratification by Variant Type and Context Table 1: Genomes currently being characterized by GIAB by integrating data from multiple technologies. Vials from large homogeneous batch of DNA available as NIST Reference Materials (RMs) Clinical Benchmarking Considerations Different variant representation changes variant counts. MNP (=> one TP / FP / FN): chr1 16837188 TGC CGT SNPs (=> two TP / FP / FN): chr1 16837188 T C chr1 16837190 C T Variant types can change when decomposing or recomposing variants: Complex variant: chr1 201586350 CTCTCTCTCT CA DEL + SNP: chr1 201586350 CTCTCTCTCT C chr1 201586359 T A Variants cannot always be canonicalized uniquely: Complex variant: chr20 21221450 GCCC GCG chr20 21221450 GC G chr20 21221452 C G chr20 21221452 C G chr20 21221453 C <DEL> BWA/GATK- no decoy vs. NIST 2.18 vs. NIST 3.3.2 vs. PG 2016-1.0 Precision 91% 67% 93% Sensitivity 99.8% 99.4% 93% Outside bed 91% 92% 78% • Decoy was designed to capture mis-mapped reads that cause FPs – can we test this? • v3.3.2 best at identifying FP SNPs • 43% of all FPs in decoy (only 0.5% of TPs) • PG and v2.18 exclude most FP sites • PG best at identifying possible FN SNPs • Clustered, unclear, difficult-to-map variants What is the effect of not mapping to the decoy? What is the SNP sensitivity in coding exons? • 97.98% sensitivity vs. PG • FNs predominately in low MQ and/or segmental duplication regions • ~80% of FNs supported by long or linked reads • 99.96% sensitivity vs. NISTv3.3.2 • 62x lower FN rate than vs PG True accuracy is hard to estimate, especially in difficult regions 1x0.3x 10x3x 30x 11 to 50 bp51 to 200 bp 2bp unit repeat 3bp unit repeat 4bp unit repeat 2bp unit repeat 3bp unit repeat 4bp unit repeat FN rate vs. average What is the FP rate for compound heterozygous indels? • 93% precision vs. PG • 4/10 manually inspected putative FPs were errors in test set • 6/10 were correct in test set (partial calls or missing in PG vcf) • 95% precision vs. NISTv3.3.2 • 9/10 manually inspected FPs were errors in test set (1 error in v3.3.2) • Benchmark genomes may contain a limited number of variants in targeted regions, particularly for non-SNPs, so always calculate confidence intervals • The variants that many clinical assays are most interested in detecting are often enriched for difficult variants (e.g., indels, complex variants) • Could we use the stratified performance on benchmark genomes to predict the assay’s performance for clinically interesting variants? • Other useful benchmarking approaches: • Synthetic DNA spike-ins • Cell lines with engineered mutations • Simulated reads • Modified real reads • Modified reference genomes • Confirming results found in real samples over time • Accuracy often varies by variant type and genomic context • Error rates for complex variants > indels > SNPs • Error rates in tandem repeats and difficult to map regions are greater than in non-repetitive regions • The benchmarking team has made available a set of bed files describing difficult and interesting regions • Different types of tandem repeats • Low mappability regions • Segmental duplications • Coding regions PrecisionFDA Challenge Results Example (precision.fda.gov) Genome PGP ID Coriell ID NIST ID NIST RM # CEPH Mother N/A GM12878 HG001 8398 CEPH Father N/A GM12877 N/A N/A Table 2: Genomes with high-confidence calls from the Illumina Platinum Genomes Project by phasing parents and 11 children and finding variants inherited as expected Accounting for different representations of complex variants • Complex variants (i.e., nearby SNPs and indels) can usually be correctly represented in multiple ways • GA4GH Benchmarking tools account for these differences in representation Sophisticated comparison tools (right) make a significant difference in performance metrics compared to naïve tools (left) Example Complex Variant where normalization alone (e.g., with vt) does not work