Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls
>=20bp discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio
Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering
sequence changes within 20% edit distance in trio
Discovery Support: 30062 SVs with 2+ techs or 5+ callers
predicting sequences <20% different or BioNano/Nabsys
support in trio
Evaluate/genotype: 19748 SVs with consensus
variant genotype from svviz in son
Filter complex: 12745 SVs not within
1kb of another SV
Regions: 9641 SVs inside
2.66 Gbp benchmark
regions supported by
diploid assembly
v0.6
Results from Adding Long and Linked Reads
NIST hosts the Genome in a Bottle (GIAB) Consortium, which develops metrology for
benchmarking human genome variant calling. Consortium products include:
• Characterization of seven broadly-consented human genomes, including 2 son-mother-
father trios, released as NIST Reference Materials (RMs)
• Benchmarks for germline small variants, including a new v4.1 benchmark for HG002 that
uses linked and long reads to expand to more “dark” (difficult to map) regions
• Benchmark for sequence-resolved large insertions and deletions >50 bp for HG002
Introduction
Integration data for HG002
A new Genome in a Bottle small variant benchmark for genomic “dark matter”
J.M. Zook1, J. Wagner1, N. Olson1, L. Harris1, J. McDaniel1, A.M. Wenger2, W.J. Rowell2, A. Carroll3, I.T. Fiddes4, C. Xiao5, C-S Chin6, F. Sedlazeck7, M. Salit8, Genome in a Bottle Consortium
1) Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD; 2) Pacific Biosciences, Menlo Park CA; 3) Google, Inc. Mountain View, CA; 4) 10x Genomics,
Pleasanton CA; 5) National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD; 3) DNAnexus, Inc. Mountain View, CA; 7) Baylor College of
Medicine, Houston, TX; 8) Joint Initiative for Metrology in Biology, Stanford, CA
Next steps for GIAB
Methods to Form Small Variant Benchmark
Benchmark includes more bases, variants, and segmental duplications in v4.1
Example use: Benchmark identifies new errors in short read WGS
• SNV FNs increase by a factor of >3,
mostly due to new benchmark variants in
difficult to map regions and segmental
duplications
New variants in potentially medically-relevant genes
• v4.1 covers >90 % of the MHC region
(MHC described in bioRxiv 831792)
• Other genes with many more variants:
TSPEAR (31), LAMA5 (28), FCGBP (18),
TPSAB1 (15), HSPG2 (13)
• From ACMG59, new variants in PMS2, RET, SCN5A, and TNNI3
Long-range PCR + Sanger confirmation
• Confirmed all 86 covered variants in
CYP21A2, CYP2D6, PMS2, TNXA, TNXB,
C4A, C4B, DMBT1, STRC, and HSPG2
• Confirmed all 50 covered variants in
4 LINE1s with errors in v3.3.2
Dark and Camouflaged Genes
• v4.1 covers ~22 % of “dark genes” for 100bp
reads
Join the Genome in a Bottle Consortium
Platform Characteristics Alignment; Variant Calling
PacBio Sequel II ~15-20 kb reads; ~50x coverage
minimap2; GATK4
minimap2; DeepVariant
10X Genomics Linked reads; ~84x coverage LongRanger Pipeline
PASS variants #2
Benchmark regions
0/1 1/11/1
Benchmark calls 0/11/1
Callable regions #2
Callable regions #1
1/10/11/1PASS variants #1
InputMethods
1/1
Concordant
Discordant
unresolved
Discordant
arbitrated
Concordant
not callable
Coverage of ~190 difficult, medically
relevant genes from Mandelker et al
Variants Bases covered
Benchmark v3.3.2 7,358 5.3 Mbp (52.1%)
Benchmark v4.1 12,395 8.5 Mbp (83.5%)
Difficult Region Description Bases Covered in GRCh37 Bases Covered in GRCh38
Tier 1 and Tier 2 calls from v0.6 SV Benchmark 48,876,992 49,291,167
Potential copy number variation 28,315,423 44,938,752
Tandem Repeats > 10kb 5,731,885 71,942,255
Highly similar and high depth segmental duplications 1,232,701 2,094,143
Regions that are collapsed and expanded from GRCh37/38 Primary
Assembly Alignments
17,979,597 N/A
VDJ 3,482,644 3,616,717
Inversions 2,454,472 1,438,352
Modeled centromere and heterochromatin N/A 62,304,573
Subset FNs vs v3.3.2 FNs vs v4 .1
All SNVs 9,382 36,438
Low mappability+seg dups 4,753 30,571
• Focused characterization of potentially medically relevant genes missing from v4.1
• Improve benchmark for larger indels, homopolymers, and tandem repeats
• Develop benchmarks and standards for complex variants
• Pipeline for small and structural variant integration to make new benchmarks
• Generating benchmark variants from diploid assemblies (with Human Pangenome Project)
• Refine use of genome stratifications to understand strengths/limitations of any method
• Machine learning: Outlier detection, active learning
The input data for GIAB benchmark v3.3.2 consisted of Illumina, Complete Genomics, Ion,
10X, and Solid technologies. v4.1 includes 2 new datasets:
New members welcome! Sign up for newsletters at www.genomeinabottle.org
Email for questions about participating in GIAB: jzook@nist.gov
Regions still excluded from v4.1 benchmark:
v3.3.2 GRCh37 v4.1 GRCh37 v3.3.2 GRCh38 v4.1 GRCh38
Bases 2.36 Mbp 2.53 Mbp 2.35 Mbp 2.54 Mbp
Reference
covered
87.8% 94.1% 85.4% 92.2%
SNVs 3.05M 3.35M 3.03M 3.36M
Indels 466k 525k 477k 528k
Bases in
Segmental
Duplications
0.1 Mbp 72.5 Mbp 5.4 Mbp 83.9 Mbp
Arbitration Example
Genome in a Bottle
Consortium
SNVs INDELs
v3.3.2-
specific
v4.1-
specific
376,653
91,837
91,719
48,753
50 to 1000 bp
Alu
Alu
1kbp to 10kbp
LINE
LINE
tinyurl.com/GIABSV06
Preprint on
bioRxiv
664623
Benchmark for large insertions and deletions
PMS2
Mostly in questionable
reference regions
Mostly in low mappability
and seg dups
Benchmark Evaluations
• Asked experts in variety of variant calling
methods to benchmark their method against
v4.1 and manually curate 100 FPs and FNs
• Is GIAB correct? Is callset correct?
• Volunteers curated FPs/FNs from 11 callsets
• Illumina (mapping and graph-based)
• 10x (mapping and local assembly-based)
• PacBio (mapping and assembly-based)
• ONT (mapping/deep learning)
• Preliminary results
• >90% of FPs and FNs are errors in callsets
• Challenging to interpret results in MHC
• v4.1 still has some dense variants in seg
dups that might be due to SVs or CNVs
New draft benchmark
for HG001/NA12878
helps identify
mapping errors in
phased pedigree calls
Callset developer
curates putative
errors
Benchmark is
wrong or
questionable
NIST curator
disagrees
Discuss with
callset developer
NIST curator
agrees
Classify source of
potential error in
benchmark
Benchmark is
correct
No further
curation
v4.1
Benchm
ark at
tinyurl.com
/
GIABHG2v4-1
Evaluation Process

Giab agbt small_var_2020

  • 1.
    Discovery: 498876 (296761unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering sequence changes within 20% edit distance in trio Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting sequences <20% different or BioNano/Nabsys support in trio Evaluate/genotype: 19748 SVs with consensus variant genotype from svviz in son Filter complex: 12745 SVs not within 1kb of another SV Regions: 9641 SVs inside 2.66 Gbp benchmark regions supported by diploid assembly v0.6 Results from Adding Long and Linked Reads NIST hosts the Genome in a Bottle (GIAB) Consortium, which develops metrology for benchmarking human genome variant calling. Consortium products include: • Characterization of seven broadly-consented human genomes, including 2 son-mother- father trios, released as NIST Reference Materials (RMs) • Benchmarks for germline small variants, including a new v4.1 benchmark for HG002 that uses linked and long reads to expand to more “dark” (difficult to map) regions • Benchmark for sequence-resolved large insertions and deletions >50 bp for HG002 Introduction Integration data for HG002 A new Genome in a Bottle small variant benchmark for genomic “dark matter” J.M. Zook1, J. Wagner1, N. Olson1, L. Harris1, J. McDaniel1, A.M. Wenger2, W.J. Rowell2, A. Carroll3, I.T. Fiddes4, C. Xiao5, C-S Chin6, F. Sedlazeck7, M. Salit8, Genome in a Bottle Consortium 1) Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD; 2) Pacific Biosciences, Menlo Park CA; 3) Google, Inc. Mountain View, CA; 4) 10x Genomics, Pleasanton CA; 5) National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD; 3) DNAnexus, Inc. Mountain View, CA; 7) Baylor College of Medicine, Houston, TX; 8) Joint Initiative for Metrology in Biology, Stanford, CA Next steps for GIAB Methods to Form Small Variant Benchmark Benchmark includes more bases, variants, and segmental duplications in v4.1 Example use: Benchmark identifies new errors in short read WGS • SNV FNs increase by a factor of >3, mostly due to new benchmark variants in difficult to map regions and segmental duplications New variants in potentially medically-relevant genes • v4.1 covers >90 % of the MHC region (MHC described in bioRxiv 831792) • Other genes with many more variants: TSPEAR (31), LAMA5 (28), FCGBP (18), TPSAB1 (15), HSPG2 (13) • From ACMG59, new variants in PMS2, RET, SCN5A, and TNNI3 Long-range PCR + Sanger confirmation • Confirmed all 86 covered variants in CYP21A2, CYP2D6, PMS2, TNXA, TNXB, C4A, C4B, DMBT1, STRC, and HSPG2 • Confirmed all 50 covered variants in 4 LINE1s with errors in v3.3.2 Dark and Camouflaged Genes • v4.1 covers ~22 % of “dark genes” for 100bp reads Join the Genome in a Bottle Consortium Platform Characteristics Alignment; Variant Calling PacBio Sequel II ~15-20 kb reads; ~50x coverage minimap2; GATK4 minimap2; DeepVariant 10X Genomics Linked reads; ~84x coverage LongRanger Pipeline PASS variants #2 Benchmark regions 0/1 1/11/1 Benchmark calls 0/11/1 Callable regions #2 Callable regions #1 1/10/11/1PASS variants #1 InputMethods 1/1 Concordant Discordant unresolved Discordant arbitrated Concordant not callable Coverage of ~190 difficult, medically relevant genes from Mandelker et al Variants Bases covered Benchmark v3.3.2 7,358 5.3 Mbp (52.1%) Benchmark v4.1 12,395 8.5 Mbp (83.5%) Difficult Region Description Bases Covered in GRCh37 Bases Covered in GRCh38 Tier 1 and Tier 2 calls from v0.6 SV Benchmark 48,876,992 49,291,167 Potential copy number variation 28,315,423 44,938,752 Tandem Repeats > 10kb 5,731,885 71,942,255 Highly similar and high depth segmental duplications 1,232,701 2,094,143 Regions that are collapsed and expanded from GRCh37/38 Primary Assembly Alignments 17,979,597 N/A VDJ 3,482,644 3,616,717 Inversions 2,454,472 1,438,352 Modeled centromere and heterochromatin N/A 62,304,573 Subset FNs vs v3.3.2 FNs vs v4 .1 All SNVs 9,382 36,438 Low mappability+seg dups 4,753 30,571 • Focused characterization of potentially medically relevant genes missing from v4.1 • Improve benchmark for larger indels, homopolymers, and tandem repeats • Develop benchmarks and standards for complex variants • Pipeline for small and structural variant integration to make new benchmarks • Generating benchmark variants from diploid assemblies (with Human Pangenome Project) • Refine use of genome stratifications to understand strengths/limitations of any method • Machine learning: Outlier detection, active learning The input data for GIAB benchmark v3.3.2 consisted of Illumina, Complete Genomics, Ion, 10X, and Solid technologies. v4.1 includes 2 new datasets: New members welcome! Sign up for newsletters at www.genomeinabottle.org Email for questions about participating in GIAB: jzook@nist.gov Regions still excluded from v4.1 benchmark: v3.3.2 GRCh37 v4.1 GRCh37 v3.3.2 GRCh38 v4.1 GRCh38 Bases 2.36 Mbp 2.53 Mbp 2.35 Mbp 2.54 Mbp Reference covered 87.8% 94.1% 85.4% 92.2% SNVs 3.05M 3.35M 3.03M 3.36M Indels 466k 525k 477k 528k Bases in Segmental Duplications 0.1 Mbp 72.5 Mbp 5.4 Mbp 83.9 Mbp Arbitration Example Genome in a Bottle Consortium SNVs INDELs v3.3.2- specific v4.1- specific 376,653 91,837 91,719 48,753 50 to 1000 bp Alu Alu 1kbp to 10kbp LINE LINE tinyurl.com/GIABSV06 Preprint on bioRxiv 664623 Benchmark for large insertions and deletions PMS2 Mostly in questionable reference regions Mostly in low mappability and seg dups Benchmark Evaluations • Asked experts in variety of variant calling methods to benchmark their method against v4.1 and manually curate 100 FPs and FNs • Is GIAB correct? Is callset correct? • Volunteers curated FPs/FNs from 11 callsets • Illumina (mapping and graph-based) • 10x (mapping and local assembly-based) • PacBio (mapping and assembly-based) • ONT (mapping/deep learning) • Preliminary results • >90% of FPs and FNs are errors in callsets • Challenging to interpret results in MHC • v4.1 still has some dense variants in seg dups that might be due to SVs or CNVs New draft benchmark for HG001/NA12878 helps identify mapping errors in phased pedigree calls Callset developer curates putative errors Benchmark is wrong or questionable NIST curator disagrees Discuss with callset developer NIST curator agrees Classify source of potential error in benchmark Benchmark is correct No further curation v4.1 Benchm ark at tinyurl.com / GIABHG2v4-1 Evaluation Process