Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Giab agbt SVs_2019


Published on

AGBT poster describing structural variant (SV) benchmark

Published in: Health & Medicine
  • Be the first to comment

Giab agbt SVs_2019

  1. 1. Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering sequence changes within 20% edit distance in trio Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting sequences <20% different or BioNano/Nabsys support in trio Evaluate/genotype: 19748 SVs with consensus variant genotype from svviz in son Filter complex: 12745 SVs not within 1kb of another SV Regions: 11869 SVs inside 2.69 Gbp benchmark regions supported by diploid assembly v0.6 Introduction A new benchmark for human germline structural variant calls Justin Zook,1 Lesley Chapman,1 Nancy Hansen,3 Fritz J. Sedlazeck,4 Aaron Wenger,5 Adam English,6 Chunlin Xiao,7 John Oliver,8 Joyce Lee,9 Alex Hastie,9 Ian Fiddes,10 Alvaro Barrio,10 Tobias Marschall,11 Mark Chaisson,12 John Farrell,13 Andrew Carroll,14 Paul C. Boutros15,16, Iman Hajirasouliha17, Christopher E. Mason17, Sayed Mohammad Ebrahim Sahraeian,18 Marc Salit,2 and many other members of the Genome in a Bottle Consortium (1) National Institute of Standards and Technology; (2) Joint Initiative for Metrology in Biology; (3) NHGRI/NIH; (4) Baylor College of Medicine; (5) Pacific Biosciences; (6) Spiral Genetics; (7) NCBI/NIH; (8) Nabsys; (9) BioNano Genomics; (10) 10x Genomics; (11) Max Planck Institute; (12) University of Southern California; (13) Boston University Medical School; (14) Google; (15) University of California, Los Angeles; (16) Ontario Institute for Cancer Research; (17) Weill Cornell Medicine; (18) Roche Sequencing Solutions • NIST has hosted the Genome in a Bottle Consortium to develop authoritatively-characterized, human genome Reference Materials that are an enduring resource for benchmarking variant calls Integrating data to form benchmark calls Ongoing and Future GIAB Work • Using long & linked reads in difficult-to-map regions • Improved benchmarks for homopolymers and long repeats • Complex and clustered variants • New collaborations to characterize difficult regions and variants in these genomes are welcome! Email Crowd-sourced manual curation vs. benchmark set Benchmark calls are strongly supported Zook et al., Scientific Data, 2016. Our benchmark sets are useful in evaluating multiple technologies 2012 • No human benchmark calls available • GIAB Consortium formed 2014 • Small variant genotypes for ~77% of pilot genome NA12878 2015 • NIST releases first human genome Reference Material 2016 • 4 new genomes • Small variants for ~90% of 7 genomes for GRCh37/38 2018 • Draft SV benchmark • Difficult to map regions 2019+ • Characteriz- ing difficult variants and regions • Assembly benchmarks • Cancer Benchmark set and README at • Goal: When comparing any callset to our vcf within the bed, most putative FPs and FNs should be errors in the tested callset • We benchmarked several callsets from assembly-based and non- assembly-based methods with short and long reads. • Upon manual curation, the majority of most FPs and FNs were errors in the tested callset • Exception: FP insertions from pbsv, suggesting we may miss ~5% of true insertions • Exception: One FP insertion from Bionano was correctly larger 50 to 1000 bp 1kbp to 10kbp Alu Alu LINE LINE • Candidates examined by 11 curators on average • 627/635 consensus manual curations agreed with v0.6 genotype in benchmark regions • Most “discordant” sites related to inclusion of 20-49bp indels in curation Short reads • Illumina • Complete Genomics Long reads • PacBio (raw and CCS) • Oxford Nanopore Linked reads • 10x Genomics • 6kb Mate-pair Optical/electronic mapping • BioNano • Nabsys Public GIAB Data Short reads have limitations for large insertions and SVs in tandem repeats Log10(BioNano Size) Log10(BenchmarkSize) Father 0/0 0/0 0/0 0/1 0/1 0/1 1/1 1/1 1/1 Son | Mother 0/0 0/1 1/1 0/0 0/1 1/1 0/0 0/1 1/1 0/1 14 1185 417 1143 1119 462 416 522 12 1/1 0 0 0 0 449 444 2 431 2748 Trio Mendelian genotype violation rate 28/9392 = 0.3% (Excludes X/Y and sites with no GT in a parent) Support from long reads Support from short reads Support from optical mapping SV discovery and genotyping methods have different strengths and weaknesses More methods discover SVs that are deletions, not in tandem repeats, and smaller insertions Fraction of reads supporting SV Fraction of reads supporting SV Het Hom Het Hom Het Hom Het Hom Het Hom Het Hom Het Hom Het Hom