Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GIAB ASHG 2019 Small Variant poster


Published on

GIAB ASHG 2019 Small Variant poster

Published in: Health & Medicine
  • Be the first to comment

  • Be the first to like this

GIAB ASHG 2019 Small Variant poster

  1. 1. Results from Adding Long and Linked Reads NIST hosts the Genome in a Bottle (GIAB) Consortium that develops metrology infrastructure for characterization of human whole genome variant detection. Consortium products include: • Characterization of seven broadly-consented human genomes including 2 son-mother-father trios released as Reference Materials (RMs) • Reference data associated with RMs are benchmark variants and genomic regions covering, for example, 87.8% of assembled bases in chromosomes 1-22 in GRCh37 for the sample HG002 A limitation of the current GIAB benchmark is short read variant callers perform poorly in genomic locations with high homology such as segmental duplications and low-complexity repeat-rich regions. We incorporated PacBio CCS long reads and 10x Genomics linked reads to generate a draft for a new GIAB benchmark. Initial results show long and linked reads add greater than 276,840 SNPs and 42,980 insertions/deletions to the benchmark, mostly in regions difficult to map with short reads. Overview Integration data for HG002 Using long and linked reads to generate a new Genome in a Bottle small variant benchmark J. Wagner1, A. Carroll6, I.T. Fiddes3, A.M. Wenger2, W.J. Rowell2, N. Olson1, L. Harris1, J. McDaniel1, C. Xiao5, M. Salit4, J. Zook1, Genome in a Bottle Consortium 1) Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD 20899; 2) Pacific Biosciences, 1305 O'Brien Drive, Menlo Park CA 94025; 3) 10x Genomics, 7068 Koll Center Parkway, Pleasanton CA 94566; 4) Joint Initiative for Metrology in Biology, Stanford, CA 94305; 5) National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20894; 6) Google, Inc. Mountain View, CA Ongoing and Future work Integration Pipeline Process Benchmark includes more bases, variants, and segmental duplications in v4 Comparison of Illumina RTG VCF against benchmark sets • SNP FNs increases by a factor of more than 3, mostly due to new benchmark variants in difficult to map regions and segmental duplications Performance in medically-relevant genes in GRCh37 • v4 draft covers more of the MHC region, see poster 1707W for details • Outside of MHC updates, top 5 genes with variants increased from v3.3.2 to v4 draft benchmark: TSPEAR (31), LAMA5 (28), FCGBP (18), TPSAB1 (15), HSPG2 (13) • PMS2 from ACMG59 has 2 more variants and RET, SCN5A, TNNI3 have 1 more variant covered in v4 draft benchmark that are not in v3.3.2 Sanger sequencing • Performed long range PCR before sequencing • Confirmed 12 variants in CYP21A2, which is a medically-relevant gene in the MHC region • Confirmed 6 variants in PMS2 Genome in a Bottle Consortium Platform Characteristics Alignment; Variant Calling PacBio Sequel II ~11Kbp reads; ~32x coverage minimap2; GATK4 minimap2; DeepVariant 10X Genomics Linked reads; ~84x coverage LongRanger Pipeline PASS variants #2 Benchmark regions 0/1 1/11/1 Benchmark calls 0/11/1 Callable regions #2 Callable regions #1 1/10/11/1PASS variants #1 InputMethods 1/1 Concordant Discordant unresolved Discordant arbitrated Concordant not callable Variants in Medical Exome (genes from OMIM, HGMD, ClinVar, UniProt) Benchmark Regions v3.3.2 8,209 Benchmark Regions v4 draft 9,527 Difficult Region Description Bases Covered in GRCh37 Bases Covered in GRCh38 v0.6 SV Benchmark 32,596,754 32,872,907 Potential copy number variation 51,713,344 62,666,746 Tandem Repeats > 10kb 5,731,885 71,942,255 Highly similar and high depth segmental duplications 1,232,701 2,094,143 Regions that are collapsed and expanded from GRCh37/38 Primary Assembly Alignments 17,979,597 N/A Modeled centromere and heterochromatin N/A 62,304,573 Subset v3.3.2 FNs v4 FNs All SNPs 8,594 30,229 Low mappability 6,708 25,295 Segmental duplications 1,429 14,008 • Refine use of genome stratifications • Adding variant calls from raw PacBio and Oxford Nanopore • Improve benchmark for larger indels, homopolymers, and tandem repeats • Improve normalization of complex variants • Generating benchmark variants from diploid assemblies • Machine learning - Outlier detection, active learning The input data for GIAB benchmark v3.3.2 consisted of Illumina, Complete Genomics, Ion, 10X, and Solid technologies. The draft v4 benchmark incorporates new PacBio CCS and 10x Genomics linked read data. New members welcome! Sign up for newsletters at Volunteer to evaluate draft benchmark by emailing: Excluded all methods: The following regions are excluded from all technologies and methods: • Tandem Repeats < 51bp except GATK from Illumina PCR-free, Complete Genomics, and CCS DeepVariant • Tandem Repeats > 51bp and < 200bp except GATK from Illumina PCR-Free and CCS DeepVariant • Tandem Repeats > 200bp except CCS DeepVariant • Homopolymers > 6bp except GATK from Illumina PCR-free, Complete Genomics, Ion Exome, CCS • Imperfect homopolymer > 10bp except GATK from Illumina PCR-Free • Difficult to map regions for short reads except 10x and CCS • LINE:L1Hs > 500 except Illumina MatePair, 10x, and CCS • Segmental duplications except 10x and CCS Evaluation by GIAB collaborators Compared benchmark to callsets from a variety of technologies and variant calling methods including: • Illumina PCR-Free and Dragen • 10x Genomics and Aquila (variants from local diploid assembly) • PacBio CCS and GATK4 • PacBio CCS and Clair (Next generation of Clairvoyante) • PacBio CCS and DeepVariant • ONT Promethion and Clair Preliminary results suggest that a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets. v4 draft GRCh37 v4 draft GRCh38 Base pairs 2,504,027,936 2,509,269,277 Reference covered 93.2% 91.03% SNPs 3,323,773 3,314,941 Indels 519,152 519,494 Base pairs in Segmental Duplications 64,300,499 73,819,342 Arbitration Example 80.00% 85.00% 90.00% 95.00% Percent of reference covered Only in v3.3.2 GRCh37 Only in v4 draft GRCh37 SNPs INDELs More volunteers welcomed Genome in a Bottle Consortium SNPs INDELs Only in v3.3.2 GRCh38 Only in v4 draft GRCh38 343,358 69,495 77,324 23,828 376,653 91,837 91,719 48,753