Successfully reported this slideshow.
Your SlideShare is downloading. ×

GRC GIAB Workshop ASHG 2019 Small Variant Benchmark

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
Loading in …3
×

Check these out next

1 of 35 Ad

More Related Content

Slideshows for you (20)

Similar to GRC GIAB Workshop ASHG 2019 Small Variant Benchmark (20)

Advertisement

More from GenomeInABottle (12)

Recently uploaded (20)

Advertisement

GRC GIAB Workshop ASHG 2019 Small Variant Benchmark

  1. 1. Using long and linked reads to generate a new Genome in a Bottle small variant benchmark Justin Wagner, Andrew Carroll, Ian T. Fiddes, Aaron M. Wenger, William J. Rowell, Nathan Olson, Lindsey Harris, Jenny McDaniel, Xin Zhou, Sergey Aganezov, Melanie Kirsche, Bohan Ni, Samantha Zarate, Byunggil Yoo, Neil Miller, C. Xiao, Marc Salit, Justin Zook, Genome in a Bottle Consortium GRC/GIAB Workshop ASHG 2019
  2. 2. Overview • v3.3.2 benchmark variants and regions cover 87.84% of assembled bases in chromosomes 1-22 in GRCh37 for the sample HG002 • Short read variant callers perform poorly in genomic locations with high homology such as segmental duplications and low-complexity repeat-rich regions • Now utilizing PacBio CCS and 10X Genomics data to expand the GIAB benchmark regions and reduce errors in current regions • Long and linked reads add variants to the benchmark, mostly in regions difficult to map with short reads • GRCh37: 276,840 SNPs and 53,482 INDELs • GRCh38: 286,483 SNPs and 42,980 INDELs
  3. 3. How the benchmark is generated
  4. 4. When do we trust variants and regions from each method Variants PASS Filtered outliers Low/high coverage or low MQ (or low GQ for gVCF) Difficult regions/SVs Callable regions TR VariantCallingMethodX (1) (2) (3) 1/1 0/1
  5. 5. Arbitrating between variant calls in different methods PASS variants #2 Benchmark regions 0/1 1/11/1 Benchmark calls 0/11/1 Callable regions #2 Callable regions #1 1/10/11/1PASS variants #1 InputMethods 1/1 (1) Concordant (2) Discordant unresolved (3) Discordant arbitrated (4) Concordant not callable
  6. 6. Sequencing data used in integration for HG002 Platform Characteristics Alignment; Variant Calling Illumina 150x150bp, ~300x coverage Novoalign; GATK v3.5 CG 26x26bp; ~100x coverage Complete Genomics Pipeline Illumina 150x150bp, ~300x coverage Novoalign; Freebayes Illumina 250x250bp;~45x coverage Novoalign; GATK v3.5 Illumina 250x250bp;~45x coverage Novoalign; Freebayes Illumina 6Kbp mate pair; ~13x coverage bwa_mem; GATK v3.5 Illumina 6Kbp mate pair; ~13x coverage bwa_mem; Freebayes Ion Exome, 1000x coverage Torrent Suite v4.2; Torrent Variant Caller v4.4 Solid 75bp; ~60x coverage LifeScope v2.5.1; GATK v3.5 PacBio CCS Sequel II ~11kb reads; ~32x coverage minimap2; GATK4 PacBio CCS Sequel II ~11kb reads; ~32x coverage minimap2; DeepVariant v0.8 10x Genomics Linked reads; ~84x coverage LongRanger Pipeline
  7. 7. Long and linked reads cover more variants and regions Variants PASS Filtered outliers Low/high coverage or low MQ (or low GQ for gVCF) Difficult regions/SVs Callable regions TR VariantCallingMethodX (1) (2) (3) 1/1 0/1 10x Genomics and PacBio CCS data add new variants (1), regions with good coverage of high MQ reads (2), and access to difficult regions (3)
  8. 8. How the benchmark is generated
  9. 9. Difficult Regions Excluded from all Methods Difficult Region Description Bases Covered in GRCh37 Bases Covered in GRCh38 v0.6 SV GIAB Benchmark 32,596,754 32,872,907 Potential copy number variation 51,713,344 62,666,746 Tandem Repeats > 10kb 5,731,885 71,942,255 Highly similar and high depth segmental duplications 1,232,701 2,094,143 Regions that are collapsed and expanded from GRCh37/38 Primary Assembly Alignments 17,979,597 N/A Modeled centromere and heterochromatin N/A 62,304,573
  10. 10. Difficult Regions Excluded by Method • Tandem Repeats < 51bp except GATK from Illumina PCR-free, Complete Genomics, and CCS DeepVariant • Tandem Repeats > 51bp and < 200bp except GATK from Illumina PCR- Free and CCS DeepVariant • Tandem Repeats > 200bp except CCS DeepVariant • Homopolymers > 6bp except GATK from Illumina PCR-free, Complete Genomics, Ion Exome, PacBio CCS • Imperfect homopolymer > 10bp except GATK from Illumina PCR-Free • Difficult to map regions for short reads except 10x and CCS • LINE:L1Hs > 500bp except Illumina MatePair, 10x, and CCS • Segmental duplications except 10x and CCS
  11. 11. v4 draft benchmark includes variants found with haplotype-resolved assembly of MHC • Worked with a team from the March 2019 NCBI Pangenome Hackathon to generate haplotype-resolved assembly of MHC region (chr6:28,477,797-33,448,354 in GRCh37) • Use assembly to call small variants • Small variants from assembly are integrated with mapping-based calls in the MHC region for v4 draft benchmark • v4 draft benchmark includes 23,229 variants in the MHC region • Covers most HLA genes and CYP21A2/TNXA/TNXB
  12. 12. v4 draft benchmark include more bases, variants, and segmental duplications v4 draft GRCh37 v4 draft GRCh38 Base pairs 2,504,027,936 2,509,269,277 Reference covered 93.2% 91.03% SNPs 3,323,773 3,314,941 Indels 519,152 519,494 Base pairs in Segmental Duplications 64,300,499 73,819,342 80.00% 85.00% 90.00% 95.00% Percent of reference covered
  13. 13. Some variants and segmental duplications only covered in v3.3.2 or v4 draft Only in v3.3.2 GRCh37 Only in v4 draft GRCh37 SNPs INDELs SNPs INDELs Only in v3.3.2 GRCh38 Only in v4 draft GRCh38343,358 69,495 77,324 23,828 376,653 91,837 91,719 48,753 Segmental Duplications Segmental Duplications 25,445 63,949,151 1,928,353 70,187,985
  14. 14. v4 draft enables benchmarking in regions difficult for short reads Comparison of Illumina RTG VCF against benchmark sets • SNP FNs increase by a factor of more than 3, mostly due to new benchmark variants in difficult to map regions and segmental duplications • False negatives: variants present in the truth set, but missed in the query Subset v3.3.2 FNs v4 draft FNs All SNPs 8,594 30,229 Low mappability 6,708 25,295 Segmental duplications 1,429 14,008
  15. 15. v4 draft benchmark contains more medically- relevant variants • v4 draft covers more of the MHC region • Outside of MHC updates, top 5 genes with variants increased from v3.3.2 to v4 draft benchmark: TSPEAR (31), LAMA5 (28), FCGBP (18), TPSAB1 (15), HSPG2 (13) • PMS2 from ACMG59 has 2 more variants and RET, SCN5A, TNNI3 have 1 more variant covered in v4 draft benchmark that are not in v3.3.2 Variants in Medical Exome (genes from OMIM, HGMD, ClinVar, UniProt) Benchmark Regions v3.3.2 8,209 Benchmark Regions v4 draft 9,527
  16. 16. Sanger sequencing confirms medically- relevant variants • Performed long range PCR before sequencing • Confirmed 12 variants in CYP21A2, which is a medically- relevant gene in the MHC region • Confirmed 6 variants in PMS2 • Confirmed 15 variants in 5 other genes
  17. 17. Evaluation by GIAB collaborators Compared benchmark to callsets from a variety of technologies and variant calling methods including: • Illumina PCR-Free and Dragen • PacBio CCS and GATK4 • PacBio CCS and DeepVariant • PacBio CCS and Clair (Next generation of Clairvoyante) • ONT Promethion and Clair Preliminary results suggest that a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets More volunteers welcomed
  18. 18. Manual curation by callset developers Process • Compare callset to benchmark using hap.py and/or vcfeval • Randomly select 5 FP SNPs, 5 FN SNPs, 5 FP indels and 5 FN indels, each from inside and outside the v3.3.2 benchmark bed, in GRCh37 and GRCh38 (5*4*2*2=80 total) • Use IGV with PCR-free Illumina, PacBio CCS, 10x, and ONT + difficult bed files Questions to ask • Are both alleles correct in the benchmark? • Yes/No/Unsure • Are both alleles correct in the callset being tested? • Yes/No/Unsure • If the benchmark is wrong or questionable, how did you make this determination? • Instructions: Be critical of the benchmark, and select unsure if the evidence does not strongly support the benchmark being correct
  19. 19. Process for independent evaluations Callset developer curates putative errors Benchmark is wrong or questionable NIST curator disagrees Discuss with callset developer NIST curator agrees Classify source of potential error in benchmark Benchmark is correct No further curation
  20. 20. Initial evaluation suggest a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets Platform and Caller Number Benchmark Correct Number Benchmark Unsure Benchmark is not correct Comparison callset is not correct Total sites CCS with GATK GRCh37 FP 19 1 0 19 20 CCS with GATK GRCh37 FN 15 3 2 18 20 ONT with Clair GRCh37 FP 33 1 0 34 34 ONT with Clair GRCh37 FN 27 3 0 30 30 CCS with Clair GRCh37 FP 7 13 0 6 20 CCS with Clair GRCh37 FN 19 1 0 19 20 Illumina with Dragen GRCh37 FP 14 6 0 11 20 Illumina with Dragen GRCh37 FN 17 3 0 17 20
  21. 21. Evaluation FPs – Inversions LINEs
  22. 22. Evaluation FPs – Complex SVs
  23. 23. Evaluation FPs – Near SVs
  24. 24. Evaluation FPs – Near low coverage
  25. 25. Potential refinements identified for v4.1 • Exclude VDJ • Exclude Inversions • Improve CNV coverage • Use ONT for excessive coverage • Explore smoothing on excessive coverage beds • Use new diploid assemblies to identify CNVs • MHC • Exclude CNVs in the MHC, partial repeats in MHC, small regions that are questionable in the DRB genes • Benchmark regions density • Regions with dense variation and many gaps in bed • Dense variants near SVs • Segmental duplications • Small region of duplication covered by benchmark • Containing an SV
  26. 26. Conclusions • Long and linked reads add variants to the benchmark, mostly in regions difficult to map with short reads • GRCh37: 276,840 SNPs and 53,482 INDELs • GRCh38: 286,483 SNPs and 42,980 INDELs • v4 draft benchmark is available for GRCh37 and GRCh38 • GRCh37 Percent Chromosomes 1-22 Covered: 93.2% • GRCh38 Percent Chromosomes 1-22 Covered: 91.03% • Initial evaluation suggest a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets • More volunteers welcomed • Identified refinements for v4.1
  27. 27. On-going and Future Work • Refine use of genome stratifications • Adding variant calls from raw PacBio and Oxford Nanopore • Improve benchmark for larger indels, homopolymers, and tandem repeats • Improve normalization of complex variants • Generating benchmark variants from diploid assemblies • Machine learning • Outlier detection, active learning • Generate v4 draft for other GIAB genomes
  28. 28. Acknowledgements • Andrew Carroll • Ian T. Fiddes • Aaron M. Wenger • William J. Rowell • Nathan Olson • Lindsey Harris • Jenny McDaniel • Chunlin Xiao • Marc Salit • Justin Zook • Genome in a Bottle Consortium Draft Benchmark Evaluators • Xin Zhou • Sergey Aganezov • Melanie Kirsche • Bohan Ni • Samantha Zarate • Byunggil Yoo • Neil Miller
  29. 29. Backup
  30. 30. Initial evaluation suggest a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets Platform and Caller Number Benchmark Correct Number Benchmark Unsure Benchmark is not correct Comparison callset is not correct Total sites CCS with GATK GRCh38 FP 16 4 0 16 20 CCS with GATK GRCh38 FN 17 3 0 16 20 ONT with Clair GRCh38 FP 19 1 0 19 20 ONT with Clair GRCh38 FN 14 6 0 19 20 CCS with Clair GRCh38 FP 15 5 0 16 20 CCS with Clair GRCh38 FN 18 2 0 20 20 Illumina with Dragen GRCh38 FP 16 3 1 16 20 Illumina with Dragen GRCh38 FN 18 2 0 18 20
  31. 31. Integration Pipeline Process Find sensitive variant calls and callable regions for each dataset, excluding difficult regions/SVs that are problematic for each type of data and variant caller Find “consensus” calls with support from 2+ technologies (and no other technologies disagree) using callable regions Use “consensus” calls to train simple one-class model for each dataset and find “outliers” that are less trustworthy for each dataset Find benchmark calls by using callable regions and “outliers” to arbitrate between datasets when they disagree Find benchmark regions by taking union of callable regions and subtracting uncertain variants
  32. 32. Sanger sequencing results
  33. 33. Initial evaluation shows a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets Platform and Caller Number Benchmark Correct Number Benchmark Unsure Benchmark is not correct Comparison callset is not correct Total sites CCS with DeepVariant GRCh37 FP 3 9 8 20 CCS with DeepVariant GRCh37 FN 17 3 0 20 CCS with GATK GRCh37 FP 19 1 0 19 20 CCS with GATK GRCh37 FN 15 3 2 18 20 ONT with Clair GRCh37 FP 33 1 0 34 34 ONT with Clair GRCh37 FN 27 3 0 30 30 CCS with Clair GRCh37 FP 7 13 0 6 20 CCS with Clair GRCh37 FN 19 1 0 19 20 Illumina with Dragen GRCh37 FP 14 6 0 11 20 Illumina with Dragen GRCh37 FN 17 3 0 17 20
  34. 34. Initial evaluation shows a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets Platform and Caller Number Benchmark Correct Number Benchmark Unsure Benchmark is not correct Comparison callset is not correct Total sites CCS with DeepVariant GRCh38 FP 6 7 7 20 CCS with DeepVariant GRCh38 FN 20 0 0 20 CCS with GATK GRCh38 FP 16 4 0 16 20 CCS with GATK GRCh38 FN 17 3 0 16 20 ONT with Clair GRCh38 FP 19 1 0 19 20 ONT with Clair GRCh38 FN 14 6 0 19 20 CCS with Clair GRCh38 FP 15 5 0 16 20 CCS with Clair GRCh38 FN 18 2 0 20 20 Illumina with Dragen GRCh38 FP 16 3 1 16 20 Illumina with Dragen GRCh38 FN 18 2 0 18 20
  35. 35. Initial evaluation shows a majority of FPs and FNs are correct in the benchmark and errors in the tested callsets Platform and Caller Number Benchmark Correct Number Benchmark Unsure/No Number Callset Incorrect CCS with GATK GRCh37 32 8 32 CCS with GATK GRCh38 33 7 32 ONT with Clair GRCh37 60 4 60 CCS with Clair GRCh37 26 14 24 CCS with Clair GRCh38 33 7 36 Illumina with Dragen GRCh37 31 9 28 Illumina with Dragen GRCh38 34 6 34

Editor's Notes

  • Exclude tandem repeats approximately larger than the read length for each method

    Homopolymers are excluded from 10x and PacBio CCS

    Really long homopolymers only included for GATK based calls for PCR-Free data because GATK gVCF has low genotype quality score if they don’t have reads that totally encompass the homopolymer
    - Trust homopolymers most from PCR-Free short reads


  • Ongoing work includes checking if many are in regions that might be in potential CNVs as they could be errors in v3.3.2


  • false-negatives (FN) : variants present in the truth set, but missed in the query.
  • 3_79181930

    Add this from what lindsey sent on slack
  • Combine GRCh37 and GRCh38
  • Left is an inversion


    Right is an likely a LINE-mediated inversion
    - If have an inversion near repetitive elements, then exclude the repetitive elements as well

    - Show just two LINEs and the inversion they flank

  • Left is likely a tandem duplication or large insertion or complex insertion

    Right is an inversion but then deletion that is in SV benchmark, likely a complex SV
  • Update this table – Includes Billy’s new results

    10x-Aquila_37
    16
    24
    16
    10x-Aquila_38
    22
    18
    17

×