Successfully reported this slideshow.
Your SlideShare is downloading. ×

Benchmarking with GIAB 220907

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 42 Ad

More Related Content

More from GenomeInABottle (20)

Advertisement

Benchmarking with GIAB 220907

  1. 1. Benchmarking with Genome In A Bottle
  2. 2. GIAB Improves Confidence in Genome Sequencing and Variant Calling REFERENCE MATERIALS CHARACTERIZATIONS (BENCHMARK SETS) REFERENCE DATA BENCHMARKING METHODS 2
  3. 3. Genome Sequencing and Variant Calling 3
  4. 4. GIAB Reference Materials 4
  5. 5. GIAB has characterized variants in 7 human genomes 5 HG001* Chinese Trio NA12878 HG002* HG003* HG004* AJ Trio HG006 HG007 HG005* *NIST RMs developed from large batches of DNA
  6. 6. GIAB Reference Data 6
  7. 7. Public Data Sources • NIH Hosted FTP Site https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/ • NIH SRA https://www.ncbi.nlm.nih.gov/bioproject/200694 • HPRC S3 Bucket https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0 7
  8. 8. 8
  9. 9. GIAB Data Indexes on Github 9 https://github.com/genome-in-a-bottle/giab_data_indexes
  10. 10. Work In Progress - Data Registry Queryable database with pointers to publicly available GIAB data along with summary statistics Data Types Sample FASTQs BAMs VCFs Capturing methods and linking datasets for data provenance 10
  11. 11. GIAB Characterizations 11
  12. 12. 12
  13. 13. Small Variant Integration Process 13
  14. 14. Benchmark Regions Reliably identifies false positives Matching variants assumed true positives Variants from any method Benchmark Variants Design of GIAB benchmark Variants not assessed Reliably identifies false negatives GRCh37 and GRCh38 Reliable IDentification of Errors (RIDE) 14
  15. 15. v4.2.1 Small Variant Benchmark used Long and Linked Reads 15 Reference Build Benchmark Set Reference Coverage SNVs Indels Base pairs in Seg Dups and low mappability GRCh37 v3.3.2 87.8 3,048,869 464,463 57,277,670 GRCh37 v4.2.1 94.1 3,353,881 522,388 133,848,288 GRCh38 v3.3.2 85.4 3,030,495 475,332 65,714,199 GRCh38 v4.2.1 92.2 3,367,208 525,545 145,585,710 Wagner et al, https://doi.org/10.1101/2020.07.24.212712
  16. 16. Structural Variant Benchmark Set 16 Zook, J.M., Hansen, N.F., Olson, N.D. et al. A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol 38, 1347–1355 (2020). https://doi.org/10.1038/s41587-020-0538-8
  17. 17. GIAB Benchmarking Methods 17
  18. 18. Small Variant Benchmarking Highlights (TLDR) Best practices for benchmarking germline variant calling https://rdcu.be/bVtIF Supplemental Table 2 summarizes best practices Hap.py - best practices implementation Command line - https://github.com/Illumi na/hap.py Graphical interface – https://precision.fda.gov/ HappyR – R package for hap.py results Github https://github.com/Illumi na/happyR www.slideshare.ne t/genomeinabottle
  19. 19. Benchmarking Process 19
  20. 20. Best Practices Summary Benchmark Sets Stringency of variant comparison Variant comparison tools Manual Curation Metric Interpretation Stratifications Confidence Intervals Additional Benchmarking Approaches
  21. 21. Applying Best Practices 22
  22. 22. Best Practices for Benchmarking Small Variants 23 https://github.com/ga4gh/benchmarking-tools Paper: https://rdcu.be/bqpDT https://precision.fda.gov/
  23. 23. Stratified Performance Metrics • Plot metric on a phred scale for better separation of metric values > 99%. • Precision = TP/(TP + FP) • Recall = TP/ (TP + FN) • Confidence intervals indicate uncertainty and help account for differences in number of variants per stratification. INDEL SNP Precision Recall Difficult Homopol Not in Difficult TR and Homopol CDS chainSelf lowmap and segdups lowmap SegDups chainSelf >10kb SegDups > 10kb Difficult Homopol Not in Difficult TR and Homopol CDS chainSelf lowmap and segdups lowmap SegDups chainSelf >10kb SegDups > 10kb 99 99.9 99.99 99 99.9 99.99 Genomic Context Metric (% phred scale) GIAB ID HG003 HG004 Stratification Type all notin
  24. 24. Pairwise callset comparison L1H L1H quadTR >200bp nonuniuqe l250m0e0 nonuniuqe l250m0e0 notin Not in All Difficult notin Not in All Difficult TR 201bp − 10kb L1H L1H diTR 51−200bp diTR 51−200bp triTR 51−200bp triTR 51−200bp nonuniuqe l250m0e0 nonuniuqe l250m0e0 notin Not in All Difficult L1H notin Not in All Difficult notin Not in All Difficult L1H MHC MHC diTR 51−200bp diTR 51−200bp quadTR 51−200bp triTR 51−200bp triTR 51−200bp notin Not in All Difficult notin Not in All Difficult Precision Recall INDEL SNP 0 90 99 99.9 99.99 0 90 99 99.9 99.99 0 90 99 99.9 99.99 0 90 99 99.9 99.99 DeepVariant_PacBio DeepVariant_ILL strat_group All Diff LowComplexity Map and SegDups mappability Other Diff SegDups NA
  25. 25. (Optional) Optimization – Identifying biases responsible for performing stratifications.
  26. 26. Benchmarking Take Home Messages Kruche et al. URL, is a great resource for germ-line small variant benchmarking. Appropriate data visualizations are critical to interpreting benchmarking results. Use manual curation to evaluate benchmarking results Resources available for benchmarking small and structural variants against GRCh37 and GRCh38.
  27. 27. Collaborating with FDA to use GIAB benchmark to inspire new methods 29 https://precision.fda.gov/challenges/10
  28. 28. 30
  29. 29. Challenge Results • Received 64 submissions from 20 participants • Most submissions used deep-learning- based variant-calling methods • Submissions using multiple technologies outperformed single technology submissions • Submission performance varied by genomic stratification 31 W W W W W W W W W W W W W W Sentieon Roche Sequencing Solutions The Genomics Team in Google Health Sentieon Sentieon DRAGEN Sentieon Roche Sequencing Solutions Sentieon Seven Bridges Genomics The UCSC CGL and Google Health Wang Genomics Lab DRAGEN The UCSC CGL and Google Health 0 90 99 99.9 Dif f i cult-to-Map Regions All Benchmark Regions MHC Genomic Regions F1 % Technology ILLUMINA MULTI ONT PACBIO
  30. 30. Results Con’t • Updated stratifications enable comparison of method strengths • Graph-based variant calling enables high accuracy of short read variant calls in the difficult MHC region. • Improved benchmark sets and stratifications reveal significant progress in DNA sequencing and variant calling since the 2016 challenge 32
  31. 31. Future of Genome In A Bottle 33
  32. 32. DEvelopment Framework for Assembly Based Bechmarks (DEFRABB) 34
  33. 33. Developing benchmarks on new references using assemblies 35 • Telomere-to-Telomere Consortium generated a new reference T2T- CHM13 • Developed CMRG benchmark on T2T- CHM13 using the diploid assembly of HG002 similar to benchmarks on GRCh37 and GRCh38
  34. 34. Assembly-Based Benchmark Process 36
  35. 35. Assembly-Based Benchmark Process 37 - Minimap2 for Assembly –Assembly alignment - Variants called and diploid assembled regions identified using dipcall v0.3
  36. 36. Assembly-Based Benchmark Process 38 VCF formatting and modifications for use in benchmarking.
  37. 37. Assembly-Based Benchmark Process 39 Exclude regions from dip.bed (assembled regions) that are problematic for small variant calling and comparison due to SVs and gaps in reference or alignment
  38. 38. Take-home messages REFERENCE MATERIALS AVAILABLE FOR 5 INDIVIDUALS SMALL VARIANT BENCHMARK SETS FOR 7 INDIVIDUALS FOR GRCH37 AND GRCH38, SV BENCHMARK FOR ONE INDIVIDUAL FOR GRCH37 BEST PRACTICES ESTABLISHED FOR SMALL VARIANT BENCHMARKING CURRENT EFFORTS FOCUS ON DEVELOPING SMALL VARIANT AND STRUCTURAL VARIANT BENCHMARK SET USING DIPLOID ASSEMBLIES 40
  39. 39. Acknowledgment of many GIAB contributors 41 Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples * Funders * *
  40. 40. Interesting in getting involved? 42 www.genomeinabottle.org - sign up for general GIAB and Analysis Team google groups GIAB slides: www.slideshare.net/genomeinabottle Public, Unembargoed Data: github.com/genome- in-a-bottle We are hiring! Data Manager, Machine learning, diploid assembly, cancer genomes, data science, other ‘omics, …

×