Successfully reported this slideshow.
Your SlideShare is downloading. ×

GIAB for AMP GeT-RM Forum

Loading in …3

Check these out next

1 of 43 Ad

More Related Content

Slideshows for you (20)

Similar to GIAB for AMP GeT-RM Forum (20)


More from GenomeInABottle (12)

Recently uploaded (20)


GIAB for AMP GeT-RM Forum

  1. 1. November 5, 2019 How Well Can You Detect Difficult Variants? Benchmarking with Genome in a Bottle
  2. 2. Why start Genome in a Bottle? • A map of every individual’s genome will soon be possible, but how will we know if it is correct? • Diagnostics and precision medicine require high levels of confidence • Well-characterized, broadly disseminated genomes are needed to benchmark performance of sequencing O’Rawe et al, Genome Medicine, 2013
  3. 3. Human Genome Sequencing needed a new class of Reference Materials with billions of reference values By Russ London at English Wikipedia, CC BY-SA 3.0,
  4. 4. GIAB has characterized 7 human genomes • Pilot genome – NA12878 • PGP Human Genomes – Ashkenazi Jewish son – Ashkenazi Jewish trio – Chinese son • Parents also characterized National I nstituteof S tandards & Technology Report of I nvestigation Reference Material 8391 Human DNA for Whole-Genome Variant Assessment (Son of Eastern European Ashkenazim Jewish Ancestry) This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer (10 mM TRIS, 1 mM EDTA, pH 8.0). This material is intended for assessing performance of human genome sequencing variant calling by obtaining estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as functional or clinical interpretation. Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods similar to described in reference 1. An information value is considered to be a value that will be of interest and use to the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available. These data and genomic characterizations will be maintained over time as new data accrue and measurement and informatics methods become available. The information values are given as a variant call file (vcf) that contains the high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information (NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
  5. 5. Open consent enables secondary reference samples to meet specific clinical needs • >50 products now available based on broadly-consented, well-characterized GIAB PGP cell lines • Genomic DNA + DNA spike-ins • Clinical variants • Somatic variants • Difficult variants • Clinical matrix (FFPE) • Circulating tumor DNA • Stem cells (iPSCs) • Genome editing • …
  6. 6. Design of our human genome reference values Benchmark Variant Calls
  7. 7. Benchmark Regions – regions in which the benchmark contains (almost) all the variants Benchmark Variant Calls Design of our human genome reference values
  8. 8. Reference Values Benchmark Variant Calls Design of our human genome reference values Benchmark Regions
  9. 9. Variants from any method being evaluated Design of our human genome reference values Benchmark Regions Benchmark Variant Calls
  10. 10. Benchmark Regions Variants outside benchmark regions are not assessed Majority of variants unique to method should be false positives (FPs) Majority of variants unique to benchmark should be false negatives (FNs) Matching variants assumed to be true positives Variants from any method being evaluated Benchmark Variant Calls Design of our human genome reference values
  11. 11. Benchmark Variant Calls Query Variants Benchmark Regions Variants outside benchmark regions are not assessed Majority of variants unique to method should be false positives (FPs) Majority of variants unique to benchmark should be false negatives (FNs) Matching variants assumed to be true positives This does not directly give the accuracy of the reference values, but rather that they are fit for purpose. Design of our human genome reference values
  12. 12. GIAB Recently Published Resources for “Easier” Small Variants
  13. 13. Now using linked and long reads for difficult variants and regions GIAB Public Data • Linked Reads – 10x Genomics – Complete Genomics/BGI stLFR – Hi-C – Strand-seq (underway) • Long Reads – PacBio Continuous Long Reads – PacBio Circular Consensus Seq – Oxford Nanopore “ultralong” – Promethion GIAB Use Cases • Develop structural variant benchmark – bioRxiv 664623 • Diploid assembly of difficult regions like MHC – On bioRxiv this week • Expand small variant benchmark – v4.0 draft available for testing
  14. 14. 50 to 1000 bp Alu Alu 1kbp to 10kbp LINE LINE Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering sequence changes within 20% edit distance in trio Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting sequences <20% different or BioNano/Nabsys support in trio Evaluate/genotype: 19748 SVs with consensus variant genotype from svviz in son Filter complex: 12745 SVs not within 1kb of another SV Regions: 9641 SVs inside 2.66 Gbp benchmark regions supported by diploid assembly v0.6
  15. 15. Diploid assembly of MHC Martin, et al., 2016 BioRxiv 085050. Chin and Khalak, 2019, BioRxiv 705616
  16. 16. Alignments of assembly to reference 16 Two haplotigs (no gap) span through whole MHC region
  17. 17. Integrating assembly- and mapping- based calls gives best MHC benchmark • MHC assembly-based bed includes 23187 variants in the MHC region, excluding: • CYP21A2 and pseudogene • Homopolymers >10bp • SVs in assembly • Very dense variants • v4.0 mapping-based bed includes 13964 variants in the MHC region • Only 11 differences between assembly and mapping based calls in both beds • 2 genotyping errors in assembly-based • 1 inaccurate complex allele and cluster of 8 missed variants in mapping-based • Merged benchmark includes 23229 variants in the MHC region Mbp • Covers most HLA genes and CYP21A2/TNXA/TNXB Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure ---------------------------------------------------------------------------------------------------- None 13899 13549 10 4 0.9993 0.9997 0.9995 These variants are fully phased through the MHC regions too!
  18. 18. v4.0 benchmark uses 10x and CCS to include more bases, variants, and segmental duplications v4.0 GRCh37 v4.0 GRCh38 Base pairs 2,504,027,936 2,509,269,277 Reference covered 93.2% 91.03% SNPs 3,323,773 3,314,941 Indels 519,152 519,494 Base pairs in Segmental Duplications 64,300,499 73,819,34280.00% 85.00% 90.00% 95.00% GRCh37 v3.3.2 GRCh37 v4 draft GRCh38 v3.3.2 GRCh38 v4 draft Percent of reference covered
  19. 19. v4.0 enables benchmarking in regions difficult for short reads Example comparison of Illumina RTG VCF against benchmark sets Subset v3.3.2 FNs v4 draft FNs All SNPs 8,594 30,229 Low mappability 6,708 25,295 Segmental duplications 1,429 14,008
  20. 20. v4.0 benchmark contains more variants in potentially medically-relevant regions • v4.0 covers >90 % of the MHC region (CYP21A2 and all HLA genes except HLA-DRBx) • Additional coding variants in other medically relevant genes: TSPEAR (31), LAMA5 (28), FCGBP (18), TPSAB1 (15), HSPG2 (13) • From ACMG59, new variants in PMS2, RET, SCN5A, and TNNI3 “Medical Exome” (exons from OMIM, HGMD, ClinVar, UniProt) Variants Bases covered Benchmark v3.3.2 8,209 12,821,160 (85.5 %) Benchmark v4.0 9,527 13,748,850 (91.7 %)
  21. 21. Long range PCR + Sanger sequencing confirms new difficult variants in clinically tested exons • Confirmed all 63 covered variants in CYP21A2, PMS2, TNXA, TNXB, C4A, C4B, DMBT1, STRC, and HSPG2
  22. 22. v4.0 covers most of PMS2
  23. 23. Now cover SMN1, but regions still excluded due to high CCS coverage
  24. 24. Some CR1 regions still excluded due to slightly high coverage
  25. 25. Should we make a targeted benchmark for difficult genes? v4.0 still only covers ~22 % of “dark genes” for 100bp reads (Ebbert et al) • Compare long read diploid assembly to mapping of short and long reads • Manually curate and resolve discordant sites • Which genes should we target? • Exons and introns?
  26. 26. The road ahead... 2019 Integration pipeline development for small and structural variants Manuscripts for small and structural variants 2020 Difficult large variants Somatic sample development Germline samples from new ancestries Diploid assembly 2021+ Somatic integration pipeline Somatic structural variation Large segmental duplications Centromere/telomere Diploid assembly benchmarking ...
  27. 27. Acknowledgment of many GIAB contributors Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples * Funders * *
  28. 28. For More Information - sign up for general GIAB and Analysis Team google groups GIAB slides, including 2019 Workshop slides: Public, Unembargoed Data: – – – Global Alliance Benchmarking Team – – Web-based implementation at – Best Practices at Public workshops – Next workshop planned for April 1-2, 2020 at Stanford University, CA, USA Justin Zook: NIST postdoc opportunities available! Diploid assembly, cancer genomes, other ‘omics, …
  29. 29. Germline Variant Calling Benchmarking Nathan Olson AMP Reference Material 11/5/2019
  30. 30. Small Variant Benchmarkin g Highlights (TLDR) Best practices for benchmarking germline variant calling • •Supplemental Table 2 summarizes best practices - best practices implementation •Command line - •Graphical interface – HappyR – R package for results •Github
  31. 31. Benchmarking Process
  32. 32. Best Practices Summary Benchmark Sets Stringency of variant comparison Variant comparison tools Manual Curation Metric Interpretation Stratifications Confidence Intervals Additional Benchmarking Approaches
  33. 33. Applying Best Practices
  34. 34. Benchmarking Demonstration • Samples – GIAB AJ Trio • Sequencing • 2X150bp Illumina HiSeq • 60X Coverage • Variant Calling Pipeline* • Mapping – BWA • Variant Calling – GATK4 • Ref GRCh37 • Benchmarking with and GA4GH stratifications * Run on precisionFDA, see for method details
  35. 35. GA4GH Benchmarking Tool
  36. 36. Check discrepancies are errors in query callset.
  37. 37. Stratified Performance Metrics • Plot on a 1 minus metric log10 scale for better separation. Here lower is better. • Precision = TP/(TP + FP) • Recall = TP/ (TP + FN) • Confidence intervals indicate uncertainty and account for differences in number of variants per stratification.
  38. 38. Stratification Scatter Plot
  39. 39. (Optional) Optimization – Identifying biases responsible for performing stratifications.
  40. 40. Take Home Messages Kruche et al. URL, is a great resource for germ-line small variant benchmarking. GA4GH benchmarking tool available at and github URL Appropriate data visualizations (EDA) are critical to interpreting benchmarking results. Use manual curation to evaluate benchmarking results We are actively working on developing resources for benchmarking small variants against GRCh38 and Structural Variants
  41. 41. Acknowledgements GA4GH Benchmarking Team Genome In A Bottle Consortium NIST GIAB Team Justin Zook Jennifer McDaniel Justin Wagner Questions -

Editor's Notes

    Fast Assembly / Fast Iteration

  • false-negatives (FN) : variants present in the truth set, but missed in the query.
  • 3_79181930

    Add this from what lindsey sent on slack
  • This is a good slide for 644:
    give a clinical anecdote
    Also numbers - attendance, publications, data, RM unit sales
    Reference sample distributors
    How much money from IAA?
    - sustained funding
    Quantify collaborators' input
    GIAB steering committee
    Examples of others contributing data, analyses
    How to describe emails
  • Why benchmark and when
    Validating and optimizing a measurement process - e.g. Clinical lab validating NGS pipeline
    NGS and bioinformatic pipeline development
    NGS process QC
  • Generate variants calls for sample with benchmarks - start with DNA or publicly available datasets, starting point depends on what you are benchmarking or optimizing
    Compare query variant calls to truth callset
    Evaluate results
    (Optional) Use results to optimize measurement process
  • Subset of FPs and FNs
    Multiple technologies
    Relevant annotations

  • Zoom in to show what is in the table
    Show and define metrics
  • Add metric definitions
    Recreate plot using binconf for uncertainties

    Plot on a 1 minus metric log10 scale for better separation. Here lower is better.
    Confidence intervals indicate uncertainty and account for differences in number of variants per stratification.

  • Help identify poor performing stratifications
    Plot 1-Metric on a log-scale for better separation of stratifications with metric values close to 1
  • Example IGV with hypothesis of error source
    Analysis to do - IGV with PCR and PCR-free data, CCS, 10X, variant calls, benchmark, relevant stratifications