Successfully reported this slideshow.
Your SlideShare is downloading. ×

GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511

Loading in …3

Check these out next

1 of 29 Ad

More Related Content

Slideshows for you (20)

Similar to GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511 (20)


More from GenomeInABottle (14)


GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511

  1. 1. May 11, 2020 Genome in a Bottle Benchmarks for Structural Variants and Repetitive Regions @GenomeinaBottle on Twitter
  2. 2. Why start Genome in a Bottle? • A map of every individual’s genome will soon be possible, but how will we know if it is correct? • Diagnostics and precision medicine require high levels of confidence • Well-characterized, broadly disseminated genomes are needed to benchmark performance of sequencing O’Rawe et al, Genome Medicine, 2013
  3. 3. Human Genome Sequencing needed a new class of Reference Materials with billions of reference values By Russ London at English Wikipedia, CC BY-SA 3.0,
  4. 4. GIAB has characterized 7 human genomes • Pilot genome – NA12878 • PGP Human Genomes – Ashkenazi Jewish son – Ashkenazi Jewish trio – Chinese son • Parents also characterized National I nstituteof S tandards & Technology Report of I nvestigation Reference Material 8391 Human DNA for Whole-Genome Variant Assessment (Son of Eastern European Ashkenazim Jewish Ancestry) This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer (10 mM TRIS, 1 mM EDTA, pH 8.0). This material is intended for assessing performance of human genome sequencing variant calling by obtaining estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as functional or clinical interpretation. Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods similar to described in reference 1. An information value is considered to be a value that will be of interest and use to the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available. These data and genomic characterizations will be maintained over time as new data accrue and measurement and informatics methods become available. The information values are given as a variant call file (vcf) that contains the high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information (NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
  5. 5. Open consent enables secondary reference samples to meet specific clinical needs • >50 products now available based on broadly-consented, well-characterized GIAB PGP cell lines • Genomic DNA + DNA spike-ins • Clinical variants • Somatic variants • Difficult variants • Clinical matrix (FFPE) • Circulating tumor DNA • Stem cells (iPSCs) • Genome editing • …
  6. 6. Design of our human genome reference values Benchmark Variant Calls
  7. 7. Benchmark Regions – regions in which the benchmark contains (almost) all the variants Benchmark Variant Calls Design of our human genome reference values
  8. 8. Reference Values Benchmark Variant Calls Design of our human genome reference values Benchmark Regions
  9. 9. Variants from any method being evaluated Design of our human genome reference values Benchmark Regions Benchmark Variant Calls
  10. 10. Benchmark Regions Variants outside benchmark regions are not assessed Majority of variants unique to method should be false positives (FPs) Majority of variants unique to benchmark should be false negatives (FNs) Matching variants assumed to be true positives Variants from any method being evaluated Benchmark Variant Calls Design of our human genome reference values
  11. 11. Benchmark Variant Calls Query Variants Benchmark Regions Variants outside benchmark regions are not assessed Majority of variants unique to method should be false positives (FPs) Majority of variants unique to benchmark should be false negatives (FNs) Matching variants assumed to be true positives This does not directly give the accuracy of the reference values, but rather that they are fit for purpose. Design of our human genome reference values
  12. 12. GIAB Recently Published Resources for “Easier” Small Variants
  13. 13. Now using linked and long reads for difficult variants and regions GIAB/HPRC Public Data • Linked Reads – 10x Genomics – Complete Genomics/BGI stLFR – TELL-seq – Hi-C – Strand-seq • Long Reads – PacBio Continuous Long Reads – PacBio Circular Consensus Seq – Oxford Nanopore “ultralong” – Promethion GIAB Use Cases • Develop structural variant benchmark – bioRxiv 664623 • Diploid assembly of difficult regions like MHC – bioRxiv 831792 – New collaboration with • Expand small variant benchmark – v4.1 available, manuscript in prep
  14. 14. 50 to 1000 bp Alu Alu 1kbp to 10kbp LINE LINE Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering sequence changes within 20% edit distance in trio Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting sequences <20% different or BioNano/Nabsys support in trio Evaluate/genotype: 19748 SVs with consensus variant genotype from svviz in son Filter complex: 12745 SVs not within 1kb of another SV Regions: 9641 SVs inside 2.66 Gbp benchmark regions supported by diploid assembly v0.6
  15. 15. New SV Benchmark Reliably Identifies FPs and FNs FN FP LongReadsShortReads DEL INS DEL INS 0 10 20 30 40 0 30 60 90 Structural Variant Type Count Is GIAB Correct? No Maybe Partial Yes
  16. 16. Diploid assembly of MHC Martin, et al., 2016 BioRxiv 085050. Chin and Khalak, 2019, BioRxiv 705616 *Now dipcall
  17. 17. Alignments of assembly to reference Two haplotigs span through whole MHC region New version correctly assembles 30kb seg dup
  18. 18. New small variant benchmark includes more bases of human genome and variants Benchmark Set GRCh38 Coverage SNPs INDELs v3.3.2 85.4% 3,028,458 476,514 v4.1 92.2% 3,363,367 528,138 Percent increase in V4.1 compared to V3.3.2 (100*(V4.1 - V3.3.2)/V3.3.2) for GRCh38 reference bases covered, single nucleotide variants, and small indels
  19. 19. New benchmark covers more medically-relevant genes that are difficult to map for short reads v4.1 covers many more difficult, medically-relevant genes. Cumulative distribution for percent gene covered by benchmark regions for 193 difficult, medically-relevant genes. • Remaining regions to cover: • Very difficult seg dups • Structural variants • Large duplications • Some small complex variants • Some >15bp indels • Satellite DNA
  20. 20. • Comparison of FNs from different sequencing technologies and variant calling methods against benchmark set • New benchmark identifies more SNP FNs across technologies, mostly due to new benchmark variants in difficult to map regions and segmental duplications Performance with new benchmark demonstrates utility in regions that are difficult for short reads
  21. 21. Benchmark reliably identifies FPs and FNs across diverse callsets
  22. 22. Germline Variant Calling Benchmarking
  23. 23. GA4GH Benchmarking Tool
  24. 24. Example of benchmarking a diploid assembly • Call variants with dipcall • Stratify performance by difficult regions • More errors in seg dups • More indels errors in long homopolymers • Can also separate genotyping errors from other FPs • Can subset Recall to regions covered by both haplotypes • Also gives fraction of variants not assessed because they were outside benchmark regions Type Region Recall Precision SNV All in benchmark 98.4 97.8 SNV SegDup 76.9 59.6 SNV Easy* 99.5 99.9 Indel All in benchmark 93.0 83.3 Indel Homopolymers >10bp 79.7 52.7 Indel Easy* 99.1 94.4 *Easy := genome after excluding all homopolymers >6bp, tandem repeats, seg dups, and low mappability regions
  25. 25. Small Variant Benchmarking Highlights Best practices for benchmarking germline variant calling • • Supplemental Table 2 summarizes best practices - best practices implementation • Command line - • Graphical interface – • v2 stratification beds - in-a-bottle/genome-stratifications HappyR – R package for results • Github
  26. 26. The road ahead... 2020 New SV Benchmark for GRCh38 and other genomes Small variant benchmark for other GIAB genomes Focus on missing difficult clinical genes Work with HPP on H2M variants Somatic sample development 2021+ Somatic benchmarking Germline samples from new ancestries Large segmental duplications Centromere/telomere Diploid assembly benchmarking ...
  27. 27.
  28. 28. Acknowledgment of many GIAB contributors Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples * Funders * *
  29. 29. For More Information - sign up for general GIAB and Analysis Team google groups GIAB slides: Public, Unembargoed Data: – – – Global Alliance Benchmarking Team – – Web-based implementation at – Best Practices at Public workshops – Join google groups for updates at Justin Zook: NIST postdoc opportunities available! Diploid assembly, cancer genomes, other ‘omics, …

Editor's Notes

    Fast Assembly / Fast Iteration

  • Have GRCh38 and include table
  • Add draft figure caption
  • True positives (TP) : variants/genotypes that match in truth and query
    False negatives (FN) : variants present in the truth set, but missed in the query
    False positives (FP) : variants that have mismatching genotypes or alt alleles, as well as query variant calls in regions a truth set would call confident hom-ref regions
  • Why benchmark and when
    Validating and optimizing a measurement process - e.g. Clinical lab validating NGS pipeline
    NGS and bioinformatic pipeline development
    NGS process QC
  • This is a good slide for 644:
    give a clinical anecdote
    Also numbers - attendance, publications, data, RM unit sales
    Reference sample distributors
    How much money from IAA?
    - sustained funding
    Quantify collaborators' input
    GIAB steering committee
    Examples of others contributing data, analyses
    How to describe emails