Successfully reported this slideshow.
Your SlideShare is downloading. ×

Giab for jax long read 190917

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 26 Ad
Advertisement

More Related Content

Slideshows for you (20)

Similar to Giab for jax long read 190917 (20)

Advertisement

More from GenomeInABottle (13)

Recently uploaded (20)

Advertisement

Giab for jax long read 190917

  1. 1. September 17, 2019 Genome in a Bottle: Developing Benchmarks for Challenging Variants With Long Reads www.slideshare.net/genomeinabottle
  2. 2. NIST Human Genomics Team • Purpose: Inspire trust in human genome measurements to enable – Technology innovation – Clinical translation – Science-based regulatory oversight – Human health • Values: – Understand stakeholder needs – Collaborate with experts and synthesize results • Sequencing technologies • Informatics developers – Open science • Open data • Open analyses • Open samples
  3. 3. Why start Genome in a Bottle? • A map of every individual’s genome will soon be possible, but how will we know if it is correct? • Diagnostics and precision medicine require high levels of confidence • Well-characterized, broadly disseminated genomes are needed to benchmark performance of sequencing • NIST and FDA funding for the work O’Rawe et al, Genome Medicine, 2013 https://doi.org/10.1186/gm432
  4. 4. Human Genome Sequencing needed a new class of Reference Materials with billions of reference values By Russ London at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=9923576
  5. 5. Many diverse contributors to GIAB Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples * Funders * *
  6. 6. GIAB has characterized 7 human genomes • Pilot genome – NA12878 • PGP Human Genomes – Ashkenazi Jewish son – Ashkenazi Jewish trio – Chinese son • Parents also characterized National I nstituteof S tandards & Technology Report of I nvestigation Reference Material 8391 Human DNA for Whole-Genome Variant Assessment (Son of Eastern European Ashkenazim Jewish Ancestry) This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer (10 mM TRIS, 1 mM EDTA, pH 8.0). This material is intended for assessing performance of human genome sequencing variant calling by obtaining estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as functional or clinical interpretation. Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods similar to described in reference 1. An information value is considered to be a value that will be of interest and use to the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available. These data and genomic characterizations will be maintained over time as new data accrue and measurement and informatics methods become available. The information values are given as a variant call file (vcf) that contains the high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information (NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
  7. 7. Design of our human genome reference values Benchmark variant calls (Reference Values) Variants from any method being evaluated Benchmark regions (Reference Values)
  8. 8. Goal for our human genome reference values Benchmark variant calls (Reference Values) Variants from any method being evaluated Benchmark regions (Reference Values) Variants outside benchmark regions are not assessed Majority of variants unique to method should be false positives (FPs) Majority of variants unique to benchmark should be false negatives (FNs) Matching variants assumed to be true positives
  9. 9. Goal for our human genome reference values Benchmark variant calls (Reference Values) Variants from any method being evaluated Benchmark regions (Reference Values) Variants outside benchmark regions are not assessed Majority of variants unique to method should be false positivesMajority of variants unique to benchmark should be false negatives This does not directly give the accuracy of the reference values, but rather that they are fit for purpose.
  10. 10. GIAB Recently Published Resources for “Easier” Small Variants
  11. 11. GIAB has extensive public, unembargoed data Short reads • BGISEQ • Complete Genomics • Illumina • Ion Torrent • SOLiD Linked reads • 10x Genomics • BGISEQ stLFR • Illumina 6kb mate-pair • HiC Long reads • PacBio • PacBio CCS • Promethion • Ultralong Oxford Nanopore Optical/electronic mapping • BioNano • Nabsys ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/
  12. 12. Now using linked and long reads for difficult variants and regions GIAB Public Data • Linked Reads – 10x Genomics – Complete Genomics/BGI stLFR • Long Reads – PacBio Continuous Long Reads – PacBio Circular Consensus Seq – Oxford Nanopore “ultralong” – Promethion GIAB Use Cases • Expand small variant benchmark • Develop structural variant benchmark • Diploid assembly of difficult regions like MHC
  13. 13. Expand small variant benchmark set to difficult to map regions Justin Wagner, NIST
  14. 14. Long+Linked Reads expand small variant benchmark GRCh37 GRCh38 v3.3.2 v4beta Base pairs 2,353,170,731 2,509,269,277 Reference covered 85.4% 91.03% SNPs 3,028,458 3,314,941 Indels 476,514 519,494 Base pairs in Segmental Duplications 5,382,891 73,819,342 v3.3.2 v4beta Base pairs 2,358,060,765 2,504,027,936 Reference covered 87.8% 93.2% SNPs 3,046,933 3,323,773 Indels 465,670 519,152 Base pairs in Segmental Duplications 13,722,546 64,300,499 Benchmark includes more bases, variants, and segmental duplications in v4⍺
  15. 15. Small variant performance metrics decrease vs. new benchmark Comparison of Illumina GATK4 VCF against benchmark sets • SNP FN rate increases by a factor of 10 – almost entirely due to new benchmark variants in difficult to map regions Subset v3.3.2 Recall v4 Recall v3.3.2 Precision v4 Precision All SNPs 0.9995 0.9914 0.9981 0.9941 Difficult to map SNPs 0.9474 0.4916 0.8911 0.7171
  16. 16. Want to help us evaluate the benchmark? • Compare your small variants to the v4 benchmark • Manually curate some FPs and FNs – Pre-configured IGV sessions available! • Are they actually FPs and FNs? https://groups.google.com/forum/#!forum/giab-analysis-team
  17. 17. Develop sequence-resolved structural variant benchmark set GIAB Analysis Team and Nate Olson, NIST
  18. 18. 50 to 1000 bp Alu Alu 1kbp to 10kbp LINE LINE Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering sequence changes within 20% edit distance in trio Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting sequences <20% different or BioNano/Nabsys support in trio Evaluate/genotype: 19748 SVs with consensus variant genotype from svviz in son Filter complex: 12745 SVs not within 1kb of another SV Regions: 9641 SVs inside 2.66 Gbp benchmark regions supported by diploid assembly v0.6 tinyurl.com/GIABSV06
  19. 19. Our benchmark sets are useful in evaluating SVs from multiple technologies Goal: When comparing any callset to our vcf within the bed, most putative FPs and FNs should be errors in the tested callset github.com/spiralgenetics/truvari github.com/nhansen/SVanalyzer
  20. 20. Resolve MHC regions from HG002 https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/MHC Justin Wenger, Justin Zook, Mikko Rautiainen, Jason Chin, Tobias Marschall, Qian Zeng, Erik Garrison, Shilpa Garg Mar. 25-27, UCSC, The Human Pangenomics Hackathon
  21. 21. Goals • Make the best haplotype-correct assemblies for the MHC regions of HG002 from all available data • Correct phasing for small and large variants • Create GIAB small and structural variant benchmarks for this complicated but medically important region • Used in latest v4.0 draft small variant benchmark
  22. 22. Integrating assembly- and mapping- based calls gives best MHC benchmark • MHC assembly-based bed includes 23187 variants in 4.64/4.97 Mbp, excluding: • CYP21A2 and pseudogene • Homopolymers >10bp • SVs in assembly • Very dense variants • v4.0 mapping-based bed includes 13964 variants in 4.16/4.97 Mbp, excluding: • Short read callsets • Conflicts between callers • SVs from all methods • Homopolymers >10bp • Many clusters of variants, including some HLA genes • Only 11 differences between assembly and mapping based calls in both beds • 2 genotyping errors in assembly-based • 1 inaccurate complex allele and cluster of 8 missed variants in mapping-based • Merged benchmark includes 23229 variants in 4.67/4.97 Mbp • Covers most HLA genes and CYP21A2/TNXA/TNXB
  23. 23. Open consent enables secondary reference samples to meet specific clinical needs • >50 products now available based on broadly-consented, well-characterized GIAB PGP cell lines • Genomic DNA + DNA spike-ins • Clinical variants • Somatic variants • Difficult variants • Clinical matrix (FFPE) • Circulating tumor DNA • Stem cells (iPSCs) • Genome editing • …
  24. 24. The road ahead... 2019 Integration pipeline development for small and structural variants Manuscripts for small and structural variants 2020 Difficult large variants Somatic sample development Germline samples from new ancestries Diploid assembly 2021+ Somatic integration pipeline Somatic structural variation Large segmental duplications Centromere/ telomere ...
  25. 25. Acknowledgment of many GIAB contributors Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples * Funders * *
  26. 26. For More Information www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group GIAB slides, including 2019 Workshop slides: www.slideshare.net/genomeinabottle Public, Unembargoed Data: – http://www.nature.com/articles/sdata201625 – ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ – github.com/genome-in-a-bottle Global Alliance Benchmarking Team – https://github.com/ga4gh/benchmarking-tools – Web-based implementation at precision.fda.gov – Best Practices at https://rdcu.be/bqpDT Public workshops – Next workshop planned for April 1-2, 2020 at Stanford University, CA, USA Justin Zook: jzook@nist.gov NIST postdoc opportunities available! Diploid assembly, cancer genomes, other ‘omics, …

Editor's Notes

  • This is a good slide for 644:
    give a clinical anecdote
    Also numbers - attendance, publications, data, RM unit sales
    Reference sample distributors
    How much money from IAA?
    - sustained funding
    Quantify collaborators' input
    GIAB steering committee
    Examples of others contributing data, analyses
    How to describe emails
  • false-negatives (FN) : variants present in the truth set, but missed in the query.
  • This is a good slide for 644:
    give a clinical anecdote
    Also numbers - attendance, publications, data, RM unit sales
    Reference sample distributors
    How much money from IAA?
    - sustained funding
    Quantify collaborators' input
    GIAB steering committee
    Examples of others contributing data, analyses
    How to describe emails

×