Successfully reported this slideshow.
Your SlideShare is downloading. ×

Using accurate long reads to improve Genome in a Bottle Benchmarks 220923


Check these out next

1 of 27 Ad

More Related Content

Similar to Using accurate long reads to improve Genome in a Bottle Benchmarks 220923 (20)

More from GenomeInABottle (20)


Recently uploaded (20)

Using accurate long reads to improve Genome in a Bottle Benchmarks 220923

  1. 1. Using accurate long reads to improve Genome in a Bottle Benchmarks Justin Zook, on behalf of the Genome in a Bottle Consortium National Institute of Standards and Technology (NIST) Human Genomics Team Sep 23, 2022
  2. 2. Motivation for Genome in a Bottle: Sequencing and analysis methods can give different answers, particularly in challenging, repetitive regions O’Rawe et al, Genome Medicine, 2013
  3. 3. GIAB has characterized variants in 7 human genomes National I nstituteof S tandards & Te c hnology Re port of I nve stigation Reference Material 8391 Human DNA for Whole-Genome Variant Assessment (Son of Eastern European Ashkenazim Jewish Ancestry) This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer (10 mM TRIS, 1 mM EDTA, pH 8.0). This material is intended for assessing performance of human genome sequencing variant calling by obtaining estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as functional or clinical interpretation. Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods similar to described in reference 1. An information value is considered to be a value that will be of interest and use to the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available. These data and genomic characterizations will be maintained over time as new data accrue and measurement and informatics methods become available. The information values are given as a variant call file (vcf) that contains the high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information (NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is: HG001* HG002* HG003* HG004* HG006 HG007 HG005* AJ Trio Chinese Trio Pilot Genome NA12878 *NIST RMs developed from large batches of DNA
  4. 4. GIAB “Open Science” Virtuous Cycle Users analyze GIAB Samples Benchmark vs. GIAB data Critical feedback to GIAB Integrate new methods New benchmark data Method development, optimization, and demonstration Part of assay validation GIAB/NIST expands to more difficult regions
  5. 5. Design of our human genome reference values Benchmark Variant Calls
  6. 6. Benchmark Regions – regions in which the benchmark contains (almost) all the variants Benchmark Variant Calls Design of our human genome reference values
  7. 7. Variants from any method being evaluated Design of our human genome reference values Benchmark Regions Benchmark Variant Calls
  8. 8. Benchmark Regions Variants outside benchmark regions are not assessed Majority of variants unique to method should be false positives (FPs) Majority of variants unique to benchmark should be false negatives (FNs) Matching variants assumed to be true positives Variants from any method being evaluated Benchmark Variant Calls Design of our human genome reference values Reliable IDentification of Errors (RIDE)
  9. 9.
  10. 10. Accurate long reads have been essential for improving GIAB benchmarks Small variants with mapping-based methods MHC with local de novo assembly Challenging medically relevant genes with trio de novo assembly (small var & isolated SVs) chrX/Y and whole genome with trio de novo assembly (small var + TRs + SVs)
  11. 11. v4.2.1 Small Variant Benchmark improved difficult to map regions with Long and Linked Reads Reference Build Benchmark Set Reference Coverage SNVs Indels Base pairs in Seg Dups and low mappability GRCh37 v3.3.2 87.8 3,048,869 464,463 57,277,670 GRCh37 v4.2.1 94.1 3,353,881 522,388 133,848,288 GRCh38 v3.3.2 85.4 3,030,495 475,332 65,714,199 GRCh38 v4.2.1 92.2 3,367,208 525,545 145,585,710 Wagner et al, Cell Genomics, 2022
  12. 12. Collaborating with FDA to use GIAB benchmark to inspire new methods
  13. 13. The best-performing submissions were from new sequencing technologies and bioinformatics methods Olson et al, Cell Genomics, 2022
  14. 14. INDELs SNVs Stratification helps understand strengths of each technology/meth od Olson et al, Cell Genomics, 2022
  15. 15. Shortcomings in Medical Genes for v4.2.1 benchmark ● Mandelker et al. in 2016 created a list of medical genes with at least one exon that is difficult to map with short reads ● v4.2.1 improved coverage of these genes but many are still not fully covered
  16. 16. Generating a Benchmark for 273 Challenging Genes from Trio-based Long read diploid assembly Manually curated >1000 variants Wagner et al, Nature Biotech, 2022
  17. 17. Highlighting Genes in the New Benchmark – SMN1 Wagner et al, Nature Biotech, 2022
  18. 18. False duplications on GRCh38 can be fixed by masking Wagner et al, Nature Biotech, 2022
  19. 19. T2T also identified collapsed duplications in GRCh38 ● 203 regions affecting ~8 Mbp and 308 genes (including 48 protein coding genes) ● Includes several medically-relevant genes: ○ KCNJ18/KCNJ12 ○ KMT2C ○ MAP2K3
  20. 20. Modifying GRCh38 to fix false duplications and collapsed duplications
  21. 21. Work In Progress - Data Registry Queryable database with pointers to publicly available GIAB data along with summary statistics Data Types Sample FASTQs BAMs VCFs Capturing methods and linking datasets for data provenance 21
  22. 22. DEvelopment Framework for Assembly Based Bechmarks (DEFRABB) 22
  23. 23. Assembly-Based Benchmark Process Credits: Nate Olson, Jennifer McDaniel, and GIAB team
  24. 24. Building new GIAB resources with long reads ● RNA-seq ○ Recently generated illumina and PacBio RNA-seq from several GIAB lymphoblastoid cell lines and iPSCs ■ ONT RNA-seq planned as well ○ Planned analyses include isoforms, variants, gene annotation ○ Collaborations welcome! ● Tumor/normal ○ Working with MGH and others to develop the first broadly-consented tumor/normal cell line pairs ○ Starting characterization of first pancreatic cancer cell line ● Engineering variants into GIAB cell lines ○ Collaboration with Medical Device Innovation Consortium Somatic Reference Samples project
  25. 25. Take-home messages ● Ongoing improvement of benchmarks has been needed to drive technology and bioinformatics innovations, particularly for long reads ● Assembly methods using accurate long reads have advanced rapidly and are enabling characterization of increasingly challenging genome regions ● More work is needed to develop better benchmarks and benchmarking tools, particularly for complex SVs and tumor genomes
  26. 26. Acknowledgment of many GIAB contributors Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples * Funders * *
  27. 27. Interesting in getting involved? - sign up for general GIAB and Analysis Team google groups GIAB slides: Public, Unembargoed Data: a-bottle We are hiring! Cancer genomes, Data Manager, Machine learning, diploid assembly, other ‘omics, …