Successfully reported this slideshow.
Your SlideShare is downloading. ×

Genome in a Bottle- reference materials to benchmark challenging variants and regions of the human genome 210930


Check these out next

1 of 38 Ad

More Related Content

Slideshows for you (20)

Similar to Genome in a Bottle- reference materials to benchmark challenging variants and regions of the human genome 210930 (20)


More from GenomeInABottle (12)

Recently uploaded (20)


Genome in a Bottle- reference materials to benchmark challenging variants and regions of the human genome 210930

  1. 1. Genome in a Bottle: Reference Materials to Benchmark Challenging Variants and Regions of the Human Genome Justin Zook, on behalf of the Genome in a Bottle Consortium National Institute of Standards and Technology (NIST) Human Genomics Team Sept 30, 2021
  2. 2. Motivation for Genome in a Bottle: Sequencing and analysis methods can give different answers, particularly in challenging, repetitive regions O’Rawe et al, Genome Medicine, 2013
  3. 3. GIAB has characterized variants in 7 human genomes National I nstituteof S tandards & Te c hnology Re port of I nve stigation Reference Material 8391 Human DNA for Whole-Genome Variant Assessment (Son of Eastern European Ashkenazim Jewish Ancestry) This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer (10 mM TRIS, 1 mM EDTA, pH 8.0). This material is intended for assessing performance of human genome sequencing variant calling by obtaining estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as functional or clinical interpretation. Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods similar to described in reference 1. An information value is considered to be a value that will be of interest and use to the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available. These data and genomic characterizations will be maintained over time as new data accrue and measurement and informatics methods become available. The information values are given as a variant call file (vcf) that contains the high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information (NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is: HG001 HG002 HG003 HG004 HG006 HG007 HG005 AJ Trio Chinese Trio Pilot Genome NA12878
  4. 4. GIAB “Open Science” Virtuous Cycle Users analyze GIAB Samples Benchmark vs. GIAB data Critical feedback to GIAB Integrate new methods New benchmark data Method development, optimization, and demonstration Part of assay validation GIAB/NIST expands to more difficult regions
  5. 5. Design of our human genome reference values Benchmark Variant Calls
  6. 6. Benchmark Regions – regions in which the benchmark contains (almost) all the variants Benchmark Variant Calls Design of our human genome reference values
  7. 7. Reference Values* Benchmark Variant Calls Design of our human genome reference values Benchmark Regions *Currently no quality or confidence scores associated with our reference values
  8. 8. Variants from any method being evaluated Design of our human genome reference values Benchmark Regions Benchmark Variant Calls
  9. 9. Benchmark Regions Variants outside benchmark regions are not assessed Majority of variants unique to method should be false positives (FPs) Majority of variants unique to benchmark should be false negatives (FNs) Matching variants assumed to be true positives Variants from any method being evaluated Benchmark Variant Calls Design of our human genome reference values
  10. 10. In 2019, GIAB and GA4GH Published Resources for “Easier” Small Variants
  11. 11. First Structural Variant Benchmark Published
  12. 12.
  13. 13. v4.2.1 Small Variant Benchmark used Long and Linked Reads Reference Build Benchmark Set Reference Coverage SNVs Indels Base pairs in Seg Dups and low mappability GRCh37 v3.3.2 87.8 3,048,869 464,463 57,277,670 GRCh37 v4.2.1 94.1 3,353,881 522,388 133,848,288 GRCh38 v3.3.2 85.4 3,030,495 475,332 65,714,199 GRCh38 v4.2.1 92.2 3,367,208 525,545 145,585,710 Wagner et al,
  14. 14. New benchmark includes challenging genes like PMS2 Segmental duplications
  15. 15. Collaborating with FDA to use GIAB benchmark to inspire new methods
  16. 16. The best-performing submissions were from new sequencing technologies and bioinformatics methods Olson et al,
  17. 17. Expanding the benchmark was important to demonstrate improved technologies and analysis methods for difficult genome regions Olson et al,
  18. 18. INDELs SNVs Stratification helps understand strengths of each technology/method Olson et al,
  19. 19. Shortcomings in Medical Genes for v4.2.1 benchmark ● Mandelker et al. in 2016 created a list of medical genes with at least one exon that is difficult to map with short reads ● v4.2.1 improved coverage of these genes but many are still not fully covered
  20. 20. Why Create a Medical Gene Benchmark for Genome in a Bottle? ● HG002 v4.2.1 benchmark still excludes >10% of 395 medically relevant genes on chromosomes 1-22 on GRCh37 or GRCh38 due to structural variants, large segmental duplications, or other difficult regions ● Advances in diploid assembly enabled us to develop phased small variant and structural variant benchmarks in 273 of these 395 genes on both GRCh37 and GRCh38 for HG002 Wagner et al, Justin Wagner Jason Chin Fritz Sedlazeck GIAB CMRG Team
  21. 21. Generating a Challenging Medical Gene Benchmark Trio-based diploid assembly
  22. 22. Diploid Assembly Using PacBio HiFi reads ● Trio-hifiasm ○ Illumina reads for parents and PacBio HiFi reads for HG002 ○ Best performance in Human Pangenome Reference Consortium diploid assembly bakeoff ● Called variants with dipcall ○ Outputs variant calls and confident regions ○ Confident regions: covered by exactly one contig from each haplotype
  23. 23. New benchmark includes 273 challenging genes ● Curated each gene for accurate resolution by assembly in IGV ● Manually curated >1000 variant discrepancies and excluded errors in benchmark ● Most errors in homopolymers and/or highly homozygous regions
  24. 24. The new CMRG small variant benchmark includes more challenging variants and identifies more false negatives
  25. 25. Highlighting Genes in the New Benchmark – SMN1
  26. 26. GRCh37 and GRCh38 contain different false duplications • GRCh38 has an extra copy of some medically relevant genes like CBS, KCNE1, and CRYAA, causing mis-mapped reads 26 gnomAD coverage of CBS on GRCh38 decreases for genome sequencing due to mapping ambiguity gnomAD coverage of CBS on GRCh37 is generally normal for genome (green) and exome (blue) samples
  27. 27. False duplications on GRCh38 can be fixed by masking
  28. 28. T2T identified and fixed additional false duplications ● 12 regions affecting ~1.2 Mbp and 74 genes (including 22 protein coding genes) ● Most medically relevant genes included in 11 pairs of genes in 5 large duplicated regions on chr21
  29. 29. Genes found to be falsely duplicated in CMRG and T2T work
  30. 30. T2T also identified collapsed duplications in GRCh38 ● 203 regions affecting ~8 Mbp and 308 genes (including 48 protein coding genes) ● Includes several medically-relevant genes: ○ KCNJ18/KCNJ12 ○ KMT2C ○ MAP2K3
  31. 31. What medical genes do we still not include >90%? ● 110 on GRCh37 and 100 on GRCh38 + all genes on chrX/chrY Progressively categorizing all 100 on GRCh38: ● 20 affected by gaps in the reference ● 38 had evidence of duplications in HG002 relative to GRCh38 ○ Collapsed duplications in GRCh38 (e.g., KCNJ18) ○ Population copy number variability (e.g., LPA, KIR) ● 2 resolved on GRCh38 but not GRCh37 ● 18 were >90% included by the dip.bed but had multiple contigs or a break in the assembly-assembly alignment ● 7 have a large deletion of part or all of the gene on one haplotype ● 4 have breaks or false duplications in the hifiasm assembly (e.g., SMN2) ● 2 are in the structurally variable immunoglobulin locus ● 6 resolved but excluded due to being previously assembled in the MHC ● one (TNNT3) has a structural error in GRCh38
  32. 32. Plans for future assembly-based benchmarks ● Long-read assembly-based variants are reaching/surpassing the accuracy of our benchmarks (with some exceptions) ● Use T2T-HPRC’s assembly of HG002 chrX (and chrY?) to develop small variant and structural variant benchmark for genic and non-genic regions ● Use diploid assemblies of children in trios
  33. 33. Exploring if AI can be used for Genomic Reference Material Development ● Exploring deep learning to assign uncertainty to genomic reference materials ● Exploring transparency for genomics AI (e.g., "model cards") ● Exploring explainability for AI-based reference materials
  34. 34.
  35. 35. 21st Century Cell Lines: Fully Consented and Characterized Cancer Tumor/Normal Cell Lines as Reference Materials ● Developing matched tumor/normal cell lines pairs and donor normal tissue analyzed at early passages ○ Initial collaboration with Andrew Liss at MGH for pancreatic ductal adenocarcinoma (PDAC) cell lines ● Broadly consented for public release of genomic data and commercial use and redistribution ● Path to Cancer Genome in a Bottle Seeking collaborations for additional broadly- consented tumor/normal cell lines
  36. 36. Take-home messages ● Ongoing improvement of benchmarks has been needed to drive technology and bioinformatics innovations ● Assembly methods have advanced rapidly and are enabling characterization of increasingly challenging genome regions ● More work is needed to develop better benchmarks and benchmarking tools, particularly for tumor genomes
  37. 37. Acknowledgment of many GIAB contributors Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples * Funders * *
  38. 38. Interesting in getting involved? - sign up for general GIAB and Analysis Team google groups GIAB slides: Public, Unembargoed Data: • • • Global Alliance Benchmarking Team • • Web-based implementation at • Best Practices at GIAB Analysis Team Calls • Sign up for the google group to attend biweekly calls Justin Zook: We are hiring! Machine learning, diploid assembly, cancer genomes, data science, other ‘omics, …

Editor's Notes

  • Strawman