Genome in a Bottle: Reference
Materials to Benchmark
Challenging Variants and
Regions of the Human Genome
Justin Zook, on behalf of the Genome in a Bottle Consortium
National Institute of Standards and Technology (NIST)
Human Genomics Team
Sept 30, 2021
Motivation for Genome in a Bottle: Sequencing and analysis methods can give
different answers, particularly in challenging, repetitive regions
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432
GIAB has characterized variants in 7
human genomes
National I nstituteof S tandards & Te
c
hnology
Re
port of I nve
stigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
HG001
HG002
HG003 HG004
HG006 HG007
HG005
AJ Trio
Chinese
Trio
Pilot Genome
NA12878
GIAB “Open Science” Virtuous Cycle
Users
analyze
GIAB
Samples
Benchmark
vs. GIAB
data
Critical
feedback to
GIAB
Integrate
new
methods
New
benchmark
data
Method
development,
optimization, and
demonstration
Part of assay
validation
GIAB/NIST
expands to
more difficult
regions
Design of our human genome reference values
Benchmark
Variant
Calls
Benchmark
Regions –
regions in which
the benchmark
contains (almost)
all the variants
Benchmark
Variant
Calls
Design of our human genome reference values
Reference
Values*
Benchmark
Variant
Calls
Design of our human genome reference values
Benchmark
Regions
*Currently no quality or
confidence scores associated
with our reference values
Variants from
any method
being evaluated
Design of our human genome reference values
Benchmark
Regions
Benchmark
Variant
Calls
Benchmark
Regions
Variants
outside
benchmark
regions are
not assessed
Majority of
variants unique
to method should
be false positives
(FPs)
Majority of
variants
unique to
benchmark
should be
false
negatives
(FNs)
Matching
variants
assumed to be
true positives
Variants from
any method
being evaluated
Benchmark
Variant
Calls
Design of our human genome reference values
In 2019, GIAB and GA4GH Published
Resources for “Easier” Small Variants
First
Structural
Variant
Benchmark
Published
https://doi.org/10.1038/s41436-021-01187-w
v4.2.1 Small Variant Benchmark used Long and Linked Reads
Reference Build Benchmark Set Reference Coverage SNVs Indels Base pairs in Seg Dups and low mappability
GRCh37 v3.3.2 87.8 3,048,869 464,463 57,277,670
GRCh37 v4.2.1 94.1 3,353,881 522,388 133,848,288
GRCh38 v3.3.2 85.4 3,030,495 475,332 65,714,199
GRCh38 v4.2.1 92.2 3,367,208 525,545 145,585,710
Wagner et al, https://doi.org/10.1101/2020.07.24.2127
New benchmark includes challenging genes like PMS2
Segmental duplications
Collaborating with FDA to use GIAB
benchmark to inspire new methods
https://precision.fda.gov/challenges/10
The best-performing submissions were from new sequencing
technologies and bioinformatics methods
Olson et al, https://doi.org/10.1101/2020.11.13.380
Expanding the benchmark was important to demonstrate improved
technologies and analysis methods for difficult genome regions
Olson et al, https://doi.org/10.1101/2020.11.13.380
INDELs SNVs
Stratification helps understand strengths of each technology/method
Olson et al, https://doi.org/10.1101/2020.11.13.380
Shortcomings in Medical Genes for v4.2.1 benchmark
● Mandelker et al. in 2016
created a list of medical
genes with at least one
exon that is difficult to map
with short reads
● v4.2.1 improved coverage
of these genes but many
are still not fully covered
Why Create a Medical Gene Benchmark for Genome in a
Bottle?
● HG002 v4.2.1 benchmark still excludes >10% of 395 medically relevant
genes on chromosomes 1-22 on GRCh37 or GRCh38 due to structural
variants, large segmental duplications, or other difficult regions
● Advances in diploid assembly enabled us to develop phased small
variant and structural variant benchmarks in 273 of these 395 genes on
both GRCh37 and GRCh38 for HG002
Wagner et al, https://doi.org/10.1101/2021.06.07.444885
Justin Wagner
Jason Chin
Fritz Sedlazeck
GIAB CMRG Team
Generating a Challenging Medical Gene Benchmark
Trio-based
diploid
assembly
Diploid Assembly Using PacBio HiFi reads
● Trio-hifiasm
○ Illumina reads for parents and
PacBio HiFi reads for HG002
○ Best performance in Human
Pangenome Reference Consortium
diploid assembly bakeoff
● Called variants with dipcall
○ Outputs variant calls and confident
regions
○ Confident regions: covered by
exactly one contig from each
haplotype
https://github.com/lh3/dipcall
https://doi.org/10.1038/s41592-020-01056-5
New benchmark
includes 273 challenging
genes
● Curated each gene for
accurate resolution by
assembly in IGV
● Manually curated >1000
variant discrepancies and
excluded errors in benchmark
● Most errors in homopolymers
and/or highly homozygous
regions
The new CMRG small variant benchmark includes more
challenging variants and identifies more false negatives
Highlighting Genes in the New Benchmark – SMN1
GRCh37 and GRCh38 contain different false duplications
• GRCh38 has an extra copy of some medically relevant genes
like CBS, KCNE1, and CRYAA, causing mis-mapped reads
26
https://gnomad.broadinstitute.org/gene/ENSG00000160200?dataset=gnomad_r2_1
gnomAD coverage of CBS on GRCh38 decreases for genome sequencing due to mapping ambiguity
gnomAD coverage of CBS on GRCh37 is generally normal for genome (green) and exome (blue) samples
False duplications on GRCh38 can be fixed by masking
T2T identified and fixed additional false duplications
● 12 regions affecting ~1.2 Mbp and 74 genes (including 22 protein coding genes)
● Most medically relevant genes included in 11 pairs of genes in 5 large duplicated
regions on chr21
https://doi.org/10.1101/2021.07.12.452063
Genes found to be falsely
duplicated in CMRG and
T2T work
T2T also identified collapsed
duplications in GRCh38
● 203 regions affecting ~8 Mbp and 308 genes
(including 48 protein coding genes)
● Includes several medically-relevant genes:
○ KCNJ18/KCNJ12
○ KMT2C
○ MAP2K3
https://doi.org/10.1101/2021.07.12.452063
What medical genes do we still not include >90%?
● 110 on GRCh37 and 100 on GRCh38 + all genes on chrX/chrY
Progressively categorizing all 100 on GRCh38:
● 20 affected by gaps in the reference
● 38 had evidence of duplications in HG002 relative to GRCh38
○ Collapsed duplications in GRCh38 (e.g., KCNJ18)
○ Population copy number variability (e.g., LPA, KIR)
● 2 resolved on GRCh38 but not GRCh37
● 18 were >90% included by the dip.bed but had multiple contigs or a break in the
assembly-assembly alignment
● 7 have a large deletion of part or all of the gene on one haplotype
● 4 have breaks or false duplications in the hifiasm assembly (e.g., SMN2)
● 2 are in the structurally variable immunoglobulin locus
● 6 resolved but excluded due to being previously assembled in the MHC
● one (TNNT3) has a structural error in GRCh38
Plans for future assembly-based benchmarks
● Long-read assembly-based variants are reaching/surpassing the accuracy of
our benchmarks (with some exceptions)
● Use T2T-HPRC’s assembly of HG002 chrX (and chrY?) to develop small
variant and structural variant benchmark for genic and non-genic regions
● Use diploid assemblies of children in trios
Exploring if AI can be used for Genomic Reference Material
Development
● Exploring deep learning to assign
uncertainty to genomic reference
materials
● Exploring transparency for genomics AI
(e.g., "model cards")
● Exploring explainability for AI-based
reference materials
https://mdic.org/project/cancer-genomic-somatic-reference-samples/
21st Century Cell Lines: Fully Consented and
Characterized Cancer Tumor/Normal Cell Lines as
Reference Materials
● Developing matched tumor/normal cell lines
pairs and donor normal tissue analyzed at early
passages
○ Initial collaboration with Andrew Liss at MGH for
pancreatic ductal adenocarcinoma (PDAC) cell
lines
● Broadly consented for public release of
genomic data and commercial use and
redistribution
● Path to Cancer Genome in a Bottle
Seeking
collaborations
for additional
broadly-
consented
tumor/normal
cell lines
Take-home messages
● Ongoing improvement of benchmarks has been needed to
drive technology and bioinformatics innovations
● Assembly methods have advanced rapidly and are
enabling characterization of increasingly challenging
genome regions
● More work is needed to develop better benchmarks and
benchmarking tools, particularly for tumor genomes
Acknowledgment of many GIAB contributors
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
* Funders
*
*
Interesting in getting involved?
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google groups
GIAB slides: www.slideshare.net/genomeinabottle
Public, Unembargoed Data:
• http://www.nature.com/articles/sdata201625
• ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
• github.com/genome-in-a-bottle
Global Alliance Benchmarking Team
• https://github.com/ga4gh/benchmarking-tools
• Web-based implementation at precision.fda.gov
• Best Practices at https://rdcu.be/bqpDT
GIAB Analysis Team Calls
• Sign up for the google group to attend biweekly calls Justin Zook: jzook@nist.gov
We are hiring!
Machine learning,
diploid assembly,
cancer genomes,
data science,
other ‘omics, …

Genome in a Bottle- reference materials to benchmark challenging variants and regions of the human genome 210930

  • 1.
    Genome in aBottle: Reference Materials to Benchmark Challenging Variants and Regions of the Human Genome Justin Zook, on behalf of the Genome in a Bottle Consortium National Institute of Standards and Technology (NIST) Human Genomics Team Sept 30, 2021
  • 2.
    Motivation for Genomein a Bottle: Sequencing and analysis methods can give different answers, particularly in challenging, repetitive regions O’Rawe et al, Genome Medicine, 2013 https://doi.org/10.1186/gm432
  • 3.
    GIAB has characterizedvariants in 7 human genomes National I nstituteof S tandards & Te c hnology Re port of I nve stigation Reference Material 8391 Human DNA for Whole-Genome Variant Assessment (Son of Eastern European Ashkenazim Jewish Ancestry) This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer (10 mM TRIS, 1 mM EDTA, pH 8.0). This material is intended for assessing performance of human genome sequencing variant calling by obtaining estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as functional or clinical interpretation. Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods similar to described in reference 1. An information value is considered to be a value that will be of interest and use to the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available. These data and genomic characterizations will be maintained over time as new data accrue and measurement and informatics methods become available. The information values are given as a variant call file (vcf) that contains the high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information (NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is: HG001 HG002 HG003 HG004 HG006 HG007 HG005 AJ Trio Chinese Trio Pilot Genome NA12878
  • 4.
    GIAB “Open Science”Virtuous Cycle Users analyze GIAB Samples Benchmark vs. GIAB data Critical feedback to GIAB Integrate new methods New benchmark data Method development, optimization, and demonstration Part of assay validation GIAB/NIST expands to more difficult regions
  • 5.
    Design of ourhuman genome reference values Benchmark Variant Calls
  • 6.
    Benchmark Regions – regions inwhich the benchmark contains (almost) all the variants Benchmark Variant Calls Design of our human genome reference values
  • 7.
    Reference Values* Benchmark Variant Calls Design of ourhuman genome reference values Benchmark Regions *Currently no quality or confidence scores associated with our reference values
  • 8.
    Variants from any method beingevaluated Design of our human genome reference values Benchmark Regions Benchmark Variant Calls
  • 9.
    Benchmark Regions Variants outside benchmark regions are not assessed Majorityof variants unique to method should be false positives (FPs) Majority of variants unique to benchmark should be false negatives (FNs) Matching variants assumed to be true positives Variants from any method being evaluated Benchmark Variant Calls Design of our human genome reference values
  • 10.
    In 2019, GIABand GA4GH Published Resources for “Easier” Small Variants
  • 11.
  • 12.
  • 13.
    v4.2.1 Small VariantBenchmark used Long and Linked Reads Reference Build Benchmark Set Reference Coverage SNVs Indels Base pairs in Seg Dups and low mappability GRCh37 v3.3.2 87.8 3,048,869 464,463 57,277,670 GRCh37 v4.2.1 94.1 3,353,881 522,388 133,848,288 GRCh38 v3.3.2 85.4 3,030,495 475,332 65,714,199 GRCh38 v4.2.1 92.2 3,367,208 525,545 145,585,710 Wagner et al, https://doi.org/10.1101/2020.07.24.2127
  • 14.
    New benchmark includeschallenging genes like PMS2 Segmental duplications
  • 15.
    Collaborating with FDAto use GIAB benchmark to inspire new methods https://precision.fda.gov/challenges/10
  • 16.
    The best-performing submissionswere from new sequencing technologies and bioinformatics methods Olson et al, https://doi.org/10.1101/2020.11.13.380
  • 17.
    Expanding the benchmarkwas important to demonstrate improved technologies and analysis methods for difficult genome regions Olson et al, https://doi.org/10.1101/2020.11.13.380
  • 18.
    INDELs SNVs Stratification helpsunderstand strengths of each technology/method Olson et al, https://doi.org/10.1101/2020.11.13.380
  • 19.
    Shortcomings in MedicalGenes for v4.2.1 benchmark ● Mandelker et al. in 2016 created a list of medical genes with at least one exon that is difficult to map with short reads ● v4.2.1 improved coverage of these genes but many are still not fully covered
  • 20.
    Why Create aMedical Gene Benchmark for Genome in a Bottle? ● HG002 v4.2.1 benchmark still excludes >10% of 395 medically relevant genes on chromosomes 1-22 on GRCh37 or GRCh38 due to structural variants, large segmental duplications, or other difficult regions ● Advances in diploid assembly enabled us to develop phased small variant and structural variant benchmarks in 273 of these 395 genes on both GRCh37 and GRCh38 for HG002 Wagner et al, https://doi.org/10.1101/2021.06.07.444885 Justin Wagner Jason Chin Fritz Sedlazeck GIAB CMRG Team
  • 21.
    Generating a ChallengingMedical Gene Benchmark Trio-based diploid assembly
  • 22.
    Diploid Assembly UsingPacBio HiFi reads ● Trio-hifiasm ○ Illumina reads for parents and PacBio HiFi reads for HG002 ○ Best performance in Human Pangenome Reference Consortium diploid assembly bakeoff ● Called variants with dipcall ○ Outputs variant calls and confident regions ○ Confident regions: covered by exactly one contig from each haplotype https://github.com/lh3/dipcall https://doi.org/10.1038/s41592-020-01056-5
  • 23.
    New benchmark includes 273challenging genes ● Curated each gene for accurate resolution by assembly in IGV ● Manually curated >1000 variant discrepancies and excluded errors in benchmark ● Most errors in homopolymers and/or highly homozygous regions
  • 24.
    The new CMRGsmall variant benchmark includes more challenging variants and identifies more false negatives
  • 25.
    Highlighting Genes inthe New Benchmark – SMN1
  • 26.
    GRCh37 and GRCh38contain different false duplications • GRCh38 has an extra copy of some medically relevant genes like CBS, KCNE1, and CRYAA, causing mis-mapped reads 26 https://gnomad.broadinstitute.org/gene/ENSG00000160200?dataset=gnomad_r2_1 gnomAD coverage of CBS on GRCh38 decreases for genome sequencing due to mapping ambiguity gnomAD coverage of CBS on GRCh37 is generally normal for genome (green) and exome (blue) samples
  • 27.
    False duplications onGRCh38 can be fixed by masking
  • 28.
    T2T identified andfixed additional false duplications ● 12 regions affecting ~1.2 Mbp and 74 genes (including 22 protein coding genes) ● Most medically relevant genes included in 11 pairs of genes in 5 large duplicated regions on chr21 https://doi.org/10.1101/2021.07.12.452063
  • 29.
    Genes found tobe falsely duplicated in CMRG and T2T work
  • 30.
    T2T also identifiedcollapsed duplications in GRCh38 ● 203 regions affecting ~8 Mbp and 308 genes (including 48 protein coding genes) ● Includes several medically-relevant genes: ○ KCNJ18/KCNJ12 ○ KMT2C ○ MAP2K3 https://doi.org/10.1101/2021.07.12.452063
  • 31.
    What medical genesdo we still not include >90%? ● 110 on GRCh37 and 100 on GRCh38 + all genes on chrX/chrY Progressively categorizing all 100 on GRCh38: ● 20 affected by gaps in the reference ● 38 had evidence of duplications in HG002 relative to GRCh38 ○ Collapsed duplications in GRCh38 (e.g., KCNJ18) ○ Population copy number variability (e.g., LPA, KIR) ● 2 resolved on GRCh38 but not GRCh37 ● 18 were >90% included by the dip.bed but had multiple contigs or a break in the assembly-assembly alignment ● 7 have a large deletion of part or all of the gene on one haplotype ● 4 have breaks or false duplications in the hifiasm assembly (e.g., SMN2) ● 2 are in the structurally variable immunoglobulin locus ● 6 resolved but excluded due to being previously assembled in the MHC ● one (TNNT3) has a structural error in GRCh38
  • 32.
    Plans for futureassembly-based benchmarks ● Long-read assembly-based variants are reaching/surpassing the accuracy of our benchmarks (with some exceptions) ● Use T2T-HPRC’s assembly of HG002 chrX (and chrY?) to develop small variant and structural variant benchmark for genic and non-genic regions ● Use diploid assemblies of children in trios
  • 33.
    Exploring if AIcan be used for Genomic Reference Material Development ● Exploring deep learning to assign uncertainty to genomic reference materials ● Exploring transparency for genomics AI (e.g., "model cards") ● Exploring explainability for AI-based reference materials
  • 34.
  • 35.
    21st Century CellLines: Fully Consented and Characterized Cancer Tumor/Normal Cell Lines as Reference Materials ● Developing matched tumor/normal cell lines pairs and donor normal tissue analyzed at early passages ○ Initial collaboration with Andrew Liss at MGH for pancreatic ductal adenocarcinoma (PDAC) cell lines ● Broadly consented for public release of genomic data and commercial use and redistribution ● Path to Cancer Genome in a Bottle Seeking collaborations for additional broadly- consented tumor/normal cell lines
  • 36.
    Take-home messages ● Ongoingimprovement of benchmarks has been needed to drive technology and bioinformatics innovations ● Assembly methods have advanced rapidly and are enabling characterization of increasingly challenging genome regions ● More work is needed to develop better benchmarks and benchmarking tools, particularly for tumor genomes
  • 37.
    Acknowledgment of manyGIAB contributors Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples * Funders * *
  • 38.
    Interesting in gettinginvolved? www.genomeinabottle.org - sign up for general GIAB and Analysis Team google groups GIAB slides: www.slideshare.net/genomeinabottle Public, Unembargoed Data: • http://www.nature.com/articles/sdata201625 • ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ • github.com/genome-in-a-bottle Global Alliance Benchmarking Team • https://github.com/ga4gh/benchmarking-tools • Web-based implementation at precision.fda.gov • Best Practices at https://rdcu.be/bqpDT GIAB Analysis Team Calls • Sign up for the google group to attend biweekly calls Justin Zook: jzook@nist.gov We are hiring! Machine learning, diploid assembly, cancer genomes, data science, other ‘omics, …

Editor's Notes