Genome in a Bottle- reference materials to benchmark challenging variants and regions of the human genome 210930

Genome in a Bottle: Reference
Materials to Benchmark
Challenging Variants and
Regions of the Human Genome
Justin Zook, on behalf of the Genome in a Bottle Consortium
National Institute of Standards and Technology (NIST)
Human Genomics Team
Sept 30, 2021

Motivation for Genome in a Bottle: Sequencing and analysis methods can give
different answers, particularly in challenging, repetitive regions
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432

GIAB has characterized variants in 7
human genomes
National I nstituteof S tandards & Te
c
hnology
Re
port of I nve
stigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
HG001
HG002
HG003 HG004
HG006 HG007
HG005
AJ Trio
Chinese
Trio
Pilot Genome
NA12878

GIAB “Open Science” Virtuous Cycle
Users
analyze
GIAB
Samples
Benchmark
vs. GIAB
data
Critical
feedback to
GIAB
Integrate
new
methods
New
benchmark
data
Method
development,
optimization, and
demonstration
Part of assay
validation
GIAB/NIST
expands to
more difficult
regions

Design of our human genome reference values
Benchmark
Variant
Calls

Benchmark
Regions –
regions in which
the benchmark
contains (almost)
all the variants
Benchmark
Variant
Calls

Reference
Values*
Benchmark
Variant
Calls
Benchmark
Regions
*Currently no quality or
confidence scores associated
with our reference values

Variants from
any method
being evaluated
Benchmark
Regions
Benchmark
Variant
Calls

Benchmark
Regions
Variants
outside
benchmark
regions are
not assessed
Majority of
variants unique
to method should
be false positives
(FPs)
Majority of
variants
unique to
benchmark
should be
false
negatives
(FNs)
Matching
variants
assumed to be
true positives
Variants from
any method
being evaluated
Benchmark
Variant
Calls

In 2019, GIAB and GA4GH Published
Resources for “Easier” Small Variants

First
Structural
Variant
Benchmark
Published

https://doi.org/10.1038/s41436-021-01187-w

v4.2.1 Small Variant Benchmark used Long and Linked Reads
Reference Build Benchmark Set Reference Coverage SNVs Indels Base pairs in Seg Dups and low mappability
GRCh37 v3.3.2 87.8 3,048,869 464,463 57,277,670
GRCh37 v4.2.1 94.1 3,353,881 522,388 133,848,288
GRCh38 v3.3.2 85.4 3,030,495 475,332 65,714,199
GRCh38 v4.2.1 92.2 3,367,208 525,545 145,585,710
Wagner et al, https://doi.org/10.1101/2020.07.24.2127

New benchmark includes challenging genes like PMS2
Segmental duplications

Collaborating with FDA to use GIAB
benchmark to inspire new methods
https://precision.fda.gov/challenges/10

The best-performing submissions were from new sequencing
technologies and bioinformatics methods
Olson et al, https://doi.org/10.1101/2020.11.13.380

Expanding the benchmark was important to demonstrate improved
technologies and analysis methods for difficult genome regions

INDELs SNVs
Stratification helps understand strengths of each technology/method

Shortcomings in Medical Genes for v4.2.1 benchmark
● Mandelker et al. in 2016
created a list of medical
genes with at least one
exon that is difficult to map
with short reads
● v4.2.1 improved coverage
of these genes but many
are still not fully covered

Why Create a Medical Gene Benchmark for Genome in a
Bottle?
● HG002 v4.2.1 benchmark still excludes >10% of 395 medically relevant
genes on chromosomes 1-22 on GRCh37 or GRCh38 due to structural
variants, large segmental duplications, or other difficult regions
● Advances in diploid assembly enabled us to develop phased small
variant and structural variant benchmarks in 273 of these 395 genes on
both GRCh37 and GRCh38 for HG002
Wagner et al, https://doi.org/10.1101/2021.06.07.444885
Justin Wagner
Jason Chin
Fritz Sedlazeck
GIAB CMRG Team

Generating a Challenging Medical Gene Benchmark
Trio-based
diploid
assembly

Diploid Assembly Using PacBio HiFi reads
● Trio-hifiasm
○ Illumina reads for parents and
PacBio HiFi reads for HG002
○ Best performance in Human
Pangenome Reference Consortium
diploid assembly bakeoff
● Called variants with dipcall
○ Outputs variant calls and confident
regions
○ Confident regions: covered by
exactly one contig from each
haplotype
https://github.com/lh3/dipcall
https://doi.org/10.1038/s41592-020-01056-5

New benchmark
includes 273 challenging
genes
● Curated each gene for
accurate resolution by
assembly in IGV
● Manually curated >1000
variant discrepancies and
excluded errors in benchmark
● Most errors in homopolymers
and/or highly homozygous
regions

The new CMRG small variant benchmark includes more
challenging variants and identifies more false negatives

Highlighting Genes in the New Benchmark – SMN1

GRCh37 and GRCh38 contain different false duplications
• GRCh38 has an extra copy of some medically relevant genes
like CBS, KCNE1, and CRYAA, causing mis-mapped reads
26
https://gnomad.broadinstitute.org/gene/ENSG00000160200?dataset=gnomad_r2_1
gnomAD coverage of CBS on GRCh38 decreases for genome sequencing due to mapping ambiguity
gnomAD coverage of CBS on GRCh37 is generally normal for genome (green) and exome (blue) samples

False duplications on GRCh38 can be fixed by masking

T2T identified and fixed additional false duplications
● 12 regions affecting ~1.2 Mbp and 74 genes (including 22 protein coding genes)
● Most medically relevant genes included in 11 pairs of genes in 5 large duplicated
regions on chr21
https://doi.org/10.1101/2021.07.12.452063

Genes found to be falsely
duplicated in CMRG and
T2T work

T2T also identified collapsed
duplications in GRCh38
● 203 regions affecting ~8 Mbp and 308 genes
(including 48 protein coding genes)
● Includes several medically-relevant genes:
○ KCNJ18/KCNJ12
○ KMT2C
○ MAP2K3
https://doi.org/10.1101/2021.07.12.452063

What medical genes do we still not include >90%?
● 110 on GRCh37 and 100 on GRCh38 + all genes on chrX/chrY
Progressively categorizing all 100 on GRCh38:
● 20 affected by gaps in the reference
● 38 had evidence of duplications in HG002 relative to GRCh38
○ Collapsed duplications in GRCh38 (e.g., KCNJ18)
○ Population copy number variability (e.g., LPA, KIR)
● 2 resolved on GRCh38 but not GRCh37
● 18 were >90% included by the dip.bed but had multiple contigs or a break in the
assembly-assembly alignment
● 7 have a large deletion of part or all of the gene on one haplotype
● 4 have breaks or false duplications in the hifiasm assembly (e.g., SMN2)
● 2 are in the structurally variable immunoglobulin locus
● 6 resolved but excluded due to being previously assembled in the MHC
● one (TNNT3) has a structural error in GRCh38

Plans for future assembly-based benchmarks
● Long-read assembly-based variants are reaching/surpassing the accuracy of
our benchmarks (with some exceptions)
● Use T2T-HPRC’s assembly of HG002 chrX (and chrY?) to develop small
variant and structural variant benchmark for genic and non-genic regions
● Use diploid assemblies of children in trios

Exploring if AI can be used for Genomic Reference Material
Development
● Exploring deep learning to assign
uncertainty to genomic reference
materials
● Exploring transparency for genomics AI
(e.g., "model cards")
● Exploring explainability for AI-based
reference materials

https://mdic.org/project/cancer-genomic-somatic-reference-samples/

21st Century Cell Lines: Fully Consented and
Characterized Cancer Tumor/Normal Cell Lines as
Reference Materials
● Developing matched tumor/normal cell lines
pairs and donor normal tissue analyzed at early
passages
○ Initial collaboration with Andrew Liss at MGH for
pancreatic ductal adenocarcinoma (PDAC) cell
lines
● Broadly consented for public release of
genomic data and commercial use and
redistribution
● Path to Cancer Genome in a Bottle
Seeking
collaborations
for additional
broadly-
consented
tumor/normal
cell lines

Take-home messages
● Ongoing improvement of benchmarks has been needed to
drive technology and bioinformatics innovations
● Assembly methods have advanced rapidly and are
enabling characterization of increasingly challenging
genome regions
● More work is needed to develop better benchmarks and
benchmarking tools, particularly for tumor genomes

Acknowledgment of many GIAB contributors
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
* Funders
*
*

Interesting in getting involved?
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google groups
GIAB slides: www.slideshare.net/genomeinabottle
Public, Unembargoed Data:
• http://www.nature.com/articles/sdata201625
• ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
• github.com/genome-in-a-bottle
Global Alliance Benchmarking Team
• https://github.com/ga4gh/benchmarking-tools
• Web-based implementation at precision.fda.gov
• Best Practices at https://rdcu.be/bqpDT
GIAB Analysis Team Calls
• Sign up for the google group to attend biweekly calls Justin Zook: jzook@nist.gov
We are hiring!
Machine learning,
diploid assembly,
cancer genomes,
data science,
other ‘omics, …

Genome in a Bottle- reference materials to benchmark challenging variants and regions of the human genome 210930

In this document