Using accurate long reads to improve Genome in a Bottle Benchmarks 220923

Using accurate long reads to
improve Genome in a Bottle
Benchmarks
Justin Zook, on behalf of the Genome in a Bottle Consortium
National Institute of Standards and Technology (NIST)
Human Genomics Team
Sep 23, 2022

Motivation for Genome in a Bottle: Sequencing and analysis methods can give
different answers, particularly in challenging, repetitive regions
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432

GIAB has characterized variants in 7
human genomes
National I nstituteof S tandards & Te
c
hnology
Re
port of I nve
stigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
HG001*
HG002*
HG003* HG004*
HG006 HG007
HG005*
AJ Trio
Chinese Trio
Pilot Genome
NA12878
*NIST RMs developed from large batches of DNA

GIAB “Open Science” Virtuous Cycle
Users
analyze
GIAB
Samples
Benchmark
vs. GIAB
data
Critical
feedback to
GIAB
Integrate
new
methods
New
benchmark
data
Method
development,
optimization, and
demonstration
Part of assay
validation
GIAB/NIST
expands to
more difficult
regions

Design of our human genome reference values
Benchmark
Variant
Calls

Benchmark
Regions –
regions in which
the benchmark
contains (almost)
all the variants
Benchmark
Variant
Calls

Variants from
any method
being evaluated
Benchmark
Regions
Benchmark
Variant
Calls

Benchmark
Regions
Variants
outside
benchmark
regions are
not assessed
Majority of
variants unique
to method should
be false positives
(FPs)
Majority of
variants
unique to
benchmark
should be
false
negatives
(FNs)
Matching
variants
assumed to be
true positives
Variants from
any method
being evaluated
Benchmark
Variant
Calls
Reliable IDentification of Errors (RIDE)

https://doi.org/10.1038/s41436-021-01187-w

Accurate long reads have been essential for
improving GIAB benchmarks
Small variants with mapping-based methods
MHC with local de novo assembly
Challenging medically relevant genes with trio
de novo assembly (small var & isolated SVs)
chrX/Y and whole genome with trio de novo
assembly (small var + TRs + SVs)

v4.2.1 Small Variant Benchmark improved difficult to map
regions with Long and Linked Reads
Reference Build Benchmark Set Reference Coverage SNVs Indels Base pairs in Seg Dups and low mappability
GRCh37 v3.3.2 87.8 3,048,869 464,463 57,277,670
GRCh37 v4.2.1 94.1 3,353,881 522,388 133,848,288
GRCh38 v3.3.2 85.4 3,030,495 475,332 65,714,199
GRCh38 v4.2.1 92.2 3,367,208 525,545 145,585,710
Wagner et al, Cell Genomics, 2022 https://doi.org/10.1016/j.xgen.2022.1

Collaborating with FDA to use GIAB
benchmark to inspire new methods
https://precision.fda.gov/challenges/10

The best-performing submissions were from new sequencing
technologies and bioinformatics methods
Olson et al, Cell Genomics, 2022 https://doi.org/10.1016/j.xgen.2022.10

INDELs SNVs
Stratification
helps understand
strengths of each
technology/meth
od
Olson et al, Cell Genomics, 2022 https://doi.org/10.1016/j.xgen.2022.10

Shortcomings in Medical Genes for v4.2.1 benchmark
● Mandelker et al. in 2016
created a list of medical
genes with at least one
exon that is difficult to map
with short reads
● v4.2.1 improved coverage
of these genes but many
are still not fully covered

Generating a Benchmark for 273 Challenging Genes from
Trio-based Long read diploid assembly
Manually
curated
>1000
variants
Wagner et al, Nature Biotech, 2022 https://rdcu.be/cGwVA

Highlighting Genes in the New Benchmark – SMN1

False duplications on GRCh38 can be fixed by masking

T2T also identified collapsed
duplications in GRCh38
● 203 regions affecting ~8 Mbp and 308 genes
(including 48 protein coding genes)
● Includes several medically-relevant genes:
○ KCNJ18/KCNJ12
○ KMT2C
○ MAP2K3
https://doi.org/10.1126/science.abl3533

Modifying GRCh38 to fix false duplications and
collapsed duplications

Work In Progress - Data Registry
Queryable database with
pointers to publicly
available GIAB data
along with summary
statistics
Data Types
Sample
FASTQs
BAMs
VCFs
Capturing methods and
linking datasets for data
provenance
21

DEvelopment
Framework for
Assembly Based
Bechmarks
(DEFRABB)
22

Assembly-Based Benchmark Process
Credits: Nate Olson, Jennifer McDaniel, and GIAB team

Building new GIAB resources with long reads
● RNA-seq
○ Recently generated illumina and PacBio RNA-seq from several GIAB lymphoblastoid cell lines
and iPSCs
■ ONT RNA-seq planned as well
○ Planned analyses include isoforms, variants, gene annotation
○ Collaborations welcome!
● Tumor/normal
○ Working with MGH and others to develop the first broadly-consented tumor/normal cell line
pairs
○ Starting characterization of first pancreatic cancer cell line
● Engineering variants into GIAB cell lines
○ Collaboration with Medical Device Innovation Consortium Somatic Reference Samples project

Take-home messages
● Ongoing improvement of benchmarks has been needed to
drive technology and bioinformatics innovations, particularly for
long reads
● Assembly methods using accurate long reads have advanced
rapidly and are enabling characterization of increasingly
challenging genome regions
● More work is needed to develop better benchmarks and
benchmarking tools, particularly for complex SVs and tumor
genomes

Acknowledgment of many GIAB contributors
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
* Funders
*
*

Interesting in getting involved?
www.genomeinabottle.org - sign up for general
GIAB and Analysis Team google groups
GIAB slides:
www.slideshare.net/genomeinabottle
Public, Unembargoed
Data:
github.com/genome-in-
a-bottle
We are hiring!
Cancer genomes,
Data Manager,
Machine learning,
diploid assembly,
other ‘omics, …

Using accurate long reads to improve Genome in a Bottle Benchmarks 220923

More Related Content

What's hot

Similar to Using accurate long reads to improve Genome in a Bottle Benchmarks 220923

More from GenomeInABottle

Recently uploaded

Using accurate long reads to improve Genome in a Bottle Benchmarks 220923