Using accurate long reads to
improve Genome in a Bottle
Benchmarks
Justin Zook, on behalf of the Genome in a Bottle Consortium
National Institute of Standards and Technology (NIST)
Human Genomics Team
Sep 23, 2022
Motivation for Genome in a Bottle: Sequencing and analysis methods can give
different answers, particularly in challenging, repetitive regions
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432
GIAB has characterized variants in 7
human genomes
National I nstituteof S tandards & Te
c
hnology
Re
port of I nve
stigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
HG001*
HG002*
HG003* HG004*
HG006 HG007
HG005*
AJ Trio
Chinese Trio
Pilot Genome
NA12878
*NIST RMs developed from large batches of DNA
GIAB “Open Science” Virtuous Cycle
Users
analyze
GIAB
Samples
Benchmark
vs. GIAB
data
Critical
feedback to
GIAB
Integrate
new
methods
New
benchmark
data
Method
development,
optimization, and
demonstration
Part of assay
validation
GIAB/NIST
expands to
more difficult
regions
Design of our human genome reference values
Benchmark
Variant
Calls
Benchmark
Regions –
regions in which
the benchmark
contains (almost)
all the variants
Benchmark
Variant
Calls
Design of our human genome reference values
Variants from
any method
being evaluated
Design of our human genome reference values
Benchmark
Regions
Benchmark
Variant
Calls
Benchmark
Regions
Variants
outside
benchmark
regions are
not assessed
Majority of
variants unique
to method should
be false positives
(FPs)
Majority of
variants
unique to
benchmark
should be
false
negatives
(FNs)
Matching
variants
assumed to be
true positives
Variants from
any method
being evaluated
Benchmark
Variant
Calls
Design of our human genome reference values
Reliable IDentification of Errors (RIDE)
https://doi.org/10.1038/s41436-021-01187-w
Accurate long reads have been essential for
improving GIAB benchmarks
Small variants with mapping-based methods
MHC with local de novo assembly
Challenging medically relevant genes with trio
de novo assembly (small var & isolated SVs)
chrX/Y and whole genome with trio de novo
assembly (small var + TRs + SVs)
v4.2.1 Small Variant Benchmark improved difficult to map
regions with Long and Linked Reads
Reference Build Benchmark Set Reference Coverage SNVs Indels Base pairs in Seg Dups and low mappability
GRCh37 v3.3.2 87.8 3,048,869 464,463 57,277,670
GRCh37 v4.2.1 94.1 3,353,881 522,388 133,848,288
GRCh38 v3.3.2 85.4 3,030,495 475,332 65,714,199
GRCh38 v4.2.1 92.2 3,367,208 525,545 145,585,710
Wagner et al, Cell Genomics, 2022 https://doi.org/10.1016/j.xgen.2022.1
Collaborating with FDA to use GIAB
benchmark to inspire new methods
https://precision.fda.gov/challenges/10
The best-performing submissions were from new sequencing
technologies and bioinformatics methods
Olson et al, Cell Genomics, 2022 https://doi.org/10.1016/j.xgen.2022.10
INDELs SNVs
Stratification
helps understand
strengths of each
technology/meth
od
Olson et al, Cell Genomics, 2022 https://doi.org/10.1016/j.xgen.2022.10
Shortcomings in Medical Genes for v4.2.1 benchmark
● Mandelker et al. in 2016
created a list of medical
genes with at least one
exon that is difficult to map
with short reads
● v4.2.1 improved coverage
of these genes but many
are still not fully covered
Generating a Benchmark for 273 Challenging Genes from
Trio-based Long read diploid assembly
Manually
curated
>1000
variants
Wagner et al, Nature Biotech, 2022 https://rdcu.be/cGwVA
Highlighting Genes in the New Benchmark – SMN1
Wagner et al, Nature Biotech, 2022 https://rdcu.be/cGwVA
False duplications on GRCh38 can be fixed by masking
Wagner et al, Nature Biotech, 2022 https://rdcu.be/cGwVA
T2T also identified collapsed
duplications in GRCh38
● 203 regions affecting ~8 Mbp and 308 genes
(including 48 protein coding genes)
● Includes several medically-relevant genes:
○ KCNJ18/KCNJ12
○ KMT2C
○ MAP2K3
https://doi.org/10.1126/science.abl3533
Modifying GRCh38 to fix false duplications and
collapsed duplications
Work In Progress - Data Registry
Queryable database with
pointers to publicly
available GIAB data
along with summary
statistics
Data Types
Sample
FASTQs
BAMs
VCFs
Capturing methods and
linking datasets for data
provenance
21
DEvelopment
Framework for
Assembly Based
Bechmarks
(DEFRABB)
22
Assembly-Based Benchmark Process
Credits: Nate Olson, Jennifer McDaniel, and GIAB team
Building new GIAB resources with long reads
● RNA-seq
○ Recently generated illumina and PacBio RNA-seq from several GIAB lymphoblastoid cell lines
and iPSCs
■ ONT RNA-seq planned as well
○ Planned analyses include isoforms, variants, gene annotation
○ Collaborations welcome!
● Tumor/normal
○ Working with MGH and others to develop the first broadly-consented tumor/normal cell line
pairs
○ Starting characterization of first pancreatic cancer cell line
● Engineering variants into GIAB cell lines
○ Collaboration with Medical Device Innovation Consortium Somatic Reference Samples project
Take-home messages
● Ongoing improvement of benchmarks has been needed to
drive technology and bioinformatics innovations, particularly for
long reads
● Assembly methods using accurate long reads have advanced
rapidly and are enabling characterization of increasingly
challenging genome regions
● More work is needed to develop better benchmarks and
benchmarking tools, particularly for complex SVs and tumor
genomes
Acknowledgment of many GIAB contributors
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
* Funders
*
*
Interesting in getting involved?
www.genomeinabottle.org - sign up for general
GIAB and Analysis Team google groups
GIAB slides:
www.slideshare.net/genomeinabottle
Public, Unembargoed
Data:
github.com/genome-in-
a-bottle
We are hiring!
Cancer genomes,
Data Manager,
Machine learning,
diploid assembly,
other ‘omics, …

Using accurate long reads to improve Genome in a Bottle Benchmarks 220923

  • 1.
    Using accurate longreads to improve Genome in a Bottle Benchmarks Justin Zook, on behalf of the Genome in a Bottle Consortium National Institute of Standards and Technology (NIST) Human Genomics Team Sep 23, 2022
  • 2.
    Motivation for Genomein a Bottle: Sequencing and analysis methods can give different answers, particularly in challenging, repetitive regions O’Rawe et al, Genome Medicine, 2013 https://doi.org/10.1186/gm432
  • 3.
    GIAB has characterizedvariants in 7 human genomes National I nstituteof S tandards & Te c hnology Re port of I nve stigation Reference Material 8391 Human DNA for Whole-Genome Variant Assessment (Son of Eastern European Ashkenazim Jewish Ancestry) This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer (10 mM TRIS, 1 mM EDTA, pH 8.0). This material is intended for assessing performance of human genome sequencing variant calling by obtaining estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as functional or clinical interpretation. Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods similar to described in reference 1. An information value is considered to be a value that will be of interest and use to the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available. These data and genomic characterizations will be maintained over time as new data accrue and measurement and informatics methods become available. The information values are given as a variant call file (vcf) that contains the high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information (NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is: HG001* HG002* HG003* HG004* HG006 HG007 HG005* AJ Trio Chinese Trio Pilot Genome NA12878 *NIST RMs developed from large batches of DNA
  • 4.
    GIAB “Open Science”Virtuous Cycle Users analyze GIAB Samples Benchmark vs. GIAB data Critical feedback to GIAB Integrate new methods New benchmark data Method development, optimization, and demonstration Part of assay validation GIAB/NIST expands to more difficult regions
  • 5.
    Design of ourhuman genome reference values Benchmark Variant Calls
  • 6.
    Benchmark Regions – regions inwhich the benchmark contains (almost) all the variants Benchmark Variant Calls Design of our human genome reference values
  • 7.
    Variants from any method beingevaluated Design of our human genome reference values Benchmark Regions Benchmark Variant Calls
  • 8.
    Benchmark Regions Variants outside benchmark regions are not assessed Majorityof variants unique to method should be false positives (FPs) Majority of variants unique to benchmark should be false negatives (FNs) Matching variants assumed to be true positives Variants from any method being evaluated Benchmark Variant Calls Design of our human genome reference values Reliable IDentification of Errors (RIDE)
  • 9.
  • 10.
    Accurate long readshave been essential for improving GIAB benchmarks Small variants with mapping-based methods MHC with local de novo assembly Challenging medically relevant genes with trio de novo assembly (small var & isolated SVs) chrX/Y and whole genome with trio de novo assembly (small var + TRs + SVs)
  • 11.
    v4.2.1 Small VariantBenchmark improved difficult to map regions with Long and Linked Reads Reference Build Benchmark Set Reference Coverage SNVs Indels Base pairs in Seg Dups and low mappability GRCh37 v3.3.2 87.8 3,048,869 464,463 57,277,670 GRCh37 v4.2.1 94.1 3,353,881 522,388 133,848,288 GRCh38 v3.3.2 85.4 3,030,495 475,332 65,714,199 GRCh38 v4.2.1 92.2 3,367,208 525,545 145,585,710 Wagner et al, Cell Genomics, 2022 https://doi.org/10.1016/j.xgen.2022.1
  • 12.
    Collaborating with FDAto use GIAB benchmark to inspire new methods https://precision.fda.gov/challenges/10
  • 13.
    The best-performing submissionswere from new sequencing technologies and bioinformatics methods Olson et al, Cell Genomics, 2022 https://doi.org/10.1016/j.xgen.2022.10
  • 14.
    INDELs SNVs Stratification helps understand strengthsof each technology/meth od Olson et al, Cell Genomics, 2022 https://doi.org/10.1016/j.xgen.2022.10
  • 15.
    Shortcomings in MedicalGenes for v4.2.1 benchmark ● Mandelker et al. in 2016 created a list of medical genes with at least one exon that is difficult to map with short reads ● v4.2.1 improved coverage of these genes but many are still not fully covered
  • 16.
    Generating a Benchmarkfor 273 Challenging Genes from Trio-based Long read diploid assembly Manually curated >1000 variants Wagner et al, Nature Biotech, 2022 https://rdcu.be/cGwVA
  • 17.
    Highlighting Genes inthe New Benchmark – SMN1 Wagner et al, Nature Biotech, 2022 https://rdcu.be/cGwVA
  • 18.
    False duplications onGRCh38 can be fixed by masking Wagner et al, Nature Biotech, 2022 https://rdcu.be/cGwVA
  • 19.
    T2T also identifiedcollapsed duplications in GRCh38 ● 203 regions affecting ~8 Mbp and 308 genes (including 48 protein coding genes) ● Includes several medically-relevant genes: ○ KCNJ18/KCNJ12 ○ KMT2C ○ MAP2K3 https://doi.org/10.1126/science.abl3533
  • 20.
    Modifying GRCh38 tofix false duplications and collapsed duplications
  • 21.
    Work In Progress- Data Registry Queryable database with pointers to publicly available GIAB data along with summary statistics Data Types Sample FASTQs BAMs VCFs Capturing methods and linking datasets for data provenance 21
  • 22.
  • 23.
    Assembly-Based Benchmark Process Credits:Nate Olson, Jennifer McDaniel, and GIAB team
  • 24.
    Building new GIABresources with long reads ● RNA-seq ○ Recently generated illumina and PacBio RNA-seq from several GIAB lymphoblastoid cell lines and iPSCs ■ ONT RNA-seq planned as well ○ Planned analyses include isoforms, variants, gene annotation ○ Collaborations welcome! ● Tumor/normal ○ Working with MGH and others to develop the first broadly-consented tumor/normal cell line pairs ○ Starting characterization of first pancreatic cancer cell line ● Engineering variants into GIAB cell lines ○ Collaboration with Medical Device Innovation Consortium Somatic Reference Samples project
  • 25.
    Take-home messages ● Ongoingimprovement of benchmarks has been needed to drive technology and bioinformatics innovations, particularly for long reads ● Assembly methods using accurate long reads have advanced rapidly and are enabling characterization of increasingly challenging genome regions ● More work is needed to develop better benchmarks and benchmarking tools, particularly for complex SVs and tumor genomes
  • 26.
    Acknowledgment of manyGIAB contributors Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples * Funders * *
  • 27.
    Interesting in gettinginvolved? www.genomeinabottle.org - sign up for general GIAB and Analysis Team google groups GIAB slides: www.slideshare.net/genomeinabottle Public, Unembargoed Data: github.com/genome-in- a-bottle We are hiring! Cancer genomes, Data Manager, Machine learning, diploid assembly, other ‘omics, …