Genome in a bottle for amp GeT-RM 181030

Genome in a Bottle: Benchmark for Structural
Variant Calls and New Data
Justin Zook, on behalf of the GIAB Consortium
NIST Human Genomics Team
October 30, 2018

Why Genome in a Bottle?
• A map of every individual’s genome
will soon be possible, but how will
we know if it is correct?
• Diagnostics and precision medicine
require high levels of confidence
• Well-characterized, broadly
disseminated genomes are needed
to benchmark performance of
sequencing
• Open, transparent data/analyses
• Enable technology development,
optimization, and demonstration
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432

GIAB is evolving with technologies
2012
• No human
benchmark
calls available
• GIAB
Consortium
formed
2014
• Small variant
genotypes
for ~77% of
pilot genome
NA12878
2015
• NIST releases
first human
genome
Reference
Material
2016
• 4 new
genomes
• Small
variants for
90% of 5
genomes for
GRCh37/38
2017+
• Characteriz-
ing difficult
variants
• Develop
tumor
samples

GIAB has characterized 7 human genomes
• Pilot genome
– NA12878
• PGP Human Genomes
– Ashkenazi Jewish son
– Ashkenazi Jewish trio
– Chinese son
• Parents also characterized
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
https://doi.org/10.1101/281006
Latest small variant characterization:
New!

Open consent enables secondary reference samples
• >30 products now available
based on broadly-consented,
well-characterized GIAB PGP cell
lines
• Genomic DNA + DNA spike-ins
– Clinical variants
– Somatic variants
– Difficult variants
• Clinical matrix (FFPE)
• Circulating tumor DNA
• Stem cells (iPSCs)
• Genome editing
• …

All data and analyses are open and public
51 authors
14 institutions
12 datasets
7 genomes
Data described in ISA-tab
New data on GIAB NCBI FTP

GIAB has extensive public, unembargoed data
Short reads
• BGISEQ
• Complete
Genomics
• Illumina
• Ion Torrent
• SOLiD
Linked reads
• 10x Genomics
• BGISEQ stLFR
• Illumina 6kb
mate-pair
Long reads
• PacBio
• PacBio CCS
• Promethion
• Ultralong
Oxford
Nanopore
Optical/electronic
mapping
• BioNano
• Nabsys
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/

GIAB “Open Science” Virtuous Cycle
Users
analyze
GIAB
Samples
Benchmark
vs. GIAB
data
Critical
feedback
to GIAB
Integrate
new
methods
New
benchmark
data
Method
development,
optimization, and
demonstration
Part of assay
validation
GIAB/NIST
expands to more
difficult variants

Best Practices for Benchmarking Small Variants
https://github.com/ga4gh/benchmarking-tools
https://doi.org/10.1101/270157 https://precision.fda.gov/
Describe public
“Truth” VCFs
with confident
regions
Enable
stratification of
performance in
difficult regions
Tools to compare
different
representations of
complex variants Standardized
VCF-I output of
comparison
tools
Standardized
output formats for
performance
metrics
Web-based interface for
performance metrics
Standardized
definitions of
performance metrics
based on matching
stringency

What are we accessing and what is still
challenging?
Type of variant Genome
context
Fraction
of variants
called*
Number of
variants
missing*
How to improve?
Simple SNPs Not repetitive ~97% >100k Machine learning
Simple indels Not repetitive ~93% >10k Machine learning
All variants Low
mappability
<30% >170k Use linked reads and long
reads
All variants Regions not in
GRCh37/38
0 >>100k??? De novo assembly; long reads
Small indels Tandem repeats
and
homopolymers
<50% >200k STR/homopolymer callers; long
reads; better handle complex
and compound variants
Indels 15-50bp All <25% >30k Assembly-based callers;
integrate larger variants
differently; long reads
Indels >50bp All <1% >20k
* Approximate values based on fraction of variants in GATKHC or FermiKit that are
inside v3.3.2 High-confidence regions

How can we extend our approach to structural
variants?
Similarities to small variants
• Collect callsets from multiple
technologies
• Compare callsets to find calls
supported by multiple technologies
Differences from small variants
• Callsets have limited sensitivity
• Variants are often imprecisely
characterized
– breakpoints, size, type, etc.
• Representation of variants is poorly
standardized, especially when complex
• Comparison tools in infancy

Integration of diverse data types and analyses
• Data publicly available
– Deep short reads
– Linked reads
– Long reads
– Optical/nanopore mapping
• Analyses
– Small variant calling
– SV calling
– Local and global assembly
Discover &
Refine
sequence-
resolved calls
from multiple
datasets &
analyses Compare
variant and
genotype calls
from different
methods
Evaluate/
genotype calls
with other
data
Identify
features
associated
with reliability
of calls from
each method
Form
benchmark
calls using
heuristics &
machine
learning
Compare
benchmarks
to high-
quality
callsets and
examine
differences

V0.6 SV Benchmark Set
• Tier 1 regions contain 2.68 Gbp
with 11,869 isolated SVs >49bp
• Tier 1 calls meet the criteria:
• Discovered by 2+ techs or 5+
callers
• Confirmed and genotyped by
long reads
• Not disproven by any
technology
• Clusters of calls within 1000bp are
excluded
• Regions around calls 20-49bp are
excluded
Benchmark set and README at tinyurl.com/GIABSV06
Blue - clustered calls
Red - isolated calls
50 to 1000 bp
Alu Alu
1kbp to 10kbp
LINE LINE

Can you trust the SV benchmark results?
• Important to use sophisticated
benchmarking tools
• github.com/spiralgenetics/truvari
• github.com/nhansen/SVanalyzer
• Volunteers compared to v0.6 Tier 1
• Stratified by variant type and
overlap with tandem repeat
• Manually curated 10 random
putative FPs and FNs from each
category
• Short reads vs v0.6
• >90% of putative FPs and FNs
are errors from short reads
• Long reads vs. v0.6
• >90% of putative FNs are
errors from long read methods
• ~50% of putative FP insertions
appear to be real missed
variants in v0.6
Draft SV calls: tinyurl.com/GIABSV06

SV Example: Heterozygous 2.8kb Deletion
PacBio
CCS
Oxford
Nanopore
Illumina

SV Example: Heterozygous 100bp Insertion
PacBio
CCS
Oxford
Nanopore
Illumina

SV Example: Homozygous 1300bp Insertion in Tandem Repeat
PacBio
CCS
Oxford
Nanopore
Illumina

SV Example: Complex and Compound SV Region
PacBio
CCS
Oxford
Nanopore
Illumina

SV Example: Complex and Compound SV Region
Credit: Joyce Lee, BioNano Genomics

Crowd-sourced manual curation agrees with SV benchmark
www.svcurator.com
● Candidates examined by
11 curators on average
● 627/635 consensus manual
curations agreed with v0.6
genotype in benchmark
regions
○ Most “discordant” sites
related to inclusion of 20-
49bp indels in curation
Credit: Lesley Chapman

Oxford Nanopore “Ultralong reads”
Noah Spies
David Catoe
Marc Salit
Matt Loose
Nick Loman
Josh Quick
Nate Olson
Miten Jain
Karen Miga
Hugh Olson
Benedict Paten
• Second release:
• 16x total mapped
• 8x reads > 50kb
• 4x reads > 100kb
• Estimated 30x total in
2018
• Starting work with UCSC
on Promethion
sequencing of all GIAB
genomes
• See Miten Jain talk
from GRC/GIAB

Improving small variants with long reads
Raw ONT/PacBio long reads
• New methods use phasing to call
accurate SNVs despite high error
rate
• SNV precision and recall can be
>99% vs. current benchmark
• Indels are still challenging
PacBio Long CCS
• New 10kb and 15kb reads with
low error rate
• Enables SNV and indel calling with
methods developed for short
reads
• SNV precision & recall >99.9%
– Most errors in homopolymers
– Fixes some short read errors in
LINEs
• Indel precision & recall >97%
– Most errors in homopolymers
Jana Ebler/Tobias Marschall
Trevor Pesout/Benedict Paten
Vikas Bansal
Ruibang Luo/Fritz Sedlazeck/Mike Schatz
Aaron Wenger/Billy Rowell/Luke Hickey
Jason Chin
Andrew Carroll, Pi-Chuan Chang, Mark DePristo, Alexey Kolesnikov
Data public without embargo: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/

Challenges in Benchmarking Variant Calling
• It is difficult to do robust benchmarking of tests designed to
detect many analytes (e.g., many variants)
• Best to benchmark only within high-confidence bed file, but…
• Benchmark calls/regions tend to be biased towards easier
variants and regions
– Some clinical tests are enriched for difficult sites
• Always manually inspect a subset of FPs/FNs
• Stratification by variant type and region is important
• Always calculate confidence intervals on performance metrics

The road ahead...
2018
• Further
automate
integration
• Large
variants
• Difficult
small
variants
• Phasing
2019
• Difficult
small & large
variants
• Somatic
sample
development
• Germline
samples from
new
ancestries
2020+
• Diploid
assembly
• Somatic
structural
variation
• Segmental
duplications
• Centromere/
telomere
• ...

Take-home Messages
• Genome in a Bottle is:
– “Open science”
– Authoritative characterization of human genomes
• Currently enable benchmarking of “easier” variants
– Clinical validation
– Technology development, optimization, and demonstration
• Now working on difficult variants and regions
– Draft benchmark set >=50bp + confident regions
– Many challenges remain and collaborations welcome!
Draft SV calls: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/

Acknowledgements
• NIST
– Lesley Chapman
– Nate Olson
– Justin Wagner
– Jenny McDaniel
– Lindsay Harris
• JIMB
– Marc Salit
– Noah Spies
– David Catoe
• FDA
• GA4GH Benchmarking Team
• Genome in a Bottle Consortium

Acknowledgment of GIAB
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developersNGS technology developers
Reference samples

For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
Latest small variant benchmark: https://doi.org/10.1101/281006
Data:
– http://www.nature.com/articles/sdata201625
– ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
– Best Practices at https://doi.org/10.1101/270157
Public workshops
– Next workshop planned for Spring 2019 at Stanford University, CA, USA
Justin Zook: jzook@nist.gov
NIST postdoc
opportunities
available!

Genome in a bottle for amp GeT-RM 181030

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Genome in a bottle for amp GeT-RM 181030

Similar to Genome in a bottle for amp GeT-RM 181030 (20)

More from GenomeInABottle

More from GenomeInABottle (15)

Recently uploaded

Recently uploaded (20)

Genome in a bottle for amp GeT-RM 181030