GIAB Integrating multiple technologies to form benchmark SVs 180517

Genome in a Bottle:
Integrating Multiple Technologies to Form Benchmark
Structural Variants
Justin Zook, on behalf of the GIAB Consortium
NIST Genome-Scale Measurements Group
Joint Initiative for Metrology in Biology (JIMB)
May 17, 2018

Take-home Messages
• Genome in a Bottle is:
– “Open science”
– Authoritative characterization of human genomes
• Currently enable benchmarking of “easier” variants
– Clinical validation
– Technology development, optimization, and demonstration
• Now working on difficult variants and regions
– Draft variant calls >=20bp available and feedback requested
– Working on finalizing a tiered benchmark set >=50bp + confident regions
– New long and ultralong read data coming
– Many challenges remain and collaborations welcome!

Why Genome in a Bottle?
• A map of every individual’s genome
will soon be possible, but how will
we know if it is correct?
• Diagnostics and precision medicine
require high levels of confidence
• Well-characterized, broadly
disseminated genomes are needed
to benchmark performance of
sequencing
• Open, transparent data/analyses
• Enable technology development,
optimization, and demonstration
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432

GIAB is evolving with technologies
2012
• No human
benchmark
calls available
• GIAB
Consortium
formed
2014
• Small variant
genotypes
for ~77% of
pilot genome
NA12878
2015
• NIST releases
first human
genome
Reference
Material
2016
• 4 new
genomes
• Small
variants for
90% of 5
genomes for
GRCh37/38
2017+
• Characteriz-
ing difficult
variants
• Develop
tumor
samples

GIAB has characterized 5 human genome RMs
• Pilot genome
– NA12878
• PGP Human Genomes
– Ashkenazi Jewish son
– Ashkenazi Jewish trio
– Chinese son
• Parents also characterized
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
https://doi.org/10.1101/281006

Important characteristics of benchmark calls
What does “gold standard” mean?
1. Accurate
– high-confidence variants,
genotypes, haplotypes, and regions
– When results from any method is
compared to the benchmark, the
majority of differences (FPs/FNs)
are errors in the method
2. Representative examples
– Different types of variants in
different genome contexts
3. Comprehensive characterization
– Many examples of different variant
types/genome contexts
– Eventually, diploid assembly
benchmarking

GIAB “Open Science” Virtuous Cycle
Users
analyze
GIAB
Samples
Benchmark
vs. GIAB
data
Critical
feedback
to GIAB
Integrate
new
methods
New
benchmark
data
Method
development,
optimization, and
demonstration
Part of assay
validation
GIAB/NIST
expands to more
difficult regions

Open consent enables secondary reference samples
• >30 products now available
based on broadly-consented,
well-characterized GIAB PGP cell
lines
• Genomic DNA + DNA spike-ins
– Clinical variants
– Somatic variants
– Difficult variants
• Clinical matrix (FFPE)
• Circulating tumor DNA
• Stem cells (iPSCs)
• Genome editing

All data and analyses are open and public
51 authors
14 institutions
12 datasets
7 genomes
Data described in ISA-tab
New data on GIAB NCBI FTP

Best Practices for Benchmarking Small Variants
https://github.com/ga4gh/benchmarking-tools
https://doi.org/10.1101/270157 https://precision.fda.gov/
Describe
public
“Truth”
VCFs with
confident
regions
Enable
stratification of
performance in
difficult regions
Tools to compare
different
representations of
complex variants
Standardized
VCF-I output of
comparison tools
Standardized definitions
of performance metrics
based on matching
stringency Web-based interface
for performance
metrics
Standardized output
formats for
performance metrics

What are we accessing and what is still
challenging?
Type of variant Genome
context
Fraction
of variants
called*
Number of
variants
missing*
How to improve?
Simple SNPs Not repetitive ~97% >100k Machine learning
Simple indels Not repetitive ~93% >10k Machine learning
All variants Low
mappability
<30% >170k Use linked reads and long
reads
All variants Regions not in
GRCh37/38
0 >>100k??? De novo assembly; long reads
Small indels Tandem repeats
and
homopolymers
<50% >200k STR/homopolymer callers; long
reads; better handle complex
and compound variants
Indels 15-50bp All <25% >30k Assembly-based callers;
integrate larger variants
differently; long reads
Indels >50bp All <1% >20k
* Approximate values based on fraction of variants in GATKHC or FermiKit that are
inside v3.3.2 High-confidence regions

Integration of diverse data types and analyses
• Data publicly available
– Deep short reads
– Linked reads
– Long reads
– Optical/nanopore mapping
• Analyses
– Small variant calling
– SV calling
– Local and global assembly
Discover &
Refine
sequence-
resolved calls
from multiple
datasets &
analyses Compare
variant and
genotype calls
from different
methods
Evaluate/
genotype calls
with other
data
Identify
features
associated
with reliability
of calls from
each method
Form
benchmark
calls using
heuristics &
machine
learning
Compare
benchmarks
to high-
quality
callsets and
examine
differences

How can we extend our approach to structural
variants?
Similarities to small variants
• Collect callsets from multiple
technologies
• Compare callsets to find calls
supported by multiple technologies
Differences from small variants
• Callsets have limited sensitivity
• Variants are often imprecisely
characterized
– breakpoints, size, type, etc.
• Representation of variants is poorly
standardized, especially when complex
• Comparison tools in infancy

Evolution of SV calls for AJ Trio
v0.2.0
• Only
deletions
• Overlap
and size-
based
clustering
• Output
sites with
multitech
support
v0.3.0
• New
calling
methods
• Deletions
and
insertions
• Sequence-
resolved
calls
• Sequence-
based
clustering
• Output
sites with
multitech
support
v0.4.0
• Include
some
single tech
calls
• Evaluate
read
support to
remove
some false
positives
• Add
genotypes
for trio
v0.5.0
• Better
calling
methods,
especially
for large
insertions
• Include
more
single tech
calls
• Add some
phasing
info
Future
• Resolve
clusters of
differing
calls
• Improve
phasing
• Add new
data types
• Improve
sequence
resolution
• High-
confidence
regions

Integrating Sequence-resolved Calls
>=20bp
>1 million calls from 30+ sequence-resolved callsets from 4 techs for
AJ Trio
>500k unique sequence-resolved calls
38k INS and 37k DEL with 2+ techs or 5+
callers predicting sequences <20%
different or BioNano/Nabsys support
33k INS and 35k DEL
genotyped by svviz in 1+
individuals
v0.5.0
Draft SV calls for feedback: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/

Size histograms for v0.5.0
Red - simple calls in v0.5.0
Blue – differing nearby calls in v0.5.0
Alu Alu
LINE LINE

Evaluation/genotyping suite of methods
Current approaches
• svviz – maps reads to REF or ALT alleles
– Short, linked, and long reads
– Haplotype-separated reads
• BioNano – compare size predictions
• Nabsys – evaluates large deletions
Future approaches
• Paternal|maternal haplotypes for svviz
using whatshap
• Online manual curation of svviz, IGV,
dotplots, etc.
– Volunteers needed starting ~May 8!
• PCR-Sanger targeted sequencing
– Collaborations welcome!

Outstanding challenges and future work
• Large sequence-resolved insertions
• Somewhat fewer multi-kb
insertions than multi-kb deletions
• Much better than v0.4.0
• Dense calls
• ~1/3 v0.5.0 calls are within 1kb of
another v0.5.0 call
• Sequence-resolved insertion size
doesn’t always match BioNano
• Phasing will be important for
these (e.g., with 10X, whatshap)
• Calls with inaccurate or incomplete
sequence change
• Homozygous Reference calls
• Can we definitively state we call
all SVs in some regions?
• E.g., using diploid assembly?
• Benchmarking tool development
• How to compare SVs to a
benchmark?
• What performance metrics are
important?
• New tools in development at:
github.com/spiralgenetics/truvari

Proposed 2-tier call system
● Tier 1: Simple, sequence-resolved
○ v0.5.0 calls >49bp in size in HG002
○ Not within 1000bp of another >49bp call in HG002
○ ~14,000 calls
○ Benchmark variant type, breakpoint, size, sequence, genotype
● Tier 2: Confident SV but complex or no consensus sequence change
○ V0.5.0 calls that are within 1000bp of another >49bp call in HG002
■ ~6000 calls in ~2600 regions
○ Also analyze extra calls not tested as part of v0.5.0 process (not
discovered by 2+ techs or 4+ callsets and clustered)
■ ~9000 regions
○ Benchmark sensitivity to more challenging SVs

Using assemblies to develop high-confidence bed
1. Call variants from each assembly
2. Exclude regions around long read assembly variants not in
v0.5.0
3. Find regions for each assembly that are covered by 1 contig.
Remove repeats longer than 75% of N50 read length
4. Find the number of assemblies covering each region (e.g.,
using bed tools merge)
5. High confidence regions are regions in #4 covered by both
haplotypes in a diploid assembly or at least x assemblies minus
the regions in #2.
6. Subtract Tier 2 regions that don’t contain a Tier 1 call

Web-based manual curation tools
http://www.svcurator.com/
● Volunteers needed to help
us establish benchmarks!
● Learn about challenges in
SV calling
Credit:
Lesley Chapman

GIAB Developing New Data
• 10X Genomics
– Chinese trio now available
• PacBio Sequel of Chinese trio with
Mt Sinai
– Read insert N50: 16-18kb
– ~60x on son and ~30x on each
parent
– Also additional 30x on AJ
son/mother
– Data undergoing QC
• BioNano
– New DLS labeling method
• Complete Genomics/BGI
– stLFR linked reads
• Oxford Nanopore
– NIST/Birmingham/
Nottingham Ultra-long reads
• In progress
• Very preliminarily 80-90kb N50
– Max reads >1Mb!
• Current throughput may give
~30-40x total on AJ trio
• Strand-seq
– Collaboration with Korbel lab

ONT “Ultralong reads”
Noah Spies
David Catoe
Matt Loose
Nick Loman
Josh Quick
• So far…
• ~4x total mapped
• ~2x > 50kb
• ~1x > 100kb
• Plan initial release soon
• Estimated ~30x total in
2018

New Samples
Additional ancestries
• Shorter term
– Use existing PGP individual samples
– Use existing integration pipeline
• Data-based selection
– Proportion of potential genomes from
different ancestries
• 3 to 8 new samples
• Longer term
– Recruit large family
– Recruit trios from other ancestry groups
Cancer samples
• Longer term
• Make PGP-consented tumor and
normal cell lines from same individual
• Select tumor with diversity of mutation
types

The road ahead...
2018
• Large
variants
• Difficult
small
variants
• Phasing
2019
• Difficult
small & large
variants
• Somatic
sample
development
• Germline
samples from
new
ancestries
2020+
• Diploid
assembly
• Somatic
structural
variation
• Segmental
duplications
• Centromere/
telomere
• ...

Take-home Messages
• Genome in a Bottle is:
– “Open science”
– Authoritative characterization of human genomes
• Currently enable benchmarking of “easier” variants
– Clinical validation
– Technology development, optimization, and demonstration
• Now working on difficult variants and regions
– Draft variant calls >=20bp available and feedback requested
– Working on finalizing a tiered benchmark set >=50bp + confident regions
– New long and ultralong read data coming
– Many challenges remain and collaborations welcome!
Draft SV calls for feedback: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/

Acknowledgements
• NIST/JIMB
– Marc Salit
– Jenny McDaniel
– Lindsay Harris
– David Catoe
– Lesley Chapman
– Noah Spies
• Genome in a Bottle Consortium
• GA4GH Benchmarking Team
• FDA

For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
Latest small variant benchmark: https://doi.org/10.1101/281006
Data:
– http://www.nature.com/articles/sdata201625
– ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
– Best Practices at https://doi.org/10.1101/270157
Public workshops
– Next workshop tentatively January 2019 at Stanford University, CA, USA
Justin Zook: jzook@nist.gov
NIST postdoc
opportunities
available!

GIAB Integrating multiple technologies to form benchmark SVs 180517

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GIAB Integrating multiple technologies to form benchmark SVs 180517

Similar to GIAB Integrating multiple technologies to form benchmark SVs 180517 (20)

More from GenomeInABottle

More from GenomeInABottle (20)

Recently uploaded

Recently uploaded (20)

GIAB Integrating multiple technologies to form benchmark SVs 180517