GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511

May 11, 2020
Genome in a Bottle Benchmarks
for Structural Variants and
Repetitive Regions
www.slideshare.net/genomeinabottle
@GenomeinaBottle on Twitter

Why start Genome in a Bottle?
• A map of every individual’s
genome will soon be possible, but
how will we know if it is correct?
• Diagnostics and precision
medicine require high levels of
confidence
• Well-characterized, broadly
disseminated genomes are needed
to benchmark performance of
sequencing
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432

Human Genome Sequencing needed a new class of
Reference Materials with billions of reference values
By Russ London at English Wikipedia, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=9923576

GIAB has characterized 7 human
genomes
• Pilot genome
– NA12878
• PGP Human
Genomes
– Ashkenazi Jewish son
– Ashkenazi Jewish trio
– Chinese son
• Parents also
characterized
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:

Open consent enables secondary reference samples to
meet specific clinical needs
• >50 products now available
based on broadly-consented,
well-characterized GIAB PGP cell
lines
• Genomic DNA + DNA spike-ins
• Clinical variants
• Somatic variants
• Difficult variants
• Clinical matrix (FFPE)
• Circulating tumor DNA
• Stem cells (iPSCs)
• Genome editing
• …

Design of our human genome reference values
Benchmark
Variant
Calls

Benchmark
Regions –
regions in which
the benchmark
contains (almost)
all the variants
Benchmark
Variant
Calls

Reference
Values
Benchmark
Variant
Calls
Benchmark
Regions

Variants from
any method
being evaluated
Benchmark
Regions
Benchmark
Variant
Calls

Benchmark
Regions
Variants
outside
benchmark
regions are
not assessed
Majority of
variants unique
to method should
be false positives
(FPs)
Majority of
variants
unique to
benchmark
should be
false
negatives
(FNs)
Matching
variants
assumed to be
true positives
Variants from
any method
being evaluated
Benchmark
Variant
Calls

Benchmark
Variant
Calls
Query
Variants
Benchmark
Regions
Variants
outside
benchmark
regions are
not assessed
Majority of
variants unique
to method should
be false positives
(FPs)
Majority of
variants
unique to
benchmark
should be
false
negatives
(FNs)
Matching
variants
assumed to be
true positives
This does not directly
give the accuracy of the
reference values, but
rather that they are fit
for purpose.

GIAB Recently Published Resources for
“Easier” Small Variants

Now using linked and long reads for
difficult variants and regions
GIAB/HPRC Public Data
• Linked Reads
– 10x Genomics
– Complete Genomics/BGI stLFR
– TELL-seq
– Hi-C
– Strand-seq
• Long Reads
– PacBio Continuous Long Reads
– PacBio Circular Consensus Seq
– Oxford Nanopore “ultralong”
– Promethion
GIAB Use Cases
• Develop structural variant
benchmark
– bioRxiv 664623
• Diploid assembly of difficult regions
like MHC
– bioRxiv 831792
– New collaboration with
www.humanpangenome.org
• Expand small variant benchmark
– v4.1 available, manuscript in prep

50 to 1000 bp
Alu
Alu
1kbp to 10kbp
LINE
LINE
Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp
discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio
Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering
sequence changes within 20% edit distance in trio
Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting
sequences <20% different or BioNano/Nabsys support in trio
Evaluate/genotype: 19748 SVs with consensus variant
genotype from svviz in son
Filter complex: 12745 SVs not within
1kb of another SV
Regions: 9641 SVs inside
2.66 Gbp benchmark
regions supported by
diploid assembly
v0.6
tinyurl.com/GIABSV06

New SV
Benchmark
Reliably
Identifies FPs
and FNs
FN FP
LongReadsShortReads
DEL INS DEL INS
0
10
20
30
40
0
30
60
90
Structural Variant Type
Count
Is GIAB Correct?
No
Maybe
Partial
Yes

Diploid assembly of MHC
Martin, et al., 2016
BioRxiv 085050.
Chin and Khalak, 2019,
BioRxiv 705616 *Now dipcall

Alignments of assembly to reference
Two haplotigs span
through whole MHC
region
New version
correctly assembles
30kb seg dup

New small variant benchmark includes more bases of human
genome and variants
Benchmark
Set
GRCh38
Coverage
SNPs INDELs
v3.3.2 85.4% 3,028,458 476,514
v4.1 92.2% 3,363,367 528,138
Percent increase in V4.1 compared to V3.3.2
(100*(V4.1 - V3.3.2)/V3.3.2) for GRCh38 reference bases covered, single
nucleotide variants, and small indels

New benchmark covers more medically-relevant genes that are
difficult to map for short reads
v4.1 covers many more difficult, medically-relevant genes. Cumulative distribution for
percent gene covered by benchmark regions for 193 difficult, medically-relevant genes.
• Remaining regions to cover:
• Very difficult seg dups
• Structural variants
• Large duplications
• Some small complex variants
• Some >15bp indels
• Satellite DNA

• Comparison of FNs from different sequencing technologies and variant calling
methods against benchmark set
• New benchmark identifies more SNP FNs across technologies, mostly due to
new benchmark variants in difficult to map regions and segmental duplications
Performance with new benchmark demonstrates utility in
regions that are difficult for short reads

Benchmark reliably identifies FPs and FNs across
diverse callsets

Germline Variant
Calling Benchmarking

GA4GH
Benchmarking
Tool
https://rdcu.be/bVtIF

Example of benchmarking a diploid assembly
• Call variants with dipcall
• Stratify performance by difficult
regions
• More errors in seg dups
• More indels errors in long
homopolymers
• Can also separate genotyping errors
from other FPs
• Can subset Recall to regions
covered by both haplotypes
• Also gives fraction of variants not
assessed because they were
outside benchmark regions
Type Region Recall Precision
SNV All in benchmark 98.4 97.8
SNV SegDup 76.9 59.6
SNV Easy* 99.5 99.9
Indel All in benchmark 93.0 83.3
Indel Homopolymers >10bp 79.7 52.7
Indel Easy* 99.1 94.4
*Easy := genome after excluding all homopolymers
>6bp, tandem repeats, seg dups, and low mappability
regions

Small Variant
Benchmarking
Highlights
Best practices for benchmarking germline
variant calling
• https://rdcu.be/bVtIF
• Supplemental Table 2 summarizes best practices
Hap.py - best practices implementation
• Command line - https://github.com/Illumina/hap.py
• Graphical interface – https://precision.fda.gov/
• v2 stratification beds - https://github.com/genome-
in-a-bottle/genome-stratifications
HappyR – R package for hap.py results
• Github https://github.com/Illumina/happyR

The road
ahead... 2020
New SV Benchmark for GRCh38 and
other genomes
Small variant benchmark for other
GIAB genomes
Focus on missing difficult clinical genes
Work with HPP on H2M variants
Somatic sample development
2021+
Somatic benchmarking
Germline samples from new ancestries
Large segmental duplications
Centromere/telomere
Diploid assembly benchmarking
...

https://precision.fda.gov/challenges/10

Acknowledgment of many GIAB contributors
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
* Funders
*
*

For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google groups
GIAB slides: www.slideshare.net/genomeinabottle
Public, Unembargoed Data:
– http://www.nature.com/articles/sdata201625
– ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
– github.com/genome-in-a-bottle
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
– Best Practices at https://rdcu.be/bqpDT
Public workshops
– Join google groups for updates at www.genomeinabottle.org
Justin Zook: jzook@nist.gov
NIST postdoc
opportunities
available!
Diploid assembly,
cancer genomes,
other ‘omics, …

GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511

Similar to GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511 (20)

More from GenomeInABottle

More from GenomeInABottle (12)

Recently uploaded

Recently uploaded (20)

GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511

Editor's Notes