Genome in a Bottle - Towards new benchmarks for the “dark matter” of the human genome 190502

May 2, 2019
Genome in a Bottle: Towards
new benchmarks for the “dark
matter” of the human genome

What’s
Genome
in a
Bottle?
• Authoritative Characterization of Human
Genomes
– enduring commitment to resource availability
• Samples
• Data
– widely available open resources
– all data made available without embargo
• Enable technology and tool-building with benchmark
samples and methods for…
– development
– optimization
– demonstration
• Germline samples available now
• Developing capacity for somatic sample development

GIAB has characterized 7 human genomes
• Pilot genome
– NA12878
• PGP Human Genomes
– Ashkenazi Jewish son
– Ashkenazi Jewish trio
– Chinese son
• Parents also characterized
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
New!

GIAB Recently Published Resources for
“Easier” Small Variants

Best Practices for Benchmarking Small Variants
https://github.com/ga4gh/benchmarking-tools
Paper: https://rdcu.be/bqpDT https://precision.fda.gov/
Describe public
“Truth” VCFs
with confident
regions
Enable
stratification of
performance in
difficult regions
Tools to compare
different
representations of
complex variants Standardized
VCF-I output of
comparison
tools
Standardized
output formats for
performance
metrics
Web-based interface for
performance metrics
Standardized
definitions of
performance metrics
based on matching
stringency

Best practice #1: Account for
different representations
Representation 1
CAAG
CAAAG
REF 1 CA C 0/1
Representation 2
Representation 3
REF 2 AA A 0/1
REF 3 AA A 0/1
CAAG
CAAAG
CAAG
CAAAG
CHROM POS REF ALT GT
(a)
Representation 1
REF 1 A C 0|1
Representation 2 REF 1 AAC CGG 0/1
REF 2 A G 0|1
REF 3 C G 0|1
(b)
CGG
AACREF:
CGG
AAC
Representation 1
Representation 2
ATGCREF:
ATCTGTGC
REF 1 A ATC 0|1
REF 3 G GTG 0|1
REF 1 A ATCTG 0/1
(c)
ATGC
ATCTGTGC
Representation 1
Representation 2
Representation 3
GCG
GCCC
REF:
REF:
GCG
GCCCREF:
GCG
GCCCREF:
Representation 4
GCG
GCCCREF:
REF 1 GCCC GCG 0/1
REF 3 CC G 0/1
REF 4 C G 0|1
REF 1 GC G 0|1
REF 3 C G 0|1
REF 4 C <DEL> 0|1
(d)
REF:
REF:
ALT:
REF:
ALT:
REF:
ALT:
ALT:
ALT:
ALT:
ALT:
ALT:
ALT:
ALT:
ALT:
• Complex variants are often represented in different
ways
• Normalization can help, but not always
• Phasing of nearby variants can affect interpretation

Best practice #2:
Stratify by variant type
and genome context
• Performance metrics can
be very different for
different variant types
and genome contexts
• GA4GH tools enable very
granular stratification
• Also can see what the
benchmark excludes
1x0.3x 10x3x 30x
11to50bp51to200bp
2bp unit repeat
3bp unit repeat
4bp unit repeat
2bp unit repeat
3bp unit repeat
4bp unit repeat
FN rate vs. average

Best practice #3:
Manually curate FPs
and FNs
• Helps to understand what
is causing errors
• Sometimes, putative FPs
and FNs are errors in the
benchmark set
https://doi.org/10.1101/581264

GIAB has extensive public,
unembargoed data
Short reads
• BGISEQ
• Complete
Genomics
• Illumina
• Ion Torrent
• SOLiD
Linked reads
• 10x Genomics
• BGISEQ stLFR
• Illumina 6kb
mate-pair
Long reads
• PacBio
• PacBio CCS
• Promethion
• Ultralong Oxford
Nanopore
Optical/electronic
mapping
• BioNano
• Nabsys
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/

Now using linked and long reads for
difficult variants and regions
GIAB Public Data
• Linked Reads
– 10x Genomics
– Complete Genomics/BGI stLFR
• Long Reads
– PacBio Continuous Long Reads
– PacBio Circular Consensus Seq
– Oxford Nanopore “ultralong”
GIAB Use Cases
• Expand small variant
benchmark
• Develop structural variant
benchmark
• Diploid assembly of difficult
regions like MHC

Linked Reads
• Short reads, but
barcodes give long
range information
>100kb
• Most useful for:
– Phasing variants & reads
– Difficult-to-map regions
– De novo assembly
https://dx.doi.org/10.1038%2Fnbt.3432

PACBIO CIRCULAR CONSENSUS SEQUENCING (CCS)
Double-stranded DNA
Ligate adapters
Anneal primer and bind
DNA polymerase
Sequence
Generate
consensus HiFi read
Subreads
(passes)
Subread errors
Passes
5 10 15 200
30
0
10
20
40
50
Accuracy(Phred)
Wenger, Peluso, et al. (2018). bioRxiv. doi:10.1101/519025
Read accuracy improves
with more passes

15X Coverage by reads > 100Kb
Oxford Nanopore Can Produce “Ultralong” Reads

Expand small variant
benchmark set to difficult to
map regions
Justin Wagner, NIST

Long+Linked Reads expand small
variant benchmark set
Benchmark includes more bases, variants, and segmental duplications in v4⍺
v3.3.2 v4⍺ In v4⍺ not in
v3.3.2
In v3.3.2 not in
v4⍺
Base pairs
covered
2,358,060,765 2,572,421,057 225,990,474 11,630,182
Percent of
GRCh37 covered
87.84% 95.82% 8.42% 0.43%
SNPs 3,046,933 3,432,698 385,765 25,219
Indels 465,670 537,035 71,365 15,382
Base pairs in
Segmental
Duplications
13,722,546 116,687,703 103,466,431 501,274

Small variant performance metrics
decrease vs. new benchmark
Comparison of Illumina GATK4 VCF against benchmark sets
• SNP FN rate increases by a factor of 10
– almost entirely due to new benchmark variants in difficult to
map regions (lowmap) and segmental duplications (segdups)
Subset v3.3.2 Recall v4⍺ Recall v3.3.2 Precision v4⍺ Precision
All SNPs 0.9995 0.9914 0.9981 0.9941
Lowmap 100 bp 0.9799 0.7911 0.9623 0.8582
Lowmap 250 bp no mismatch 0.9474 0.4916 0.8911 0.7171
Segdups 0.9982 0.9103 0.9910 0.9014

Error in current
benchmark excluded
in new benchmark
v4⍺
v3.3.2
Illumina
PacBio
CCS
10X
ONT
v4⍺
v3.3.2

Develop sequence-resolved
structural variant benchmark set
GIAB Analysis Team

50 to 1000 bp
Alu
Alu
1kbp to 10kbp
LINE
LINE
Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360
unique) calls >=20bp discovered in 30+ sequence-resolved callsets from 4
technologies for AJ Trio
Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering
sequence changes within 20% edit distance in trio
Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting
sequences <20% different or BioNano/Nabsys support in trio
Evaluate/genotype: 19748 SVs with consensus variant
genotype from svviz in son
Filter complex: 12745 SVs not within
1kb of another SV
Regions: 9641 SVs inside
2.66 Gbp benchmark
regions supported by
diploid assembly
v0.6
tinyurl.com/GIABSV06

Support from long reads Support from short reads
Fraction of reads supporting SV Fraction of reads supporting SV
Het Hom Het Hom
Het Hom Het Hom
Het Hom
Het Hom
Het Hom
Het Hom
Reads support benchmark SV genotypes

Sequence-resolved SV size supported by optical
mapping
Log10(BioNano Size)
Log10(BenchmarkSize)

High Mendelian Genotype Concordance
Father 0/0 0/0 0/0 0/1 0/1 0/1 1/1 1/1 1/1
Son | Mother 0/0 0/1 1/1 0/0 0/1 1/1 0/0 0/1 1/1
0/1 14 1185 417 1143 1119 462 416 522 12
1/1 0 0 0 0 449 444 2 431 2748
Trio Mendelian genotype violation rate
28/9392 = 0.3%
(Excludes X/Y and sites with no GT in a parent)
Also, >627/635 genotypes concordant with crowd-sourced manual curations

Our benchmark sets are useful in evaluating SVs
from multiple technologies
Goal: When comparing any callset
to our vcf within the bed, most
putative FPs and FNs should be
errors in the tested callset
github.com/spiralgenetics/truvari
github.com/nhansen/SVanalyzer

Resolve MHC regions from
HG002
https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/MHC
Justin Wenger, Justin Zook, Mikko Rautiainen, Jason Chin, Tobias Marschall, Qian Zeng,
Erik Garrison, Shilpa Garg
Mar. 25-27, UCSC, The Human Pangenomics Hackathon

Goals
• Make the best haplotype-correct
assemblies for the MHC regions of
HG002 from all available data
• Fewest gaps
• Correct phasing for both SNPs and
SVs
• Provide the best genomic sequences
for future GIAB SNP and SV
benchmark for this complicated but
medically important region

MHC in GRCh37 / HG002 Assembly
ONT
CCS
10X VCF
Falcon / Peregrine
HLA-ASM
seqwish + odgiGraphAligner
Error corrected ONT
reads
Heterozygous SNPs
WhatsHap
ONT
CCS
Haplotype binned reads
Compare to HLA-Typing Results
DV VCF
10X VCF Heterozygous SNPs
DV VCF
Github: phasing-notes.md
Github: assembly directory
CCS
ONT for gap filling
Identify ONT reads filling
in
regions missed by
PacBio CCS reads
+
FALCON EC module
MHC Diploid assembly process

Preliminary MHC Diploid Assembly Results
MHC region MHC region
Haplotype II
(3 contigs spanning the region)
Haplotype I
(2 contigs spanning the region)
A loop in the assembly
graph
Missing Sequence?

Open consent enables secondary reference samples to
meet specific clinical needs
• >50 products now available
based on broadly-consented,
well-characterized GIAB PGP cell
lines
• Genomic DNA + DNA spike-ins
• Clinical variants
• Somatic variants
• Difficult variants
• Clinical matrix (FFPE)
• Circulating tumor DNA
• Stem cells (iPSCs)
• Genome editing
• …

The road
ahead... 2019
Integration pipeline
development for small and
structural variants
Manuscripts for small and
structural variants
2020
Difficult large variants
Somatic sample development
Germline samples from new
ancestries
Diploid assembly
2021+
Somatic integration pipeline
Somatic structural variation
Large segmental duplications
Centromere/ telomere
...

Acknowledgment of many GIAB contributors
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
* Funders
*
*

For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group
GIAB slides, including 2019 Workshop slides: www.slideshare.net/genomeinabottle
Public, Unembargoed Data:
– http://www.nature.com/articles/sdata201625
– ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
– github.com/genome-in-a-bottle
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
– Best Practices at https://rdcu.be/bqpDT
Public workshops
– Next workshop planned for April 1-2, 2020 at Stanford University, CA, USA
Justin Zook: jzook@nist.gov
NIST postdoc
opportunities
available!
Diploid assembly,
cancer genomes,
other ‘omics, …

Genome in a Bottle - Towards new benchmarks for the “dark matter” of the human genome 190502

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Genome in a Bottle - Towards new benchmarks for the “dark matter” of the human genome 190502

Similar to Genome in a Bottle - Towards new benchmarks for the “dark matter” of the human genome 190502 (20)

More from GenomeInABottle

More from GenomeInABottle (20)

Recently uploaded

Recently uploaded (20)

Genome in a Bottle - Towards new benchmarks for the “dark matter” of the human genome 190502