Genome in a Bottle:
Integrating Multiple Technologies to Form Benchmark
Structural Variants
Justin Zook, on behalf of the GIAB Consortium
NIST Genome-Scale Measurements Group
Joint Initiative for Metrology in Biology (JIMB)
May 17, 2018
Take-home Messages
• Genome in a Bottle is:
– “Open science”
– Authoritative characterization of human genomes
• Currently enable benchmarking of “easier” variants
– Clinical validation
– Technology development, optimization, and demonstration
• Now working on difficult variants and regions
– Draft variant calls >=20bp available and feedback requested
– Working on finalizing a tiered benchmark set >=50bp + confident regions
– New long and ultralong read data coming
– Many challenges remain and collaborations welcome!
Why Genome in a Bottle?
• A map of every individual’s genome
will soon be possible, but how will
we know if it is correct?
• Diagnostics and precision medicine
require high levels of confidence
• Well-characterized, broadly
disseminated genomes are needed
to benchmark performance of
sequencing
• Open, transparent data/analyses
• Enable technology development,
optimization, and demonstration
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432
GIAB is evolving with technologies
2012
• No human
benchmark
calls available
• GIAB
Consortium
formed
2014
• Small variant
genotypes
for ~77% of
pilot genome
NA12878
2015
• NIST releases
first human
genome
Reference
Material
2016
• 4 new
genomes
• Small
variants for
90% of 5
genomes for
GRCh37/38
2017+
• Characteriz-
ing difficult
variants
• Develop
tumor
samples
GIAB has characterized 5 human genome RMs
• Pilot genome
– NA12878
• PGP Human Genomes
– Ashkenazi Jewish son
– Ashkenazi Jewish trio
– Chinese son
• Parents also characterized
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
https://doi.org/10.1101/281006
Important characteristics of benchmark calls
What does “gold standard” mean?
1. Accurate
– high-confidence variants,
genotypes, haplotypes, and regions
– When results from any method is
compared to the benchmark, the
majority of differences (FPs/FNs)
are errors in the method
2. Representative examples
– Different types of variants in
different genome contexts
3. Comprehensive characterization
– Many examples of different variant
types/genome contexts
– Eventually, diploid assembly
benchmarking
GIAB “Open Science” Virtuous Cycle
Users
analyze
GIAB
Samples
Benchmark
vs. GIAB
data
Critical
feedback
to GIAB
Integrate
new
methods
New
benchmark
data
Method
development,
optimization, and
demonstration
Part of assay
validation
GIAB/NIST
expands to more
difficult regions
Open consent enables secondary reference samples
• >30 products now available
based on broadly-consented,
well-characterized GIAB PGP cell
lines
• Genomic DNA + DNA spike-ins
– Clinical variants
– Somatic variants
– Difficult variants
• Clinical matrix (FFPE)
• Circulating tumor DNA
• Stem cells (iPSCs)
• Genome editing
All data and analyses are open and public
51 authors
14 institutions
12 datasets
7 genomes
Data described in ISA-tab
New data on GIAB NCBI FTP
Best Practices for Benchmarking Small Variants
https://github.com/ga4gh/benchmarking-tools
https://doi.org/10.1101/270157 https://precision.fda.gov/
Describe
public
“Truth”
VCFs with
confident
regions
Enable
stratification of
performance in
difficult regions
Tools to compare
different
representations of
complex variants
Standardized
VCF-I output of
comparison tools
Standardized definitions
of performance metrics
based on matching
stringency Web-based interface
for performance
metrics
Standardized output
formats for
performance metrics
What are we accessing and what is still
challenging?
Type of variant Genome
context
Fraction
of variants
called*
Number of
variants
missing*
How to improve?
Simple SNPs Not repetitive ~97% >100k Machine learning
Simple indels Not repetitive ~93% >10k Machine learning
All variants Low
mappability
<30% >170k Use linked reads and long
reads
All variants Regions not in
GRCh37/38
0 >>100k??? De novo assembly; long reads
Small indels Tandem repeats
and
homopolymers
<50% >200k STR/homopolymer callers; long
reads; better handle complex
and compound variants
Indels 15-50bp All <25% >30k Assembly-based callers;
integrate larger variants
differently; long reads
Indels >50bp All <1% >20k
* Approximate values based on fraction of variants in GATKHC or FermiKit that are
inside v3.3.2 High-confidence regions
Integration of diverse data types and analyses
• Data publicly available
– Deep short reads
– Linked reads
– Long reads
– Optical/nanopore mapping
• Analyses
– Small variant calling
– SV calling
– Local and global assembly
Discover &
Refine
sequence-
resolved calls
from multiple
datasets &
analyses Compare
variant and
genotype calls
from different
methods
Evaluate/
genotype calls
with other
data
Identify
features
associated
with reliability
of calls from
each method
Form
benchmark
calls using
heuristics &
machine
learning
Compare
benchmarks
to high-
quality
callsets and
examine
differences
How can we extend our approach to structural
variants?
Similarities to small variants
• Collect callsets from multiple
technologies
• Compare callsets to find calls
supported by multiple technologies
Differences from small variants
• Callsets have limited sensitivity
• Variants are often imprecisely
characterized
– breakpoints, size, type, etc.
• Representation of variants is poorly
standardized, especially when complex
• Comparison tools in infancy
Evolution of SV calls for AJ Trio
v0.2.0
• Only
deletions
• Overlap
and size-
based
clustering
• Output
sites with
multitech
support
v0.3.0
• New
calling
methods
• Deletions
and
insertions
• Sequence-
resolved
calls
• Sequence-
based
clustering
• Output
sites with
multitech
support
v0.4.0
• Include
some
single tech
calls
• Evaluate
read
support to
remove
some false
positives
• Add
genotypes
for trio
v0.5.0
• Better
calling
methods,
especially
for large
insertions
• Include
more
single tech
calls
• Add some
phasing
info
Future
• Resolve
clusters of
differing
calls
• Improve
phasing
• Add new
data types
• Improve
sequence
resolution
• High-
confidence
regions
Integrating Sequence-resolved Calls
>=20bp
>1 million calls from 30+ sequence-resolved callsets from 4 techs for
AJ Trio
>500k unique sequence-resolved calls
38k INS and 37k DEL with 2+ techs or 5+
callers predicting sequences <20%
different or BioNano/Nabsys support
33k INS and 35k DEL
genotyped by svviz in 1+
individuals
v0.5.0
Draft SV calls for feedback: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/
Size histograms for v0.5.0
Red - simple calls in v0.5.0
Blue – differing nearby calls in v0.5.0
Alu Alu
LINE LINE
Evaluation/genotyping suite of methods
Current approaches
• svviz – maps reads to REF or ALT alleles
– Short, linked, and long reads
– Haplotype-separated reads
• BioNano – compare size predictions
• Nabsys – evaluates large deletions
Future approaches
• Paternal|maternal haplotypes for svviz
using whatshap
• Online manual curation of svviz, IGV,
dotplots, etc.
– Volunteers needed starting ~May 8!
• PCR-Sanger targeted sequencing
– Collaborations welcome!
Outstanding challenges and future work
• Large sequence-resolved insertions
• Somewhat fewer multi-kb
insertions than multi-kb deletions
• Much better than v0.4.0
• Dense calls
• ~1/3 v0.5.0 calls are within 1kb of
another v0.5.0 call
• Sequence-resolved insertion size
doesn’t always match BioNano
• Phasing will be important for
these (e.g., with 10X, whatshap)
• Calls with inaccurate or incomplete
sequence change
• Homozygous Reference calls
• Can we definitively state we call
all SVs in some regions?
• E.g., using diploid assembly?
• Benchmarking tool development
• How to compare SVs to a
benchmark?
• What performance metrics are
important?
• New tools in development at:
github.com/spiralgenetics/truvari
Proposed 2-tier call system
● Tier 1: Simple, sequence-resolved
○ v0.5.0 calls >49bp in size in HG002
○ Not within 1000bp of another >49bp call in HG002
○ ~14,000 calls
○ Benchmark variant type, breakpoint, size, sequence, genotype
● Tier 2: Confident SV but complex or no consensus sequence change
○ V0.5.0 calls that are within 1000bp of another >49bp call in HG002
■ ~6000 calls in ~2600 regions
○ Also analyze extra calls not tested as part of v0.5.0 process (not
discovered by 2+ techs or 4+ callsets and clustered)
■ ~9000 regions
○ Benchmark sensitivity to more challenging SVs
Using assemblies to develop high-confidence bed
1. Call variants from each assembly
2. Exclude regions around long read assembly variants not in
v0.5.0
3. Find regions for each assembly that are covered by 1 contig.
Remove repeats longer than 75% of N50 read length
4. Find the number of assemblies covering each region (e.g.,
using bed tools merge)
5. High confidence regions are regions in #4 covered by both
haplotypes in a diploid assembly or at least x assemblies minus
the regions in #2.
6. Subtract Tier 2 regions that don’t contain a Tier 1 call
Web-based manual curation tools
http://www.svcurator.com/
● Volunteers needed to help
us establish benchmarks!
● Learn about challenges in
SV calling
Credit:
Lesley Chapman
GIAB Developing New Data
• 10X Genomics
– Chinese trio now available
• PacBio Sequel of Chinese trio with
Mt Sinai
– Read insert N50: 16-18kb
– ~60x on son and ~30x on each
parent
– Also additional 30x on AJ
son/mother
– Data undergoing QC
• BioNano
– New DLS labeling method
• Complete Genomics/BGI
– stLFR linked reads
• Oxford Nanopore
– NIST/Birmingham/
Nottingham Ultra-long reads
• In progress
• Very preliminarily 80-90kb N50
– Max reads >1Mb!
• Current throughput may give
~30-40x total on AJ trio
• Strand-seq
– Collaboration with Korbel lab
ONT “Ultralong reads”
Noah Spies
David Catoe
Matt Loose
Nick Loman
Josh Quick
• So far…
• ~4x total mapped
• ~2x > 50kb
• ~1x > 100kb
• Plan initial release soon
• Estimated ~30x total in
2018
New Samples
Additional ancestries
• Shorter term
– Use existing PGP individual samples
– Use existing integration pipeline
• Data-based selection
– Proportion of potential genomes from
different ancestries
• 3 to 8 new samples
• Longer term
– Recruit large family
– Recruit trios from other ancestry groups
Cancer samples
• Longer term
• Make PGP-consented tumor and
normal cell lines from same individual
• Select tumor with diversity of mutation
types
The road ahead...
2018
• Large
variants
• Difficult
small
variants
• Phasing
2019
• Difficult
small & large
variants
• Somatic
sample
development
• Germline
samples from
new
ancestries
2020+
• Diploid
assembly
• Somatic
structural
variation
• Segmental
duplications
• Centromere/
telomere
• ...
Take-home Messages
• Genome in a Bottle is:
– “Open science”
– Authoritative characterization of human genomes
• Currently enable benchmarking of “easier” variants
– Clinical validation
– Technology development, optimization, and demonstration
• Now working on difficult variants and regions
– Draft variant calls >=20bp available and feedback requested
– Working on finalizing a tiered benchmark set >=50bp + confident regions
– New long and ultralong read data coming
– Many challenges remain and collaborations welcome!
Draft SV calls for feedback: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/
Acknowledgements
• NIST/JIMB
– Marc Salit
– Jenny McDaniel
– Lindsay Harris
– David Catoe
– Lesley Chapman
– Noah Spies
• Genome in a Bottle Consortium
• GA4GH Benchmarking Team
• FDA
For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
Latest small variant benchmark: https://doi.org/10.1101/281006
Data:
– http://www.nature.com/articles/sdata201625
– ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
– Best Practices at https://doi.org/10.1101/270157
Public workshops
– Next workshop tentatively January 2019 at Stanford University, CA, USA
Justin Zook: jzook@nist.gov
NIST postdoc
opportunities
available!

GIAB Integrating multiple technologies to form benchmark SVs 180517

  • 1.
    Genome in aBottle: Integrating Multiple Technologies to Form Benchmark Structural Variants Justin Zook, on behalf of the GIAB Consortium NIST Genome-Scale Measurements Group Joint Initiative for Metrology in Biology (JIMB) May 17, 2018
  • 2.
    Take-home Messages • Genomein a Bottle is: – “Open science” – Authoritative characterization of human genomes • Currently enable benchmarking of “easier” variants – Clinical validation – Technology development, optimization, and demonstration • Now working on difficult variants and regions – Draft variant calls >=20bp available and feedback requested – Working on finalizing a tiered benchmark set >=50bp + confident regions – New long and ultralong read data coming – Many challenges remain and collaborations welcome!
  • 3.
    Why Genome ina Bottle? • A map of every individual’s genome will soon be possible, but how will we know if it is correct? • Diagnostics and precision medicine require high levels of confidence • Well-characterized, broadly disseminated genomes are needed to benchmark performance of sequencing • Open, transparent data/analyses • Enable technology development, optimization, and demonstration O’Rawe et al, Genome Medicine, 2013 https://doi.org/10.1186/gm432
  • 4.
    GIAB is evolvingwith technologies 2012 • No human benchmark calls available • GIAB Consortium formed 2014 • Small variant genotypes for ~77% of pilot genome NA12878 2015 • NIST releases first human genome Reference Material 2016 • 4 new genomes • Small variants for 90% of 5 genomes for GRCh37/38 2017+ • Characteriz- ing difficult variants • Develop tumor samples
  • 5.
    GIAB has characterized5 human genome RMs • Pilot genome – NA12878 • PGP Human Genomes – Ashkenazi Jewish son – Ashkenazi Jewish trio – Chinese son • Parents also characterized National I nstituteof S tandards & Technology Report of I nvestigation Reference Material 8391 Human DNA for Whole-Genome Variant Assessment (Son of Eastern European Ashkenazim Jewish Ancestry) This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer (10 mM TRIS, 1 mM EDTA, pH 8.0). This material is intended for assessing performance of human genome sequencing variant calling by obtaining estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as functional or clinical interpretation. Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods similar to described in reference 1. An information value is considered to be a value that will be of interest and use to the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available. These data and genomic characterizations will be maintained over time as new data accrue and measurement and informatics methods become available. The information values are given as a variant call file (vcf) that contains the high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information (NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is: https://doi.org/10.1101/281006
  • 6.
    Important characteristics ofbenchmark calls What does “gold standard” mean? 1. Accurate – high-confidence variants, genotypes, haplotypes, and regions – When results from any method is compared to the benchmark, the majority of differences (FPs/FNs) are errors in the method 2. Representative examples – Different types of variants in different genome contexts 3. Comprehensive characterization – Many examples of different variant types/genome contexts – Eventually, diploid assembly benchmarking
  • 7.
    GIAB “Open Science”Virtuous Cycle Users analyze GIAB Samples Benchmark vs. GIAB data Critical feedback to GIAB Integrate new methods New benchmark data Method development, optimization, and demonstration Part of assay validation GIAB/NIST expands to more difficult regions
  • 8.
    Open consent enablessecondary reference samples • >30 products now available based on broadly-consented, well-characterized GIAB PGP cell lines • Genomic DNA + DNA spike-ins – Clinical variants – Somatic variants – Difficult variants • Clinical matrix (FFPE) • Circulating tumor DNA • Stem cells (iPSCs) • Genome editing
  • 9.
    All data andanalyses are open and public 51 authors 14 institutions 12 datasets 7 genomes Data described in ISA-tab New data on GIAB NCBI FTP
  • 10.
    Best Practices forBenchmarking Small Variants https://github.com/ga4gh/benchmarking-tools https://doi.org/10.1101/270157 https://precision.fda.gov/ Describe public “Truth” VCFs with confident regions Enable stratification of performance in difficult regions Tools to compare different representations of complex variants Standardized VCF-I output of comparison tools Standardized definitions of performance metrics based on matching stringency Web-based interface for performance metrics Standardized output formats for performance metrics
  • 11.
    What are weaccessing and what is still challenging? Type of variant Genome context Fraction of variants called* Number of variants missing* How to improve? Simple SNPs Not repetitive ~97% >100k Machine learning Simple indels Not repetitive ~93% >10k Machine learning All variants Low mappability <30% >170k Use linked reads and long reads All variants Regions not in GRCh37/38 0 >>100k??? De novo assembly; long reads Small indels Tandem repeats and homopolymers <50% >200k STR/homopolymer callers; long reads; better handle complex and compound variants Indels 15-50bp All <25% >30k Assembly-based callers; integrate larger variants differently; long reads Indels >50bp All <1% >20k * Approximate values based on fraction of variants in GATKHC or FermiKit that are inside v3.3.2 High-confidence regions
  • 12.
    Integration of diversedata types and analyses • Data publicly available – Deep short reads – Linked reads – Long reads – Optical/nanopore mapping • Analyses – Small variant calling – SV calling – Local and global assembly Discover & Refine sequence- resolved calls from multiple datasets & analyses Compare variant and genotype calls from different methods Evaluate/ genotype calls with other data Identify features associated with reliability of calls from each method Form benchmark calls using heuristics & machine learning Compare benchmarks to high- quality callsets and examine differences
  • 13.
    How can weextend our approach to structural variants? Similarities to small variants • Collect callsets from multiple technologies • Compare callsets to find calls supported by multiple technologies Differences from small variants • Callsets have limited sensitivity • Variants are often imprecisely characterized – breakpoints, size, type, etc. • Representation of variants is poorly standardized, especially when complex • Comparison tools in infancy
  • 14.
    Evolution of SVcalls for AJ Trio v0.2.0 • Only deletions • Overlap and size- based clustering • Output sites with multitech support v0.3.0 • New calling methods • Deletions and insertions • Sequence- resolved calls • Sequence- based clustering • Output sites with multitech support v0.4.0 • Include some single tech calls • Evaluate read support to remove some false positives • Add genotypes for trio v0.5.0 • Better calling methods, especially for large insertions • Include more single tech calls • Add some phasing info Future • Resolve clusters of differing calls • Improve phasing • Add new data types • Improve sequence resolution • High- confidence regions
  • 15.
    Integrating Sequence-resolved Calls >=20bp >1million calls from 30+ sequence-resolved callsets from 4 techs for AJ Trio >500k unique sequence-resolved calls 38k INS and 37k DEL with 2+ techs or 5+ callers predicting sequences <20% different or BioNano/Nabsys support 33k INS and 35k DEL genotyped by svviz in 1+ individuals v0.5.0 Draft SV calls for feedback: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/
  • 16.
    Size histograms forv0.5.0 Red - simple calls in v0.5.0 Blue – differing nearby calls in v0.5.0 Alu Alu LINE LINE
  • 17.
    Evaluation/genotyping suite ofmethods Current approaches • svviz – maps reads to REF or ALT alleles – Short, linked, and long reads – Haplotype-separated reads • BioNano – compare size predictions • Nabsys – evaluates large deletions Future approaches • Paternal|maternal haplotypes for svviz using whatshap • Online manual curation of svviz, IGV, dotplots, etc. – Volunteers needed starting ~May 8! • PCR-Sanger targeted sequencing – Collaborations welcome!
  • 18.
    Outstanding challenges andfuture work • Large sequence-resolved insertions • Somewhat fewer multi-kb insertions than multi-kb deletions • Much better than v0.4.0 • Dense calls • ~1/3 v0.5.0 calls are within 1kb of another v0.5.0 call • Sequence-resolved insertion size doesn’t always match BioNano • Phasing will be important for these (e.g., with 10X, whatshap) • Calls with inaccurate or incomplete sequence change • Homozygous Reference calls • Can we definitively state we call all SVs in some regions? • E.g., using diploid assembly? • Benchmarking tool development • How to compare SVs to a benchmark? • What performance metrics are important? • New tools in development at: github.com/spiralgenetics/truvari
  • 19.
    Proposed 2-tier callsystem ● Tier 1: Simple, sequence-resolved ○ v0.5.0 calls >49bp in size in HG002 ○ Not within 1000bp of another >49bp call in HG002 ○ ~14,000 calls ○ Benchmark variant type, breakpoint, size, sequence, genotype ● Tier 2: Confident SV but complex or no consensus sequence change ○ V0.5.0 calls that are within 1000bp of another >49bp call in HG002 ■ ~6000 calls in ~2600 regions ○ Also analyze extra calls not tested as part of v0.5.0 process (not discovered by 2+ techs or 4+ callsets and clustered) ■ ~9000 regions ○ Benchmark sensitivity to more challenging SVs
  • 20.
    Using assemblies todevelop high-confidence bed 1. Call variants from each assembly 2. Exclude regions around long read assembly variants not in v0.5.0 3. Find regions for each assembly that are covered by 1 contig. Remove repeats longer than 75% of N50 read length 4. Find the number of assemblies covering each region (e.g., using bed tools merge) 5. High confidence regions are regions in #4 covered by both haplotypes in a diploid assembly or at least x assemblies minus the regions in #2. 6. Subtract Tier 2 regions that don’t contain a Tier 1 call
  • 21.
    Web-based manual curationtools http://www.svcurator.com/ ● Volunteers needed to help us establish benchmarks! ● Learn about challenges in SV calling Credit: Lesley Chapman
  • 22.
    GIAB Developing NewData • 10X Genomics – Chinese trio now available • PacBio Sequel of Chinese trio with Mt Sinai – Read insert N50: 16-18kb – ~60x on son and ~30x on each parent – Also additional 30x on AJ son/mother – Data undergoing QC • BioNano – New DLS labeling method • Complete Genomics/BGI – stLFR linked reads • Oxford Nanopore – NIST/Birmingham/ Nottingham Ultra-long reads • In progress • Very preliminarily 80-90kb N50 – Max reads >1Mb! • Current throughput may give ~30-40x total on AJ trio • Strand-seq – Collaboration with Korbel lab
  • 23.
    ONT “Ultralong reads” NoahSpies David Catoe Matt Loose Nick Loman Josh Quick • So far… • ~4x total mapped • ~2x > 50kb • ~1x > 100kb • Plan initial release soon • Estimated ~30x total in 2018
  • 24.
    New Samples Additional ancestries •Shorter term – Use existing PGP individual samples – Use existing integration pipeline • Data-based selection – Proportion of potential genomes from different ancestries • 3 to 8 new samples • Longer term – Recruit large family – Recruit trios from other ancestry groups Cancer samples • Longer term • Make PGP-consented tumor and normal cell lines from same individual • Select tumor with diversity of mutation types
  • 25.
    The road ahead... 2018 •Large variants • Difficult small variants • Phasing 2019 • Difficult small & large variants • Somatic sample development • Germline samples from new ancestries 2020+ • Diploid assembly • Somatic structural variation • Segmental duplications • Centromere/ telomere • ...
  • 26.
    Take-home Messages • Genomein a Bottle is: – “Open science” – Authoritative characterization of human genomes • Currently enable benchmarking of “easier” variants – Clinical validation – Technology development, optimization, and demonstration • Now working on difficult variants and regions – Draft variant calls >=20bp available and feedback requested – Working on finalizing a tiered benchmark set >=50bp + confident regions – New long and ultralong read data coming – Many challenges remain and collaborations welcome! Draft SV calls for feedback: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/
  • 27.
    Acknowledgements • NIST/JIMB – MarcSalit – Jenny McDaniel – Lindsay Harris – David Catoe – Lesley Chapman – Noah Spies • Genome in a Bottle Consortium • GA4GH Benchmarking Team • FDA
  • 28.
    For More Information www.genomeinabottle.org- sign up for general GIAB and Analysis Team google group github.com/genome-in-a-bottle – Guide to GIAB data & ftp www.slideshare.net/genomeinabottle Latest small variant benchmark: https://doi.org/10.1101/281006 Data: – http://www.nature.com/articles/sdata201625 – ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ Global Alliance Benchmarking Team – https://github.com/ga4gh/benchmarking-tools – Web-based implementation at precision.fda.gov – Best Practices at https://doi.org/10.1101/270157 Public workshops – Next workshop tentatively January 2019 at Stanford University, CA, USA Justin Zook: jzook@nist.gov NIST postdoc opportunities available!