SlideShare a Scribd company logo
1 of 28
Genome in a Bottle:
Integrating Multiple Technologies to Form Benchmark
Structural Variants
Justin Zook, on behalf of the GIAB Consortium
NIST Genome-Scale Measurements Group
Joint Initiative for Metrology in Biology (JIMB)
May 17, 2018
Take-home Messages
• Genome in a Bottle is:
– “Open science”
– Authoritative characterization of human genomes
• Currently enable benchmarking of “easier” variants
– Clinical validation
– Technology development, optimization, and demonstration
• Now working on difficult variants and regions
– Draft variant calls >=20bp available and feedback requested
– Working on finalizing a tiered benchmark set >=50bp + confident regions
– New long and ultralong read data coming
– Many challenges remain and collaborations welcome!
Why Genome in a Bottle?
• A map of every individual’s genome
will soon be possible, but how will
we know if it is correct?
• Diagnostics and precision medicine
require high levels of confidence
• Well-characterized, broadly
disseminated genomes are needed
to benchmark performance of
sequencing
• Open, transparent data/analyses
• Enable technology development,
optimization, and demonstration
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432
GIAB is evolving with technologies
2012
• No human
benchmark
calls available
• GIAB
Consortium
formed
2014
• Small variant
genotypes
for ~77% of
pilot genome
NA12878
2015
• NIST releases
first human
genome
Reference
Material
2016
• 4 new
genomes
• Small
variants for
90% of 5
genomes for
GRCh37/38
2017+
• Characteriz-
ing difficult
variants
• Develop
tumor
samples
GIAB has characterized 5 human genome RMs
• Pilot genome
– NA12878
• PGP Human Genomes
– Ashkenazi Jewish son
– Ashkenazi Jewish trio
– Chinese son
• Parents also characterized
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
https://doi.org/10.1101/281006
Important characteristics of benchmark calls
What does “gold standard” mean?
1. Accurate
– high-confidence variants,
genotypes, haplotypes, and regions
– When results from any method is
compared to the benchmark, the
majority of differences (FPs/FNs)
are errors in the method
2. Representative examples
– Different types of variants in
different genome contexts
3. Comprehensive characterization
– Many examples of different variant
types/genome contexts
– Eventually, diploid assembly
benchmarking
GIAB “Open Science” Virtuous Cycle
Users
analyze
GIAB
Samples
Benchmark
vs. GIAB
data
Critical
feedback
to GIAB
Integrate
new
methods
New
benchmark
data
Method
development,
optimization, and
demonstration
Part of assay
validation
GIAB/NIST
expands to more
difficult regions
Open consent enables secondary reference samples
• >30 products now available
based on broadly-consented,
well-characterized GIAB PGP cell
lines
• Genomic DNA + DNA spike-ins
– Clinical variants
– Somatic variants
– Difficult variants
• Clinical matrix (FFPE)
• Circulating tumor DNA
• Stem cells (iPSCs)
• Genome editing
All data and analyses are open and public
51 authors
14 institutions
12 datasets
7 genomes
Data described in ISA-tab
New data on GIAB NCBI FTP
Best Practices for Benchmarking Small Variants
https://github.com/ga4gh/benchmarking-tools
https://doi.org/10.1101/270157 https://precision.fda.gov/
Describe
public
“Truth”
VCFs with
confident
regions
Enable
stratification of
performance in
difficult regions
Tools to compare
different
representations of
complex variants
Standardized
VCF-I output of
comparison tools
Standardized definitions
of performance metrics
based on matching
stringency Web-based interface
for performance
metrics
Standardized output
formats for
performance metrics
What are we accessing and what is still
challenging?
Type of variant Genome
context
Fraction
of variants
called*
Number of
variants
missing*
How to improve?
Simple SNPs Not repetitive ~97% >100k Machine learning
Simple indels Not repetitive ~93% >10k Machine learning
All variants Low
mappability
<30% >170k Use linked reads and long
reads
All variants Regions not in
GRCh37/38
0 >>100k??? De novo assembly; long reads
Small indels Tandem repeats
and
homopolymers
<50% >200k STR/homopolymer callers; long
reads; better handle complex
and compound variants
Indels 15-50bp All <25% >30k Assembly-based callers;
integrate larger variants
differently; long reads
Indels >50bp All <1% >20k
* Approximate values based on fraction of variants in GATKHC or FermiKit that are
inside v3.3.2 High-confidence regions
Integration of diverse data types and analyses
• Data publicly available
– Deep short reads
– Linked reads
– Long reads
– Optical/nanopore mapping
• Analyses
– Small variant calling
– SV calling
– Local and global assembly
Discover &
Refine
sequence-
resolved calls
from multiple
datasets &
analyses Compare
variant and
genotype calls
from different
methods
Evaluate/
genotype calls
with other
data
Identify
features
associated
with reliability
of calls from
each method
Form
benchmark
calls using
heuristics &
machine
learning
Compare
benchmarks
to high-
quality
callsets and
examine
differences
How can we extend our approach to structural
variants?
Similarities to small variants
• Collect callsets from multiple
technologies
• Compare callsets to find calls
supported by multiple technologies
Differences from small variants
• Callsets have limited sensitivity
• Variants are often imprecisely
characterized
– breakpoints, size, type, etc.
• Representation of variants is poorly
standardized, especially when complex
• Comparison tools in infancy
Evolution of SV calls for AJ Trio
v0.2.0
• Only
deletions
• Overlap
and size-
based
clustering
• Output
sites with
multitech
support
v0.3.0
• New
calling
methods
• Deletions
and
insertions
• Sequence-
resolved
calls
• Sequence-
based
clustering
• Output
sites with
multitech
support
v0.4.0
• Include
some
single tech
calls
• Evaluate
read
support to
remove
some false
positives
• Add
genotypes
for trio
v0.5.0
• Better
calling
methods,
especially
for large
insertions
• Include
more
single tech
calls
• Add some
phasing
info
Future
• Resolve
clusters of
differing
calls
• Improve
phasing
• Add new
data types
• Improve
sequence
resolution
• High-
confidence
regions
Integrating Sequence-resolved Calls
>=20bp
>1 million calls from 30+ sequence-resolved callsets from 4 techs for
AJ Trio
>500k unique sequence-resolved calls
38k INS and 37k DEL with 2+ techs or 5+
callers predicting sequences <20%
different or BioNano/Nabsys support
33k INS and 35k DEL
genotyped by svviz in 1+
individuals
v0.5.0
Draft SV calls for feedback: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/
Size histograms for v0.5.0
Red - simple calls in v0.5.0
Blue – differing nearby calls in v0.5.0
Alu Alu
LINE LINE
Evaluation/genotyping suite of methods
Current approaches
• svviz – maps reads to REF or ALT alleles
– Short, linked, and long reads
– Haplotype-separated reads
• BioNano – compare size predictions
• Nabsys – evaluates large deletions
Future approaches
• Paternal|maternal haplotypes for svviz
using whatshap
• Online manual curation of svviz, IGV,
dotplots, etc.
– Volunteers needed starting ~May 8!
• PCR-Sanger targeted sequencing
– Collaborations welcome!
Outstanding challenges and future work
• Large sequence-resolved insertions
• Somewhat fewer multi-kb
insertions than multi-kb deletions
• Much better than v0.4.0
• Dense calls
• ~1/3 v0.5.0 calls are within 1kb of
another v0.5.0 call
• Sequence-resolved insertion size
doesn’t always match BioNano
• Phasing will be important for
these (e.g., with 10X, whatshap)
• Calls with inaccurate or incomplete
sequence change
• Homozygous Reference calls
• Can we definitively state we call
all SVs in some regions?
• E.g., using diploid assembly?
• Benchmarking tool development
• How to compare SVs to a
benchmark?
• What performance metrics are
important?
• New tools in development at:
github.com/spiralgenetics/truvari
Proposed 2-tier call system
● Tier 1: Simple, sequence-resolved
○ v0.5.0 calls >49bp in size in HG002
○ Not within 1000bp of another >49bp call in HG002
○ ~14,000 calls
○ Benchmark variant type, breakpoint, size, sequence, genotype
● Tier 2: Confident SV but complex or no consensus sequence change
○ V0.5.0 calls that are within 1000bp of another >49bp call in HG002
■ ~6000 calls in ~2600 regions
○ Also analyze extra calls not tested as part of v0.5.0 process (not
discovered by 2+ techs or 4+ callsets and clustered)
■ ~9000 regions
○ Benchmark sensitivity to more challenging SVs
Using assemblies to develop high-confidence bed
1. Call variants from each assembly
2. Exclude regions around long read assembly variants not in
v0.5.0
3. Find regions for each assembly that are covered by 1 contig.
Remove repeats longer than 75% of N50 read length
4. Find the number of assemblies covering each region (e.g.,
using bed tools merge)
5. High confidence regions are regions in #4 covered by both
haplotypes in a diploid assembly or at least x assemblies minus
the regions in #2.
6. Subtract Tier 2 regions that don’t contain a Tier 1 call
Web-based manual curation tools
http://www.svcurator.com/
● Volunteers needed to help
us establish benchmarks!
● Learn about challenges in
SV calling
Credit:
Lesley Chapman
GIAB Developing New Data
• 10X Genomics
– Chinese trio now available
• PacBio Sequel of Chinese trio with
Mt Sinai
– Read insert N50: 16-18kb
– ~60x on son and ~30x on each
parent
– Also additional 30x on AJ
son/mother
– Data undergoing QC
• BioNano
– New DLS labeling method
• Complete Genomics/BGI
– stLFR linked reads
• Oxford Nanopore
– NIST/Birmingham/
Nottingham Ultra-long reads
• In progress
• Very preliminarily 80-90kb N50
– Max reads >1Mb!
• Current throughput may give
~30-40x total on AJ trio
• Strand-seq
– Collaboration with Korbel lab
ONT “Ultralong reads”
Noah Spies
David Catoe
Matt Loose
Nick Loman
Josh Quick
• So far…
• ~4x total mapped
• ~2x > 50kb
• ~1x > 100kb
• Plan initial release soon
• Estimated ~30x total in
2018
New Samples
Additional ancestries
• Shorter term
– Use existing PGP individual samples
– Use existing integration pipeline
• Data-based selection
– Proportion of potential genomes from
different ancestries
• 3 to 8 new samples
• Longer term
– Recruit large family
– Recruit trios from other ancestry groups
Cancer samples
• Longer term
• Make PGP-consented tumor and
normal cell lines from same individual
• Select tumor with diversity of mutation
types
The road ahead...
2018
• Large
variants
• Difficult
small
variants
• Phasing
2019
• Difficult
small & large
variants
• Somatic
sample
development
• Germline
samples from
new
ancestries
2020+
• Diploid
assembly
• Somatic
structural
variation
• Segmental
duplications
• Centromere/
telomere
• ...
Take-home Messages
• Genome in a Bottle is:
– “Open science”
– Authoritative characterization of human genomes
• Currently enable benchmarking of “easier” variants
– Clinical validation
– Technology development, optimization, and demonstration
• Now working on difficult variants and regions
– Draft variant calls >=20bp available and feedback requested
– Working on finalizing a tiered benchmark set >=50bp + confident regions
– New long and ultralong read data coming
– Many challenges remain and collaborations welcome!
Draft SV calls for feedback: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/
Acknowledgements
• NIST/JIMB
– Marc Salit
– Jenny McDaniel
– Lindsay Harris
– David Catoe
– Lesley Chapman
– Noah Spies
• Genome in a Bottle Consortium
• GA4GH Benchmarking Team
• FDA
For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
Latest small variant benchmark: https://doi.org/10.1101/281006
Data:
– http://www.nature.com/articles/sdata201625
– ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
– Best Practices at https://doi.org/10.1101/270157
Public workshops
– Next workshop tentatively January 2019 at Stanford University, CA, USA
Justin Zook: jzook@nist.gov
NIST postdoc
opportunities
available!

More Related Content

What's hot

What's hot (20)

ASHG 2015 Genome in a bottle
ASHG 2015 Genome in a bottleASHG 2015 Genome in a bottle
ASHG 2015 Genome in a bottle
 
Giab product and tool roadmap small variants
Giab product and tool roadmap   small variantsGiab product and tool roadmap   small variants
Giab product and tool roadmap small variants
 
Jan2016 rm selection and design breakout summary
Jan2016 rm selection and design breakout summaryJan2016 rm selection and design breakout summary
Jan2016 rm selection and design breakout summary
 
Jan2016 bina giab
Jan2016 bina giabJan2016 bina giab
Jan2016 bina giab
 
Aug2015 salit standards architecture
Aug2015 salit standards architectureAug2015 salit standards architecture
Aug2015 salit standards architecture
 
Tools for Using NIST Reference Materials
Tools for Using NIST Reference MaterialsTools for Using NIST Reference Materials
Tools for Using NIST Reference Materials
 
171114 best practices for benchmarking variant calls justin
171114 best practices for benchmarking variant calls justin171114 best practices for benchmarking variant calls justin
171114 best practices for benchmarking variant calls justin
 
GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417
 
2017 amp benchmarking_poster_justin
2017 amp benchmarking_poster_justin2017 amp benchmarking_poster_justin
2017 amp benchmarking_poster_justin
 
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
 
Giab ashg 2017
Giab ashg 2017Giab ashg 2017
Giab ashg 2017
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
Genome in a bottle april 30 2015 hvp Leiden
Genome in a bottle april 30 2015 hvp LeidenGenome in a bottle april 30 2015 hvp Leiden
Genome in a bottle april 30 2015 hvp Leiden
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005
 
161115 precision fda giab
161115 precision fda giab161115 precision fda giab
161115 precision fda giab
 
Giab workshop update mar2019
Giab workshop update mar2019Giab workshop update mar2019
Giab workshop update mar2019
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020
 

Similar to GIAB Integrating multiple technologies to form benchmark SVs 180517

Similar to GIAB Integrating multiple technologies to form benchmark SVs 180517 (20)

171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marc
 
170326 giab abrf
170326 giab abrf170326 giab abrf
170326 giab abrf
 
Giab aug2015 intro and update 150821.pptx
Giab aug2015 intro and update 150821.pptxGiab aug2015 intro and update 150821.pptx
Giab aug2015 intro and update 150821.pptx
 
170120 giab stanford genetics seminar
170120 giab stanford genetics seminar170120 giab stanford genetics seminar
170120 giab stanford genetics seminar
 
Sept2016 plenary nist_intro
Sept2016 plenary nist_introSept2016 plenary nist_intro
Sept2016 plenary nist_intro
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdf
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Oncogenomics 2013
Oncogenomics 2013Oncogenomics 2013
Oncogenomics 2013
 
150224 giab 30 min generic slides
150224 giab 30 min generic slides150224 giab 30 min generic slides
150224 giab 30 min generic slides
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
 
GIAB GRC Workshop slides
GIAB GRC Workshop slidesGIAB GRC Workshop slides
GIAB GRC Workshop slides
 

More from GenomeInABottle

More from GenomeInABottle (20)

2023 GIAB AMP Update
2023 GIAB AMP Update2023 GIAB AMP Update
2023 GIAB AMP Update
 
GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023
 
Stratomod ASHG 2023
Stratomod ASHG 2023Stratomod ASHG 2023
Stratomod ASHG 2023
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussion
 
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant poster
 
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
 
GIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant poster
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assembly
 
New methods diploid assembly with graphs
New methods   diploid assembly with graphsNew methods   diploid assembly with graphs
New methods diploid assembly with graphs
 
How giab fits in the rest of the world seqc2 tumor normal
How giab fits in the rest of the world   seqc2 tumor normalHow giab fits in the rest of the world   seqc2 tumor normal
How giab fits in the rest of the world seqc2 tumor normal
 
New data from giab genomes pacbio ccs
New data from giab genomes   pacbio ccsNew data from giab genomes   pacbio ccs
New data from giab genomes pacbio ccs
 
New data from giab genomes strand-seq
New data from giab genomes   strand-seqNew data from giab genomes   strand-seq
New data from giab genomes strand-seq
 
New data from giab genomes promethion
New data from giab genomes   promethionNew data from giab genomes   promethion
New data from giab genomes promethion
 
New data from giab genomes intro and ultralong nanopore
New data from giab genomes   intro and ultralong nanoporeNew data from giab genomes   intro and ultralong nanopore
New data from giab genomes intro and ultralong nanopore
 
How giab fits in the rest of the world mdic somatic reference samples
How giab fits in the rest of the world   mdic somatic reference samplesHow giab fits in the rest of the world   mdic somatic reference samples
How giab fits in the rest of the world mdic somatic reference samples
 
How giab fits in the rest of the world telomere to telomere consortium
How giab fits in the rest of the world   telomere to telomere consortiumHow giab fits in the rest of the world   telomere to telomere consortium
How giab fits in the rest of the world telomere to telomere consortium
 
How giab fits in the rest of the world human genome structural variation co...
How giab fits in the rest of the world   human genome structural variation co...How giab fits in the rest of the world   human genome structural variation co...
How giab fits in the rest of the world human genome structural variation co...
 

Recently uploaded

Guntur Call Girl Service 📞6297126446📞Just Call Divya📲 Call Girl In Guntur No ...
Guntur Call Girl Service 📞6297126446📞Just Call Divya📲 Call Girl In Guntur No ...Guntur Call Girl Service 📞6297126446📞Just Call Divya📲 Call Girl In Guntur No ...
Guntur Call Girl Service 📞6297126446📞Just Call Divya📲 Call Girl In Guntur No ...
Call Girls in Nagpur High Profile Call Girls
 
❤️ Chandigarh Call Girls☎️98151-579OO☎️ Call Girl service in Chandigarh ☎️ Ch...
❤️ Chandigarh Call Girls☎️98151-579OO☎️ Call Girl service in Chandigarh ☎️ Ch...❤️ Chandigarh Call Girls☎️98151-579OO☎️ Call Girl service in Chandigarh ☎️ Ch...
❤️ Chandigarh Call Girls☎️98151-579OO☎️ Call Girl service in Chandigarh ☎️ Ch...
Rashmi Entertainment
 
Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...
Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...
Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...
amritaverma53
 
Russian Call Girls In Pune 👉 Just CALL ME: 9352988975 ✅❤️💯low cost unlimited ...
Russian Call Girls In Pune 👉 Just CALL ME: 9352988975 ✅❤️💯low cost unlimited ...Russian Call Girls In Pune 👉 Just CALL ME: 9352988975 ✅❤️💯low cost unlimited ...
Russian Call Girls In Pune 👉 Just CALL ME: 9352988975 ✅❤️💯low cost unlimited ...
chanderprakash5506
 
👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...
👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...
👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...
rajnisinghkjn
 

Recently uploaded (20)

Guntur Call Girl Service 📞6297126446📞Just Call Divya📲 Call Girl In Guntur No ...
Guntur Call Girl Service 📞6297126446📞Just Call Divya📲 Call Girl In Guntur No ...Guntur Call Girl Service 📞6297126446📞Just Call Divya📲 Call Girl In Guntur No ...
Guntur Call Girl Service 📞6297126446📞Just Call Divya📲 Call Girl In Guntur No ...
 
❤️ Chandigarh Call Girls☎️98151-579OO☎️ Call Girl service in Chandigarh ☎️ Ch...
❤️ Chandigarh Call Girls☎️98151-579OO☎️ Call Girl service in Chandigarh ☎️ Ch...❤️ Chandigarh Call Girls☎️98151-579OO☎️ Call Girl service in Chandigarh ☎️ Ch...
❤️ Chandigarh Call Girls☎️98151-579OO☎️ Call Girl service in Chandigarh ☎️ Ch...
 
💰Call Girl In Bangalore☎️63788-78445💰 Call Girl service in Bangalore☎️Bangalo...
💰Call Girl In Bangalore☎️63788-78445💰 Call Girl service in Bangalore☎️Bangalo...💰Call Girl In Bangalore☎️63788-78445💰 Call Girl service in Bangalore☎️Bangalo...
💰Call Girl In Bangalore☎️63788-78445💰 Call Girl service in Bangalore☎️Bangalo...
 
Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...
Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...
Call Girl in Chennai | Whatsapp No 📞 7427069034 📞 VIP Escorts Service Availab...
 
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...
Race Course Road } Book Call Girls in Bangalore | Whatsapp No 6378878445 VIP ...
 
Circulatory Shock, types and stages, compensatory mechanisms
Circulatory Shock, types and stages, compensatory mechanismsCirculatory Shock, types and stages, compensatory mechanisms
Circulatory Shock, types and stages, compensatory mechanisms
 
Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...
Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...
Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...
 
Call Girls Service Jaipur {9521753030 } ❤️VVIP BHAWNA Call Girl in Jaipur Raj...
Call Girls Service Jaipur {9521753030 } ❤️VVIP BHAWNA Call Girl in Jaipur Raj...Call Girls Service Jaipur {9521753030 } ❤️VVIP BHAWNA Call Girl in Jaipur Raj...
Call Girls Service Jaipur {9521753030 } ❤️VVIP BHAWNA Call Girl in Jaipur Raj...
 
💞 Safe And Secure Call Girls Coimbatore🧿 6378878445 🧿 High Class Coimbatore C...
💞 Safe And Secure Call Girls Coimbatore🧿 6378878445 🧿 High Class Coimbatore C...💞 Safe And Secure Call Girls Coimbatore🧿 6378878445 🧿 High Class Coimbatore C...
💞 Safe And Secure Call Girls Coimbatore🧿 6378878445 🧿 High Class Coimbatore C...
 
Bhopal❤CALL GIRL 9352988975 ❤CALL GIRLS IN Bhopal ESCORT SERVICE
Bhopal❤CALL GIRL 9352988975 ❤CALL GIRLS IN Bhopal ESCORT SERVICEBhopal❤CALL GIRL 9352988975 ❤CALL GIRLS IN Bhopal ESCORT SERVICE
Bhopal❤CALL GIRL 9352988975 ❤CALL GIRLS IN Bhopal ESCORT SERVICE
 
(RIYA)🎄Airhostess Call Girl Jaipur Call Now 8445551418 Premium Collection Of ...
(RIYA)🎄Airhostess Call Girl Jaipur Call Now 8445551418 Premium Collection Of ...(RIYA)🎄Airhostess Call Girl Jaipur Call Now 8445551418 Premium Collection Of ...
(RIYA)🎄Airhostess Call Girl Jaipur Call Now 8445551418 Premium Collection Of ...
 
Bhawanipatna Call Girls 📞9332606886 Call Girls in Bhawanipatna Escorts servic...
Bhawanipatna Call Girls 📞9332606886 Call Girls in Bhawanipatna Escorts servic...Bhawanipatna Call Girls 📞9332606886 Call Girls in Bhawanipatna Escorts servic...
Bhawanipatna Call Girls 📞9332606886 Call Girls in Bhawanipatna Escorts servic...
 
Call girls Service Phullen / 9332606886 Genuine Call girls with real Photos a...
Call girls Service Phullen / 9332606886 Genuine Call girls with real Photos a...Call girls Service Phullen / 9332606886 Genuine Call girls with real Photos a...
Call girls Service Phullen / 9332606886 Genuine Call girls with real Photos a...
 
Russian Call Girls In Pune 👉 Just CALL ME: 9352988975 ✅❤️💯low cost unlimited ...
Russian Call Girls In Pune 👉 Just CALL ME: 9352988975 ✅❤️💯low cost unlimited ...Russian Call Girls In Pune 👉 Just CALL ME: 9352988975 ✅❤️💯low cost unlimited ...
Russian Call Girls In Pune 👉 Just CALL ME: 9352988975 ✅❤️💯low cost unlimited ...
 
Call Girls Mussoorie Just Call 8854095900 Top Class Call Girl Service Available
Call Girls Mussoorie Just Call 8854095900 Top Class Call Girl Service AvailableCall Girls Mussoorie Just Call 8854095900 Top Class Call Girl Service Available
Call Girls Mussoorie Just Call 8854095900 Top Class Call Girl Service Available
 
Lucknow Call Girls Service { 9984666624 } ❤️VVIP ROCKY Call Girl in Lucknow U...
Lucknow Call Girls Service { 9984666624 } ❤️VVIP ROCKY Call Girl in Lucknow U...Lucknow Call Girls Service { 9984666624 } ❤️VVIP ROCKY Call Girl in Lucknow U...
Lucknow Call Girls Service { 9984666624 } ❤️VVIP ROCKY Call Girl in Lucknow U...
 
ANATOMY AND PHYSIOLOGY OF RESPIRATORY SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF RESPIRATORY SYSTEM.pptxANATOMY AND PHYSIOLOGY OF RESPIRATORY SYSTEM.pptx
ANATOMY AND PHYSIOLOGY OF RESPIRATORY SYSTEM.pptx
 
Call Girls Kathua Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Kathua Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Kathua Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Kathua Just Call 8250077686 Top Class Call Girl Service Available
 
Chennai ❣️ Call Girl 6378878445 Call Girls in Chennai Escort service book now
Chennai ❣️ Call Girl 6378878445 Call Girls in Chennai Escort service book nowChennai ❣️ Call Girl 6378878445 Call Girls in Chennai Escort service book now
Chennai ❣️ Call Girl 6378878445 Call Girls in Chennai Escort service book now
 
👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...
👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...
👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...
 

GIAB Integrating multiple technologies to form benchmark SVs 180517

  • 1. Genome in a Bottle: Integrating Multiple Technologies to Form Benchmark Structural Variants Justin Zook, on behalf of the GIAB Consortium NIST Genome-Scale Measurements Group Joint Initiative for Metrology in Biology (JIMB) May 17, 2018
  • 2. Take-home Messages • Genome in a Bottle is: – “Open science” – Authoritative characterization of human genomes • Currently enable benchmarking of “easier” variants – Clinical validation – Technology development, optimization, and demonstration • Now working on difficult variants and regions – Draft variant calls >=20bp available and feedback requested – Working on finalizing a tiered benchmark set >=50bp + confident regions – New long and ultralong read data coming – Many challenges remain and collaborations welcome!
  • 3. Why Genome in a Bottle? • A map of every individual’s genome will soon be possible, but how will we know if it is correct? • Diagnostics and precision medicine require high levels of confidence • Well-characterized, broadly disseminated genomes are needed to benchmark performance of sequencing • Open, transparent data/analyses • Enable technology development, optimization, and demonstration O’Rawe et al, Genome Medicine, 2013 https://doi.org/10.1186/gm432
  • 4. GIAB is evolving with technologies 2012 • No human benchmark calls available • GIAB Consortium formed 2014 • Small variant genotypes for ~77% of pilot genome NA12878 2015 • NIST releases first human genome Reference Material 2016 • 4 new genomes • Small variants for 90% of 5 genomes for GRCh37/38 2017+ • Characteriz- ing difficult variants • Develop tumor samples
  • 5. GIAB has characterized 5 human genome RMs • Pilot genome – NA12878 • PGP Human Genomes – Ashkenazi Jewish son – Ashkenazi Jewish trio – Chinese son • Parents also characterized National I nstituteof S tandards & Technology Report of I nvestigation Reference Material 8391 Human DNA for Whole-Genome Variant Assessment (Son of Eastern European Ashkenazim Jewish Ancestry) This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer (10 mM TRIS, 1 mM EDTA, pH 8.0). This material is intended for assessing performance of human genome sequencing variant calling by obtaining estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as functional or clinical interpretation. Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods similar to described in reference 1. An information value is considered to be a value that will be of interest and use to the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available. These data and genomic characterizations will be maintained over time as new data accrue and measurement and informatics methods become available. The information values are given as a variant call file (vcf) that contains the high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information (NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is: https://doi.org/10.1101/281006
  • 6. Important characteristics of benchmark calls What does “gold standard” mean? 1. Accurate – high-confidence variants, genotypes, haplotypes, and regions – When results from any method is compared to the benchmark, the majority of differences (FPs/FNs) are errors in the method 2. Representative examples – Different types of variants in different genome contexts 3. Comprehensive characterization – Many examples of different variant types/genome contexts – Eventually, diploid assembly benchmarking
  • 7. GIAB “Open Science” Virtuous Cycle Users analyze GIAB Samples Benchmark vs. GIAB data Critical feedback to GIAB Integrate new methods New benchmark data Method development, optimization, and demonstration Part of assay validation GIAB/NIST expands to more difficult regions
  • 8. Open consent enables secondary reference samples • >30 products now available based on broadly-consented, well-characterized GIAB PGP cell lines • Genomic DNA + DNA spike-ins – Clinical variants – Somatic variants – Difficult variants • Clinical matrix (FFPE) • Circulating tumor DNA • Stem cells (iPSCs) • Genome editing
  • 9. All data and analyses are open and public 51 authors 14 institutions 12 datasets 7 genomes Data described in ISA-tab New data on GIAB NCBI FTP
  • 10. Best Practices for Benchmarking Small Variants https://github.com/ga4gh/benchmarking-tools https://doi.org/10.1101/270157 https://precision.fda.gov/ Describe public “Truth” VCFs with confident regions Enable stratification of performance in difficult regions Tools to compare different representations of complex variants Standardized VCF-I output of comparison tools Standardized definitions of performance metrics based on matching stringency Web-based interface for performance metrics Standardized output formats for performance metrics
  • 11. What are we accessing and what is still challenging? Type of variant Genome context Fraction of variants called* Number of variants missing* How to improve? Simple SNPs Not repetitive ~97% >100k Machine learning Simple indels Not repetitive ~93% >10k Machine learning All variants Low mappability <30% >170k Use linked reads and long reads All variants Regions not in GRCh37/38 0 >>100k??? De novo assembly; long reads Small indels Tandem repeats and homopolymers <50% >200k STR/homopolymer callers; long reads; better handle complex and compound variants Indels 15-50bp All <25% >30k Assembly-based callers; integrate larger variants differently; long reads Indels >50bp All <1% >20k * Approximate values based on fraction of variants in GATKHC or FermiKit that are inside v3.3.2 High-confidence regions
  • 12. Integration of diverse data types and analyses • Data publicly available – Deep short reads – Linked reads – Long reads – Optical/nanopore mapping • Analyses – Small variant calling – SV calling – Local and global assembly Discover & Refine sequence- resolved calls from multiple datasets & analyses Compare variant and genotype calls from different methods Evaluate/ genotype calls with other data Identify features associated with reliability of calls from each method Form benchmark calls using heuristics & machine learning Compare benchmarks to high- quality callsets and examine differences
  • 13. How can we extend our approach to structural variants? Similarities to small variants • Collect callsets from multiple technologies • Compare callsets to find calls supported by multiple technologies Differences from small variants • Callsets have limited sensitivity • Variants are often imprecisely characterized – breakpoints, size, type, etc. • Representation of variants is poorly standardized, especially when complex • Comparison tools in infancy
  • 14. Evolution of SV calls for AJ Trio v0.2.0 • Only deletions • Overlap and size- based clustering • Output sites with multitech support v0.3.0 • New calling methods • Deletions and insertions • Sequence- resolved calls • Sequence- based clustering • Output sites with multitech support v0.4.0 • Include some single tech calls • Evaluate read support to remove some false positives • Add genotypes for trio v0.5.0 • Better calling methods, especially for large insertions • Include more single tech calls • Add some phasing info Future • Resolve clusters of differing calls • Improve phasing • Add new data types • Improve sequence resolution • High- confidence regions
  • 15. Integrating Sequence-resolved Calls >=20bp >1 million calls from 30+ sequence-resolved callsets from 4 techs for AJ Trio >500k unique sequence-resolved calls 38k INS and 37k DEL with 2+ techs or 5+ callers predicting sequences <20% different or BioNano/Nabsys support 33k INS and 35k DEL genotyped by svviz in 1+ individuals v0.5.0 Draft SV calls for feedback: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/
  • 16. Size histograms for v0.5.0 Red - simple calls in v0.5.0 Blue – differing nearby calls in v0.5.0 Alu Alu LINE LINE
  • 17. Evaluation/genotyping suite of methods Current approaches • svviz – maps reads to REF or ALT alleles – Short, linked, and long reads – Haplotype-separated reads • BioNano – compare size predictions • Nabsys – evaluates large deletions Future approaches • Paternal|maternal haplotypes for svviz using whatshap • Online manual curation of svviz, IGV, dotplots, etc. – Volunteers needed starting ~May 8! • PCR-Sanger targeted sequencing – Collaborations welcome!
  • 18. Outstanding challenges and future work • Large sequence-resolved insertions • Somewhat fewer multi-kb insertions than multi-kb deletions • Much better than v0.4.0 • Dense calls • ~1/3 v0.5.0 calls are within 1kb of another v0.5.0 call • Sequence-resolved insertion size doesn’t always match BioNano • Phasing will be important for these (e.g., with 10X, whatshap) • Calls with inaccurate or incomplete sequence change • Homozygous Reference calls • Can we definitively state we call all SVs in some regions? • E.g., using diploid assembly? • Benchmarking tool development • How to compare SVs to a benchmark? • What performance metrics are important? • New tools in development at: github.com/spiralgenetics/truvari
  • 19. Proposed 2-tier call system ● Tier 1: Simple, sequence-resolved ○ v0.5.0 calls >49bp in size in HG002 ○ Not within 1000bp of another >49bp call in HG002 ○ ~14,000 calls ○ Benchmark variant type, breakpoint, size, sequence, genotype ● Tier 2: Confident SV but complex or no consensus sequence change ○ V0.5.0 calls that are within 1000bp of another >49bp call in HG002 ■ ~6000 calls in ~2600 regions ○ Also analyze extra calls not tested as part of v0.5.0 process (not discovered by 2+ techs or 4+ callsets and clustered) ■ ~9000 regions ○ Benchmark sensitivity to more challenging SVs
  • 20. Using assemblies to develop high-confidence bed 1. Call variants from each assembly 2. Exclude regions around long read assembly variants not in v0.5.0 3. Find regions for each assembly that are covered by 1 contig. Remove repeats longer than 75% of N50 read length 4. Find the number of assemblies covering each region (e.g., using bed tools merge) 5. High confidence regions are regions in #4 covered by both haplotypes in a diploid assembly or at least x assemblies minus the regions in #2. 6. Subtract Tier 2 regions that don’t contain a Tier 1 call
  • 21. Web-based manual curation tools http://www.svcurator.com/ ● Volunteers needed to help us establish benchmarks! ● Learn about challenges in SV calling Credit: Lesley Chapman
  • 22. GIAB Developing New Data • 10X Genomics – Chinese trio now available • PacBio Sequel of Chinese trio with Mt Sinai – Read insert N50: 16-18kb – ~60x on son and ~30x on each parent – Also additional 30x on AJ son/mother – Data undergoing QC • BioNano – New DLS labeling method • Complete Genomics/BGI – stLFR linked reads • Oxford Nanopore – NIST/Birmingham/ Nottingham Ultra-long reads • In progress • Very preliminarily 80-90kb N50 – Max reads >1Mb! • Current throughput may give ~30-40x total on AJ trio • Strand-seq – Collaboration with Korbel lab
  • 23. ONT “Ultralong reads” Noah Spies David Catoe Matt Loose Nick Loman Josh Quick • So far… • ~4x total mapped • ~2x > 50kb • ~1x > 100kb • Plan initial release soon • Estimated ~30x total in 2018
  • 24. New Samples Additional ancestries • Shorter term – Use existing PGP individual samples – Use existing integration pipeline • Data-based selection – Proportion of potential genomes from different ancestries • 3 to 8 new samples • Longer term – Recruit large family – Recruit trios from other ancestry groups Cancer samples • Longer term • Make PGP-consented tumor and normal cell lines from same individual • Select tumor with diversity of mutation types
  • 25. The road ahead... 2018 • Large variants • Difficult small variants • Phasing 2019 • Difficult small & large variants • Somatic sample development • Germline samples from new ancestries 2020+ • Diploid assembly • Somatic structural variation • Segmental duplications • Centromere/ telomere • ...
  • 26. Take-home Messages • Genome in a Bottle is: – “Open science” – Authoritative characterization of human genomes • Currently enable benchmarking of “easier” variants – Clinical validation – Technology development, optimization, and demonstration • Now working on difficult variants and regions – Draft variant calls >=20bp available and feedback requested – Working on finalizing a tiered benchmark set >=50bp + confident regions – New long and ultralong read data coming – Many challenges remain and collaborations welcome! Draft SV calls for feedback: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/
  • 27. Acknowledgements • NIST/JIMB – Marc Salit – Jenny McDaniel – Lindsay Harris – David Catoe – Lesley Chapman – Noah Spies • Genome in a Bottle Consortium • GA4GH Benchmarking Team • FDA
  • 28. For More Information www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group github.com/genome-in-a-bottle – Guide to GIAB data & ftp www.slideshare.net/genomeinabottle Latest small variant benchmark: https://doi.org/10.1101/281006 Data: – http://www.nature.com/articles/sdata201625 – ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ Global Alliance Benchmarking Team – https://github.com/ga4gh/benchmarking-tools – Web-based implementation at precision.fda.gov – Best Practices at https://doi.org/10.1101/270157 Public workshops – Next workshop tentatively January 2019 at Stanford University, CA, USA Justin Zook: jzook@nist.gov NIST postdoc opportunities available!