SlideShare a Scribd company logo
1 of 27
Download to read offline
Genome in a Bottle:
Developing benchmark sets for large indels and
structural variants
Justin Zook, Marc Salit, and the GIAB Consortium
NIST Genome-Scale Measurements Group
Joint Initiative for Metrology in Biology (JIMB)
Oct 16, 2017
Take-home Messages
• Genome in a Bottle is authoritatively characterizing human
genomes
• Current characterization enables benchmarking of “easier”
variants/regions in germline genomes
– Clinical validation
– Technology development, optimization, and demonstration
• Now working on difficult variants and regions
– Draft variant calls >=20bp available and feedback requested
– Many challenges remain and collaborations welcome!
Why are we doing this?
• Technologies evolving rapidly
• Different sequencing and
bioinformatics methods give
different results
• Now have concordance in easy
regions, but not in difficult
regions
• Challenge:
– How do we characterize 6 billion
bases in the genome with high
confidence?
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432
GIAB is evolving
2012
• No human
benchmark
calls available
• GIAB
Consortium
formed
2014
• Small variant
genotypes
for ~77% of
pilot genome
NA12878
2015
• NIST releases
first human
genome
Reference
Material
2016
• 4 new
genomes
• Small
variants for
90% of 5
genomes for
GRCh37/38
2017+
• Characteriz-
ing difficult
variants
Genome in a Bottle Consortium
Authoritative Characterization of Human Genomes
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference materials to
evaluate performance
• GIAB is developing:
– reference materials
– Reference data
– Methods
– Tools to calculate performance
metrics
genericmeasurementprocess
www.slideshare.net/genomeinabottle
Bringing Principles of Metrology
to the Genome
• Reference materials
– DNA in a tube from NIST
• Extensive state-of-the-art
characterization
• “Upgradable” as technology
develops
• Commercial innovation
– PGP genomes suitable for
commercial derived products
• Benchmarking tools and software
– with GA4GH
• Enhance new technologies
GIAB has characterized 5 human genome RMs
• Pilot genome
– NA12878
• PGP Human Genomes
– Ashkenazi Jewish son
– Ashkenazi Jewish trio
– Chinese son
• Parents also characterized
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
Integration of diverse data types and analyses
• Data publicly available
– Deep short reads
– Linked reads
– Long reads
– Optical/nanopore mapping
• Analyses
– Small variant calling
– SV calling
– Local and global assembly
Discover &
Refine
sequence-
resolved calls
from multiple
datasets &
analyses Compare
variant and
genotype calls
from different
methods
Evaluate/
genotype calls
with other
data
Identify
features
associated
with reliability
of calls from
each method
Form
benchmark
calls using
heuristics &
machine
learning
Compare
benchmarks
to high-
quality
callsets and
examine
differences
Paper describing data…
51 authors
14 institutions
12 datasets
7 genomes
Data described in ISA-tab
Evolution of high-confidence small variants
Calls
HC
Regions HC Calls
HC
indels
Concordant
with PG
NIST-
only in
beds
PG-only
in beds PG-only
Variants
Phased
v2.19 2.22 Gb 3153247 352937 3030703 87 404 1018795 0.3%
v3.2.2 2.53 Gb 3512990 335594 3391783 57 52 657715 3.9%
v3.3 2.57 Gb 3566076 358753 3441361 40 60 608137 8.8%
v3.3.2 2.58 Gb 3691156 487841 3529641 47 61 469202 99.6%
5-7
errors
in NIST
1-7
errors
in NIST
~2 FPs and ~2 FNs per million NIST variants in PG and NIST bed files
Global Alliance for Genomics and Health Benchmarking Task
Team
• Developed standardized
definitions for performance
metrics like TP, FP, and FN.
• Developing sophisticated
benchmarking tools
• Integrated into a single framework
with standardized inputs and
outputs
• Standardized bed files with
difficult genome contexts for
stratification
https://github.com/ga4gh/benchmarking-tools
Variant types can change when decomposing
or recomposing variants:
Complex variant:
chr1 201586350 CTCTCTCTCT CA
DEL + SNP:
chr1 201586350 CTCTCTCTCT C
chr1 201586359 T A
Credit: Peter Krusche, Illumina
GA4GH Benchmarking Team
Benchmarking Tools
Standardized comparison, counting, and stratification with
Hap.py + vcfeval
https://precision.fda.gov/https://github.com/ga4gh/benchmarking-tools
What are we accessing and what is still
challenging?
Type of variant Genome
context
Fraction
of variants
called*
Number of
variants
missing*
How to improve?
Simple SNPs Not repetitive ~97% >100k Machine learning
Simple indels Not repetitive ~93% >10k Machine learning
All variants Low
mappability
<30% >170k Use linked reads and long
reads
All variants Regions not in
GRCh37/38
0 >>100k??? De novo assembly; long reads
Small indels Tandem repeats
and
homopolymers
<50% >200k STR/homopolymer callers; long
reads; better handle complex
and compound variants
Indels 15-50bp All <25% >30k Assembly-based callers;
integrate larger variants
differently; long reads
Indels >50bp All <1% >20k
* Approximate values based on fraction of variants in GATKHC or FermiKit that are
inside v3.3.2 High-confidence regions
How can we extend our approach to structural
variants?
Similarities to small variants
• Collect callsets from multiple
technologies
• Compare callsets to find calls
supported by multiple technologies
Differences from small variants
• Callsets have limited sensitivity
• Variants are often imprecisely
characterized
– breakpoints, size, type, etc.
• Representation of variants is poorly
standardized, especially when complex
• Comparison tools in infancy
Our strategy
Collect many candidate calls for AJ Trio
• Gather candidate calls from a variety of
approaches
– Many technologies
• Short, linked, and long reads
• Optical and nanopore mapping
– Many approaches
• Small variant callers
• Structural variant callers
• Local and global de novo assemblies
• Community submitted >1 million calls
from 30+ methods using 5+ technologies
Refine/evaluate/genotype candidates
• Obtain sequence-resolved calls as
often as possible using assembly-based
approaches
• Compare sequence predictions of
candidate calls and merge similar calls
• Determine raw data’s support of each
sequence-resolved call and its
genotype
Evaluation/genotyping suite of methods
Current approaches
• svviz – maps reads to REF or ALT alleles
– PacBio
– Illumina paired end and mate-pair
– 10X haplotype-separated
• BioNano – compare size predictions
• Nabsys – evaluates large deletions
Future approaches
• Separate haplotypes on other data
types for svviz using whatshap
• Online manual curation of svviz, IGV,
dotplots, gEVAL, etc.
– Volunteers needed!
• PCR-Sanger targeted sequencing
– Collaborations welcome!
Integrating Sequence-resolved Calls >=20bp
>1 million calls from 30+ sequence-resolved callsets from 4 techs for
AJ Trio
>500k unique sequence-resolved calls
30k INS and 32k DEL with 2+ techs or 5+
callers predicting sequences <20%
different or BioNano/Nabsys support
28k INS and 29k DEL
genotyped by svviz in 1+
individuals
v0.4.0
Size Distribution of v0.4.0 Calls
Not Tandem Repeat
Tandem Repeat
Deletions Insertions
Alu
LINE
Alu
LINE
Sequence-resolved insertion size relative to BioNano
Insertion sequence prediction accuracy differs
between methods
Relative Distance from exact match
Illumina local
assembly
PacBio raw
read
PacBio consensus
assembly
Developing web-based Manual curation tools
https://github.com/svviz/svviz
Outstanding challenges and future work
• Large sequence-resolved insertions
• Many fewer multi-kb insertions
than multi-kb deletions
• Dense calls
• ~1/3 v0.4.0 calls are within 1kb of
another v0.4.0 call
• Sequence-resolved insertion size
doesn’t always match BioNano
• Phasing will be important for
these (e.g., with 10X, whatshap)
• Calls with inaccurate or incomplete
sequence change
• Exploring training a model to
predict sequence accuracy
• Homozygous Reference calls
• Can we definitively state there is
no SV in some regions?
• E.g., using diploid assembly?
• Benchmarking tool development
• How to compare SVs to a
benchmark?
• What performance metrics are
important?
New public data planned for late 2017
• PacBio Sequel sequencing of
GIAB Chinese trio
– Collaboration with Mt. Sinai
– 60x/30x/30x coverage planned
– Potentially >15kb N50 read length
• Oxford Nanopore sequencing of
Ashkenazim trio
– Collaboration with Nick Loman and
Matt Loose
– ~50x/25x/25x coverage planned
– Ultralong read sequencing (50-
100kb+ N50 read length)
New Samples
Additional ancestries
• Shorter term
– Use existing PGP individual samples
– Use existing integration pipeline
• Data-based selection
– Proportion of potential genomes from
different ancestries
• 3 to 8 new samples
• Longer term
– Recruit large family
– Recruit trios from other ancestry groups
Cancer samples
• Longer term
• Make PGP-consented tumor and
normal cell lines from same individual
• Select tumor with diversity of mutation
types
Take-home Messages
• Genome in a Bottle is authoritatively characterizing human
genomes
• Current characterization enables robust benchmarking of “easier”
variants/regions
• Actively working on difficult variants and regions
– Draft variant calls >=20bp available – feedback requested!
• New public long and ultralong read datasets coming!
• What can we help enable?
– Clinical applications – precision medicine
– Research applications – how to know new methods are measuring difficult
regions/variants well
Acknowledgements
• NIST/JIMB
– Marc Salit
– Jenny McDaniel
– Lindsay Vang
– David Catoe
– Lesley Chapman
• Genome in a Bottle Consortium
• GA4GH Benchmarking Team
• FDA
For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
Data: http://www.nature.com/articles/sdata201625
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– precision.fda.gov – GA4GH benchmarking app
Biweekly Analysis Team calls (open to all)
– https://groups.google.com/forum/#!forum/giab-analysis-team
Public workshops
– Next workshop Jan 25-26, 2018 in Stanford, CA
– http://jimb.stanford.edu/giabworkshops for info and registration
NIST/JIMB postdoc opportunities available!
Justin Zook: jzook@nist.gov
Marc Salit: salit@nist.gov

More Related Content

What's hot

hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)Shaojun Xie
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amGenome Reference Consortium
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium
 
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008Saul Kravitz
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectGenome Reference Consortium
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysisGenome Reference Consortium
 
Aug2013 NIST highly confident genotype calls for NA12878
Aug2013 NIST highly confident genotype calls for NA12878Aug2013 NIST highly confident genotype calls for NA12878
Aug2013 NIST highly confident genotype calls for NA12878GenomeInABottle
 
ASHG 2015 Genome in a bottle
ASHG 2015 Genome in a bottleASHG 2015 Genome in a bottle
ASHG 2015 Genome in a bottleGenomeInABottle
 
Genome in a bottle april 30 2015 hvp Leiden
Genome in a bottle april 30 2015 hvp LeidenGenome in a bottle april 30 2015 hvp Leiden
Genome in a bottle april 30 2015 hvp LeidenGenomeInABottle
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsGenome Reference Consortium
 
Giab ashg webinar 160224
Giab ashg webinar 160224Giab ashg webinar 160224
Giab ashg webinar 160224GenomeInABottle
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyGenome Reference Consortium
 
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3GenomeInABottle
 

What's hot (20)

Explaining the assembly model
Explaining the assembly modelExplaining the assembly model
Explaining the assembly model
 
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008Human Reference Genome Browser Presentation at BIO-ITWorld 2008
Human Reference Genome Browser Presentation at BIO-ITWorld 2008
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysis
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
Ashg2015 schneider final
Ashg2015 schneider finalAshg2015 schneider final
Ashg2015 schneider final
 
agbt 2016 workshop church
agbt 2016 workshop churchagbt 2016 workshop church
agbt 2016 workshop church
 
Aug2013 NIST highly confident genotype calls for NA12878
Aug2013 NIST highly confident genotype calls for NA12878Aug2013 NIST highly confident genotype calls for NA12878
Aug2013 NIST highly confident genotype calls for NA12878
 
ASHG 2015 Genome in a bottle
ASHG 2015 Genome in a bottleASHG 2015 Genome in a bottle
ASHG 2015 Genome in a bottle
 
Genome in a bottle april 30 2015 hvp Leiden
Genome in a bottle april 30 2015 hvp LeidenGenome in a bottle april 30 2015 hvp Leiden
Genome in a bottle april 30 2015 hvp Leiden
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materials
 
Giab ashg webinar 160224
Giab ashg webinar 160224Giab ashg webinar 160224
Giab ashg webinar 160224
 
Mason abrf single_cell_2017
Mason abrf single_cell_2017Mason abrf single_cell_2017
Mason abrf single_cell_2017
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
 
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 

Similar to 171017 giab for giab grc workshop

171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshopGenomeInABottle
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GenomeInABottle
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GenomeInABottle
 
Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821GenomeInABottle
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGenomeInABottle
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923GenomeInABottle
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030GenomeInABottle
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GenomeInABottle
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...GenomeInABottle
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marcGenomeInABottle
 
170120 giab stanford genetics seminar
170120 giab stanford genetics seminar170120 giab stanford genetics seminar
170120 giab stanford genetics seminarGenomeInABottle
 
Sept2016 plenary nist_intro
Sept2016 plenary nist_introSept2016 plenary nist_intro
Sept2016 plenary nist_introGenomeInABottle
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGenomeInABottle
 
150224 giab 30 min generic slides
150224 giab 30 min generic slides150224 giab 30 min generic slides
150224 giab 30 min generic slidesGenomeInABottle
 
Tools for Using NIST Reference Materials
Tools for Using NIST Reference MaterialsTools for Using NIST Reference Materials
Tools for Using NIST Reference MaterialsGenomeInABottle
 

Similar to 171017 giab for giab grc workshop (20)

171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
170326 giab abrf
170326 giab abrf170326 giab abrf
170326 giab abrf
 
GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marc
 
170120 giab stanford genetics seminar
170120 giab stanford genetics seminar170120 giab stanford genetics seminar
170120 giab stanford genetics seminar
 
Sept2016 plenary nist_intro
Sept2016 plenary nist_introSept2016 plenary nist_intro
Sept2016 plenary nist_intro
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdf
 
150224 giab 30 min generic slides
150224 giab 30 min generic slides150224 giab 30 min generic slides
150224 giab 30 min generic slides
 
Tools for Using NIST Reference Materials
Tools for Using NIST Reference MaterialsTools for Using NIST Reference Materials
Tools for Using NIST Reference Materials
 

More from Genome Reference Consortium

Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonGenome Reference Consortium
 
Graph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGenome Reference Consortium
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
 

More from Genome Reference Consortium (16)

Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
 
Graph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regions
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
Everyday de novo assembly
Everyday de novo assemblyEveryday de novo assembly
Everyday de novo assembly
 

Recently uploaded

Plant Fibres used as Surgical Dressings PDF.pdf
Plant Fibres used as Surgical Dressings PDF.pdfPlant Fibres used as Surgical Dressings PDF.pdf
Plant Fibres used as Surgical Dressings PDF.pdfDivya Kanojiya
 
PHYSIOTHERAPY IN HEART TRANSPLANTATION..
PHYSIOTHERAPY IN HEART TRANSPLANTATION..PHYSIOTHERAPY IN HEART TRANSPLANTATION..
PHYSIOTHERAPY IN HEART TRANSPLANTATION..AneriPatwari
 
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptxSYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptxdrashraf369
 
History and Development of Pharmacovigilence.pdf
History and Development of Pharmacovigilence.pdfHistory and Development of Pharmacovigilence.pdf
History and Development of Pharmacovigilence.pdfSasikiranMarri
 
epilepsy and status epilepticus for undergraduate.pptx
epilepsy and status epilepticus  for undergraduate.pptxepilepsy and status epilepticus  for undergraduate.pptx
epilepsy and status epilepticus for undergraduate.pptxMohamed Rizk Khodair
 
Culture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptxCulture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptxDr. Dheeraj Kumar
 
Informed Consent Empowering Healthcare Decision-Making.pptx
Informed Consent Empowering Healthcare Decision-Making.pptxInformed Consent Empowering Healthcare Decision-Making.pptx
Informed Consent Empowering Healthcare Decision-Making.pptxSasikiranMarri
 
Tans femoral Amputee : Prosthetics Knee Joints.pptx
Tans femoral Amputee : Prosthetics Knee Joints.pptxTans femoral Amputee : Prosthetics Knee Joints.pptx
Tans femoral Amputee : Prosthetics Knee Joints.pptxKezaiah S
 
Presentation on Parasympathetic Nervous System
Presentation on Parasympathetic Nervous SystemPresentation on Parasympathetic Nervous System
Presentation on Parasympathetic Nervous SystemPrerana Jadhav
 
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners
 
Chronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptx
Chronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptxChronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptx
Chronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptxSasikiranMarri
 
The next social challenge to public health: the information environment.pptx
The next social challenge to public health:  the information environment.pptxThe next social challenge to public health:  the information environment.pptx
The next social challenge to public health: the information environment.pptxTina Purnat
 
Role of medicinal and aromatic plants in national economy PDF.pdf
Role of medicinal and aromatic plants in national economy PDF.pdfRole of medicinal and aromatic plants in national economy PDF.pdf
Role of medicinal and aromatic plants in national economy PDF.pdfDivya Kanojiya
 
Introduction to Sports Injuries by- Dr. Anjali Rai
Introduction to Sports Injuries by- Dr. Anjali RaiIntroduction to Sports Injuries by- Dr. Anjali Rai
Introduction to Sports Injuries by- Dr. Anjali RaiGoogle
 
Basic principles involved in the traditional systems of medicine PDF.pdf
Basic principles involved in the traditional systems of medicine PDF.pdfBasic principles involved in the traditional systems of medicine PDF.pdf
Basic principles involved in the traditional systems of medicine PDF.pdfDivya Kanojiya
 
Myelin Oligodendrocyte Glycoprotein antibody associated disease (MOGAD)
Myelin Oligodendrocyte Glycoprotein antibody associated disease (MOGAD)Myelin Oligodendrocyte Glycoprotein antibody associated disease (MOGAD)
Myelin Oligodendrocyte Glycoprotein antibody associated disease (MOGAD)MohamadAlhes
 
World-Health-Day-2024-My-Health-My-Right.pptx
World-Health-Day-2024-My-Health-My-Right.pptxWorld-Health-Day-2024-My-Health-My-Right.pptx
World-Health-Day-2024-My-Health-My-Right.pptxEx WHO/USAID
 
Presentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptx
Presentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptxPresentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptx
Presentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptxpdamico1
 
Radiation Dosimetry Parameters and Isodose Curves.pptx
Radiation Dosimetry Parameters and Isodose Curves.pptxRadiation Dosimetry Parameters and Isodose Curves.pptx
Radiation Dosimetry Parameters and Isodose Curves.pptxDr. Dheeraj Kumar
 
ANEMIA IN PREGNANCY by Dr. Akebom Kidanemariam
ANEMIA IN PREGNANCY by Dr. Akebom KidanemariamANEMIA IN PREGNANCY by Dr. Akebom Kidanemariam
ANEMIA IN PREGNANCY by Dr. Akebom KidanemariamAkebom Gebremichael
 

Recently uploaded (20)

Plant Fibres used as Surgical Dressings PDF.pdf
Plant Fibres used as Surgical Dressings PDF.pdfPlant Fibres used as Surgical Dressings PDF.pdf
Plant Fibres used as Surgical Dressings PDF.pdf
 
PHYSIOTHERAPY IN HEART TRANSPLANTATION..
PHYSIOTHERAPY IN HEART TRANSPLANTATION..PHYSIOTHERAPY IN HEART TRANSPLANTATION..
PHYSIOTHERAPY IN HEART TRANSPLANTATION..
 
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptxSYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
 
History and Development of Pharmacovigilence.pdf
History and Development of Pharmacovigilence.pdfHistory and Development of Pharmacovigilence.pdf
History and Development of Pharmacovigilence.pdf
 
epilepsy and status epilepticus for undergraduate.pptx
epilepsy and status epilepticus  for undergraduate.pptxepilepsy and status epilepticus  for undergraduate.pptx
epilepsy and status epilepticus for undergraduate.pptx
 
Culture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptxCulture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptx
 
Informed Consent Empowering Healthcare Decision-Making.pptx
Informed Consent Empowering Healthcare Decision-Making.pptxInformed Consent Empowering Healthcare Decision-Making.pptx
Informed Consent Empowering Healthcare Decision-Making.pptx
 
Tans femoral Amputee : Prosthetics Knee Joints.pptx
Tans femoral Amputee : Prosthetics Knee Joints.pptxTans femoral Amputee : Prosthetics Knee Joints.pptx
Tans femoral Amputee : Prosthetics Knee Joints.pptx
 
Presentation on Parasympathetic Nervous System
Presentation on Parasympathetic Nervous SystemPresentation on Parasympathetic Nervous System
Presentation on Parasympathetic Nervous System
 
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
 
Chronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptx
Chronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptxChronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptx
Chronic-Fatigue-Syndrome-CFS-Understanding-a-Complex-Disorder.pptx
 
The next social challenge to public health: the information environment.pptx
The next social challenge to public health:  the information environment.pptxThe next social challenge to public health:  the information environment.pptx
The next social challenge to public health: the information environment.pptx
 
Role of medicinal and aromatic plants in national economy PDF.pdf
Role of medicinal and aromatic plants in national economy PDF.pdfRole of medicinal and aromatic plants in national economy PDF.pdf
Role of medicinal and aromatic plants in national economy PDF.pdf
 
Introduction to Sports Injuries by- Dr. Anjali Rai
Introduction to Sports Injuries by- Dr. Anjali RaiIntroduction to Sports Injuries by- Dr. Anjali Rai
Introduction to Sports Injuries by- Dr. Anjali Rai
 
Basic principles involved in the traditional systems of medicine PDF.pdf
Basic principles involved in the traditional systems of medicine PDF.pdfBasic principles involved in the traditional systems of medicine PDF.pdf
Basic principles involved in the traditional systems of medicine PDF.pdf
 
Myelin Oligodendrocyte Glycoprotein antibody associated disease (MOGAD)
Myelin Oligodendrocyte Glycoprotein antibody associated disease (MOGAD)Myelin Oligodendrocyte Glycoprotein antibody associated disease (MOGAD)
Myelin Oligodendrocyte Glycoprotein antibody associated disease (MOGAD)
 
World-Health-Day-2024-My-Health-My-Right.pptx
World-Health-Day-2024-My-Health-My-Right.pptxWorld-Health-Day-2024-My-Health-My-Right.pptx
World-Health-Day-2024-My-Health-My-Right.pptx
 
Presentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptx
Presentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptxPresentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptx
Presentation for Bella Mahl 2024-03-28-24-MW-Overview-Bella.pptx
 
Radiation Dosimetry Parameters and Isodose Curves.pptx
Radiation Dosimetry Parameters and Isodose Curves.pptxRadiation Dosimetry Parameters and Isodose Curves.pptx
Radiation Dosimetry Parameters and Isodose Curves.pptx
 
ANEMIA IN PREGNANCY by Dr. Akebom Kidanemariam
ANEMIA IN PREGNANCY by Dr. Akebom KidanemariamANEMIA IN PREGNANCY by Dr. Akebom Kidanemariam
ANEMIA IN PREGNANCY by Dr. Akebom Kidanemariam
 

171017 giab for giab grc workshop

  • 1. Genome in a Bottle: Developing benchmark sets for large indels and structural variants Justin Zook, Marc Salit, and the GIAB Consortium NIST Genome-Scale Measurements Group Joint Initiative for Metrology in Biology (JIMB) Oct 16, 2017
  • 2. Take-home Messages • Genome in a Bottle is authoritatively characterizing human genomes • Current characterization enables benchmarking of “easier” variants/regions in germline genomes – Clinical validation – Technology development, optimization, and demonstration • Now working on difficult variants and regions – Draft variant calls >=20bp available and feedback requested – Many challenges remain and collaborations welcome!
  • 3. Why are we doing this? • Technologies evolving rapidly • Different sequencing and bioinformatics methods give different results • Now have concordance in easy regions, but not in difficult regions • Challenge: – How do we characterize 6 billion bases in the genome with high confidence? O’Rawe et al, Genome Medicine, 2013 https://doi.org/10.1186/gm432
  • 4. GIAB is evolving 2012 • No human benchmark calls available • GIAB Consortium formed 2014 • Small variant genotypes for ~77% of pilot genome NA12878 2015 • NIST releases first human genome Reference Material 2016 • 4 new genomes • Small variants for 90% of 5 genomes for GRCh37/38 2017+ • Characteriz- ing difficult variants
  • 5. Genome in a Bottle Consortium Authoritative Characterization of Human Genomes Sample gDNA isolation Library Prep Sequencing Alignment/Mapping Variant Calling Confidence Estimates Downstream Analysis • gDNA reference materials to evaluate performance • GIAB is developing: – reference materials – Reference data – Methods – Tools to calculate performance metrics genericmeasurementprocess www.slideshare.net/genomeinabottle
  • 6. Bringing Principles of Metrology to the Genome • Reference materials – DNA in a tube from NIST • Extensive state-of-the-art characterization • “Upgradable” as technology develops • Commercial innovation – PGP genomes suitable for commercial derived products • Benchmarking tools and software – with GA4GH • Enhance new technologies
  • 7. GIAB has characterized 5 human genome RMs • Pilot genome – NA12878 • PGP Human Genomes – Ashkenazi Jewish son – Ashkenazi Jewish trio – Chinese son • Parents also characterized National I nstituteof S tandards & Technology Report of I nvestigation Reference Material 8391 Human DNA for Whole-Genome Variant Assessment (Son of Eastern European Ashkenazim Jewish Ancestry) This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer (10 mM TRIS, 1 mM EDTA, pH 8.0). This material is intended for assessing performance of human genome sequencing variant calling by obtaining estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as functional or clinical interpretation. Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods similar to described in reference 1. An information value is considered to be a value that will be of interest and use to the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available. These data and genomic characterizations will be maintained over time as new data accrue and measurement and informatics methods become available. The information values are given as a variant call file (vcf) that contains the high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information (NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
  • 8. Integration of diverse data types and analyses • Data publicly available – Deep short reads – Linked reads – Long reads – Optical/nanopore mapping • Analyses – Small variant calling – SV calling – Local and global assembly Discover & Refine sequence- resolved calls from multiple datasets & analyses Compare variant and genotype calls from different methods Evaluate/ genotype calls with other data Identify features associated with reliability of calls from each method Form benchmark calls using heuristics & machine learning Compare benchmarks to high- quality callsets and examine differences
  • 9. Paper describing data… 51 authors 14 institutions 12 datasets 7 genomes Data described in ISA-tab
  • 10. Evolution of high-confidence small variants Calls HC Regions HC Calls HC indels Concordant with PG NIST- only in beds PG-only in beds PG-only Variants Phased v2.19 2.22 Gb 3153247 352937 3030703 87 404 1018795 0.3% v3.2.2 2.53 Gb 3512990 335594 3391783 57 52 657715 3.9% v3.3 2.57 Gb 3566076 358753 3441361 40 60 608137 8.8% v3.3.2 2.58 Gb 3691156 487841 3529641 47 61 469202 99.6% 5-7 errors in NIST 1-7 errors in NIST ~2 FPs and ~2 FNs per million NIST variants in PG and NIST bed files
  • 11. Global Alliance for Genomics and Health Benchmarking Task Team • Developed standardized definitions for performance metrics like TP, FP, and FN. • Developing sophisticated benchmarking tools • Integrated into a single framework with standardized inputs and outputs • Standardized bed files with difficult genome contexts for stratification https://github.com/ga4gh/benchmarking-tools Variant types can change when decomposing or recomposing variants: Complex variant: chr1 201586350 CTCTCTCTCT CA DEL + SNP: chr1 201586350 CTCTCTCTCT C chr1 201586359 T A Credit: Peter Krusche, Illumina GA4GH Benchmarking Team
  • 12. Benchmarking Tools Standardized comparison, counting, and stratification with Hap.py + vcfeval https://precision.fda.gov/https://github.com/ga4gh/benchmarking-tools
  • 13. What are we accessing and what is still challenging? Type of variant Genome context Fraction of variants called* Number of variants missing* How to improve? Simple SNPs Not repetitive ~97% >100k Machine learning Simple indels Not repetitive ~93% >10k Machine learning All variants Low mappability <30% >170k Use linked reads and long reads All variants Regions not in GRCh37/38 0 >>100k??? De novo assembly; long reads Small indels Tandem repeats and homopolymers <50% >200k STR/homopolymer callers; long reads; better handle complex and compound variants Indels 15-50bp All <25% >30k Assembly-based callers; integrate larger variants differently; long reads Indels >50bp All <1% >20k * Approximate values based on fraction of variants in GATKHC or FermiKit that are inside v3.3.2 High-confidence regions
  • 14. How can we extend our approach to structural variants? Similarities to small variants • Collect callsets from multiple technologies • Compare callsets to find calls supported by multiple technologies Differences from small variants • Callsets have limited sensitivity • Variants are often imprecisely characterized – breakpoints, size, type, etc. • Representation of variants is poorly standardized, especially when complex • Comparison tools in infancy
  • 15. Our strategy Collect many candidate calls for AJ Trio • Gather candidate calls from a variety of approaches – Many technologies • Short, linked, and long reads • Optical and nanopore mapping – Many approaches • Small variant callers • Structural variant callers • Local and global de novo assemblies • Community submitted >1 million calls from 30+ methods using 5+ technologies Refine/evaluate/genotype candidates • Obtain sequence-resolved calls as often as possible using assembly-based approaches • Compare sequence predictions of candidate calls and merge similar calls • Determine raw data’s support of each sequence-resolved call and its genotype
  • 16. Evaluation/genotyping suite of methods Current approaches • svviz – maps reads to REF or ALT alleles – PacBio – Illumina paired end and mate-pair – 10X haplotype-separated • BioNano – compare size predictions • Nabsys – evaluates large deletions Future approaches • Separate haplotypes on other data types for svviz using whatshap • Online manual curation of svviz, IGV, dotplots, gEVAL, etc. – Volunteers needed! • PCR-Sanger targeted sequencing – Collaborations welcome!
  • 17. Integrating Sequence-resolved Calls >=20bp >1 million calls from 30+ sequence-resolved callsets from 4 techs for AJ Trio >500k unique sequence-resolved calls 30k INS and 32k DEL with 2+ techs or 5+ callers predicting sequences <20% different or BioNano/Nabsys support 28k INS and 29k DEL genotyped by svviz in 1+ individuals v0.4.0
  • 18. Size Distribution of v0.4.0 Calls Not Tandem Repeat Tandem Repeat Deletions Insertions Alu LINE Alu LINE
  • 19. Sequence-resolved insertion size relative to BioNano
  • 20. Insertion sequence prediction accuracy differs between methods Relative Distance from exact match Illumina local assembly PacBio raw read PacBio consensus assembly
  • 21. Developing web-based Manual curation tools https://github.com/svviz/svviz
  • 22. Outstanding challenges and future work • Large sequence-resolved insertions • Many fewer multi-kb insertions than multi-kb deletions • Dense calls • ~1/3 v0.4.0 calls are within 1kb of another v0.4.0 call • Sequence-resolved insertion size doesn’t always match BioNano • Phasing will be important for these (e.g., with 10X, whatshap) • Calls with inaccurate or incomplete sequence change • Exploring training a model to predict sequence accuracy • Homozygous Reference calls • Can we definitively state there is no SV in some regions? • E.g., using diploid assembly? • Benchmarking tool development • How to compare SVs to a benchmark? • What performance metrics are important?
  • 23. New public data planned for late 2017 • PacBio Sequel sequencing of GIAB Chinese trio – Collaboration with Mt. Sinai – 60x/30x/30x coverage planned – Potentially >15kb N50 read length • Oxford Nanopore sequencing of Ashkenazim trio – Collaboration with Nick Loman and Matt Loose – ~50x/25x/25x coverage planned – Ultralong read sequencing (50- 100kb+ N50 read length)
  • 24. New Samples Additional ancestries • Shorter term – Use existing PGP individual samples – Use existing integration pipeline • Data-based selection – Proportion of potential genomes from different ancestries • 3 to 8 new samples • Longer term – Recruit large family – Recruit trios from other ancestry groups Cancer samples • Longer term • Make PGP-consented tumor and normal cell lines from same individual • Select tumor with diversity of mutation types
  • 25. Take-home Messages • Genome in a Bottle is authoritatively characterizing human genomes • Current characterization enables robust benchmarking of “easier” variants/regions • Actively working on difficult variants and regions – Draft variant calls >=20bp available – feedback requested! • New public long and ultralong read datasets coming! • What can we help enable? – Clinical applications – precision medicine – Research applications – how to know new methods are measuring difficult regions/variants well
  • 26. Acknowledgements • NIST/JIMB – Marc Salit – Jenny McDaniel – Lindsay Vang – David Catoe – Lesley Chapman • Genome in a Bottle Consortium • GA4GH Benchmarking Team • FDA
  • 27. For More Information www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails github.com/genome-in-a-bottle – Guide to GIAB data & ftp www.slideshare.net/genomeinabottle Data: http://www.nature.com/articles/sdata201625 Global Alliance Benchmarking Team – https://github.com/ga4gh/benchmarking-tools – precision.fda.gov – GA4GH benchmarking app Biweekly Analysis Team calls (open to all) – https://groups.google.com/forum/#!forum/giab-analysis-team Public workshops – Next workshop Jan 25-26, 2018 in Stanford, CA – http://jimb.stanford.edu/giabworkshops for info and registration NIST/JIMB postdoc opportunities available! Justin Zook: jzook@nist.gov Marc Salit: salit@nist.gov