2. What’s Genome in a Bottle?
• Authoritative Characterization
of Human Genomes
– enduring commitment to
resource availability
• Samples
• Data
– widely available open
resources
– no restrictions on use or
distribution
• Enable technology and tool-
building with benchmark
samples and methods for…
– development
– optimization
– demonstration
• Germline samples available
now
• Developing capacity for
somatic sample development
3. What GIAB Isn’t
• Population genetics
• Disease-specific
• Many clinical samples
• Non-human
• Genome, not
transcriptome,
epigenome, proteome,
metabolome…
6. Benchmarking strides forward
• Draft GA4GH
manuscript describing
best practices for
benchmarking germline
small variants
– >15 co-authors actively
editing manuscript
• Robust, sophisticated
benchmarking tools
publicly available
– GitHub
– PrecisionFDA
7. High-confidence calls are in use
• 286 citations of 2014
paper
• PrecisionFDA challenges
• Clinical labs
• Demonstration of new
variant callers
https://blog.dnanexus.com/2017-12-05-evaluating-
deepvariant-googles-machine-learning-variant-caller/
8. Clinical Community adopting GIAB
• Justin in a 2-year term
on Association for
Molecular Pathology
Clinical Practice
Committee
• Monica
Basehore appointed
GIAB/AMP liaison
• GIAB derived products
meeting needs
in clinical labs
– AMP RM Forum hosted
by CDC
• Somatic variants
• ctDNA
• Difficult variants
9. Derived Products are on the market
• 31 products from 3
companies now
available based on GIAB
PGP cell lines
• DNA + spike-ins
– Clinical variants
– Somatic variants
– Difficult variants
• FFPE
• ctDNA
10. Open data is being used!
• 82 citations of Scientific
Data paper
11. GIAB Developing New Data
• 10X Genomics
– Chinese trio now available
• PacBio Sequel of Chinese trio
with Mt Sinai
– Read insert N50: ~15kb
– 202Gb son and 98-116Gb on each
parent
– Data undergoing QC
• BioNano
– New DLS labeling method
• Complete Genomics/BGI
– stLFR linked reads
• Oxford Nanopore
– NIST/Birmingham/
Nottingham Ultra-long reads
• Starting soon
• Very preliminarily 80-90kb N50
– Max reads >1Mb!
• Current throughput will give
~40x total on AJ trio, but may
improve
• Strand-seq
– Collaboration with Korbel lab
12. Progress on New Samples
Germline Samples
• Performed rfmix ancestry analysis on 6
PGP individuals with WGS and cell lines
– 2 differently admixed Hispanic
– 1 76% African + 24% European
– 1 84% European + 15% South Asian
– 1 77% European + 21% East Asian
– 1 99% European
– PGP1
– 1 self-reported Chinese/Filipino
• Working on MTA for open
dissemination
Somatic Samples
• Discussion Friday Morning
13. Goals for This Workshop
• Update consortium and
onboard new members
• Review progress on SVs
• Demo Manual Curation
App
• Learn about new
methods for
characterizing difficult
regions
• Review, revisit, and
update Principles for
Dissemination of GIAB
Samples
• Discuss plans for new
Germline and Tumor
Samples
14. Workshop Agenda
• THURSDAY, JANUARY 25, 2018
• 9:00 AM - 10:30 AM: Welcome, Onboarding, and
GIAB Progress Update
• 10:30 AM - 11:00 AM: Break
• 11:00 AM - 12:15 PM: Training and Trial Manual
Curation of Structural Variants
• 12:15 PM - 1:45 PM: Lunch
• 1:45 PM - 3:15 PM: Feedback about v0.5 Draft
Structural Variant Benchmark Set
• 3:15 PM - 3:45 PM: Break
• 3:45 PM – 5:00 PM: New Approaches to
Characterizing Difficult Variants and Regions
• 5:00 PM - 5:30 PM: Discussion of Future Work
• 5:30 PM - 6:30 PM: Happy Hour, Sponsored by
PacBio and Invitae
• FRIDAY, JANUARY 26, 2018
• 9:00 AM - 10:45 AM: Panel Discussion about
Principles for Dissemination of GIAB Samples
• 10:45 AM - 11:15 AM: Break
• 11:15 AM - 12:00 PM: Discussion about Future
Germline and Tumor Samples
• 12:00 PM - 1:00 PM: Break
• 1:00 PM - 2:30 PM: GIAB Steering Committee
Meeting
15. Steering Committee Agenda
• Roadmap
• Next 2-3 workshops
• Resourcing GIAB work
• Communications
– Best practices
– Ways of working together
• Liaisons
– HGSVC
– Clinical labs
• Samples, consents, repository
relationships
– NIST RMs needed?
– Distribute all NIST RMs
together
– Cells instead of DNA
– GIAB Imprimatur
• Research v. Standards-making
– Tool development vs reference
sample development
17. • Develop open-access
samples and data for broad
uses in industry, academia,
and government
• Convene community of
experts to characterize
genomes -> GIAB/NIST
integrates results to form
benchmarks
• Develop tools to calculate
accurate and standardized
performance metrics
Unique GIAB roles in genomics
18.
19. • New sequencing with long
and linked reads
• Developing plan for open
access to GIAB materials
• Selecting samples from new
ancestries
• Developing cancer samples
for somatic benchmarking
• Stay for Friday morning!
Progress Update
• Developing cancer samples
for somatic benchmarking
• Draft publications about
high-confidence calls and
benchmarking methods
Ongoing and Future Work
20.
21. • Best methods agree on
99.9%+ of “easy” calls
• Evaluating “straw man” large
indel/SV callsets
Progress Update
• Characterize challenging 10-
20% of genome
• New methods for reference
characterization of somatic
genomes
• Refining principled
integration methods
• Assembly metrology
Ongoing and Future Work
22. • GIAB Analysis Team focused
on large indels and SVs
Progress Update
• Individual collaborations
exploring expanding calls for
other variant types
Ongoing and Future Work
23. • Released 3 “straw man”
sequence-resolved
benchmark callsets >=20bp
• Analysis Team gave critical
feedback in each round
• V0.5.0 released Jan 2018
Progress Update
• Evaluate v0.5.0
• Write manuscript
• Manual Curation
• Resolve clusters of variants
• Integrate new technologies
and methods
Ongoing and Future Work
24. Our SV Integration Strategy
Collect many candidate calls for AJ Trio
• Gather candidate calls from a
variety of approaches
– Many technologies
• Short, linked, and long reads
• Optical and nanopore mapping
– Many approaches
• Small variant callers
• Structural variant callers
• Local and global de novo assemblies
• Community submitted >1 million
calls from 30+ methods using 5+
technologies
Refine/evaluate/genotype candidates
• Obtain sequence-resolved calls as
often as possible using assembly-
based approaches
• Compare sequence predictions of
candidate calls and merge similar
calls
• Determine raw data’s support of
each sequence-resolved call and its
genotype
25. Evolution of SV calls for AJ Trio
v0.2.0
• Only
deletions
• Overlap and
size-based
clustering
• Output
sites with
multitech
support
v0.3.0
• New calling
methods
• Deletions
and
insertions
• Sequence-
resolved
calls
• Sequence-
based
clustering
• Output
sites with
multitech
support
v0.4.0
• Include
some single
tech calls
• Evaluate
read
support to
remove
some false
positives
• Add
genotypes
for trio
v0.5.0
• Better
calling
methods,
especially
for large
insertions
• Include
more single
tech calls
• Add some
phasing info
Future
• Resolve
clusters of
differing
calls
• Improve
phasing
• Add new
data types
• Improve
sequence
resolution
• Collaborate
with
HGSVC?
26. • Initiated discussions with
several groups working on
phasing and calling variant in
difficult to map regions
• Similar data and methods
used for both problems
Progress Update
• Work with several groups
developing new methods
• Integrate difficult to map
variants into high-confidence
calls
• Integrate phasing into high-
confidence calls
Ongoing and Future Work
27. • Initiated discussions with
several groups working on
short tandem repeats and
complex variants
• Explored using RTG vcfeval
and varmatch to harmonize
multiple vcfs for integration
Progress Update
• Add STR methods into
integration methods
• Test variant harmonization
methods for integration
• Find collaborators for HLA
and ALT loci characterization
(e.g., graph-based methods,
linked/long reads)
Ongoing and Future Work
28. • Draft manuscript for v3.3.2
small variants
• Preliminary machine learning
methods can reproduce SV
genotypes from svviz
• Demo of SV manual curation
web app (see next session!)
Progress Update
• Crowd-source manual
curation of SVs
• Use crowd-sourced labels for
machine learning
Ongoing and Future Work
29. • Using assemblies to call and
refine structural variants
Progress Update
• Need to develop integration
methods for all types of
somatic variants
• Need to develop methods to
integrate and benchmark
diploid assemblies
Ongoing and Future Work
30.
31. • GA4GH made available
sophisticated, standardized
tools for benchmarking small
variants
• “Best practices” manuscript
for small variant
benchmarking
Progress Update
• Develop new methods for
structural variant
benchmarking
• Develop new methods for
somatic variant
benchmarking
• Predict performance on
clinically interesting variants
Ongoing and Future Work
32. Benchmarking Best Practices
Manuscript
• Focus on germline small
variants
• Describe benchmark
callsets
• Define performance
metrics at different
stringency levels
• Sophisticated comparison
tools are important
• Stratify performance by
variant type and genome
context
• Tools available on GitHub
and PrecisionFDA
33. The road ahead...
2018
• Large
variants
• Difficult
small
variants
• Phasing
2019
• Difficult large
variants
• Somatic
sample
development
• Germline
samples
from new
ancestries
2020+
• Diploid
assembly
• Somatic
structural
variation
• Segmental
duplications
• Centromere/
telomere
• ...
35. Outstanding work summary
• Many variant calls cannot be assessed by
comparison to current benchmark callsets (>20%
of SNPs, >50% of indels, ~100% of SVs outside our
high-confidence regions)
– Currently mostly assessing “easy” things
• No broadly consented tumor-normal cell lines are
available
• Benchmarking tools for SVs are not standardized
36. Genome
Measurement
Science
Germline
Samples
Somatic
Samples
Benchmarking
Publications
2018 2019 2020
IRB
approval
Strategy for cell
line developer/
distributor
Using variant calls to
benchmark assemblies
Identify cell line
developer/
distributor
Small repeats
Difficult to map w/ phasing
Initial large indels/SVs
Challenging large
indels/SVs
Non-variant regions
for large indels/SVs
More difficult
variant calls in
all samples
Machine learning to
integrate indels/SVs
Further automate
arbitration/integration
for new techs and
difficult variants
X/Y
Complex
variants
Select
samples
for new
ancestries
Diploid assemblies
are important part of
integration
SV comparison
tools integrated into
Benchmarking
frameworkBenchmarking/
new integrated
calls
SV
Integration
Machine
learning
Paper with
HGSVC?
Predict
performance for
clinical variants of
interest
Establish
cell lines
Characterize
cell lines
Develop integration
methods for somatic
variants
Implement SV callers
Ultralong read
science