1. genomeinabottle.org
Genome in a Bottle Consortium
GIAB/GRC Pre-ASHG Workshop
October 5, 2015
Reference Materials for Clinical Applications of
Human Genome Sequencing
Justin Zook and Marc Salit
National Institute of Standards and Technology
4. genomeinabottle.org
GIAB Scope
• The Genome in a Bottle Consortium is
developing the reference materials, reference
methods, and reference data needed to
assess confidence in human genome variant
calls.
• A principal motivation for this consortium is to
enable performance assessment of
sequencing and science-based regulatory
oversight of clinical sequencing.
5. genomeinabottle.org
Well-characterized, stable RMs
• Obtain metrics for validation,
QC, QA, PT
• Determine sources and types
of bias/error
• Learn to resolve difficult
structural variants
• Improve reference genome
assembly
• Optimization
– integration of data from
multiple platforms
– sequencing and analysis
• Enable regulated applications Comparison of SNP Calls for
NA12878 on 2 platforms, 3
analysis methods
6. genomeinabottle.org
NGS Validation Process using
Genomes in Bottles
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence
Estimates
Downstream
Analysis
Analytical Process
Genome in a Bottle Scope
Pre-Analytical Process
Clinical Interpretation
GIAB
Data
7. genomeinabottle.org
Genome in a Bottle Consortium (GIAB)
Hosted by US National Institute of Standards and Technology
Goal: Provide infrastructure to assess
confidence in human variant calls
• Appropriately consented widely
available DNA samples, distributed by
the Coriell Institute
– Also, QCed Reference Material (RM)
versions from controlled lots will be
available from NIST
– Also, PGP samples are commercially
available
• High-accuracy reference data for these
samples
• Tools to facilitate their use
– With the Global Alliance Data Working
Group Benchmarking Team
ga4gh.org
8. genomeinabottle.org
GIAB Selected Samples
CEPH/Utah Pedigree 1463
✔
NA1288
9
NA12879
NA12890
NA12880
NA12881
NA12882
NA12883
NA12884
NA12885
NA12886
NA12887
NA12888
NA12893
NA12877 NA12878
NA12891 NA12892
✔ ✔
NA24149 NA24143
NA24385
Ashkenazi Jewish Trio
✔
NA24694 NA24695
NA24631
Asian (Han Chinese) Trio
✔
Note: Illumina and RTG have used data from the pedigree
to improve variant calls in the specific GIAB samples.
New
New
Personal
Genome
Project
Available as
NIST RM8398
9. genomeinabottle.org
NIST Human Genome
Reference Materials (RMs)
• NIST RM 8398 is available!
– tinyurl.com/giabpilot
– DNA isolated from large
growth cell cultures
– Stable, homogeneous
– Best for regulated uses
– DNA from same cell line at
Coriell (NA12878)
• New AJ and Asian Samples
– Available from Coriell now
– NIST RM available in 2016
11. genomeinabottle.org
Integration Methods to Establish
Reference Variant Calls for NA12878
Candidate Variants from Each Platform
Identify Concordant Variants
Identify Characteristics of Systematic Error
Arbitrate Using Evidence of Systematic Error
Exclude regions potentially biased for all short
reads (e.g., repeats, SVs)
Zook et al., Nature Biotechnology, 2014.
12. genomeinabottle.org
Assigning confidence to genomic
regions for NA12878
High-confidence (77%)
• Platforms agree or we
understand the systematic
biases causing
disagreement
• At least some methods have
no evidence of systematic
errors
• Mendelian inheritance
consistent
Lower confidence (23%)
• In a region known to be
difficult for current
technologies
– Segmental Dups
– Repeats, Low Complexity
– High/Low GC
– Etc.
• Evidence of systematic error
across many platforms
• Inconsistent inheritance
Zook et al., Nature Biotechnology, 2014.
13. genomeinabottle.org
Using high-confidence NIST-GIAB
genotypes for NA12878
• NIST have released
several versions of high-
confidence genotypes
for its pilot RM
• These data are
presently being used for
benchmarking
– prior to release of RMs
– SNPs & indels
• ~77% of the genome
•Data on FTP now well-organized
14. genomeinabottle.org
GeT-RM Browser from NCBI and CDC
• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/
• Allows visualization of data underlying call each call
15. genomeinabottle.org
Uses of GIAB NA12878
Oncology – Molecular and Cellular Tumor Markers
“Next Generation” Sequencing (NGS) guidelines for
somatic genetic variant detection
www.bioplanet.com/gcat
16. genomeinabottle.org
Global Alliance for Genomics and Health
Benchmarking Task Team
• Formed June 2014 to develop
methods and tools for comparing
variant calls to a benchmark
• Developed standardized definitions
for performance metrics like TP, FP,
and FN.
• Initial focus on germline SNPs/indels
• Developing benchmarking tools
• Comparison engine
• Pluggable web interface with
modules for:
• Reporting/calculation of metrics
• Visualization/user interface
• Working with Genome in a Bottle
Consortium to host data and calls
from their well-characterized
genomes
www.bioplanet.com/gcat
Example User Interface
17. genomeinabottle.org
Global Alliance for Genomics and Health
Benchmarking Task Team
Credit: Rebecca Truty, Complete Genomics
How should we interpret this complex variant on chr21?
18. genomeinabottle.org
Global Alliance for Genomics and Health
Benchmarking Task Team
Credit: Rebecca Truty, Complete Genomics
Beyond simple T/F classification: Genotype errors
Trut
h
Callse
t
Description Proposed
Name(s)
CM#1 region
match
CM#2 allele match CM#3 genotype
match
0/1 1/1 zygosity/genotype
error
GE TP 1TP, 1GE FN
1/1 0/1
1/2 0/1
1/1
0/2
2/2
common allele, FN
allele
GE_FN TP 1TP, 1GE, 1FN FN
0/1 1/2 common allele, FP
allele
GE_FP TP 1TP, 1GE, 1FP FP, FN
1/1 1/2
1/2 1/3 common allele, FP
allele, FN allele
GE_FP_FN TP 1TP, 1GE, 1FP,
1FN
FP, FN
19. genomeinabottle.org
Global Alliance for Genomics and Health
Benchmarking Task Team
Credit: Rebecca Truty, Complete Genomics
Beyond simple T/F classification: no-calls and half-calls
Truth Callset Description Proposed
Name(s)
CM#1 region
match
CM#2 allele match CM#3
genotyp
e match
0/1 ./1 half-call, TP allele HC_TP NC, NCV,
TP
1NC, 1NCV, 1TP, 1GE TP
1/1 ./1 1NC, 1NCV, 1TP, 1GE FN
0/1
1/1
./0 half call, FN
allele(s)
HC_FN NC, NCV, TP 1NC, 1NCV, 1FN FN
1/2 ./0 1NC, 2NCV, 2FN FN
1/2 ./1
./2
half-call, TP allele,
FN allele
HC_TP_F
N
NC, NCV,
TP
1NC, 1NCV, 1TP, 1GE,
1FN
FN
21. genomeinabottle.org
Public data from GIAB AJ PGP Trio
Long reads/”Linked” reads
• ~70/30/30x PacBio
– ~11kb N50
• BioNano
• 10X Genomics
• Moleculo
• Complete Genomics LFR
• Oxford Nanopore
Short reads
• 300x Illumina paired-end
• 15x Illumina 6kb mate-pair
• Complete Genomics
• SOLiD 5500W
• Ion Proton Exome
http://biorxiv.org/content/early/2015/09/15/026468
22. genomeinabottle.org
GIAB Analysis Group – New Data Sets
Leaders
• Francisco de la Vega
– Annai Systems
• Chris Mason
– Weil Cornell Medical Center
• Tina Graves
– Washington University
• Valerie Schneider
– NCBI
•and Justin and Marc
Status
• Analysis Group Responsibilities:
– https://docs.google.com/document/d/10e
A0DwB4iYTSFM_LPO9_2LyyN2xEqH49OXH
htNH1uzw/edit?usp=sharing
• Analysis Milestones:
– https://docs.google.com/spreadsheets/d/1Pj4nSz
H742g40wJz2fA6f8kFtZYAToZpSZYVPiC5st4/edit?u
sp=sharing
• Analysis Methods
– https://docs.google.com/spreadsheet
s/d/1Je2g85H7oK6kMXbBOoqQ1FM
NrvGnFuUJTJn7deyYiS8/edit?usp=sha
ring
• Analysis Plan:
– https://drive.google.com/file/d/0B7Ao1qq
JJDHQdnVEaVdqbWdEdkE/view?usp=shari
ng
• Collecting Data and analyses on GIAB
FTP Site
• Recruiting people to help with the
work.
Goal: Establish and distribute a set of authoritative benchmark variant calls of all
types and sizes, as well as homozygous reference regions, on GIAB PGP trios
23. genomeinabottle.org
Data Release Policy: Real-time,
Open, Public Release
Individual Datasets
• Uploaded to GIAB FTP site
as it is collected
• Includes raw reads, aligned
reads, and
variant/reference calls
Integrated High-confidence Calls
• First develop SNP, indel,
and homozygous reference
calls
• Then develop SV and non-
SV calls
• Released calls are versioned
• Preliminary callsets will be
made available to be
critiqued
24. genomeinabottle.org
Analysis Progress: AJ Trio
• SNPs/indels
– Several candidate callsets
– NIST working on integration
– Plan to use 10X/moleculo/PacBio for difficult-to-map regions
• Assembly
– 2 de novo assemblies of AJ trio (MHAP/PBcR and Falcon/Bionano)
– Will be used by at least 2 groups for SV calling
• Structural variants
– Candidate calls being generated by 15+ groups with >20 different
algorithms and 6 datasets
– 3 integration methods: Bina-MetaSV, DNAnexus/Baylor-
Parliament, NIST-svclassify
– Parliament: ~7k SVs with evidence in PacBio and Illumina
• Long-range Phasing
– 2 phased calls so far (CG LFR and 10X)
– Integration methods needed
25. genomeinabottle.org
Proposed approach to form high-
confidence SV (and non-SV) calls
Generate candidate calls from multiple
methods
Compare/evaluate calls using
Parliament/MetaSV/svclassify/others?;
manually inspect discordant calls
Integrate new and revised calls
Combine integrated calls (with heuristics
and/or machine learning) to generate high-
confidence calls
August 30, 2015
Nov 1, 2015
Jan 1, 2016
Jan 26, 2016
26. genomeinabottle.org
Acknowledgments
• FDA – Elizabeth
Mansfield, Computing
staff
• Many members of
Genome in a Bottle
– New members
welcome!
– Sign up on website for
email newsletters
Steering Committee
– Marc Salit
– Justin Zook
– David Mittelman
– Andrew Grupe
– Michael Eberle
– Steve Sherry
– Deanna Church
– Francisco De La Vega
– Christian Olsen
– Monica Basehore
– Lisa Kalman
– Christopher Mason
– Elizabeth Mansfield
– Liz Kerrigan
– Leming Shi
– Melvin Limson
– Alexander Wait Zaranek
– Nils Homer
– Fiona Hyland
– Steve Lincoln
– Don Baldwin
– Robyn Temple-Smolkin
– Chunlin Xiao
– Kara Norman
– Luke Hickey
27. genomeinabottle.org
For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis
Team google group emails
www.bioplanet.com/gcat - exome comparison tool
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://biorxiv.org/content/early/2015/09/15/026468
Global Alliance Benchmarking work group
– ga4gh.org/#/benchmarking-team
Twice yearly workshop
– Winter: January 28-29, 2016 at Stanford University, California, USA
– Summer at NIST, Maryland, USA
Public Meetings
Justin Zook: jzook@nist.gov
Marc Salit: salit@nist.gov