genomeinabottle.org
Genome in a Bottle Consortium
January 2016
Stanford University, Stanford, CA
Reference Materials for Human Genome
Sequencing
Marc Salit, Ph.D. and Justin Zook, Ph.D
National Institute of Standards and Technology
genomeinabottle.org
PGP Trio data described in…
genomeinabottle.org
PGP Trio data described in…
genomeinabottle.org
August Workshop Description
Cell Systems , Volume 1 , Issue 3 , 176 - 177
genomeinabottle.org
GIAB Scope
• The Genome in a Bottle Consortium is
developing the reference materials, reference
methods, and reference data needed to
assess confidence in human whole genome
variant calls.
• Priority is authoritative characterization of
human genomes.
GIAB steering committee, Aug 2015
genomeinabottle.org
Genome in a Bottle
Consortium Development
• NIST met with sequencing
technology developers to assess
standards needs
– Stanford, June 2011
• Open, exploratory workshop
– ASHG, Montreal, Canada
– October 2011
• Small workshop at NIST to develop
consortium for human genome
reference materials
– FDA, NCBI, NHGRI, NCI, CDC, Wash
U, Broad, technology developers,
clinical labs, CAP, PGP, Partners,
ABRF, others
– developed draft work plan
– April 2012
• Open, public meetings of GIAB
– August 2012 at NIST
– March 2013 at Xgen
– August 2013 at NIST
– January 2014 at Stanford
– August 2014 at NIST
– January 2015 at Stanford
– August 2015 at NIST
– January 28-29, 2016 at Stanford
–
• Website
– www.genomeinabottle.org
genomeinabottle.org
Well-characterized, stable RMs
• Obtain metrics for
validation, QC, QA, PT
• Determine sources and
types of bias/error
• Learn to resolve difficult
structural variants
• Improve reference
genome assembly
• Optimization
• Enable regulated
applications
Comparison of SNP Calls for
NA12878 on 2 platforms, 3
analysis methods
Analytical Performance
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• Use gDNA reference
materials to benchmark
performance
• Characterized Pilot
Genome NA12878
• Ashkenazim Trio, Asian
Trio from PGP in process
• Tools to facilitate their
use
– With the Global Alliance
Data Working Group
Benchmarking Team
genericmeasurementprocess
genomeinabottle.org
High-confidence SNP/indel calls
Zook et al., Nature Biotechnology, 2014.
• methods to develop
SNP/indel call set
described in manuscript
• broad and quick
adoption of call set for
benchmarking
– struck nerve
genomeinabottle.org
NIST Released the GIAB Pilot Genome
as RM 8398 in May 2015
>150 units sold so far
genomeinabottle.org
NIST Human Genome
Reference Materials (RMs)
• NIST RM 8398 is available!
– tinyurl.com/giabpilot
– DNA isolated from large
growth cell cultures
– Stable, homogeneous
– Best for regulated uses
– DNA from same cell line at
Coriell (NA12878)
• New AJ and Asian Samples
– Available from Coriell now
– NIST RM available in 2016
genomeinabottle.org
Jan 2016 Workshop
Thursday
• Update and Roadmap
• Breakouts
– Analyses for PGP GIAB Trios
– Reference Material Selection
and Development
• Breakout reports
• Roadmap discussion
Friday
• Using GIAB Products for
technology development,
optimization, and
demonstration
– Experiences from the
consortium
• Steering committee
genomeinabottle.org
Steering Committee Meeting
Topics
• Future workshops
• Format
• Program committee?
• Crafting a mission statement
• Defining scope
• Liaison with other efforts
Current members
– Marc Salit
– Justin Zook
– David Mittelman
– Andrew Grupe
– Michael Eberle
– Steve Sherry
– Deanna Church
– Francisco De La Vega
– Christian Olsen
– Monica Basehore
– Lisa Kalman
– Christopher Mason
– Elizabeth Mansfield
– Liz Kerrigan
– Leming Shi
– Melvin Limson
– Alexander Wait Zaranek
– Nils Homer
– Fiona Hyland
– Steve Lincoln
– Don Baldwin
– Robyn Temple-Smolkin
– Chunlin Xiao
– Kara Norman
– Luke Hickey
genomeinabottle.org
Agenda
Monday
• Breakfast and registration
• Welcome and Context Setting
• NIST RM Update and Status Report
• Charge to Working Groups
• Coffee Break
• Working Group Breakout Discussions
• Lunch (provided)
• Informal Working Group Reports
• Coffee Break
• Breakout Topical Discussions
– Topic #1: Moving beyond the 'easy'
variants and regions of the genome
– Topic #2: Selecting future genomes for
Reference Materials
Tuesday
• Breakfast and registration
• Use cases: Experiences using the pilot
Reference Material
• Discussion of plans to release pilot
Reference Material
• Coffee Break
• Working Group Breakout discussions
• Lunch (provided)
• Working Group leaders present plans
and discussion
• Steering committee Overview
• First meeting of the Steering
Committee (others adjourn)
Please Note
Slides will be made available on SlideShare after
the workshop (see genomeinabottle.org).
Tweets are welcome unless the speaker requests
otherwise. Please use #giab as the hashtag.
We are liaising with…
• Illumina Platinum Genomes
• CDC GeT-RM
• Korean Genome Project
• Genome Reference
Consortium
• 1000 Genomes SV group
• CAP/CLIA
• ABRF
• FDA
• SEQC
• Global metrology system
• Global Alliance for
Genomics and Health
Benchmarking Team
• NCBI/CDC GeT-RM Browser
• GCAT website
NGS RM Project
Slide courtesy of Lisa Kalman, CDC
Association of Biomolecular Resource Facilities (ABRF)
www.abrf.org
Next Generation Sequencing Study
Phase 2: DNA sequencing platforms
Study Design and Launch Plan
Slides courtesy of Chris Mason
January 24, 2016
Aims
Create reference data sets - Sequence data from reference samples will be generated
with intra- and inter-lab replication to model the likely range of performance that
should be expected under normal service laboratory conditions.
Test and create reference samples - Designated reference samples will be easily
accessible to the community for self-evaluation by comparison to the reference data.
Samples should be standardized, able to be stably reproduced over time, and suitable
for development of new laboratory and bioinformatics methods.
Data release and Immediate utility - Performance metrics and data will be developed
for instrument platforms and sample preparation protocols that are deployed now or
will be in the near future in core sequencing facilities. After QC, data will be released
to the entire Genome in a Bottle (GIAB) and ABRF Consortia for use and preparation
for submission as publications.
ABRF NGS Phase II Study
Samples and Platforms – All tested in triplicate across three distinct sites
Platform Human DNA Bacterial DNA
Illumina HiSeq X Ten A, B, C, C2, C2f
Illumina HiSeq 4000 A, B, C
Illumina HiSeq 2500 v4 1T A, B, C
Illumina HiSeq 2500 v3 Rapid Run C Ste, Eco, Mil, P
Illumina NextSeq 500 High Output C
Illumina MiSeq Ste, Eco, Mil, P
Life Tech Proton A, B, C exomes Ste, Eco, Mil, P
Life Tech S5 A, B, C exomes Ste, Eco, Mil, P
Life Tech PGM Ste, Eco, Mil, P
Pacific Biosciences Ste, Eco, Mil, P
Oxford Nanopore Ste, Eco, Mil, P
maternal
paternal
son
son
(Coriell)
A B C C2
Ste Eco Mil pool
Human Trio Bacterial Isolates and Mixture
ABRF NGS Phase II Study
Reference DNA,
TruSeq PCR-free 350
FFPE DNA, TruSeq Nano
FFPE DNA, TruSeq PCR-free
KAPA libraries from sites a-b-c
Ste Eco Mil pool
maternal
paternal
son
son
(Coriell)
%GC: 28 50 72
A B C C2
Personal Genome Project
NIST Reference Human Genomes
C2f
Reference bacterial genomes
TruSeq PCR-free 550
Ca
Illumina (ILMN) - Samples
ABRF NGS Phase II Study
Organization and Leadership
ABRF NGS Phase II Study
Sequencing Quality Control
Phase II (SEQC2) – An
Introduction
Slides courtesy of Weida Tong, Ph.D.
Division of Bioinformatics and Biostatistics,
NCTR/FDA
22
Short reads vs
long reads
Detection power
for rare mutation
Detection accuracy
for difficult genes
Application scope
of MiSeq
Variants call (e.g.,
SNV, CNV, Indels)
Assess the WGS accuracy and reproducibility for variants call by
investigating the join effect of reads alignment pipelines,
variants call methods and coverage as well as comparing the
results from personal genome versus reference genome.
Assess detection power of ultra-deep sequencing
(TGS) for subclonal mutation and its dependency
on bioinformatics and coverage.
Assess the utility of MiSeq for (1) detection of subclonal mutation,
(2) the difficult genes (e.g., HLA), and (3) the difficult variations
(e.g., Indels)
Assess the accuracy for some difficult genes that
varies significantly due to complexity in their genomic
regions (e.g. GC region) with specifically focused on
HLA genes.
Datasets:
• Approaches: WGS
and TGS
• Platforms: Hiseq,
PacBio, MiSeq, etc
• Samples: TRIOs, NB,
cell lines, etc
Parameters:
• Personal vs reference
genome
• Bioinformatics
• Coverage
SEQC2 Overview
Assess short reads alone, long reads alone and their
combination for genome assembly and subsequent
variant calling in WGS.
Trio Study
Coverage/platform
Notes
Short reads Long reads
SEQC2:
HapMap Trio
(European)
80x TBD
Planned for both WGS and TGS;
genotyping data and information
from HapMap are available
GIAB:
Trio
(Ashkenazim)
Illumina 300x
69x (son),
30x (parents) This dataset is generated by Genome
In A Bottle (GIAB) consortium. We
closely work with GIAB to obtain the
update information of this Trio and
the GIAB leaders also participate in
SEQC2.
Complete
Genomics
BioNano
Ion Torrent Moleculo
SOLiD (WGS)
SEQC2:
Chinese Trio and
test of LCL-
germline
100x 50x
Panned; the datasets will be
provided by Dr. Leming Shi who is a
part of SEQC2 leadership team.
Three Trio Datasets
24
Candidate NIST Reference Materials
Genome PGP ID Coriell ID NIST ID NIST RM #
CEPH
Mother/Daugh
ter
N/A GM12878 HG001 RM8398
AJ Son huAA53E0 GM24385 HG002 RM8391
(son)/RM8392
(trio)
AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)
AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)
Asian Son hu91BD69 GM24631 HG005 RM8393
Asian Father huCA017E GM24694 N/A N/A
Asian Mother hu38168C GM24695 N/A N/A
NIST Microbial Genomic DNA Reference Materials
Credit:
Nate Olson
Analysis process for Microbial RMs
Credit:
Nate Olson
genomeinabottle.org
GIAB Progress Update
January 2016
Dataset AJ Son AJ Parents Chinese son Chinese
parents
NA12878
Illumina Paired-
end
X X X X X
Illumina Long
Mate pair
X X X X X
Illumina
“moleculo”
X X X X X
Complete
Genomics
X X X X X
Complete
Genomics LFR
X X X
Ion exome
X X X X
BioNano
X X X X
10X
X X X
PacBio
X X X
SOLiD single end
X X X
Illumina exome
X X X X
Oxford
Nanopore
X
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
#IPs
Month
GIAB ftp site downloads/unique-IPs by month
GIAB Analysis Group – New Data Sets
Leaders
• Francisco de la Vega
– Stanford, TOMA Biosciences
• Chris Mason
– Weil Cornell Medical Center
• Tina Graves
– Washington University
• Valerie Schneider
– NCBI
•and Justin and Marc
Strategic Documents
• Analysis Group Responsibilities:
– https://docs.google.com/document/d/10e
A0DwB4iYTSFM_LPO9_2LyyN2xEqH49OXH
htNH1uzw/edit?usp=sharing
• Analysis Milestones:
– https://docs.google.com/spreadsheets/d/1Pj4nSz
H742g40wJz2fA6f8kFtZYAToZpSZYVPiC5st4/edit?u
sp=sharing
• Analysis Methods
– https://docs.google.com/spreadsheet
s/d/1Je2g85H7oK6kMXbBOoqQ1FM
NrvGnFuUJTJn7deyYiS8/edit?usp=sha
ring
• Analysis Plan:
– https://drive.google.com/file/d/0B7Ao1qq
JJDHQdnVEaVdqbWdEdkE/view?usp=shari
ng
• Collecting Data and analyses on GIAB
FTP Site
• Recruiting people to help with the
work.
Goal: Establish and distribute a set of authoritative benchmark variant calls of all
types and sizes, as well as homozygous reference regions, on GIAB PGP trios
GIAB Analysis Group – New Data Sets
Types of analyses
• SNPs/indels
– NIST working on integration
– 10X/moleculo/PacBio for
difficult-to-map regions
• Assembly
– 2 de novo assemblies
– Being used for SV calling
Status
• Structural variants
– Candidate calls being generated
by 15+ groups with >20 different
algorithms and 6 datasets
– 3+ integration methods
– ~monthly calls
• Long-range Phasing
– 2 phased calls so far (CG LFR and
10X)
– Integration methods needed
• Methylation analyses
Goal: Establish and distribute a set of authoritative benchmark variant calls of all
types and sizes, as well as homozygous reference regions, on GIAB PGP trios
genomeinabottle.org
SNP/Indel Integration Method Update
• Implementing refined integration methods
– Developed so others can readily reproduce results
– Consistent results for all GIAB genomes
– Simpler process taking advantage of best practices
for each technology
• Validating with released NA12878 RM data
– Preliminary comparisons show minor changes
• Application to PGP trios
– Plan to analyze AJ trio by Q2 2016
– Release of NIST RMs in Q2 2016
– Develop calls for GRCh38
genomeinabottle.org
Data Release: Real-time, Open,
Public Release
Individual Datasets
• Uploaded to GIAB FTP site
as it is collected
• Includes raw reads, aligned
reads, and
variant/reference calls
Integrated High-confidence Calls
• First develop SNP, indel,
and homozygous reference
calls
• Then develop SV and non-
SV calls
• Released calls are versioned
• Preliminary callsets will be
made available to be
critiqued
GIAB AJ Trio Hybrid PacBio/BioNano
Assembly
Hybrid (PacBio with BioNano)
Input Assembly Notes
# of
Scaffolds N50 Max Total
HG002 Falcon 248 22.7Mb 92.8Mb 2.38Gb
Trio Falcon 210 29.3Mb 87.6Mb 2.32Gb
Two Step
Trio
celera (child) +
falcon (trio) 187 34.3Mb 98.0Mb 2.6Gb
Credits: Ali Bashir, Jason Chin, Alex Hastie
Pendleton et al, Nature Methods, 2015
svclassify
Proposed approach to form high-
confidence SV (and non-SV) calls
Generate Candidate Calls
Compare/evaluate calls using
Parliament/MetaSV/svclassify/others?;
manual inspection
Integrate new and revised calls; manual
inspection
Combine integrated calls; manual inspection;
targeted experimental validation?
August 30, 2015
January 2016
Plan in January 2016
Feb 2016 and beyond
Deletion overlap summary for son
By # of callsets
# of callsets # of calls
1 3780
2 1391
3 859
4 574
5+ 344
By Technology
Technology # of calls
Illumina 3277
PacBio 5177
BioNano 812
CG 1758
Illumina/CG+PacBio 2318
Illumina/CG+BioNano 518
PacBio+BioNano 467
2+ technologies 2661
Converted all to bed; combined with bedtools multiinter; Calls within 50bps were merged
Preliminary Confirmation of SVs
Integration results from AJ son
Parliament: BMC Genomics, 2015, 16:286 (performed by Andrew Carroll, DNAnexus)
MetaSV: Bioinformatics, 2015, 31:2741 (performed by Marghoob Mohiyuddin, Bina/Roche)
• Parliament
– Candidates from Illumina
– Confirmed by PacBio and/or
Illumina
– ~50% in both technologies
– ~4.5k deletions, 1k insertions
– 85% of Genotypes consistent
within Trio
• MetaSV
– Multiple types of evidence
from Illumina
MetaSV
Total:
2809
Parliament
Total:
5467
569
(20 %)
977
(18 %)
MetaSV
2240
(80 %)
Parliament
4490
(82 %)
50 % reciprocal overlap
Some overlap within Parliament calls
genomeinabottle.org
GeT-RM Browser from NCBI and CDC
• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/
• Allows visualization of data underlying call each call
genomeinabottle.org
Uses of GIAB NA12878
Oncology – Molecular and Cellular Tumor Markers
“Next Generation” Sequencing (NGS) guidelines for
somatic genetic variant detection
www.bioplanet.com/gcat
Global Alliance for Genomics and Health
Benchmarking Task Team
• Initial version of standardized
definitions for performance
metrics like TP, FP, and FN.
• Continued development of
sophisticated benchmarking tools
• vcfeval – Len Trigg
• hap.py – Peter Krusche
• vgraph – Kevin Jacobs
• Standardized intermediate and
final file formats
• Standardized bed files with
difficult genome contexts for
stratification
• Simulating reads with non-SNP
ClinVar variants to demonstrate
importance of these tools
• github.com/ga4gh/benchmarking
-tools
Next steps
• Further analysis to
demonstrate importance of
sophisticated tools
• Write manuscript about the
team’s tools
• Integrate vcfeval and hap.py to
take advantage of strengths of
each
• Recommend “Best Practices”
for benchmarking
• Explore venues for making the
team’s benchmarking process
easier to use
Progress
Proposed Performance Metrics
Definitions
• Define TP/FP/FN/TN in 4 ways depending on
required stringency of match:
• Loose match: TP if within x-bp of a true variant
• Allelle match: TP if ALT allele matches
• Genotype match: TP if genotype and ALT allele
match
• Phasing match: TP if genotype, ALT allele, and
phasing with nearby variants all match
• True negatives are difficult to define because
an infinite number of potential alleles exist
genomeinabottle.org
Global Alliance for Genomics and Health
Benchmarking Task Team
Credit: Rebecca Truty, Complete Genomics
How should we interpret this complex variant on chr21?
GA4GH Benchmarking Tool Architecture
Truth VCF
Query
VCF
Comparison Engine
vcfeval / vgraph /
xcmp / bcftools / ...
VCF-I
Two-column VCF
with TP/FP/FN
annotations
Quantification
e.g. quantify / hap.py
Stratification BED
files
Confident Call
Regions
VCF-R
Two-column
VCF with
TP/FP/FN/UNK
annotations
Counts
Credit: Peter Krusche
https://github.com/ga4gh/benchmarking-tools
Approaches to Benchmarking Variant
Calling
• Well-characterized whole genome Reference
Materials
• Many samples characterized in clinically relevant
regions
• Synthetic DNA spike-ins
• Cell lines with engineered mutations
• Simulated reads
• Modified real reads
• Modified reference genomes
• Confirming results found in real samples over
time
Challenges in Benchmarking Small
Variant Calling
• It is difficult to do robust benchmarking of tests designed to
detect many analytes (e.g., many variants)
• Easiest to benchmark only within high-confidence bed file,
but…
• Benchmark calls/regions tend to be biased towards easier
variants and regions
– Some clinical tests are enriched for difficult sites
• Challenges with benchmarking complex variants near
boundaries of high-confidence regions
• Always manually inspect a subset of FPs/FNs
• Stratification by variant type and region is important
• Always calculate confidence intervals on performance
metrics
Particular Challenges in Benchmarking
SV Calling
• How to establish benchmark calls for difficult
regions?
• How to establish non-SV regions to assess FP
rates?
• Multiple dimensions of accuracy:
– Predicted SV existence
– Predicted SV type
– Predicted size
– Predicted breakpoints
– Predicted exact sequence
Acknowledgments
• FDA – Elizabeth
Mansfield
• Many members of
Genome in a Bottle
– New members
welcome!
– Sign up on website
for email
newsletters
GIAB Steering Committee
– Marc Salit
– Justin Zook
– David Mittelman
– Andrew Grupe
– Michael Eberle
– Steve Sherry
– Deanna Church
– Francisco De La Vega
– Christian Olsen
– Monica Basehore
– Lisa Kalman
– Christopher Mason
– Elizabeth Mansfield
– Liz Kerrigan
– Leming Shi
– Melvin Limson
– Alexander Wait Zaranek
– Nils Homer
– Fiona Hyland
– Steve Lincoln
– Don Baldwin
– Robyn Temple-Smolkin
– Chunlin Xiao
– Kara Norman
– Luke Hickey
For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis
Team google group emails
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://biorxiv.org/content/early/2015/09/15/026468
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
Twice yearly public workshops
– Winter at Stanford University, California, USA
– Summer at NIST, Maryland, USA
Justin Zook: jzook@nist.gov
Marc Salit: salit@nist.gov
GIAB Roadmap: Where are we,
Where are we going?
• Reference Materials
– Germline
– Somatic
• Informatics
– Analysis of GIAB data
– Benchmarking
• Documentary Standards/Publications
– Documentation of methods
– Supporting Use

Giab jan2016 intro and update 160128

  • 1.
    genomeinabottle.org Genome in aBottle Consortium January 2016 Stanford University, Stanford, CA Reference Materials for Human Genome Sequencing Marc Salit, Ph.D. and Justin Zook, Ph.D National Institute of Standards and Technology
  • 2.
  • 3.
  • 4.
    genomeinabottle.org August Workshop Description CellSystems , Volume 1 , Issue 3 , 176 - 177
  • 5.
    genomeinabottle.org GIAB Scope • TheGenome in a Bottle Consortium is developing the reference materials, reference methods, and reference data needed to assess confidence in human whole genome variant calls. • Priority is authoritative characterization of human genomes. GIAB steering committee, Aug 2015
  • 6.
    genomeinabottle.org Genome in aBottle Consortium Development • NIST met with sequencing technology developers to assess standards needs – Stanford, June 2011 • Open, exploratory workshop – ASHG, Montreal, Canada – October 2011 • Small workshop at NIST to develop consortium for human genome reference materials – FDA, NCBI, NHGRI, NCI, CDC, Wash U, Broad, technology developers, clinical labs, CAP, PGP, Partners, ABRF, others – developed draft work plan – April 2012 • Open, public meetings of GIAB – August 2012 at NIST – March 2013 at Xgen – August 2013 at NIST – January 2014 at Stanford – August 2014 at NIST – January 2015 at Stanford – August 2015 at NIST – January 28-29, 2016 at Stanford – • Website – www.genomeinabottle.org
  • 7.
    genomeinabottle.org Well-characterized, stable RMs •Obtain metrics for validation, QC, QA, PT • Determine sources and types of bias/error • Learn to resolve difficult structural variants • Improve reference genome assembly • Optimization • Enable regulated applications Comparison of SNP Calls for NA12878 on 2 platforms, 3 analysis methods
  • 8.
    Analytical Performance Sample gDNA isolation LibraryPrep Sequencing Alignment/Mapping Variant Calling Confidence Estimates Downstream Analysis • Use gDNA reference materials to benchmark performance • Characterized Pilot Genome NA12878 • Ashkenazim Trio, Asian Trio from PGP in process • Tools to facilitate their use – With the Global Alliance Data Working Group Benchmarking Team genericmeasurementprocess
  • 9.
    genomeinabottle.org High-confidence SNP/indel calls Zooket al., Nature Biotechnology, 2014. • methods to develop SNP/indel call set described in manuscript • broad and quick adoption of call set for benchmarking – struck nerve
  • 10.
    genomeinabottle.org NIST Released theGIAB Pilot Genome as RM 8398 in May 2015 >150 units sold so far
  • 11.
    genomeinabottle.org NIST Human Genome ReferenceMaterials (RMs) • NIST RM 8398 is available! – tinyurl.com/giabpilot – DNA isolated from large growth cell cultures – Stable, homogeneous – Best for regulated uses – DNA from same cell line at Coriell (NA12878) • New AJ and Asian Samples – Available from Coriell now – NIST RM available in 2016
  • 12.
    genomeinabottle.org Jan 2016 Workshop Thursday •Update and Roadmap • Breakouts – Analyses for PGP GIAB Trios – Reference Material Selection and Development • Breakout reports • Roadmap discussion Friday • Using GIAB Products for technology development, optimization, and demonstration – Experiences from the consortium • Steering committee
  • 13.
    genomeinabottle.org Steering Committee Meeting Topics •Future workshops • Format • Program committee? • Crafting a mission statement • Defining scope • Liaison with other efforts Current members – Marc Salit – Justin Zook – David Mittelman – Andrew Grupe – Michael Eberle – Steve Sherry – Deanna Church – Francisco De La Vega – Christian Olsen – Monica Basehore – Lisa Kalman – Christopher Mason – Elizabeth Mansfield – Liz Kerrigan – Leming Shi – Melvin Limson – Alexander Wait Zaranek – Nils Homer – Fiona Hyland – Steve Lincoln – Don Baldwin – Robyn Temple-Smolkin – Chunlin Xiao – Kara Norman – Luke Hickey
  • 14.
    genomeinabottle.org Agenda Monday • Breakfast andregistration • Welcome and Context Setting • NIST RM Update and Status Report • Charge to Working Groups • Coffee Break • Working Group Breakout Discussions • Lunch (provided) • Informal Working Group Reports • Coffee Break • Breakout Topical Discussions – Topic #1: Moving beyond the 'easy' variants and regions of the genome – Topic #2: Selecting future genomes for Reference Materials Tuesday • Breakfast and registration • Use cases: Experiences using the pilot Reference Material • Discussion of plans to release pilot Reference Material • Coffee Break • Working Group Breakout discussions • Lunch (provided) • Working Group leaders present plans and discussion • Steering committee Overview • First meeting of the Steering Committee (others adjourn) Please Note Slides will be made available on SlideShare after the workshop (see genomeinabottle.org). Tweets are welcome unless the speaker requests otherwise. Please use #giab as the hashtag.
  • 15.
    We are liaisingwith… • Illumina Platinum Genomes • CDC GeT-RM • Korean Genome Project • Genome Reference Consortium • 1000 Genomes SV group • CAP/CLIA • ABRF • FDA • SEQC • Global metrology system • Global Alliance for Genomics and Health Benchmarking Team • NCBI/CDC GeT-RM Browser • GCAT website
  • 16.
    NGS RM Project Slidecourtesy of Lisa Kalman, CDC
  • 17.
    Association of BiomolecularResource Facilities (ABRF) www.abrf.org Next Generation Sequencing Study Phase 2: DNA sequencing platforms Study Design and Launch Plan Slides courtesy of Chris Mason January 24, 2016
  • 18.
    Aims Create reference datasets - Sequence data from reference samples will be generated with intra- and inter-lab replication to model the likely range of performance that should be expected under normal service laboratory conditions. Test and create reference samples - Designated reference samples will be easily accessible to the community for self-evaluation by comparison to the reference data. Samples should be standardized, able to be stably reproduced over time, and suitable for development of new laboratory and bioinformatics methods. Data release and Immediate utility - Performance metrics and data will be developed for instrument platforms and sample preparation protocols that are deployed now or will be in the near future in core sequencing facilities. After QC, data will be released to the entire Genome in a Bottle (GIAB) and ABRF Consortia for use and preparation for submission as publications. ABRF NGS Phase II Study
  • 19.
    Samples and Platforms– All tested in triplicate across three distinct sites Platform Human DNA Bacterial DNA Illumina HiSeq X Ten A, B, C, C2, C2f Illumina HiSeq 4000 A, B, C Illumina HiSeq 2500 v4 1T A, B, C Illumina HiSeq 2500 v3 Rapid Run C Ste, Eco, Mil, P Illumina NextSeq 500 High Output C Illumina MiSeq Ste, Eco, Mil, P Life Tech Proton A, B, C exomes Ste, Eco, Mil, P Life Tech S5 A, B, C exomes Ste, Eco, Mil, P Life Tech PGM Ste, Eco, Mil, P Pacific Biosciences Ste, Eco, Mil, P Oxford Nanopore Ste, Eco, Mil, P maternal paternal son son (Coriell) A B C C2 Ste Eco Mil pool Human Trio Bacterial Isolates and Mixture ABRF NGS Phase II Study
  • 20.
    Reference DNA, TruSeq PCR-free350 FFPE DNA, TruSeq Nano FFPE DNA, TruSeq PCR-free KAPA libraries from sites a-b-c Ste Eco Mil pool maternal paternal son son (Coriell) %GC: 28 50 72 A B C C2 Personal Genome Project NIST Reference Human Genomes C2f Reference bacterial genomes TruSeq PCR-free 550 Ca Illumina (ILMN) - Samples ABRF NGS Phase II Study
  • 21.
  • 22.
    Sequencing Quality Control PhaseII (SEQC2) – An Introduction Slides courtesy of Weida Tong, Ph.D. Division of Bioinformatics and Biostatistics, NCTR/FDA 22
  • 23.
    Short reads vs longreads Detection power for rare mutation Detection accuracy for difficult genes Application scope of MiSeq Variants call (e.g., SNV, CNV, Indels) Assess the WGS accuracy and reproducibility for variants call by investigating the join effect of reads alignment pipelines, variants call methods and coverage as well as comparing the results from personal genome versus reference genome. Assess detection power of ultra-deep sequencing (TGS) for subclonal mutation and its dependency on bioinformatics and coverage. Assess the utility of MiSeq for (1) detection of subclonal mutation, (2) the difficult genes (e.g., HLA), and (3) the difficult variations (e.g., Indels) Assess the accuracy for some difficult genes that varies significantly due to complexity in their genomic regions (e.g. GC region) with specifically focused on HLA genes. Datasets: • Approaches: WGS and TGS • Platforms: Hiseq, PacBio, MiSeq, etc • Samples: TRIOs, NB, cell lines, etc Parameters: • Personal vs reference genome • Bioinformatics • Coverage SEQC2 Overview Assess short reads alone, long reads alone and their combination for genome assembly and subsequent variant calling in WGS.
  • 24.
    Trio Study Coverage/platform Notes Short readsLong reads SEQC2: HapMap Trio (European) 80x TBD Planned for both WGS and TGS; genotyping data and information from HapMap are available GIAB: Trio (Ashkenazim) Illumina 300x 69x (son), 30x (parents) This dataset is generated by Genome In A Bottle (GIAB) consortium. We closely work with GIAB to obtain the update information of this Trio and the GIAB leaders also participate in SEQC2. Complete Genomics BioNano Ion Torrent Moleculo SOLiD (WGS) SEQC2: Chinese Trio and test of LCL- germline 100x 50x Panned; the datasets will be provided by Dr. Leming Shi who is a part of SEQC2 leadership team. Three Trio Datasets 24
  • 25.
    Candidate NIST ReferenceMaterials Genome PGP ID Coriell ID NIST ID NIST RM # CEPH Mother/Daugh ter N/A GM12878 HG001 RM8398 AJ Son huAA53E0 GM24385 HG002 RM8391 (son)/RM8392 (trio) AJ Father hu6E4515 GM24149 HG003 RM8392 (trio) AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio) Asian Son hu91BD69 GM24631 HG005 RM8393 Asian Father huCA017E GM24694 N/A N/A Asian Mother hu38168C GM24695 N/A N/A
  • 26.
    NIST Microbial GenomicDNA Reference Materials Credit: Nate Olson
  • 27.
    Analysis process forMicrobial RMs Credit: Nate Olson
  • 28.
  • 29.
    Dataset AJ SonAJ Parents Chinese son Chinese parents NA12878 Illumina Paired- end X X X X X Illumina Long Mate pair X X X X X Illumina “moleculo” X X X X X Complete Genomics X X X X X Complete Genomics LFR X X X Ion exome X X X X BioNano X X X X 10X X X X PacBio X X X SOLiD single end X X X Illumina exome X X X X Oxford Nanopore X
  • 30.
  • 31.
    GIAB Analysis Group– New Data Sets Leaders • Francisco de la Vega – Stanford, TOMA Biosciences • Chris Mason – Weil Cornell Medical Center • Tina Graves – Washington University • Valerie Schneider – NCBI •and Justin and Marc Strategic Documents • Analysis Group Responsibilities: – https://docs.google.com/document/d/10e A0DwB4iYTSFM_LPO9_2LyyN2xEqH49OXH htNH1uzw/edit?usp=sharing • Analysis Milestones: – https://docs.google.com/spreadsheets/d/1Pj4nSz H742g40wJz2fA6f8kFtZYAToZpSZYVPiC5st4/edit?u sp=sharing • Analysis Methods – https://docs.google.com/spreadsheet s/d/1Je2g85H7oK6kMXbBOoqQ1FM NrvGnFuUJTJn7deyYiS8/edit?usp=sha ring • Analysis Plan: – https://drive.google.com/file/d/0B7Ao1qq JJDHQdnVEaVdqbWdEdkE/view?usp=shari ng • Collecting Data and analyses on GIAB FTP Site • Recruiting people to help with the work. Goal: Establish and distribute a set of authoritative benchmark variant calls of all types and sizes, as well as homozygous reference regions, on GIAB PGP trios
  • 32.
    GIAB Analysis Group– New Data Sets Types of analyses • SNPs/indels – NIST working on integration – 10X/moleculo/PacBio for difficult-to-map regions • Assembly – 2 de novo assemblies – Being used for SV calling Status • Structural variants – Candidate calls being generated by 15+ groups with >20 different algorithms and 6 datasets – 3+ integration methods – ~monthly calls • Long-range Phasing – 2 phased calls so far (CG LFR and 10X) – Integration methods needed • Methylation analyses Goal: Establish and distribute a set of authoritative benchmark variant calls of all types and sizes, as well as homozygous reference regions, on GIAB PGP trios
  • 33.
    genomeinabottle.org SNP/Indel Integration MethodUpdate • Implementing refined integration methods – Developed so others can readily reproduce results – Consistent results for all GIAB genomes – Simpler process taking advantage of best practices for each technology • Validating with released NA12878 RM data – Preliminary comparisons show minor changes • Application to PGP trios – Plan to analyze AJ trio by Q2 2016 – Release of NIST RMs in Q2 2016 – Develop calls for GRCh38
  • 34.
    genomeinabottle.org Data Release: Real-time,Open, Public Release Individual Datasets • Uploaded to GIAB FTP site as it is collected • Includes raw reads, aligned reads, and variant/reference calls Integrated High-confidence Calls • First develop SNP, indel, and homozygous reference calls • Then develop SV and non- SV calls • Released calls are versioned • Preliminary callsets will be made available to be critiqued
  • 35.
    GIAB AJ TrioHybrid PacBio/BioNano Assembly Hybrid (PacBio with BioNano) Input Assembly Notes # of Scaffolds N50 Max Total HG002 Falcon 248 22.7Mb 92.8Mb 2.38Gb Trio Falcon 210 29.3Mb 87.6Mb 2.32Gb Two Step Trio celera (child) + falcon (trio) 187 34.3Mb 98.0Mb 2.6Gb Credits: Ali Bashir, Jason Chin, Alex Hastie Pendleton et al, Nature Methods, 2015
  • 36.
  • 37.
    Proposed approach toform high- confidence SV (and non-SV) calls Generate Candidate Calls Compare/evaluate calls using Parliament/MetaSV/svclassify/others?; manual inspection Integrate new and revised calls; manual inspection Combine integrated calls; manual inspection; targeted experimental validation? August 30, 2015 January 2016 Plan in January 2016 Feb 2016 and beyond
  • 38.
    Deletion overlap summaryfor son By # of callsets # of callsets # of calls 1 3780 2 1391 3 859 4 574 5+ 344 By Technology Technology # of calls Illumina 3277 PacBio 5177 BioNano 812 CG 1758 Illumina/CG+PacBio 2318 Illumina/CG+BioNano 518 PacBio+BioNano 467 2+ technologies 2661 Converted all to bed; combined with bedtools multiinter; Calls within 50bps were merged
  • 39.
    Preliminary Confirmation ofSVs Integration results from AJ son Parliament: BMC Genomics, 2015, 16:286 (performed by Andrew Carroll, DNAnexus) MetaSV: Bioinformatics, 2015, 31:2741 (performed by Marghoob Mohiyuddin, Bina/Roche) • Parliament – Candidates from Illumina – Confirmed by PacBio and/or Illumina – ~50% in both technologies – ~4.5k deletions, 1k insertions – 85% of Genotypes consistent within Trio • MetaSV – Multiple types of evidence from Illumina MetaSV Total: 2809 Parliament Total: 5467 569 (20 %) 977 (18 %) MetaSV 2240 (80 %) Parliament 4490 (82 %) 50 % reciprocal overlap Some overlap within Parliament calls
  • 40.
    genomeinabottle.org GeT-RM Browser fromNCBI and CDC • http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/ • Allows visualization of data underlying call each call
  • 41.
    genomeinabottle.org Uses of GIABNA12878 Oncology – Molecular and Cellular Tumor Markers “Next Generation” Sequencing (NGS) guidelines for somatic genetic variant detection www.bioplanet.com/gcat
  • 42.
    Global Alliance forGenomics and Health Benchmarking Task Team • Initial version of standardized definitions for performance metrics like TP, FP, and FN. • Continued development of sophisticated benchmarking tools • vcfeval – Len Trigg • hap.py – Peter Krusche • vgraph – Kevin Jacobs • Standardized intermediate and final file formats • Standardized bed files with difficult genome contexts for stratification • Simulating reads with non-SNP ClinVar variants to demonstrate importance of these tools • github.com/ga4gh/benchmarking -tools Next steps • Further analysis to demonstrate importance of sophisticated tools • Write manuscript about the team’s tools • Integrate vcfeval and hap.py to take advantage of strengths of each • Recommend “Best Practices” for benchmarking • Explore venues for making the team’s benchmarking process easier to use Progress
  • 43.
    Proposed Performance Metrics Definitions •Define TP/FP/FN/TN in 4 ways depending on required stringency of match: • Loose match: TP if within x-bp of a true variant • Allelle match: TP if ALT allele matches • Genotype match: TP if genotype and ALT allele match • Phasing match: TP if genotype, ALT allele, and phasing with nearby variants all match • True negatives are difficult to define because an infinite number of potential alleles exist
  • 44.
    genomeinabottle.org Global Alliance forGenomics and Health Benchmarking Task Team Credit: Rebecca Truty, Complete Genomics How should we interpret this complex variant on chr21?
  • 45.
    GA4GH Benchmarking ToolArchitecture Truth VCF Query VCF Comparison Engine vcfeval / vgraph / xcmp / bcftools / ... VCF-I Two-column VCF with TP/FP/FN annotations Quantification e.g. quantify / hap.py Stratification BED files Confident Call Regions VCF-R Two-column VCF with TP/FP/FN/UNK annotations Counts Credit: Peter Krusche https://github.com/ga4gh/benchmarking-tools
  • 46.
    Approaches to BenchmarkingVariant Calling • Well-characterized whole genome Reference Materials • Many samples characterized in clinically relevant regions • Synthetic DNA spike-ins • Cell lines with engineered mutations • Simulated reads • Modified real reads • Modified reference genomes • Confirming results found in real samples over time
  • 47.
    Challenges in BenchmarkingSmall Variant Calling • It is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants) • Easiest to benchmark only within high-confidence bed file, but… • Benchmark calls/regions tend to be biased towards easier variants and regions – Some clinical tests are enriched for difficult sites • Challenges with benchmarking complex variants near boundaries of high-confidence regions • Always manually inspect a subset of FPs/FNs • Stratification by variant type and region is important • Always calculate confidence intervals on performance metrics
  • 48.
    Particular Challenges inBenchmarking SV Calling • How to establish benchmark calls for difficult regions? • How to establish non-SV regions to assess FP rates? • Multiple dimensions of accuracy: – Predicted SV existence – Predicted SV type – Predicted size – Predicted breakpoints – Predicted exact sequence
  • 49.
    Acknowledgments • FDA –Elizabeth Mansfield • Many members of Genome in a Bottle – New members welcome! – Sign up on website for email newsletters GIAB Steering Committee – Marc Salit – Justin Zook – David Mittelman – Andrew Grupe – Michael Eberle – Steve Sherry – Deanna Church – Francisco De La Vega – Christian Olsen – Monica Basehore – Lisa Kalman – Christopher Mason – Elizabeth Mansfield – Liz Kerrigan – Leming Shi – Melvin Limson – Alexander Wait Zaranek – Nils Homer – Fiona Hyland – Steve Lincoln – Don Baldwin – Robyn Temple-Smolkin – Chunlin Xiao – Kara Norman – Luke Hickey
  • 50.
    For More Information www.genomeinabottle.org- sign up for general GIAB and Analysis Team google group emails github.com/genome-in-a-bottle – Guide to GIAB data & ftp www.slideshare.net/genomeinabottle www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser Data: http://biorxiv.org/content/early/2015/09/15/026468 Global Alliance Benchmarking Team – https://github.com/ga4gh/benchmarking-tools Twice yearly public workshops – Winter at Stanford University, California, USA – Summer at NIST, Maryland, USA Justin Zook: jzook@nist.gov Marc Salit: salit@nist.gov
  • 51.
    GIAB Roadmap: Whereare we, Where are we going? • Reference Materials – Germline – Somatic • Informatics – Analysis of GIAB data – Benchmarking • Documentary Standards/Publications – Documentation of methods – Supporting Use