Genome in a Bottle: You’ve
sequenced. How well did you do?
October 9, 2015
Justin Zook, Marc Salit, and the
Genome in a Bottle Consortium
*Nothing to Disclose
Sequencing technologies and
bioinformatics pipelines disagree
O’Rawe et al. Genome Medicine 2013, 5:28
Sequencing technologies and
bioinformatics pipelines disagree
O’Rawe et al. Genome Medicine 2013, 5:28
Who is right?
Is anyone right?
Genome in a Bottle Consortium (GIAB)
Hosted by US National Institute of Standards and Technology
Goal: Provide infrastructure to assess
confidence in human variant calls
• Appropriately consented widely
available DNA samples, distributed by
the Coriell Institute
– Also, QCed Reference Material (RM)
versions from controlled lots will be
available from NIST
– Also, PGP samples are commercially
available
• High-accuracy reference data for these
samples
• Tools to facilitate their use
– With the Global Alliance Data Working
Group Benchmarking Team
Global Alliance for Genomics and Health
ga4gh.org
Genome in a Bottle
genomeinabottle.org
GIAB Selected Samples
CEPH/Utah Pedigree 1463
✔
NA1288
9
NA12879
NA12890
NA12880
NA12881
NA12882
NA12883
NA12884
NA12885
NA12886
NA12887
NA12888
NA12893
NA12877 NA12878
NA12891 NA12892
✔ ✔
NA24149 NA24143
NA24385
Ashkenazi Jewish Trio
✔
NA24694 NA24695
NA24631
Asian (Han Chinese) Trio
✔
Note: Illumina and RTG have used data from the pedigree
to improve variant calls in the specific GIAB samples.
New
New
Personal
Genome
Project
Available as
NIST RM8398
NGS Validation Process using
Genomes in Bottles
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
Analytical Process
Genome in a Bottle Scope
Pre-Analytical Process
Clinical Interpretation
GIAB
Data
Pilot Genome: NA12878
Integrated 14 datasets from 5 platforms
to establish Reference SNP/indel Calls for
NA12878
Zook et al., Nature Biotechnology, 2014.
~77 % High-confidence
~23 % Uncertain
Uses of GIAB NA12878
Oncology – Molecular and Cellular Tumor Markers
“Next Generation” Sequencing (NGS) guidelines for
somatic genetic variant detection
www.bioplanet.com/gcat
GeT-RM Browser from NCBI and CDC
• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/
Global Alliance for Genomics and Health
Benchmarking Task Team
• Developed standardized
definitions for
performance metrics like
TP, FP, and FN.
• Developing sophisticated
benchmarking tools
• vcfeval – Len Trigg
• hap.py – Peter Krusche
• vgraph – Kevin Jacobs
• Standardized bed files
with difficult genome
contexts for stratification
Credit: GA4GH, Abby Beeler, Ellie Wood
Stratification of FP Rates
Higher FP rates at Tandem Repeats
New GIAB Trios
from Personal Genome Project
Public, unembargoed data from GIAB
AJ PGP Trio
Long reads/”Linked” reads
• ~70/30/30x PacBio
– ~11kb N50
• ~100x BioNano
• ~30x 10X Genomics
• ~20x Moleculo
• Complete Genomics LFR
• ~0.005x Oxford Nanopore
Short reads
• 300x Illumina paired-end
• 15x Illumina 6kb mate-pair
• 100x Complete Genomics
• 60x SOLiD 5500W
• 1000x Ion Proton Exome
http://biorxiv.org/content/early/2015/09/15/026468
GIAB Analysis Group – New Data Sets
Leaders
• Francisco de la Vega
• Chris Mason
• Tina Graves
• Valerie Schneider
• Justin Zook
• Marc Salit
Status
• Analysis Group Responsibilities:
– https://docs.google.com/document/d/10e
A0DwB4iYTSFM_LPO9_2LyyN2xEqH49OXH
htNH1uzw/edit?usp=sharing
• Analysis Milestones:
– https://docs.google.com/spreadsheets/d/1Pj4nSz
H742g40wJz2fA6f8kFtZYAToZpSZYVPiC5st4/edit?u
sp=sharing
• Analysis Methods
– https://docs.google.com/spreadsheet
s/d/1Je2g85H7oK6kMXbBOoqQ1FM
NrvGnFuUJTJn7deyYiS8/edit?usp=sha
ring
• Analysis Plan:
– https://drive.google.com/file/d/0B7Ao1qq
JJDHQdnVEaVdqbWdEdkE/view?usp=shari
ng
• Collecting Data and analyses on GIAB
FTP Site
• Recruiting people to help with the
work.
Goal: Establish and distribute a set of authoritative benchmark variant calls of all
types and sizes, as well as homozygous reference regions, on GIAB PGP trios
Analysis Progress: AJ Trio
• SNPs/indels
– NIST working on integration
– 10X/moleculo/PacBio for difficult-to-map regions
• Assembly
– 2 de novo assemblies
– Useful for SV calling
• Structural variants
– Candidate calls being generated by 15+ groups with >20
different algorithms and 6 datasets
– 3+ integration methods
• Long-range Phasing
– 2 phased calls so far (CG LFR and 10X)
– Integration methods needed
• Other analyses
– CpG methylation with PacBio and Illumina
GIAB AJ Trio PacBio-only Assemblies
PacBio Only
Input Algorithm
# of
Contigs N50 Max Total
Child
MHAP/Celera
(Phillippy Lab) 13,048 4.5Mb 35.1Mb 3.0Gb
Child
Daligner/Falcon
(Chin/Bashir) 9,973 7.1Mb 39.2Mb 3.0Gb
Mother
MHAP/Celera
(Phillippy Lab) 23,493 1.03Mb 8.9Mb 3.0Gb
Father
MHAP/Celera
(Phillippy Lab) 16,326 0.91Mb 9.8Mb 3.0Gb
Merged
Trio
Daligner/Falcon
(Chin/Bashir) 5,680 9.25 Mb 50.3Mb 2.9Gb
Credits: Ali Bashir, Jason Chin, Adam Phillippy, and Serge Koren
GIAB AJ Trio Hybrid PacBio/BioNano
Assembly
Hybrid (PacBio with BioNano)
Input Assembly Notes
# of
Scaffolds N50 Max Total
HG002 Falcon 248 22.7Mb 92.8Mb 2.38Gb
Trio Falcon 210 29.3Mb 87.6Mb 2.32Gb
Two Step
Trio
celera (child) +
falcon (trio) 187 34.3Mb 98.0Mb 2.6Gb
Credits: Ali Bashir, Jason Chin, Alex Hastie
Pendleton et al, Nature Methods, 2015
Proposed approach to form high-
confidence SV (and non-SV) calls
Generate Candidate Calls
Compare/evaluate calls using
Parliament/MetaSV/svclassify/others?;
manual inspection
Integrate new and revised calls; manual
inspection
Combine integrated calls; manual inspection;
targeted experimental validation?
August 30, 2015
Nov 1, 2015
Jan 1, 2016
Jan 26, 2016 and beyond
Very Preliminary Confirmation of SVs
Integration results from AJ son
Parliament: BMC Genomics, 2015, 16:286 (performed by Andrew Carroll, DNAnexus)
MetaSV: Bioinformatics, 2015, 31:2741 (performed by Marghoob Mohiyuddin, Bina/Roche)
• Parliament
– Candidates from Illumina
– Confirmed by PacBio and/or
Illumina
– ~50% in both technologies
– ~4.5k deletions, 1k insertions
– 85% of Genotypes consistent
within Trio
• MetaSV
– Multiple types of evidence
from Illumina
MetaSV
Total:
2809
Parliament
Total:
5467
569
(20 %)
977
(18 %)
MetaSV
2240
(80 %)
Parliament
4490
(82 %)
50 % reciprocal overlap
Some overlap within Parliament calls
New GIAB GitHub Site
github.com/genome-in-a-bottle Credit: Chunlin Xiao, NCBI
WARNINGS
• Easiest to benchmark only within high-
confidence bed file
• Benchmark calls/regions tend to be biased
towards easier variants and regions
– Some clinical tests are enriched for difficult sites
• Always manually inspect a subset of FPs/FNs
• Stratification by variant type and region is
important
• Always calculate confidence intervals
Acknowledgments
• FDA – Elizabeth
Mansfield, Computing
staff
• Many members of
Genome in a Bottle
– New members
welcome!
– Sign up on website for
email newsletters
Steering Committee
– Marc Salit
– Justin Zook
– David Mittelman
– Andrew Grupe
– Michael Eberle
– Steve Sherry
– Deanna Church
– Francisco De La Vega
– Christian Olsen
– Monica Basehore
– Lisa Kalman
– Christopher Mason
– Elizabeth Mansfield
– Liz Kerrigan
– Leming Shi
– Melvin Limson
– Alexander Wait Zaranek
– Nils Homer
– Fiona Hyland
– Steve Lincoln
– Don Baldwin
– Robyn Temple-Smolkin
– Chunlin Xiao
– Kara Norman
– Luke Hickey
For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis
Team google group emails
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://biorxiv.org/content/early/2015/09/15/026468
Global Alliance Benchmarking Team
– ga4gh.org/#/benchmarking-team
Twice yearly workshop
– Winter: January 28-29, 2016 at Stanford University, California, USA
– Summer at NIST, Maryland, USA
Public Meetings!
Justin Zook: jzook@nist.gov
Marc Salit: salit@nist.gov
Contribute calls or
critically evaluate
GIAB calls!
NIST/NRC Postdoc
Opportunities available!

ASHG 2015 Genome in a bottle

  • 1.
    Genome in aBottle: You’ve sequenced. How well did you do? October 9, 2015 Justin Zook, Marc Salit, and the Genome in a Bottle Consortium *Nothing to Disclose
  • 2.
    Sequencing technologies and bioinformaticspipelines disagree O’Rawe et al. Genome Medicine 2013, 5:28
  • 3.
    Sequencing technologies and bioinformaticspipelines disagree O’Rawe et al. Genome Medicine 2013, 5:28 Who is right? Is anyone right?
  • 4.
    Genome in aBottle Consortium (GIAB) Hosted by US National Institute of Standards and Technology Goal: Provide infrastructure to assess confidence in human variant calls • Appropriately consented widely available DNA samples, distributed by the Coriell Institute – Also, QCed Reference Material (RM) versions from controlled lots will be available from NIST – Also, PGP samples are commercially available • High-accuracy reference data for these samples • Tools to facilitate their use – With the Global Alliance Data Working Group Benchmarking Team Global Alliance for Genomics and Health ga4gh.org Genome in a Bottle genomeinabottle.org
  • 5.
    GIAB Selected Samples CEPH/UtahPedigree 1463 ✔ NA1288 9 NA12879 NA12890 NA12880 NA12881 NA12882 NA12883 NA12884 NA12885 NA12886 NA12887 NA12888 NA12893 NA12877 NA12878 NA12891 NA12892 ✔ ✔ NA24149 NA24143 NA24385 Ashkenazi Jewish Trio ✔ NA24694 NA24695 NA24631 Asian (Han Chinese) Trio ✔ Note: Illumina and RTG have used data from the pedigree to improve variant calls in the specific GIAB samples. New New Personal Genome Project Available as NIST RM8398
  • 6.
    NGS Validation Processusing Genomes in Bottles Sample gDNA isolation Library Prep Sequencing Alignment/Mapping Variant Calling Confidence Estimates Downstream Analysis Analytical Process Genome in a Bottle Scope Pre-Analytical Process Clinical Interpretation GIAB Data
  • 7.
  • 8.
    Integrated 14 datasetsfrom 5 platforms to establish Reference SNP/indel Calls for NA12878 Zook et al., Nature Biotechnology, 2014. ~77 % High-confidence ~23 % Uncertain
  • 9.
    Uses of GIABNA12878 Oncology – Molecular and Cellular Tumor Markers “Next Generation” Sequencing (NGS) guidelines for somatic genetic variant detection www.bioplanet.com/gcat
  • 10.
    GeT-RM Browser fromNCBI and CDC • http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/
  • 11.
    Global Alliance forGenomics and Health Benchmarking Task Team • Developed standardized definitions for performance metrics like TP, FP, and FN. • Developing sophisticated benchmarking tools • vcfeval – Len Trigg • hap.py – Peter Krusche • vgraph – Kevin Jacobs • Standardized bed files with difficult genome contexts for stratification Credit: GA4GH, Abby Beeler, Ellie Wood Stratification of FP Rates Higher FP rates at Tandem Repeats
  • 12.
    New GIAB Trios fromPersonal Genome Project
  • 13.
    Public, unembargoed datafrom GIAB AJ PGP Trio Long reads/”Linked” reads • ~70/30/30x PacBio – ~11kb N50 • ~100x BioNano • ~30x 10X Genomics • ~20x Moleculo • Complete Genomics LFR • ~0.005x Oxford Nanopore Short reads • 300x Illumina paired-end • 15x Illumina 6kb mate-pair • 100x Complete Genomics • 60x SOLiD 5500W • 1000x Ion Proton Exome http://biorxiv.org/content/early/2015/09/15/026468
  • 14.
    GIAB Analysis Group– New Data Sets Leaders • Francisco de la Vega • Chris Mason • Tina Graves • Valerie Schneider • Justin Zook • Marc Salit Status • Analysis Group Responsibilities: – https://docs.google.com/document/d/10e A0DwB4iYTSFM_LPO9_2LyyN2xEqH49OXH htNH1uzw/edit?usp=sharing • Analysis Milestones: – https://docs.google.com/spreadsheets/d/1Pj4nSz H742g40wJz2fA6f8kFtZYAToZpSZYVPiC5st4/edit?u sp=sharing • Analysis Methods – https://docs.google.com/spreadsheet s/d/1Je2g85H7oK6kMXbBOoqQ1FM NrvGnFuUJTJn7deyYiS8/edit?usp=sha ring • Analysis Plan: – https://drive.google.com/file/d/0B7Ao1qq JJDHQdnVEaVdqbWdEdkE/view?usp=shari ng • Collecting Data and analyses on GIAB FTP Site • Recruiting people to help with the work. Goal: Establish and distribute a set of authoritative benchmark variant calls of all types and sizes, as well as homozygous reference regions, on GIAB PGP trios
  • 15.
    Analysis Progress: AJTrio • SNPs/indels – NIST working on integration – 10X/moleculo/PacBio for difficult-to-map regions • Assembly – 2 de novo assemblies – Useful for SV calling • Structural variants – Candidate calls being generated by 15+ groups with >20 different algorithms and 6 datasets – 3+ integration methods • Long-range Phasing – 2 phased calls so far (CG LFR and 10X) – Integration methods needed • Other analyses – CpG methylation with PacBio and Illumina
  • 16.
    GIAB AJ TrioPacBio-only Assemblies PacBio Only Input Algorithm # of Contigs N50 Max Total Child MHAP/Celera (Phillippy Lab) 13,048 4.5Mb 35.1Mb 3.0Gb Child Daligner/Falcon (Chin/Bashir) 9,973 7.1Mb 39.2Mb 3.0Gb Mother MHAP/Celera (Phillippy Lab) 23,493 1.03Mb 8.9Mb 3.0Gb Father MHAP/Celera (Phillippy Lab) 16,326 0.91Mb 9.8Mb 3.0Gb Merged Trio Daligner/Falcon (Chin/Bashir) 5,680 9.25 Mb 50.3Mb 2.9Gb Credits: Ali Bashir, Jason Chin, Adam Phillippy, and Serge Koren
  • 17.
    GIAB AJ TrioHybrid PacBio/BioNano Assembly Hybrid (PacBio with BioNano) Input Assembly Notes # of Scaffolds N50 Max Total HG002 Falcon 248 22.7Mb 92.8Mb 2.38Gb Trio Falcon 210 29.3Mb 87.6Mb 2.32Gb Two Step Trio celera (child) + falcon (trio) 187 34.3Mb 98.0Mb 2.6Gb Credits: Ali Bashir, Jason Chin, Alex Hastie Pendleton et al, Nature Methods, 2015
  • 18.
    Proposed approach toform high- confidence SV (and non-SV) calls Generate Candidate Calls Compare/evaluate calls using Parliament/MetaSV/svclassify/others?; manual inspection Integrate new and revised calls; manual inspection Combine integrated calls; manual inspection; targeted experimental validation? August 30, 2015 Nov 1, 2015 Jan 1, 2016 Jan 26, 2016 and beyond
  • 19.
    Very Preliminary Confirmationof SVs Integration results from AJ son Parliament: BMC Genomics, 2015, 16:286 (performed by Andrew Carroll, DNAnexus) MetaSV: Bioinformatics, 2015, 31:2741 (performed by Marghoob Mohiyuddin, Bina/Roche) • Parliament – Candidates from Illumina – Confirmed by PacBio and/or Illumina – ~50% in both technologies – ~4.5k deletions, 1k insertions – 85% of Genotypes consistent within Trio • MetaSV – Multiple types of evidence from Illumina MetaSV Total: 2809 Parliament Total: 5467 569 (20 %) 977 (18 %) MetaSV 2240 (80 %) Parliament 4490 (82 %) 50 % reciprocal overlap Some overlap within Parliament calls
  • 20.
    New GIAB GitHubSite github.com/genome-in-a-bottle Credit: Chunlin Xiao, NCBI
  • 21.
    WARNINGS • Easiest tobenchmark only within high- confidence bed file • Benchmark calls/regions tend to be biased towards easier variants and regions – Some clinical tests are enriched for difficult sites • Always manually inspect a subset of FPs/FNs • Stratification by variant type and region is important • Always calculate confidence intervals
  • 22.
    Acknowledgments • FDA –Elizabeth Mansfield, Computing staff • Many members of Genome in a Bottle – New members welcome! – Sign up on website for email newsletters Steering Committee – Marc Salit – Justin Zook – David Mittelman – Andrew Grupe – Michael Eberle – Steve Sherry – Deanna Church – Francisco De La Vega – Christian Olsen – Monica Basehore – Lisa Kalman – Christopher Mason – Elizabeth Mansfield – Liz Kerrigan – Leming Shi – Melvin Limson – Alexander Wait Zaranek – Nils Homer – Fiona Hyland – Steve Lincoln – Don Baldwin – Robyn Temple-Smolkin – Chunlin Xiao – Kara Norman – Luke Hickey
  • 23.
    For More Information www.genomeinabottle.org- sign up for general GIAB and Analysis Team google group emails github.com/genome-in-a-bottle – Guide to GIAB data & ftp www.slideshare.net/genomeinabottle www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser Data: http://biorxiv.org/content/early/2015/09/15/026468 Global Alliance Benchmarking Team – ga4gh.org/#/benchmarking-team Twice yearly workshop – Winter: January 28-29, 2016 at Stanford University, California, USA – Summer at NIST, Maryland, USA Public Meetings! Justin Zook: jzook@nist.gov Marc Salit: salit@nist.gov Contribute calls or critically evaluate GIAB calls! NIST/NRC Postdoc Opportunities available!