Comparing and Benchmarking Large
Deletion Callsets
Justin Zook
NIST Genome-Scale Measurements
Group
June 27, 2016
Sequencing technologies and
bioinformatics pipelines disagree
O’Rawe et al. Genome Medicine 2013, 5:28
Sequencing technologies and
bioinformatics pipelines disagree
O’Rawe et al. Genome Medicine 2013, 5:28
Genome in a Bottle Consortium
Whole Genome Variant Calling
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference materials to
evaluate performance
– materials certified for their
variants against a reference
sequence, with confidence
estimates
• established consortium to
develop reference materials,
data, methods, performance
metrics
• Characterized Pilot Genome
NA12878 for small variants
– Now AJ Son also
• Ashkenazim Trio, Asian Trio
from PGP in process
genericmeasurementprocess
Candidate NIST Reference Materials
Genome PGP ID Coriell ID NIST ID NIST RM #
CEPH
Mother/Daugh
ter
N/A GM12878 HG001 RM8398
AJ Son huAA53E0 GM24385 HG002 RM8391
(son)/RM8392
(trio)
AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)
AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)
Asian Son hu91BD69 GM24631 HG005 RM8393
Asian Father huCA017E GM24694 N/A N/A
Asian Mother hu38168C GM24695 N/A N/A
Data for GIAB PGP Trios
Dataset Characteristics Coverage Availability Most useful for…
Illumina Paired-end
WGS
150x150bp
250x250bp
~300x/individual
~50x/individual
on SRA/FTP SNPs/indels/some SVs
Complete Genomics 100x/individual on SRA/ftp SNPs/indels/some SVs
SOLiD 5500W WGS 50bp single end 70x/son on FTP SNPs
Illumina Paired-end
WES
100x100bp ~300x/individual on SRA/FTP SNPs/indels in exome
Ion Proton Exome 1000x/individual on SRA/FTP SNPs/indels in exome
Illumina Mate pair ~6000 bp insert ~30x/individual on FTP SVs
Illumina “moleculo” Custom library ~30x by long
fragments
on FTP SVs/phasing/assembly
Complete Genomics LFR 100x/individual on SRA/FTP SNPs/indels/phasing
10X Pseudo-long reads 30-45x/individual on FTP SVs/phasing/assembly
PacBio ~10kb reads ~70x on AJ son, ~30x
on each AJ parent
on SRA/FTP SVs/phasing/assembly
/STRs
Oxford Nanopore 5.8kb 2D reads 0.02x on AJ son on FTP SVs/assembly
Nabsys 2.0 ~100kbp N50
nanopore maps
70x on AJ son SVs/assembly
BioNano Genomics 200-250kbp optical
map reads
~100x/AJ individual;
57x on Asian son
on FTP SVs/assembly
Dataset AJ Son AJ Parents Chinese son Chinese
parents
NA12878
Illumina Paired-
end
X X X X X
Illumina Long
Mate pair
X X X X X
Illumina
“moleculo”
X X X X X
Complete
Genomics
X X X X X
Complete
Genomics LFR
X X X
Ion exome
X X X X
BioNano
X X X X
10X
X X X
PacBio
X X X
SOLiD single end
X X X
Illumina exome
X X X X
Oxford
Nanopore
X
Paper describing data…
51 authors
14 institutions
12 datasets
7 genomes
Data described in ISA-tab
Integration Methods to Establish
Benchmark Small Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of
bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
Integration Methods to Establish
Benchmark Small Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of
bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
How can we extend this approach to
SVs?
Similarities to small variants
• Collect callsets from
multiple technologies
• Compare callsets to find
calls supported by multiple
technologies
Differences from small variants
• Callsets generally are not
sufficiently sensitive to
assume that regions without
calls are homozygous
reference
• Variants are often imprecisely
characterized
– breakpoints, size, type, etc.
• Representation of variants is
poorly standardized, especially
when complex
• Comparison tools in infancy
Callsets Contributed so far
Short reads
• Illumina
– Spiral Genetics
– cortex
– Commonlaw
– MetaSV
– Parliament/assembly
– Parliament/assembly-force
• Complete Genomics
• CG-SV
• CG-CNV
• CG-vcfBeta
Long reads and Linked reads
• PacBio
• CSHL-assembly
• Sniffles
• PBHoney-spots and –tails
• Parliament/pacbio
• Parliament/pacbio-force
• MultibreakSV
• smrt-sv.dip
• Assemblytics-Falcon and-MHAP
• Nanopore mapping
• Nabsys force calls
• optical mapping
• BioNano with and without haplotype-
aware assembly
• 10X Genomics
Step 1: Merging calls
• Process
– Find union of calls >19bp from all deletion callsets and merge
any regions if within 1000 bp (results in 28460 regions)
– Annotate each merged region with fraction covered by calls
from each callset
– Split out those overlapping tandem repeats longer than 200bp
by >25% (2715 regions)
• Helps mitigate different representations of calls in
repetitive regions and imprecision of breakpoints from
many callers
• Limitations
– may not appropriately call compound heterozygous SVs
– Ignores other types of SVs in the region
– Loses genotype information
Step 2: Find size prediction accuracy
• Find “size prediction accuracy” of each callset
by calculating the difference from the median
predicted size for regions with calls from >3
callers, and rank callers for <3kb and >3kb size
ranges Spiral 0.00%
Cortex 0.24%
CGSV 0.65%
AssemblyticsFalcon 0.79%
CGvcf 1.09%
fermikit 1.28%
smrtsvdip 1.43%
MetaSV 1.57%
MultibreakSV 1.62%
PBHoneySpots 2.13%
AssemblyticsMHAP 2.21%
ParliamentAssemblyForce 2.26%
CSHLassembly 2.29%
ParliamentPacBio 2.92%
ParliamentAssembly 3.00%
Spiral 0.04%
AssemblyticsFalcon 0.06%
CGSV 0.06%
CSHLassembly 0.08%
AssemblyticsMHAP 0.08%
MultibreakSV 0.10%
fermikit 0.11%
PBHoneyTails 0.38%
CommonLaw 0.48%
ParliamentPacBio 0.58%
smrtsvdip 0.62%
MetaSV 1.12%
sniffles 1.57%
Nabsys2tech01Force 3.02%
BioNano 3.67%
Size >3kbSize <3kb
Step 3: Find calls supported by 2 techs
1. Find calls supported by calls from 2 or more
technologies with size prediction within 20%
2. Find sensitivity of each caller to these calls in
size ranges 20-50, 50-100, 100-1000, 1000-
3000, and >3000 bp
Step 4: Filter questionable calls
supported by 2+ technologies
• 316 calls covered >25% by segmental
duplication >10kb
• 631 calls with at least one caller predicting a
size >2x different from the consensus size
• 34 calls where callsets missing this call from
multiple technologies have a multiplied (1-
sensitivity) < 2% in this size tranche
• 87 calls that overlap Ns in the reference
Number of Calls Supported by 2
Technologies by Size Range
<50bp 50-100bp 100-1000bp 1kb-3kb >3kb
pre-filtered 2404 1307 2288 481 600
filtered 2325 1188 1875 379 341
Sensitivity to Draft Benchmark Calls
<50bp 50-100bp 100-1000bp 1kb-3kb >3kb
AssemblyticsFalcon 0% 55% 68% 59% 45%
AssemblyticsMHAP 0% 51% 66% 56% 52%
CGvcf 86% 20% 4% 0% 0%
CGCNV 0% 0% 0% 0% 29%
CGSV 0% 0% 39% 65% 56%
CSHLassembly 0% 47% 62% 49% 42%
sniffles 7% 28% 58% 59% 64%
BioNano 0% 0% 2% 26% 37%
Spiral 85% 44% 57% 38% 40%
Cortex 39% 15% 7% 2% 0%
CommonLaw 0% 0% 8% 47% 40%
PBHoneySpots 0% 39% 63% 9% 0%
PBHoneyTails 0% 0% 0% 31% 57%
MetaSV 0% 0% 75% 74% 71%
ParliamentPacBio 0% 0% 74% 75% 48%
ParliamentAssembly 0% 0% 65% 44% 2%
MultibreakSV 16% 66% 72% 59% 47%
CNVnator 0% 0% 22% 71% 74%
ParliamentPacBioForce 1% 45% 72% 31% 18%
ParliamentAssemblyForce 0% 42% 63% 11% 2%
BionanoHaplo 0% 0% 0% 36% 49%
NabsysForce160405 0% 0% 5% 25% 28%
smrtsvdip 0% 66% 77% 65% 55%
fermikit 94% 86% 83% 59% 56%
Size distributions
Concordance between technologies
All Calls
High-confidence Calls
Support for all candidate regions
# of callsets # of technologies
Support for benchmark calls
# of callsets # of technologies
Possible double deletion
Clear 1kb homozygous deletion
Possible Complex SV called a deletion
Het in Son and hom ref and alt in
parents
Heterozygous deletions in phased 10X
reads
~3kb Heterozygous Deletion
~5kb Heterozygous Deletion
Global Alliance for Genomics and Health
Benchmarking Task Team
• Developed standardized
definitions for
performance metrics like
TP, FP, and FN.
• Developing sophisticated
benchmarking tools
• vcfeval – Len Trigg
• hap.py – Peter Krusche
• vgraph – Kevin Jacobs
• Standardized bed files
with difficult genome
contexts for stratification
Credit: GA4GH, Abby Beeler, Ellie Wood
Stratification of FP Rates
Higher FP rates at Tandem Repeats
Challenges in Benchmarking Small
Variant Calling
• It is difficult to do robust benchmarking of tests designed to
detect many analytes (e.g., many variants)
• Easiest to benchmark only within high-confidence bed file,
but…
• Benchmark calls/regions tend to be biased towards easier
variants and regions
– Some clinical tests are enriched for difficult sites
• Challenges with benchmarking complex variants near
boundaries of high-confidence regions
• Always manually inspect a subset of FPs/FNs
• Stratification by variant type and region is important
• Always calculate confidence intervals on performance
metrics
Particular Challenges in Benchmarking
SV Calling
• How to establish benchmark calls for difficult
regions?
• How to establish non-SV regions to assess FP
rates?
• Multiple dimensions of accuracy:
– Predicted SV existence
– Predicted SV type
– Predicted size
– Predicted breakpoints
– Predicted exact sequence
– Predicted genotype
Approaches to Benchmarking Variant
Calling
• Well-characterized whole genome Reference
Materials
• Many samples characterized in clinically relevant
regions
• Synthetic DNA spike-ins
• Cell lines with engineered mutations
• Simulated reads
• Modified real reads
• Modified reference genomes
• Confirming results found in real samples over
time
Acknowledgements
• NIST
– Marc Salit
– Jenny McDaniel
– Lindsay Vang
– David Catoe
– Hemang Parikh
• Genome in a Bottle Consortium
• GA4GH Benchmarking Team
• FDA
– Liz Mansfield
• SV Callset Contributors
– CSHL/JHU
– Mt Sinai
– 10X
– Nabsys
– Spiral Genetics/Stanford
– Heng Li/Mike Lin
– DNAnexus
– Complete Genomics
– Baylor
– Bina/Roche
– BioNano Genomics
– Mark Chaisson
– NIH/NCBI
– NIH/NHGRI
– Can Alkan/Stanford

160627 giab for festival sv workshop

  • 1.
    Comparing and BenchmarkingLarge Deletion Callsets Justin Zook NIST Genome-Scale Measurements Group June 27, 2016
  • 2.
    Sequencing technologies and bioinformaticspipelines disagree O’Rawe et al. Genome Medicine 2013, 5:28
  • 3.
    Sequencing technologies and bioinformaticspipelines disagree O’Rawe et al. Genome Medicine 2013, 5:28
  • 4.
    Genome in aBottle Consortium Whole Genome Variant Calling Sample gDNA isolation Library Prep Sequencing Alignment/Mapping Variant Calling Confidence Estimates Downstream Analysis • gDNA reference materials to evaluate performance – materials certified for their variants against a reference sequence, with confidence estimates • established consortium to develop reference materials, data, methods, performance metrics • Characterized Pilot Genome NA12878 for small variants – Now AJ Son also • Ashkenazim Trio, Asian Trio from PGP in process genericmeasurementprocess
  • 5.
    Candidate NIST ReferenceMaterials Genome PGP ID Coriell ID NIST ID NIST RM # CEPH Mother/Daugh ter N/A GM12878 HG001 RM8398 AJ Son huAA53E0 GM24385 HG002 RM8391 (son)/RM8392 (trio) AJ Father hu6E4515 GM24149 HG003 RM8392 (trio) AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio) Asian Son hu91BD69 GM24631 HG005 RM8393 Asian Father huCA017E GM24694 N/A N/A Asian Mother hu38168C GM24695 N/A N/A
  • 6.
    Data for GIABPGP Trios Dataset Characteristics Coverage Availability Most useful for… Illumina Paired-end WGS 150x150bp 250x250bp ~300x/individual ~50x/individual on SRA/FTP SNPs/indels/some SVs Complete Genomics 100x/individual on SRA/ftp SNPs/indels/some SVs SOLiD 5500W WGS 50bp single end 70x/son on FTP SNPs Illumina Paired-end WES 100x100bp ~300x/individual on SRA/FTP SNPs/indels in exome Ion Proton Exome 1000x/individual on SRA/FTP SNPs/indels in exome Illumina Mate pair ~6000 bp insert ~30x/individual on FTP SVs Illumina “moleculo” Custom library ~30x by long fragments on FTP SVs/phasing/assembly Complete Genomics LFR 100x/individual on SRA/FTP SNPs/indels/phasing 10X Pseudo-long reads 30-45x/individual on FTP SVs/phasing/assembly PacBio ~10kb reads ~70x on AJ son, ~30x on each AJ parent on SRA/FTP SVs/phasing/assembly /STRs Oxford Nanopore 5.8kb 2D reads 0.02x on AJ son on FTP SVs/assembly Nabsys 2.0 ~100kbp N50 nanopore maps 70x on AJ son SVs/assembly BioNano Genomics 200-250kbp optical map reads ~100x/AJ individual; 57x on Asian son on FTP SVs/assembly
  • 7.
    Dataset AJ SonAJ Parents Chinese son Chinese parents NA12878 Illumina Paired- end X X X X X Illumina Long Mate pair X X X X X Illumina “moleculo” X X X X X Complete Genomics X X X X X Complete Genomics LFR X X X Ion exome X X X X BioNano X X X X 10X X X X PacBio X X X SOLiD single end X X X Illumina exome X X X X Oxford Nanopore X
  • 8.
    Paper describing data… 51authors 14 institutions 12 datasets 7 genomes Data described in ISA-tab
  • 9.
    Integration Methods toEstablish Benchmark Small Variant Calls Candidate variants Concordant variants Find characteristics of bias Arbitrate using evidence of bias Confidence Level Zook et al., Nature Biotechnology, 2014.
  • 10.
    Integration Methods toEstablish Benchmark Small Variant Calls Candidate variants Concordant variants Find characteristics of bias Arbitrate using evidence of bias Confidence Level Zook et al., Nature Biotechnology, 2014.
  • 11.
    How can weextend this approach to SVs? Similarities to small variants • Collect callsets from multiple technologies • Compare callsets to find calls supported by multiple technologies Differences from small variants • Callsets generally are not sufficiently sensitive to assume that regions without calls are homozygous reference • Variants are often imprecisely characterized – breakpoints, size, type, etc. • Representation of variants is poorly standardized, especially when complex • Comparison tools in infancy
  • 12.
    Callsets Contributed sofar Short reads • Illumina – Spiral Genetics – cortex – Commonlaw – MetaSV – Parliament/assembly – Parliament/assembly-force • Complete Genomics • CG-SV • CG-CNV • CG-vcfBeta Long reads and Linked reads • PacBio • CSHL-assembly • Sniffles • PBHoney-spots and –tails • Parliament/pacbio • Parliament/pacbio-force • MultibreakSV • smrt-sv.dip • Assemblytics-Falcon and-MHAP • Nanopore mapping • Nabsys force calls • optical mapping • BioNano with and without haplotype- aware assembly • 10X Genomics
  • 13.
    Step 1: Mergingcalls • Process – Find union of calls >19bp from all deletion callsets and merge any regions if within 1000 bp (results in 28460 regions) – Annotate each merged region with fraction covered by calls from each callset – Split out those overlapping tandem repeats longer than 200bp by >25% (2715 regions) • Helps mitigate different representations of calls in repetitive regions and imprecision of breakpoints from many callers • Limitations – may not appropriately call compound heterozygous SVs – Ignores other types of SVs in the region – Loses genotype information
  • 14.
    Step 2: Findsize prediction accuracy • Find “size prediction accuracy” of each callset by calculating the difference from the median predicted size for regions with calls from >3 callers, and rank callers for <3kb and >3kb size ranges Spiral 0.00% Cortex 0.24% CGSV 0.65% AssemblyticsFalcon 0.79% CGvcf 1.09% fermikit 1.28% smrtsvdip 1.43% MetaSV 1.57% MultibreakSV 1.62% PBHoneySpots 2.13% AssemblyticsMHAP 2.21% ParliamentAssemblyForce 2.26% CSHLassembly 2.29% ParliamentPacBio 2.92% ParliamentAssembly 3.00% Spiral 0.04% AssemblyticsFalcon 0.06% CGSV 0.06% CSHLassembly 0.08% AssemblyticsMHAP 0.08% MultibreakSV 0.10% fermikit 0.11% PBHoneyTails 0.38% CommonLaw 0.48% ParliamentPacBio 0.58% smrtsvdip 0.62% MetaSV 1.12% sniffles 1.57% Nabsys2tech01Force 3.02% BioNano 3.67% Size >3kbSize <3kb
  • 15.
    Step 3: Findcalls supported by 2 techs 1. Find calls supported by calls from 2 or more technologies with size prediction within 20% 2. Find sensitivity of each caller to these calls in size ranges 20-50, 50-100, 100-1000, 1000- 3000, and >3000 bp
  • 16.
    Step 4: Filterquestionable calls supported by 2+ technologies • 316 calls covered >25% by segmental duplication >10kb • 631 calls with at least one caller predicting a size >2x different from the consensus size • 34 calls where callsets missing this call from multiple technologies have a multiplied (1- sensitivity) < 2% in this size tranche • 87 calls that overlap Ns in the reference
  • 17.
    Number of CallsSupported by 2 Technologies by Size Range <50bp 50-100bp 100-1000bp 1kb-3kb >3kb pre-filtered 2404 1307 2288 481 600 filtered 2325 1188 1875 379 341
  • 18.
    Sensitivity to DraftBenchmark Calls <50bp 50-100bp 100-1000bp 1kb-3kb >3kb AssemblyticsFalcon 0% 55% 68% 59% 45% AssemblyticsMHAP 0% 51% 66% 56% 52% CGvcf 86% 20% 4% 0% 0% CGCNV 0% 0% 0% 0% 29% CGSV 0% 0% 39% 65% 56% CSHLassembly 0% 47% 62% 49% 42% sniffles 7% 28% 58% 59% 64% BioNano 0% 0% 2% 26% 37% Spiral 85% 44% 57% 38% 40% Cortex 39% 15% 7% 2% 0% CommonLaw 0% 0% 8% 47% 40% PBHoneySpots 0% 39% 63% 9% 0% PBHoneyTails 0% 0% 0% 31% 57% MetaSV 0% 0% 75% 74% 71% ParliamentPacBio 0% 0% 74% 75% 48% ParliamentAssembly 0% 0% 65% 44% 2% MultibreakSV 16% 66% 72% 59% 47% CNVnator 0% 0% 22% 71% 74% ParliamentPacBioForce 1% 45% 72% 31% 18% ParliamentAssemblyForce 0% 42% 63% 11% 2% BionanoHaplo 0% 0% 0% 36% 49% NabsysForce160405 0% 0% 5% 25% 28% smrtsvdip 0% 66% 77% 65% 55% fermikit 94% 86% 83% 59% 56%
  • 19.
  • 20.
    Concordance between technologies AllCalls High-confidence Calls
  • 21.
    Support for allcandidate regions # of callsets # of technologies
  • 22.
    Support for benchmarkcalls # of callsets # of technologies
  • 23.
  • 24.
  • 25.
    Possible Complex SVcalled a deletion
  • 26.
    Het in Sonand hom ref and alt in parents
  • 27.
    Heterozygous deletions inphased 10X reads ~3kb Heterozygous Deletion ~5kb Heterozygous Deletion
  • 28.
    Global Alliance forGenomics and Health Benchmarking Task Team • Developed standardized definitions for performance metrics like TP, FP, and FN. • Developing sophisticated benchmarking tools • vcfeval – Len Trigg • hap.py – Peter Krusche • vgraph – Kevin Jacobs • Standardized bed files with difficult genome contexts for stratification Credit: GA4GH, Abby Beeler, Ellie Wood Stratification of FP Rates Higher FP rates at Tandem Repeats
  • 29.
    Challenges in BenchmarkingSmall Variant Calling • It is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants) • Easiest to benchmark only within high-confidence bed file, but… • Benchmark calls/regions tend to be biased towards easier variants and regions – Some clinical tests are enriched for difficult sites • Challenges with benchmarking complex variants near boundaries of high-confidence regions • Always manually inspect a subset of FPs/FNs • Stratification by variant type and region is important • Always calculate confidence intervals on performance metrics
  • 30.
    Particular Challenges inBenchmarking SV Calling • How to establish benchmark calls for difficult regions? • How to establish non-SV regions to assess FP rates? • Multiple dimensions of accuracy: – Predicted SV existence – Predicted SV type – Predicted size – Predicted breakpoints – Predicted exact sequence – Predicted genotype
  • 31.
    Approaches to BenchmarkingVariant Calling • Well-characterized whole genome Reference Materials • Many samples characterized in clinically relevant regions • Synthetic DNA spike-ins • Cell lines with engineered mutations • Simulated reads • Modified real reads • Modified reference genomes • Confirming results found in real samples over time
  • 32.
    Acknowledgements • NIST – MarcSalit – Jenny McDaniel – Lindsay Vang – David Catoe – Hemang Parikh • Genome in a Bottle Consortium • GA4GH Benchmarking Team • FDA – Liz Mansfield • SV Callset Contributors – CSHL/JHU – Mt Sinai – 10X – Nabsys – Spiral Genetics/Stanford – Heng Li/Mike Lin – DNAnexus – Complete Genomics – Baylor – Bina/Roche – BioNano Genomics – Mark Chaisson – NIH/NCBI – NIH/NHGRI – Can Alkan/Stanford