Sept2016 sv nist_intro

SV Data Jamboree
Justin Zook and Ali Bashir
With the Genome in a Bottle
Consortium
September 15, 2016

Sequencing technologies and
bioinformatics pipelines disagree
O’Rawe et al. Genome Medicine 2013, 5:28

Candidate NIST Reference Materials
Genome PGP ID Coriell ID NIST ID NIST RM #
CEPH
Mother/Daugh
ter
N/A GM12878 HG001 RM8398
AJ Son huAA53E0 GM24385 HG002 RM8391
(son)/RM8392
(trio)
AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)
AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)
Asian Son hu91BD69 GM24631 HG005 RM8393
Asian Father huCA017E GM24694 N/A N/A
Asian Mother hu38168C GM24695 N/A N/A

Data for GIAB PGP Trios
Dataset Characteristics Coverage Availability Most useful for…
Illumina Paired-end
WGS
150x150bp
250x250bp
~300x/individual
~50x/individual
on SRA/FTP SNPs/indels/some SVs
Complete Genomics 100x/individual on SRA/ftp SNPs/indels/some SVs
SOLiD 5500W WGS 50bp single end 70x/son on FTP SNPs
Illumina Paired-end
WES
100x100bp ~300x/individual on SRA/FTP SNPs/indels in exome
Ion Proton Exome 1000x/individual on SRA/FTP SNPs/indels in exome
Illumina Mate pair ~6000 bp insert ~30x/individual on FTP SVs
Illumina “moleculo” Custom library ~30x by long
fragments
on FTP SVs/phasing/assembly
Complete Genomics LFR 100x/individual on SRA/FTP SNPs/indels/phasing
10X Pseudo-long reads 30-45x/individual on FTP SVs/phasing/assembly
PacBio ~10kb reads ~70x on AJ son, ~30x
on each AJ parent
on SRA/FTP SVs/phasing/assembly
/STRs
Oxford Nanopore 5.8kb 2D reads 0.02x on AJ son on FTP SVs/assembly
Nabsys 2.0 ~100kbp N50
nanopore maps
70x on AJ son SVs/assembly
BioNano Genomics 200-250kbp optical
map reads
~100x/AJ individual;
57x on Asian son
on FTP SVs/assembly

Paper describing data…
51 authors
14 institutions
12 datasets
7 genomes
Data described in ISA-tab

Integration Methods to Establish
Benchmark Small Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of
bias
Confidence Level Zook et al., Nature Biotechnology, 2014.

How can we extend this approach to
SVs?
Similarities to small variants
• Collect callsets from
multiple technologies
• Compare callsets to find
calls supported by multiple
technologies
Differences from small variants
• Callsets generally are not
sufficiently sensitive to
assume that regions without
calls are homozygous
reference
– SVs of different types/sizes are
not always detected easily
• Variants are often imprecisely
characterized
– breakpoints, size, type, etc.
• Representation of variants is
poorly standardized, especially
when complex
• Comparison tools in infancy

Callsets Contributed so far
Short reads
• Illumina
– Spiral Genetics
– cortex
– Commonlaw
– MetaSV
• Complete Genomics
• CG-SV
• CG-CNV
• CG-vcfBeta
Long reads and Linked reads
• PacBio
• CSHL-assembly
• Sniffles
• PBHoney-spots and –tails
• Parliament/pacbio
• Parliament/assembly
• MultibreakSV
• smrt-sv.dip
• Assemblytics-Falcon and-MHAP
• NHGRI assembly-based
• Nanopore mapping
• Nabsys force calls
• optical mapping
• BioNano with and without haplotype-aware
assembly
• 10X Genomics Chromium
• Deletions
• Large SVs

AJ Trio Assemblies
On FTP
• PacBio
– Falcon
– Canu
• BioNano
– Haploid
– Diploid
In Process
• Illumina
– DISCOVAR – contig N50 ~100k
• PacBio
– Falcon diploid in process
• Dovetail scaffolding
– With PacBio-falcon
– With PacBIo-Canu
– With DISCOVAR
• 10X?
– By itself
– Phasing PacBio

APPROACH #1: FIND DELETIONS WITH
SUPPORT FROM MULTIPLE TECHS AND
CONCORDANT BREAKPOINTS

Step 1: Merging calls
• Process
– Find union of calls >19bp from all deletion callsets and merge
any regions if within 1000 bp (results in 28460 regions)
– Annotate each merged region with fraction covered by calls
from each callset
– Split out those overlapping tandem repeats longer than 200bp
by >25% (2715 regions)
• Helps mitigate different representations of calls in
repetitive regions and imprecision of breakpoints from
many callers
• Limitations
– may not appropriately call compound heterozygous SVs
– Ignores other types of SVs in the region
– Loses genotype information
Callset #1
Callset #2

Step 2: Find size prediction accuracy
• Find “size prediction accuracy” of each callset
by calculating the difference from the median
predicted size for regions with calls from >3
callers, and rank callers for <3kb and >3kb size
ranges
Spiral 0.00%
Cortex 0.24%
CGSV 0.65%
AssemblyticsFalcon 0.79%
CGvcf 1.09%
fermikit 1.28%
smrtsvdip 1.43%
MetaSV 1.57%
MultibreakSV 1.62%
PBHoneySpots 2.13%
AssemblyticsMHAP 2.21%
ParliamentAssemblyForce 2.26%
CSHLassembly 2.29%
ParliamentPacBio 2.92%
ParliamentAssembly 3.00%
Spiral 0.04%
AssemblyticsFalcon 0.06%
CGSV 0.06%
CSHLassembly 0.08%
AssemblyticsMHAP 0.08%
MultibreakSV 0.10%
fermikit 0.11%
PBHoneyTails 0.38%
CommonLaw 0.48%
ParliamentPacBio 0.58%
smrtsvdip 0.62%
MetaSV 1.12%
sniffles 1.57%
Nabsys2tech01Force 3.02%
BioNano 3.67%
Size >3kbSize <3kb
IMPORTANT NOTE: These
stats are intended for
integration and to help
developers improve their
methods, not to compare
methods, since they likely
do not reflect actual size
prediction accuracy for all
methods.

Step 3: Find calls supported by 2 techs
1. Find calls supported by calls from 2 or more
technologies with size prediction within 20%
2. Find sensitivity of each caller to these calls in
size ranges 20-50, 50-100, 100-1000, 1000-
3000, and >3000 bp

Step 4: Filter questionable calls
supported by 2+ technologies
• 316 calls covered >25% by segmental
duplication >10kb
• 631 calls with at least one caller predicting a
size >2x different from the consensus size
• 34 calls where callsets missing this call from
multiple technologies have a multiplied (1-
sensitivity) < 2% in this size tranche
• 87 calls that overlap Ns in the reference

Overview of process
Merge
deletions
within 1kb
Rank calls by
closeness of
predicted
size to
median size
and select
call in each
region from
best callset
Find calls
supported
by 2+
technologies
with size
within 20%
Filter calls
overlapping
seg dups,
reference
N’s, or with
call with
predicted
size 2x larger

Number of Calls Supported by 2
Technologies by Size Range
<50bp 50-100bp 100-1000bp 1kb-3kb >3kb
pre-filtered 2542 1567 2447 731 730
filtered 2427 1415 2207 638 524

Support for all candidate regions
# of callsets # of technologies

Support for benchmark calls
# of callsets # of technologies

Approach #2: svcompare (NCBI
hackathon)
Builds on SURVIVOR
• Compares each new callset to
the first and adds new calls
not within 1kb of existing calls
• Outputs multi-sample vcf with
type, size, and breakpoints
from each callset in each
candidate region
• Integrates multiple types, but
doesn’t currently output size
of insertions or exact
sequence
• Developed by Fritz Sedlazeck,
JHU
Output stats
• 130k input regions from
calls >19bp
• 876 regions have >1 type
within a callset
• 2276 regions have >1 type
across callsets
• How to integrate discordant
types in same region?
https://github.com/NCBI-Hackathons/svcompare

Example start position distance from
median start by callset (400-1000bp)

Approach #3: “Type” candidate calls in
each dataset
svviz
• Looks for whether reads
support REF or ALT allele
– Can often easily infer
genotype
• Also generates other stats
about mapping reads
• Generates visualization of
mapped reads as well
• Nabsys has developed a
similar approach for their
mapping data
Compatible datasets
• PacBio
• Illumina 150bp and 250bp
paired end
• Illumina 6kb mate-pair
• 10X haplotype-separated

10X SV analyses
with svviz
• Find reads
supporting ref
and alt alleles in
each haplotype
• Verify support for
ref and alt is on
different
haplotypes for
hets
• Verify support
from both
haplotypes for
confidence homo
var or hom ref
call
SonDadMomSonDadMom

Goals for Data Jamboree
Share progress in algorithm
development
• New technologies
• New analysis methods
• Visualization methods
• Integration/comparison
methods
Outstanding questions to discuss
• Integration
– How to form high-confidence calls,
breakpoints, and genotypes from
multiple calls?
– What is the minimum viable product
for a practical benchmark set?
• Is this a good criterion: “When an
individual callset is compared to
ours, most FPs/FNs should be errors
in the individual callset”
– How to handle non-deletions?
• SV typing
• Future work
– How to form high-confidence
regions?
– SV phasing
– Is anyone developing SV
benchmarking tools?

Things to resolve
Integration
• How to compare events
with variable breakpoints
across callsets?
– Tandem repeats
• How to compare non-
deletions?
– Start with insertions?
• Distinguish precise
breakpoints when possible
Typing
• Leverage long-range
information to type with
short reads?
• How to deal with imprecise
breakpoints?
• At what point is something
validated?
– Potentially high-confidence
variants (or reference?)
– Haplotype-separated

Acknowledgements
• NIST
– Marc Salit
– Jenny McDaniel
– Lindsay Vang
– David Catoe
– Hemang Parikh
• Genome in a Bottle Consortium
• GA4GH Benchmarking Team
• FDA
– Liz Mansfield
• SV Callset Contributors
– CSHL/JHU
– Mt Sinai
– 10X
– Nabsys
– Spiral Genetics/Stanford
– Heng Li/Mike Lin
– DNAnexus
– Complete Genomics
– Baylor
– Bina/Roche
– BioNano Genomics
– Mark Chaisson
– NIH/NCBI
– NIH/NHGRI
– Can Alkan/Stanford

Sept2016 sv nist_intro

More Related Content

What's hot

Viewers also liked

Similar to Sept2016 sv nist_intro

More from GenomeInABottle

Recently uploaded

Sept2016 sv nist_intro

Editor's Notes