This document summarizes an assessment of version 0.5.0 of the GiaB structural variant calling pipeline. It provides statistics on the number and size of total and passed calls, compares the calls to other published call sets, examines overlap with genes and repeats, reports genotyping results using PacBio data, and discusses areas for further improvement.
2. Size and numbers of all calls 0.5.0
Size DEL DUP INV INS TRA Complex
0-50bp 22,769 0 0 21,103 0 120
50-100bp 5,357 0 0 4,636 0 102
100-1000bp 6,995 0 0 8,351 0 971
1000-10000bp 1,262 0 0 1,667 0 417
10000+bp 388 0 0 44 0 34
Total 36,771 0 0 35,801 0 1,644
3. Size and numbers of pass calls 0.5.0
Size DEL DUP INV INS TRA Complex
50-100bp 3,300 0 0 3,238 0 49
100-1000bp 3,747 0 0 6,493 0 361
1000-10000bp 732 0 0 1,343 0 227
10000+bp 108 0 0 38 0 19
Total 7,887 0 0 11,112 0 656
4. Comparison to other published call sets
Data set Distance (bp) Refound (%) DEL INS Complex
NA12878
(GiaB)
1k 67.25 4602 6986 2
NA12878
(GiaB)
100 54.13 3695 5631 2
1000
genomes
project
1k 18.54 1947 1247 2
1000
genomes
project
100 10.37 1776 11 1
5. Comparison to other call sets
Comparing the single call vcf files with call set 0.50
• Calls 0.5.0 supported: 16,724 (96.84%)
• 545 unique to call 0.5.0
Technologies-> 2 3 4 5 6
0.5.0 calls 2,446 9,583 2,793 1,806 96
Non 0.5.0 calls
27,175 6,941 302 2 0
6. Overlap with Genes:
• Overlapped passed SV calls 0.5.0
with annotation
• Counted how often a Gene is
reported.
• Gene PTPRN2 was observed 87
times.
• 56 genes were reported 10 or
more times.
Overview of SVs overlapping with genes
Genes reported in #SVs
Frequency
0 20 40 60 80
010002000300040005000
7. Overlap with simple repeats
• 5,931 (30.17%) SVs are within (at least 70% of their size) of simple
repeats
• 1,827 DEL
• 4,014 INS
• 90 complex
8. Sniffles genotyping over HG002
• Using NGMLR alignments + SVs calls 0.5.0
• Sniffles will ignore complex types
• The variant analysis over the split reads and read alignments are
carried out as before, but focusing only on the specified variants
• No quality assessment/filtering is performed
• VCF is reported either with the exact breakpoints or with redefined
breakpoints given the supported reads
9. Genotyping using Pacbio HG002
• Input: 7,887 del and 11,112
ins + NGMLR alignments for
HG002
• Genotype comparison
• :
Genotyping Pacbio
Allele frequency
Frequency
0 20 40 60 80 100
05001000150020002500
0/0 0/1 1/1
Svviz for
Pacbio
2,328 9,737 7,590
15. Suggestions
• SV overlap has to be revisited
• Use Genes to rank SVs for inspection
• Can we be more precise what complex type means?
• Write comparison paper!?
Editor's Notes
Welcome everyone. My name is Fritz Sedlazeck and I am currently working at the Human Genome Sequencing Center @ Baylor in Houston.
Today I am going to talk about challenges in SV calling and our pursue of the diploid genome that we are working on. Before I dive into that let me shortly introduce myself and my scientific interest.
1000 genomes: 65889 SVs
NA12878 Giab: 43156 SVs
Maybe do a simple overlap and characterize how many techs are supporting the 0.5.0 calls?
Using SURVIVOR to convert to bed and then bedtools with a repeat annotation
Can I filter out the non HG002? Or at least compute the numer.
2328 ./.9737 0/17590 1/1
Check which SVs are getting ignored.