The document discusses challenges in characterizing structural variations across genomes and populations using long-read sequencing technologies. It describes tools developed for accurate mapping and structural variant calling from long reads, including NGMLR for mapping and Sniffles for variant calling. It also presents results of applying these tools to call structural variants in the NA12878 genome from PacBio and Nanopore long reads, detecting many more variants than from short read data. The goal is to better understand variability between genomes and impact on gene regulation.
3. Structural Variations
Genomic DisordersEvolution
Impact on regulation Impact on phenotypes
RegulatoryState
Cell Line
A
54
9A
o
rta
B
_
ce
lls_
P
B
_R
o
ad
m
ap
C
D
1
4C
D
16
__
m
on
ocyte_
C
B
C
D
14
C
D
16
__m
ono
cyte
_V
B
C
D
4
_a
b_
T
_
cell_
V
B
C
D
8_a
b_
T
_
ce
ll_C
B
C
M
_C
D
4
_ab
_T
_cell_
V
B
D
N
D
_4
1
e
osin
o
ph
il_V
B
E
P
C
_V
B
eryth
rob
la
st_C
B
F
e
ta
l_
A
dren
al_
G
la
n
d
F
e
tal_
Intestin
e_L
arg
e
F
etal_
In
te
stin
e_
S
m
all
F
e
ta
l_
M
u
scle
_
Le
g
F
etal_
M
uscle
_T
runk
F
etal_
S
tom
a
ch
F
e
tal_
T
hym
us
G
astric
G
M
12
87
8
H
1_
m
esenchym
al
H
1_
ne
uron
al_
p
rog
en
itor
H
1_
troph
ob
la
st
H
1
E
S
C
H
9
H
e
La
_S
3
H
e
pG
2H
M
E
CH
S
M
M
H
S
M
M
tub
e
H
U
V
E
C
_p
ro
l_
C
B
H
U
V
E
CIM
R
90
iP
S
_2
0b
iP
S
_D
F
_19
_1
1
iP
S
_D
F
_6
_9K
56
2
Le
ft_V
e
ntric
leL
un
g
M
0_
m
acro
ph
ag
e_
C
B
M
0_
m
acrop
hag
e_
V
B
M
1_m
acro
ph
age
_C
B
M
1_
m
acro
ph
ag
e_
V
B
M
2
_m
a
crop
ha
ge
_C
B
M
2_
m
acro
ph
ag
e_
V
B
M
ono
cyte
s_C
D
1
4_
P
B
_R
o
ad
m
ap
M
on
ocyte
s_
C
D
1
4
M
S
C
_V
B
n
aiv
e
_B
_ce
ll_
V
B
N
a
tu
ral_
K
ille
r_cells_P
B
ne
utrop
hil_
C
B
n
eutrop
hil_m
ye
lo
cyte
_B
M
n
eu
tro
ph
il_V
BN
H
_A
N
H
D
F
_A
DN
H
E
KN
H
LF
O
steob
l
O
vary
P
an
crea
s
P
la
ce
nta
P
soa
s_M
uscle
R
ig
ht_A
triu
m
S
m
all_
Intestin
e
S
ple
e
n
T
_cells_P
B
_R
oa
dm
a
p
T
hym
us
C
T
C
F
_b
in
din
g
_siteA
C
T
IV
E
C
T
C
F
_
bin
d
in
g_
site
IN
A
C
T
IV
E
C
T
C
F
_bin
d
in
g_
site
P
O
IS
E
D
C
T
C
F
_
bin
d
in
g_
site
R
E
P
R
E
S
S
E
D
e
nha
ncerA
C
T
IV
E
e
nh
an
ce
rIN
A
C
T
IV
E
en
han
ce
rP
O
IS
E
D
e
nh
an
cerR
E
P
R
E
S
S
E
D
op
en
_chrom
a
tin
_reg
io
nA
C
T
IV
E
o
pe
n_
chro
m
atin
_
re
gio
n
IN
A
C
T
IV
E
o
pe
n_
chro
m
atin
_re
gio
n
N
A
ope
n_
ch
ro
m
atin
_
regio
n
P
O
IS
E
D
o
pe
n_
chro
m
atin
_re
gio
n
R
E
P
R
E
S
S
E
D
p
rom
o
te
rA
C
T
IV
E
pro
m
oter_
fla
n
kin
g
_reg
io
nA
C
T
IV
E
p
rom
o
te
r_fla
nkin
g_
re
gio
n
IN
A
C
T
IV
E
p
rom
o
te
r_fla
nkin
g_
regio
n
P
O
IS
E
D
p
ro
m
o
te
r_fla
nkin
g_re
gio
n
R
E
P
R
E
S
S
E
D
prom
oterIN
A
C
T
IV
E
pro
m
oterP
O
IS
E
D
prom
oterR
E
P
R
E
S
S
E
D
T
F
_b
in
din
g
_siteA
C
T
IV
E
T
F
_
bin
d
in
g_
site
IN
A
C
T
IV
E
T
F
_
bin
d
in
g_
site
N
A
T
F
_
bin
d
in
g_
site
P
O
IS
E
D
T
F
_
bin
d
in
g_
site
R
E
P
R
E
S
S
E
D
A
54
9A
o
rta
B
_
ce
lls_
P
B
_R
o
ad
m
ap
C
D
1
4C
D
16
__
m
on
ocyte_
C
B
C
D
14
C
D
16
__m
ono
cyte
_V
B
C
D
4
_a
b_
T
_
cell_
V
B
C
D
8_a
b_
T
_
ce
ll_C
B
C
M
_C
D
4
_ab
_T
_cell_
V
B
D
N
D
_4
1
e
osin
o
ph
il_V
B
E
P
C
_V
B
eryth
rob
la
st_C
B
F
e
ta
l_
A
dren
al_
G
la
n
d
F
e
tal_
Intestin
e_L
arg
e
F
etal_
In
te
stin
e_
S
m
all
F
e
ta
l_
M
u
scle
_
Le
g
F
etal_
M
uscle
_T
runk
F
etal_
S
tom
a
ch
F
e
tal_
T
hym
us
G
astric
G
M
12
87
8
H
1_
m
esenchym
al
H
1_
ne
uron
al_
p
rog
en
itor
H
1_
troph
ob
la
st
H
1
E
S
C
H
9
H
e
La
_S
3
H
e
pG
2H
M
E
CH
S
M
M
H
S
M
M
tub
e
H
U
V
E
C
_p
ro
l_
C
B
H
U
V
E
CIM
R
90
iP
S
_2
0b
iP
S
_D
F
_19
_1
1
iP
S
_D
F
_6
_9K
56
2
Le
ft_V
e
ntric
leL
un
g
M
0_
m
acro
ph
ag
e_
C
B
M
0_
m
acrop
hag
e_
V
B
M
1_m
acro
ph
age
_C
B
M
1_
m
acro
ph
ag
e_
V
B
M
2
_m
a
crop
ha
ge
_C
B
M
2_
m
acro
ph
ag
e_
V
B
M
ono
cyte
s_C
D
1
4_
P
B
_R
o
ad
m
ap
M
on
ocyte
s_
C
D
1
4
M
S
C
_V
B
n
aiv
e
_B
_ce
ll_
V
B
N
a
tu
ral_
K
ille
r_cells_P
B
ne
utrop
hil_
C
B
n
eutrop
hil_m
ye
lo
cyte
_B
M
n
eu
tro
ph
il_V
BN
H
_A
N
H
D
F
_A
DN
H
E
KN
H
LF
O
steob
l
O
vary
P
an
crea
s
P
la
ce
nta
P
soa
s_M
uscle
R
ig
ht_A
triu
m
S
m
all_
Intestin
e
S
ple
e
n
T
_cells_P
B
_R
oa
dm
a
p
T
hym
us
C
T
C
F
_b
in
din
g
_siteA
C
T
IV
E
C
T
C
F
_
bin
d
in
g_
site
IN
A
C
T
IV
E
C
T
C
F
_bin
d
in
g_
site
P
O
IS
E
D
C
T
C
F
_
bin
d
in
g_
site
R
E
P
R
E
S
S
E
D
e
nha
ncerA
C
T
IV
E
e
nh
an
ce
rIN
A
C
T
IV
E
en
han
ce
rP
O
IS
E
D
e
nh
an
cerR
E
P
R
E
S
S
E
D
op
en
_chrom
a
tin
_reg
io
nA
C
T
IV
E
o
pe
n_
chro
m
atin
_
re
gio
n
IN
A
C
T
IV
E
o
pe
n_
chro
m
atin
_re
gio
n
N
A
ope
n_
ch
ro
m
atin
_
regio
n
P
O
IS
E
D
o
pe
n_
chro
m
atin
_re
gio
n
R
E
P
R
E
S
S
E
D
p
rom
o
te
rA
C
T
IV
E
pro
m
oter_
fla
n
kin
g
_reg
io
nA
C
T
IV
E
p
rom
o
te
r_fla
nkin
g_
re
gio
n
IN
A
C
T
IV
E
p
rom
o
te
r_fla
nkin
g_
regio
n
P
O
IS
E
D
p
ro
m
o
te
r_fla
nkin
g_re
gio
n
R
E
P
R
E
S
S
E
D
prom
oterIN
A
C
T
IV
E
pro
m
oterP
O
IS
E
D
prom
oterR
E
P
R
E
S
S
E
D
T
F
_b
in
din
g
_siteA
C
T
IV
E
T
F
_
bin
d
in
g_
site
IN
A
C
T
IV
E
T
F
_
bin
d
in
g_
site
N
A
T
F
_
bin
d
in
g_
site
P
O
IS
E
D
T
F
_
bin
d
in
g_
site
R
E
P
R
E
S
S
E
D
0500100015002000
scale
affected#
4. Diploid genome
• Impact on Regulation
• Variability of genes
• Need to understand the full structure
5. Challenges: Pursuing the diploid genome
1. Accurate prediction of SVs
2. Comparison of SVs
3. Annotation and interpretation of SVs
4. Population analysis
5. Diploid Genome
Layer et.al. (2014)
7. • (+) SVs in repetitive regions
• (+) Span SVs
• (+) Uniform coverage
• (+) Can identify more complex SVs
• (-) Higher seq. error rate
• (-) Hard to align
1.1 Long Read Technologies
8. 1.1 Accurate mapping and SV calling
NextGenMap-LR (NGMLR):
• Long read mapper
• Convex gap costs
• Faster then BWA-MEM
Sniffles:
• SV caller for long reads
• All types of SVs
• Phasing of SVs
10. 1.1 NA12878: SV calling
Tech. Cove
rage
Avg read
len
Method SVs TRA DEL INS
PacBio 55x 4,334 Sniffles 22,877 119 9,933 12,052
Oxford
Nanopore
@Baylor
34x 4,982 Sniffles 12,596 46 7,102 5,166
Illumina 50x 2 x 101 Manta,
Delly,
Lumpy
7,275 2,247 3,744 0
Sedlazeck et.al. (2017)
11. 1.1 NA12878: check 2,247 vs 119 TRA
Illumina data
Translocation:
PacBio data
ONT data
Truncated reads:
Insertion
In rep. region
Overlap Illumina TRA(%)
Insertions 53.05
Deletions 12.06
Duplications 0.57
Nested 0.31
High coverage 1.87
Low complexity 9.79
Explained 77.65
Sedlazeck et.al. (2017)
12. 1.1 NA12878: check 2,247 vs 119 TRA
ONT data
PacBio data
Illumina data
Insertion
In rep. region
Inversion:
Translocation:
Truncated reads:
Insertion
In rep. region
Sedlazeck et.al. (2017)
13. 1.2 More complex SVs
Inverted tandem duplication:
• Pelizaeus-Merzbacher
disease
• MECP2
• VIPR2
Sedlazeck et.al. (2017)
PacBio data
Illumina data
14. 1.2 More complex SVs
Inversion flanked by deletions:
• Haemophilia A
• Only found over long range PCR!
(2007)
Sedlazeck et.al. (2017)
Illumina data
PacBio data
15. Challenges
1. Accurate prediction of SVs: Sniffles (talk on Thursday!)
2. Comparison of SVs
3. Annotation and interpretation of SVs
4. Population analysis
5. Diploid Genome
Layer et.al. (2014)
17. 2. Genome in a Bottle: merging 95 vcfs (1 min)
10x Genomics
BioNano
Complete Genomics
Illumina
PacBio
Minimum 2 callers:SV Caller Comparison:
Using PCR+Sanger validate SVs form multiple categories.
Join CSHL + Baylor to help with validations!
18. Challenges
1. Accurate prediction of SVs: Sniffles (talk on Thursday!)
2. Comparison of SVs: SURIVOR
3. Annotation and interpretation of SVs
4. Population analysis
5. Diploid Genome
19. Histogram over genes impacted
#Gene hit by SVS
Frequency
0 20 40 60 80
0200040006000
3. Annotation: SURVIVOR_ant
Annotating SVs with:
• Multiple GTF, BED, VCF
Genome in a Bottle:
• 63,677 genes (GTF)
• 1,733,686 regions (3 bed files)
• 22 seconds:
• 8,314 Genes impacted
Sedlazeck et.al. (2017)
#Genes
# SV hit gene
Genes impacted by SVs
20. Challenges
1. Accurate prediction of SVs: Sniffles (talk on Thursday!)
2. Comparison of SVs: SURIVOR
3. Annotation and interpretation of SVs: SURVIVOR_ANT
4. Population analysis
5. Diploid Genome
22. 4. SVs in 22,600 Individuals
We need large SV studies:
• Common vs. rare SVs
• Inform GWAS studies
• Ethnicity specific SVs
• Catalog variability of regions
• MHC, LPA, etc.
0.0e+00 5.0e+07 1.0e+08 1.5e+08
0.000.100.20
CHR6: Average SV Allele Frequency per 100kb
Allelefrequency
MHC LPA
#SVs
Shared across individuals
Position
23. Challenges
1. Accurate prediction of SVs: Sniffles (talk on Thursday!)
2. Comparison of SVs: SURIVOR
3. Annotation and interpretation of SVs: SURVIVOR_ANT
4. Population analysis: SURVIVOR
5. Diploid Genome
24. 5.1 Diploid Genome
Challenges:
• Sequencing technology
• Computational methods
• Money
HGSC Approach: GADGET
1. Sequence 100 individuals: PacBio + 10x Genomics
2. SV detection/genotyping
3. Phasing of SVs+ SNP
4. Population based genotyping of SVs short reads.
25. 5.2 Diploid Genome
Selecting 100 samples
• We want to maximize the outcome/
$ spent
• Selection of samples (red)
• Select top 100 (red)
• Random selection of samples
(boxplot)
Histogram of mat[, 2]
# SVS
#Patients
2e+04 4e+04 6e+04 8e+04 1e+05
050100150200250
1 6 12 19 26 33 40 47 54 61 68 75 82 89 96
020406080100
Random vs. informed choice of samples (CCDG)
# of chosen Samples
SVinpopulation(%)
Informed
Top100
Random
Number of chosen samples
SVinpopulation(%)
27. Challenges/ Summary
1. Accurate prediction of SVs: Sniffles (Talk on
Thursday!)
2. Comparison of SVs: SURIVOR
3. Annotation and interpretation of SVs:
SURVIVOR_ANT
4. Population analysis: SURVIVOR
5. Diploid Genome: GADGET
All methods are available:
https://github.com/fritzsedlazeck
https://fritzsedlazeck.github.io/
1 6 12 19 26 33 40 47 54 61 68 75 82 89 96
020406080100
Random vs. informed choice of samples (CCDG)
# of chosen Samples
SVinpopulation(%)
Informed
Top100
Random
Number of chosen samples
SVinpopulation(%)
28. William Salerno
Stephen Richards
Richard Gibbs
Michael Schatz
Schatz lab
Acknowledgments
Daniel Jeffares
Jürg Bähler
Christophe Dessimoz
Justin Zook
GiaB consortium
Editor's Notes
Welcome everyone. My name is Fritz Sedlazeck and I am currently working at the Human Genome Sequencing Center @ Baylor in Houston.
Today I am going to talk about challenges in SV calling and our pursue of the diploid genome that we are working on. Before I dive into that let me shortly introduce myself and my scientific interest.
I am a computational biologist mainly focusing on method developing for mapping and assembly of short and long reads.
Detecting of genomic variations focusing on SVs.
Benchmarking and detecting biases in methods and sequencing technologies
And to apply all of these to obtain more insights into molecular biology.
The focus of the talk today is around structural variations.
Only when we account for all variations we will be able to obtain deeper insights.
However there are certain challenges to get there.
Look at the Venn again. Probably each caller has some fraction of true and false positives. Reflecting the complexity of calling SVs
Structural Varitions are in generally loosely defined as 50bp+ …..
Short read based callign often discussed to lack sensitivity and large FDR!
Avg len for Pacbio nowardays much higher!
Many Deletions on Nanopore -> 11,394 (96.19%) were deletions, and the majority (89.72%) were within a homopolymer .
Check indel for significanc
@ONT INS: probably missing repetitive regions due to caller??
3 times sequenced in different labs!
This highlights a huge bias in short reads and explains why illumina is not enough!
Look at the Venn again. Probably each caller has some fraction of true and false positives.
So one possible solution could be to combine (make a consensus call).
So now that we can call SVs across different callers. How can we annotate and rank these calls?
AC131097.3: Long non coding
SMYD3: Histone methyltransferase
Now we have the calls and annotation. Are these SVs common in the population or unique to my sample??
#Singeltons/stats..
Fraxction of rare vs. common.
Patients -> Individuals.
Now that we have methods to identify SVs, reduce FDR, annotate and have mechanism to know if they are rare or common, we need to understand their context -> Diploid genome.
#SV singeltons, #SV two samples?
Put greedy curv without crossing out.
Interesting:
Do the curve for 3 sd .
More time!