2. Overview
• Two
ideas
in
Phase2
SNP
calling
– Use
exome
off-‐target
reads
for
whole
genome
SNP
site
discovery
– Use
exome
genotype
calls
to
improve
overall
genotype
accuracy
• Preliminary
results
and
plan
3. Review
of
phase1
SNP
pipeline
Low
coverage
(~4X)
WGS
BAMs
High
coverage
(~50X)
WES
BAMs
MulV-‐sample
calling
Single-‐sample
calling
PopulaVon
SNP
sites
Individual
SNP
sites
and
genotypes
Apply
mulV-‐center
consensus
strategy
and
merge
SNP
sites
Impute
genotype/haplotype
Calculate
genotype
likelihood
on
all
candidate
sites
4. Two
ideas
in
Phase2
Low
coverage
(~4X)
WGS
BAMs
High
coverage
(~50X)
WES
BAMs
MulV-‐sample
calling
Single-‐sample
calling
PopulaVon
SNP
sites
Individual
SNP
sites
and
genotypes
Apply
mulV-‐center
consensus
strategy
and
merge
SNP
sites
Impute
genotype/haplotype
Calculate
genotype
likelihood
on
all
candidate
sites
5. The
first
idea
• Use
exome
off-‐target
reads
in
whole
genome
SNP
calling
– Exome
off-‐target
reads
have
significant
coverage
on
whole
genome
coverage
– Preliminary
results
showed
higher
SNP
sensiVvity
and
reasonable
quality
7. Exome
off-‐target
reads
are
evenly
distributed
• Weighted
read
depths
calculated
using
EBD
in
5kb
sliding
windows
across
chr20
8. SNP
calling
experiment
• Using
all
1449
phase2
lowpass
BAMs
and
1182
exome
Illumina
BAMs
• Calling
model
modified
from
SNPtools
– Combining
reads
of
the
same
sample
to
esVmate
the
variance
of
true
variant
reads
– Grouping
reads
of
the
same
sequencing
plaborm
to
esVmate
the
variance
of
plaborm
specific
bias
9. SNP
calls
comparison
(chr20
off-‐target
regions)
#SNP
Ti/Tv
#
in
Phase1
Known
Ti/Tv
%
Rare
(MAF<
1%)
%
Novel
to
Phase1
Novel
Ti/Tv
OMNI
poly
sensiWvity
OMNI
mono
False
discovery
BI
phase2
baseline
821,141
2.31
514,021
2.34
72.5%
37.4%
2.24
98.2%
(50,195/51,126)
0.9%
(12/1265)
BCM
phase2
baseline
847,274
2.33
502,517
2.42
68.6%
40.7%
2.20
98.6%
(50,406/51,126)
1.9%
(24/1265)
BCM
Phase2
experimental
911,602
2.32
521,189
2.42
69.7%
42.8%
2.19
98.8%
(50,494/51,126)
2.1%
(27/1265)
AddiWonal
SNPs
64,328
2.17
18,672
2.26
99.1%
71.0%
2.13
0.2%
(88/51,126)
0.2%
(3/1265)
• Called ~7% more SNP on off-target regions by using exome reads
• Ti/Tv and OMNI metrics showed reasonable quality
• Additional SNPs are mostly rare in phase1 calls or novel SNP
10. MAF
distribuVon
comparison
(afer
imputaVon)
Both increasing sample size and adding exome reads increase SNP discovery
rate on the rare end (0.1% bin)
11. The
second
idea
• Using
exome
calls
to
refine
genotype
imputaVon
– Exome
calls
are
of
high
quality
and
independent
from
AF
– Exome
pipeline
addressed
plaborm/capture
specific
errors.
12. A
snapshot
of
Phase1
exome
SNP
validaVon
results
total
submiYed
yield
validated
validated/yield
singleton
5372
100
93
92
98.9%
<1%
4430
50
49
47
95.9%
>1%
1896
50
46
46
100%
SVM
overall
11698
200
188
185
98.4%
Why <1% has the lowest validation rate?
• Validation sample selection
• Imputation artifacts
13. A
closer
look
at
imputaVon
arVfacts
Chr
Pos
Site
source
AC
Sample
picked
for
validaWon
PCR-‐454
validaWon
Phase1
release
v1
GL
in
log-‐10
scale
RR/RA/AA
Exome
calls
(BCM)
20
20033172
EX_SOLID
singleton
NA19468
(SOLID)
0/0
0/1
./.:-‐5,-‐0.000391054,-‐3.04576
0/1
20
23667835
EX_ILLUMINA
<1%
NA18510
(Illumina)
0/0
0/1
./.:-‐5,-‐0.00020851,-‐3.31876
0/0
or
./.
20
23667835
EX_ILLUMINA
<1%
NA18858
(Illumina)
0/0
0/1
./.:-‐2.72124,-‐0.000825952,-‐5
0/0
or
./.
20
25478962
EX_ILLUMINA
<1%
HG00104
(SOLiD)
0/0
0/1
./.:-‐5,0,-‐5
0/0
or
./.
20
25478962
EX_ILLUMINA
<1%
HG00234
(SOLiD)
0/0
0/1
./.:-‐3.1549,-‐0.000304111,-‐5
0/0
or
./.
20
25478962
EX_ILLUMINA
<1%
HG00364
(SOLiD)
0/0
0/1
./.:-‐4.69838,-‐8.69777e-‐06,-‐5
0/0
or
./.
20
25478962
EX_ILLUMINA
<1%
HG00593
(SOLiD)
0/0
0/1
./.:-‐3.1938,-‐0.000278053,-‐5
0/0
or
./.
20
25478962
EX_ILLUMINA
<1%
HG01271
(SOLiD)
0/0
0/1
./.:-‐0.31142,-‐0.290883,-‐5
0/0
or
./.
20
60885811
EX_ILLUMINA
<1%
HG00134
(SOLiD)
0/0
0/1
./.:-‐0.477139,-‐0.477113,-‐0.477113
0/0
or
./.
20
60885811
EX_ILLUMINA
<1%
HG00350
(SOLiD)
0/0
0/1
./.:-‐0.123447,-‐0.61343,-‐2.41117
0/0
or
./.
20
62326235
EX_ILLUMINA
<1%
HG00128
(SOLiD)
0/0
0/1
./.:-‐4.22169,-‐2.6068e-‐05,-‐5
0/0
or
./.
20
62326235
EX_ILLUMINA
<1%
HG00179
(SOLiD)
0/0
0/1
./.:-‐3.22182,-‐0.000773747,-‐2.9281
2
0/0
or
./.
SNPs were called in one sample but incorrectly imputed in other samples
15. Future
work
• use
both
Illumina
and
SOLiD
exome
data
to
assist
whole
genome
SNP
calling
in
next
experiment
• integrate
exome
genotype
calls
in
whole
genome
imputaVon
16. Acknowledgements
HGSC-‐BCM
• Fuli
Yu
• Danny
Challis
• Uday
Evani
• Majhew
Baibridge
• Donna
Muzny
• Jeffrey
Reid
• Richard
Gibbs
• Yi
Wang
BlueBiou@Rice
• Research
CompuVng
group