Mols_August2013

Introduction Methods Results Conclusions
Sequence Based Identity by Descent Detection
Joint work with Jasmine Nirody & Yun S. Song
@ University of California, Berkeley
Paula Tataru
Mols Meeting
August 15, 2013
Sequence Based IBD detection 1

G
G
A
A
C
C
T
T
G
G
A
A
G
A
C
C
Identity By Descent (IBD) tracts
DNA segments that are inherited from a common ancestor
recombination disrupts them
expected length depends on the TMRCA

G
G
A
A
C
C
T
T
G
G
A
A
G
A
C
C
IBD is fundamental in genetics
selection
phasing
imputation
association studies

G
G
A
A
C
C
T
T
G
G
A
A
G
A
C
C
Current methods use population-wide SNP genotype data
work best for recent IBD (longer than 1cM)
different IBD deﬁnitions
pairwise SNPs disrupt predicted IBD tracts
probabilistic, deterministic

GERMLINE
Gusev et al., 2009
Identical By State (IBS)
Deterministic
Linear in number of samples
Phased SNP data
Sliding window to ﬁnd IBS
Allows for genotyping error

FastIBD
Browning & Browning, 2011
IBD inside IBS
Deterministic
Quadratic in number of samples
Unphased SNP data; phasing done with Beagle
Accounts for phase uncertainty and background levels of LD
Models shared haplotype frequencies

ReﬁnedIBD
Browning & Browning, 2013
IBD inside IBS
Probabilistic
Very similar to FastIBD
Identiﬁes candidate IBD segments using GERMLINE
Filter candidates based on a probabilistic model

SMCSD
Paul et al., 2011, Sheehan et al., 2013
same TMRCA
Probabilistic: HMM
Phased sequence data
Based on coalescence theory
Predicts recombination breakpoints that change TMRCA

SMCSD in a nutshell
Designed to estimate demographic history
partition time in discrete intervals
assume constant population size per time interval
use EM to train model

SMCSD in a nutshell
Designed to estimate demographic history
partition time in discrete intervals
assume constant population size per time interval
use EM to train model
Use decoding to infer IBD
assume demography given
run posterior decoding
changes of TMRCA reveal recombination breakpoints
use posterior probabilities to trim tracts’ endpoints

Data simulation
Simulate trees in ms
µ = 1.25 × 10−8
r = 10−8
sequences of length 10MB
10 sequences (45 pairs)
10 replicates
Collect recombination breakpoints from ms output
Reconstruct pairwise IBD tracts

Human Population
Tenessen et al., 2012, Simons et al., 2013

Human Population
0.
0.0
0.5
1.0
CumProb
0 1000 2000 3000 4000 5000 6000
Generations back in time
103
104
105
106
PopSize
EA EA Watt A

European Population
Recall Precision F-score0.0
0.5
1.0
0.1 0.55 1.0
Tract length (cM)
0
0.5
1.0
TruePositive
0.1 0.55 1.0
Tract length (cM)
0
0.5
1.0
FalseNegative
0.1 0.55 1.0
Tract length (cM)
0
0.5
1.0
FalsePositive
0.1 0.55 1.0
Tract length (cM)
0
0.5
1.0
Power
0.1 0.55 1.0
Tract length (cM)
0
0.5
1.0
Under-prediction
0.1 0.55 1.0
Tract length (cM)
0
0.5
1.0
Over-prediction
0 1000 2000 3000 4000 5000 6000
103
104
105
106
PopSize
GERMLINE FastIBD RefinedIBD SMCSD-W SMCSD-T

African Population
Recall Precision F-score0.0
0.5
1.0
0.1 0.6 1.1
Tract length (cM)
0
0.5
1.0
TruePositive
0.1 0.6 1.1
Tract length (cM)
0
0.5
1.0
FalseNegative
0.1 0.65 1.2
Tract length (cM)
0
0.5
1.0
FalsePositive
0.1 0.6 1.1
Tract length (cM)
0
0.5
1.0
Power
0.1 0.6 1.1
Tract length (cM)
0
0.5
1.0
Under-prediction
0.1 0.6 1.1
Tract length (cM)
0
0.5
1.0
Over-prediction
0 1000 2000 3000 4000 5000 6000
103
104
105
106
PopSize
GERMLINE FastIBD RefinedIBD SMCSD-W SMCSD-T

Conclusion
Simulated data from outbred populations
Existing programs are strong performers for long tracts
SMCSD performs better on shorter tracts
SMCSD uses a more robust IBD deﬁnition

Thank you!

Mols_August2013

Recommended

Recommended

More Related Content

Similar to Mols_August2013

Similar to Mols_August2013 (20)

More from Paula Tataru

More from Paula Tataru (14)

Mols_August2013