1. T a t a r uP a u l a
Deciphering the story in our DNA
PhD defense Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Aarhus, January23rd 2015
Inference of population history
and patterns from molecular data
Supervisors: Christian N.S. Pedersen & Asger Hobolth
7. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Research overview
4
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA
secondary structure prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free
grammars and evolutionary information
SCFG
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional
expectations of sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular
expressions
HMM &
DFA
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals
CTMC &
HMM
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift,
mutation and selection: the Beta distribution approach
DTMC
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
8. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Research overview
4
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA
secondary structure prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free
grammars and evolutionary information
SCFG
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional
expectations of sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular
expressions
HMM &
DFA
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals
CTMC &
HMM
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift,
mutation and selection: the Beta distribution approach
DTMC
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
9. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Research overview
4
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA
secondary structure prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free
grammars and evolutionary information
SCFG
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional
expectations of sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular
expressions
HMM &
DFA
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals
CTMC &
HMM
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift,
mutation and selection: the Beta distribution approach
DTMC
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
10. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
5
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
11. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
5
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
12. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Evolution
of a population
forward in time
› Follow
the change
of the allele count
Populations genetics: the Wright-Fisher model
6
individuals
generations(time)
3
2
4
3
4
5
5
allele count
1. Modeling toolbox
13. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Allele count
per generation
in a population
of size N
› States
{0, 1, …, N}
0 ≤ i, j ≤ N
› Transitions
binomial
Discrete Time Markov Chain
7
ji
Bin(i | j/N)
Bin(j | N, i/N)
Bin(j | N, j/N)Bin(i | N, i/N)
0.23
0.20
0.33
0.08
1
0.26
DTMC
3
2
4
3
4
5
5
allele count
1. Modeling toolbox
15. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
› Directly affects
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
1. Modeling toolbox
16. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
› Directly affects
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
2
1
2
2
2
3
3
sample
1. Modeling toolbox
17. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
› Directly affects
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
2
1
2
2
2
3
3
sample
Hidden Markov chain
1. Modeling toolbox
18. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
› Directly affects
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
2
1
2
2
2
3
3
sample
Observable data
1. Modeling toolbox
19. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Hidden Markov Model
› Transitions (DTMC) binomial
› Emissions (data) binomial
9
i
Bin(i | N, i/N)
0: Bin(0 | n, i/N)
1: Bin(1 | n, i/N)
2: Bin(2 | n, i/N)
3: Bin(3 | n, i/N)
Bin(j |N, j/N)
0: Bin(0 | n, j/N)
1: Bin(1 | n, j/N)
2: Bin(2 | n, j/N)
3: Bin(3 | n, j/N)
jBin(i | N, j/N)
Bin(j | N, i/N)
1. Modeling toolbox
20. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Hidden Markov Model
› Standard algorithms
› Forward Likelihood of data
› Viterbi Global decoding
› Posterior decoding Local decoding
10
i
Bin(i | N, i/N)
0: Bin(0 | n, i/N)
1: Bin(1 | n, i/N)
2: Bin(2 | n, i/N)
3: Bin(3 | n, i/N)
Bin(j |N, j/N)
0: Bin(0 | n, j/N)
1: Bin(1 | n, j/N)
2: Bin(2 | n, j/N)
3: Bin(3 | n, j/N)
jBin(i | N, j/N)
Bin(j | N, i/N)
1. Modeling toolbox
21. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Allele count
in time
in a population
of size N
› States
{0, 1, …, N}
0 < i < N
› Rates
ri = i (N-i) / N
Continuous Time Markov Chain
11
i
CTMC
i-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
1. Modeling toolbox
22. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Allele count
in time
in a population
of size N
› States
{0, 1, …, N}
0 < i < N
› Rates
ri = i (N-i) / N
Continuous Time Markov Chain
11
i
allele count in DTMC
3 2 4 3 4 5 5
0.23 0.20 0.330.08 10.26
CTMC
i-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
1/2
3 3 3 4
t2 t6
2 4 5
t1 t4t2
1/2
1/2
1/2
1/2
1/2
allele count in CTMC
1. Modeling toolbox
23. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
12
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
24. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
12
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
25. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› DTMC describes sequences
› Allele count in a population
› DFA encodes pattern
› (i)+ (i+1)+ (i+2)+
13
i+2
0 1
2
k
i+1
i+2
3
i i
k, i+2
i+1
i+1
ii
i+2
k
i+1
M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences.
In preparation.
ji
Bin(i | j/N)
Bin(j | N, i/N)
Bin(j | N, j/N)Bin(i | N, i/N)
Motif discovery for DTMCs using DFAs
3 2 4 3 4 5 5
2. Brief overview
26. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Motif discovery for DTMCs using DFAs
14
M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences.
In preparation.
› Does the pattern
(i)+ (i+1)+ (i+2)+
occur more frequently
in specific environments?
Populations
DTMC sequences
… 2 4 3 4 5 5
generations (time)
Environment
2. Brief overview
27. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Motif discovery for DTMCs using DFAs
14
M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences.
In preparation.
› Does the pattern
(i)+ (i+1)+ (i+2)+
occur more frequently
in specific environments?
Populations
DTMC sequences
… 2 4 3 4 5 5
generations (time)
Environment › Contribution
› DFA
› New approach to significance
(random walk)
2. Brief overview
28. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› HMM describes problem
› Allele count in a population
› DFA encodes pattern
› (i)+ (i+1)+ (i+2)+
Restricted algorithms for HMMs using DFAs
15
P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted
to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013
i
Bin(i | N, i/N)
0: Bin(0 | n, i/N)
1: Bin(1 | n, i/N)
2: Bin(2 | n, i/N)
3: Bin(3 | n, i/N)
Bin(j |N, j/N)
0: Bin(0 | n, j/N)
1: Bin(1 | n, j/N)
2: Bin(2 | n, j/N)
3: Bin(3 | n, j/N)
jBin(i | N, j/N)
Bin(j | N, i/N)
i+2
0 1
2
k
i+1
i+2
3
i i
k, i+2
i+1
i+1
ii
i+2
k
i+1
2 1 2 2 2 3 3
3 2 4 3 4 5 5
2. Brief overview
29. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Restricted algorithms for HMMs using DFAs
16
P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted
to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013
i
Bin(i | N, i/N)
0: Bin(0 | n, i/N)
1: Bin(1 | n, i/N)
2: Bin(2 | n, i/N)
3: Bin(3 | n, i/N)
Bin(j |N, j/N)
0: Bin(0 | n, j/N)
1: Bin(1 | n, j/N)
2: Bin(2 | n, j/N)
3: Bin(3 | n, j/N)
jBin(i | N, j/N)
Bin(j | N, i/N)
i+2
0 1
2
k
i+1
i+2
3
i i
k, i+2
i+1
i+1
ii
i+2
k
i+1
› Contribution: new algorithms
› Calculate distribution of #pattern occurrences
› Adapt decoding algorithms to include #pattern occurrences
2. Brief overview
30. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Restricted algorithms for HMMs using DFAs
17
P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted
to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013
2. Brief overview
31. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Conditional expectations for CTMCs
18
P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient
statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011
1/2
3 3 3 4
t2 t6
2 4 5
t1 t4t2
1/2
1/2
1/2
1/2
1/2
ii-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
2. Brief overview
32. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Required expectations
› Time
› Jumps
Conditional expectations for CTMCs
18
P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient
statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011
ii-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
3 5
t
2. Brief overview
33. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Required expectations
› Time
› Jumps
› Contribution:
compare and extend
existing methods
› Accuracy
› Speed
Conditional expectations for CTMCs
18
P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient
statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011
ii-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
3 5
t
2. Brief overview
34. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Beta with spikes: approximate Wright-Fisher
19
P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of
drift, mutation and selection: the Beta distribution approach. In preparation
individuals
generations(time)
3
2
4
3
4
5
5
allele count
2. Brief overview
35. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Beta with spikes: approximate Wright-Fisher
19
P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of
drift, mutation and selection: the Beta distribution approach. In preparation
› What is the distribution
of allele count in
the current generation?
2. Brief overview
individuals
generations(time)
3
2
4
3
4
5
5
allele count
36. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Beta with spikes: approximate Wright-Fisher
19
P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of
drift, mutation and selection: the Beta distribution approach. In preparation
individuals
generations(time)
3
2
4
3
4
5
5
allele count
› What is the distribution
of allele count in
the current generation?
› Use the beta distribution
› Contribution:
› Add spikes (better fit)
› Include selection
2. Brief overview
37. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Beta with spikes: approximate Wright-Fisher
20
P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of
drift, mutation and selection: the Beta distribution approach. In preparation
2. Brief overview
38. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
21
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
39. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
21
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
42. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Trace the genealogy
of sampled individuals
backward in time
› Coalescent process
terminates when
reaching MRCA
› Time to coalescent
event: CTMC
22
individuals
generations(time)
Populations genetics: the coalescent model
MRCA
3. Overview: diCal-IBD
50. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24
› Multiple sequences and loci analysis
› HMM: hidden states = (possible) coalescent trees for each locus
› CTMC: probability of the alleles at the leaves
3. Overview: diCal-IBD
52. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
diCal-IBD: CTMCs and HMMs
26
› IBD applications
› Association mapping
› Demography inference
› Detection of selection
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
3. Overview: diCal-IBD
53. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
diCal-IBD: CTMCs and HMMs
26
› IBD applications
› Association mapping
› Demography inference
› Detection of selection
› First method to use
the coalescent with recombination
› One of the first methods to use
sequence data
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
3. Overview: diCal-IBD
54. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
diCal-IBD: CTMCs and HMMs
26
› IBD applications
› Association mapping
› Demography inference
› Detection of selection
› First method to use
the coalescent with recombination
› One of the first methods to use
sequence data
› Outperforms SNP-based methods
› Comparable with sequence-based method
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
3. Overview: diCal-IBD
55. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
diCal-IBD: CTMCs and HMMs
27
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
3. Overview: diCal-IBD
56. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Selection and IBD
28
› IBD tract length depends on MRCA
› The more recent the MRCA, the longer the tract
› Recent MRCA can be an indication of positive selection
› IBD can be used for detecting positive selection1
› Standing variation
1Albrechtsen A et al. Genetics 2010;186:295-308
3. Overview: diCal-IBD
57. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Selection and IBD
29
› IBD segment length depends on MRCA
› The more recent the MRCA, the longer the segment
› Recent MRCA can be an indication of recent selection
› IBD can be used for detecting selection
› Standing variation
1Albrechtsen A et al. Genetics 2010;186:295-308
3. Overview: diCal-IBD
58. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The SLC24A5 gene: diCal-IBD
30
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
SLC24A5
› Major influence on natural skin color variation
› Under positive selection in Europeans1
1Wilde S et al. PNAS 2014;111(13):4832-4837
3. Overview: diCal-IBD
59. PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
31
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3