PaulaTataru_PhD_defense

T a t a r uP a u l a
Deciphering the story in our DNA
PhD defense Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Aarhus, January23rd 2015
Inference of population history
and patterns from molecular data
Supervisors: Christian N.S. Pedersen & Asger Hobolth

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Math Computer science
My PhD studies
2
Bioinformatics

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Mathematical
modeling
Implementation
My PhD studies
2
Bioinformatics

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Mathematical
modeling
Implementation
My PhD studies
2
Bioinformatics
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Mathematical modeling toolbox
3
SCFG
DFA DTMC HMM
CTMC

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Mathematical modeling toolbox
3
Stochastic Context
Free Grammar
Deterministic Finite
Automaton
Discrete Time
Markov Chain
Hidden Markov
Model
Continuous Time
Markov Chain

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Research overview
4
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA
secondary structure prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free
grammars and evolutionary information
SCFG
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional
expectations of sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular
expressions
HMM &
DFA
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals
CTMC &
HMM
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift,
mutation and selection: the Beta distribution approach
DTMC
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
5
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Evolution
of a population
forward in time
› Follow
the change
of the allele count
Populations genetics: the Wright-Fisher model
6
individuals
generations(time)
3
2
4
3
4
5
5
allele count
1. Modeling toolbox

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Allele count
per generation
in a population
of size N
› States
{0, 1, …, N}
0 ≤ i, j ≤ N
› Transitions
binomial
Discrete Time Markov Chain
7
ji
Bin(i | j/N)
Bin(j | N, i/N)
Bin(j | N, j/N)Bin(i | N, i/N)
0.23
0.20
0.33
0.08
1
0.26
DTMC
3
2
4
3
4
5
5
allele count
1. Modeling toolbox

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
1. Modeling toolbox

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
› Directly affects
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
1. Modeling toolbox

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
2
1
2
2
2
3
3
sample
1. Modeling toolbox

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
2
1
2
2
2
3
3
sample
Hidden Markov chain
1. Modeling toolbox

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
2
1
2
2
2
3
3
sample
Observable data
1. Modeling toolbox

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Allele count
in time
in a population
of size N
› States
{0, 1, …, N}
0 < i < N
› Rates
ri = i (N-i) / N
Continuous Time Markov Chain
11
i
CTMC
i-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
1. Modeling toolbox

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Allele count
in time
in a population
of size N
› States
{0, 1, …, N}
0 < i < N
› Rates
ri = i (N-i) / N
Continuous Time Markov Chain
11
i
allele count in DTMC
3 2 4 3 4 5 5
0.23 0.20 0.330.08 10.26
CTMC
i-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
1/2
3 3 3 4
t2 t6
2 4 5
t1 t4t2
1/2
1/2
1/2
1/2
1/2
allele count in CTMC
1. Modeling toolbox

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
12
prediction
SCFG
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
CTMC
HMM &
DFA
DTMC &
DFA
DTMC
2
CTMC
& HMM
3

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› DTMC describes sequences
› Allele count in a population
› DFA encodes pattern
› (i)+ (i+1)+ (i+2)+
13
i+2
0 1
2
k
i+1
i+2
3
i i
k, i+2
i+1
i+1
ii
i+2
k
i+1
M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences.
In preparation.
ji
Bin(i | j/N)
Bin(j | N, i/N)
Bin(j | N, j/N)Bin(i | N, i/N)
Motif discovery for DTMCs using DFAs
3 2 4 3 4 5 5
2. Brief overview

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
14
In preparation.
› Does the pattern
(i)+ (i+1)+ (i+2)+
occur more frequently
in specific environments?
Populations
DTMC sequences
… 2 4 3 4 5 5
generations (time)
Environment
2. Brief overview

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
14
In preparation.
› Does the pattern
(i)+ (i+1)+ (i+2)+
occur more frequently
in specific environments?
Populations
DTMC sequences
… 2 4 3 4 5 5
generations (time)
Environment › Contribution
› DFA
› New approach to significance
(random walk)
2. Brief overview

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› HMM describes problem
› Allele count in a population
› DFA encodes pattern
› (i)+ (i+1)+ (i+2)+
Restricted algorithms for HMMs using DFAs
15
P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted
to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013
i
Bin(i | N, i/N)
0: Bin(0 | n, i/N)
1: Bin(1 | n, i/N)
2: Bin(2 | n, i/N)
3: Bin(3 | n, i/N)
Bin(j |N, j/N)
0: Bin(0 | n, j/N)
1: Bin(1 | n, j/N)
2: Bin(2 | n, j/N)
3: Bin(3 | n, j/N)
jBin(i | N, j/N)
Bin(j | N, i/N)
i+2
0 1
2
k
i+1
i+2
3
i i
k, i+2
i+1
i+1
ii
i+2
k
i+1
2 1 2 2 2 3 3
3 2 4 3 4 5 5
2. Brief overview

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
17
2. Brief overview

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Conditional expectations for CTMCs
18
P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient
statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011
1/2
3 3 3 4
t2 t6
2 4 5
t1 t4t2
1/2
1/2
1/2
1/2
1/2
ii-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
2. Brief overview

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Required expectations
› Time
› Jumps
18
ii-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
3 5
t
2. Brief overview

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Required expectations
› Time
› Jumps
› Contribution:
compare and extend
existing methods
› Accuracy
› Speed
18
ii-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
3 5
t
2. Brief overview

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Beta with spikes: approximate Wright-Fisher
19
P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of
drift, mutation and selection: the Beta distribution approach. In preparation
individuals
generations(time)
3
2
4
3
4
5
5
allele count
2. Brief overview

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
19
› What is the distribution
of allele count in
the current generation?
2. Brief overview
individuals
generations(time)
3
2
4
3
4
5
5
allele count

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
19
individuals
generations(time)
3
2
4
3
4
5
5
allele count
› What is the distribution
of allele count in
the current generation?
› Use the beta distribution
› Contribution:
› Add spikes (better fit)
› Include selection
2. Brief overview

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
20
2. Brief overview

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
21
prediction
SCFG
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
CTMC
HMM &
DFA
DTMC &
DFA
DTMC
2
CTMC
& HMM
3

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Trace the genealogy
of sampled individuals
backward in time
22
individuals
generations(time)
Populations genetics: the coalescent model
3. Overview: diCal-IBD

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Trace the genealogy
of sampled individuals
backward in time
› Coalescent process
terminates when
reaching MRCA
› Time to coalescent
event: CTMC
22
individuals
generations(time)
Populations genetics: the coalescent model
MRCA

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Recombination
23 3. Overview: diCal-IBD

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24 3. Overview: diCal-IBD

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24
› Multiple sequences and loci analysis
› HMM: hidden states = (possible) coalescent trees for each locus
› CTMC: probability of the alleles at the leaves

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Identity by descent
25
IBD tract

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
diCal-IBD: CTMCs and HMMs
26
› IBD applications
› Association mapping
› Demography inference
› Detection of selection
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
26
› First method to use
the coalescent with recombination
› One of the first methods to use
sequence data

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
26
› First method to use
the coalescent with recombination
› One of the first methods to use
sequence data
› Outperforms SNP-based methods
› Comparable with sequence-based method

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
27

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Selection and IBD
28
› IBD tract length depends on MRCA
› The more recent the MRCA, the longer the tract
› Recent MRCA can be an indication of positive selection
› IBD can be used for detecting positive selection1
› Standing variation
1Albrechtsen A et al. Genetics 2010;186:295-308

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Selection and IBD
29
› IBD segment length depends on MRCA
› The more recent the MRCA, the longer the segment
› Recent MRCA can be an indication of recent selection
› IBD can be used for detecting selection
› Standing variation
1Albrechtsen A et al. Genetics 2010;186:295-308

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The SLC24A5 gene: diCal-IBD
30
SLC24A5
› Major influence on natural skin color variation
› Under positive selection in Europeans1
1Wilde S et al. PNAS 2014;111(13):4832-4837

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
31
prediction
SCFG
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
CTMC
HMM &
DFA
DTMC &
DFA
DTMC
2
CTMC
& HMM
3

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: Methods vs Data
32

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
32
diCal
ARGWeaver
MSMC
ms
MaCS
PSMC
coalHMM
Kim Tree

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
32
Beta with spikes
diCal-IBD
diCal
ARGWeaver
MSMC
ms
MaCS
PSMC
coalHMM
Kim Tree

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
32
Beta with spikes
diCal-IBD
diCal
ARGWeaver
MSMC
ms
MaCS
PSMC
coalHMM
Kim Tree Data

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
32
Beta with spikes
diCal-IBD
diCal
ARGWeaver
MSMC
ms
MaCS
PSMC
coalHMM
Kim Tree Data
dimension
reduction

PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
33
Thank you for your attention!

PaulaTataru_PhD_defense

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to PaulaTataru_PhD_defense

Similar to PaulaTataru_PhD_defense (20)

More from Paula Tataru

More from Paula Tataru (20)

PaulaTataru_PhD_defense