SlideShare a Scribd company logo
1 of 66
Download to read offline
T a t a r uP a u l a
Deciphering the story in our DNA
PhD defense Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Aarhus, January23rd 2015
Inference of population history
and patterns from molecular data
Supervisors: Christian N.S. Pedersen & Asger Hobolth
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Math Computer science
My PhD studies
2
Bioinformatics
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Mathematical
modeling
Implementation
My PhD studies
2
Bioinformatics
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Mathematical
modeling
Implementation
My PhD studies
2
Bioinformatics
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Mathematical modeling toolbox
3
SCFG
DFA DTMC HMM
CTMC
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Mathematical modeling toolbox
3
Stochastic Context
Free Grammar
Deterministic Finite
Automaton
Discrete Time
Markov Chain
Hidden Markov
Model
Continuous Time
Markov Chain
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Research overview
4
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA
secondary structure prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free
grammars and evolutionary information
SCFG
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional
expectations of sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular
expressions
HMM &
DFA
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals
CTMC &
HMM
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift,
mutation and selection: the Beta distribution approach
DTMC
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Research overview
4
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA
secondary structure prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free
grammars and evolutionary information
SCFG
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional
expectations of sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular
expressions
HMM &
DFA
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals
CTMC &
HMM
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift,
mutation and selection: the Beta distribution approach
DTMC
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Research overview
4
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA
secondary structure prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free
grammars and evolutionary information
SCFG
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional
expectations of sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular
expressions
HMM &
DFA
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals
CTMC &
HMM
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift,
mutation and selection: the Beta distribution approach
DTMC
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
5
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
5
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Evolution
of a population
forward in time
› Follow
the change
of the allele count
Populations genetics: the Wright-Fisher model
6
individuals
generations(time)
3
2
4
3
4
5
5
allele count
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Allele count
per generation
in a population
of size N
› States
{0, 1, …, N}
0 ≤ i, j ≤ N
› Transitions
binomial
Discrete Time Markov Chain
7
ji
Bin(i | j/N)
Bin(j | N, i/N)
Bin(j | N, j/N)Bin(i | N, i/N)
0.23
0.20
0.33
0.08
1
0.26
DTMC
3
2
4
3
4
5
5
allele count
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
› Directly affects
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
› Directly affects
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
2
1
2
2
2
3
3
sample
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
› Directly affects
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
2
1
2
2
2
3
3
sample
Hidden Markov chain
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
› Directly affects
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
2
1
2
2
2
3
3
sample
Observable data
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Hidden Markov Model
› Transitions (DTMC) binomial
› Emissions (data) binomial
9
i
Bin(i | N, i/N)
0: Bin(0 | n, i/N)
1: Bin(1 | n, i/N)
2: Bin(2 | n, i/N)
3: Bin(3 | n, i/N)
Bin(j |N, j/N)
0: Bin(0 | n, j/N)
1: Bin(1 | n, j/N)
2: Bin(2 | n, j/N)
3: Bin(3 | n, j/N)
jBin(i | N, j/N)
Bin(j | N, i/N)
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Hidden Markov Model
› Standard algorithms
› Forward Likelihood of data
› Viterbi Global decoding
› Posterior decoding Local decoding
10
i
Bin(i | N, i/N)
0: Bin(0 | n, i/N)
1: Bin(1 | n, i/N)
2: Bin(2 | n, i/N)
3: Bin(3 | n, i/N)
Bin(j |N, j/N)
0: Bin(0 | n, j/N)
1: Bin(1 | n, j/N)
2: Bin(2 | n, j/N)
3: Bin(3 | n, j/N)
jBin(i | N, j/N)
Bin(j | N, i/N)
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Allele count
in time
in a population
of size N
› States
{0, 1, …, N}
0 < i < N
› Rates
ri = i (N-i) / N
Continuous Time Markov Chain
11
i
CTMC
i-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Allele count
in time
in a population
of size N
› States
{0, 1, …, N}
0 < i < N
› Rates
ri = i (N-i) / N
Continuous Time Markov Chain
11
i
allele count in DTMC
3 2 4 3 4 5 5
0.23 0.20 0.330.08 10.26
CTMC
i-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
1/2
3 3 3 4
t2 t6
2 4 5
t1 t4t2
1/2
1/2
1/2
1/2
1/2
allele count in CTMC
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
12
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
12
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› DTMC describes sequences
› Allele count in a population
› DFA encodes pattern
› (i)+ (i+1)+ (i+2)+
13
i+2
0 1
2
k
i+1
i+2
3
i i
k, i+2
i+1
i+1
ii
i+2
k
i+1
M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences.
In preparation.
ji
Bin(i | j/N)
Bin(j | N, i/N)
Bin(j | N, j/N)Bin(i | N, i/N)
Motif discovery for DTMCs using DFAs
3 2 4 3 4 5 5
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Motif discovery for DTMCs using DFAs
14
M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences.
In preparation.
› Does the pattern
(i)+ (i+1)+ (i+2)+
occur more frequently
in specific environments?
Populations
DTMC sequences
… 2 4 3 4 5 5
generations (time)
Environment
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Motif discovery for DTMCs using DFAs
14
M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences.
In preparation.
› Does the pattern
(i)+ (i+1)+ (i+2)+
occur more frequently
in specific environments?
Populations
DTMC sequences
… 2 4 3 4 5 5
generations (time)
Environment › Contribution
› DFA
› New approach to significance
(random walk)
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› HMM describes problem
› Allele count in a population
› DFA encodes pattern
› (i)+ (i+1)+ (i+2)+
Restricted algorithms for HMMs using DFAs
15
P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted
to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013
i
Bin(i | N, i/N)
0: Bin(0 | n, i/N)
1: Bin(1 | n, i/N)
2: Bin(2 | n, i/N)
3: Bin(3 | n, i/N)
Bin(j |N, j/N)
0: Bin(0 | n, j/N)
1: Bin(1 | n, j/N)
2: Bin(2 | n, j/N)
3: Bin(3 | n, j/N)
jBin(i | N, j/N)
Bin(j | N, i/N)
i+2
0 1
2
k
i+1
i+2
3
i i
k, i+2
i+1
i+1
ii
i+2
k
i+1
2 1 2 2 2 3 3
3 2 4 3 4 5 5
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Restricted algorithms for HMMs using DFAs
16
P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted
to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013
i
Bin(i | N, i/N)
0: Bin(0 | n, i/N)
1: Bin(1 | n, i/N)
2: Bin(2 | n, i/N)
3: Bin(3 | n, i/N)
Bin(j |N, j/N)
0: Bin(0 | n, j/N)
1: Bin(1 | n, j/N)
2: Bin(2 | n, j/N)
3: Bin(3 | n, j/N)
jBin(i | N, j/N)
Bin(j | N, i/N)
i+2
0 1
2
k
i+1
i+2
3
i i
k, i+2
i+1
i+1
ii
i+2
k
i+1
› Contribution: new algorithms
› Calculate distribution of #pattern occurrences
› Adapt decoding algorithms to include #pattern occurrences
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Restricted algorithms for HMMs using DFAs
17
P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted
to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Conditional expectations for CTMCs
18
P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient
statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011
1/2
3 3 3 4
t2 t6
2 4 5
t1 t4t2
1/2
1/2
1/2
1/2
1/2
ii-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Required expectations
› Time
› Jumps
Conditional expectations for CTMCs
18
P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient
statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011
ii-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
3 5
t
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Required expectations
› Time
› Jumps
› Contribution:
compare and extend
existing methods
› Accuracy
› Speed
Conditional expectations for CTMCs
18
P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient
statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011
ii-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
3 5
t
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Beta with spikes: approximate Wright-Fisher
19
P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of
drift, mutation and selection: the Beta distribution approach. In preparation
individuals
generations(time)
3
2
4
3
4
5
5
allele count
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Beta with spikes: approximate Wright-Fisher
19
P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of
drift, mutation and selection: the Beta distribution approach. In preparation
› What is the distribution
of allele count in
the current generation?
2. Brief overview
individuals
generations(time)
3
2
4
3
4
5
5
allele count
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Beta with spikes: approximate Wright-Fisher
19
P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of
drift, mutation and selection: the Beta distribution approach. In preparation
individuals
generations(time)
3
2
4
3
4
5
5
allele count
› What is the distribution
of allele count in
the current generation?
› Use the beta distribution
› Contribution:
› Add spikes (better fit)
› Include selection
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Beta with spikes: approximate Wright-Fisher
20
P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of
drift, mutation and selection: the Beta distribution approach. In preparation
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
21
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
21
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Trace the genealogy
of sampled individuals
backward in time
22
individuals
generations(time)
Populations genetics: the coalescent model
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Trace the genealogy
of sampled individuals
backward in time
22
individuals
generations(time)
Populations genetics: the coalescent model
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Trace the genealogy
of sampled individuals
backward in time
› Coalescent process
terminates when
reaching MRCA
› Time to coalescent
event: CTMC
22
individuals
generations(time)
Populations genetics: the coalescent model
MRCA
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Recombination
23 3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24 3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24 3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24 3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24 3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24 3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24 3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24
› Multiple sequences and loci analysis
› HMM: hidden states = (possible) coalescent trees for each locus
› CTMC: probability of the alleles at the leaves
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Identity by descent
25
IBD tract
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
diCal-IBD: CTMCs and HMMs
26
› IBD applications
› Association mapping
› Demography inference
› Detection of selection
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
diCal-IBD: CTMCs and HMMs
26
› IBD applications
› Association mapping
› Demography inference
› Detection of selection
› First method to use
the coalescent with recombination
› One of the first methods to use
sequence data
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
diCal-IBD: CTMCs and HMMs
26
› IBD applications
› Association mapping
› Demography inference
› Detection of selection
› First method to use
the coalescent with recombination
› One of the first methods to use
sequence data
› Outperforms SNP-based methods
› Comparable with sequence-based method
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
diCal-IBD: CTMCs and HMMs
27
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Selection and IBD
28
› IBD tract length depends on MRCA
› The more recent the MRCA, the longer the tract
› Recent MRCA can be an indication of positive selection
› IBD can be used for detecting positive selection1
› Standing variation
1Albrechtsen A et al. Genetics 2010;186:295-308
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Selection and IBD
29
› IBD segment length depends on MRCA
› The more recent the MRCA, the longer the segment
› Recent MRCA can be an indication of recent selection
› IBD can be used for detecting selection
› Standing variation
1Albrechtsen A et al. Genetics 2010;186:295-308
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The SLC24A5 gene: diCal-IBD
30
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
SLC24A5
› Major influence on natural skin color variation
› Under positive selection in Europeans1
1Wilde S et al. PNAS 2014;111(13):4832-4837
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
31
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: Methods vs Data
32
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: Methods vs Data
32
diCal
ARGWeaver
MSMC
ms
MaCS
PSMC
coalHMM
Kim Tree
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: Methods vs Data
32
Beta with spikes
diCal-IBD
diCal
ARGWeaver
MSMC
ms
MaCS
PSMC
coalHMM
Kim Tree
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: Methods vs Data
32
Beta with spikes
diCal-IBD
diCal
ARGWeaver
MSMC
ms
MaCS
PSMC
coalHMM
Kim Tree Data
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: Methods vs Data
32
Beta with spikes
diCal-IBD
diCal
ARGWeaver
MSMC
ms
MaCS
PSMC
coalHMM
Kim Tree Data
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: Methods vs Data
32
Beta with spikes
diCal-IBD
diCal
ARGWeaver
MSMC
ms
MaCS
PSMC
coalHMM
Kim Tree Data
dimension
reduction
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
33
Thank you for your attention!

More Related Content

What's hot

Goodwin2016 ngs 10 years
Goodwin2016 ngs 10 yearsGoodwin2016 ngs 10 years
Goodwin2016 ngs 10 yearsPrakash Koringa
 
Integrating phylogenetic inference and metadata visualization for NGS data
Integrating phylogenetic inference and metadata visualization for NGS dataIntegrating phylogenetic inference and metadata visualization for NGS data
Integrating phylogenetic inference and metadata visualization for NGS dataJoão André Carriço
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysismikaelhuss
 
BM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeBM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeLeighton Pritchard
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
Computational Resources In Infectious Disease
Computational Resources In Infectious DiseaseComputational Resources In Infectious Disease
Computational Resources In Infectious DiseaseJoão André Carriço
 
Identifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
Identifying the Coding and Non Coding Regions of DNA Using Spectral AnalysisIdentifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
Identifying the Coding and Non Coding Regions of DNA Using Spectral AnalysisIJMER
 
Next Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesNext Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesChung-Tsai Su
 
Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology appl...
Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology appl...Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology appl...
Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology appl...Joe Parker
 
A collaborative model for bioinformatics education: combining biologically i...
A collaborative model for bioinformatics education:  combining biologically i...A collaborative model for bioinformatics education:  combining biologically i...
A collaborative model for bioinformatics education: combining biologically i...Elia Brodsky
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical NotebookNaima Tahsin
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
AI in Bioinformatics
AI in BioinformaticsAI in Bioinformatics
AI in BioinformaticsAli Kishk
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceJustin Johnson
 

What's hot (20)

Goodwin2016 ngs 10 years
Goodwin2016 ngs 10 yearsGoodwin2016 ngs 10 years
Goodwin2016 ngs 10 years
 
Integrating phylogenetic inference and metadata visualization for NGS data
Integrating phylogenetic inference and metadata visualization for NGS dataIntegrating phylogenetic inference and metadata visualization for NGS data
Integrating phylogenetic inference and metadata visualization for NGS data
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
BM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeBM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of Strathclyde
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Rna seq
Rna seqRna seq
Rna seq
 
Computational Resources In Infectious Disease
Computational Resources In Infectious DiseaseComputational Resources In Infectious Disease
Computational Resources In Infectious Disease
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
presentation
presentationpresentation
presentation
 
Identifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
Identifying the Coding and Non Coding Regions of DNA Using Spectral AnalysisIdentifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
Identifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
 
Next Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesNext Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and Opportunities
 
Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology appl...
Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology appl...Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology appl...
Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology appl...
 
A collaborative model for bioinformatics education: combining biologically i...
A collaborative model for bioinformatics education:  combining biologically i...A collaborative model for bioinformatics education:  combining biologically i...
A collaborative model for bioinformatics education: combining biologically i...
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
AI in Bioinformatics
AI in BioinformaticsAI in Bioinformatics
AI in Bioinformatics
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real Science
 
2015_CV_J_SHELTON_linked
2015_CV_J_SHELTON_linked2015_CV_J_SHELTON_linked
2015_CV_J_SHELTON_linked
 

Viewers also liked

Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisEfficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisChristian Have
 
Applying Hidden Markov Models to Bioinformatics
Applying Hidden Markov Models to BioinformaticsApplying Hidden Markov Models to Bioinformatics
Applying Hidden Markov Models to Bioinformaticsbutest
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformaticsavrilcoghlan
 
Gene Prediction Using Hidden Markov Model and Recurrent Neural Network
Gene Prediction Using Hidden Markov Model and Recurrent Neural NetworkGene Prediction Using Hidden Markov Model and Recurrent Neural Network
Gene Prediction Using Hidden Markov Model and Recurrent Neural NetworkAhmed Hani Ibrahim
 

Viewers also liked (9)

Hmm
Hmm Hmm
Hmm
 
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisEfficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
 
Applying Hidden Markov Models to Bioinformatics
Applying Hidden Markov Models to BioinformaticsApplying Hidden Markov Models to Bioinformatics
Applying Hidden Markov Models to Bioinformatics
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformatics
 
Gene Prediction Using Hidden Markov Model and Recurrent Neural Network
Gene Prediction Using Hidden Markov Model and Recurrent Neural NetworkGene Prediction Using Hidden Markov Model and Recurrent Neural Network
Gene Prediction Using Hidden Markov Model and Recurrent Neural Network
 
Hmm
HmmHmm
Hmm
 
prediction methods for ORF
prediction methods for ORFprediction methods for ORF
prediction methods for ORF
 
Hidden markov model
Hidden markov modelHidden markov model
Hidden markov model
 
Hidden markov model ppt
Hidden markov model pptHidden markov model ppt
Hidden markov model ppt
 

Similar to PaulaTataru_PhD_defense

bioinfomatics
bioinfomaticsbioinfomatics
bioinfomaticsnguyenpg
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...QBiC_Tue
 
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Thermo Fisher Scientific
 
OVium Bioinformatic Solutions
OVium Bioinformatic SolutionsOVium Bioinformatic Solutions
OVium Bioinformatic SolutionsOVium Solutions
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
 
WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...Chris Evelo
 
Forensics: Human Identity Testing in the Applied Genetics Group
Forensics: Human Identity Testing in the Applied Genetics GroupForensics: Human Identity Testing in the Applied Genetics Group
Forensics: Human Identity Testing in the Applied Genetics GroupNathan Olson
 
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Characterization of the c...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Characterization of the c...tranSMART Community Meeting 5-7 Nov 13 - Session 3: Characterization of the c...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Characterization of the c...David Peyruc
 
Forensics: Human Identity Testing in the Applied Genetics Group
Forensics: Human Identity Testing in the Applied Genetics GroupForensics: Human Identity Testing in the Applied Genetics Group
Forensics: Human Identity Testing in the Applied Genetics Groupnist-spin
 
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...Setia Pramana
 
American Society for Mass Spectrometry Conference 2013
American Society for Mass Spectrometry Conference 2013American Society for Mass Spectrometry Conference 2013
American Society for Mass Spectrometry Conference 2013Dmitry Grapov
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationNils Gehlenborg
 
Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Leighton Pritchard
 
2014 09 30_t1_bioinformatics_wim_vancriekinge
2014 09 30_t1_bioinformatics_wim_vancriekinge2014 09 30_t1_bioinformatics_wim_vancriekinge
2014 09 30_t1_bioinformatics_wim_vancriekingeProf. Wim Van Criekinge
 
DisGeNET: A discovery platform for the dynamical exploration of human disease...
DisGeNET: A discovery platform for the dynamical exploration of human disease...DisGeNET: A discovery platform for the dynamical exploration of human disease...
DisGeNET: A discovery platform for the dynamical exploration of human disease...Núria Queralt Rosinach
 
Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE
Extracting a cellular hierarchy from high-dimensional cytometry data with SPADEExtracting a cellular hierarchy from high-dimensional cytometry data with SPADE
Extracting a cellular hierarchy from high-dimensional cytometry data with SPADENikolas Pontikos
 

Similar to PaulaTataru_PhD_defense (20)

bioinfomatics
bioinfomaticsbioinfomatics
bioinfomatics
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
 
OVium Bioinformatic Solutions
OVium Bioinformatic SolutionsOVium Bioinformatic Solutions
OVium Bioinformatic Solutions
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
 
WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...
 
Forensics: Human Identity Testing in the Applied Genetics Group
Forensics: Human Identity Testing in the Applied Genetics GroupForensics: Human Identity Testing in the Applied Genetics Group
Forensics: Human Identity Testing in the Applied Genetics Group
 
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Characterization of the c...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Characterization of the c...tranSMART Community Meeting 5-7 Nov 13 - Session 3: Characterization of the c...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Characterization of the c...
 
Qi liu 08.08.2014
Qi liu 08.08.2014Qi liu 08.08.2014
Qi liu 08.08.2014
 
Forensics: Human Identity Testing in the Applied Genetics Group
Forensics: Human Identity Testing in the Applied Genetics GroupForensics: Human Identity Testing in the Applied Genetics Group
Forensics: Human Identity Testing in the Applied Genetics Group
 
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
 
American Society for Mass Spectrometry Conference 2013
American Society for Mass Spectrometry Conference 2013American Society for Mass Spectrometry Conference 2013
American Society for Mass Spectrometry Conference 2013
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient Stratification
 
Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)
 
Bioinformatics seminar
Bioinformatics seminarBioinformatics seminar
Bioinformatics seminar
 
2014 09 30_t1_bioinformatics_wim_vancriekinge
2014 09 30_t1_bioinformatics_wim_vancriekinge2014 09 30_t1_bioinformatics_wim_vancriekinge
2014 09 30_t1_bioinformatics_wim_vancriekinge
 
DisGeNET: A discovery platform for the dynamical exploration of human disease...
DisGeNET: A discovery platform for the dynamical exploration of human disease...DisGeNET: A discovery platform for the dynamical exploration of human disease...
DisGeNET: A discovery platform for the dynamical exploration of human disease...
 
Thesis biobix
Thesis biobixThesis biobix
Thesis biobix
 
Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE
Extracting a cellular hierarchy from high-dimensional cytometry data with SPADEExtracting a cellular hierarchy from high-dimensional cytometry data with SPADE
Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE
 

More from Paula Tataru

More from Paula Tataru (20)

write_thesis
write_thesiswrite_thesis
write_thesis
 
Thiele
ThieleThiele
Thiele
 
PhDretreat2014
PhDretreat2014PhDretreat2014
PhDretreat2014
 
PhDretreat2011
PhDretreat2011PhDretreat2011
PhDretreat2011
 
part A
part Apart A
part A
 
birc-csd2012
birc-csd2012birc-csd2012
birc-csd2012
 
TreeOfLife-jeopardy-2014
TreeOfLife-jeopardy-2014TreeOfLife-jeopardy-2014
TreeOfLife-jeopardy-2014
 
AB-RNA-Mfold&SCFGs-2011
AB-RNA-Mfold&SCFGs-2011AB-RNA-Mfold&SCFGs-2011
AB-RNA-Mfold&SCFGs-2011
 
AB-RNA-comparison-2011
AB-RNA-comparison-2011AB-RNA-comparison-2011
AB-RNA-comparison-2011
 
AB-RNA-alignments-2011
AB-RNA-alignments-2011AB-RNA-alignments-2011
AB-RNA-alignments-2011
 
AB-RNA-Nussinov-2011
AB-RNA-Nussinov-2011AB-RNA-Nussinov-2011
AB-RNA-Nussinov-2011
 
AB-RNA-SCFGdesign=2010
AB-RNA-SCFGdesign=2010AB-RNA-SCFGdesign=2010
AB-RNA-SCFGdesign=2010
 
AB-RNA-SCFG-2010
AB-RNA-SCFG-2010AB-RNA-SCFG-2010
AB-RNA-SCFG-2010
 
AB-RNA-alignments-2010
AB-RNA-alignments-2010AB-RNA-alignments-2010
AB-RNA-alignments-2010
 
AB-RNA-Nus-2010
AB-RNA-Nus-2010AB-RNA-Nus-2010
AB-RNA-Nus-2010
 
PaulaTataruVienna
PaulaTataruViennaPaulaTataruVienna
PaulaTataruVienna
 
PaulaTataruCSHL
PaulaTataruCSHLPaulaTataruCSHL
PaulaTataruCSHL
 
PaulaTataruAarhus
PaulaTataruAarhusPaulaTataruAarhus
PaulaTataruAarhus
 
mgsa_poster
mgsa_postermgsa_poster
mgsa_poster
 
PaulaTataruOxford
PaulaTataruOxfordPaulaTataruOxford
PaulaTataruOxford
 

PaulaTataru_PhD_defense

  • 1. T a t a r uP a u l a Deciphering the story in our DNA PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Aarhus, January23rd 2015 Inference of population history and patterns from molecular data Supervisors: Christian N.S. Pedersen & Asger Hobolth
  • 2. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Math Computer science My PhD studies 2 Bioinformatics
  • 3. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Mathematical modeling Implementation My PhD studies 2 Bioinformatics
  • 4. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Mathematical modeling Implementation My PhD studies 2 Bioinformatics Population genetics Sequence analysis Pattern matching Structural bioinformatics
  • 5. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Mathematical modeling toolbox 3 SCFG DFA DTMC HMM CTMC
  • 6. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Mathematical modeling toolbox 3 Stochastic Context Free Grammar Deterministic Finite Automaton Discrete Time Markov Chain Hidden Markov Model Continuous Time Markov Chain
  • 7. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Research overview 4 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC Population genetics Sequence analysis Pattern matching Structural bioinformatics
  • 8. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Research overview 4 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC Population genetics Sequence analysis Pattern matching Structural bioinformatics
  • 9. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Research overview 4 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC Population genetics Sequence analysis Pattern matching Structural bioinformatics
  • 10. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Outline 5 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG Population genetics Sequence analysis Pattern matching Structural bioinformatics SCFG DFA DTMC HMM CTMC1 • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC 2 • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM 3
  • 11. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Outline 5 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG Population genetics Sequence analysis Pattern matching Structural bioinformatics SCFG DFA DTMC HMM CTMC1 • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC 2 • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM 3
  • 12. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Evolution of a population forward in time › Follow the change of the allele count Populations genetics: the Wright-Fisher model 6 individuals generations(time) 3 2 4 3 4 5 5 allele count 1. Modeling toolbox
  • 13. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Allele count per generation in a population of size N › States {0, 1, …, N} 0 ≤ i, j ≤ N › Transitions binomial Discrete Time Markov Chain 7 ji Bin(i | j/N) Bin(j | N, i/N) Bin(j | N, j/N)Bin(i | N, i/N) 0.23 0.20 0.33 0.08 1 0.26 DTMC 3 2 4 3 4 5 5 allele count 1. Modeling toolbox
  • 14. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre individuals generations(time) › Allele count per generation in a population of size N › Difficult to measure Hidden Markov Model 8 3 2 4 3 4 5 5 allele count 1. Modeling toolbox
  • 15. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre individuals generations(time) › Allele count per generation in a population of size N › Difficult to measure › Directly affects allele count of a sample of size n Hidden Markov Model 8 3 2 4 3 4 5 5 allele count 1. Modeling toolbox
  • 16. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre individuals generations(time) › Allele count per generation in a population of size N › Difficult to measure › Directly affects allele count of a sample of size n Hidden Markov Model 8 3 2 4 3 4 5 5 allele count 2 1 2 2 2 3 3 sample 1. Modeling toolbox
  • 17. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre individuals generations(time) › Allele count per generation in a population of size N › Difficult to measure › Directly affects allele count of a sample of size n Hidden Markov Model 8 3 2 4 3 4 5 5 allele count 2 1 2 2 2 3 3 sample Hidden Markov chain 1. Modeling toolbox
  • 18. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre individuals generations(time) › Allele count per generation in a population of size N › Difficult to measure › Directly affects allele count of a sample of size n Hidden Markov Model 8 3 2 4 3 4 5 5 allele count 2 1 2 2 2 3 3 sample Observable data 1. Modeling toolbox
  • 19. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Hidden Markov Model › Transitions (DTMC) binomial › Emissions (data) binomial 9 i Bin(i | N, i/N) 0: Bin(0 | n, i/N) 1: Bin(1 | n, i/N) 2: Bin(2 | n, i/N) 3: Bin(3 | n, i/N) Bin(j |N, j/N) 0: Bin(0 | n, j/N) 1: Bin(1 | n, j/N) 2: Bin(2 | n, j/N) 3: Bin(3 | n, j/N) jBin(i | N, j/N) Bin(j | N, i/N) 1. Modeling toolbox
  • 20. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Hidden Markov Model › Standard algorithms › Forward Likelihood of data › Viterbi Global decoding › Posterior decoding Local decoding 10 i Bin(i | N, i/N) 0: Bin(0 | n, i/N) 1: Bin(1 | n, i/N) 2: Bin(2 | n, i/N) 3: Bin(3 | n, i/N) Bin(j |N, j/N) 0: Bin(0 | n, j/N) 1: Bin(1 | n, j/N) 2: Bin(2 | n, j/N) 3: Bin(3 | n, j/N) jBin(i | N, j/N) Bin(j | N, i/N) 1. Modeling toolbox
  • 21. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Allele count in time in a population of size N › States {0, 1, …, N} 0 < i < N › Rates ri = i (N-i) / N Continuous Time Markov Chain 11 i CTMC i-1 i+1 ri ri+1ri-1 - 2 ri -2 ri-1 - 2 ri+1 ri 1. Modeling toolbox
  • 22. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Allele count in time in a population of size N › States {0, 1, …, N} 0 < i < N › Rates ri = i (N-i) / N Continuous Time Markov Chain 11 i allele count in DTMC 3 2 4 3 4 5 5 0.23 0.20 0.330.08 10.26 CTMC i-1 i+1 ri ri+1ri-1 - 2 ri -2 ri-1 - 2 ri+1 ri 1/2 3 3 3 4 t2 t6 2 4 5 t1 t4t2 1/2 1/2 1/2 1/2 1/2 allele count in CTMC 1. Modeling toolbox
  • 23. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Outline 12 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG Population genetics Sequence analysis Pattern matching Structural bioinformatics SCFG DFA DTMC HMM CTMC1 • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC 2 • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM 3
  • 24. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Outline 12 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG Population genetics Sequence analysis Pattern matching Structural bioinformatics SCFG DFA DTMC HMM CTMC1 • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC 2 • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM 3
  • 25. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › DTMC describes sequences › Allele count in a population › DFA encodes pattern › (i)+ (i+1)+ (i+2)+ 13 i+2 0 1 2 k i+1 i+2 3 i i k, i+2 i+1 i+1 ii i+2 k i+1 M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences. In preparation. ji Bin(i | j/N) Bin(j | N, i/N) Bin(j | N, j/N)Bin(i | N, i/N) Motif discovery for DTMCs using DFAs 3 2 4 3 4 5 5 2. Brief overview
  • 26. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Motif discovery for DTMCs using DFAs 14 M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences. In preparation. › Does the pattern (i)+ (i+1)+ (i+2)+ occur more frequently in specific environments? Populations DTMC sequences … 2 4 3 4 5 5 generations (time) Environment 2. Brief overview
  • 27. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Motif discovery for DTMCs using DFAs 14 M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences. In preparation. › Does the pattern (i)+ (i+1)+ (i+2)+ occur more frequently in specific environments? Populations DTMC sequences … 2 4 3 4 5 5 generations (time) Environment › Contribution › DFA › New approach to significance (random walk) 2. Brief overview
  • 28. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › HMM describes problem › Allele count in a population › DFA encodes pattern › (i)+ (i+1)+ (i+2)+ Restricted algorithms for HMMs using DFAs 15 P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013 i Bin(i | N, i/N) 0: Bin(0 | n, i/N) 1: Bin(1 | n, i/N) 2: Bin(2 | n, i/N) 3: Bin(3 | n, i/N) Bin(j |N, j/N) 0: Bin(0 | n, j/N) 1: Bin(1 | n, j/N) 2: Bin(2 | n, j/N) 3: Bin(3 | n, j/N) jBin(i | N, j/N) Bin(j | N, i/N) i+2 0 1 2 k i+1 i+2 3 i i k, i+2 i+1 i+1 ii i+2 k i+1 2 1 2 2 2 3 3 3 2 4 3 4 5 5 2. Brief overview
  • 29. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Restricted algorithms for HMMs using DFAs 16 P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013 i Bin(i | N, i/N) 0: Bin(0 | n, i/N) 1: Bin(1 | n, i/N) 2: Bin(2 | n, i/N) 3: Bin(3 | n, i/N) Bin(j |N, j/N) 0: Bin(0 | n, j/N) 1: Bin(1 | n, j/N) 2: Bin(2 | n, j/N) 3: Bin(3 | n, j/N) jBin(i | N, j/N) Bin(j | N, i/N) i+2 0 1 2 k i+1 i+2 3 i i k, i+2 i+1 i+1 ii i+2 k i+1 › Contribution: new algorithms › Calculate distribution of #pattern occurrences › Adapt decoding algorithms to include #pattern occurrences 2. Brief overview
  • 30. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Restricted algorithms for HMMs using DFAs 17 P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013 2. Brief overview
  • 31. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Conditional expectations for CTMCs 18 P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011 1/2 3 3 3 4 t2 t6 2 4 5 t1 t4t2 1/2 1/2 1/2 1/2 1/2 ii-1 i+1 ri ri+1ri-1 - 2 ri -2 ri-1 - 2 ri+1 ri 2. Brief overview
  • 32. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Required expectations › Time › Jumps Conditional expectations for CTMCs 18 P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011 ii-1 i+1 ri ri+1ri-1 - 2 ri -2 ri-1 - 2 ri+1 ri 3 5 t 2. Brief overview
  • 33. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Required expectations › Time › Jumps › Contribution: compare and extend existing methods › Accuracy › Speed Conditional expectations for CTMCs 18 P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011 ii-1 i+1 ri ri+1ri-1 - 2 ri -2 ri-1 - 2 ri+1 ri 3 5 t 2. Brief overview
  • 34. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Beta with spikes: approximate Wright-Fisher 19 P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach. In preparation individuals generations(time) 3 2 4 3 4 5 5 allele count 2. Brief overview
  • 35. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Beta with spikes: approximate Wright-Fisher 19 P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach. In preparation › What is the distribution of allele count in the current generation? 2. Brief overview individuals generations(time) 3 2 4 3 4 5 5 allele count
  • 36. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Beta with spikes: approximate Wright-Fisher 19 P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach. In preparation individuals generations(time) 3 2 4 3 4 5 5 allele count › What is the distribution of allele count in the current generation? › Use the beta distribution › Contribution: › Add spikes (better fit) › Include selection 2. Brief overview
  • 37. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Beta with spikes: approximate Wright-Fisher 20 P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach. In preparation 2. Brief overview
  • 38. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Outline 21 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG Population genetics Sequence analysis Pattern matching Structural bioinformatics SCFG DFA DTMC HMM CTMC1 • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC 2 • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM 3
  • 39. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Outline 21 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG Population genetics Sequence analysis Pattern matching Structural bioinformatics SCFG DFA DTMC HMM CTMC1 • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC 2 • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM 3
  • 40. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Trace the genealogy of sampled individuals backward in time 22 individuals generations(time) Populations genetics: the coalescent model 3. Overview: diCal-IBD
  • 41. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Trace the genealogy of sampled individuals backward in time 22 individuals generations(time) Populations genetics: the coalescent model 3. Overview: diCal-IBD
  • 42. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Trace the genealogy of sampled individuals backward in time › Coalescent process terminates when reaching MRCA › Time to coalescent event: CTMC 22 individuals generations(time) Populations genetics: the coalescent model MRCA 3. Overview: diCal-IBD
  • 43. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Recombination 23 3. Overview: diCal-IBD
  • 44. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The coalescent model: adding recombination 24 3. Overview: diCal-IBD
  • 45. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The coalescent model: adding recombination 24 3. Overview: diCal-IBD
  • 46. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The coalescent model: adding recombination 24 3. Overview: diCal-IBD
  • 47. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The coalescent model: adding recombination 24 3. Overview: diCal-IBD
  • 48. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The coalescent model: adding recombination 24 3. Overview: diCal-IBD
  • 49. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The coalescent model: adding recombination 24 3. Overview: diCal-IBD
  • 50. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The coalescent model: adding recombination 24 › Multiple sequences and loci analysis › HMM: hidden states = (possible) coalescent trees for each locus › CTMC: probability of the alleles at the leaves 3. Overview: diCal-IBD
  • 51. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Identity by descent 25 IBD tract 3. Overview: diCal-IBD
  • 52. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre diCal-IBD: CTMCs and HMMs 26 › IBD applications › Association mapping › Demography inference › Detection of selection P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014 3. Overview: diCal-IBD
  • 53. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre diCal-IBD: CTMCs and HMMs 26 › IBD applications › Association mapping › Demography inference › Detection of selection › First method to use the coalescent with recombination › One of the first methods to use sequence data P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014 3. Overview: diCal-IBD
  • 54. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre diCal-IBD: CTMCs and HMMs 26 › IBD applications › Association mapping › Demography inference › Detection of selection › First method to use the coalescent with recombination › One of the first methods to use sequence data › Outperforms SNP-based methods › Comparable with sequence-based method P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014 3. Overview: diCal-IBD
  • 55. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre diCal-IBD: CTMCs and HMMs 27 P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014 3. Overview: diCal-IBD
  • 56. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Selection and IBD 28 › IBD tract length depends on MRCA › The more recent the MRCA, the longer the tract › Recent MRCA can be an indication of positive selection › IBD can be used for detecting positive selection1 › Standing variation 1Albrechtsen A et al. Genetics 2010;186:295-308 3. Overview: diCal-IBD
  • 57. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Selection and IBD 29 › IBD segment length depends on MRCA › The more recent the MRCA, the longer the segment › Recent MRCA can be an indication of recent selection › IBD can be used for detecting selection › Standing variation 1Albrechtsen A et al. Genetics 2010;186:295-308 3. Overview: diCal-IBD
  • 58. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The SLC24A5 gene: diCal-IBD 30 P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014 SLC24A5 › Major influence on natural skin color variation › Under positive selection in Europeans1 1Wilde S et al. PNAS 2014;111(13):4832-4837 3. Overview: diCal-IBD
  • 59. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Outline 31 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG Population genetics Sequence analysis Pattern matching Structural bioinformatics SCFG DFA DTMC HMM CTMC1 • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC 2 • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM 3
  • 60. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Population genetics: Methods vs Data 32
  • 61. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Population genetics: Methods vs Data 32 diCal ARGWeaver MSMC ms MaCS PSMC coalHMM Kim Tree
  • 62. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Population genetics: Methods vs Data 32 Beta with spikes diCal-IBD diCal ARGWeaver MSMC ms MaCS PSMC coalHMM Kim Tree
  • 63. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Population genetics: Methods vs Data 32 Beta with spikes diCal-IBD diCal ARGWeaver MSMC ms MaCS PSMC coalHMM Kim Tree Data
  • 64. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Population genetics: Methods vs Data 32 Beta with spikes diCal-IBD diCal ARGWeaver MSMC ms MaCS PSMC coalHMM Kim Tree Data
  • 65. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Population genetics: Methods vs Data 32 Beta with spikes diCal-IBD diCal ARGWeaver MSMC ms MaCS PSMC coalHMM Kim Tree Data dimension reduction