This document outlines Paula Tataru's qualification exam, which focuses on developing statistical theory and algorithms for analyzing molecular data. It summarizes her past work on using stochastic context-free grammars (SCFGs) for RNA secondary structure prediction and calculating expectations for continuous-time Markov chains (CTMCs). It then discusses her current work on hidden Markov models (HMMs), including algorithms for patterns in HMMs and using HMMs to infer population parameters. Future work includes applying these methods to real data and extending models to incorporate additional properties.
1. Development of statistical theory and
algorithms for analyzing molecular data
Paula Tataru
Qualification
exam
2. Paula Tataru Qualification exam 2
Outline
●
Published work
o SCFGs & RNA secondary structure prediction
o Expectations for CTMCs
●
Hidden Markov models (HMMs)
●
Patterns & HMMs
●
HMMs for inferring population parameters
●
Future work
3. Paula Tataru Qualification exam 3
Statistical theory
Markov chains
Discrete time Markov chains
Continuous time Markov chains
Hidden Markov models
Stochastic context free grammars
4. Paula Tataru Qualification exam 4
Outline
●
Published work
o SCFGs for RNA secondary structure prediction
o Expectations for CTMCs
●
Hidden Markov models (HMMs)
●
Patterns & HMMs
●
HMMs for inferring population parameters
●
Future work
5. Paula Tataru Qualification exam 5
SCFGs & RNA secondary structure
●
Use SCFG to predict RNA secondary structure
●
KH99 best existing SCFG
●
Find a SCFG that is just as good/better
7. Paula Tataru Qualification exam 7
Expectations for CTMCs
●
Calculate expectations for CTMCs
o Time spent in a state
o Changes between any two states
●
Three approaches
o Eigenvalue decomposition (EVD)
o Uniformization (UNI)
o Matrix exponential (EXPM)
●
Which is more accurate?
●
Which is fastest?
A A G
10. Paula Tataru Qualification exam 10
Outline
●
Published work
o SCFGs for RNA secondary structure prediction
o Expectations for CTMCs
●
Hidden Markov models (HMMs)
●
Patterns & HMMs
●
HMMs for inferring population parameters
●
Future work
11. Paula Tataru Qualification exam 11
Hidden Markov models
●
Observables
●
Hidden states
●
Initial state probabilities
●
Transition probabilities
●
Emission probabilities
12. Paula Tataru Qualification exam 12
Algorithms
●
Given a sequence of observations
●
Forward algorithm
o What is the likelihood of the observations?
●
Viterbi algorithm
o What is the most likely hidden explanation ?
... A T G G C C T A AT C G T ...
... C
1
C
2
C
3
C
1
C
2
C
3
C
1
C
2
C
3
N N N N ...
13. Paula Tataru Qualification exam 13
Applications of HMMs
●
Gene annotation (GeneMark, GeneScan)
●
Protein structure modeling (Phobius, SignalP)
●
Sequence alignment (HMMER, SAM)
●
Phylogenetic analysis (PhyloHMM, CoalHMM)
14. Paula Tataru Qualification exam 14
Outline
●
Published work
o SCFGs for RNA secondary structure prediction
o Expectations for CTMCs
●
Hidden Markov models (HMMs)
●
Patterns & HMMs
●
HMMs for inferring population parameters
●
Future work
15. Paula Tataru Qualification exam 15
Patterns & HMMs
●
Patterns in hidden explanation are of interest
●
How many times does one pattern occur?
o restricted forward algorithm
●
What is the most likely hidden explanation containing
the pattern a number of times?
o restricted Viterbi algorithm
16. Paula Tataru Qualification exam 16
Patterns & HMMs
●
Keep track of the occurrences of the pattern r
●
Use a deterministic finite automaton (DFA)
●
Each state q encodes how much of the pattern r has
already been seen
●
Run the HMM and the DFA in parallel
17. Paula Tataru Qualification exam 17
Restricted forward
●
Forward algorithm
o
o the likelihood of the observations
18. Paula Tataru Qualification exam 18
Restricted forward
●
Forward algorithm
o
o the likelihood of the observations
●
Restricted forward algorithm
o
o the distribution for the number of pattern occurrences
19. 19Paula Tataru Qualification exam
●
Simple gene finder
●
● Pattern r = (NN(C1
|R3
)) | ((C3
|R1
)NN)
●
Generated observed and hidden sequence of length 500
20. 20Paula Tataru Qualification exam
●
Simple gene finder
●
● Pattern r = (NN(C1
|R3
)) | ((C3
|R1
)NN)
●
Generated observed and hidden sequence of length 500
21. Paula Tataru Qualification exam 21
Restricted Viterbi
●
Viterbi algorithm
o
o the most likely hidden explanation
22. Paula Tataru Qualification exam 22
Restricted Viterbi
●
Viterbi algorithm
o
o the most likely hidden explanation
●
Restricted Viterbi algorithm
o
o the most likely hidden explanation containing the pattern a
certain number of times
23. Paula Tataru Qualification exam 23
Results
●
Simple gene finder
● Pattern r = (NN(C1
|R3
)) | ((C3
|R1
)NN)
●
Generated 100 pairs of observed and hidden
sequences of length 500, 525, …, 1500
26. Paula Tataru Qualification exam 26
Future work
●
Apply on real model and data (GeneTack)
●
●
Incorporate waiting time distribution
o obtained from the restricted forward algorithm
o include in the restricted Viterbi algorithm
●
●
Improve memory consumption
27. Paula Tataru Qualification exam 27
Outline
●
Published work
o SCFGs for RNA secondary structure prediction
o Expectations for CTMCs
●
Hidden Markov models (HMMs)
●
Patterns & HMMs
●
HMMs for inferring population parameters
●
Future work
30. Paula Tataru Qualification exam 30
PSMC
●
Relies on ancestral recombinations
●
Coalescent with recombination theory
●
Sequential Markov chain (SMC)
1 1 1 12 2 2 23 3 3 3
31. Paula Tataru Qualification exam 31
CoalHMM
●
Observables: nucleotides (DNA sequences)
●
Hidden states: coalescence trees
●
PSMC
o two sequences from the same population (or individual)
o coalescence trees are uniquely determined by TMRCA
… G T C T G A C …
… G A C T G C C …
T A C CG G T T G G A C C C
32. Paula Tataru Qualification exam 32
Inferring population parameters
●
Calculate likelihood of the data (forward algorithm)
●
Find parameters that give best likelihood
o population size back in time
o recombination rate
… G T C T G A C …
… G A C T G C C …
T A C CG G T T G G A C C C
33. Paula Tataru Qualification exam 33
CoalHMM
●
Discretized time in K+1 intervals
o [0, t1
), [t1
, t2
), …, [tK
, ∞)
●
Apply on more than two sequences (same population)
o constrain coalescence events
1 2 3 4 5 1 2 3 4 5
37. Paula Tataru Qualification exam 37
Future work
●
Finalize implementing and testing the CoalHMM
●
Apply on real data
o low coverage NGS data
●
Investigate time discretization
●
Extend to more than one population
o isolation model
o isolation with migration model
38. Paula Tataru Qualification exam 38
Outline
●
Published work
o SCFGs for RNA secondary structure prediction
o Expectations for CTMCs
●
Hidden Markov models (HMMs)
●
Patterns & HMMs
●
HMMs for inferring population parameters
●
Future work
39. Paula Tataru Qualification exam 39
Future work – concrete plans
●
Proceed with patterns & HMMs
o real model and data (GeneTack)
o waiting time distribution / space consumption
●
Finalize & extend CoalHMM
o apply on data
o extend to multiple populations
●
Stay abroad (January – June 2013)
o UC Berkeley, prof. Yun S. Song
o similar work on the sequential Markov chain
43. Development of statistical theory and
algorithms for analyzing molecular data
Paula Tataru
Qualification
exam
Previous 2½ years
Coming 2 years
Title of thesis
47. Paula Tataru Qualification exam 5
SCFGs & RNA secondary structure
●
Use SCFG to predict RNA secondary structure
●
KH99 best existing SCFG
●
Find a SCFG that is just as good/better
Automated search technique
Two selected grammars
Area under the graph
49. Paula Tataru Qualification exam 7
Expectations for CTMCs
●
Calculate expectations for CTMCs
o Time spent in a state
o Changes between any two states
●
Three approaches
o Eigenvalue decomposition (EVD)
o Uniformization (UNI)
o Matrix exponential (EXPM)
●
Which is more accurate?
●
Which is fastest?
A A G
CTMCs → describe DNA evolution
State → nucleotides
Inference → certain expectations
53. Paula Tataru Qualification exam 11
Hidden Markov models
●
Observables
●
Hidden states
●
Initial state probabilities
●
Transition probabilities
●
Emission probabilities
Sequential data with underlying hidden structure
Model for finding genes
Relate text to figure
54. Paula Tataru Qualification exam 12
Algorithms
●
Given a sequence of observations
●
Forward algorithm
o What is the likelihood of the observations?
●
Viterbi algorithm
o What is the most likely hidden explanation ?
... A T G G C C T A AT C G T ...
... C1
C2
C3
C1
C2
C3
C1
C2
C3
N N N N ...
55. Paula Tataru Qualification exam 13
Applications of HMMs
●
Gene annotation (GeneMark, GeneScan)
●
Protein structure modeling (Phobius, SignalP)
●
Sequence alignment (HMMER, SAM)
●
Phylogenetic analysis (PhyloHMM, CoalHMM)
57. Paula Tataru Qualification exam 15
Patterns & HMMs
●
Patterns in hidden explanation are of interest
●
How many times does one pattern occur?
o restricted forward algorithm
●
What is the most likely hidden explanation containing
the pattern a number of times?
o restricted Viterbi algorithm
Andreas Sand
Pattern → gene
Number of genes
Use Viterbi → very bad, no measure of certainty
Find distribution
Incorporate in prediction
58. Paula Tataru Qualification exam 16
Patterns & HMMs
●
Keep track of the occurrences of the pattern r
●
Use a deterministic finite automaton (DFA)
●
Each state q encodes how much of the pattern r has
already been seen
●
Run the HMM and the DFA in parallel
DFA → graph
Nodes → states
59. Paula Tataru Qualification exam 17
Restricted forward
●
Forward algorithm
o
o the likelihood of the observations
2D matrix
Column → Observed up to time t
Hidden state xt
60. Paula Tataru Qualification exam 18
Restricted forward
●
Forward algorithm
o
o the likelihood of the observations
●
Restricted forward algorithm
o
o the distribution for the number of pattern occurrences
4D matrix
Column → Observed up to time t
Hidden state xt
K patterns
State q
63. Paula Tataru Qualification exam 21
Restricted Viterbi
●
Viterbi algorithm
o
o the most likely hidden explanation
2D matrix
Observed up to time t
Hidden state xt
64. Paula Tataru Qualification exam 22
Restricted Viterbi
●
Viterbi algorithm
o
o the most likely hidden explanation
●
Restricted Viterbi algorithm
o
o the most likely hidden explanation containing the pattern a
certain number of times
4D matrix
Observed up to time t
Hidden state xt
K patterns
State q
65. Paula Tataru Qualification exam 23
Results
●
Simple gene finder
● Pattern r = (NN(C1
|R3
)) | ((C3
|R1
)NN)
●
Generated 100 pairs of observed and hidden
sequences of length 500, 525, …, 1500
68. Paula Tataru Qualification exam 26
Future work
●
Apply on real model and data (GeneTack)
●
●
Incorporate waiting time distribution
o obtained from the restricted forward algorithm
o include in the restricted Viterbi algorithm
●
●
Improve memory consumption
72. Paula Tataru Qualification exam 30
PSMC
●
Relies on ancestral recombinations
●
Coalescent with recombination theory
●
Sequential Markov chain (SMC)
1 1 1 12 2 2 23 3 3 3
Moves along the sequence
Each position → one tree
Shift in tree → recombination
73. Paula Tataru Qualification exam 31
CoalHMM
●
Observables: nucleotides (DNA sequences)
●
Hidden states: coalescence trees
●
PSMC
o two sequences from the same population (or individual)
o coalescence trees are uniquely determined by TMRCA
… G T C T G A C …
… G A C T G C C …
T A C CG G T T G G A C C C
74. Paula Tataru Qualification exam 32
Inferring population parameters
●
Calculate likelihood of the data (forward algorithm)
●
Find parameters that give best likelihood
o population size back in time
o recombination rate
… G T C T G A C …
… G A C T G C C …
T A C CG G T T G G A C C C
Parameters determine event probabilities
75. Paula Tataru Qualification exam 33
CoalHMM
●
Discretized time in K+1 intervals
o [0, t1
), [t1
, t2
), …, [tK
, ∞)
●
Apply on more than two sequences (same population)
o constrain coalescence events
1 2 3 4 5 1 2 3 4 5
Infinite time
77. Paula Tataru Qualification exam 35
CoalHMM
●
Notation,
●
Probability
Sum over all possible paths
Gives probability distribution from moving from certain tree to all other trees
79. Paula Tataru Qualification exam 37
Future work
●
Finalize implementing and testing the CoalHMM
●
Apply on real data
o low coverage NGS data
●
Investigate time discretization
●
Extend to more than one population
o isolation model
o isolation with migration model
81. Paula Tataru Qualification exam 39
Future work – concrete plans
●
Proceed with patterns & HMMs
o real model and data (GeneTack)
o waiting time distribution / space consumption
●
Finalize & extend CoalHMM
o apply on data
o extend to multiple populations
●
Stay abroad (January – June 2013)
o UC Berkeley, prof. Yun S. Song
o similar work on the sequential Markov chain