SlideShare a Scribd company logo
1 of 250
Download to read offline
Machine Learning for Data Mining
Hidden Markov Models
Andres Mendez-Vazquez
July 22, 2015
1 / 99
Outline
1 Introduction
A little about Markov Random Processes
Setup for Our Problem
2 First Problem
What do we need to calculate?
How to solve it? Forward Procedure
3 Second Problem
Dynamic Programming
Viterbi Algorithm
Final Viterbi Algorithm
4 Third Problem
The most difficult
Expectation Maximization of the Third Problem
The Baum-Welch Algorithm
2 / 99
Outline
1 Introduction
A little about Markov Random Processes
Setup for Our Problem
2 First Problem
What do we need to calculate?
How to solve it? Forward Procedure
3 Second Problem
Dynamic Programming
Viterbi Algorithm
Final Viterbi Algorithm
4 Third Problem
The most difficult
Expectation Maximization of the Third Problem
The Baum-Welch Algorithm
3 / 99
Markov Random Processes
Markov process (First-Order Markov Model)
A stochastic process is called a Markov process when it has the Markov
property:
P Xtn |Xtn−1 = xn−1, ...Xt1 = x1 = P Xtn |Xtn−1 = xn−1 (1)
Simply
The future path of a Markov process, given its current state and the past
history before, depends only on the current state (not on how this state
has been reached).
Thus (Quite an Oversimplification!!!)
A Markov process is characterized by the (one-step) transition probabilities:
pij = P (Xn = j|Xn−1 = i) (2)
4 / 99
Markov Random Processes
Markov process (First-Order Markov Model)
A stochastic process is called a Markov process when it has the Markov
property:
P Xtn |Xtn−1 = xn−1, ...Xt1 = x1 = P Xtn |Xtn−1 = xn−1 (1)
Simply
The future path of a Markov process, given its current state and the past
history before, depends only on the current state (not on how this state
has been reached).
Thus (Quite an Oversimplification!!!)
A Markov process is characterized by the (one-step) transition probabilities:
pij = P (Xn = j|Xn−1 = i) (2)
4 / 99
Markov Random Processes
Markov process (First-Order Markov Model)
A stochastic process is called a Markov process when it has the Markov
property:
P Xtn |Xtn−1 = xn−1, ...Xt1 = x1 = P Xtn |Xtn−1 = xn−1 (1)
Simply
The future path of a Markov process, given its current state and the past
history before, depends only on the current state (not on how this state
has been reached).
Thus (Quite an Oversimplification!!!)
A Markov process is characterized by the (one-step) transition probabilities:
pij = P (Xn = j|Xn−1 = i) (2)
4 / 99
Graphical Interpretation
A sequence of events depending only in the previous one
5 / 99
Markov Chains
Definition (Oversimplified)
A Markov chain is thus a process Xn indexed by integers n = 0, 1, ... such
that the states of the system are being indexed by integers Xn = 0, 1, ...
From here
The probability of a path i1, i2, ..., in is
P (X1 = i1, X2 = i2, ..., Xn = in) = P (X1 = i1) pi1i2 pi2i3 · · · pin−1in (3)
6 / 99
Markov Chains
Definition (Oversimplified)
A Markov chain is thus a process Xn indexed by integers n = 0, 1, ... such
that the states of the system are being indexed by integers Xn = 0, 1, ...
From here
The probability of a path i1, i2, ..., in is
P (X1 = i1, X2 = i2, ..., Xn = in) = P (X1 = i1) pi1i2 pi2i3 · · · pin−1in (3)
6 / 99
The transition probability matrix of a Markov chain
Thus
The transition probabilities can be arranged as transition probability matrix P = (pi,j)
Final State −→
Initial State ↓



p1,1 p1,2 p1,3 · · ·
p2,1 p2,2 p2,3 · · ·
...
...
...
...


 = P
Propertities
The row i contains the transition probabilities from state i to other states.
Since the system always goes to some state, the sum of the column
probabilities is 1
Then
A matrix with non-negative elements such that the sum of each column
equals 1 is called a stochastic matrix.
7 / 99
The transition probability matrix of a Markov chain
Thus
The transition probabilities can be arranged as transition probability matrix P = (pi,j)
Final State −→
Initial State ↓



p1,1 p1,2 p1,3 · · ·
p2,1 p2,2 p2,3 · · ·
...
...
...
...


 = P
Propertities
The row i contains the transition probabilities from state i to other states.
Since the system always goes to some state, the sum of the column
probabilities is 1
Then
A matrix with non-negative elements such that the sum of each column
equals 1 is called a stochastic matrix.
7 / 99
The transition probability matrix of a Markov chain
Thus
The transition probabilities can be arranged as transition probability matrix P = (pi,j)
Final State −→
Initial State ↓



p1,1 p1,2 p1,3 · · ·
p2,1 p2,2 p2,3 · · ·
...
...
...
...


 = P
Propertities
The row i contains the transition probabilities from state i to other states.
Since the system always goes to some state, the sum of the column
probabilities is 1
Then
A matrix with non-negative elements such that the sum of each column
equals 1 is called a stochastic matrix.
7 / 99
Example
We have the following state machine
0
1
2
We have the following Stochastic Matrix
M =



1 − p 1 − p 0
p (1 − p) p (1 − p) 1 − p
p2 p2 p


 (4)
With p = 1
3
8 / 99
Example
We have the following state machine
0
1
2
We have the following Stochastic Matrix
M =



1 − p 1 − p 0
p (1 − p) p (1 − p) 1 − p
p2 p2 p


 (4)
With p = 1
3
8 / 99
Using this
We can talk about an initial distribution is given as a row vector π
A stationary probability vector π is defined as a distribution, written as a
row vector, that does not change under application of the transition
matrix:
π = Pπ (5)
Based in the Perron-Frobenious Theorem (Read about it in
Wolfran-MathWorld)
Theorem: Let M be a Markov chain whose left stochastic matrix has
no zeros. Then, M has a single stationary distribution, and it
converges to this when started at any distribution.
9 / 99
Using this
We can talk about an initial distribution is given as a row vector π
A stationary probability vector π is defined as a distribution, written as a
row vector, that does not change under application of the transition
matrix:
π = Pπ (5)
Based in the Perron-Frobenious Theorem (Read about it in
Wolfran-MathWorld)
Theorem: Let M be a Markov chain whose left stochastic matrix has
no zeros. Then, M has a single stationary distribution, and it
converges to this when started at any distribution.
9 / 99
How to find the stationary distribution
We can find this stationary distribution using a slow process called
the Power Method
There are others:
Householder transformations
Givens rotations
Arnoldi iteration
10 / 99
How to find the stationary distribution
We can find this stationary distribution using a slow process called
the Power Method
There are others:
Householder transformations
Givens rotations
Arnoldi iteration
10 / 99
How to find the stationary distribution
We can find this stationary distribution using a slow process called
the Power Method
There are others:
Householder transformations
Givens rotations
Arnoldi iteration
10 / 99
How to find the stationary distribution
We can find this stationary distribution using a slow process called
the Power Method
There are others:
Householder transformations
Givens rotations
Arnoldi iteration
10 / 99
How to find the stationary distribution
We can find this stationary distribution using a slow process called
the Power Method
There are others:
Householder transformations
Givens rotations
Arnoldi iteration
10 / 99
Outline
1 Introduction
A little about Markov Random Processes
Setup for Our Problem
2 First Problem
What do we need to calculate?
How to solve it? Forward Procedure
3 Second Problem
Dynamic Programming
Viterbi Algorithm
Final Viterbi Algorithm
4 Third Problem
The most difficult
Expectation Maximization of the Third Problem
The Baum-Welch Algorithm
11 / 99
Class Sequence and Observations
We have
A sequence of T (feature vectors) observations, X = x1, x2, ..., xT .
In addition
A collection of N classes or hidden states, ω1, ω2, ..., ωN
Thus, we have an element like this one
Ω = ω1, ω2, ..., ωT (6)
Note: Total number of class sequences is NT for sequences of size
T.
12 / 99
Class Sequence and Observations
We have
A sequence of T (feature vectors) observations, X = x1, x2, ..., xT .
In addition
A collection of N classes or hidden states, ω1, ω2, ..., ωN
Thus, we have an element like this one
Ω = ω1, ω2, ..., ωT (6)
Note: Total number of class sequences is NT for sequences of size
T.
12 / 99
Class Sequence and Observations
We have
A sequence of T (feature vectors) observations, X = x1, x2, ..., xT .
In addition
A collection of N classes or hidden states, ω1, ω2, ..., ωN
Thus, we have an element like this one
Ω = ω1, ω2, ..., ωT (6)
Note: Total number of class sequences is NT for sequences of size
T.
12 / 99
Transition Model
Definition of the transition model λ = (A, B, π) for Hidden Markov
Model (HMM)
Where:
1 N is the number of hidden states.
2 M is the number of different observation symbols.
3 A = {aij}is the stat transition probability aij = P (qt+1 = ωj|qt = ωi)
with 1 ≤ i, j ≤ N (In our case aij > 0).
4 B = {bj (xk)} is the observation symbol probability at state j i.e.
bj (xk) = P (xk|qt = ωj) such that 1 ≤ j ≤ N and 1 ≤ k ≤ M.
5 The initial state distribution π = {πi} where πi = P (q1 = ωi) where
1 ≤ i ≤ N.
13 / 99
Transition Model
Definition of the transition model λ = (A, B, π) for Hidden Markov
Model (HMM)
Where:
1 N is the number of hidden states.
2 M is the number of different observation symbols.
3 A = {aij}is the stat transition probability aij = P (qt+1 = ωj|qt = ωi)
with 1 ≤ i, j ≤ N (In our case aij > 0).
4 B = {bj (xk)} is the observation symbol probability at state j i.e.
bj (xk) = P (xk|qt = ωj) such that 1 ≤ j ≤ N and 1 ≤ k ≤ M.
5 The initial state distribution π = {πi} where πi = P (q1 = ωi) where
1 ≤ i ≤ N.
13 / 99
Transition Model
Definition of the transition model λ = (A, B, π) for Hidden Markov
Model (HMM)
Where:
1 N is the number of hidden states.
2 M is the number of different observation symbols.
3 A = {aij}is the stat transition probability aij = P (qt+1 = ωj|qt = ωi)
with 1 ≤ i, j ≤ N (In our case aij > 0).
4 B = {bj (xk)} is the observation symbol probability at state j i.e.
bj (xk) = P (xk|qt = ωj) such that 1 ≤ j ≤ N and 1 ≤ k ≤ M.
5 The initial state distribution π = {πi} where πi = P (q1 = ωi) where
1 ≤ i ≤ N.
13 / 99
Transition Model
Definition of the transition model λ = (A, B, π) for Hidden Markov
Model (HMM)
Where:
1 N is the number of hidden states.
2 M is the number of different observation symbols.
3 A = {aij}is the stat transition probability aij = P (qt+1 = ωj|qt = ωi)
with 1 ≤ i, j ≤ N (In our case aij > 0).
4 B = {bj (xk)} is the observation symbol probability at state j i.e.
bj (xk) = P (xk|qt = ωj) such that 1 ≤ j ≤ N and 1 ≤ k ≤ M.
5 The initial state distribution π = {πi} where πi = P (q1 = ωi) where
1 ≤ i ≤ N.
13 / 99
Transition Model
Definition of the transition model λ = (A, B, π) for Hidden Markov
Model (HMM)
Where:
1 N is the number of hidden states.
2 M is the number of different observation symbols.
3 A = {aij}is the stat transition probability aij = P (qt+1 = ωj|qt = ωi)
with 1 ≤ i, j ≤ N (In our case aij > 0).
4 B = {bj (xk)} is the observation symbol probability at state j i.e.
bj (xk) = P (xk|qt = ωj) such that 1 ≤ j ≤ N and 1 ≤ k ≤ M.
5 The initial state distribution π = {πi} where πi = P (q1 = ωi) where
1 ≤ i ≤ N.
13 / 99
Transition Model
Definition of the transition model λ = (A, B, π) for Hidden Markov
Model (HMM)
Where:
1 N is the number of hidden states.
2 M is the number of different observation symbols.
3 A = {aij}is the stat transition probability aij = P (qt+1 = ωj|qt = ωi)
with 1 ≤ i, j ≤ N (In our case aij > 0).
4 B = {bj (xk)} is the observation symbol probability at state j i.e.
bj (xk) = P (xk|qt = ωj) such that 1 ≤ j ≤ N and 1 ≤ k ≤ M.
5 The initial state distribution π = {πi} where πi = P (q1 = ωi) where
1 ≤ i ≤ N.
13 / 99
Example
We have that
The hidden states conform a probabilistic state machine with emission
probability for the symbols.
Hidden States
Observable States
14 / 99
Using a HMM
It can be used as a generator of sequences X
1 Choose an intitial state q1 = ωi according to the initial state
distribution π.
2 Set t = 1.
3 Choose an observation Xt = xk according to the symbol probability in
state ωi, bi (k) .
4 Transit to a new state qt+1 = ωj according to the state transition
probability distribution for state ωi, aij.
5 Set t = t + 1, return to step 3 if t < T.
15 / 99
Using a HMM
It can be used as a generator of sequences X
1 Choose an intitial state q1 = ωi according to the initial state
distribution π.
2 Set t = 1.
3 Choose an observation Xt = xk according to the symbol probability in
state ωi, bi (k) .
4 Transit to a new state qt+1 = ωj according to the state transition
probability distribution for state ωi, aij.
5 Set t = t + 1, return to step 3 if t < T.
15 / 99
Using a HMM
It can be used as a generator of sequences X
1 Choose an intitial state q1 = ωi according to the initial state
distribution π.
2 Set t = 1.
3 Choose an observation Xt = xk according to the symbol probability in
state ωi, bi (k) .
4 Transit to a new state qt+1 = ωj according to the state transition
probability distribution for state ωi, aij.
5 Set t = t + 1, return to step 3 if t < T.
15 / 99
Using a HMM
It can be used as a generator of sequences X
1 Choose an intitial state q1 = ωi according to the initial state
distribution π.
2 Set t = 1.
3 Choose an observation Xt = xk according to the symbol probability in
state ωi, bi (k) .
4 Transit to a new state qt+1 = ωj according to the state transition
probability distribution for state ωi, aij.
5 Set t = t + 1, return to step 3 if t < T.
15 / 99
Using a HMM
It can be used as a generator of sequences X
1 Choose an intitial state q1 = ωi according to the initial state
distribution π.
2 Set t = 1.
3 Choose an observation Xt = xk according to the symbol probability in
state ωi, bi (k) .
4 Transit to a new state qt+1 = ωj according to the state transition
probability distribution for state ωi, aij.
5 Set t = t + 1, return to step 3 if t < T.
15 / 99
What we want!!!
Problem 1
Given the sequence X and a certain model λ, How we efficiently compute
P (X|λ)?
Problem 2
Given the observation sequence X and a model λ, How we choose a
corresponding class sequence Ω? Which best explains the observations.
Problem 3
How do we ajust the parameters of the λ to maximize P (X|λ)?
16 / 99
What we want!!!
Problem 1
Given the sequence X and a certain model λ, How we efficiently compute
P (X|λ)?
Problem 2
Given the observation sequence X and a model λ, How we choose a
corresponding class sequence Ω? Which best explains the observations.
Problem 3
How do we ajust the parameters of the λ to maximize P (X|λ)?
16 / 99
What we want!!!
Problem 1
Given the sequence X and a certain model λ, How we efficiently compute
P (X|λ)?
Problem 2
Given the observation sequence X and a model λ, How we choose a
corresponding class sequence Ω? Which best explains the observations.
Problem 3
How do we ajust the parameters of the λ to maximize P (X|λ)?
16 / 99
First Problem
Task
Our task is to decide to which class sequence Ω = ω1, ω2, ..., ωT the
sequence X = x1, x2, ..., xT belongs.
For this, first consider a fix sequence of states
Ω = ω1, ω2, ..., ωT
17 / 99
First Problem
Task
Our task is to decide to which class sequence Ω = ω1, ω2, ..., ωT the
sequence X = x1, x2, ..., xT belongs.
For this, first consider a fix sequence of states
Ω = ω1, ω2, ..., ωT
17 / 99
First, some assumptions
First
Given the sequence of classes, the observations are statistically
independent.
Second
The probability density function in one class does not depend on the other
classes.
Thus, the probability of the observation sequence X
P (X|Ω, λ) =P (x1, x2, ..., xT |ω1, ω2, ..., ωT , λ)
=
T
k=1
P (xk|ω1, ω2, ..., ωT , λ)
=
T
k=1
P (xk|ωk, λ)
18 / 99
First, some assumptions
First
Given the sequence of classes, the observations are statistically
independent.
Second
The probability density function in one class does not depend on the other
classes.
Thus, the probability of the observation sequence X
P (X|Ω, λ) =P (x1, x2, ..., xT |ω1, ω2, ..., ωT , λ)
=
T
k=1
P (xk|ω1, ω2, ..., ωT , λ)
=
T
k=1
P (xk|ωk, λ)
18 / 99
First, some assumptions
First
Given the sequence of classes, the observations are statistically
independent.
Second
The probability density function in one class does not depend on the other
classes.
Thus, the probability of the observation sequence X
P (X|Ω, λ) =P (x1, x2, ..., xT |ω1, ω2, ..., ωT , λ)
=
T
k=1
P (xk|ω1, ω2, ..., ωT , λ)
=
T
k=1
P (xk|ωk, λ)
18 / 99
First, some assumptions
First
Given the sequence of classes, the observations are statistically
independent.
Second
The probability density function in one class does not depend on the other
classes.
Thus, the probability of the observation sequence X
P (X|Ω, λ) =P (x1, x2, ..., xT |ω1, ω2, ..., ωT , λ)
=
T
k=1
P (xk|ω1, ω2, ..., ωT , λ)
=
T
k=1
P (xk|ωk, λ)
18 / 99
First, some assumptions
First
Given the sequence of classes, the observations are statistically
independent.
Second
The probability density function in one class does not depend on the other
classes.
Thus, the probability of the observation sequence X
P (X|Ω, λ) =P (x1, x2, ..., xT |ω1, ω2, ..., ωT , λ)
=
T
k=1
P (xk|ω1, ω2, ..., ωT , λ)
=
T
k=1
P (xk|ωk, λ)
18 / 99
Then
We need the probability of such a state sequence Ω
P (Ω|λ) (7)
We can use the Markov Condition
P (Ω|λ) =P (ω1, ω2, ..., ωT |λ)
=P (ωT |ω1, ω2, ..., ωT−1, λ) × · · ·
P (ωT−1|ω1, ω2, ..., ωT−2, λ) × · · ·
P (ω1|λ)
Thus
P (Ω|λ) = P (ω1|λ)
N
k=2
P (ωk|ωk−1, λ) (8)
19 / 99
Then
We need the probability of such a state sequence Ω
P (Ω|λ) (7)
We can use the Markov Condition
P (Ω|λ) =P (ω1, ω2, ..., ωT |λ)
=P (ωT |ω1, ω2, ..., ωT−1, λ) × · · ·
P (ωT−1|ω1, ω2, ..., ωT−2, λ) × · · ·
P (ω1|λ)
Thus
P (Ω|λ) = P (ω1|λ)
N
k=2
P (ωk|ωk−1, λ) (8)
19 / 99
Then
We need the probability of such a state sequence Ω
P (Ω|λ) (7)
We can use the Markov Condition
P (Ω|λ) =P (ω1, ω2, ..., ωT |λ)
=P (ωT |ω1, ω2, ..., ωT−1, λ) × · · ·
P (ωT−1|ω1, ω2, ..., ωT−2, λ) × · · ·
P (ω1|λ)
Thus
P (Ω|λ) = P (ω1|λ)
N
k=2
P (ωk|ωk−1, λ) (8)
19 / 99
Then
We need the probability of such a state sequence Ω
P (Ω|λ) (7)
We can use the Markov Condition
P (Ω|λ) =P (ω1, ω2, ..., ωT |λ)
=P (ωT |ω1, ω2, ..., ωT−1, λ) × · · ·
P (ωT−1|ω1, ω2, ..., ωT−2, λ) × · · ·
P (ω1|λ)
Thus
P (Ω|λ) = P (ω1|λ)
N
k=2
P (ωk|ωk−1, λ) (8)
19 / 99
Using our notation
Then
P (Ω|λ) = π1a1,2 · · · aT−1,T (9)
Now
P (X|Ω, λ) = bq1 (x1) · · · bqT
(xT ) (10)
20 / 99
Using our notation
Then
P (Ω|λ) = π1a1,2 · · · aT−1,T (9)
Now
P (X|Ω, λ) = bq1 (x1) · · · bqT
(xT ) (10)
20 / 99
What do we want?
Then, we want to the probability of X and Ω to happen
simultaneously
P (X, Ω|λ) = P (X|Ω, λ) P (Ω|λ) (11)
Hidden State Machine
21 / 99
Outline
1 Introduction
A little about Markov Random Processes
Setup for Our Problem
2 First Problem
What do we need to calculate?
How to solve it? Forward Procedure
3 Second Problem
Dynamic Programming
Viterbi Algorithm
Final Viterbi Algorithm
4 Third Problem
The most difficult
Expectation Maximization of the Third Problem
The Baum-Welch Algorithm
22 / 99
Thus, we need to calculate
What?
P (X|λ) =
Ω
P (X, Ω|λ) (12)
Using Bayes Rule
P (X|λ) =
Ω
P (X|Ω, λ) P (Ω|λ) (13)
23 / 99
Thus, we need to calculate
What?
P (X|λ) =
Ω
P (X, Ω|λ) (12)
Using Bayes Rule
P (X|λ) =
Ω
P (X|Ω, λ) P (Ω|λ) (13)
23 / 99
Putting All Together
Thus, we have then
P (X|λ) =
Ω
P (X|Ω, λ) P (Ω|λ)
=
q1,q2,...,qT
π1bq1 (x1) [aq1,q2 bq2 (x2)] · · · [bqT
(xT )]
24 / 99
Putting All Together
Thus, we have then
P (X|λ) =
Ω
P (X|Ω, λ) P (Ω|λ)
=
q1,q2,...,qT
π1bq1 (x1) [aq1,q2 bq2 (x2)] · · · [bqT
(xT )]
24 / 99
Complexity for brute force approach
Multiplications
We need 2T multiplications (Rearranging):
π1aq1,q2 · · · aqT−1,qT
T
[bq1 (x1) · · · bqT
(xT )]
T
(14)
Sums
We need NT additions because of the term
q1,q2,...,qT
.
Total number of Operations
O TNT
(15)
25 / 99
Complexity for brute force approach
Multiplications
We need 2T multiplications (Rearranging):
π1aq1,q2 · · · aqT−1,qT
T
[bq1 (x1) · · · bqT
(xT )]
T
(14)
Sums
We need NT additions because of the term
q1,q2,...,qT
.
Total number of Operations
O TNT
(15)
25 / 99
Complexity for brute force approach
Multiplications
We need 2T multiplications (Rearranging):
π1aq1,q2 · · · aqT−1,qT
T
[bq1 (x1) · · · bqT
(xT )]
T
(14)
Sums
We need NT additions because of the term
q1,q2,...,qT
.
Total number of Operations
O TNT
(15)
25 / 99
Outline
1 Introduction
A little about Markov Random Processes
Setup for Our Problem
2 First Problem
What do we need to calculate?
How to solve it? Forward Procedure
3 Second Problem
Dynamic Programming
Viterbi Algorithm
Final Viterbi Algorithm
4 Third Problem
The most difficult
Expectation Maximization of the Third Problem
The Baum-Welch Algorithm
26 / 99
This will become part of a Dynamic Programming Solution
Together with a Backward procedure
The forward and backward procedure will help to solve the second problem
27 / 99
Consider the following
Define
αt (i) = P (x1, x2, ..., xt, qt = ωi|λ) (16)
Probability of the partial observation until time t and state ωi.
We can use an induction
1 Initialization α1 (i) = πibi (x1) 1 ≤ i ≤ N
2 Induction αt+1 (j) = N
i=1 αt (i) aij bj (Ot+1) 1 ≤ t ≤ T − 1 and
1 ≤ j ≤ N.
3 Termination P (X|λ) = N
i=1 αT (i)
28 / 99
Consider the following
Define
αt (i) = P (x1, x2, ..., xt, qt = ωi|λ) (16)
Probability of the partial observation until time t and state ωi.
We can use an induction
1 Initialization α1 (i) = πibi (x1) 1 ≤ i ≤ N
2 Induction αt+1 (j) = N
i=1 αt (i) aij bj (Ot+1) 1 ≤ t ≤ T − 1 and
1 ≤ j ≤ N.
3 Termination P (X|λ) = N
i=1 αT (i)
28 / 99
Consider the following
Define
αt (i) = P (x1, x2, ..., xt, qt = ωi|λ) (16)
Probability of the partial observation until time t and state ωi.
We can use an induction
1 Initialization α1 (i) = πibi (x1) 1 ≤ i ≤ N
2 Induction αt+1 (j) = N
i=1 αt (i) aij bj (Ot+1) 1 ≤ t ≤ T − 1 and
1 ≤ j ≤ N.
3 Termination P (X|λ) = N
i=1 αT (i)
28 / 99
Consider the following
Define
αt (i) = P (x1, x2, ..., xt, qt = ωi|λ) (16)
Probability of the partial observation until time t and state ωi.
We can use an induction
1 Initialization α1 (i) = πibi (x1) 1 ≤ i ≤ N
2 Induction αt+1 (j) = N
i=1 αt (i) aij bj (Ot+1) 1 ≤ t ≤ T − 1 and
1 ≤ j ≤ N.
3 Termination P (X|λ) = N
i=1 αT (i)
28 / 99
How do we do this?
Step 1
You need to prove it is correct for α1 (i) with 1 ≤ i ≤ N
Assume it is true for t
Then, you need to prove it for αt+1 (j) with 1 ≤ t ≤ T − 1 and 1 ≤ j ≤ N.
Then
At termination you have a way to calculate the desired probability P (X|λ).
29 / 99
How do we do this?
Step 1
You need to prove it is correct for α1 (i) with 1 ≤ i ≤ N
Assume it is true for t
Then, you need to prove it for αt+1 (j) with 1 ≤ t ≤ T − 1 and 1 ≤ j ≤ N.
Then
At termination you have a way to calculate the desired probability P (X|λ).
29 / 99
How do we do this?
Step 1
You need to prove it is correct for α1 (i) with 1 ≤ i ≤ N
Assume it is true for t
Then, you need to prove it for αt+1 (j) with 1 ≤ t ≤ T − 1 and 1 ≤ j ≤ N.
Then
At termination you have a way to calculate the desired probability P (X|λ).
29 / 99
Step 1
Initializes the forward probabilities as
α1 (i) = P (x1, q1 = ωi|λ) (17)
Here is the idea of moving from one state at time t to another state
on time t + 1
30 / 99
Step 1
Initializes the forward probabilities as
α1 (i) = P (x1, q1 = ωi|λ) (17)
Here is the idea of moving from one state at time t to another state
on time t + 1
30 / 99
What?
Basically
Our problem is we do not now from which state we start to arrive to j
Now, we start calculate αt+1 (j)
αt+1 (j) =P (x1, x2, ..., xt, xt+1, qt+1 = ωj|λ)
=P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) × · · ·
P (x1, x2, ..., xt, qt+1 = ωj|λ)
31 / 99
What?
Basically
Our problem is we do not now from which state we start to arrive to j
Now, we start calculate αt+1 (j)
αt+1 (j) =P (x1, x2, ..., xt, xt+1, qt+1 = ωj|λ)
=P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) × · · ·
P (x1, x2, ..., xt, qt+1 = ωj|λ)
31 / 99
What?
Basically
Our problem is we do not now from which state we start to arrive to j
Now, we start calculate αt+1 (j)
αt+1 (j) =P (x1, x2, ..., xt, xt+1, qt+1 = ωj|λ)
=P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) × · · ·
P (x1, x2, ..., xt, qt+1 = ωj|λ)
31 / 99
We need to calculate
First
P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) (18)
Second
P (x1, x2, ..., xt, qt+1 = ωj|λ) (19)
32 / 99
We need to calculate
First
P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) (18)
Second
P (x1, x2, ..., xt, qt+1 = ωj|λ) (19)
32 / 99
Thus, we start with P (x1, x2, ..., xt, qt+1 = ωj|λ)
We have that
P (x1, x2, ..., xt, qt+1 = ωj|λ) =
=
N
i=1
P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ)
The second step is due to applying the marginal formula over all possible
states
33 / 99
Thus, we start with P (x1, x2, ..., xt, qt+1 = ωj|λ)
We have that
P (x1, x2, ..., xt, qt+1 = ωj|λ) =
=
N
i=1
P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ)
The second step is due to applying the marginal formula over all possible
states
33 / 99
Now, P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ)
Thus
P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ) =
=P (qt+1 = ωj|x1, x2, ..., xt, qt = ωi, λ) × · · ·
P (x1, x2, ..., xt, qt = ωi|λ)
=P (x1, x2, ..., xt, qt = ωi|λ) × · · ·
P (qt+1 = ωj|qt = ωi, λ)
=αt (i) aij
34 / 99
Now, P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ)
Thus
P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ) =
=P (qt+1 = ωj|x1, x2, ..., xt, qt = ωi, λ) × · · ·
P (x1, x2, ..., xt, qt = ωi|λ)
=P (x1, x2, ..., xt, qt = ωi|λ) × · · ·
P (qt+1 = ωj|qt = ωi, λ)
=αt (i) aij
34 / 99
Now, P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ)
Thus
P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ) =
=P (qt+1 = ωj|x1, x2, ..., xt, qt = ωi, λ) × · · ·
P (x1, x2, ..., xt, qt = ωi|λ)
=P (x1, x2, ..., xt, qt = ωi|λ) × · · ·
P (qt+1 = ωj|qt = ωi, λ)
=αt (i) aij
34 / 99
Now, P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ)
Thus
P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ) =
=P (qt+1 = ωj|x1, x2, ..., xt, qt = ωi, λ) × · · ·
P (x1, x2, ..., xt, qt = ωi|λ)
=P (x1, x2, ..., xt, qt = ωi|λ) × · · ·
P (qt+1 = ωj|qt = ωi, λ)
=αt (i) aij
34 / 99
Thus
We have that
P (x1, x2, ..., xt, qt+1 = ωj|λ) =
N
i=1
αt (i) aij (20)
What about the term
P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) = P (xt+1|qt+1 = ωj, λ) = bj (xt+1)
(21)
Putting all together
αt+1 (j) =
N
i=1
αt (i) aij bj (xt+1) (22)
35 / 99
Thus
We have that
P (x1, x2, ..., xt, qt+1 = ωj|λ) =
N
i=1
αt (i) aij (20)
What about the term
P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) = P (xt+1|qt+1 = ωj, λ) = bj (xt+1)
(21)
Putting all together
αt+1 (j) =
N
i=1
αt (i) aij bj (xt+1) (22)
35 / 99
Thus
We have that
P (x1, x2, ..., xt, qt+1 = ωj|λ) =
N
i=1
αt (i) aij (20)
What about the term
P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) = P (xt+1|qt+1 = ωj, λ) = bj (xt+1)
(21)
Putting all together
αt+1 (j) =
N
i=1
αt (i) aij bj (xt+1) (22)
35 / 99
Finally
This true for
t = 1, 2, ..., T − 1 and 1 ≤ j ≤ N.
Now, applying the marginal idea for P (x1, x2, ..., xT |λ)
P (x1, x2, ..., xT |λ) =
N
i=1
P (x1, x2, ..., xT , qT = ωi|λ) =
N
i=1
αT (i) (23)
36 / 99
Finally
This true for
t = 1, 2, ..., T − 1 and 1 ≤ j ≤ N.
Now, applying the marginal idea for P (x1, x2, ..., xT |λ)
P (x1, x2, ..., xT |λ) =
N
i=1
P (x1, x2, ..., xT , qT = ωi|λ) =
N
i=1
αT (i) (23)
36 / 99
Complexity
Given that we are requiered to compute the αt (j)
For 1 ≤ t ≤ T and 1 ≤ j ≤ N
We can calculate this using a table (Dynamic Programming Idea)
i
t
1 2 · · · N
1 α1 (1) α1 (2) · · · α1 (N)
2 α2 (1) α2 (1)
Cost N
· · · α2 (N)
...
...
...
...
...
T αT (1) αT (2) · · · αT (N)
37 / 99
Complexity
Given that we are requiered to compute the αt (j)
For 1 ≤ t ≤ T and 1 ≤ j ≤ N
We can calculate this using a table (Dynamic Programming Idea)
i
t
1 2 · · · N
1 α1 (1) α1 (2) · · · α1 (N)
2 α2 (1) α2 (1)
Cost N
· · · α2 (N)
...
...
...
...
...
T αT (1) αT (2) · · · αT (N)
37 / 99
Complexity
We have
1 Inner loop O (N)
2 Outer loop O (NT)
Then O (N2
T)
38 / 99
All this come from the Lattice Structure
Lattice with Lenght T over time and Depth N over the structure of
the probabilistic state machine
1 2 3 T
39 / 99
Notice the following
The forward method
It is enough to solve the first problem.
However for the second problem
We need to define the backward variable,
βt (i) = P (xt+1, xt+2, ..., xT |qt = ωi, λ).
40 / 99
Notice the following
The forward method
It is enough to solve the first problem.
However for the second problem
We need to define the backward variable,
βt (i) = P (xt+1, xt+2, ..., xT |qt = ωi, λ).
40 / 99
This can be solved in a similar way
Initialization
βT (i) = 1, 1 ≤ i ≤ N (24)
Induction
βt (i) =
N
j=1
aijbj (xt+1) βt+1 (j) (25)
Fort = T − 1, T − 2, ..., 1 with 1 ≤ i ≤ N
41 / 99
This can be solved in a similar way
Initialization
βT (i) = 1, 1 ≤ i ≤ N (24)
Induction
βt (i) =
N
j=1
aijbj (xt+1) βt+1 (j) (25)
Fort = T − 1, T − 2, ..., 1 with 1 ≤ i ≤ N
41 / 99
Example of the backward process
Example
42 / 99
How?
That is for the next midterm
Please try it for yourselves!!!
43 / 99
Here, we have several ways of solving the problem
Problem
Given the observation sequence X and a model λ, How we choose a
corresponding class sequence Ω?
For this, we define
γt (i) = P (qt = ωi|X, λ) (26)
Meaning
The probability of being in state ωi at time t given the observation
sequence X and the model λ.
44 / 99
Here, we have several ways of solving the problem
Problem
Given the observation sequence X and a model λ, How we choose a
corresponding class sequence Ω?
For this, we define
γt (i) = P (qt = ωi|X, λ) (26)
Meaning
The probability of being in state ωi at time t given the observation
sequence X and the model λ.
44 / 99
Here, we have several ways of solving the problem
Problem
Given the observation sequence X and a model λ, How we choose a
corresponding class sequence Ω?
For this, we define
γt (i) = P (qt = ωi|X, λ) (26)
Meaning
The probability of being in state ωi at time t given the observation
sequence X and the model λ.
44 / 99
Thus
We get
γt (i) =
P (x1, ..., xt, ..., xT , qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, qt = ωi, xt+1..., xT |λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt|qt = ωi, λ) P (xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, qt = ωi|λ) P (xt+1..., xT |qt = ωi, λ)
P (x1, ..., xt, ..., xT |λ)
=
αt (i) βt (i)
N
j=1
P (x1, ..., xt, ..., xT , qt = ωj|λ)
=
αt (i) βt (i)
N
j=1
αt (j) βt (j)
45 / 99
Thus
We get
γt (i) =
P (x1, ..., xt, ..., xT , qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, qt = ωi, xt+1..., xT |λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt|qt = ωi, λ) P (xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, qt = ωi|λ) P (xt+1..., xT |qt = ωi, λ)
P (x1, ..., xt, ..., xT |λ)
=
αt (i) βt (i)
N
j=1
P (x1, ..., xt, ..., xT , qt = ωj|λ)
=
αt (i) βt (i)
N
j=1
αt (j) βt (j)
45 / 99
Thus
We get
γt (i) =
P (x1, ..., xt, ..., xT , qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, qt = ωi, xt+1..., xT |λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt|qt = ωi, λ) P (xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, qt = ωi|λ) P (xt+1..., xT |qt = ωi, λ)
P (x1, ..., xt, ..., xT |λ)
=
αt (i) βt (i)
N
j=1
P (x1, ..., xt, ..., xT , qt = ωj|λ)
=
αt (i) βt (i)
N
j=1
αt (j) βt (j)
45 / 99
Thus
We get
γt (i) =
P (x1, ..., xt, ..., xT , qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, qt = ωi, xt+1..., xT |λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt|qt = ωi, λ) P (xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, qt = ωi|λ) P (xt+1..., xT |qt = ωi, λ)
P (x1, ..., xt, ..., xT |λ)
=
αt (i) βt (i)
N
j=1
P (x1, ..., xt, ..., xT , qt = ωj|λ)
=
αt (i) βt (i)
N
j=1
αt (j) βt (j)
45 / 99
Thus
We get
γt (i) =
P (x1, ..., xt, ..., xT , qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, qt = ωi, xt+1..., xT |λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt|qt = ωi, λ) P (xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, qt = ωi|λ) P (xt+1..., xT |qt = ωi, λ)
P (x1, ..., xt, ..., xT |λ)
=
αt (i) βt (i)
N
j=1
P (x1, ..., xt, ..., xT , qt = ωj|λ)
=
αt (i) βt (i)
N
j=1
αt (j) βt (j)
45 / 99
Thus
We get
γt (i) =
P (x1, ..., xt, ..., xT , qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, qt = ωi, xt+1..., xT |λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt|qt = ωi, λ) P (xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, qt = ωi|λ) P (xt+1..., xT |qt = ωi, λ)
P (x1, ..., xt, ..., xT |λ)
=
αt (i) βt (i)
N
j=1
P (x1, ..., xt, ..., xT , qt = ωj|λ)
=
αt (i) βt (i)
N
j=1
αt (j) βt (j)
45 / 99
Thus
We get
γt (i) =
P (x1, ..., xt, ..., xT , qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, qt = ωi, xt+1..., xT |λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt|qt = ωi, λ) P (xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ)
P (x1, ..., xt, ..., xT |λ)
=
P (x1, ..., xt, qt = ωi|λ) P (xt+1..., xT |qt = ωi, λ)
P (x1, ..., xt, ..., xT |λ)
=
αt (i) βt (i)
N
j=1
P (x1, ..., xt, ..., xT , qt = ωj|λ)
=
αt (i) βt (i)
N
j=1
αt (j) βt (j)
45 / 99
In addition, we have that
The γt (i) are a probability because
N
i=1
γt (i) = 1 (27)
Thus, we can use this to generate the most likely state qt at time t
qt = argmax1≤i≤N [γt (i)] , for 1 ≤ t ≤ T (28)
46 / 99
In addition, we have that
The γt (i) are a probability because
N
i=1
γt (i) = 1 (27)
Thus, we can use this to generate the most likely state qt at time t
qt = argmax1≤i≤N [γt (i)] , for 1 ≤ t ≤ T (28)
46 / 99
Outline
1 Introduction
A little about Markov Random Processes
Setup for Our Problem
2 First Problem
What do we need to calculate?
How to solve it? Forward Procedure
3 Second Problem
Dynamic Programming
Viterbi Algorithm
Final Viterbi Algorithm
4 Third Problem
The most difficult
Expectation Maximization of the Third Problem
The Baum-Welch Algorithm
47 / 99
What can we use? Dynamic Programming
Optimal Substructure
Optimal solution to sub-problems.
Overlapping Sub-problems
Use previous solutions to reduce complexity.
Finally
Reconstructing an optimal solution.
48 / 99
What can we use? Dynamic Programming
Optimal Substructure
Optimal solution to sub-problems.
Overlapping Sub-problems
Use previous solutions to reduce complexity.
Finally
Reconstructing an optimal solution.
48 / 99
What can we use? Dynamic Programming
Optimal Substructure
Optimal solution to sub-problems.
Overlapping Sub-problems
Use previous solutions to reduce complexity.
Finally
Reconstructing an optimal solution.
48 / 99
Optimal Substructure Process
Process
Show that a solution to the problem consists of making a choice. It
leaves a sub-problem to solve.
Given a problem you suppose an optimal solution.
Given this choice you determine how to solve the sub-problems.
Then, you determine that the solution to the sub-problem is optimal
by contradiction.
49 / 99
Optimal Substructure Process
Process
Show that a solution to the problem consists of making a choice. It
leaves a sub-problem to solve.
Given a problem you suppose an optimal solution.
Given this choice you determine how to solve the sub-problems.
Then, you determine that the solution to the sub-problem is optimal
by contradiction.
49 / 99
Optimal Substructure Process
Process
Show that a solution to the problem consists of making a choice. It
leaves a sub-problem to solve.
Given a problem you suppose an optimal solution.
Given this choice you determine how to solve the sub-problems.
Then, you determine that the solution to the sub-problem is optimal
by contradiction.
49 / 99
Optimal Substructure Process
Process
Show that a solution to the problem consists of making a choice. It
leaves a sub-problem to solve.
Given a problem you suppose an optimal solution.
Given this choice you determine how to solve the sub-problems.
Then, you determine that the solution to the sub-problem is optimal
by contradiction.
49 / 99
Possible Solutions
One Optimal Criteria
It is possible to to solve for the state sequences that maximizes the
expected number of correct pairs of states (qt, qt+1)
However
The most widely criteria is to find the single best path to maximize
P (Ω|X, λ) ≈ P (Ω, X|λ)
50 / 99
Possible Solutions
One Optimal Criteria
It is possible to to solve for the state sequences that maximizes the
expected number of correct pairs of states (qt, qt+1)
However
The most widely criteria is to find the single best path to maximize
P (Ω|X, λ) ≈ P (Ω, X|λ)
50 / 99
Outline
1 Introduction
A little about Markov Random Processes
Setup for Our Problem
2 First Problem
What do we need to calculate?
How to solve it? Forward Procedure
3 Second Problem
Dynamic Programming
Viterbi Algorithm
Final Viterbi Algorithm
4 Third Problem
The most difficult
Expectation Maximization of the Third Problem
The Baum-Welch Algorithm
51 / 99
Viterbi Algorithm
Setup
We want the best single state sequence Ωi = {ωi1 , ωi2 , ..., ωiT
} for a given
observation X
Thus, we define
δt (i) = max
q1,q2,...,qt−1
P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ] (29)
52 / 99
Viterbi Algorithm
Setup
We want the best single state sequence Ωi = {ωi1 , ωi2 , ..., ωiT
} for a given
observation X
Thus, we define
δt (i) = max
q1,q2,...,qt−1
P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ] (29)
52 / 99
Use Induction
We can do the following
δt+1 (j) = max
q1,q2,...,qt
P [q1, q2, · · · , qt+1 = j, x1, x2, · · · , xt+1|λ]
= max
i
{P [qt+1 = j, xt+1|q1, q2, · · · , qt = i, x1, x2, · · · , xt, λ] × · · ·
P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]}
= max
i
{P [qt+1 = j, xt+1|qt = i, λ] × · · ·
P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]}
= max
i
{P [qt+1 = j|qt = i, λ] P [xt+1|qt+1 = j, qt = i, λ] δt (i)}
= max
i
{aijP [xt+1|qt+1 = j, λ] δt (i)}
= max
i
{aijbj (xt+1) δt (i)} = max
i
{aijδt (i)} bj (xt+1)
53 / 99
Use Induction
We can do the following
δt+1 (j) = max
q1,q2,...,qt
P [q1, q2, · · · , qt+1 = j, x1, x2, · · · , xt+1|λ]
= max
i
{P [qt+1 = j, xt+1|q1, q2, · · · , qt = i, x1, x2, · · · , xt, λ] × · · ·
P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]}
= max
i
{P [qt+1 = j, xt+1|qt = i, λ] × · · ·
P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]}
= max
i
{P [qt+1 = j|qt = i, λ] P [xt+1|qt+1 = j, qt = i, λ] δt (i)}
= max
i
{aijP [xt+1|qt+1 = j, λ] δt (i)}
= max
i
{aijbj (xt+1) δt (i)} = max
i
{aijδt (i)} bj (xt+1)
53 / 99
Use Induction
We can do the following
δt+1 (j) = max
q1,q2,...,qt
P [q1, q2, · · · , qt+1 = j, x1, x2, · · · , xt+1|λ]
= max
i
{P [qt+1 = j, xt+1|q1, q2, · · · , qt = i, x1, x2, · · · , xt, λ] × · · ·
P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]}
= max
i
{P [qt+1 = j, xt+1|qt = i, λ] × · · ·
P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]}
= max
i
{P [qt+1 = j|qt = i, λ] P [xt+1|qt+1 = j, qt = i, λ] δt (i)}
= max
i
{aijP [xt+1|qt+1 = j, λ] δt (i)}
= max
i
{aijbj (xt+1) δt (i)} = max
i
{aijδt (i)} bj (xt+1)
53 / 99
Use Induction
We can do the following
δt+1 (j) = max
q1,q2,...,qt
P [q1, q2, · · · , qt+1 = j, x1, x2, · · · , xt+1|λ]
= max
i
{P [qt+1 = j, xt+1|q1, q2, · · · , qt = i, x1, x2, · · · , xt, λ] × · · ·
P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]}
= max
i
{P [qt+1 = j, xt+1|qt = i, λ] × · · ·
P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]}
= max
i
{P [qt+1 = j|qt = i, λ] P [xt+1|qt+1 = j, qt = i, λ] δt (i)}
= max
i
{aijP [xt+1|qt+1 = j, λ] δt (i)}
= max
i
{aijbj (xt+1) δt (i)} = max
i
{aijδt (i)} bj (xt+1)
53 / 99
Use Induction
We can do the following
δt+1 (j) = max
q1,q2,...,qt
P [q1, q2, · · · , qt+1 = j, x1, x2, · · · , xt+1|λ]
= max
i
{P [qt+1 = j, xt+1|q1, q2, · · · , qt = i, x1, x2, · · · , xt, λ] × · · ·
P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]}
= max
i
{P [qt+1 = j, xt+1|qt = i, λ] × · · ·
P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]}
= max
i
{P [qt+1 = j|qt = i, λ] P [xt+1|qt+1 = j, qt = i, λ] δt (i)}
= max
i
{aijP [xt+1|qt+1 = j, λ] δt (i)}
= max
i
{aijbj (xt+1) δt (i)} = max
i
{aijδt (i)} bj (xt+1)
53 / 99
Use Induction
We can do the following
δt+1 (j) = max
q1,q2,...,qt
P [q1, q2, · · · , qt+1 = j, x1, x2, · · · , xt+1|λ]
= max
i
{P [qt+1 = j, xt+1|q1, q2, · · · , qt = i, x1, x2, · · · , xt, λ] × · · ·
P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]}
= max
i
{P [qt+1 = j, xt+1|qt = i, λ] × · · ·
P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]}
= max
i
{P [qt+1 = j|qt = i, λ] P [xt+1|qt+1 = j, qt = i, λ] δt (i)}
= max
i
{aijP [xt+1|qt+1 = j, λ] δt (i)}
= max
i
{aijbj (xt+1) δt (i)} = max
i
{aijδt (i)} bj (xt+1)
53 / 99
Outline
1 Introduction
A little about Markov Random Processes
Setup for Our Problem
2 First Problem
What do we need to calculate?
How to solve it? Forward Procedure
3 Second Problem
Dynamic Programming
Viterbi Algorithm
Final Viterbi Algorithm
4 Third Problem
The most difficult
Expectation Maximization of the Third Problem
The Baum-Welch Algorithm
54 / 99
Procedure
Initialization
δ1 (i) = πibi (x1), 1 ≤ i ≤ N
Array Ψ1(i) = 0 , 1 ≤ i ≤ N
Recusrions
δt (j) = max1≤i≤N {aijδt−1 (i)} bj (xt), 2 ≤ t ≤ T and 1 ≤ j ≤ N
Ψt(j) = argmax1≤i≤N {aijδt−1 (i)}, 2 ≤ t ≤ T and 1 ≤ j ≤ N
Termination
P∗ = max1≤i≤N [δT (i)]
qT ∗ = argmax1≤i≤N {δT (i)}
55 / 99
Procedure
Initialization
δ1 (i) = πibi (x1), 1 ≤ i ≤ N
Array Ψ1(i) = 0 , 1 ≤ i ≤ N
Recusrions
δt (j) = max1≤i≤N {aijδt−1 (i)} bj (xt), 2 ≤ t ≤ T and 1 ≤ j ≤ N
Ψt(j) = argmax1≤i≤N {aijδt−1 (i)}, 2 ≤ t ≤ T and 1 ≤ j ≤ N
Termination
P∗ = max1≤i≤N [δT (i)]
qT ∗ = argmax1≤i≤N {δT (i)}
55 / 99
Procedure
Initialization
δ1 (i) = πibi (x1), 1 ≤ i ≤ N
Array Ψ1(i) = 0 , 1 ≤ i ≤ N
Recusrions
δt (j) = max1≤i≤N {aijδt−1 (i)} bj (xt), 2 ≤ t ≤ T and 1 ≤ j ≤ N
Ψt(j) = argmax1≤i≤N {aijδt−1 (i)}, 2 ≤ t ≤ T and 1 ≤ j ≤ N
Termination
P∗ = max1≤i≤N [δT (i)]
qT ∗ = argmax1≤i≤N {δT (i)}
55 / 99
Procedure
Initialization
δ1 (i) = πibi (x1), 1 ≤ i ≤ N
Array Ψ1(i) = 0 , 1 ≤ i ≤ N
Recusrions
δt (j) = max1≤i≤N {aijδt−1 (i)} bj (xt), 2 ≤ t ≤ T and 1 ≤ j ≤ N
Ψt(j) = argmax1≤i≤N {aijδt−1 (i)}, 2 ≤ t ≤ T and 1 ≤ j ≤ N
Termination
P∗ = max1≤i≤N [δT (i)]
qT ∗ = argmax1≤i≤N {δT (i)}
55 / 99
Procedure
Initialization
δ1 (i) = πibi (x1), 1 ≤ i ≤ N
Array Ψ1(i) = 0 , 1 ≤ i ≤ N
Recusrions
δt (j) = max1≤i≤N {aijδt−1 (i)} bj (xt), 2 ≤ t ≤ T and 1 ≤ j ≤ N
Ψt(j) = argmax1≤i≤N {aijδt−1 (i)}, 2 ≤ t ≤ T and 1 ≤ j ≤ N
Termination
P∗ = max1≤i≤N [δT (i)]
qT ∗ = argmax1≤i≤N {δT (i)}
55 / 99
Procedure
Initialization
δ1 (i) = πibi (x1), 1 ≤ i ≤ N
Array Ψ1(i) = 0 , 1 ≤ i ≤ N
Recusrions
δt (j) = max1≤i≤N {aijδt−1 (i)} bj (xt), 2 ≤ t ≤ T and 1 ≤ j ≤ N
Ψt(j) = argmax1≤i≤N {aijδt−1 (i)}, 2 ≤ t ≤ T and 1 ≤ j ≤ N
Termination
P∗ = max1≤i≤N [δT (i)]
qT ∗ = argmax1≤i≤N {δT (i)}
55 / 99
Clearly
Something Notable
All those recursion can be changed for iterative bottom-up procedure
That all
I leave to you to figure out!!!
56 / 99
Clearly
Something Notable
All those recursion can be changed for iterative bottom-up procedure
That all
I leave to you to figure out!!!
56 / 99
Backtracking
Path backtracking
qt∗ = Ψt+1(qt+1∗), t = T − 1, T − 2, · · · , 1.
57 / 99
Outline
1 Introduction
A little about Markov Random Processes
Setup for Our Problem
2 First Problem
What do we need to calculate?
How to solve it? Forward Procedure
3 Second Problem
Dynamic Programming
Viterbi Algorithm
Final Viterbi Algorithm
4 Third Problem
The most difficult
Expectation Maximization of the Third Problem
The Baum-Welch Algorithm
58 / 99
The most difficult
Problem
There is no analytical solution to third problem!!!
However
Given a finite observation sequence as trainign data, we can choose
λ (A, B, π) such that P (O|λ)is maximized.
For this
We have the Baum-Welch or Forward-Backward Algorithm.
59 / 99
The most difficult
Problem
There is no analytical solution to third problem!!!
However
Given a finite observation sequence as trainign data, we can choose
λ (A, B, π) such that P (O|λ)is maximized.
For this
We have the Baum-Welch or Forward-Backward Algorithm.
59 / 99
The most difficult
Problem
There is no analytical solution to third problem!!!
However
Given a finite observation sequence as trainign data, we can choose
λ (A, B, π) such that P (O|λ)is maximized.
For this
We have the Baum-Welch or Forward-Backward Algorithm.
59 / 99
Outline
1 Introduction
A little about Markov Random Processes
Setup for Our Problem
2 First Problem
What do we need to calculate?
How to solve it? Forward Procedure
3 Second Problem
Dynamic Programming
Viterbi Algorithm
Final Viterbi Algorithm
4 Third Problem
The most difficult
Expectation Maximization of the Third Problem
The Baum-Welch Algorithm
60 / 99
Another way to see this is thinking in Maximum Likelihood
Given our observed data sequence X = x1, x2, ..., xT
The likelihood function is obtained from the joint distribution by
marginalizing over the Hidden State Variables
P (X|λ) =
Ω
P (X, Ω|λ) (30)
Problems
We cannot simply treat each summation over ωi independently!!!
We cannot perform all the summations explicitly because it results in
NT terms.
And Again
A further difficulty for the likelihood function is that represents a
summation over the emission models for different settings of the latent
variables.
61 / 99
Another way to see this is thinking in Maximum Likelihood
Given our observed data sequence X = x1, x2, ..., xT
The likelihood function is obtained from the joint distribution by
marginalizing over the Hidden State Variables
P (X|λ) =
Ω
P (X, Ω|λ) (30)
Problems
We cannot simply treat each summation over ωi independently!!!
We cannot perform all the summations explicitly because it results in
NT terms.
And Again
A further difficulty for the likelihood function is that represents a
summation over the emission models for different settings of the latent
variables.
61 / 99
Another way to see this is thinking in Maximum Likelihood
Given our observed data sequence X = x1, x2, ..., xT
The likelihood function is obtained from the joint distribution by
marginalizing over the Hidden State Variables
P (X|λ) =
Ω
P (X, Ω|λ) (30)
Problems
We cannot simply treat each summation over ωi independently!!!
We cannot perform all the summations explicitly because it results in
NT terms.
And Again
A further difficulty for the likelihood function is that represents a
summation over the emission models for different settings of the latent
variables.
61 / 99
Outline
1 Introduction
A little about Markov Random Processes
Setup for Our Problem
2 First Problem
What do we need to calculate?
How to solve it? Forward Procedure
3 Second Problem
Dynamic Programming
Viterbi Algorithm
Final Viterbi Algorithm
4 Third Problem
The most difficult
Expectation Maximization of the Third Problem
The Baum-Welch Algorithm
62 / 99
Initial Setup
Let us consider discrete (categorical) HMMs of length T (each
observation sequence is T observations long).
1 N is the number of hidden states.
2 M is the number of different observation symbols.
3 D observations, X = X(1), X(2), ..., X(D) such that
X(i) = x
(i)
1 , x
(i)
2 , ..., x
(i)
T .
4 We assume each observation is drawn iid.
5 Finally, we are given Ω = Ω(1), Ω(2), ..., Ω(D) one sequence of
hidden states per observation, such that Ω(i) = ω
(i)
1 , ω
(i)
2 , ..., ω
(i)
T .
63 / 99
Initial Setup
Let us consider discrete (categorical) HMMs of length T (each
observation sequence is T observations long).
1 N is the number of hidden states.
2 M is the number of different observation symbols.
3 D observations, X = X(1), X(2), ..., X(D) such that
X(i) = x
(i)
1 , x
(i)
2 , ..., x
(i)
T .
4 We assume each observation is drawn iid.
5 Finally, we are given Ω = Ω(1), Ω(2), ..., Ω(D) one sequence of
hidden states per observation, such that Ω(i) = ω
(i)
1 , ω
(i)
2 , ..., ω
(i)
T .
63 / 99
Initial Setup
Let us consider discrete (categorical) HMMs of length T (each
observation sequence is T observations long).
1 N is the number of hidden states.
2 M is the number of different observation symbols.
3 D observations, X = X(1), X(2), ..., X(D) such that
X(i) = x
(i)
1 , x
(i)
2 , ..., x
(i)
T .
4 We assume each observation is drawn iid.
5 Finally, we are given Ω = Ω(1), Ω(2), ..., Ω(D) one sequence of
hidden states per observation, such that Ω(i) = ω
(i)
1 , ω
(i)
2 , ..., ω
(i)
T .
63 / 99
Initial Setup
Let us consider discrete (categorical) HMMs of length T (each
observation sequence is T observations long).
1 N is the number of hidden states.
2 M is the number of different observation symbols.
3 D observations, X = X(1), X(2), ..., X(D) such that
X(i) = x
(i)
1 , x
(i)
2 , ..., x
(i)
T .
4 We assume each observation is drawn iid.
5 Finally, we are given Ω = Ω(1), Ω(2), ..., Ω(D) one sequence of
hidden states per observation, such that Ω(i) = ω
(i)
1 , ω
(i)
2 , ..., ω
(i)
T .
63 / 99
Initial Setup
Let us consider discrete (categorical) HMMs of length T (each
observation sequence is T observations long).
1 N is the number of hidden states.
2 M is the number of different observation symbols.
3 D observations, X = X(1), X(2), ..., X(D) such that
X(i) = x
(i)
1 , x
(i)
2 , ..., x
(i)
T .
4 We assume each observation is drawn iid.
5 Finally, we are given Ω = Ω(1), Ω(2), ..., Ω(D) one sequence of
hidden states per observation, such that Ω(i) = ω
(i)
1 , ω
(i)
2 , ..., ω
(i)
T .
63 / 99
Baum-Welch Algorithm
Remember the EM Algorithm
1 Compute Q (λ, λn) = Ω log [P (X, Ω|λ)] P (Ω|X, λn)
2 Set λn+1 = arg max
λ
Q (λ, λn)
We can rewrite the previous iterations (Given
P (X, Ω) = P(X)P(Ω|X)) as
arg max
λ
Ω
log [P (X, Ω|λ)] P (Ω|X, λn
) =arg max
λ
Ω
log [P (X, Ω|λ)]
P (Ω, X|λn)
P (X, |λn)
≈arg max
λ
Ω
log [P (X, Ω|λ)] P (Ω, X|λn
)
=arg max
λ
ˆQ (λ, λn
)
64 / 99
Baum-Welch Algorithm
Remember the EM Algorithm
1 Compute Q (λ, λn) = Ω log [P (X, Ω|λ)] P (Ω|X, λn)
2 Set λn+1 = arg max
λ
Q (λ, λn)
We can rewrite the previous iterations (Given
P (X, Ω) = P(X)P(Ω|X)) as
arg max
λ
Ω
log [P (X, Ω|λ)] P (Ω|X, λn
) =arg max
λ
Ω
log [P (X, Ω|λ)]
P (Ω, X|λn)
P (X, |λn)
≈arg max
λ
Ω
log [P (X, Ω|λ)] P (Ω, X|λn
)
=arg max
λ
ˆQ (λ, λn
)
64 / 99
Baum-Welch Algorithm
Remember the EM Algorithm
1 Compute Q (λ, λn) = Ω log [P (X, Ω|λ)] P (Ω|X, λn)
2 Set λn+1 = arg max
λ
Q (λ, λn)
We can rewrite the previous iterations (Given
P (X, Ω) = P(X)P(Ω|X)) as
arg max
λ
Ω
log [P (X, Ω|λ)] P (Ω|X, λn
) =arg max
λ
Ω
log [P (X, Ω|λ)]
P (Ω, X|λn)
P (X, |λn)
≈arg max
λ
Ω
log [P (X, Ω|λ)] P (Ω, X|λn
)
=arg max
λ
ˆQ (λ, λn
)
64 / 99
Baum-Welch Algorithm
Remember the EM Algorithm
1 Compute Q (λ, λn) = Ω log [P (X, Ω|λ)] P (Ω|X, λn)
2 Set λn+1 = arg max
λ
Q (λ, λn)
We can rewrite the previous iterations (Given
P (X, Ω) = P(X)P(Ω|X)) as
arg max
λ
Ω
log [P (X, Ω|λ)] P (Ω|X, λn
) =arg max
λ
Ω
log [P (X, Ω|λ)]
P (Ω, X|λn)
P (X, |λn)
≈arg max
λ
Ω
log [P (X, Ω|λ)] P (Ω, X|λn
)
=arg max
λ
ˆQ (λ, λn
)
64 / 99
Baum-Welch Algorithm
Remember the EM Algorithm
1 Compute Q (λ, λn) = Ω log [P (X, Ω|λ)] P (Ω|X, λn)
2 Set λn+1 = arg max
λ
Q (λ, λn)
We can rewrite the previous iterations (Given
P (X, Ω) = P(X)P(Ω|X)) as
arg max
λ
Ω
log [P (X, Ω|λ)] P (Ω|X, λn
) =arg max
λ
Ω
log [P (X, Ω|λ)]
P (Ω, X|λn)
P (X, |λn)
≈arg max
λ
Ω
log [P (X, Ω|λ)] P (Ω, X|λn
)
=arg max
λ
ˆQ (λ, λn
)
64 / 99
After all
We know that
P (X|λn) is not effected by λn!!!
Now, assuming we have D observations
P (X, Ω|λ) =P X(1)
, X(2)
, ..., X(D)
, Ω(1)
, Ω(2)
, ..., Ω(D)
|λ
=P X(1)
, Ω(1)
, X(2)
, Ω(2)
, ..., X(D)
, Ω(D)
|λ
=
D
d=1
P x
(d)
1 , x
(d)
2 , ..., x
(d)
T , ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T |λ
=
D
d=1
P x
(d)
1 , x
(d)
2 , ..., x
(d)
T |ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T , λ × ...
P ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T |λ
65 / 99
After all
We know that
P (X|λn) is not effected by λn!!!
Now, assuming we have D observations
P (X, Ω|λ) =P X(1)
, X(2)
, ..., X(D)
, Ω(1)
, Ω(2)
, ..., Ω(D)
|λ
=P X(1)
, Ω(1)
, X(2)
, Ω(2)
, ..., X(D)
, Ω(D)
|λ
=
D
d=1
P x
(d)
1 , x
(d)
2 , ..., x
(d)
T , ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T |λ
=
D
d=1
P x
(d)
1 , x
(d)
2 , ..., x
(d)
T |ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T , λ × ...
P ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T |λ
65 / 99
After all
We know that
P (X|λn) is not effected by λn!!!
Now, assuming we have D observations
P (X, Ω|λ) =P X(1)
, X(2)
, ..., X(D)
, Ω(1)
, Ω(2)
, ..., Ω(D)
|λ
=P X(1)
, Ω(1)
, X(2)
, Ω(2)
, ..., X(D)
, Ω(D)
|λ
=
D
d=1
P x
(d)
1 , x
(d)
2 , ..., x
(d)
T , ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T |λ
=
D
d=1
P x
(d)
1 , x
(d)
2 , ..., x
(d)
T |ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T , λ × ...
P ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T |λ
65 / 99
After all
We know that
P (X|λn) is not effected by λn!!!
Now, assuming we have D observations
P (X, Ω|λ) =P X(1)
, X(2)
, ..., X(D)
, Ω(1)
, Ω(2)
, ..., Ω(D)
|λ
=P X(1)
, Ω(1)
, X(2)
, Ω(2)
, ..., X(D)
, Ω(D)
|λ
=
D
d=1
P x
(d)
1 , x
(d)
2 , ..., x
(d)
T , ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T |λ
=
D
d=1
P x
(d)
1 , x
(d)
2 , ..., x
(d)
T |ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T , λ × ...
P ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T |λ
65 / 99
After all
We know that
P (X|λn) is not effected by λn!!!
Now, assuming we have D observations
P (X, Ω|λ) =P X(1)
, X(2)
, ..., X(D)
, Ω(1)
, Ω(2)
, ..., Ω(D)
|λ
=P X(1)
, Ω(1)
, X(2)
, Ω(2)
, ..., X(D)
, Ω(D)
|λ
=
D
d=1
P x
(d)
1 , x
(d)
2 , ..., x
(d)
T , ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T |λ
=
D
d=1
P x
(d)
1 , x
(d)
2 , ..., x
(d)
T |ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T , λ × ...
P ω
(d)
1 , ω
(d)
2 , ..., ω
(d)
T |λ
65 / 99
Thus
We have
P (X, Ω|λ) =
D
d=1
T
t=1
P x
(d)
t |ω
(d)
t , λ
T
t=2
P ω
(d)
t |ω
(d)
t−1, λ P ω
(d)
1 |λ
=
D
d=1
πq
(d)
1
bq
(d)
1
x
(d)
1
T
t=2
aq
(d)
t−1q
(d)
t
bq
(d)
t
x
(d)
t
66 / 99
Thus
We have
P (X, Ω|λ) =
D
d=1
T
t=1
P x
(d)
t |ω
(d)
t , λ
T
t=2
P ω
(d)
t |ω
(d)
t−1, λ P ω
(d)
1 |λ
=
D
d=1
πq
(d)
1
bq
(d)
1
x
(d)
1
T
t=2
aq
(d)
t−1q
(d)
t
bq
(d)
t
x
(d)
t
66 / 99
Now, taking the logarithm
We get
log P (X, Ω|λ) =
D
d=1
log πq
(d)
1
+ ...
T
t=2
log aq
(d)
t−1q
(d)
t
+ ...
T
t=1
log bq
(d)
t
x
(d)
t
67 / 99
We plug back into ˆQ (λ, λn
)
We get the following
ˆQ (λ, λn
) =
Ω
D
d=1
log πq
(d)
1
P (Ω, X|λn
)
Ω
D
d=1
T
t=2
log aq
(d)
t−1q
(d)
t
P (Ω, X|λn
) + ...
Ω
D
d=1
T
t=1
log bq
(d)
t
x
(d)
t P (Ω, X|λn
)
68 / 99
Ok
We have reintroduced P (Ω, X|λn
)
It looks like it will increase the complexity of our calculations
However
Using marginalization, we will be able to remove the terms that are not
necessary to calculate the necessary updates.
Which Updates?
πi , aij and bj (k)
69 / 99
Ok
We have reintroduced P (Ω, X|λn
)
It looks like it will increase the complexity of our calculations
However
Using marginalization, we will be able to remove the terms that are not
necessary to calculate the necessary updates.
Which Updates?
πi , aij and bj (k)
69 / 99
Ok
We have reintroduced P (Ω, X|λn
)
It looks like it will increase the complexity of our calculations
However
Using marginalization, we will be able to remove the terms that are not
necessary to calculate the necessary updates.
Which Updates?
πi , aij and bj (k)
69 / 99
Now, Which Lagrange Multipliers?
One for the M hidden marginal states of π
N
i=1
πi = 1 (31)
One for each i
N
j=1
aij = 1 (32)
One for each i
M
k=1
bi (k) = 1 (33)
Remember: bi (k) = P (xk|qt = ωi)
70 / 99
Now, Which Lagrange Multipliers?
One for the M hidden marginal states of π
N
i=1
πi = 1 (31)
One for each i
N
j=1
aij = 1 (32)
One for each i
M
k=1
bi (k) = 1 (33)
Remember: bi (k) = P (xk|qt = ωi)
70 / 99
Now, Which Lagrange Multipliers?
One for the M hidden marginal states of π
N
i=1
πi = 1 (31)
One for each i
N
j=1
aij = 1 (32)
One for each i
M
k=1
bi (k) = 1 (33)
Remember: bi (k) = P (xk|qt = ωi)
70 / 99
Finally, we have the new Lagrangian
We have
ˆL (λ, λn
) = ˆQ (λ, λn
) − λπ
N
i=1
πi − 1 −
N
i=1
λai


N
j=1
aij − 1

 −
N
i=1
λbi
M
k=1
bi (k) − 1
71 / 99
Now, if we derive with respect to πi
We have that
∂ˆL (λ, λn)
∂πi
=
∂
∂πi Ω
D
d=1
log πq
(d)
1
P (Ω, X|λn
) − λπ
=
∂
∂πi


N
j=1
D
d=1
log πjP q
(d)
1 = j, X|λn

 − λπ
=
1
πi
D
d=1
P q
(d)
1 = i, X|λn
− λπ = 0
Remark: The second step is simply the result of marginalizing out, for
each d, all q
(d)
t=1 and q
(d =d)
t for all t.
72 / 99
Now, if we derive with respect to πi
We have that
∂ˆL (λ, λn)
∂πi
=
∂
∂πi Ω
D
d=1
log πq
(d)
1
P (Ω, X|λn
) − λπ
=
∂
∂πi


N
j=1
D
d=1
log πjP q
(d)
1 = j, X|λn

 − λπ
=
1
πi
D
d=1
P q
(d)
1 = i, X|λn
− λπ = 0
Remark: The second step is simply the result of marginalizing out, for
each d, all q
(d)
t=1 and q
(d =d)
t for all t.
72 / 99
Now, if we derive with respect to πi
We have that
∂ˆL (λ, λn)
∂πi
=
∂
∂πi Ω
D
d=1
log πq
(d)
1
P (Ω, X|λn
) − λπ
=
∂
∂πi


N
j=1
D
d=1
log πjP q
(d)
1 = j, X|λn

 − λπ
=
1
πi
D
d=1
P q
(d)
1 = i, X|λn
− λπ = 0
Remark: The second step is simply the result of marginalizing out, for
each d, all q
(d)
t=1 and q
(d =d)
t for all t.
72 / 99
Now, if we derive with respect to πi
We have that
∂ˆL (λ, λn)
∂πi
=
∂
∂πi Ω
D
d=1
log πq
(d)
1
P (Ω, X|λn
) − λπ
=
∂
∂πi


N
j=1
D
d=1
log πjP q
(d)
1 = j, X|λn

 − λπ
=
1
πi
D
d=1
P q
(d)
1 = i, X|λn
− λπ = 0
Remark: The second step is simply the result of marginalizing out, for
each d, all q
(d)
t=1 and q
(d =d)
t for all t.
72 / 99
Now, if we derive with respect to πi
After all
D
d=1
log π
q
(d)
1
Ω
P Ω, X|λ
n
=
D
d=1
log π
q
(d)
1
N
j1
1
=1
N
j1
2
=1
...
N
j1
T
=1
...
N
jT
1
=1
N
jT
2
=1
...
N
jT
T
=1
P Ω, X|λ
n
=
D
d=1
log π
q
(d)
1
N
j1
1
=1
N
j2
1
=1
...
N
jT
1
=1
P q
(1)
1
= j
1
1 , ..., q
(T)
T
= j
T
1 , X|λ
n
=
D
d=1
log π
q
(d)
1
N
j1=1
q
(d)
1
= j1, X|λ
n
Thus we have
πi =
1
λπ
D
d=1
P q
(d)
1 = i, X|λn
(34)
73 / 99
Now, if we derive with respect to πi
After all
D
d=1
log π
q
(d)
1
Ω
P Ω, X|λ
n
=
D
d=1
log π
q
(d)
1
N
j1
1
=1
N
j1
2
=1
...
N
j1
T
=1
...
N
jT
1
=1
N
jT
2
=1
...
N
jT
T
=1
P Ω, X|λ
n
=
D
d=1
log π
q
(d)
1
N
j1
1
=1
N
j2
1
=1
...
N
jT
1
=1
P q
(1)
1
= j
1
1 , ..., q
(T)
T
= j
T
1 , X|λ
n
=
D
d=1
log π
q
(d)
1
N
j1=1
q
(d)
1
= j1, X|λ
n
Thus we have
πi =
1
λπ
D
d=1
P q
(d)
1 = i, X|λn
(34)
73 / 99
Now, if we derive with respect to πi
After all
D
d=1
log π
q
(d)
1
Ω
P Ω, X|λ
n
=
D
d=1
log π
q
(d)
1
N
j1
1
=1
N
j1
2
=1
...
N
j1
T
=1
...
N
jT
1
=1
N
jT
2
=1
...
N
jT
T
=1
P Ω, X|λ
n
=
D
d=1
log π
q
(d)
1
N
j1
1
=1
N
j2
1
=1
...
N
jT
1
=1
P q
(1)
1
= j
1
1 , ..., q
(T)
T
= j
T
1 , X|λ
n
=
D
d=1
log π
q
(d)
1
N
j1=1
q
(d)
1
= j1, X|λ
n
Thus we have
πi =
1
λπ
D
d=1
P q
(d)
1 = i, X|λn
(34)
73 / 99
Now, if we derive with respect to πi
After all
D
d=1
log π
q
(d)
1
Ω
P Ω, X|λ
n
=
D
d=1
log π
q
(d)
1
N
j1
1
=1
N
j1
2
=1
...
N
j1
T
=1
...
N
jT
1
=1
N
jT
2
=1
...
N
jT
T
=1
P Ω, X|λ
n
=
D
d=1
log π
q
(d)
1
N
j1
1
=1
N
j2
1
=1
...
N
jT
1
=1
P q
(1)
1
= j
1
1 , ..., q
(T)
T
= j
T
1 , X|λ
n
=
D
d=1
log π
q
(d)
1
N
j1=1
q
(d)
1
= j1, X|λ
n
Thus we have
πi =
1
λπ
D
d=1
P q
(d)
1 = i, X|λn
(34)
73 / 99
Now deriving against λπ
We have that
∂ˆL (λ, λn)
∂πi
= −
N
i=1
πi − 1 = 0 (35)
Thus, substituting into the past equation
1
λπ
N
i=1
D
d=1
P q
(d)
1 = i, X|λn
= 1 (36)
Then
λπ =
N
i=1
D
d=1
P q
(d)
1 = i, X|λn
(37)
74 / 99
Now deriving against λπ
We have that
∂ˆL (λ, λn)
∂πi
= −
N
i=1
πi − 1 = 0 (35)
Thus, substituting into the past equation
1
λπ
N
i=1
D
d=1
P q
(d)
1 = i, X|λn
= 1 (36)
Then
λπ =
N
i=1
D
d=1
P q
(d)
1 = i, X|λn
(37)
74 / 99
Now deriving against λπ
We have that
∂ˆL (λ, λn)
∂πi
= −
N
i=1
πi − 1 = 0 (35)
Thus, substituting into the past equation
1
λπ
N
i=1
D
d=1
P q
(d)
1 = i, X|λn
= 1 (36)
Then
λπ =
N
i=1
D
d=1
P q
(d)
1 = i, X|λn
(37)
74 / 99
Plugging back
We get
1
πi
D
d=1
P q
(d)
1 = i, X|λn
− λπ =0
1
πi
D
d=1
P q
(d)
1 = i, X|λn
=
N
i=1
D
d=1
P q
(d)
1 = i, X|λn
πi =
D
d=1 P q
(d)
1 = i, X|λn
N
i=1
D
d=1 P q
(d)
1 = i, X|λn
πi =
D
d=1 P q
(d)
1 = i, X|λn
D
d=1 P (X|λn)
πi =
D
d=1 P (X|λn) P q
(d)
1 = i|X, λn
DP (X|λn)
75 / 99
Plugging back
We get
1
πi
D
d=1
P q
(d)
1 = i, X|λn
− λπ =0
1
πi
D
d=1
P q
(d)
1 = i, X|λn
=
N
i=1
D
d=1
P q
(d)
1 = i, X|λn
πi =
D
d=1 P q
(d)
1 = i, X|λn
N
i=1
D
d=1 P q
(d)
1 = i, X|λn
πi =
D
d=1 P q
(d)
1 = i, X|λn
D
d=1 P (X|λn)
πi =
D
d=1 P (X|λn) P q
(d)
1 = i|X, λn
DP (X|λn)
75 / 99
Plugging back
We get
1
πi
D
d=1
P q
(d)
1 = i, X|λn
− λπ =0
1
πi
D
d=1
P q
(d)
1 = i, X|λn
=
N
i=1
D
d=1
P q
(d)
1 = i, X|λn
πi =
D
d=1 P q
(d)
1 = i, X|λn
N
i=1
D
d=1 P q
(d)
1 = i, X|λn
πi =
D
d=1 P q
(d)
1 = i, X|λn
D
d=1 P (X|λn)
πi =
D
d=1 P (X|λn) P q
(d)
1 = i|X, λn
DP (X|λn)
75 / 99
Plugging back
We get
1
πi
D
d=1
P q
(d)
1 = i, X|λn
− λπ =0
1
πi
D
d=1
P q
(d)
1 = i, X|λn
=
N
i=1
D
d=1
P q
(d)
1 = i, X|λn
πi =
D
d=1 P q
(d)
1 = i, X|λn
N
i=1
D
d=1 P q
(d)
1 = i, X|λn
πi =
D
d=1 P q
(d)
1 = i, X|λn
D
d=1 P (X|λn)
πi =
D
d=1 P (X|λn) P q
(d)
1 = i|X, λn
DP (X|λn)
75 / 99
Plugging back
We get
1
πi
D
d=1
P q
(d)
1 = i, X|λn
− λπ =0
1
πi
D
d=1
P q
(d)
1 = i, X|λn
=
N
i=1
D
d=1
P q
(d)
1 = i, X|λn
πi =
D
d=1 P q
(d)
1 = i, X|λn
N
i=1
D
d=1 P q
(d)
1 = i, X|λn
πi =
D
d=1 P q
(d)
1 = i, X|λn
D
d=1 P (X|λn)
πi =
D
d=1 P (X|λn) P q
(d)
1 = i|X, λn
DP (X|λn)
75 / 99
In addition
We can modify our γ
γq
(d)
t−1
(i) = P q
(d)
t−1 = i|X(d)
, λn
(38)
Meaning
The probability of being in state i at time q
(d)
t−1 given the observation
sequence X(d) and the model λn.
76 / 99
In addition
We can modify our γ
γq
(d)
t−1
(i) = P q
(d)
t−1 = i|X(d)
, λn
(38)
Meaning
The probability of being in state i at time q
(d)
t−1 given the observation
sequence X(d) and the model λn.
76 / 99
Thus
We have
πi =
D
d=1 P (X|λn) P q
(d)
1 = i|X, λn
DP (X|λn)
πi =
1
D
D
d=1
P q
(d)
1 = i|X(d)
, λn
πi =
1
D
D
d=1
γq
(d)
1
(i)
Remark: This can be thought as the expected frequency (Number of
Times) in state i at time (t = 1) = γq
(d)
1
(i)
77 / 99
Thus
We have
πi =
D
d=1 P (X|λn) P q
(d)
1 = i|X, λn
DP (X|λn)
πi =
1
D
D
d=1
P q
(d)
1 = i|X(d)
, λn
πi =
1
D
D
d=1
γq
(d)
1
(i)
Remark: This can be thought as the expected frequency (Number of
Times) in state i at time (t = 1) = γq
(d)
1
(i)
77 / 99
Thus
We have
πi =
D
d=1 P (X|λn) P q
(d)
1 = i|X, λn
DP (X|λn)
πi =
1
D
D
d=1
P q
(d)
1 = i|X(d)
, λn
πi =
1
D
D
d=1
γq
(d)
1
(i)
Remark: This can be thought as the expected frequency (Number of
Times) in state i at time (t = 1) = γq
(d)
1
(i)
77 / 99
Now, for aij
We now follow a similar process for the aij
∂ˆL (λ, λn)
∂aij
=
∂
∂aij
Ω
D
d=1
T
t=2
log a
q
(d)
t−1
q
(d)
t
P (Ω, X|λn
) −
N
i=1
λai
=
∂
∂aij
N
h=1
N
k=1
D
d=1
T
t=2
log ajkP q
(d)
t−1 = h, q
(d)
t = k, X|λn
− λai
=
1
aij
D
d=1
T
t=2
P q
(d)
t−1 = i, q
(d)
t = j, X|λn
− λai = 0
In a similar way
∂ˆL (λ, λn)
∂λai
= −


N
j=1
aij − 1

 = 0 (39)
78 / 99
Now, for aij
We now follow a similar process for the aij
∂ˆL (λ, λn)
∂aij
=
∂
∂aij
Ω
D
d=1
T
t=2
log a
q
(d)
t−1
q
(d)
t
P (Ω, X|λn
) −
N
i=1
λai
=
∂
∂aij
N
h=1
N
k=1
D
d=1
T
t=2
log ajkP q
(d)
t−1 = h, q
(d)
t = k, X|λn
− λai
=
1
aij
D
d=1
T
t=2
P q
(d)
t−1 = i, q
(d)
t = j, X|λn
− λai = 0
In a similar way
∂ˆL (λ, λn)
∂λai
= −


N
j=1
aij − 1

 = 0 (39)
78 / 99
Now, for aij
We now follow a similar process for the aij
∂ˆL (λ, λn)
∂aij
=
∂
∂aij
Ω
D
d=1
T
t=2
log a
q
(d)
t−1
q
(d)
t
P (Ω, X|λn
) −
N
i=1
λai
=
∂
∂aij
N
h=1
N
k=1
D
d=1
T
t=2
log ajkP q
(d)
t−1 = h, q
(d)
t = k, X|λn
− λai
=
1
aij
D
d=1
T
t=2
P q
(d)
t−1 = i, q
(d)
t = j, X|λn
− λai = 0
In a similar way
∂ˆL (λ, λn)
∂λai
= −


N
j=1
aij − 1

 = 0 (39)
78 / 99
Now, for aij
We now follow a similar process for the aij
∂ˆL (λ, λn)
∂aij
=
∂
∂aij
Ω
D
d=1
T
t=2
log a
q
(d)
t−1
q
(d)
t
P (Ω, X|λn
) −
N
i=1
λai
=
∂
∂aij
N
h=1
N
k=1
D
d=1
T
t=2
log ajkP q
(d)
t−1 = h, q
(d)
t = k, X|λn
− λai
=
1
aij
D
d=1
T
t=2
P q
(d)
t−1 = i, q
(d)
t = j, X|λn
− λai = 0
In a similar way
∂ˆL (λ, λn)
∂λai
= −


N
j=1
aij − 1

 = 0 (39)
78 / 99
Then
We have that
aij =
D
d=1
T
t=2 P q
(d)
t−1 = i, q
(d)
t = j, X|λn
λai
Then, using N
j=1 aij = 1
λai =
N
j=1
D
d=1
T
t=2
P q
(d)
t−1 = i, q
(d)
t = j, X|λn
(40)
79 / 99
Then
We have that
aij =
D
d=1
T
t=2 P q
(d)
t−1 = i, q
(d)
t = j, X|λn
λai
Then, using N
j=1 aij = 1
λai =
N
j=1
D
d=1
T
t=2
P q
(d)
t−1 = i, q
(d)
t = j, X|λn
(40)
79 / 99
Re-expressing
We get
aij =
D
d=1
T
t=2 P q
(d)
t−1 = i, q
(d)
t = j, X|λn
N
j=1
D
d=1
T
t=2 P q
(d)
t−1 = i, q
(d)
t = j, X|λn
Marginalizing the denominator
aij =
D
d=1
T
t=2 P q
(d)
t−1 = i, q
(d)
t = j, X|λn
D
d=1
T
t=2 P q
(d)
t−1 = i, X|λn
80 / 99
Re-expressing
We get
aij =
D
d=1
T
t=2 P q
(d)
t−1 = i, q
(d)
t = j, X|λn
N
j=1
D
d=1
T
t=2 P q
(d)
t−1 = i, q
(d)
t = j, X|λn
Marginalizing the denominator
aij =
D
d=1
T
t=2 P q
(d)
t−1 = i, q
(d)
t = j, X|λn
D
d=1
T
t=2 P q
(d)
t−1 = i, X|λn
80 / 99
Next
We have
aij =
D
d=1
T
t=2 P q
(d)
t−1 = i, q
(d)
t = j|X, λn P (X|λn)
D
d=1
T
t=2 P q
(d)
t−1 = i|X, λn P (X|λn)
=
D
d=1
T
t=2 P q
(d)
t−1 = i, q
(d)
t = j|X(d), λn
D
d=1
T
t=2 P q
(d)
t−1 = i|X(d), λn
81 / 99
Now, we are looking for a more compact version
We define
The probability of being in state i at time q
(d)
t−1 and state j at time q
(d)
t :
ξq
(d)
t−1
(i, j) = P q
(d)
t−1 = i, q
(d)
t = j|X(d)
, λ (41)
82 / 99
Using Bayes
We have for ξq
(d)
t−1
(i, j)
ξ
q
(d)
t−1
(i, j) =
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i, q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
|X(d)
, λ
P X(d)|λ
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
, λ P q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
|λ
P (X|λ)
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P x
(d)
t+1
, ..., x
(d)
T
|q
(d)
t
= j, x
(d)
t
, λ × · · ·
P X(d), |λ
P q
(d)
t
= j, x
(d)
t
|λ
=P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P x
(d)
t+1
, ..., x
(d)
T
|q
(d)
t
= j, x
(d)
t
, λ × · · ·
P x
(d)
t
|q
(d)
t
= j, λ P q
(d)
t
= j|λ
P X(d)|λ
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P q
(d)
t
= j|λ β
(d)
t
(j) bj x
(d)
t
P X(d)|λ
= ...
83 / 99
Using Bayes
We have for ξq
(d)
t−1
(i, j)
ξ
q
(d)
t−1
(i, j) =
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i, q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
|X(d)
, λ
P X(d)|λ
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
, λ P q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
|λ
P (X|λ)
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P x
(d)
t+1
, ..., x
(d)
T
|q
(d)
t
= j, x
(d)
t
, λ × · · ·
P X(d), |λ
P q
(d)
t
= j, x
(d)
t
|λ
=P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P x
(d)
t+1
, ..., x
(d)
T
|q
(d)
t
= j, x
(d)
t
, λ × · · ·
P x
(d)
t
|q
(d)
t
= j, λ P q
(d)
t
= j|λ
P X(d)|λ
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P q
(d)
t
= j|λ β
(d)
t
(j) bj x
(d)
t
P X(d)|λ
= ...
83 / 99
Using Bayes
We have for ξq
(d)
t−1
(i, j)
ξ
q
(d)
t−1
(i, j) =
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i, q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
|X(d)
, λ
P X(d)|λ
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
, λ P q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
|λ
P (X|λ)
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P x
(d)
t+1
, ..., x
(d)
T
|q
(d)
t
= j, x
(d)
t
, λ × · · ·
P X(d), |λ
P q
(d)
t
= j, x
(d)
t
|λ
=P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P x
(d)
t+1
, ..., x
(d)
T
|q
(d)
t
= j, x
(d)
t
, λ × · · ·
P x
(d)
t
|q
(d)
t
= j, λ P q
(d)
t
= j|λ
P X(d)|λ
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P q
(d)
t
= j|λ β
(d)
t
(j) bj x
(d)
t
P X(d)|λ
= ...
83 / 99
Using Bayes
We have for ξq
(d)
t−1
(i, j)
ξ
q
(d)
t−1
(i, j) =
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i, q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
|X(d)
, λ
P X(d)|λ
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
, λ P q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
|λ
P (X|λ)
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P x
(d)
t+1
, ..., x
(d)
T
|q
(d)
t
= j, x
(d)
t
, λ × · · ·
P X(d), |λ
P q
(d)
t
= j, x
(d)
t
|λ
=P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P x
(d)
t+1
, ..., x
(d)
T
|q
(d)
t
= j, x
(d)
t
, λ × · · ·
P x
(d)
t
|q
(d)
t
= j, λ P q
(d)
t
= j|λ
P X(d)|λ
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P q
(d)
t
= j|λ β
(d)
t
(j) bj x
(d)
t
P X(d)|λ
= ...
83 / 99
Using Bayes
We have for ξq
(d)
t−1
(i, j)
ξ
q
(d)
t−1
(i, j) =
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i, q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
|X(d)
, λ
P X(d)|λ
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
, λ P q
(d)
t
= j, x
(d)
t
, ..., x
(d)
T
|λ
P (X|λ)
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P x
(d)
t+1
, ..., x
(d)
T
|q
(d)
t
= j, x
(d)
t
, λ × · · ·
P X(d), |λ
P q
(d)
t
= j, x
(d)
t
|λ
=P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P x
(d)
t+1
, ..., x
(d)
T
|q
(d)
t
= j, x
(d)
t
, λ × · · ·
P x
(d)
t
|q
(d)
t
= j, λ P q
(d)
t
= j|λ
P X(d)|λ
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|q
(d)
t
= j, λ P q
(d)
t
= j|λ β
(d)
t
(j) bj x
(d)
t
P X(d)|λ
= ...
83 / 99
Again Bayes
Thus ξq
(d)
t−1
(i, j) =
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i, q
(d)
t
= j|λ βt (j) bj (xt)
P X(d)|λ
=
P q
(d)
t
= j|x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i, λ P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|λ β
(d)
t
(j) bj x
(d)
t
P (X|λ)
=
P q
(d)
t
= j|q
(d)
t−1
= i, λ α
(d)
t−1
(i) β
(d)
t
(j) bj x
(d)
t
P X(d)|λ
=
α
(d)
t−1
(i) a
(d)
ij
β
(d)
t
(j) bj x
(d)
t
N
k=1
N
h=1
α
(d)
t−1
(k) a
(d)
kh
β
(d)
t
(h) bh x
(d)
t
84 / 99
Again Bayes
Thus ξq
(d)
t−1
(i, j) =
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i, q
(d)
t
= j|λ βt (j) bj (xt)
P X(d)|λ
=
P q
(d)
t
= j|x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i, λ P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|λ β
(d)
t
(j) bj x
(d)
t
P (X|λ)
=
P q
(d)
t
= j|q
(d)
t−1
= i, λ α
(d)
t−1
(i) β
(d)
t
(j) bj x
(d)
t
P X(d)|λ
=
α
(d)
t−1
(i) a
(d)
ij
β
(d)
t
(j) bj x
(d)
t
N
k=1
N
h=1
α
(d)
t−1
(k) a
(d)
kh
β
(d)
t
(h) bh x
(d)
t
84 / 99
Again Bayes
Thus ξq
(d)
t−1
(i, j) =
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i, q
(d)
t
= j|λ βt (j) bj (xt)
P X(d)|λ
=
P q
(d)
t
= j|x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i, λ P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|λ β
(d)
t
(j) bj x
(d)
t
P (X|λ)
=
P q
(d)
t
= j|q
(d)
t−1
= i, λ α
(d)
t−1
(i) β
(d)
t
(j) bj x
(d)
t
P X(d)|λ
=
α
(d)
t−1
(i) a
(d)
ij
β
(d)
t
(j) bj x
(d)
t
N
k=1
N
h=1
α
(d)
t−1
(k) a
(d)
kh
β
(d)
t
(h) bh x
(d)
t
84 / 99
Again Bayes
Thus ξq
(d)
t−1
(i, j) =
=
P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i, q
(d)
t
= j|λ βt (j) bj (xt)
P X(d)|λ
=
P q
(d)
t
= j|x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i, λ P x
(d)
1
, ..., x
(d)
t−1
, q
(d)
t−1
= i|λ β
(d)
t
(j) bj x
(d)
t
P (X|λ)
=
P q
(d)
t
= j|q
(d)
t−1
= i, λ α
(d)
t−1
(i) β
(d)
t
(j) bj x
(d)
t
P X(d)|λ
=
α
(d)
t−1
(i) a
(d)
ij
β
(d)
t
(j) bj x
(d)
t
N
k=1
N
h=1
α
(d)
t−1
(k) a
(d)
kh
β
(d)
t
(h) bh x
(d)
t
84 / 99
This can be seen graphically as
This is illustrated by the following figure
85 / 99
In addition
Given the definition of γq
(d)
t−1
(i)
γq
(d)
t−1
(i) =
αq
(d)
t−1
(i) βq
(d)
t−1
(i)
N
j=1 αq
(d)
t
(j) βq
(d)
t
(j)
(42)
Summing over j
N
j=1
ξq
(d)
t−1
(i, j) =
N
j=1
P q
(d)
t−1 = i, q
(d)
t = j|X(d)
, λ
=P q
(d)
t−1 = i|X(d)
, λ
=
P x
(d)
1 , ..., x
(d)
t−1, q
(d)
t−1 = i, x
(d)
t , ..., x
(d)
T |λ
P X(d)|λ
86 / 99
In addition
Given the definition of γq
(d)
t−1
(i)
γq
(d)
t−1
(i) =
αq
(d)
t−1
(i) βq
(d)
t−1
(i)
N
j=1 αq
(d)
t
(j) βq
(d)
t
(j)
(42)
Summing over j
N
j=1
ξq
(d)
t−1
(i, j) =
N
j=1
P q
(d)
t−1 = i, q
(d)
t = j|X(d)
, λ
=P q
(d)
t−1 = i|X(d)
, λ
=
P x
(d)
1 , ..., x
(d)
t−1, q
(d)
t−1 = i, x
(d)
t , ..., x
(d)
T |λ
P X(d)|λ
86 / 99
In addition
Given the definition of γq
(d)
t−1
(i)
γq
(d)
t−1
(i) =
αq
(d)
t−1
(i) βq
(d)
t−1
(i)
N
j=1 αq
(d)
t
(j) βq
(d)
t
(j)
(42)
Summing over j
N
j=1
ξq
(d)
t−1
(i, j) =
N
j=1
P q
(d)
t−1 = i, q
(d)
t = j|X(d)
, λ
=P q
(d)
t−1 = i|X(d)
, λ
=
P x
(d)
1 , ..., x
(d)
t−1, q
(d)
t−1 = i, x
(d)
t , ..., x
(d)
T |λ
P X(d)|λ
86 / 99
In addition
Given the definition of γq
(d)
t−1
(i)
γq
(d)
t−1
(i) =
αq
(d)
t−1
(i) βq
(d)
t−1
(i)
N
j=1 αq
(d)
t
(j) βq
(d)
t
(j)
(42)
Summing over j
N
j=1
ξq
(d)
t−1
(i, j) =
N
j=1
P q
(d)
t−1 = i, q
(d)
t = j|X(d)
, λ
=P q
(d)
t−1 = i|X(d)
, λ
=
P x
(d)
1 , ..., x
(d)
t−1, q
(d)
t−1 = i, x
(d)
t , ..., x
(d)
T |λ
P X(d)|λ
86 / 99
Then
We have
=
P x
(d)
1 , ..., x
(d)
t−1, q
(d)
t−1 = i, x
(d)
t , ..., x
(d)
T |λ
P X(d)|λ
=
αt−1 (i) βt−1 (i)
N
i=1 P (x1, ..., xt, ..., xT , qt = ωi|λ)
=γq
(d)
t−1
(i)
87 / 99
Then
We have
=
P x
(d)
1 , ..., x
(d)
t−1, q
(d)
t−1 = i, x
(d)
t , ..., x
(d)
T |λ
P X(d)|λ
=
αt−1 (i) βt−1 (i)
N
i=1 P (x1, ..., xt, ..., xT , qt = ωi|λ)
=γq
(d)
t−1
(i)
87 / 99
Then
We have
=
P x
(d)
1 , ..., x
(d)
t−1, q
(d)
t−1 = i, x
(d)
t , ..., x
(d)
T |λ
P X(d)|λ
=
αt−1 (i) βt−1 (i)
N
i=1 P (x1, ..., xt, ..., xT , qt = ωi|λ)
=γq
(d)
t−1
(i)
87 / 99
We can also think about the expected values
The following sum
1 It can be thought as expected (over time) number of times that state
ωi is visited
2 Or the number of transitions made from state ωi
It is
T
t=2
γq
(d)
t−1
(i) (43)
88 / 99
We can also think about the expected values
The following sum
1 It can be thought as expected (over time) number of times that state
ωi is visited
2 Or the number of transitions made from state ωi
It is
T
t=2
γq
(d)
t−1
(i) (43)
88 / 99
Also the expected value for ξq
(d)
t−1
(i, j)
The following sum
It can be thought as the expected number of transitions from i to j
It is
T
t=2
ξq
(d)
t−1
(i, j) (44)
89 / 99
Also the expected value for ξq
(d)
t−1
(i, j)
The following sum
It can be thought as the expected number of transitions from i to j
It is
T
t=2
ξq
(d)
t−1
(i, j) (44)
89 / 99
Remark
Important
Baum et al. proved that the maximization ofQ λ, λ leads to increased
likelihood:
max
λ
Q λ, λ ⇒ P X|λ ≥ P (X|λ) (45)
For details
Please look back at our slides on EM!!!
90 / 99
Remark
Important
Baum et al. proved that the maximization ofQ λ, λ leads to increased
likelihood:
max
λ
Q λ, λ ⇒ P X|λ ≥ P (X|λ) (45)
For details
Please look back at our slides on EM!!!
90 / 99
Remark
Important
Baum et al. proved that the maximization ofQ λ, λ leads to increased
likelihood:
max
λ
Q λ, λ ⇒ P X|λ ≥ P (X|λ) (45)
For details
Please look back at our slides on EM!!!
90 / 99
We can rewrite aij
Thus
aij =
D
d=1
T
t=2 P q
(d)
t−1 = i, q
(d)
t = j|X(d), λn
D
d=1
T
t=2 P q
(d)
t−1 = i|X(d), λn
=
D
d=1
T
t=2 ξq
(d)
t−1
(i, j)
D
d=1
T
t=2 γq
(d)
t−1
(i)
Remark: This can be seen as
expected number of transitions from state i to j
expected number of transitions made from statei
91 / 99
Now, the terms bi (k)
We have that
∂ˆL (λ, λn
)
∂bi (k)
=
∂
∂bi (k)
Ω
D
d=1
T
t=1
log bq
(d)
t
x
(d)
t P (Ω, X|λn
) − λbi
=
∂
∂bi (k)
i,
N
h=1
D
d=1
T
t=1
log bh x
(d)
t P q
(d)
t = h, X|λn
− λbi
Now, we have a problem
The term bi (k) can be different than term bi x
(d)
t
We can use this to fix our problem
I x
(d)
t = k =
1 when x
(d)
t == k
0 Otherwise
(46)
92 / 99
Now, the terms bi (k)
We have that
∂ˆL (λ, λn
)
∂bi (k)
=
∂
∂bi (k)
Ω
D
d=1
T
t=1
log bq
(d)
t
x
(d)
t P (Ω, X|λn
) − λbi
=
∂
∂bi (k)
i,
N
h=1
D
d=1
T
t=1
log bh x
(d)
t P q
(d)
t = h, X|λn
− λbi
Now, we have a problem
The term bi (k) can be different than term bi x
(d)
t
We can use this to fix our problem
I x
(d)
t = k =
1 when x
(d)
t == k
0 Otherwise
(46)
92 / 99
Now, the terms bi (k)
We have that
∂ˆL (λ, λn
)
∂bi (k)
=
∂
∂bi (k)
Ω
D
d=1
T
t=1
log bq
(d)
t
x
(d)
t P (Ω, X|λn
) − λbi
=
∂
∂bi (k)
i,
N
h=1
D
d=1
T
t=1
log bh x
(d)
t P q
(d)
t = h, X|λn
− λbi
Now, we have a problem
The term bi (k) can be different than term bi x
(d)
t
We can use this to fix our problem
I x
(d)
t = k =
1 when x
(d)
t == k
0 Otherwise
(46)
92 / 99
Thus, we have that
Using our previous idea
∂ˆL (λ, λn)
∂bi (k)
=
D
d=1
T
t=1
P q
(d)
t = i, X|λn I x
(d)
t = k
bi (k)
− λbi
In addition
∂ˆL (λ, λn)
∂λbi
= −
M
k=1
bi (k) − 1 (47)
93 / 99
Thus, we have that
Using our previous idea
∂ˆL (λ, λn)
∂bi (k)
=
D
d=1
T
t=1
P q
(d)
t = i, X|λn I x
(d)
t = k
bi (k)
− λbi
In addition
∂ˆL (λ, λn)
∂λbi
= −
M
k=1
bi (k) − 1 (47)
93 / 99
Thus
We have that
bi (k) =
D
d=1
T
t=1 P q
(d)
t = i, X|λn I x
(d)
t = k
λbi
(48)
Thus
λbi =
M
k=1
D
d=1
T
t=1
P q
(d)
t = i, X|λn
I x
(d)
t = k (49)
94 / 99
Thus
We have that
bi (k) =
D
d=1
T
t=1 P q
(d)
t = i, X|λn I x
(d)
t = k
λbi
(48)
Thus
λbi =
M
k=1
D
d=1
T
t=1
P q
(d)
t = i, X|λn
I x
(d)
t = k (49)
94 / 99
We have then
The following result
bi (k) =
D
d=1
T
t=1 P q
(d)
t = i, X|λn I x
(d)
t = k
M
k=1
D
d=1
T
t=1 P q
(d)
t = i, X|λn I x
(d)
t = k
=
D
d=1
T
t=1 P q
(d)
t = i, X|λn I x
(d)
t = j
D
d=1
T
t=1 P q
(d)
t = i, X|λn
=
D
d=1
T
t=1 P q
(d)
t = i|X(d), λn I x
(d)
t = k
D
d=1
T
t=1 P q
(d)
t = i|X(d), λn
95 / 99
We have then
The following result
bi (k) =
D
d=1
T
t=1 P q
(d)
t = i, X|λn I x
(d)
t = k
M
k=1
D
d=1
T
t=1 P q
(d)
t = i, X|λn I x
(d)
t = k
=
D
d=1
T
t=1 P q
(d)
t = i, X|λn I x
(d)
t = j
D
d=1
T
t=1 P q
(d)
t = i, X|λn
=
D
d=1
T
t=1 P q
(d)
t = i|X(d), λn I x
(d)
t = k
D
d=1
T
t=1 P q
(d)
t = i|X(d), λn
95 / 99
We have then
The following result
bi (k) =
D
d=1
T
t=1 P q
(d)
t = i, X|λn I x
(d)
t = k
M
k=1
D
d=1
T
t=1 P q
(d)
t = i, X|λn I x
(d)
t = k
=
D
d=1
T
t=1 P q
(d)
t = i, X|λn I x
(d)
t = j
D
d=1
T
t=1 P q
(d)
t = i, X|λn
=
D
d=1
T
t=1 P q
(d)
t = i|X(d), λn I x
(d)
t = k
D
d=1
T
t=1 P q
(d)
t = i|X(d), λn
95 / 99
Using Rabiner’s Notation
We have that
bi (k) =
D
d=1
T
t=1 γq
(d)
t
(i) I x
(d)
t = k
D
d=1
T
t=1 γq
(d)
t
(i)
(50)
Which can be seen as
The expected number of times in state i and observing symbol xk
The expected number of times in state i
96 / 99
Using Rabiner’s Notation
We have that
bi (k) =
D
d=1
T
t=1 γq
(d)
t
(i) I x
(d)
t = k
D
d=1
T
t=1 γq
(d)
t
(i)
(50)
Which can be seen as
The expected number of times in state i and observing symbol xk
The expected number of times in state i
96 / 99
Thus, the re-estimation
First
π
(n+1)
i =
1
D
D
d=1
γq
(d)
1
(i) (51)
Second
a
(n+1)
ij
D
d=1
T
t=2 ξq
(d)
t−1
(i, j)
D
d=1
T
t=2 γq
(d)
t−1
(i)
(52)
Third
b
(n+1)
i (k) =
D
d=1
T
t=1 γq
(d)
t
(i) I x
(d)
t = k
D
d=1
T
t=1 γq
(d)
t
(i)
(53)
97 / 99
Thus, the re-estimation
First
π
(n+1)
i =
1
D
D
d=1
γq
(d)
1
(i) (51)
Second
a
(n+1)
ij
D
d=1
T
t=2 ξq
(d)
t−1
(i, j)
D
d=1
T
t=2 γq
(d)
t−1
(i)
(52)
Third
b
(n+1)
i (k) =
D
d=1
T
t=1 γq
(d)
t
(i) I x
(d)
t = k
D
d=1
T
t=1 γq
(d)
t
(i)
(53)
97 / 99
Thus, the re-estimation
First
π
(n+1)
i =
1
D
D
d=1
γq
(d)
1
(i) (51)
Second
a
(n+1)
ij
D
d=1
T
t=2 ξq
(d)
t−1
(i, j)
D
d=1
T
t=2 γq
(d)
t−1
(i)
(52)
Third
b
(n+1)
i (k) =
D
d=1
T
t=1 γq
(d)
t
(i) I x
(d)
t = k
D
d=1
T
t=1 γq
(d)
t
(i)
(53)
97 / 99
Something Nice
An important aspect of the re-estimation procedure is that
N
i=1
π
(n+1)
i = 1
N
j=1
a
(n+1)
ij = 1 1 ≤ i ≤ N
M
k=1
b
(n+1)
j (k) = 1 1 ≤ j ≤ N
They are
Automatically it is satisfied at each iteration.
98 / 99
Something Nice
An important aspect of the re-estimation procedure is that
N
i=1
π
(n+1)
i = 1
N
j=1
a
(n+1)
ij = 1 1 ≤ i ≤ N
M
k=1
b
(n+1)
j (k) = 1 1 ≤ j ≤ N
They are
Automatically it is satisfied at each iteration.
98 / 99
Once these probabilities are ready
We can use Naïve Bayes
P (Ωi|X) =
P (Ωi, X)
P (X)
>
P (Ωj, X)
P (X)
= P (Ωj|X) ∀i = j (54)
Then
The rule looks like
P (Ωi|X) P (Ωi) > P (Ωj|X) P (Ωj) ∀i = j (55)
99 / 99
Once these probabilities are ready
We can use Naïve Bayes
P (Ωi|X) =
P (Ωi, X)
P (X)
>
P (Ωj, X)
P (X)
= P (Ωj|X) ∀i = j (54)
Then
The rule looks like
P (Ωi|X) P (Ωi) > P (Ωj|X) P (Ωj) ∀i = j (55)
99 / 99

More Related Content

What's hot (20)

Markov chains1
Markov chains1Markov chains1
Markov chains1
 
Probable
ProbableProbable
Probable
 
By BIRASA FABRICE
By BIRASA FABRICEBy BIRASA FABRICE
By BIRASA FABRICE
 
Viterbi algorithm
Viterbi algorithmViterbi algorithm
Viterbi algorithm
 
HIDDEN MARKOV MODEL AND ITS APPLICATION
HIDDEN MARKOV MODEL AND ITS APPLICATIONHIDDEN MARKOV MODEL AND ITS APPLICATION
HIDDEN MARKOV MODEL AND ITS APPLICATION
 
discrete time signals and systems
 discrete time signals and systems  discrete time signals and systems
discrete time signals and systems
 
Markov chain intro
Markov chain introMarkov chain intro
Markov chain intro
 
DSP_FOEHU - MATLAB 01 - Discrete Time Signals and Systems
DSP_FOEHU - MATLAB 01 - Discrete Time Signals and SystemsDSP_FOEHU - MATLAB 01 - Discrete Time Signals and Systems
DSP_FOEHU - MATLAB 01 - Discrete Time Signals and Systems
 
Markov Chains
Markov ChainsMarkov Chains
Markov Chains
 
Markov chain
Markov chainMarkov chain
Markov chain
 
z transforms
z transformsz transforms
z transforms
 
sistem non linier Lyapunov Stability
sistem non linier Lyapunov Stabilitysistem non linier Lyapunov Stability
sistem non linier Lyapunov Stability
 
mcmc
mcmcmcmc
mcmc
 
Marcov chains and simple queue ch 1
Marcov chains and simple queue ch 1Marcov chains and simple queue ch 1
Marcov chains and simple queue ch 1
 
Chapter0
Chapter0Chapter0
Chapter0
 
Lyapunov stability analysis
Lyapunov stability analysisLyapunov stability analysis
Lyapunov stability analysis
 
Fault tolerant process control
Fault tolerant process controlFault tolerant process control
Fault tolerant process control
 
Z transform
Z transformZ transform
Z transform
 
Chap 2 discrete_time_signal_and_systems
Chap 2 discrete_time_signal_and_systemsChap 2 discrete_time_signal_and_systems
Chap 2 discrete_time_signal_and_systems
 
Z transfrm ppt
Z transfrm pptZ transfrm ppt
Z transfrm ppt
 

Viewers also liked

An Introduction to Hidden Markov Model
An Introduction to Hidden Markov ModelAn Introduction to Hidden Markov Model
An Introduction to Hidden Markov ModelShih-Hsiang Lin
 
Probabilistic Models of Time Series and Sequences
Probabilistic Models of Time Series and SequencesProbabilistic Models of Time Series and Sequences
Probabilistic Models of Time Series and SequencesZitao Liu
 
Preparation Data Structures 09 hash tables
Preparation Data Structures 09 hash tablesPreparation Data Structures 09 hash tables
Preparation Data Structures 09 hash tablesAndres Mendez-Vazquez
 
Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...
Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...
Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...Sri Ambati
 
Alex Tellez, Deep Learning Applications
Alex Tellez, Deep Learning ApplicationsAlex Tellez, Deep Learning Applications
Alex Tellez, Deep Learning ApplicationsSri Ambati
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov ModelsVu Pham
 
Hidden Markov Model & Stock Prediction
Hidden Markov Model & Stock PredictionHidden Markov Model & Stock Prediction
Hidden Markov Model & Stock PredictionDavid Chiu
 
Data Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov ModelsData Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov ModelsDerek Kane
 

Viewers also liked (9)

Ch11 hmm
Ch11 hmmCh11 hmm
Ch11 hmm
 
An Introduction to Hidden Markov Model
An Introduction to Hidden Markov ModelAn Introduction to Hidden Markov Model
An Introduction to Hidden Markov Model
 
Probabilistic Models of Time Series and Sequences
Probabilistic Models of Time Series and SequencesProbabilistic Models of Time Series and Sequences
Probabilistic Models of Time Series and Sequences
 
Preparation Data Structures 09 hash tables
Preparation Data Structures 09 hash tablesPreparation Data Structures 09 hash tables
Preparation Data Structures 09 hash tables
 
Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...
Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...
Smart Applications with Machine Learning by H2O - Alex Tellez of RHalf speaks...
 
Alex Tellez, Deep Learning Applications
Alex Tellez, Deep Learning ApplicationsAlex Tellez, Deep Learning Applications
Alex Tellez, Deep Learning Applications
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
Hidden Markov Model & Stock Prediction
Hidden Markov Model & Stock PredictionHidden Markov Model & Stock Prediction
Hidden Markov Model & Stock Prediction
 
Data Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov ModelsData Science - Part XIII - Hidden Markov Models
Data Science - Part XIII - Hidden Markov Models
 

Similar to 12 Machine Learning Supervised Hidden Markov Chains

Hidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathHidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathLê Hòa
 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionbutest
 
markov chain.ppt
markov chain.pptmarkov chain.ppt
markov chain.pptDWin Myo
 
HMM-Based Speech Synthesis
HMM-Based Speech SynthesisHMM-Based Speech Synthesis
HMM-Based Speech SynthesisIJMER
 
Book chapter-5
Book chapter-5Book chapter-5
Book chapter-5Hung Le
 
2012 mdsp pr06  hmm
2012 mdsp pr06  hmm2012 mdsp pr06  hmm
2012 mdsp pr06  hmmnozomuhamada
 
Quantum automata for infinite periodic words
Quantum automata for infinite periodic wordsQuantum automata for infinite periodic words
Quantum automata for infinite periodic wordsKonstantinos Giannakis
 
Intro to Quant Trading Strategies (Lecture 2 of 10)
Intro to Quant Trading Strategies (Lecture 2 of 10)Intro to Quant Trading Strategies (Lecture 2 of 10)
Intro to Quant Trading Strategies (Lecture 2 of 10)Adrian Aley
 
CS221: HMM and Particle Filters
CS221: HMM and Particle FiltersCS221: HMM and Particle Filters
CS221: HMM and Particle Filterszukun
 
Introduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsIntroduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsUniversity of Salerno
 
State space analysis.pptx
State space analysis.pptxState space analysis.pptx
State space analysis.pptxRaviMuthamala1
 
02 - Discrete-Time Markov Models - incomplete.pptx
02 - Discrete-Time Markov Models - incomplete.pptx02 - Discrete-Time Markov Models - incomplete.pptx
02 - Discrete-Time Markov Models - incomplete.pptxAntonioDapporto1
 
Hw3 2016 | Applied Stochastic process MTH 412 IIT Kanpur
Hw3 2016 | Applied Stochastic process MTH 412 IIT KanpurHw3 2016 | Applied Stochastic process MTH 412 IIT Kanpur
Hw3 2016 | Applied Stochastic process MTH 412 IIT KanpurVivekananda Samiti
 

Similar to 12 Machine Learning Supervised Hidden Markov Chains (20)

Hidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable PathHidden Markov Model - The Most Probable Path
Hidden Markov Model - The Most Probable Path
 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
 
Stochastic matrices
Stochastic matricesStochastic matrices
Stochastic matrices
 
markov chain.ppt
markov chain.pptmarkov chain.ppt
markov chain.ppt
 
HMM-Based Speech Synthesis
HMM-Based Speech SynthesisHMM-Based Speech Synthesis
HMM-Based Speech Synthesis
 
Book chapter-5
Book chapter-5Book chapter-5
Book chapter-5
 
2012 mdsp pr06  hmm
2012 mdsp pr06  hmm2012 mdsp pr06  hmm
2012 mdsp pr06  hmm
 
Queueing theory
Queueing theoryQueueing theory
Queueing theory
 
Quantum automata for infinite periodic words
Quantum automata for infinite periodic wordsQuantum automata for infinite periodic words
Quantum automata for infinite periodic words
 
Intro to Quant Trading Strategies (Lecture 2 of 10)
Intro to Quant Trading Strategies (Lecture 2 of 10)Intro to Quant Trading Strategies (Lecture 2 of 10)
Intro to Quant Trading Strategies (Lecture 2 of 10)
 
CS221: HMM and Particle Filters
CS221: HMM and Particle FiltersCS221: HMM and Particle Filters
CS221: HMM and Particle Filters
 
Introduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsIntroduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov Chains
 
Introduction to logistic regression
Introduction to logistic regressionIntroduction to logistic regression
Introduction to logistic regression
 
Stochastic Processes Assignment Help
Stochastic Processes Assignment HelpStochastic Processes Assignment Help
Stochastic Processes Assignment Help
 
State space analysis.pptx
State space analysis.pptxState space analysis.pptx
State space analysis.pptx
 
Hastings 1970
Hastings 1970Hastings 1970
Hastings 1970
 
Talk 5
Talk 5Talk 5
Talk 5
 
02 - Discrete-Time Markov Models - incomplete.pptx
02 - Discrete-Time Markov Models - incomplete.pptx02 - Discrete-Time Markov Models - incomplete.pptx
02 - Discrete-Time Markov Models - incomplete.pptx
 
Hw3 2016 | Applied Stochastic process MTH 412 IIT Kanpur
Hw3 2016 | Applied Stochastic process MTH 412 IIT KanpurHw3 2016 | Applied Stochastic process MTH 412 IIT Kanpur
Hw3 2016 | Applied Stochastic process MTH 412 IIT Kanpur
 
Markov chain
Markov chainMarkov chain
Markov chain
 

More from Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 

Recently uploaded

Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdfKamal Acharya
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxchumtiyababu
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 

Recently uploaded (20)

Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 

12 Machine Learning Supervised Hidden Markov Chains

  • 1. Machine Learning for Data Mining Hidden Markov Models Andres Mendez-Vazquez July 22, 2015 1 / 99
  • 2. Outline 1 Introduction A little about Markov Random Processes Setup for Our Problem 2 First Problem What do we need to calculate? How to solve it? Forward Procedure 3 Second Problem Dynamic Programming Viterbi Algorithm Final Viterbi Algorithm 4 Third Problem The most difficult Expectation Maximization of the Third Problem The Baum-Welch Algorithm 2 / 99
  • 3. Outline 1 Introduction A little about Markov Random Processes Setup for Our Problem 2 First Problem What do we need to calculate? How to solve it? Forward Procedure 3 Second Problem Dynamic Programming Viterbi Algorithm Final Viterbi Algorithm 4 Third Problem The most difficult Expectation Maximization of the Third Problem The Baum-Welch Algorithm 3 / 99
  • 4. Markov Random Processes Markov process (First-Order Markov Model) A stochastic process is called a Markov process when it has the Markov property: P Xtn |Xtn−1 = xn−1, ...Xt1 = x1 = P Xtn |Xtn−1 = xn−1 (1) Simply The future path of a Markov process, given its current state and the past history before, depends only on the current state (not on how this state has been reached). Thus (Quite an Oversimplification!!!) A Markov process is characterized by the (one-step) transition probabilities: pij = P (Xn = j|Xn−1 = i) (2) 4 / 99
  • 5. Markov Random Processes Markov process (First-Order Markov Model) A stochastic process is called a Markov process when it has the Markov property: P Xtn |Xtn−1 = xn−1, ...Xt1 = x1 = P Xtn |Xtn−1 = xn−1 (1) Simply The future path of a Markov process, given its current state and the past history before, depends only on the current state (not on how this state has been reached). Thus (Quite an Oversimplification!!!) A Markov process is characterized by the (one-step) transition probabilities: pij = P (Xn = j|Xn−1 = i) (2) 4 / 99
  • 6. Markov Random Processes Markov process (First-Order Markov Model) A stochastic process is called a Markov process when it has the Markov property: P Xtn |Xtn−1 = xn−1, ...Xt1 = x1 = P Xtn |Xtn−1 = xn−1 (1) Simply The future path of a Markov process, given its current state and the past history before, depends only on the current state (not on how this state has been reached). Thus (Quite an Oversimplification!!!) A Markov process is characterized by the (one-step) transition probabilities: pij = P (Xn = j|Xn−1 = i) (2) 4 / 99
  • 7. Graphical Interpretation A sequence of events depending only in the previous one 5 / 99
  • 8. Markov Chains Definition (Oversimplified) A Markov chain is thus a process Xn indexed by integers n = 0, 1, ... such that the states of the system are being indexed by integers Xn = 0, 1, ... From here The probability of a path i1, i2, ..., in is P (X1 = i1, X2 = i2, ..., Xn = in) = P (X1 = i1) pi1i2 pi2i3 · · · pin−1in (3) 6 / 99
  • 9. Markov Chains Definition (Oversimplified) A Markov chain is thus a process Xn indexed by integers n = 0, 1, ... such that the states of the system are being indexed by integers Xn = 0, 1, ... From here The probability of a path i1, i2, ..., in is P (X1 = i1, X2 = i2, ..., Xn = in) = P (X1 = i1) pi1i2 pi2i3 · · · pin−1in (3) 6 / 99
  • 10. The transition probability matrix of a Markov chain Thus The transition probabilities can be arranged as transition probability matrix P = (pi,j) Final State −→ Initial State ↓    p1,1 p1,2 p1,3 · · · p2,1 p2,2 p2,3 · · · ... ... ... ...    = P Propertities The row i contains the transition probabilities from state i to other states. Since the system always goes to some state, the sum of the column probabilities is 1 Then A matrix with non-negative elements such that the sum of each column equals 1 is called a stochastic matrix. 7 / 99
  • 11. The transition probability matrix of a Markov chain Thus The transition probabilities can be arranged as transition probability matrix P = (pi,j) Final State −→ Initial State ↓    p1,1 p1,2 p1,3 · · · p2,1 p2,2 p2,3 · · · ... ... ... ...    = P Propertities The row i contains the transition probabilities from state i to other states. Since the system always goes to some state, the sum of the column probabilities is 1 Then A matrix with non-negative elements such that the sum of each column equals 1 is called a stochastic matrix. 7 / 99
  • 12. The transition probability matrix of a Markov chain Thus The transition probabilities can be arranged as transition probability matrix P = (pi,j) Final State −→ Initial State ↓    p1,1 p1,2 p1,3 · · · p2,1 p2,2 p2,3 · · · ... ... ... ...    = P Propertities The row i contains the transition probabilities from state i to other states. Since the system always goes to some state, the sum of the column probabilities is 1 Then A matrix with non-negative elements such that the sum of each column equals 1 is called a stochastic matrix. 7 / 99
  • 13. Example We have the following state machine 0 1 2 We have the following Stochastic Matrix M =    1 − p 1 − p 0 p (1 − p) p (1 − p) 1 − p p2 p2 p    (4) With p = 1 3 8 / 99
  • 14. Example We have the following state machine 0 1 2 We have the following Stochastic Matrix M =    1 − p 1 − p 0 p (1 − p) p (1 − p) 1 − p p2 p2 p    (4) With p = 1 3 8 / 99
  • 15. Using this We can talk about an initial distribution is given as a row vector π A stationary probability vector π is defined as a distribution, written as a row vector, that does not change under application of the transition matrix: π = Pπ (5) Based in the Perron-Frobenious Theorem (Read about it in Wolfran-MathWorld) Theorem: Let M be a Markov chain whose left stochastic matrix has no zeros. Then, M has a single stationary distribution, and it converges to this when started at any distribution. 9 / 99
  • 16. Using this We can talk about an initial distribution is given as a row vector π A stationary probability vector π is defined as a distribution, written as a row vector, that does not change under application of the transition matrix: π = Pπ (5) Based in the Perron-Frobenious Theorem (Read about it in Wolfran-MathWorld) Theorem: Let M be a Markov chain whose left stochastic matrix has no zeros. Then, M has a single stationary distribution, and it converges to this when started at any distribution. 9 / 99
  • 17. How to find the stationary distribution We can find this stationary distribution using a slow process called the Power Method There are others: Householder transformations Givens rotations Arnoldi iteration 10 / 99
  • 18. How to find the stationary distribution We can find this stationary distribution using a slow process called the Power Method There are others: Householder transformations Givens rotations Arnoldi iteration 10 / 99
  • 19. How to find the stationary distribution We can find this stationary distribution using a slow process called the Power Method There are others: Householder transformations Givens rotations Arnoldi iteration 10 / 99
  • 20. How to find the stationary distribution We can find this stationary distribution using a slow process called the Power Method There are others: Householder transformations Givens rotations Arnoldi iteration 10 / 99
  • 21. How to find the stationary distribution We can find this stationary distribution using a slow process called the Power Method There are others: Householder transformations Givens rotations Arnoldi iteration 10 / 99
  • 22. Outline 1 Introduction A little about Markov Random Processes Setup for Our Problem 2 First Problem What do we need to calculate? How to solve it? Forward Procedure 3 Second Problem Dynamic Programming Viterbi Algorithm Final Viterbi Algorithm 4 Third Problem The most difficult Expectation Maximization of the Third Problem The Baum-Welch Algorithm 11 / 99
  • 23. Class Sequence and Observations We have A sequence of T (feature vectors) observations, X = x1, x2, ..., xT . In addition A collection of N classes or hidden states, ω1, ω2, ..., ωN Thus, we have an element like this one Ω = ω1, ω2, ..., ωT (6) Note: Total number of class sequences is NT for sequences of size T. 12 / 99
  • 24. Class Sequence and Observations We have A sequence of T (feature vectors) observations, X = x1, x2, ..., xT . In addition A collection of N classes or hidden states, ω1, ω2, ..., ωN Thus, we have an element like this one Ω = ω1, ω2, ..., ωT (6) Note: Total number of class sequences is NT for sequences of size T. 12 / 99
  • 25. Class Sequence and Observations We have A sequence of T (feature vectors) observations, X = x1, x2, ..., xT . In addition A collection of N classes or hidden states, ω1, ω2, ..., ωN Thus, we have an element like this one Ω = ω1, ω2, ..., ωT (6) Note: Total number of class sequences is NT for sequences of size T. 12 / 99
  • 26. Transition Model Definition of the transition model λ = (A, B, π) for Hidden Markov Model (HMM) Where: 1 N is the number of hidden states. 2 M is the number of different observation symbols. 3 A = {aij}is the stat transition probability aij = P (qt+1 = ωj|qt = ωi) with 1 ≤ i, j ≤ N (In our case aij > 0). 4 B = {bj (xk)} is the observation symbol probability at state j i.e. bj (xk) = P (xk|qt = ωj) such that 1 ≤ j ≤ N and 1 ≤ k ≤ M. 5 The initial state distribution π = {πi} where πi = P (q1 = ωi) where 1 ≤ i ≤ N. 13 / 99
  • 27. Transition Model Definition of the transition model λ = (A, B, π) for Hidden Markov Model (HMM) Where: 1 N is the number of hidden states. 2 M is the number of different observation symbols. 3 A = {aij}is the stat transition probability aij = P (qt+1 = ωj|qt = ωi) with 1 ≤ i, j ≤ N (In our case aij > 0). 4 B = {bj (xk)} is the observation symbol probability at state j i.e. bj (xk) = P (xk|qt = ωj) such that 1 ≤ j ≤ N and 1 ≤ k ≤ M. 5 The initial state distribution π = {πi} where πi = P (q1 = ωi) where 1 ≤ i ≤ N. 13 / 99
  • 28. Transition Model Definition of the transition model λ = (A, B, π) for Hidden Markov Model (HMM) Where: 1 N is the number of hidden states. 2 M is the number of different observation symbols. 3 A = {aij}is the stat transition probability aij = P (qt+1 = ωj|qt = ωi) with 1 ≤ i, j ≤ N (In our case aij > 0). 4 B = {bj (xk)} is the observation symbol probability at state j i.e. bj (xk) = P (xk|qt = ωj) such that 1 ≤ j ≤ N and 1 ≤ k ≤ M. 5 The initial state distribution π = {πi} where πi = P (q1 = ωi) where 1 ≤ i ≤ N. 13 / 99
  • 29. Transition Model Definition of the transition model λ = (A, B, π) for Hidden Markov Model (HMM) Where: 1 N is the number of hidden states. 2 M is the number of different observation symbols. 3 A = {aij}is the stat transition probability aij = P (qt+1 = ωj|qt = ωi) with 1 ≤ i, j ≤ N (In our case aij > 0). 4 B = {bj (xk)} is the observation symbol probability at state j i.e. bj (xk) = P (xk|qt = ωj) such that 1 ≤ j ≤ N and 1 ≤ k ≤ M. 5 The initial state distribution π = {πi} where πi = P (q1 = ωi) where 1 ≤ i ≤ N. 13 / 99
  • 30. Transition Model Definition of the transition model λ = (A, B, π) for Hidden Markov Model (HMM) Where: 1 N is the number of hidden states. 2 M is the number of different observation symbols. 3 A = {aij}is the stat transition probability aij = P (qt+1 = ωj|qt = ωi) with 1 ≤ i, j ≤ N (In our case aij > 0). 4 B = {bj (xk)} is the observation symbol probability at state j i.e. bj (xk) = P (xk|qt = ωj) such that 1 ≤ j ≤ N and 1 ≤ k ≤ M. 5 The initial state distribution π = {πi} where πi = P (q1 = ωi) where 1 ≤ i ≤ N. 13 / 99
  • 31. Transition Model Definition of the transition model λ = (A, B, π) for Hidden Markov Model (HMM) Where: 1 N is the number of hidden states. 2 M is the number of different observation symbols. 3 A = {aij}is the stat transition probability aij = P (qt+1 = ωj|qt = ωi) with 1 ≤ i, j ≤ N (In our case aij > 0). 4 B = {bj (xk)} is the observation symbol probability at state j i.e. bj (xk) = P (xk|qt = ωj) such that 1 ≤ j ≤ N and 1 ≤ k ≤ M. 5 The initial state distribution π = {πi} where πi = P (q1 = ωi) where 1 ≤ i ≤ N. 13 / 99
  • 32. Example We have that The hidden states conform a probabilistic state machine with emission probability for the symbols. Hidden States Observable States 14 / 99
  • 33. Using a HMM It can be used as a generator of sequences X 1 Choose an intitial state q1 = ωi according to the initial state distribution π. 2 Set t = 1. 3 Choose an observation Xt = xk according to the symbol probability in state ωi, bi (k) . 4 Transit to a new state qt+1 = ωj according to the state transition probability distribution for state ωi, aij. 5 Set t = t + 1, return to step 3 if t < T. 15 / 99
  • 34. Using a HMM It can be used as a generator of sequences X 1 Choose an intitial state q1 = ωi according to the initial state distribution π. 2 Set t = 1. 3 Choose an observation Xt = xk according to the symbol probability in state ωi, bi (k) . 4 Transit to a new state qt+1 = ωj according to the state transition probability distribution for state ωi, aij. 5 Set t = t + 1, return to step 3 if t < T. 15 / 99
  • 35. Using a HMM It can be used as a generator of sequences X 1 Choose an intitial state q1 = ωi according to the initial state distribution π. 2 Set t = 1. 3 Choose an observation Xt = xk according to the symbol probability in state ωi, bi (k) . 4 Transit to a new state qt+1 = ωj according to the state transition probability distribution for state ωi, aij. 5 Set t = t + 1, return to step 3 if t < T. 15 / 99
  • 36. Using a HMM It can be used as a generator of sequences X 1 Choose an intitial state q1 = ωi according to the initial state distribution π. 2 Set t = 1. 3 Choose an observation Xt = xk according to the symbol probability in state ωi, bi (k) . 4 Transit to a new state qt+1 = ωj according to the state transition probability distribution for state ωi, aij. 5 Set t = t + 1, return to step 3 if t < T. 15 / 99
  • 37. Using a HMM It can be used as a generator of sequences X 1 Choose an intitial state q1 = ωi according to the initial state distribution π. 2 Set t = 1. 3 Choose an observation Xt = xk according to the symbol probability in state ωi, bi (k) . 4 Transit to a new state qt+1 = ωj according to the state transition probability distribution for state ωi, aij. 5 Set t = t + 1, return to step 3 if t < T. 15 / 99
  • 38. What we want!!! Problem 1 Given the sequence X and a certain model λ, How we efficiently compute P (X|λ)? Problem 2 Given the observation sequence X and a model λ, How we choose a corresponding class sequence Ω? Which best explains the observations. Problem 3 How do we ajust the parameters of the λ to maximize P (X|λ)? 16 / 99
  • 39. What we want!!! Problem 1 Given the sequence X and a certain model λ, How we efficiently compute P (X|λ)? Problem 2 Given the observation sequence X and a model λ, How we choose a corresponding class sequence Ω? Which best explains the observations. Problem 3 How do we ajust the parameters of the λ to maximize P (X|λ)? 16 / 99
  • 40. What we want!!! Problem 1 Given the sequence X and a certain model λ, How we efficiently compute P (X|λ)? Problem 2 Given the observation sequence X and a model λ, How we choose a corresponding class sequence Ω? Which best explains the observations. Problem 3 How do we ajust the parameters of the λ to maximize P (X|λ)? 16 / 99
  • 41. First Problem Task Our task is to decide to which class sequence Ω = ω1, ω2, ..., ωT the sequence X = x1, x2, ..., xT belongs. For this, first consider a fix sequence of states Ω = ω1, ω2, ..., ωT 17 / 99
  • 42. First Problem Task Our task is to decide to which class sequence Ω = ω1, ω2, ..., ωT the sequence X = x1, x2, ..., xT belongs. For this, first consider a fix sequence of states Ω = ω1, ω2, ..., ωT 17 / 99
  • 43. First, some assumptions First Given the sequence of classes, the observations are statistically independent. Second The probability density function in one class does not depend on the other classes. Thus, the probability of the observation sequence X P (X|Ω, λ) =P (x1, x2, ..., xT |ω1, ω2, ..., ωT , λ) = T k=1 P (xk|ω1, ω2, ..., ωT , λ) = T k=1 P (xk|ωk, λ) 18 / 99
  • 44. First, some assumptions First Given the sequence of classes, the observations are statistically independent. Second The probability density function in one class does not depend on the other classes. Thus, the probability of the observation sequence X P (X|Ω, λ) =P (x1, x2, ..., xT |ω1, ω2, ..., ωT , λ) = T k=1 P (xk|ω1, ω2, ..., ωT , λ) = T k=1 P (xk|ωk, λ) 18 / 99
  • 45. First, some assumptions First Given the sequence of classes, the observations are statistically independent. Second The probability density function in one class does not depend on the other classes. Thus, the probability of the observation sequence X P (X|Ω, λ) =P (x1, x2, ..., xT |ω1, ω2, ..., ωT , λ) = T k=1 P (xk|ω1, ω2, ..., ωT , λ) = T k=1 P (xk|ωk, λ) 18 / 99
  • 46. First, some assumptions First Given the sequence of classes, the observations are statistically independent. Second The probability density function in one class does not depend on the other classes. Thus, the probability of the observation sequence X P (X|Ω, λ) =P (x1, x2, ..., xT |ω1, ω2, ..., ωT , λ) = T k=1 P (xk|ω1, ω2, ..., ωT , λ) = T k=1 P (xk|ωk, λ) 18 / 99
  • 47. First, some assumptions First Given the sequence of classes, the observations are statistically independent. Second The probability density function in one class does not depend on the other classes. Thus, the probability of the observation sequence X P (X|Ω, λ) =P (x1, x2, ..., xT |ω1, ω2, ..., ωT , λ) = T k=1 P (xk|ω1, ω2, ..., ωT , λ) = T k=1 P (xk|ωk, λ) 18 / 99
  • 48. Then We need the probability of such a state sequence Ω P (Ω|λ) (7) We can use the Markov Condition P (Ω|λ) =P (ω1, ω2, ..., ωT |λ) =P (ωT |ω1, ω2, ..., ωT−1, λ) × · · · P (ωT−1|ω1, ω2, ..., ωT−2, λ) × · · · P (ω1|λ) Thus P (Ω|λ) = P (ω1|λ) N k=2 P (ωk|ωk−1, λ) (8) 19 / 99
  • 49. Then We need the probability of such a state sequence Ω P (Ω|λ) (7) We can use the Markov Condition P (Ω|λ) =P (ω1, ω2, ..., ωT |λ) =P (ωT |ω1, ω2, ..., ωT−1, λ) × · · · P (ωT−1|ω1, ω2, ..., ωT−2, λ) × · · · P (ω1|λ) Thus P (Ω|λ) = P (ω1|λ) N k=2 P (ωk|ωk−1, λ) (8) 19 / 99
  • 50. Then We need the probability of such a state sequence Ω P (Ω|λ) (7) We can use the Markov Condition P (Ω|λ) =P (ω1, ω2, ..., ωT |λ) =P (ωT |ω1, ω2, ..., ωT−1, λ) × · · · P (ωT−1|ω1, ω2, ..., ωT−2, λ) × · · · P (ω1|λ) Thus P (Ω|λ) = P (ω1|λ) N k=2 P (ωk|ωk−1, λ) (8) 19 / 99
  • 51. Then We need the probability of such a state sequence Ω P (Ω|λ) (7) We can use the Markov Condition P (Ω|λ) =P (ω1, ω2, ..., ωT |λ) =P (ωT |ω1, ω2, ..., ωT−1, λ) × · · · P (ωT−1|ω1, ω2, ..., ωT−2, λ) × · · · P (ω1|λ) Thus P (Ω|λ) = P (ω1|λ) N k=2 P (ωk|ωk−1, λ) (8) 19 / 99
  • 52. Using our notation Then P (Ω|λ) = π1a1,2 · · · aT−1,T (9) Now P (X|Ω, λ) = bq1 (x1) · · · bqT (xT ) (10) 20 / 99
  • 53. Using our notation Then P (Ω|λ) = π1a1,2 · · · aT−1,T (9) Now P (X|Ω, λ) = bq1 (x1) · · · bqT (xT ) (10) 20 / 99
  • 54. What do we want? Then, we want to the probability of X and Ω to happen simultaneously P (X, Ω|λ) = P (X|Ω, λ) P (Ω|λ) (11) Hidden State Machine 21 / 99
  • 55. Outline 1 Introduction A little about Markov Random Processes Setup for Our Problem 2 First Problem What do we need to calculate? How to solve it? Forward Procedure 3 Second Problem Dynamic Programming Viterbi Algorithm Final Viterbi Algorithm 4 Third Problem The most difficult Expectation Maximization of the Third Problem The Baum-Welch Algorithm 22 / 99
  • 56. Thus, we need to calculate What? P (X|λ) = Ω P (X, Ω|λ) (12) Using Bayes Rule P (X|λ) = Ω P (X|Ω, λ) P (Ω|λ) (13) 23 / 99
  • 57. Thus, we need to calculate What? P (X|λ) = Ω P (X, Ω|λ) (12) Using Bayes Rule P (X|λ) = Ω P (X|Ω, λ) P (Ω|λ) (13) 23 / 99
  • 58. Putting All Together Thus, we have then P (X|λ) = Ω P (X|Ω, λ) P (Ω|λ) = q1,q2,...,qT π1bq1 (x1) [aq1,q2 bq2 (x2)] · · · [bqT (xT )] 24 / 99
  • 59. Putting All Together Thus, we have then P (X|λ) = Ω P (X|Ω, λ) P (Ω|λ) = q1,q2,...,qT π1bq1 (x1) [aq1,q2 bq2 (x2)] · · · [bqT (xT )] 24 / 99
  • 60. Complexity for brute force approach Multiplications We need 2T multiplications (Rearranging): π1aq1,q2 · · · aqT−1,qT T [bq1 (x1) · · · bqT (xT )] T (14) Sums We need NT additions because of the term q1,q2,...,qT . Total number of Operations O TNT (15) 25 / 99
  • 61. Complexity for brute force approach Multiplications We need 2T multiplications (Rearranging): π1aq1,q2 · · · aqT−1,qT T [bq1 (x1) · · · bqT (xT )] T (14) Sums We need NT additions because of the term q1,q2,...,qT . Total number of Operations O TNT (15) 25 / 99
  • 62. Complexity for brute force approach Multiplications We need 2T multiplications (Rearranging): π1aq1,q2 · · · aqT−1,qT T [bq1 (x1) · · · bqT (xT )] T (14) Sums We need NT additions because of the term q1,q2,...,qT . Total number of Operations O TNT (15) 25 / 99
  • 63. Outline 1 Introduction A little about Markov Random Processes Setup for Our Problem 2 First Problem What do we need to calculate? How to solve it? Forward Procedure 3 Second Problem Dynamic Programming Viterbi Algorithm Final Viterbi Algorithm 4 Third Problem The most difficult Expectation Maximization of the Third Problem The Baum-Welch Algorithm 26 / 99
  • 64. This will become part of a Dynamic Programming Solution Together with a Backward procedure The forward and backward procedure will help to solve the second problem 27 / 99
  • 65. Consider the following Define αt (i) = P (x1, x2, ..., xt, qt = ωi|λ) (16) Probability of the partial observation until time t and state ωi. We can use an induction 1 Initialization α1 (i) = πibi (x1) 1 ≤ i ≤ N 2 Induction αt+1 (j) = N i=1 αt (i) aij bj (Ot+1) 1 ≤ t ≤ T − 1 and 1 ≤ j ≤ N. 3 Termination P (X|λ) = N i=1 αT (i) 28 / 99
  • 66. Consider the following Define αt (i) = P (x1, x2, ..., xt, qt = ωi|λ) (16) Probability of the partial observation until time t and state ωi. We can use an induction 1 Initialization α1 (i) = πibi (x1) 1 ≤ i ≤ N 2 Induction αt+1 (j) = N i=1 αt (i) aij bj (Ot+1) 1 ≤ t ≤ T − 1 and 1 ≤ j ≤ N. 3 Termination P (X|λ) = N i=1 αT (i) 28 / 99
  • 67. Consider the following Define αt (i) = P (x1, x2, ..., xt, qt = ωi|λ) (16) Probability of the partial observation until time t and state ωi. We can use an induction 1 Initialization α1 (i) = πibi (x1) 1 ≤ i ≤ N 2 Induction αt+1 (j) = N i=1 αt (i) aij bj (Ot+1) 1 ≤ t ≤ T − 1 and 1 ≤ j ≤ N. 3 Termination P (X|λ) = N i=1 αT (i) 28 / 99
  • 68. Consider the following Define αt (i) = P (x1, x2, ..., xt, qt = ωi|λ) (16) Probability of the partial observation until time t and state ωi. We can use an induction 1 Initialization α1 (i) = πibi (x1) 1 ≤ i ≤ N 2 Induction αt+1 (j) = N i=1 αt (i) aij bj (Ot+1) 1 ≤ t ≤ T − 1 and 1 ≤ j ≤ N. 3 Termination P (X|λ) = N i=1 αT (i) 28 / 99
  • 69. How do we do this? Step 1 You need to prove it is correct for α1 (i) with 1 ≤ i ≤ N Assume it is true for t Then, you need to prove it for αt+1 (j) with 1 ≤ t ≤ T − 1 and 1 ≤ j ≤ N. Then At termination you have a way to calculate the desired probability P (X|λ). 29 / 99
  • 70. How do we do this? Step 1 You need to prove it is correct for α1 (i) with 1 ≤ i ≤ N Assume it is true for t Then, you need to prove it for αt+1 (j) with 1 ≤ t ≤ T − 1 and 1 ≤ j ≤ N. Then At termination you have a way to calculate the desired probability P (X|λ). 29 / 99
  • 71. How do we do this? Step 1 You need to prove it is correct for α1 (i) with 1 ≤ i ≤ N Assume it is true for t Then, you need to prove it for αt+1 (j) with 1 ≤ t ≤ T − 1 and 1 ≤ j ≤ N. Then At termination you have a way to calculate the desired probability P (X|λ). 29 / 99
  • 72. Step 1 Initializes the forward probabilities as α1 (i) = P (x1, q1 = ωi|λ) (17) Here is the idea of moving from one state at time t to another state on time t + 1 30 / 99
  • 73. Step 1 Initializes the forward probabilities as α1 (i) = P (x1, q1 = ωi|λ) (17) Here is the idea of moving from one state at time t to another state on time t + 1 30 / 99
  • 74. What? Basically Our problem is we do not now from which state we start to arrive to j Now, we start calculate αt+1 (j) αt+1 (j) =P (x1, x2, ..., xt, xt+1, qt+1 = ωj|λ) =P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) × · · · P (x1, x2, ..., xt, qt+1 = ωj|λ) 31 / 99
  • 75. What? Basically Our problem is we do not now from which state we start to arrive to j Now, we start calculate αt+1 (j) αt+1 (j) =P (x1, x2, ..., xt, xt+1, qt+1 = ωj|λ) =P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) × · · · P (x1, x2, ..., xt, qt+1 = ωj|λ) 31 / 99
  • 76. What? Basically Our problem is we do not now from which state we start to arrive to j Now, we start calculate αt+1 (j) αt+1 (j) =P (x1, x2, ..., xt, xt+1, qt+1 = ωj|λ) =P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) × · · · P (x1, x2, ..., xt, qt+1 = ωj|λ) 31 / 99
  • 77. We need to calculate First P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) (18) Second P (x1, x2, ..., xt, qt+1 = ωj|λ) (19) 32 / 99
  • 78. We need to calculate First P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) (18) Second P (x1, x2, ..., xt, qt+1 = ωj|λ) (19) 32 / 99
  • 79. Thus, we start with P (x1, x2, ..., xt, qt+1 = ωj|λ) We have that P (x1, x2, ..., xt, qt+1 = ωj|λ) = = N i=1 P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ) The second step is due to applying the marginal formula over all possible states 33 / 99
  • 80. Thus, we start with P (x1, x2, ..., xt, qt+1 = ωj|λ) We have that P (x1, x2, ..., xt, qt+1 = ωj|λ) = = N i=1 P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ) The second step is due to applying the marginal formula over all possible states 33 / 99
  • 81. Now, P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ) Thus P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ) = =P (qt+1 = ωj|x1, x2, ..., xt, qt = ωi, λ) × · · · P (x1, x2, ..., xt, qt = ωi|λ) =P (x1, x2, ..., xt, qt = ωi|λ) × · · · P (qt+1 = ωj|qt = ωi, λ) =αt (i) aij 34 / 99
  • 82. Now, P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ) Thus P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ) = =P (qt+1 = ωj|x1, x2, ..., xt, qt = ωi, λ) × · · · P (x1, x2, ..., xt, qt = ωi|λ) =P (x1, x2, ..., xt, qt = ωi|λ) × · · · P (qt+1 = ωj|qt = ωi, λ) =αt (i) aij 34 / 99
  • 83. Now, P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ) Thus P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ) = =P (qt+1 = ωj|x1, x2, ..., xt, qt = ωi, λ) × · · · P (x1, x2, ..., xt, qt = ωi|λ) =P (x1, x2, ..., xt, qt = ωi|λ) × · · · P (qt+1 = ωj|qt = ωi, λ) =αt (i) aij 34 / 99
  • 84. Now, P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ) Thus P (x1, x2, ..., xt, qt = ωi, qt+1 = ωj|λ) = =P (qt+1 = ωj|x1, x2, ..., xt, qt = ωi, λ) × · · · P (x1, x2, ..., xt, qt = ωi|λ) =P (x1, x2, ..., xt, qt = ωi|λ) × · · · P (qt+1 = ωj|qt = ωi, λ) =αt (i) aij 34 / 99
  • 85. Thus We have that P (x1, x2, ..., xt, qt+1 = ωj|λ) = N i=1 αt (i) aij (20) What about the term P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) = P (xt+1|qt+1 = ωj, λ) = bj (xt+1) (21) Putting all together αt+1 (j) = N i=1 αt (i) aij bj (xt+1) (22) 35 / 99
  • 86. Thus We have that P (x1, x2, ..., xt, qt+1 = ωj|λ) = N i=1 αt (i) aij (20) What about the term P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) = P (xt+1|qt+1 = ωj, λ) = bj (xt+1) (21) Putting all together αt+1 (j) = N i=1 αt (i) aij bj (xt+1) (22) 35 / 99
  • 87. Thus We have that P (x1, x2, ..., xt, qt+1 = ωj|λ) = N i=1 αt (i) aij (20) What about the term P (xt+1|x1, x2, ..., xt, qt+1 = ωj, λ) = P (xt+1|qt+1 = ωj, λ) = bj (xt+1) (21) Putting all together αt+1 (j) = N i=1 αt (i) aij bj (xt+1) (22) 35 / 99
  • 88. Finally This true for t = 1, 2, ..., T − 1 and 1 ≤ j ≤ N. Now, applying the marginal idea for P (x1, x2, ..., xT |λ) P (x1, x2, ..., xT |λ) = N i=1 P (x1, x2, ..., xT , qT = ωi|λ) = N i=1 αT (i) (23) 36 / 99
  • 89. Finally This true for t = 1, 2, ..., T − 1 and 1 ≤ j ≤ N. Now, applying the marginal idea for P (x1, x2, ..., xT |λ) P (x1, x2, ..., xT |λ) = N i=1 P (x1, x2, ..., xT , qT = ωi|λ) = N i=1 αT (i) (23) 36 / 99
  • 90. Complexity Given that we are requiered to compute the αt (j) For 1 ≤ t ≤ T and 1 ≤ j ≤ N We can calculate this using a table (Dynamic Programming Idea) i t 1 2 · · · N 1 α1 (1) α1 (2) · · · α1 (N) 2 α2 (1) α2 (1) Cost N · · · α2 (N) ... ... ... ... ... T αT (1) αT (2) · · · αT (N) 37 / 99
  • 91. Complexity Given that we are requiered to compute the αt (j) For 1 ≤ t ≤ T and 1 ≤ j ≤ N We can calculate this using a table (Dynamic Programming Idea) i t 1 2 · · · N 1 α1 (1) α1 (2) · · · α1 (N) 2 α2 (1) α2 (1) Cost N · · · α2 (N) ... ... ... ... ... T αT (1) αT (2) · · · αT (N) 37 / 99
  • 92. Complexity We have 1 Inner loop O (N) 2 Outer loop O (NT) Then O (N2 T) 38 / 99
  • 93. All this come from the Lattice Structure Lattice with Lenght T over time and Depth N over the structure of the probabilistic state machine 1 2 3 T 39 / 99
  • 94. Notice the following The forward method It is enough to solve the first problem. However for the second problem We need to define the backward variable, βt (i) = P (xt+1, xt+2, ..., xT |qt = ωi, λ). 40 / 99
  • 95. Notice the following The forward method It is enough to solve the first problem. However for the second problem We need to define the backward variable, βt (i) = P (xt+1, xt+2, ..., xT |qt = ωi, λ). 40 / 99
  • 96. This can be solved in a similar way Initialization βT (i) = 1, 1 ≤ i ≤ N (24) Induction βt (i) = N j=1 aijbj (xt+1) βt+1 (j) (25) Fort = T − 1, T − 2, ..., 1 with 1 ≤ i ≤ N 41 / 99
  • 97. This can be solved in a similar way Initialization βT (i) = 1, 1 ≤ i ≤ N (24) Induction βt (i) = N j=1 aijbj (xt+1) βt+1 (j) (25) Fort = T − 1, T − 2, ..., 1 with 1 ≤ i ≤ N 41 / 99
  • 98. Example of the backward process Example 42 / 99
  • 99. How? That is for the next midterm Please try it for yourselves!!! 43 / 99
  • 100. Here, we have several ways of solving the problem Problem Given the observation sequence X and a model λ, How we choose a corresponding class sequence Ω? For this, we define γt (i) = P (qt = ωi|X, λ) (26) Meaning The probability of being in state ωi at time t given the observation sequence X and the model λ. 44 / 99
  • 101. Here, we have several ways of solving the problem Problem Given the observation sequence X and a model λ, How we choose a corresponding class sequence Ω? For this, we define γt (i) = P (qt = ωi|X, λ) (26) Meaning The probability of being in state ωi at time t given the observation sequence X and the model λ. 44 / 99
  • 102. Here, we have several ways of solving the problem Problem Given the observation sequence X and a model λ, How we choose a corresponding class sequence Ω? For this, we define γt (i) = P (qt = ωi|X, λ) (26) Meaning The probability of being in state ωi at time t given the observation sequence X and the model λ. 44 / 99
  • 103. Thus We get γt (i) = P (x1, ..., xt, ..., xT , qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, qt = ωi, xt+1..., xT |λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt|qt = ωi, λ) P (xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, qt = ωi|λ) P (xt+1..., xT |qt = ωi, λ) P (x1, ..., xt, ..., xT |λ) = αt (i) βt (i) N j=1 P (x1, ..., xt, ..., xT , qt = ωj|λ) = αt (i) βt (i) N j=1 αt (j) βt (j) 45 / 99
  • 104. Thus We get γt (i) = P (x1, ..., xt, ..., xT , qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, qt = ωi, xt+1..., xT |λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt|qt = ωi, λ) P (xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, qt = ωi|λ) P (xt+1..., xT |qt = ωi, λ) P (x1, ..., xt, ..., xT |λ) = αt (i) βt (i) N j=1 P (x1, ..., xt, ..., xT , qt = ωj|λ) = αt (i) βt (i) N j=1 αt (j) βt (j) 45 / 99
  • 105. Thus We get γt (i) = P (x1, ..., xt, ..., xT , qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, qt = ωi, xt+1..., xT |λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt|qt = ωi, λ) P (xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, qt = ωi|λ) P (xt+1..., xT |qt = ωi, λ) P (x1, ..., xt, ..., xT |λ) = αt (i) βt (i) N j=1 P (x1, ..., xt, ..., xT , qt = ωj|λ) = αt (i) βt (i) N j=1 αt (j) βt (j) 45 / 99
  • 106. Thus We get γt (i) = P (x1, ..., xt, ..., xT , qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, qt = ωi, xt+1..., xT |λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt|qt = ωi, λ) P (xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, qt = ωi|λ) P (xt+1..., xT |qt = ωi, λ) P (x1, ..., xt, ..., xT |λ) = αt (i) βt (i) N j=1 P (x1, ..., xt, ..., xT , qt = ωj|λ) = αt (i) βt (i) N j=1 αt (j) βt (j) 45 / 99
  • 107. Thus We get γt (i) = P (x1, ..., xt, ..., xT , qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, qt = ωi, xt+1..., xT |λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt|qt = ωi, λ) P (xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, qt = ωi|λ) P (xt+1..., xT |qt = ωi, λ) P (x1, ..., xt, ..., xT |λ) = αt (i) βt (i) N j=1 P (x1, ..., xt, ..., xT , qt = ωj|λ) = αt (i) βt (i) N j=1 αt (j) βt (j) 45 / 99
  • 108. Thus We get γt (i) = P (x1, ..., xt, ..., xT , qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, qt = ωi, xt+1..., xT |λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt|qt = ωi, λ) P (xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, qt = ωi|λ) P (xt+1..., xT |qt = ωi, λ) P (x1, ..., xt, ..., xT |λ) = αt (i) βt (i) N j=1 P (x1, ..., xt, ..., xT , qt = ωj|λ) = αt (i) βt (i) N j=1 αt (j) βt (j) 45 / 99
  • 109. Thus We get γt (i) = P (x1, ..., xt, ..., xT , qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, qt = ωi, xt+1..., xT |λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt|qt = ωi, λ) P (xt+1..., xT |qt = ωi, λ) P (qt = ωi|λ) P (x1, ..., xt, ..., xT |λ) = P (x1, ..., xt, qt = ωi|λ) P (xt+1..., xT |qt = ωi, λ) P (x1, ..., xt, ..., xT |λ) = αt (i) βt (i) N j=1 P (x1, ..., xt, ..., xT , qt = ωj|λ) = αt (i) βt (i) N j=1 αt (j) βt (j) 45 / 99
  • 110. In addition, we have that The γt (i) are a probability because N i=1 γt (i) = 1 (27) Thus, we can use this to generate the most likely state qt at time t qt = argmax1≤i≤N [γt (i)] , for 1 ≤ t ≤ T (28) 46 / 99
  • 111. In addition, we have that The γt (i) are a probability because N i=1 γt (i) = 1 (27) Thus, we can use this to generate the most likely state qt at time t qt = argmax1≤i≤N [γt (i)] , for 1 ≤ t ≤ T (28) 46 / 99
  • 112. Outline 1 Introduction A little about Markov Random Processes Setup for Our Problem 2 First Problem What do we need to calculate? How to solve it? Forward Procedure 3 Second Problem Dynamic Programming Viterbi Algorithm Final Viterbi Algorithm 4 Third Problem The most difficult Expectation Maximization of the Third Problem The Baum-Welch Algorithm 47 / 99
  • 113. What can we use? Dynamic Programming Optimal Substructure Optimal solution to sub-problems. Overlapping Sub-problems Use previous solutions to reduce complexity. Finally Reconstructing an optimal solution. 48 / 99
  • 114. What can we use? Dynamic Programming Optimal Substructure Optimal solution to sub-problems. Overlapping Sub-problems Use previous solutions to reduce complexity. Finally Reconstructing an optimal solution. 48 / 99
  • 115. What can we use? Dynamic Programming Optimal Substructure Optimal solution to sub-problems. Overlapping Sub-problems Use previous solutions to reduce complexity. Finally Reconstructing an optimal solution. 48 / 99
  • 116. Optimal Substructure Process Process Show that a solution to the problem consists of making a choice. It leaves a sub-problem to solve. Given a problem you suppose an optimal solution. Given this choice you determine how to solve the sub-problems. Then, you determine that the solution to the sub-problem is optimal by contradiction. 49 / 99
  • 117. Optimal Substructure Process Process Show that a solution to the problem consists of making a choice. It leaves a sub-problem to solve. Given a problem you suppose an optimal solution. Given this choice you determine how to solve the sub-problems. Then, you determine that the solution to the sub-problem is optimal by contradiction. 49 / 99
  • 118. Optimal Substructure Process Process Show that a solution to the problem consists of making a choice. It leaves a sub-problem to solve. Given a problem you suppose an optimal solution. Given this choice you determine how to solve the sub-problems. Then, you determine that the solution to the sub-problem is optimal by contradiction. 49 / 99
  • 119. Optimal Substructure Process Process Show that a solution to the problem consists of making a choice. It leaves a sub-problem to solve. Given a problem you suppose an optimal solution. Given this choice you determine how to solve the sub-problems. Then, you determine that the solution to the sub-problem is optimal by contradiction. 49 / 99
  • 120. Possible Solutions One Optimal Criteria It is possible to to solve for the state sequences that maximizes the expected number of correct pairs of states (qt, qt+1) However The most widely criteria is to find the single best path to maximize P (Ω|X, λ) ≈ P (Ω, X|λ) 50 / 99
  • 121. Possible Solutions One Optimal Criteria It is possible to to solve for the state sequences that maximizes the expected number of correct pairs of states (qt, qt+1) However The most widely criteria is to find the single best path to maximize P (Ω|X, λ) ≈ P (Ω, X|λ) 50 / 99
  • 122. Outline 1 Introduction A little about Markov Random Processes Setup for Our Problem 2 First Problem What do we need to calculate? How to solve it? Forward Procedure 3 Second Problem Dynamic Programming Viterbi Algorithm Final Viterbi Algorithm 4 Third Problem The most difficult Expectation Maximization of the Third Problem The Baum-Welch Algorithm 51 / 99
  • 123. Viterbi Algorithm Setup We want the best single state sequence Ωi = {ωi1 , ωi2 , ..., ωiT } for a given observation X Thus, we define δt (i) = max q1,q2,...,qt−1 P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ] (29) 52 / 99
  • 124. Viterbi Algorithm Setup We want the best single state sequence Ωi = {ωi1 , ωi2 , ..., ωiT } for a given observation X Thus, we define δt (i) = max q1,q2,...,qt−1 P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ] (29) 52 / 99
  • 125. Use Induction We can do the following δt+1 (j) = max q1,q2,...,qt P [q1, q2, · · · , qt+1 = j, x1, x2, · · · , xt+1|λ] = max i {P [qt+1 = j, xt+1|q1, q2, · · · , qt = i, x1, x2, · · · , xt, λ] × · · · P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]} = max i {P [qt+1 = j, xt+1|qt = i, λ] × · · · P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]} = max i {P [qt+1 = j|qt = i, λ] P [xt+1|qt+1 = j, qt = i, λ] δt (i)} = max i {aijP [xt+1|qt+1 = j, λ] δt (i)} = max i {aijbj (xt+1) δt (i)} = max i {aijδt (i)} bj (xt+1) 53 / 99
  • 126. Use Induction We can do the following δt+1 (j) = max q1,q2,...,qt P [q1, q2, · · · , qt+1 = j, x1, x2, · · · , xt+1|λ] = max i {P [qt+1 = j, xt+1|q1, q2, · · · , qt = i, x1, x2, · · · , xt, λ] × · · · P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]} = max i {P [qt+1 = j, xt+1|qt = i, λ] × · · · P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]} = max i {P [qt+1 = j|qt = i, λ] P [xt+1|qt+1 = j, qt = i, λ] δt (i)} = max i {aijP [xt+1|qt+1 = j, λ] δt (i)} = max i {aijbj (xt+1) δt (i)} = max i {aijδt (i)} bj (xt+1) 53 / 99
  • 127. Use Induction We can do the following δt+1 (j) = max q1,q2,...,qt P [q1, q2, · · · , qt+1 = j, x1, x2, · · · , xt+1|λ] = max i {P [qt+1 = j, xt+1|q1, q2, · · · , qt = i, x1, x2, · · · , xt, λ] × · · · P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]} = max i {P [qt+1 = j, xt+1|qt = i, λ] × · · · P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]} = max i {P [qt+1 = j|qt = i, λ] P [xt+1|qt+1 = j, qt = i, λ] δt (i)} = max i {aijP [xt+1|qt+1 = j, λ] δt (i)} = max i {aijbj (xt+1) δt (i)} = max i {aijδt (i)} bj (xt+1) 53 / 99
  • 128. Use Induction We can do the following δt+1 (j) = max q1,q2,...,qt P [q1, q2, · · · , qt+1 = j, x1, x2, · · · , xt+1|λ] = max i {P [qt+1 = j, xt+1|q1, q2, · · · , qt = i, x1, x2, · · · , xt, λ] × · · · P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]} = max i {P [qt+1 = j, xt+1|qt = i, λ] × · · · P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]} = max i {P [qt+1 = j|qt = i, λ] P [xt+1|qt+1 = j, qt = i, λ] δt (i)} = max i {aijP [xt+1|qt+1 = j, λ] δt (i)} = max i {aijbj (xt+1) δt (i)} = max i {aijδt (i)} bj (xt+1) 53 / 99
  • 129. Use Induction We can do the following δt+1 (j) = max q1,q2,...,qt P [q1, q2, · · · , qt+1 = j, x1, x2, · · · , xt+1|λ] = max i {P [qt+1 = j, xt+1|q1, q2, · · · , qt = i, x1, x2, · · · , xt, λ] × · · · P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]} = max i {P [qt+1 = j, xt+1|qt = i, λ] × · · · P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]} = max i {P [qt+1 = j|qt = i, λ] P [xt+1|qt+1 = j, qt = i, λ] δt (i)} = max i {aijP [xt+1|qt+1 = j, λ] δt (i)} = max i {aijbj (xt+1) δt (i)} = max i {aijδt (i)} bj (xt+1) 53 / 99
  • 130. Use Induction We can do the following δt+1 (j) = max q1,q2,...,qt P [q1, q2, · · · , qt+1 = j, x1, x2, · · · , xt+1|λ] = max i {P [qt+1 = j, xt+1|q1, q2, · · · , qt = i, x1, x2, · · · , xt, λ] × · · · P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]} = max i {P [qt+1 = j, xt+1|qt = i, λ] × · · · P [q1, q2, · · · , qt = i, x1, x2, · · · , xt|λ]} = max i {P [qt+1 = j|qt = i, λ] P [xt+1|qt+1 = j, qt = i, λ] δt (i)} = max i {aijP [xt+1|qt+1 = j, λ] δt (i)} = max i {aijbj (xt+1) δt (i)} = max i {aijδt (i)} bj (xt+1) 53 / 99
  • 131. Outline 1 Introduction A little about Markov Random Processes Setup for Our Problem 2 First Problem What do we need to calculate? How to solve it? Forward Procedure 3 Second Problem Dynamic Programming Viterbi Algorithm Final Viterbi Algorithm 4 Third Problem The most difficult Expectation Maximization of the Third Problem The Baum-Welch Algorithm 54 / 99
  • 132. Procedure Initialization δ1 (i) = πibi (x1), 1 ≤ i ≤ N Array Ψ1(i) = 0 , 1 ≤ i ≤ N Recusrions δt (j) = max1≤i≤N {aijδt−1 (i)} bj (xt), 2 ≤ t ≤ T and 1 ≤ j ≤ N Ψt(j) = argmax1≤i≤N {aijδt−1 (i)}, 2 ≤ t ≤ T and 1 ≤ j ≤ N Termination P∗ = max1≤i≤N [δT (i)] qT ∗ = argmax1≤i≤N {δT (i)} 55 / 99
  • 133. Procedure Initialization δ1 (i) = πibi (x1), 1 ≤ i ≤ N Array Ψ1(i) = 0 , 1 ≤ i ≤ N Recusrions δt (j) = max1≤i≤N {aijδt−1 (i)} bj (xt), 2 ≤ t ≤ T and 1 ≤ j ≤ N Ψt(j) = argmax1≤i≤N {aijδt−1 (i)}, 2 ≤ t ≤ T and 1 ≤ j ≤ N Termination P∗ = max1≤i≤N [δT (i)] qT ∗ = argmax1≤i≤N {δT (i)} 55 / 99
  • 134. Procedure Initialization δ1 (i) = πibi (x1), 1 ≤ i ≤ N Array Ψ1(i) = 0 , 1 ≤ i ≤ N Recusrions δt (j) = max1≤i≤N {aijδt−1 (i)} bj (xt), 2 ≤ t ≤ T and 1 ≤ j ≤ N Ψt(j) = argmax1≤i≤N {aijδt−1 (i)}, 2 ≤ t ≤ T and 1 ≤ j ≤ N Termination P∗ = max1≤i≤N [δT (i)] qT ∗ = argmax1≤i≤N {δT (i)} 55 / 99
  • 135. Procedure Initialization δ1 (i) = πibi (x1), 1 ≤ i ≤ N Array Ψ1(i) = 0 , 1 ≤ i ≤ N Recusrions δt (j) = max1≤i≤N {aijδt−1 (i)} bj (xt), 2 ≤ t ≤ T and 1 ≤ j ≤ N Ψt(j) = argmax1≤i≤N {aijδt−1 (i)}, 2 ≤ t ≤ T and 1 ≤ j ≤ N Termination P∗ = max1≤i≤N [δT (i)] qT ∗ = argmax1≤i≤N {δT (i)} 55 / 99
  • 136. Procedure Initialization δ1 (i) = πibi (x1), 1 ≤ i ≤ N Array Ψ1(i) = 0 , 1 ≤ i ≤ N Recusrions δt (j) = max1≤i≤N {aijδt−1 (i)} bj (xt), 2 ≤ t ≤ T and 1 ≤ j ≤ N Ψt(j) = argmax1≤i≤N {aijδt−1 (i)}, 2 ≤ t ≤ T and 1 ≤ j ≤ N Termination P∗ = max1≤i≤N [δT (i)] qT ∗ = argmax1≤i≤N {δT (i)} 55 / 99
  • 137. Procedure Initialization δ1 (i) = πibi (x1), 1 ≤ i ≤ N Array Ψ1(i) = 0 , 1 ≤ i ≤ N Recusrions δt (j) = max1≤i≤N {aijδt−1 (i)} bj (xt), 2 ≤ t ≤ T and 1 ≤ j ≤ N Ψt(j) = argmax1≤i≤N {aijδt−1 (i)}, 2 ≤ t ≤ T and 1 ≤ j ≤ N Termination P∗ = max1≤i≤N [δT (i)] qT ∗ = argmax1≤i≤N {δT (i)} 55 / 99
  • 138. Clearly Something Notable All those recursion can be changed for iterative bottom-up procedure That all I leave to you to figure out!!! 56 / 99
  • 139. Clearly Something Notable All those recursion can be changed for iterative bottom-up procedure That all I leave to you to figure out!!! 56 / 99
  • 140. Backtracking Path backtracking qt∗ = Ψt+1(qt+1∗), t = T − 1, T − 2, · · · , 1. 57 / 99
  • 141. Outline 1 Introduction A little about Markov Random Processes Setup for Our Problem 2 First Problem What do we need to calculate? How to solve it? Forward Procedure 3 Second Problem Dynamic Programming Viterbi Algorithm Final Viterbi Algorithm 4 Third Problem The most difficult Expectation Maximization of the Third Problem The Baum-Welch Algorithm 58 / 99
  • 142. The most difficult Problem There is no analytical solution to third problem!!! However Given a finite observation sequence as trainign data, we can choose λ (A, B, π) such that P (O|λ)is maximized. For this We have the Baum-Welch or Forward-Backward Algorithm. 59 / 99
  • 143. The most difficult Problem There is no analytical solution to third problem!!! However Given a finite observation sequence as trainign data, we can choose λ (A, B, π) such that P (O|λ)is maximized. For this We have the Baum-Welch or Forward-Backward Algorithm. 59 / 99
  • 144. The most difficult Problem There is no analytical solution to third problem!!! However Given a finite observation sequence as trainign data, we can choose λ (A, B, π) such that P (O|λ)is maximized. For this We have the Baum-Welch or Forward-Backward Algorithm. 59 / 99
  • 145. Outline 1 Introduction A little about Markov Random Processes Setup for Our Problem 2 First Problem What do we need to calculate? How to solve it? Forward Procedure 3 Second Problem Dynamic Programming Viterbi Algorithm Final Viterbi Algorithm 4 Third Problem The most difficult Expectation Maximization of the Third Problem The Baum-Welch Algorithm 60 / 99
  • 146. Another way to see this is thinking in Maximum Likelihood Given our observed data sequence X = x1, x2, ..., xT The likelihood function is obtained from the joint distribution by marginalizing over the Hidden State Variables P (X|λ) = Ω P (X, Ω|λ) (30) Problems We cannot simply treat each summation over ωi independently!!! We cannot perform all the summations explicitly because it results in NT terms. And Again A further difficulty for the likelihood function is that represents a summation over the emission models for different settings of the latent variables. 61 / 99
  • 147. Another way to see this is thinking in Maximum Likelihood Given our observed data sequence X = x1, x2, ..., xT The likelihood function is obtained from the joint distribution by marginalizing over the Hidden State Variables P (X|λ) = Ω P (X, Ω|λ) (30) Problems We cannot simply treat each summation over ωi independently!!! We cannot perform all the summations explicitly because it results in NT terms. And Again A further difficulty for the likelihood function is that represents a summation over the emission models for different settings of the latent variables. 61 / 99
  • 148. Another way to see this is thinking in Maximum Likelihood Given our observed data sequence X = x1, x2, ..., xT The likelihood function is obtained from the joint distribution by marginalizing over the Hidden State Variables P (X|λ) = Ω P (X, Ω|λ) (30) Problems We cannot simply treat each summation over ωi independently!!! We cannot perform all the summations explicitly because it results in NT terms. And Again A further difficulty for the likelihood function is that represents a summation over the emission models for different settings of the latent variables. 61 / 99
  • 149. Outline 1 Introduction A little about Markov Random Processes Setup for Our Problem 2 First Problem What do we need to calculate? How to solve it? Forward Procedure 3 Second Problem Dynamic Programming Viterbi Algorithm Final Viterbi Algorithm 4 Third Problem The most difficult Expectation Maximization of the Third Problem The Baum-Welch Algorithm 62 / 99
  • 150. Initial Setup Let us consider discrete (categorical) HMMs of length T (each observation sequence is T observations long). 1 N is the number of hidden states. 2 M is the number of different observation symbols. 3 D observations, X = X(1), X(2), ..., X(D) such that X(i) = x (i) 1 , x (i) 2 , ..., x (i) T . 4 We assume each observation is drawn iid. 5 Finally, we are given Ω = Ω(1), Ω(2), ..., Ω(D) one sequence of hidden states per observation, such that Ω(i) = ω (i) 1 , ω (i) 2 , ..., ω (i) T . 63 / 99
  • 151. Initial Setup Let us consider discrete (categorical) HMMs of length T (each observation sequence is T observations long). 1 N is the number of hidden states. 2 M is the number of different observation symbols. 3 D observations, X = X(1), X(2), ..., X(D) such that X(i) = x (i) 1 , x (i) 2 , ..., x (i) T . 4 We assume each observation is drawn iid. 5 Finally, we are given Ω = Ω(1), Ω(2), ..., Ω(D) one sequence of hidden states per observation, such that Ω(i) = ω (i) 1 , ω (i) 2 , ..., ω (i) T . 63 / 99
  • 152. Initial Setup Let us consider discrete (categorical) HMMs of length T (each observation sequence is T observations long). 1 N is the number of hidden states. 2 M is the number of different observation symbols. 3 D observations, X = X(1), X(2), ..., X(D) such that X(i) = x (i) 1 , x (i) 2 , ..., x (i) T . 4 We assume each observation is drawn iid. 5 Finally, we are given Ω = Ω(1), Ω(2), ..., Ω(D) one sequence of hidden states per observation, such that Ω(i) = ω (i) 1 , ω (i) 2 , ..., ω (i) T . 63 / 99
  • 153. Initial Setup Let us consider discrete (categorical) HMMs of length T (each observation sequence is T observations long). 1 N is the number of hidden states. 2 M is the number of different observation symbols. 3 D observations, X = X(1), X(2), ..., X(D) such that X(i) = x (i) 1 , x (i) 2 , ..., x (i) T . 4 We assume each observation is drawn iid. 5 Finally, we are given Ω = Ω(1), Ω(2), ..., Ω(D) one sequence of hidden states per observation, such that Ω(i) = ω (i) 1 , ω (i) 2 , ..., ω (i) T . 63 / 99
  • 154. Initial Setup Let us consider discrete (categorical) HMMs of length T (each observation sequence is T observations long). 1 N is the number of hidden states. 2 M is the number of different observation symbols. 3 D observations, X = X(1), X(2), ..., X(D) such that X(i) = x (i) 1 , x (i) 2 , ..., x (i) T . 4 We assume each observation is drawn iid. 5 Finally, we are given Ω = Ω(1), Ω(2), ..., Ω(D) one sequence of hidden states per observation, such that Ω(i) = ω (i) 1 , ω (i) 2 , ..., ω (i) T . 63 / 99
  • 155. Baum-Welch Algorithm Remember the EM Algorithm 1 Compute Q (λ, λn) = Ω log [P (X, Ω|λ)] P (Ω|X, λn) 2 Set λn+1 = arg max λ Q (λ, λn) We can rewrite the previous iterations (Given P (X, Ω) = P(X)P(Ω|X)) as arg max λ Ω log [P (X, Ω|λ)] P (Ω|X, λn ) =arg max λ Ω log [P (X, Ω|λ)] P (Ω, X|λn) P (X, |λn) ≈arg max λ Ω log [P (X, Ω|λ)] P (Ω, X|λn ) =arg max λ ˆQ (λ, λn ) 64 / 99
  • 156. Baum-Welch Algorithm Remember the EM Algorithm 1 Compute Q (λ, λn) = Ω log [P (X, Ω|λ)] P (Ω|X, λn) 2 Set λn+1 = arg max λ Q (λ, λn) We can rewrite the previous iterations (Given P (X, Ω) = P(X)P(Ω|X)) as arg max λ Ω log [P (X, Ω|λ)] P (Ω|X, λn ) =arg max λ Ω log [P (X, Ω|λ)] P (Ω, X|λn) P (X, |λn) ≈arg max λ Ω log [P (X, Ω|λ)] P (Ω, X|λn ) =arg max λ ˆQ (λ, λn ) 64 / 99
  • 157. Baum-Welch Algorithm Remember the EM Algorithm 1 Compute Q (λ, λn) = Ω log [P (X, Ω|λ)] P (Ω|X, λn) 2 Set λn+1 = arg max λ Q (λ, λn) We can rewrite the previous iterations (Given P (X, Ω) = P(X)P(Ω|X)) as arg max λ Ω log [P (X, Ω|λ)] P (Ω|X, λn ) =arg max λ Ω log [P (X, Ω|λ)] P (Ω, X|λn) P (X, |λn) ≈arg max λ Ω log [P (X, Ω|λ)] P (Ω, X|λn ) =arg max λ ˆQ (λ, λn ) 64 / 99
  • 158. Baum-Welch Algorithm Remember the EM Algorithm 1 Compute Q (λ, λn) = Ω log [P (X, Ω|λ)] P (Ω|X, λn) 2 Set λn+1 = arg max λ Q (λ, λn) We can rewrite the previous iterations (Given P (X, Ω) = P(X)P(Ω|X)) as arg max λ Ω log [P (X, Ω|λ)] P (Ω|X, λn ) =arg max λ Ω log [P (X, Ω|λ)] P (Ω, X|λn) P (X, |λn) ≈arg max λ Ω log [P (X, Ω|λ)] P (Ω, X|λn ) =arg max λ ˆQ (λ, λn ) 64 / 99
  • 159. Baum-Welch Algorithm Remember the EM Algorithm 1 Compute Q (λ, λn) = Ω log [P (X, Ω|λ)] P (Ω|X, λn) 2 Set λn+1 = arg max λ Q (λ, λn) We can rewrite the previous iterations (Given P (X, Ω) = P(X)P(Ω|X)) as arg max λ Ω log [P (X, Ω|λ)] P (Ω|X, λn ) =arg max λ Ω log [P (X, Ω|λ)] P (Ω, X|λn) P (X, |λn) ≈arg max λ Ω log [P (X, Ω|λ)] P (Ω, X|λn ) =arg max λ ˆQ (λ, λn ) 64 / 99
  • 160. After all We know that P (X|λn) is not effected by λn!!! Now, assuming we have D observations P (X, Ω|λ) =P X(1) , X(2) , ..., X(D) , Ω(1) , Ω(2) , ..., Ω(D) |λ =P X(1) , Ω(1) , X(2) , Ω(2) , ..., X(D) , Ω(D) |λ = D d=1 P x (d) 1 , x (d) 2 , ..., x (d) T , ω (d) 1 , ω (d) 2 , ..., ω (d) T |λ = D d=1 P x (d) 1 , x (d) 2 , ..., x (d) T |ω (d) 1 , ω (d) 2 , ..., ω (d) T , λ × ... P ω (d) 1 , ω (d) 2 , ..., ω (d) T |λ 65 / 99
  • 161. After all We know that P (X|λn) is not effected by λn!!! Now, assuming we have D observations P (X, Ω|λ) =P X(1) , X(2) , ..., X(D) , Ω(1) , Ω(2) , ..., Ω(D) |λ =P X(1) , Ω(1) , X(2) , Ω(2) , ..., X(D) , Ω(D) |λ = D d=1 P x (d) 1 , x (d) 2 , ..., x (d) T , ω (d) 1 , ω (d) 2 , ..., ω (d) T |λ = D d=1 P x (d) 1 , x (d) 2 , ..., x (d) T |ω (d) 1 , ω (d) 2 , ..., ω (d) T , λ × ... P ω (d) 1 , ω (d) 2 , ..., ω (d) T |λ 65 / 99
  • 162. After all We know that P (X|λn) is not effected by λn!!! Now, assuming we have D observations P (X, Ω|λ) =P X(1) , X(2) , ..., X(D) , Ω(1) , Ω(2) , ..., Ω(D) |λ =P X(1) , Ω(1) , X(2) , Ω(2) , ..., X(D) , Ω(D) |λ = D d=1 P x (d) 1 , x (d) 2 , ..., x (d) T , ω (d) 1 , ω (d) 2 , ..., ω (d) T |λ = D d=1 P x (d) 1 , x (d) 2 , ..., x (d) T |ω (d) 1 , ω (d) 2 , ..., ω (d) T , λ × ... P ω (d) 1 , ω (d) 2 , ..., ω (d) T |λ 65 / 99
  • 163. After all We know that P (X|λn) is not effected by λn!!! Now, assuming we have D observations P (X, Ω|λ) =P X(1) , X(2) , ..., X(D) , Ω(1) , Ω(2) , ..., Ω(D) |λ =P X(1) , Ω(1) , X(2) , Ω(2) , ..., X(D) , Ω(D) |λ = D d=1 P x (d) 1 , x (d) 2 , ..., x (d) T , ω (d) 1 , ω (d) 2 , ..., ω (d) T |λ = D d=1 P x (d) 1 , x (d) 2 , ..., x (d) T |ω (d) 1 , ω (d) 2 , ..., ω (d) T , λ × ... P ω (d) 1 , ω (d) 2 , ..., ω (d) T |λ 65 / 99
  • 164. After all We know that P (X|λn) is not effected by λn!!! Now, assuming we have D observations P (X, Ω|λ) =P X(1) , X(2) , ..., X(D) , Ω(1) , Ω(2) , ..., Ω(D) |λ =P X(1) , Ω(1) , X(2) , Ω(2) , ..., X(D) , Ω(D) |λ = D d=1 P x (d) 1 , x (d) 2 , ..., x (d) T , ω (d) 1 , ω (d) 2 , ..., ω (d) T |λ = D d=1 P x (d) 1 , x (d) 2 , ..., x (d) T |ω (d) 1 , ω (d) 2 , ..., ω (d) T , λ × ... P ω (d) 1 , ω (d) 2 , ..., ω (d) T |λ 65 / 99
  • 165. Thus We have P (X, Ω|λ) = D d=1 T t=1 P x (d) t |ω (d) t , λ T t=2 P ω (d) t |ω (d) t−1, λ P ω (d) 1 |λ = D d=1 πq (d) 1 bq (d) 1 x (d) 1 T t=2 aq (d) t−1q (d) t bq (d) t x (d) t 66 / 99
  • 166. Thus We have P (X, Ω|λ) = D d=1 T t=1 P x (d) t |ω (d) t , λ T t=2 P ω (d) t |ω (d) t−1, λ P ω (d) 1 |λ = D d=1 πq (d) 1 bq (d) 1 x (d) 1 T t=2 aq (d) t−1q (d) t bq (d) t x (d) t 66 / 99
  • 167. Now, taking the logarithm We get log P (X, Ω|λ) = D d=1 log πq (d) 1 + ... T t=2 log aq (d) t−1q (d) t + ... T t=1 log bq (d) t x (d) t 67 / 99
  • 168. We plug back into ˆQ (λ, λn ) We get the following ˆQ (λ, λn ) = Ω D d=1 log πq (d) 1 P (Ω, X|λn ) Ω D d=1 T t=2 log aq (d) t−1q (d) t P (Ω, X|λn ) + ... Ω D d=1 T t=1 log bq (d) t x (d) t P (Ω, X|λn ) 68 / 99
  • 169. Ok We have reintroduced P (Ω, X|λn ) It looks like it will increase the complexity of our calculations However Using marginalization, we will be able to remove the terms that are not necessary to calculate the necessary updates. Which Updates? πi , aij and bj (k) 69 / 99
  • 170. Ok We have reintroduced P (Ω, X|λn ) It looks like it will increase the complexity of our calculations However Using marginalization, we will be able to remove the terms that are not necessary to calculate the necessary updates. Which Updates? πi , aij and bj (k) 69 / 99
  • 171. Ok We have reintroduced P (Ω, X|λn ) It looks like it will increase the complexity of our calculations However Using marginalization, we will be able to remove the terms that are not necessary to calculate the necessary updates. Which Updates? πi , aij and bj (k) 69 / 99
  • 172. Now, Which Lagrange Multipliers? One for the M hidden marginal states of π N i=1 πi = 1 (31) One for each i N j=1 aij = 1 (32) One for each i M k=1 bi (k) = 1 (33) Remember: bi (k) = P (xk|qt = ωi) 70 / 99
  • 173. Now, Which Lagrange Multipliers? One for the M hidden marginal states of π N i=1 πi = 1 (31) One for each i N j=1 aij = 1 (32) One for each i M k=1 bi (k) = 1 (33) Remember: bi (k) = P (xk|qt = ωi) 70 / 99
  • 174. Now, Which Lagrange Multipliers? One for the M hidden marginal states of π N i=1 πi = 1 (31) One for each i N j=1 aij = 1 (32) One for each i M k=1 bi (k) = 1 (33) Remember: bi (k) = P (xk|qt = ωi) 70 / 99
  • 175. Finally, we have the new Lagrangian We have ˆL (λ, λn ) = ˆQ (λ, λn ) − λπ N i=1 πi − 1 − N i=1 λai   N j=1 aij − 1   − N i=1 λbi M k=1 bi (k) − 1 71 / 99
  • 176. Now, if we derive with respect to πi We have that ∂ˆL (λ, λn) ∂πi = ∂ ∂πi Ω D d=1 log πq (d) 1 P (Ω, X|λn ) − λπ = ∂ ∂πi   N j=1 D d=1 log πjP q (d) 1 = j, X|λn   − λπ = 1 πi D d=1 P q (d) 1 = i, X|λn − λπ = 0 Remark: The second step is simply the result of marginalizing out, for each d, all q (d) t=1 and q (d =d) t for all t. 72 / 99
  • 177. Now, if we derive with respect to πi We have that ∂ˆL (λ, λn) ∂πi = ∂ ∂πi Ω D d=1 log πq (d) 1 P (Ω, X|λn ) − λπ = ∂ ∂πi   N j=1 D d=1 log πjP q (d) 1 = j, X|λn   − λπ = 1 πi D d=1 P q (d) 1 = i, X|λn − λπ = 0 Remark: The second step is simply the result of marginalizing out, for each d, all q (d) t=1 and q (d =d) t for all t. 72 / 99
  • 178. Now, if we derive with respect to πi We have that ∂ˆL (λ, λn) ∂πi = ∂ ∂πi Ω D d=1 log πq (d) 1 P (Ω, X|λn ) − λπ = ∂ ∂πi   N j=1 D d=1 log πjP q (d) 1 = j, X|λn   − λπ = 1 πi D d=1 P q (d) 1 = i, X|λn − λπ = 0 Remark: The second step is simply the result of marginalizing out, for each d, all q (d) t=1 and q (d =d) t for all t. 72 / 99
  • 179. Now, if we derive with respect to πi We have that ∂ˆL (λ, λn) ∂πi = ∂ ∂πi Ω D d=1 log πq (d) 1 P (Ω, X|λn ) − λπ = ∂ ∂πi   N j=1 D d=1 log πjP q (d) 1 = j, X|λn   − λπ = 1 πi D d=1 P q (d) 1 = i, X|λn − λπ = 0 Remark: The second step is simply the result of marginalizing out, for each d, all q (d) t=1 and q (d =d) t for all t. 72 / 99
  • 180. Now, if we derive with respect to πi After all D d=1 log π q (d) 1 Ω P Ω, X|λ n = D d=1 log π q (d) 1 N j1 1 =1 N j1 2 =1 ... N j1 T =1 ... N jT 1 =1 N jT 2 =1 ... N jT T =1 P Ω, X|λ n = D d=1 log π q (d) 1 N j1 1 =1 N j2 1 =1 ... N jT 1 =1 P q (1) 1 = j 1 1 , ..., q (T) T = j T 1 , X|λ n = D d=1 log π q (d) 1 N j1=1 q (d) 1 = j1, X|λ n Thus we have πi = 1 λπ D d=1 P q (d) 1 = i, X|λn (34) 73 / 99
  • 181. Now, if we derive with respect to πi After all D d=1 log π q (d) 1 Ω P Ω, X|λ n = D d=1 log π q (d) 1 N j1 1 =1 N j1 2 =1 ... N j1 T =1 ... N jT 1 =1 N jT 2 =1 ... N jT T =1 P Ω, X|λ n = D d=1 log π q (d) 1 N j1 1 =1 N j2 1 =1 ... N jT 1 =1 P q (1) 1 = j 1 1 , ..., q (T) T = j T 1 , X|λ n = D d=1 log π q (d) 1 N j1=1 q (d) 1 = j1, X|λ n Thus we have πi = 1 λπ D d=1 P q (d) 1 = i, X|λn (34) 73 / 99
  • 182. Now, if we derive with respect to πi After all D d=1 log π q (d) 1 Ω P Ω, X|λ n = D d=1 log π q (d) 1 N j1 1 =1 N j1 2 =1 ... N j1 T =1 ... N jT 1 =1 N jT 2 =1 ... N jT T =1 P Ω, X|λ n = D d=1 log π q (d) 1 N j1 1 =1 N j2 1 =1 ... N jT 1 =1 P q (1) 1 = j 1 1 , ..., q (T) T = j T 1 , X|λ n = D d=1 log π q (d) 1 N j1=1 q (d) 1 = j1, X|λ n Thus we have πi = 1 λπ D d=1 P q (d) 1 = i, X|λn (34) 73 / 99
  • 183. Now, if we derive with respect to πi After all D d=1 log π q (d) 1 Ω P Ω, X|λ n = D d=1 log π q (d) 1 N j1 1 =1 N j1 2 =1 ... N j1 T =1 ... N jT 1 =1 N jT 2 =1 ... N jT T =1 P Ω, X|λ n = D d=1 log π q (d) 1 N j1 1 =1 N j2 1 =1 ... N jT 1 =1 P q (1) 1 = j 1 1 , ..., q (T) T = j T 1 , X|λ n = D d=1 log π q (d) 1 N j1=1 q (d) 1 = j1, X|λ n Thus we have πi = 1 λπ D d=1 P q (d) 1 = i, X|λn (34) 73 / 99
  • 184. Now deriving against λπ We have that ∂ˆL (λ, λn) ∂πi = − N i=1 πi − 1 = 0 (35) Thus, substituting into the past equation 1 λπ N i=1 D d=1 P q (d) 1 = i, X|λn = 1 (36) Then λπ = N i=1 D d=1 P q (d) 1 = i, X|λn (37) 74 / 99
  • 185. Now deriving against λπ We have that ∂ˆL (λ, λn) ∂πi = − N i=1 πi − 1 = 0 (35) Thus, substituting into the past equation 1 λπ N i=1 D d=1 P q (d) 1 = i, X|λn = 1 (36) Then λπ = N i=1 D d=1 P q (d) 1 = i, X|λn (37) 74 / 99
  • 186. Now deriving against λπ We have that ∂ˆL (λ, λn) ∂πi = − N i=1 πi − 1 = 0 (35) Thus, substituting into the past equation 1 λπ N i=1 D d=1 P q (d) 1 = i, X|λn = 1 (36) Then λπ = N i=1 D d=1 P q (d) 1 = i, X|λn (37) 74 / 99
  • 187. Plugging back We get 1 πi D d=1 P q (d) 1 = i, X|λn − λπ =0 1 πi D d=1 P q (d) 1 = i, X|λn = N i=1 D d=1 P q (d) 1 = i, X|λn πi = D d=1 P q (d) 1 = i, X|λn N i=1 D d=1 P q (d) 1 = i, X|λn πi = D d=1 P q (d) 1 = i, X|λn D d=1 P (X|λn) πi = D d=1 P (X|λn) P q (d) 1 = i|X, λn DP (X|λn) 75 / 99
  • 188. Plugging back We get 1 πi D d=1 P q (d) 1 = i, X|λn − λπ =0 1 πi D d=1 P q (d) 1 = i, X|λn = N i=1 D d=1 P q (d) 1 = i, X|λn πi = D d=1 P q (d) 1 = i, X|λn N i=1 D d=1 P q (d) 1 = i, X|λn πi = D d=1 P q (d) 1 = i, X|λn D d=1 P (X|λn) πi = D d=1 P (X|λn) P q (d) 1 = i|X, λn DP (X|λn) 75 / 99
  • 189. Plugging back We get 1 πi D d=1 P q (d) 1 = i, X|λn − λπ =0 1 πi D d=1 P q (d) 1 = i, X|λn = N i=1 D d=1 P q (d) 1 = i, X|λn πi = D d=1 P q (d) 1 = i, X|λn N i=1 D d=1 P q (d) 1 = i, X|λn πi = D d=1 P q (d) 1 = i, X|λn D d=1 P (X|λn) πi = D d=1 P (X|λn) P q (d) 1 = i|X, λn DP (X|λn) 75 / 99
  • 190. Plugging back We get 1 πi D d=1 P q (d) 1 = i, X|λn − λπ =0 1 πi D d=1 P q (d) 1 = i, X|λn = N i=1 D d=1 P q (d) 1 = i, X|λn πi = D d=1 P q (d) 1 = i, X|λn N i=1 D d=1 P q (d) 1 = i, X|λn πi = D d=1 P q (d) 1 = i, X|λn D d=1 P (X|λn) πi = D d=1 P (X|λn) P q (d) 1 = i|X, λn DP (X|λn) 75 / 99
  • 191. Plugging back We get 1 πi D d=1 P q (d) 1 = i, X|λn − λπ =0 1 πi D d=1 P q (d) 1 = i, X|λn = N i=1 D d=1 P q (d) 1 = i, X|λn πi = D d=1 P q (d) 1 = i, X|λn N i=1 D d=1 P q (d) 1 = i, X|λn πi = D d=1 P q (d) 1 = i, X|λn D d=1 P (X|λn) πi = D d=1 P (X|λn) P q (d) 1 = i|X, λn DP (X|λn) 75 / 99
  • 192. In addition We can modify our γ γq (d) t−1 (i) = P q (d) t−1 = i|X(d) , λn (38) Meaning The probability of being in state i at time q (d) t−1 given the observation sequence X(d) and the model λn. 76 / 99
  • 193. In addition We can modify our γ γq (d) t−1 (i) = P q (d) t−1 = i|X(d) , λn (38) Meaning The probability of being in state i at time q (d) t−1 given the observation sequence X(d) and the model λn. 76 / 99
  • 194. Thus We have πi = D d=1 P (X|λn) P q (d) 1 = i|X, λn DP (X|λn) πi = 1 D D d=1 P q (d) 1 = i|X(d) , λn πi = 1 D D d=1 γq (d) 1 (i) Remark: This can be thought as the expected frequency (Number of Times) in state i at time (t = 1) = γq (d) 1 (i) 77 / 99
  • 195. Thus We have πi = D d=1 P (X|λn) P q (d) 1 = i|X, λn DP (X|λn) πi = 1 D D d=1 P q (d) 1 = i|X(d) , λn πi = 1 D D d=1 γq (d) 1 (i) Remark: This can be thought as the expected frequency (Number of Times) in state i at time (t = 1) = γq (d) 1 (i) 77 / 99
  • 196. Thus We have πi = D d=1 P (X|λn) P q (d) 1 = i|X, λn DP (X|λn) πi = 1 D D d=1 P q (d) 1 = i|X(d) , λn πi = 1 D D d=1 γq (d) 1 (i) Remark: This can be thought as the expected frequency (Number of Times) in state i at time (t = 1) = γq (d) 1 (i) 77 / 99
  • 197. Now, for aij We now follow a similar process for the aij ∂ˆL (λ, λn) ∂aij = ∂ ∂aij Ω D d=1 T t=2 log a q (d) t−1 q (d) t P (Ω, X|λn ) − N i=1 λai = ∂ ∂aij N h=1 N k=1 D d=1 T t=2 log ajkP q (d) t−1 = h, q (d) t = k, X|λn − λai = 1 aij D d=1 T t=2 P q (d) t−1 = i, q (d) t = j, X|λn − λai = 0 In a similar way ∂ˆL (λ, λn) ∂λai = −   N j=1 aij − 1   = 0 (39) 78 / 99
  • 198. Now, for aij We now follow a similar process for the aij ∂ˆL (λ, λn) ∂aij = ∂ ∂aij Ω D d=1 T t=2 log a q (d) t−1 q (d) t P (Ω, X|λn ) − N i=1 λai = ∂ ∂aij N h=1 N k=1 D d=1 T t=2 log ajkP q (d) t−1 = h, q (d) t = k, X|λn − λai = 1 aij D d=1 T t=2 P q (d) t−1 = i, q (d) t = j, X|λn − λai = 0 In a similar way ∂ˆL (λ, λn) ∂λai = −   N j=1 aij − 1   = 0 (39) 78 / 99
  • 199. Now, for aij We now follow a similar process for the aij ∂ˆL (λ, λn) ∂aij = ∂ ∂aij Ω D d=1 T t=2 log a q (d) t−1 q (d) t P (Ω, X|λn ) − N i=1 λai = ∂ ∂aij N h=1 N k=1 D d=1 T t=2 log ajkP q (d) t−1 = h, q (d) t = k, X|λn − λai = 1 aij D d=1 T t=2 P q (d) t−1 = i, q (d) t = j, X|λn − λai = 0 In a similar way ∂ˆL (λ, λn) ∂λai = −   N j=1 aij − 1   = 0 (39) 78 / 99
  • 200. Now, for aij We now follow a similar process for the aij ∂ˆL (λ, λn) ∂aij = ∂ ∂aij Ω D d=1 T t=2 log a q (d) t−1 q (d) t P (Ω, X|λn ) − N i=1 λai = ∂ ∂aij N h=1 N k=1 D d=1 T t=2 log ajkP q (d) t−1 = h, q (d) t = k, X|λn − λai = 1 aij D d=1 T t=2 P q (d) t−1 = i, q (d) t = j, X|λn − λai = 0 In a similar way ∂ˆL (λ, λn) ∂λai = −   N j=1 aij − 1   = 0 (39) 78 / 99
  • 201. Then We have that aij = D d=1 T t=2 P q (d) t−1 = i, q (d) t = j, X|λn λai Then, using N j=1 aij = 1 λai = N j=1 D d=1 T t=2 P q (d) t−1 = i, q (d) t = j, X|λn (40) 79 / 99
  • 202. Then We have that aij = D d=1 T t=2 P q (d) t−1 = i, q (d) t = j, X|λn λai Then, using N j=1 aij = 1 λai = N j=1 D d=1 T t=2 P q (d) t−1 = i, q (d) t = j, X|λn (40) 79 / 99
  • 203. Re-expressing We get aij = D d=1 T t=2 P q (d) t−1 = i, q (d) t = j, X|λn N j=1 D d=1 T t=2 P q (d) t−1 = i, q (d) t = j, X|λn Marginalizing the denominator aij = D d=1 T t=2 P q (d) t−1 = i, q (d) t = j, X|λn D d=1 T t=2 P q (d) t−1 = i, X|λn 80 / 99
  • 204. Re-expressing We get aij = D d=1 T t=2 P q (d) t−1 = i, q (d) t = j, X|λn N j=1 D d=1 T t=2 P q (d) t−1 = i, q (d) t = j, X|λn Marginalizing the denominator aij = D d=1 T t=2 P q (d) t−1 = i, q (d) t = j, X|λn D d=1 T t=2 P q (d) t−1 = i, X|λn 80 / 99
  • 205. Next We have aij = D d=1 T t=2 P q (d) t−1 = i, q (d) t = j|X, λn P (X|λn) D d=1 T t=2 P q (d) t−1 = i|X, λn P (X|λn) = D d=1 T t=2 P q (d) t−1 = i, q (d) t = j|X(d), λn D d=1 T t=2 P q (d) t−1 = i|X(d), λn 81 / 99
  • 206. Now, we are looking for a more compact version We define The probability of being in state i at time q (d) t−1 and state j at time q (d) t : ξq (d) t−1 (i, j) = P q (d) t−1 = i, q (d) t = j|X(d) , λ (41) 82 / 99
  • 207. Using Bayes We have for ξq (d) t−1 (i, j) ξ q (d) t−1 (i, j) = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i, q (d) t = j, x (d) t , ..., x (d) T |X(d) , λ P X(d)|λ = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, x (d) t , ..., x (d) T , λ P q (d) t = j, x (d) t , ..., x (d) T |λ P (X|λ) = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P x (d) t+1 , ..., x (d) T |q (d) t = j, x (d) t , λ × · · · P X(d), |λ P q (d) t = j, x (d) t |λ =P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P x (d) t+1 , ..., x (d) T |q (d) t = j, x (d) t , λ × · · · P x (d) t |q (d) t = j, λ P q (d) t = j|λ P X(d)|λ = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P q (d) t = j|λ β (d) t (j) bj x (d) t P X(d)|λ = ... 83 / 99
  • 208. Using Bayes We have for ξq (d) t−1 (i, j) ξ q (d) t−1 (i, j) = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i, q (d) t = j, x (d) t , ..., x (d) T |X(d) , λ P X(d)|λ = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, x (d) t , ..., x (d) T , λ P q (d) t = j, x (d) t , ..., x (d) T |λ P (X|λ) = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P x (d) t+1 , ..., x (d) T |q (d) t = j, x (d) t , λ × · · · P X(d), |λ P q (d) t = j, x (d) t |λ =P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P x (d) t+1 , ..., x (d) T |q (d) t = j, x (d) t , λ × · · · P x (d) t |q (d) t = j, λ P q (d) t = j|λ P X(d)|λ = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P q (d) t = j|λ β (d) t (j) bj x (d) t P X(d)|λ = ... 83 / 99
  • 209. Using Bayes We have for ξq (d) t−1 (i, j) ξ q (d) t−1 (i, j) = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i, q (d) t = j, x (d) t , ..., x (d) T |X(d) , λ P X(d)|λ = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, x (d) t , ..., x (d) T , λ P q (d) t = j, x (d) t , ..., x (d) T |λ P (X|λ) = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P x (d) t+1 , ..., x (d) T |q (d) t = j, x (d) t , λ × · · · P X(d), |λ P q (d) t = j, x (d) t |λ =P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P x (d) t+1 , ..., x (d) T |q (d) t = j, x (d) t , λ × · · · P x (d) t |q (d) t = j, λ P q (d) t = j|λ P X(d)|λ = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P q (d) t = j|λ β (d) t (j) bj x (d) t P X(d)|λ = ... 83 / 99
  • 210. Using Bayes We have for ξq (d) t−1 (i, j) ξ q (d) t−1 (i, j) = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i, q (d) t = j, x (d) t , ..., x (d) T |X(d) , λ P X(d)|λ = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, x (d) t , ..., x (d) T , λ P q (d) t = j, x (d) t , ..., x (d) T |λ P (X|λ) = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P x (d) t+1 , ..., x (d) T |q (d) t = j, x (d) t , λ × · · · P X(d), |λ P q (d) t = j, x (d) t |λ =P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P x (d) t+1 , ..., x (d) T |q (d) t = j, x (d) t , λ × · · · P x (d) t |q (d) t = j, λ P q (d) t = j|λ P X(d)|λ = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P q (d) t = j|λ β (d) t (j) bj x (d) t P X(d)|λ = ... 83 / 99
  • 211. Using Bayes We have for ξq (d) t−1 (i, j) ξ q (d) t−1 (i, j) = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i, q (d) t = j, x (d) t , ..., x (d) T |X(d) , λ P X(d)|λ = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, x (d) t , ..., x (d) T , λ P q (d) t = j, x (d) t , ..., x (d) T |λ P (X|λ) = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P x (d) t+1 , ..., x (d) T |q (d) t = j, x (d) t , λ × · · · P X(d), |λ P q (d) t = j, x (d) t |λ =P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P x (d) t+1 , ..., x (d) T |q (d) t = j, x (d) t , λ × · · · P x (d) t |q (d) t = j, λ P q (d) t = j|λ P X(d)|λ = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|q (d) t = j, λ P q (d) t = j|λ β (d) t (j) bj x (d) t P X(d)|λ = ... 83 / 99
  • 212. Again Bayes Thus ξq (d) t−1 (i, j) = = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i, q (d) t = j|λ βt (j) bj (xt) P X(d)|λ = P q (d) t = j|x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i, λ P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|λ β (d) t (j) bj x (d) t P (X|λ) = P q (d) t = j|q (d) t−1 = i, λ α (d) t−1 (i) β (d) t (j) bj x (d) t P X(d)|λ = α (d) t−1 (i) a (d) ij β (d) t (j) bj x (d) t N k=1 N h=1 α (d) t−1 (k) a (d) kh β (d) t (h) bh x (d) t 84 / 99
  • 213. Again Bayes Thus ξq (d) t−1 (i, j) = = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i, q (d) t = j|λ βt (j) bj (xt) P X(d)|λ = P q (d) t = j|x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i, λ P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|λ β (d) t (j) bj x (d) t P (X|λ) = P q (d) t = j|q (d) t−1 = i, λ α (d) t−1 (i) β (d) t (j) bj x (d) t P X(d)|λ = α (d) t−1 (i) a (d) ij β (d) t (j) bj x (d) t N k=1 N h=1 α (d) t−1 (k) a (d) kh β (d) t (h) bh x (d) t 84 / 99
  • 214. Again Bayes Thus ξq (d) t−1 (i, j) = = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i, q (d) t = j|λ βt (j) bj (xt) P X(d)|λ = P q (d) t = j|x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i, λ P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|λ β (d) t (j) bj x (d) t P (X|λ) = P q (d) t = j|q (d) t−1 = i, λ α (d) t−1 (i) β (d) t (j) bj x (d) t P X(d)|λ = α (d) t−1 (i) a (d) ij β (d) t (j) bj x (d) t N k=1 N h=1 α (d) t−1 (k) a (d) kh β (d) t (h) bh x (d) t 84 / 99
  • 215. Again Bayes Thus ξq (d) t−1 (i, j) = = P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i, q (d) t = j|λ βt (j) bj (xt) P X(d)|λ = P q (d) t = j|x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i, λ P x (d) 1 , ..., x (d) t−1 , q (d) t−1 = i|λ β (d) t (j) bj x (d) t P (X|λ) = P q (d) t = j|q (d) t−1 = i, λ α (d) t−1 (i) β (d) t (j) bj x (d) t P X(d)|λ = α (d) t−1 (i) a (d) ij β (d) t (j) bj x (d) t N k=1 N h=1 α (d) t−1 (k) a (d) kh β (d) t (h) bh x (d) t 84 / 99
  • 216. This can be seen graphically as This is illustrated by the following figure 85 / 99
  • 217. In addition Given the definition of γq (d) t−1 (i) γq (d) t−1 (i) = αq (d) t−1 (i) βq (d) t−1 (i) N j=1 αq (d) t (j) βq (d) t (j) (42) Summing over j N j=1 ξq (d) t−1 (i, j) = N j=1 P q (d) t−1 = i, q (d) t = j|X(d) , λ =P q (d) t−1 = i|X(d) , λ = P x (d) 1 , ..., x (d) t−1, q (d) t−1 = i, x (d) t , ..., x (d) T |λ P X(d)|λ 86 / 99
  • 218. In addition Given the definition of γq (d) t−1 (i) γq (d) t−1 (i) = αq (d) t−1 (i) βq (d) t−1 (i) N j=1 αq (d) t (j) βq (d) t (j) (42) Summing over j N j=1 ξq (d) t−1 (i, j) = N j=1 P q (d) t−1 = i, q (d) t = j|X(d) , λ =P q (d) t−1 = i|X(d) , λ = P x (d) 1 , ..., x (d) t−1, q (d) t−1 = i, x (d) t , ..., x (d) T |λ P X(d)|λ 86 / 99
  • 219. In addition Given the definition of γq (d) t−1 (i) γq (d) t−1 (i) = αq (d) t−1 (i) βq (d) t−1 (i) N j=1 αq (d) t (j) βq (d) t (j) (42) Summing over j N j=1 ξq (d) t−1 (i, j) = N j=1 P q (d) t−1 = i, q (d) t = j|X(d) , λ =P q (d) t−1 = i|X(d) , λ = P x (d) 1 , ..., x (d) t−1, q (d) t−1 = i, x (d) t , ..., x (d) T |λ P X(d)|λ 86 / 99
  • 220. In addition Given the definition of γq (d) t−1 (i) γq (d) t−1 (i) = αq (d) t−1 (i) βq (d) t−1 (i) N j=1 αq (d) t (j) βq (d) t (j) (42) Summing over j N j=1 ξq (d) t−1 (i, j) = N j=1 P q (d) t−1 = i, q (d) t = j|X(d) , λ =P q (d) t−1 = i|X(d) , λ = P x (d) 1 , ..., x (d) t−1, q (d) t−1 = i, x (d) t , ..., x (d) T |λ P X(d)|λ 86 / 99
  • 221. Then We have = P x (d) 1 , ..., x (d) t−1, q (d) t−1 = i, x (d) t , ..., x (d) T |λ P X(d)|λ = αt−1 (i) βt−1 (i) N i=1 P (x1, ..., xt, ..., xT , qt = ωi|λ) =γq (d) t−1 (i) 87 / 99
  • 222. Then We have = P x (d) 1 , ..., x (d) t−1, q (d) t−1 = i, x (d) t , ..., x (d) T |λ P X(d)|λ = αt−1 (i) βt−1 (i) N i=1 P (x1, ..., xt, ..., xT , qt = ωi|λ) =γq (d) t−1 (i) 87 / 99
  • 223. Then We have = P x (d) 1 , ..., x (d) t−1, q (d) t−1 = i, x (d) t , ..., x (d) T |λ P X(d)|λ = αt−1 (i) βt−1 (i) N i=1 P (x1, ..., xt, ..., xT , qt = ωi|λ) =γq (d) t−1 (i) 87 / 99
  • 224. We can also think about the expected values The following sum 1 It can be thought as expected (over time) number of times that state ωi is visited 2 Or the number of transitions made from state ωi It is T t=2 γq (d) t−1 (i) (43) 88 / 99
  • 225. We can also think about the expected values The following sum 1 It can be thought as expected (over time) number of times that state ωi is visited 2 Or the number of transitions made from state ωi It is T t=2 γq (d) t−1 (i) (43) 88 / 99
  • 226. Also the expected value for ξq (d) t−1 (i, j) The following sum It can be thought as the expected number of transitions from i to j It is T t=2 ξq (d) t−1 (i, j) (44) 89 / 99
  • 227. Also the expected value for ξq (d) t−1 (i, j) The following sum It can be thought as the expected number of transitions from i to j It is T t=2 ξq (d) t−1 (i, j) (44) 89 / 99
  • 228. Remark Important Baum et al. proved that the maximization ofQ λ, λ leads to increased likelihood: max λ Q λ, λ ⇒ P X|λ ≥ P (X|λ) (45) For details Please look back at our slides on EM!!! 90 / 99
  • 229. Remark Important Baum et al. proved that the maximization ofQ λ, λ leads to increased likelihood: max λ Q λ, λ ⇒ P X|λ ≥ P (X|λ) (45) For details Please look back at our slides on EM!!! 90 / 99
  • 230. Remark Important Baum et al. proved that the maximization ofQ λ, λ leads to increased likelihood: max λ Q λ, λ ⇒ P X|λ ≥ P (X|λ) (45) For details Please look back at our slides on EM!!! 90 / 99
  • 231. We can rewrite aij Thus aij = D d=1 T t=2 P q (d) t−1 = i, q (d) t = j|X(d), λn D d=1 T t=2 P q (d) t−1 = i|X(d), λn = D d=1 T t=2 ξq (d) t−1 (i, j) D d=1 T t=2 γq (d) t−1 (i) Remark: This can be seen as expected number of transitions from state i to j expected number of transitions made from statei 91 / 99
  • 232. Now, the terms bi (k) We have that ∂ˆL (λ, λn ) ∂bi (k) = ∂ ∂bi (k) Ω D d=1 T t=1 log bq (d) t x (d) t P (Ω, X|λn ) − λbi = ∂ ∂bi (k) i, N h=1 D d=1 T t=1 log bh x (d) t P q (d) t = h, X|λn − λbi Now, we have a problem The term bi (k) can be different than term bi x (d) t We can use this to fix our problem I x (d) t = k = 1 when x (d) t == k 0 Otherwise (46) 92 / 99
  • 233. Now, the terms bi (k) We have that ∂ˆL (λ, λn ) ∂bi (k) = ∂ ∂bi (k) Ω D d=1 T t=1 log bq (d) t x (d) t P (Ω, X|λn ) − λbi = ∂ ∂bi (k) i, N h=1 D d=1 T t=1 log bh x (d) t P q (d) t = h, X|λn − λbi Now, we have a problem The term bi (k) can be different than term bi x (d) t We can use this to fix our problem I x (d) t = k = 1 when x (d) t == k 0 Otherwise (46) 92 / 99
  • 234. Now, the terms bi (k) We have that ∂ˆL (λ, λn ) ∂bi (k) = ∂ ∂bi (k) Ω D d=1 T t=1 log bq (d) t x (d) t P (Ω, X|λn ) − λbi = ∂ ∂bi (k) i, N h=1 D d=1 T t=1 log bh x (d) t P q (d) t = h, X|λn − λbi Now, we have a problem The term bi (k) can be different than term bi x (d) t We can use this to fix our problem I x (d) t = k = 1 when x (d) t == k 0 Otherwise (46) 92 / 99
  • 235. Thus, we have that Using our previous idea ∂ˆL (λ, λn) ∂bi (k) = D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = k bi (k) − λbi In addition ∂ˆL (λ, λn) ∂λbi = − M k=1 bi (k) − 1 (47) 93 / 99
  • 236. Thus, we have that Using our previous idea ∂ˆL (λ, λn) ∂bi (k) = D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = k bi (k) − λbi In addition ∂ˆL (λ, λn) ∂λbi = − M k=1 bi (k) − 1 (47) 93 / 99
  • 237. Thus We have that bi (k) = D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = k λbi (48) Thus λbi = M k=1 D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = k (49) 94 / 99
  • 238. Thus We have that bi (k) = D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = k λbi (48) Thus λbi = M k=1 D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = k (49) 94 / 99
  • 239. We have then The following result bi (k) = D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = k M k=1 D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = k = D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = j D d=1 T t=1 P q (d) t = i, X|λn = D d=1 T t=1 P q (d) t = i|X(d), λn I x (d) t = k D d=1 T t=1 P q (d) t = i|X(d), λn 95 / 99
  • 240. We have then The following result bi (k) = D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = k M k=1 D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = k = D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = j D d=1 T t=1 P q (d) t = i, X|λn = D d=1 T t=1 P q (d) t = i|X(d), λn I x (d) t = k D d=1 T t=1 P q (d) t = i|X(d), λn 95 / 99
  • 241. We have then The following result bi (k) = D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = k M k=1 D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = k = D d=1 T t=1 P q (d) t = i, X|λn I x (d) t = j D d=1 T t=1 P q (d) t = i, X|λn = D d=1 T t=1 P q (d) t = i|X(d), λn I x (d) t = k D d=1 T t=1 P q (d) t = i|X(d), λn 95 / 99
  • 242. Using Rabiner’s Notation We have that bi (k) = D d=1 T t=1 γq (d) t (i) I x (d) t = k D d=1 T t=1 γq (d) t (i) (50) Which can be seen as The expected number of times in state i and observing symbol xk The expected number of times in state i 96 / 99
  • 243. Using Rabiner’s Notation We have that bi (k) = D d=1 T t=1 γq (d) t (i) I x (d) t = k D d=1 T t=1 γq (d) t (i) (50) Which can be seen as The expected number of times in state i and observing symbol xk The expected number of times in state i 96 / 99
  • 244. Thus, the re-estimation First π (n+1) i = 1 D D d=1 γq (d) 1 (i) (51) Second a (n+1) ij D d=1 T t=2 ξq (d) t−1 (i, j) D d=1 T t=2 γq (d) t−1 (i) (52) Third b (n+1) i (k) = D d=1 T t=1 γq (d) t (i) I x (d) t = k D d=1 T t=1 γq (d) t (i) (53) 97 / 99
  • 245. Thus, the re-estimation First π (n+1) i = 1 D D d=1 γq (d) 1 (i) (51) Second a (n+1) ij D d=1 T t=2 ξq (d) t−1 (i, j) D d=1 T t=2 γq (d) t−1 (i) (52) Third b (n+1) i (k) = D d=1 T t=1 γq (d) t (i) I x (d) t = k D d=1 T t=1 γq (d) t (i) (53) 97 / 99
  • 246. Thus, the re-estimation First π (n+1) i = 1 D D d=1 γq (d) 1 (i) (51) Second a (n+1) ij D d=1 T t=2 ξq (d) t−1 (i, j) D d=1 T t=2 γq (d) t−1 (i) (52) Third b (n+1) i (k) = D d=1 T t=1 γq (d) t (i) I x (d) t = k D d=1 T t=1 γq (d) t (i) (53) 97 / 99
  • 247. Something Nice An important aspect of the re-estimation procedure is that N i=1 π (n+1) i = 1 N j=1 a (n+1) ij = 1 1 ≤ i ≤ N M k=1 b (n+1) j (k) = 1 1 ≤ j ≤ N They are Automatically it is satisfied at each iteration. 98 / 99
  • 248. Something Nice An important aspect of the re-estimation procedure is that N i=1 π (n+1) i = 1 N j=1 a (n+1) ij = 1 1 ≤ i ≤ N M k=1 b (n+1) j (k) = 1 1 ≤ j ≤ N They are Automatically it is satisfied at each iteration. 98 / 99
  • 249. Once these probabilities are ready We can use Naïve Bayes P (Ωi|X) = P (Ωi, X) P (X) > P (Ωj, X) P (X) = P (Ωj|X) ∀i = j (54) Then The rule looks like P (Ωi|X) P (Ωi) > P (Ωj|X) P (Ωj) ∀i = j (55) 99 / 99
  • 250. Once these probabilities are ready We can use Naïve Bayes P (Ωi|X) = P (Ωi, X) P (X) > P (Ωj, X) P (X) = P (Ωj|X) ∀i = j (54) Then The rule looks like P (Ωi|X) P (Ωi) > P (Ωj|X) P (Ωj) ∀i = j (55) 99 / 99