Introduction to Machine Learning
Expectation Maximization
Andres Mendez-Vazquez
September 16, 2017
1 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
2 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
3 / 126
Images/
Maximum-Likelihood
We have a density function p (x|Θ)
Assume that we have a data set of size N, X = {x1, x2, ..., xN }
This data is known as evidence.
We assume in addition that
The vectors are independent and identically distributed (i.i.d.) with
distribution p under parameter θ.
4 / 126
Images/
Maximum-Likelihood
We have a density function p (x|Θ)
Assume that we have a data set of size N, X = {x1, x2, ..., xN }
This data is known as evidence.
We assume in addition that
The vectors are independent and identically distributed (i.i.d.) with
distribution p under parameter θ.
4 / 126
Images/
What Can We Do With The Evidence?
We may use the Bayes’ Rule to estimate the parameters θ
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(1)
Or, given a new observation ˜x
p (˜x|X) (2)
I.e. to compute the probability of the new observation being supported by
the evidence X.
Thus
The former represents parameter estimation and the latter data prediction.
5 / 126
Images/
What Can We Do With The Evidence?
We may use the Bayes’ Rule to estimate the parameters θ
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(1)
Or, given a new observation ˜x
p (˜x|X) (2)
I.e. to compute the probability of the new observation being supported by
the evidence X.
Thus
The former represents parameter estimation and the latter data prediction.
5 / 126
Images/
What Can We Do With The Evidence?
We may use the Bayes’ Rule to estimate the parameters θ
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(1)
Or, given a new observation ˜x
p (˜x|X) (2)
I.e. to compute the probability of the new observation being supported by
the evidence X.
Thus
The former represents parameter estimation and the latter data prediction.
5 / 126
Images/
Focusing First on the Estimation of the Parameters θ
We can interpret the Bayes’ Rule
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(3)
Interpreted as
posterior =
likelihood × prior
evidence
(4)
Thus, we want
likelihood = P (X|Θ)
6 / 126
Images/
Focusing First on the Estimation of the Parameters θ
We can interpret the Bayes’ Rule
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(3)
Interpreted as
posterior =
likelihood × prior
evidence
(4)
Thus, we want
likelihood = P (X|Θ)
6 / 126
Images/
Focusing First on the Estimation of the Parameters θ
We can interpret the Bayes’ Rule
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(3)
Interpreted as
posterior =
likelihood × prior
evidence
(4)
Thus, we want
likelihood = P (X|Θ)
6 / 126
Images/
What we want...
We want to maximize the likelihood as a function of θ
7 / 126
Images/
Maximum-Likelihood
We have
p (x1, x2, ..., xN |Θ) =
N
i=1
p (xi|Θ) (5)
Also known as the likelihood function.
Because multiplication of quantities p (xi|Θ) ≤ 1 can be problematic
L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log p (xi|Θ) (6)
8 / 126
Images/
Maximum-Likelihood
We have
p (x1, x2, ..., xN |Θ) =
N
i=1
p (xi|Θ) (5)
Also known as the likelihood function.
Because multiplication of quantities p (xi|Θ) ≤ 1 can be problematic
L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log p (xi|Θ) (6)
8 / 126
Images/
Maximum-Likelihood
We want to find a Θ∗
Θ∗
= argmaxΘL (Θ|X) (7)
The classic method
∂L (Θ|X)
∂θi
= 0 ∀θi ∈ Θ (8)
9 / 126
Images/
Maximum-Likelihood
We want to find a Θ∗
Θ∗
= argmaxΘL (Θ|X) (7)
The classic method
∂L (Θ|X)
∂θi
= 0 ∀θi ∈ Θ (8)
9 / 126
Images/
What happened if we have incomplete data
Data could have been split
1 X = observed data or incomplete data
2 Y = unobserved data
For this type of problems
We have the famous Expectation Maximization (EM)
10 / 126
Images/
What happened if we have incomplete data
Data could have been split
1 X = observed data or incomplete data
2 Y = unobserved data
For this type of problems
We have the famous Expectation Maximization (EM)
10 / 126
Images/
What happened if we have incomplete data
Data could have been split
1 X = observed data or incomplete data
2 Y = unobserved data
For this type of problems
We have the famous Expectation Maximization (EM)
10 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
11 / 126
Images/
The Expectation Maximization
The EM algorithm
It was first developed by Dempster et al. (1977).
Its popularity comes from the fact
It can estimate an underlying distribution when data is incomplete or has
missing values.
Two main applications
1 When missing values exists.
2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
12 / 126
Images/
The Expectation Maximization
The EM algorithm
It was first developed by Dempster et al. (1977).
Its popularity comes from the fact
It can estimate an underlying distribution when data is incomplete or has
missing values.
Two main applications
1 When missing values exists.
2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
12 / 126
Images/
The Expectation Maximization
The EM algorithm
It was first developed by Dempster et al. (1977).
Its popularity comes from the fact
It can estimate an underlying distribution when data is incomplete or has
missing values.
Two main applications
1 When missing values exists.
2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
12 / 126
Images/
The Expectation Maximization
The EM algorithm
It was first developed by Dempster et al. (1977).
Its popularity comes from the fact
It can estimate an underlying distribution when data is incomplete or has
missing values.
Two main applications
1 When missing values exists.
2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
12 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
13 / 126
Images/
Clustering
Given a series of data sets
Given the fact that Radial Gaussian Functions are Universal Approximators
.
Samples {x1, x2, ..., xN } are the visible parameters
The Gaussian distributions generating each of the samples are the
hidden parameters
Then, we model the cluster as a mixture of Gaussian’s
14 / 126
Images/
Clustering
Given a series of data sets
Given the fact that Radial Gaussian Functions are Universal Approximators
.
Samples {x1, x2, ..., xN } are the visible parameters
The Gaussian distributions generating each of the samples are the
hidden parameters
Then, we model the cluster as a mixture of Gaussian’s
14 / 126
Images/
Natural Language Processing
Unsupervised induction of probabilistic context-free grammars
Here given a series of words o1, o2, o3, ... and normalized Context-Free
Grammar
We want to know the probabilities of each rule P (i → jk)
Thus
Here the you have two variables:
The Visible Ones: The sequence of words
The Hidden Ones: The rule that produces the possible sequence
oi → oj
15 / 126
Images/
Natural Language Processing
Unsupervised induction of probabilistic context-free grammars
Here given a series of words o1, o2, o3, ... and normalized Context-Free
Grammar
We want to know the probabilities of each rule P (i → jk)
Thus
Here the you have two variables:
The Visible Ones: The sequence of words
The Hidden Ones: The rule that produces the possible sequence
oi → oj
15 / 126
Images/
Natural Language Processing
Baum-Welch Algorithm for Hidden Markov Models
Visible Data
Hidden Data
Here
Hidden Variables: The circular nodes producing the data
Visible Variables: The square nodes representing the samples.
16 / 126
Images/
Natural Language Processing
Baum-Welch Algorithm for Hidden Markov Models
Visible Data
Hidden Data
Here
Hidden Variables: The circular nodes producing the data
Visible Variables: The square nodes representing the samples.
16 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
17 / 126
Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
Images/
New Likelihood Function
The New Likelihood Function
L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11)
Note: The complete data likelihood.
Thus, we have
L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12)
Did you notice?
p (X|Θ) is the likelihood of the observed data.
p (Y|X, Θ) is the likelihood of the no-observed data under the
observed data!!!
19 / 126
Images/
New Likelihood Function
The New Likelihood Function
L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11)
Note: The complete data likelihood.
Thus, we have
L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12)
Did you notice?
p (X|Θ) is the likelihood of the observed data.
p (Y|X, Θ) is the likelihood of the no-observed data under the
observed data!!!
19 / 126
Images/
New Likelihood Function
The New Likelihood Function
L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11)
Note: The complete data likelihood.
Thus, we have
L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12)
Did you notice?
p (X|Θ) is the likelihood of the observed data.
p (Y|X, Θ) is the likelihood of the no-observed data under the
observed data!!!
19 / 126
Images/
Rewriting
This can be rewritten as
L (Θ|X, Y) = hX,Θ (Y) (13)
This basically signify that X, Θ are constant and the only random part is
Y.
In addition
L (Θ|X) (14)
It is known as the incomplete-data likelihood function.
20 / 126
Images/
Rewriting
This can be rewritten as
L (Θ|X, Y) = hX,Θ (Y) (13)
This basically signify that X, Θ are constant and the only random part is
Y.
In addition
L (Θ|X) (14)
It is known as the incomplete-data likelihood function.
20 / 126
Images/
Thus
We can connect both incomplete-complete data equations by doing
the following
L (Θ|X) =p (X|Θ)
=
Y
p (X, Y|Θ)
=
Y
p (Y|X, Θ) p (X|Θ)
=
N
i=1
p (xi|Θ)
Y
p (Y|X, Θ)
21 / 126
Images/
Thus
We can connect both incomplete-complete data equations by doing
the following
L (Θ|X) =p (X|Θ)
=
Y
p (X, Y|Θ)
=
Y
p (Y|X, Θ) p (X|Θ)
=
N
i=1
p (xi|Θ)
Y
p (Y|X, Θ)
21 / 126
Images/
Thus
We can connect both incomplete-complete data equations by doing
the following
L (Θ|X) =p (X|Θ)
=
Y
p (X, Y|Θ)
=
Y
p (Y|X, Θ) p (X|Θ)
=
N
i=1
p (xi|Θ)
Y
p (Y|X, Θ)
21 / 126
Images/
Thus
We can connect both incomplete-complete data equations by doing
the following
L (Θ|X) =p (X|Θ)
=
Y
p (X, Y|Θ)
=
Y
p (Y|X, Θ) p (X|Θ)
=
N
i=1
p (xi|Θ)
Y
p (Y|X, Θ)
21 / 126
Images/
Remarks
Problems
Normally, it is almost impossible to obtain a closed analytical solution for
the previous equation.
However
We can use the expected value of log p (X, Y|Θ), which allows us to find
an iterative procedure to approximate the solution.
22 / 126
Images/
Remarks
Problems
Normally, it is almost impossible to obtain a closed analytical solution for
the previous equation.
However
We can use the expected value of log p (X, Y|Θ), which allows us to find
an iterative procedure to approximate the solution.
22 / 126
Images/
The function we would like to have
The Q function
We want an estimation of the complete-data log-likelihood
log p (X, Y|Θ) (15)
Based in the info provided by X, Θn−1 where Θn−1 is a previously
estimated set of parameters at step n.
Think about the following, if we want to remove Y
ˆ
[log p (X, Y|Θ)] p (Y|X, Θn−1) dY (16)
Remark: We integrate out Y - Actually, this is the expected value of
log p (X, Y|Θ).
23 / 126
Images/
The function we would like to have
The Q function
We want an estimation of the complete-data log-likelihood
log p (X, Y|Θ) (15)
Based in the info provided by X, Θn−1 where Θn−1 is a previously
estimated set of parameters at step n.
Think about the following, if we want to remove Y
ˆ
[log p (X, Y|Θ)] p (Y|X, Θn−1) dY (16)
Remark: We integrate out Y - Actually, this is the expected value of
log p (X, Y|Θ).
23 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
24 / 126
Images/
Use the Expected Value
Then, we want an iterative method to guess Θ from Θn−1
Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17)
Take in account that
1 X, Θn−1 are taken as constants.
2 Θ is a normal variable that we wish to adjust.
3 Y is a random variable governed by distribution
p (Y|X, Θn−1)=marginal distribution of missing data.
25 / 126
Images/
Use the Expected Value
Then, we want an iterative method to guess Θ from Θn−1
Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17)
Take in account that
1 X, Θn−1 are taken as constants.
2 Θ is a normal variable that we wish to adjust.
3 Y is a random variable governed by distribution
p (Y|X, Θn−1)=marginal distribution of missing data.
25 / 126
Images/
Use the Expected Value
Then, we want an iterative method to guess Θ from Θn−1
Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17)
Take in account that
1 X, Θn−1 are taken as constants.
2 Θ is a normal variable that we wish to adjust.
3 Y is a random variable governed by distribution
p (Y|X, Θn−1)=marginal distribution of missing data.
25 / 126
Images/
Use the Expected Value
Then, we want an iterative method to guess Θ from Θn−1
Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17)
Take in account that
1 X, Θn−1 are taken as constants.
2 Θ is a normal variable that we wish to adjust.
3 Y is a random variable governed by distribution
p (Y|X, Θn−1)=marginal distribution of missing data.
25 / 126
Images/
Another Interpretation
Given the previous information
E [log p (X, Y|Θ) |X, Θn−1] =
´
Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY
Something Notable
1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.
2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
26 / 126
Images/
Another Interpretation
Given the previous information
E [log p (X, Y|Θ) |X, Θn−1] =
´
Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY
Something Notable
1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.
2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
26 / 126
Images/
Another Interpretation
Given the previous information
E [log p (X, Y|Θ) |X, Θn−1] =
´
Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY
Something Notable
1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.
2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
26 / 126
Images/
Another Interpretation
Given the previous information
E [log p (X, Y|Θ) |X, Θn−1] =
´
Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY
Something Notable
1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.
2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
26 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
27 / 126
Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
Images/
Why E-step!!!
From here the name
This is basically the E-step
The second step
It tries to maximize the Q function
Θn = argmaxΘQ (Θ, Θn−1) (20)
29 / 126
Images/
Why E-step!!!
From here the name
This is basically the E-step
The second step
It tries to maximize the Q function
Θn = argmaxΘQ (Θ, Θn−1) (20)
29 / 126
Images/
Why E-step!!!
From here the name
This is basically the E-step
The second step
It tries to maximize the Q function
Θn = argmaxΘQ (Θ, Θn−1) (20)
29 / 126
Images/
Derivation of the EM-Algorithm
The likelihood function we are going to use
Let X be a random vector which results from a parametrized family:
L (Θ) = ln P (X|Θ) (21)
Note: ln (x) is a strictly increasing function.
We wish to compute Θ
Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn)
Or the maximization of the difference
L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22)
30 / 126
Images/
Derivation of the EM-Algorithm
The likelihood function we are going to use
Let X be a random vector which results from a parametrized family:
L (Θ) = ln P (X|Θ) (21)
Note: ln (x) is a strictly increasing function.
We wish to compute Θ
Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn)
Or the maximization of the difference
L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22)
30 / 126
Images/
Derivation of the EM-Algorithm
The likelihood function we are going to use
Let X be a random vector which results from a parametrized family:
L (Θ) = ln P (X|Θ) (21)
Note: ln (x) is a strictly increasing function.
We wish to compute Θ
Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn)
Or the maximization of the difference
L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22)
30 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
31 / 126
Images/
Introducing the Hidden Features
Given that the hidden random vector Y exits with y values
P (X|Θ) =
y
P (X|y, Θ) P (y|Θ) (23)
Thus, using our first constraint L (Θ) − L (Θn)
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn) (24)
32 / 126
Images/
Introducing the Hidden Features
Given that the hidden random vector Y exits with y values
P (X|Θ) =
y
P (X|y, Θ) P (y|Θ) (23)
Thus, using our first constraint L (Θ) − L (Θn)
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn) (24)
32 / 126
Images/
Here, we introduce some concepts of convexity
For Convexity
Theorem (Jensen’s inequality)
Let f be a convex function defined on an interval I. If x1, x2, ..., xn ∈ I
and λ1, λ2, ..., λn ≥ 0 with n
i=1 λi = 1, then
f
n
i=1
λixi ≤
n
i=1
λif (xi) (25)
33 / 126
Images/
Proof:
For n = 1
We have the trivial case
For n = 2
The convexity definition.
Now the inductive hypothesis
We assume that the theorem is true for some n.
34 / 126
Images/
Proof:
For n = 1
We have the trivial case
For n = 2
The convexity definition.
Now the inductive hypothesis
We assume that the theorem is true for some n.
34 / 126
Images/
Proof:
For n = 1
We have the trivial case
For n = 2
The convexity definition.
Now the inductive hypothesis
We assume that the theorem is true for some n.
34 / 126
Images/
Now, we have
The following linear combination for λi
f
n+1
i=1
λixi = f λn+1xn+1 +
n
i=1
λixi
= f λn+1xn+1 +
(1 − λn+1)
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
35 / 126
Images/
Now, we have
The following linear combination for λi
f
n+1
i=1
λixi = f λn+1xn+1 +
n
i=1
λixi
= f λn+1xn+1 +
(1 − λn+1)
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
35 / 126
Images/
Now, we have
The following linear combination for λi
f
n+1
i=1
λixi = f λn+1xn+1 +
n
i=1
λixi
= f λn+1xn+1 +
(1 − λn+1)
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
35 / 126
Images/
Did you notice?
Something Notable
n+1
i=1
λi = 1
Thus
n
i=1
λi = 1 − λn+1
Finally
1
(1 − λn+1)
n
i=1
λi = 1
36 / 126
Images/
Did you notice?
Something Notable
n+1
i=1
λi = 1
Thus
n
i=1
λi = 1 − λn+1
Finally
1
(1 − λn+1)
n
i=1
λi = 1
36 / 126
Images/
Did you notice?
Something Notable
n+1
i=1
λi = 1
Thus
n
i=1
λi = 1 − λn+1
Finally
1
(1 − λn+1)
n
i=1
λi = 1
36 / 126
Images/
Now
We have that
f
n+1
i=1
λixi ≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1)
1
(1 − λn+1)
n
i=1
λif (xi)
≤ λn+1f (xn+1) +
n
i=1
λif (xi) Q.E.D.
37 / 126
Images/
Now
We have that
f
n+1
i=1
λixi ≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1)
1
(1 − λn+1)
n
i=1
λif (xi)
≤ λn+1f (xn+1) +
n
i=1
λif (xi) Q.E.D.
37 / 126
Images/
Now
We have that
f
n+1
i=1
λixi ≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1)
1
(1 − λn+1)
n
i=1
λif (xi)
≤ λn+1f (xn+1) +
n
i=1
λif (xi) Q.E.D.
37 / 126
Images/
Thus, for concave functions
It is possible to shown that
Given ln (x) a concave function:
ln
n
i=1
λixi ≥
n
i=1
λi ln (xi)
If we take in consideration
Assume that the λi = P (y|X, Θn). We know that
1 P (y|X, Θn) ≥ 0
2
y P (y|X, Θn) = 1
38 / 126
Images/
Thus, for concave functions
It is possible to shown that
Given ln (x) a concave function:
ln
n
i=1
λixi ≥
n
i=1
λi ln (xi)
If we take in consideration
Assume that the λi = P (y|X, Θn). We know that
1 P (y|X, Θn) ≥ 0
2
y P (y|X, Θn) = 1
38 / 126
Images/
We have
First
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn)
= ln
y
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
P (y|X, Θn)
− ln P (X|Θn)
= ln
y
P (y|X, Θn)
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ln P (X|Θn)
≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ...
y
P (y|X, Θn) ln P (X|Θn) Why this?
39 / 126
Images/
We have
First
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn)
= ln
y
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
P (y|X, Θn)
− ln P (X|Θn)
= ln
y
P (y|X, Θn)
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ln P (X|Θn)
≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ...
y
P (y|X, Θn) ln P (X|Θn) Why this?
39 / 126
Images/
We have
First
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn)
= ln
y
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
P (y|X, Θn)
− ln P (X|Θn)
= ln
y
P (y|X, Θn)
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ln P (X|Θn)
≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ...
y
P (y|X, Θn) ln P (X|Θn) Why this?
39 / 126
Images/
We have
First
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn)
= ln
y
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
P (y|X, Θn)
− ln P (X|Θn)
= ln
y
P (y|X, Θn)
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ln P (X|Θn)
≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ...
y
P (y|X, Θn) ln P (X|Θn) Why this?
39 / 126
Images/
Next
Because
y
P (y|X, Θn) = 1
Then
L (Θ) − L (Θn) ≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
=∆ (Θ|Θn)
40 / 126
Images/
Next
Because
y
P (y|X, Θn) = 1
Then
L (Θ) − L (Θn) ≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
=∆ (Θ|Θn)
40 / 126
Images/
Then, we have
Then, we have proved that
L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)
Then, we define a new function
l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27)
Thus l (Θ|Θn)
It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)
41 / 126
Images/
Then, we have
Then, we have proved that
L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)
Then, we define a new function
l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27)
Thus l (Θ|Θn)
It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)
41 / 126
Images/
Then, we have
Then, we have proved that
L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)
Then, we define a new function
l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27)
Thus l (Θ|Θn)
It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)
41 / 126
Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
Images/
Therefore
The function l (Θ|Θn) has the following properties
1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ).
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal.
3 The function l (Θ|Θn) is concave... How?
43 / 126
Images/
Therefore
The function l (Θ|Θn) has the following properties
1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ).
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal.
3 The function l (Θ|Θn) is concave... How?
43 / 126
Images/
Therefore
The function l (Θ|Θn) has the following properties
1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ).
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal.
3 The function l (Θ|Θn) is concave... How?
43 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
44 / 126
Images/
First
We have the value L (Θn)
We know that L (Θn) is constant i.e. an offset value
What about ∆ (Θ|Θn)
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
We have that the ln is a concave function
ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
45 / 126
Images/
First
We have the value L (Θn)
We know that L (Θn) is constant i.e. an offset value
What about ∆ (Θ|Θn)
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
We have that the ln is a concave function
ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
45 / 126
Images/
First
We have the value L (Θn)
We know that L (Θn) is constant i.e. an offset value
What about ∆ (Θ|Θn)
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
We have that the ln is a concave function
ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
45 / 126
Images/
Therefore
Each element is concave
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
Therefore, the sum of concave functions is a concave function
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
46 / 126
Images/
Therefore
Each element is concave
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
Therefore, the sum of concave functions is a concave function
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
46 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
47 / 126
Images/
Given the Concave Function
Thus, we have that
1 We can select Θn such that l (Θ|Θn) is maximized.
2 Thus, given a Θn, we can generate Θn+1.
The process can be seen in the following graph
48 / 126
Images/
Given the Concave Function
Thus, we have that
1 We can select Θn such that l (Θ|Θn) is maximized.
2 Thus, given a Θn, we can generate Θn+1.
The process can be seen in the following graph
48 / 126
Images/
Given the Concave Function
Thus, we have that
1 We can select Θn such that l (Θ|Θn) is maximized.
2 Thus, given a Θn, we can generate Θn+1.
The process can be seen in the following graph
48 / 126
Images/
Given
The Previous Constraints
1 l (Θ|Θn) is bounded from above by L (Θ)
l (Θ|Θn) ≤ L (Θ)
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
L (Θn) = l (Θ|Θn)
3 The function l (Θ|Θn) is concave
49 / 126
Images/
Given
The Previous Constraints
1 l (Θ|Θn) is bounded from above by L (Θ)
l (Θ|Θn) ≤ L (Θ)
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
L (Θn) = l (Θ|Θn)
3 The function l (Θ|Θn) is concave
49 / 126
Images/
Given
The Previous Constraints
1 l (Θ|Θn) is bounded from above by L (Θ)
l (Θ|Θn) ≤ L (Θ)
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
L (Θn) = l (Θ|Θn)
3 The function l (Θ|Θn) is concave
49 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
50 / 126
Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ


 y
P (y|X, Θn) ln


P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)





51 / 126
Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ


 y
P (y|X, Θn) ln


P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)





51 / 126
Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ


 y
P (y|X, Θn) ln


P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)





51 / 126
Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ


 y
P (y|X, Θn) ln


P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)





51 / 126
Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ


 y
P (y|X, Θn) ln


P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)





51 / 126
Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
53 / 126
Images/
The EM-Algorithm
Steps of EM
1 Expectation under hidden variables.
2 Maximization of the resulting formula.
E-Step
Determine the conditional expectation, Ey|X,Θn
[ln (P (X, y|Θ))].
M-Step
Maximize this expression with respect to Θ.
54 / 126
Images/
The EM-Algorithm
Steps of EM
1 Expectation under hidden variables.
2 Maximization of the resulting formula.
E-Step
Determine the conditional expectation, Ey|X,Θn
[ln (P (X, y|Θ))].
M-Step
Maximize this expression with respect to Θ.
54 / 126
Images/
The EM-Algorithm
Steps of EM
1 Expectation under hidden variables.
2 Maximization of the resulting formula.
E-Step
Determine the conditional expectation, Ey|X,Θn
[ln (P (X, y|Θ))].
M-Step
Maximize this expression with respect to Θ.
54 / 126
Images/
The EM-Algorithm
Steps of EM
1 Expectation under hidden variables.
2 Maximization of the resulting formula.
E-Step
Determine the conditional expectation, Ey|X,Θn
[ln (P (X, y|Θ))].
M-Step
Maximize this expression with respect to Θ.
54 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
55 / 126
Images/
Notes and Convergence of EM
Gains between L (Θ) and l (Θ|Θn)
Using the hidden variables it is possible to simplify the optimization of
L (Θ) through l (Θ|Θn).
Convergence
Remember that Θn+1is the estimate for Θ which maximizes the
difference ∆ (Θ|Θn).
56 / 126
Images/
Notes and Convergence of EM
Gains between L (Θ) and l (Θ|Θn)
Using the hidden variables it is possible to simplify the optimization of
L (Θ) through l (Θ|Θn).
Convergence
Remember that Θn+1is the estimate for Θ which maximizes the
difference ∆ (Θ|Θn).
56 / 126
Images/
Notes and Convergence of EM
Gains between L (Θ) and l (Θ|Θn)
Using the hidden variables it is possible to simplify the optimization of
L (Θ) through l (Θ|Θn).
Convergence
Remember that Θn+1is the estimate for Θ which maximizes the
difference ∆ (Θ|Θn).
56 / 126
Images/
Therefore
Then, we have
Given the initial estimate of Θ by Θn
∆ (Θn|Θn) = 0
Now
If we choose Θn+1 to maximize the ∆ (Θ|Θn), then
∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0
We have that
The Likelihood L (Θ) is not a decreasing function with respect to Θ.
57 / 126
Images/
Therefore
Then, we have
Given the initial estimate of Θ by Θn
∆ (Θn|Θn) = 0
Now
If we choose Θn+1 to maximize the ∆ (Θ|Θn), then
∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0
We have that
The Likelihood L (Θ) is not a decreasing function with respect to Θ.
57 / 126
Images/
Therefore
Then, we have
Given the initial estimate of Θ by Θn
∆ (Θn|Θn) = 0
Now
If we choose Θn+1 to maximize the ∆ (Θ|Θn), then
∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0
We have that
The Likelihood L (Θ) is not a decreasing function with respect to Θ.
57 / 126
Images/
Notes and Convergence of EM
Properties
When the algorithm reaches a fixed point for some Θn, the value
maximizes l (Θ|Θn).
Definition
A fixed point of a function is an element on domain that is mapped to
itself by the function:
f (x) = x
Basically the EM algorithm does the following
EM [Θ∗
] = Θ∗
58 / 126
Images/
Notes and Convergence of EM
Properties
When the algorithm reaches a fixed point for some Θn, the value
maximizes l (Θ|Θn).
Definition
A fixed point of a function is an element on domain that is mapped to
itself by the function:
f (x) = x
Basically the EM algorithm does the following
EM [Θ∗
] = Θ∗
58 / 126
Images/
Notes and Convergence of EM
Properties
When the algorithm reaches a fixed point for some Θn, the value
maximizes l (Θ|Θn).
Definition
A fixed point of a function is an element on domain that is mapped to
itself by the function:
f (x) = x
Basically the EM algorithm does the following
EM [Θ∗
] = Θ∗
58 / 126
Images/
At this moment
We have that
The algorithm reaches a fixed point for some Θn, the value Θ∗ maximizes
l (Θ|Θn).
Then, when the algorithm
It reaches a fixed point for some Θn the value maximizes l (Θ|Θn).
Basically Θn+1 = Θn.
59 / 126
Images/
At this moment
We have that
The algorithm reaches a fixed point for some Θn, the value Θ∗ maximizes
l (Θ|Θn).
Then, when the algorithm
It reaches a fixed point for some Θn the value maximizes l (Θ|Θn).
Basically Θn+1 = Θn.
59 / 126
Images/
Therefore
We have
60 / 126
Images/
Then
If L and l are differentiable at Θn
Since L and l are equal at Θn
Then, Θn is a stationary point of L i.e. the derivative of L vanishes at
that point.
Local Maxima
61 / 126
Images/
However
You could finish with the following case, no local maxima
Saddle Point
62 / 126
Images/
For more on the subject
Please take a look to
Geoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithm
and Extensions,” John Wiley & Sons, New York, 1996.
63 / 126
Images/
Finding Maximum Likelihood Mixture Densities Parameters
via EM
Something Notable
The mixture-density parameter estimation problem is probably one of the
most widely used applications of the EM algorithm in the computational
pattern recognition community.
We have
p (x|Θ) =
M
i=1
αipi (x|θi) (28)
where
1 Θ = (α1, ..., αM , θ1, ..., θM )
2 M
i=1 αi = 1
3 Each pi is a density function parametrized by θi.
64 / 126
Images/
Finding Maximum Likelihood Mixture Densities Parameters
via EM
Something Notable
The mixture-density parameter estimation problem is probably one of the
most widely used applications of the EM algorithm in the computational
pattern recognition community.
We have
p (x|Θ) =
M
i=1
αipi (x|θi) (28)
where
1 Θ = (α1, ..., αM , θ1, ..., θM )
2 M
i=1 αi = 1
3 Each pi is a density function parametrized by θi.
64 / 126
Images/
Finding Maximum Likelihood Mixture Densities Parameters
via EM
Something Notable
The mixture-density parameter estimation problem is probably one of the
most widely used applications of the EM algorithm in the computational
pattern recognition community.
We have
p (x|Θ) =
M
i=1
αipi (x|θi) (28)
where
1 Θ = (α1, ..., αM , θ1, ..., θM )
2 M
i=1 αi = 1
3 Each pi is a density function parametrized by θi.
64 / 126
Images/
Finding Maximum Likelihood Mixture Densities Parameters
via EM
Something Notable
The mixture-density parameter estimation problem is probably one of the
most widely used applications of the EM algorithm in the computational
pattern recognition community.
We have
p (x|Θ) =
M
i=1
αipi (x|θi) (28)
where
1 Θ = (α1, ..., αM , θ1, ..., θM )
2 M
i=1 αi = 1
3 Each pi is a density function parametrized by θi.
64 / 126
Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log


M
j=1
αjpj (xi|θj)

 (29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log


M
j=1
αjpj (xi|θj)

 (29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log


M
j=1
αjpj (xi|θj)

 (29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log


M
j=1
αjpj (xi|θj)

 (29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log


M
j=1
αjpj (xi|θj)

 (29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
Images/
Then
Thus, by the chain Rule
N
i=1
log P (xi, yi|Θ) =
N
i=1
log [P (xi|yi, θyi ) P (yi|θyi )] (31)
Question Do you need yi if you know θyi or the other way around?
Finally
N
i=1
log [P (xi|yi, θyi ) P (yi|θyi )] =
N
i=1
log [P (yi) pyi (xi|θyi )] (32)
NOPE: You do not need yi if you know θyi or the other way around.
67 / 126
Images/
Then
Thus, by the chain Rule
N
i=1
log P (xi, yi|Θ) =
N
i=1
log [P (xi|yi, θyi ) P (yi|θyi )] (31)
Question Do you need yi if you know θyi or the other way around?
Finally
N
i=1
log [P (xi|yi, θyi ) P (yi|θyi )] =
N
i=1
log [P (yi) pyi (xi|θyi )] (32)
NOPE: You do not need yi if you know θyi or the other way around.
67 / 126
Images/
Finally, we have
Making αyi
= P (yi)
log L (Θ|X, Y) =
N
i=1
log [αyi P (xi|yi, θyi )] (33)
68 / 126
Images/
Problem
Which Labels?
We do not know the values of Y.
We can get away by using the following idea
Assume the Y is a random variable.
69 / 126
Images/
Problem
Which Labels?
We do not know the values of Y.
We can get away by using the following idea
Assume the Y is a random variable.
69 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
70 / 126
Images/
Thus
You do a first guess for the parameters at the beginning of EM
Θg
= (αg
1, ..., αg
M , θg
1, ..., θg
M ) (34)
Then, it is possible to calculate given the parametric probability
pj xi|θg
j
Therefore
The mixing parameters αj can be though of as a prior probabilities of each
mixture:
αj = p (component j) (35)
71 / 126
Images/
Thus
You do a first guess for the parameters at the beginning of EM
Θg
= (αg
1, ..., αg
M , θg
1, ..., θg
M ) (34)
Then, it is possible to calculate given the parametric probability
pj xi|θg
j
Therefore
The mixing parameters αj can be though of as a prior probabilities of each
mixture:
αj = p (component j) (35)
71 / 126
Images/
Thus
You do a first guess for the parameters at the beginning of EM
Θg
= (αg
1, ..., αg
M , θg
1, ..., θg
M ) (34)
Then, it is possible to calculate given the parametric probability
pj xi|θg
j
Therefore
The mixing parameters αj can be though of as a prior probabilities of each
mixture:
αj = p (component j) (35)
71 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
72 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
73 / 126
Images/
We want to calculate the following probability
We want to calculate
p (yi|xi, Θg
)
Basically
We want a Bayesian formulation of this probability.
Assuming that the y = (y1, y2, ..., yN ) are samples identically
independent samples from a distribution.
74 / 126
Images/
We want to calculate the following probability
We want to calculate
p (yi|xi, Θg
)
Basically
We want a Bayesian formulation of this probability.
Assuming that the y = (y1, y2, ..., yN ) are samples identically
independent samples from a distribution.
74 / 126
Images/
We want to calculate the following probability
We want to calculate
p (yi|xi, Θg
)
Basically
We want a Bayesian formulation of this probability.
Assuming that the y = (y1, y2, ..., yN ) are samples identically
independent samples from a distribution.
74 / 126
Images/
Using Bayes’ Rule
Compute
p (yi|xi, Θg
) =
p (yi, xi|Θg)
p (xi|Θg)
=
p (xi|Θg) p yi|θg
yi
p (xi|Θg)
We know θg
yi
⇒ Drop it
=
αg
yi
pyi xi|θg
yi
p (xi|Θg)
=
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
75 / 126
Images/
Using Bayes’ Rule
Compute
p (yi|xi, Θg
) =
p (yi, xi|Θg)
p (xi|Θg)
=
p (xi|Θg) p yi|θg
yi
p (xi|Θg)
We know θg
yi
⇒ Drop it
=
αg
yi
pyi xi|θg
yi
p (xi|Θg)
=
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
75 / 126
Images/
Using Bayes’ Rule
Compute
p (yi|xi, Θg
) =
p (yi, xi|Θg)
p (xi|Θg)
=
p (xi|Θg) p yi|θg
yi
p (xi|Θg)
We know θg
yi
⇒ Drop it
=
αg
yi
pyi xi|θg
yi
p (xi|Θg)
=
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
75 / 126
Images/
Using Bayes’ Rule
Compute
p (yi|xi, Θg
) =
p (yi, xi|Θg)
p (xi|Θg)
=
p (xi|Θg) p yi|θg
yi
p (xi|Θg)
We know θg
yi
⇒ Drop it
=
αg
yi
pyi xi|θg
yi
p (xi|Θg)
=
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
75 / 126
Images/
As in Naive Bayes
We have the fact that there is a probability per probability at the
mixture and sample
p (yi|xi, Θg
) =
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
∀xi, yi and k ∈ {1, ..., M}
This is going to be updated at each iteration of the EM algorithm
After the initial Guess!!! Until convergence!!!
76 / 126
Images/
As in Naive Bayes
We have the fact that there is a probability per probability at the
mixture and sample
p (yi|xi, Θg
) =
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
∀xi, yi and k ∈ {1, ..., M}
This is going to be updated at each iteration of the EM algorithm
After the initial Guess!!! Until convergence!!!
76 / 126
Images/
Additionally
We assume again that the samples yis are identically and independent
samples
p (y|X, Θg
) =
N
i=1
p (yi|xi, Θg
) (36)
Where y = (y1, y2, ..., yN )
77 / 126
Images/
Now, using equation 17
Then
Q (Θ|Θg
) =
y∈Y
log (L (Θ|X, y)) p (y|X, Θg
)
=
y∈Y
N
i=1
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
)
78 / 126
Images/
Now, using equation 17
Then
Q (Θ|Θg
) =
y∈Y
log (L (Θ|X, y)) p (y|X, Θg
)
=
y∈Y
N
i=1
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
)
78 / 126
Images/
Here, a small stop
What is the meaning of y∈Y
It is actually a summation of all possible states of the random vector y.
Then, we can rewrite the previous summation as
y∈Y
=
M
y1=1
M
y2=1
· · ·
M
yN =1
N
Running over all the samples {x1, x2, ..., xN }.
79 / 126
Images/
Here, a small stop
What is the meaning of y∈Y
It is actually a summation of all possible states of the random vector y.
Then, we can rewrite the previous summation as
y∈Y
=
M
y1=1
M
y2=1
· · ·
M
yN =1
N
Running over all the samples {x1, x2, ..., xN }.
79 / 126
Images/
Then
We have
Q (Θ|Θg
) =
M
y1=1
M
y2=1
· · ·
M
yN =1
N
i=1

log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
)


80 / 126
Images/
We introduce the following
We have the following function
δl,yi
=
1 I = yi
0 I = yi
Therefore, we can do the following
αi =
M
j=1
δi,jαj
Then
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
) =
M
l=1
δl,yi log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
81 / 126
Images/
We introduce the following
We have the following function
δl,yi
=
1 I = yi
0 I = yi
Therefore, we can do the following
αi =
M
j=1
δi,jαj
Then
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
) =
M
l=1
δl,yi log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
81 / 126
Images/
We introduce the following
We have the following function
δl,yi
=
1 I = yi
0 I = yi
Therefore, we can do the following
αi =
M
j=1
δi,jαj
Then
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
) =
M
l=1
δl,yi log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
81 / 126
Images/
Thus
We have that for
M
y1=1 · · · M
yN =1
N
i=1 log [αyi
pyi
(xi|θyi
)] N
j=1 p (yj|xj, Θg
) = ∗
∗ =
M
y1=1
M
y2=1
· · ·
M
yN =1
N
i=1
M
l=1
δl,yi
log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
=
N
i=1
M
l=1
log [αlpl (xi|θl)]
M
y1=1
M
y2=1
· · ·
M
yN =1

δl,yi
N
j=1
p (yj|xj, Θg
)


Because
M
y1=1
M
y2=1 · · · M
yN =1 applies only to δl,yi
N
j=1 p (yj|xj, Θg)
82 / 126
Images/
Thus
We have that for
M
y1=1 · · · M
yN =1
N
i=1 log [αyi
pyi
(xi|θyi
)] N
j=1 p (yj|xj, Θg
) = ∗
∗ =
M
y1=1
M
y2=1
· · ·
M
yN =1
N
i=1
M
l=1
δl,yi
log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
=
N
i=1
M
l=1
log [αlpl (xi|θl)]
M
y1=1
M
y2=1
· · ·
M
yN =1

δl,yi
N
j=1
p (yj|xj, Θg
)


Because
M
y1=1
M
y2=1 · · · M
yN =1 applies only to δl,yi
N
j=1 p (yj|xj, Θg)
82 / 126
Images/
Thus
We have that for
M
y1=1 · · · M
yN =1
N
i=1 log [αyi
pyi
(xi|θyi
)] N
j=1 p (yj|xj, Θg
) = ∗
∗ =
M
y1=1
M
y2=1
· · ·
M
yN =1
N
i=1
M
l=1
δl,yi
log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
=
N
i=1
M
l=1
log [αlpl (xi|θl)]
M
y1=1
M
y2=1
· · ·
M
yN =1

δl,yi
N
j=1
p (yj|xj, Θg
)


Because
M
y1=1
M
y2=1 · · · M
yN =1 applies only to δl,yi
N
j=1 p (yj|xj, Θg)
82 / 126
Images/
Then, we have that
First notice the following
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
M
yi=1
δl,yi
p (yi|xi, Θg
)
N
j=1,j=i,
p (yj|xj, Θg
)


Then, we have
M
yi=1
δl,yi
p (yi|xi, Θg
) = p (l|xi, Θg
)
83 / 126
Images/
Then, we have that
First notice the following
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
M
yi=1
δl,yi
p (yi|xi, Θg
)
N
j=1,j=i,
p (yj|xj, Θg
)


Then, we have
M
yi=1
δl,yi
p (yi|xi, Θg
) = p (l|xi, Θg
)
83 / 126
Images/
Then, we have that
First notice the following
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
M
yi=1
δl,yi
p (yi|xi, Θg
)
N
j=1,j=i,
p (yj|xj, Θg
)


Then, we have
M
yi=1
δl,yi
p (yi|xi, Θg
) = p (l|xi, Θg
)
83 / 126
Images/
In this way
Plugging back the previous equation
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
p (l|xi, Θg
)
N
j=1,j=i
p (yj|xj, Θg
)


=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)

 p (l|xi, Θg
)
84 / 126
Images/
In this way
Plugging back the previous equation
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
p (l|xi, Θg
)
N
j=1,j=i
p (yj|xj, Θg
)


=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)

 p (l|xi, Θg
)
84 / 126
Images/
In this way
Plugging back the previous equation
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
p (l|xi, Θg
)
N
j=1,j=i
p (yj|xj, Θg
)


=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)

 p (l|xi, Θg
)
84 / 126
Images/
Now, what about...?
The left part of the equation
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
) =
=


M
y1=1
p (y1|x1, Θg
)

 · · ·


M
yi−1=1
p (yi−1|xi−1, Θg
)

 × ...


M
yi+1=1
p (yi+1|xi+1, Θg
)

 · · ·


M
yN =1
p (yN |xN , Θg
)


=
N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)


85 / 126
Images/
Now, what about...?
The left part of the equation
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
) =
=


M
y1=1
p (y1|x1, Θg
)

 · · ·


M
yi−1=1
p (yi−1|xi−1, Θg
)

 × ...


M
yi+1=1
p (yi+1|xi+1, Θg
)

 · · ·


M
yN =1
p (yN |xN , Θg
)


=
N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)


85 / 126
Images/
Now, what about...?
The left part of the equation
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
) =
=


M
y1=1
p (y1|x1, Θg
)

 · · ·


M
yi−1=1
p (yi−1|xi−1, Θg
)

 × ...


M
yi+1=1
p (yi+1|xi+1, Θg
)

 · · ·


M
yN =1
p (yN |xN , Θg
)


=
N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)


85 / 126
Images/
Then, we have that
Plugging back to the original equation



M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)



p (l|xi, Θg
) =
=



N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)





p (l|xi, Θg
)
86 / 126
Images/
Then, we have that
Plugging back to the original equation



M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)



p (l|xi, Θg
) =
=



N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)





p (l|xi, Θg
)
86 / 126
Images/
We can use properties of probability
We know that
M
yi=1
p (yi|xi, Θg
) = 1 (37)
Then



N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)





p (l|xi, Θg
) =
=



N
j=1,j=i
1



p (l|xi, Θg
)
=p (l|xi, Θg
)
=
αg
l pyi xi|θg
l
M
k=1 αg
kpk xi|θg
k
87 / 126
Images/
We can use properties of probability
We know that
M
yi=1
p (yi|xi, Θg
) = 1 (37)
Then



N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)





p (l|xi, Θg
) =
=



N
j=1,j=i
1



p (l|xi, Θg
)
=p (l|xi, Θg
)
=
αg
l pyi xi|θg
l
M
k=1 αg
kpk xi|θg
k
87 / 126
Images/
We can use properties of probability
We know that
M
yi=1
p (yi|xi, Θg
) = 1 (37)
Then



N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)





p (l|xi, Θg
) =
=



N
j=1,j=i
1



p (l|xi, Θg
)
=p (l|xi, Θg
)
=
αg
l pyi xi|θg
l
M
k=1 αg
kpk xi|θg
k
87 / 126
Images/
We can use properties of probability
We know that
M
yi=1
p (yi|xi, Θg
) = 1 (37)
Then



N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)





p (l|xi, Θg
) =
=



N
j=1,j=i
1



p (l|xi, Θg
)
=p (l|xi, Θg
)
=
αg
l pyi xi|θg
l
M
k=1 αg
kpk xi|θg
k
87 / 126
Images/
Thus
We can write Q in the following way
Q (Θ, Θg
) =
N
i=1
M
l=1
log [αlpl (xi|θl)] p (l|xi, Θg
)
=
N
i=1
M
l=1
log (αl) p (l|xi, Θg
) + ...
N
i=1
M
l=1
log (pl (xi|θl)) p (l|xi, Θg
) (38)
88 / 126
Images/
Thus
We can write Q in the following way
Q (Θ, Θg
) =
N
i=1
M
l=1
log [αlpl (xi|θl)] p (l|xi, Θg
)
=
N
i=1
M
l=1
log (αl) p (l|xi, Θg
) + ...
N
i=1
M
l=1
log (pl (xi|θl)) p (l|xi, Θg
) (38)
88 / 126
Images/
Thus
We can write Q in the following way
Q (Θ, Θg
) =
N
i=1
M
l=1
log [αlpl (xi|θl)] p (l|xi, Θg
)
=
N
i=1
M
l=1
log (αl) p (l|xi, Θg
) + ...
N
i=1
M
l=1
log (pl (xi|θl)) p (l|xi, Θg
) (38)
88 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
89 / 126
Images/
A Method
That could be used as a general framework
To solve problems set as EM problem.
First, we will look at the Lagrange Multipliers setup
Then, we will look at a specific case using the mixture of Gaussian’s
Note
Not all the mixture of distributions will get you an analytical solution.
90 / 126
Images/
A Method
That could be used as a general framework
To solve problems set as EM problem.
First, we will look at the Lagrange Multipliers setup
Then, we will look at a specific case using the mixture of Gaussian’s
Note
Not all the mixture of distributions will get you an analytical solution.
90 / 126
Images/
A Method
That could be used as a general framework
To solve problems set as EM problem.
First, we will look at the Lagrange Multipliers setup
Then, we will look at a specific case using the mixture of Gaussian’s
Note
Not all the mixture of distributions will get you an analytical solution.
90 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
91 / 126
Images/
Lagrange Multipliers
The method of Lagrange multipliers
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problems.
This is done by converting a constrained problem
To an equivalent unconstrained problem with the help of certain
unspecified parameters known as Lagrange multipliers.
92 / 126
Images/
Lagrange Multipliers
The method of Lagrange multipliers
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problems.
This is done by converting a constrained problem
To an equivalent unconstrained problem with the help of certain
unspecified parameters known as Lagrange multipliers.
92 / 126
Images/
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
(39)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
93 / 126
Images/
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
(39)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
93 / 126
Images/
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
(39)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
93 / 126
Images/
Lagrange Multipliers are applications of Py
Think about this
Remember First Law of Newton!!!
Yes!!!
94 / 126
Images/
Lagrange Multipliers are applications of Py
Think about this
Remember First Law of Newton!!!
Yes!!!
A system in equilibrium does not move
Static Body
94 / 126
Images/
Lagrange Multipliers
Definition
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problem
95 / 126
Images/
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (40)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
96 / 126
Images/
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (40)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
96 / 126
Images/
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (40)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
96 / 126
Images/
Gradient to a Surface
After all a gradient is a measure of the maximal change
For example the gradient of a function of three variables:
f (x) = i
∂f (x)
∂x
+ j
∂f (x)
∂y
+ k
∂f (x)
∂z
(41)
where i, j and k are unitary vectors in the directions x, y and z.
97 / 126
Images/
Example
We have f (x, y) = x exp {−x2
− y2
}
98 / 126
Images/
Example
With Gradient at the the contours when projecting in the 2D plane
99 / 126
Images/
Now, Think about this
Yes, we can use the gradient
However, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ... + λKFK = 0 (42)
where F0 is the gradient of the principal cost function and Fi for
i = 1, 2, .., K.
100 / 126
Images/
Now, Think about this
Yes, we can use the gradient
However, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ... + λKFK = 0 (42)
where F0 is the gradient of the principal cost function and Fi for
i = 1, 2, .., K.
100 / 126
Images/
Thus
If we have the following optimization:
min f (x)
s.tg1 (x) = 0
g2 (x) = 0
101 / 126
Images/
Geometric interpretation in the case of minimization
What is wrong? Gradients are going in the other direction, we can fix
by simple multiplying by -1
Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0
Nevertheless: it is equivalent to f −→x − λ1 g1
−→x − λ2 g2
−→x = 0
101 / 126
Images/
Geometric interpretation in the case of minimization
What is wrong? Gradients are going in the other direction, we can fix
by simple multiplying by -1
Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0
Nevertheless: it is equivalent to f −→x − λ1 g1
−→x − λ2 g2
−→x = 0
101 / 126
Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
Images/
Example
We can apply that to the following problem
min f (x, y) = x2
− 8x + y2
− 12y + 48
s.t x + y = 8
103 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
104 / 126
Images/
Lagrange Multipliers for Q
We can us the following constraint for that
l
αl = 1 (43)
We have the following cost function
Q (Θ, Θg
) + λ
l
αl − 1 (44)
Deriving by αl
∂
∂αl
Q (Θ, Θg
) + λ
l
αl − 1 = 0 (45)
105 / 126
Images/
Lagrange Multipliers for Q
We can us the following constraint for that
l
αl = 1 (43)
We have the following cost function
Q (Θ, Θg
) + λ
l
αl − 1 (44)
Deriving by αl
∂
∂αl
Q (Θ, Θg
) + λ
l
αl − 1 = 0 (45)
105 / 126
Images/
Lagrange Multipliers for Q
We can us the following constraint for that
l
αl = 1 (43)
We have the following cost function
Q (Θ, Θg
) + λ
l
αl − 1 (44)
Deriving by αl
∂
∂αl
Q (Θ, Θg
) + λ
l
αl − 1 = 0 (45)
105 / 126
Images/
Thus
The Q function
Q (Θ, Θg
) =
N
i=1
M
l=1
log (αl) p (l|xi, Θg
) + ...
N
i=1
M
l=1
log (pl (xi|θl)) p (l|xi, Θg
)
106 / 126
Images/
Deriving
We have
∂
∂αl
Q (Θ, Θg
) + λ
l
αl − 1 =
N
i=1
1
αl
p (l|xi, Θg
) + λ
107 / 126
Images/
Finally
We have making the previous equation equal to 0
N
i=1
1
αl
p (l|xi, Θg
) + λ = 0 (46)
Thus
N
i=1
p (l|xi, Θg
) = −λαl (47)
Summing over l, we get
λ = −N (48)
108 / 126
Images/
Finally
We have making the previous equation equal to 0
N
i=1
1
αl
p (l|xi, Θg
) + λ = 0 (46)
Thus
N
i=1
p (l|xi, Θg
) = −λαl (47)
Summing over l, we get
λ = −N (48)
108 / 126
Images/
Finally
We have making the previous equation equal to 0
N
i=1
1
αl
p (l|xi, Θg
) + λ = 0 (46)
Thus
N
i=1
p (l|xi, Θg
) = −λαl (47)
Summing over l, we get
λ = −N (48)
108 / 126
Images/
Lagrange Multipliers
Thus
αl =
1
N
N
i=1
p (l|xi, Θg
) (49)
About θl
It is possible to get an analytical expressions for θl as functions of
everything else.
This is for you to try!!!
For more, please look at
“Geometric Idea of Lagrange Multipliers” by John Wyatt.
109 / 126
Images/
Lagrange Multipliers
Thus
αl =
1
N
N
i=1
p (l|xi, Θg
) (49)
About θl
It is possible to get an analytical expressions for θl as functions of
everything else.
This is for you to try!!!
For more, please look at
“Geometric Idea of Lagrange Multipliers” by John Wyatt.
109 / 126
Images/
Lagrange Multipliers
Thus
αl =
1
N
N
i=1
p (l|xi, Θg
) (49)
About θl
It is possible to get an analytical expressions for θl as functions of
everything else.
This is for you to try!!!
For more, please look at
“Geometric Idea of Lagrange Multipliers” by John Wyatt.
109 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
110 / 126
Images/
Remember?
Gaussian Distribution
pl (x|µl, Σl) =
1
(2π)
d/2
|Σl|
1/2
exp −
1
2
(x − µl)T
Σ−1
l (x − µl) (50)
111 / 126
Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
Images/
In addition
If A is symmetric
∂ |A|
∂A
=
Ai,j if i = j
2Ai,j if i = j
(51)
Where Ai,j is the i, jth cofactor of A.
Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor is
preceded by a + or – sign depending whether the element is
in a + or – position.
Thus
∂ log |A|
∂A
=



Ai,j
|A| if i = j
2Ai,j if i = j
= 2A−1
− diag A−1
(52)
113 / 126
Images/
In addition
If A is symmetric
∂ |A|
∂A
=
Ai,j if i = j
2Ai,j if i = j
(51)
Where Ai,j is the i, jth cofactor of A.
Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor is
preceded by a + or – sign depending whether the element is
in a + or – position.
Thus
∂ log |A|
∂A
=



Ai,j
|A| if i = j
2Ai,j if i = j
= 2A−1
− diag A−1
(52)
113 / 126
Images/
In addition
If A is symmetric
∂ |A|
∂A
=
Ai,j if i = j
2Ai,j if i = j
(51)
Where Ai,j is the i, jth cofactor of A.
Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor is
preceded by a + or – sign depending whether the element is
in a + or – position.
Thus
∂ log |A|
∂A
=



Ai,j
|A| if i = j
2Ai,j if i = j
= 2A−1
− diag A−1
(52)
113 / 126
Images/
In addition
If A is symmetric
∂ |A|
∂A
=
Ai,j if i = j
2Ai,j if i = j
(51)
Where Ai,j is the i, jth cofactor of A.
Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor is
preceded by a + or – sign depending whether the element is
in a + or – position.
Thus
∂ log |A|
∂A
=



Ai,j
|A| if i = j
2Ai,j if i = j
= 2A−1
− diag A−1
(52)
113 / 126
Images/
Finally
The last equation we need
∂tr (AB)
∂A
= B + BT
− diag (B) (53)
In addition
∂xT Ax
∂x
(54)
114 / 126
Images/
Finally
The last equation we need
∂tr (AB)
∂A
= B + BT
− diag (B) (53)
In addition
∂xT Ax
∂x
(54)
114 / 126
Images/
Thus, using last part of equation 38
We get, after ignoring constant terms
Remember they disappear after derivatives
N
i=1
M
l=1
log (pl (xi|µl, Σl)) p (l|xi, Θg
)
=
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l (xi − µl) p (l|xi, Θg
) (55)
115 / 126
Images/
Thus, using last part of equation 38
We get, after ignoring constant terms
Remember they disappear after derivatives
N
i=1
M
l=1
log (pl (xi|µl, Σl)) p (l|xi, Θg
)
=
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l (xi − µl) p (l|xi, Θg
) (55)
115 / 126
Images/
Finally
Thus, when taking the derivative with respect to µl
N
i=1
Σ−1
l (xi − µl) p (l|xi, Θg
) = 0 (56)
Then
µl =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
(57)
116 / 126
Images/
Finally
Thus, when taking the derivative with respect to µl
N
i=1
Σ−1
l (xi − µl) p (l|xi, Θg
) = 0 (56)
Then
µl =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
(57)
116 / 126
Images/
Now, if we derive with respect to Σl
First, we rewrite equation 55
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l
(xi − µl) p (l|xi, Θg
)
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
(xi − µl) (xi − µl)T
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
Where Nl,i = (xi − µl) (xi − µl)T
.
117 / 126
Images/
Now, if we derive with respect to Σl
First, we rewrite equation 55
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l
(xi − µl) p (l|xi, Θg
)
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
(xi − µl) (xi − µl)T
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
Where Nl,i = (xi − µl) (xi − µl)T
.
117 / 126
Images/
Now, if we derive with respect to Σl
First, we rewrite equation 55
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l
(xi − µl) p (l|xi, Θg
)
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
(xi − µl) (xi − µl)T
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
Where Nl,i = (xi − µl) (xi − µl)T
.
117 / 126
Images/
Now, if we derive with respect to Σl
First, we rewrite equation 55
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l
(xi − µl) p (l|xi, Θg
)
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
(xi − µl) (xi − µl)T
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
Where Nl,i = (xi − µl) (xi − µl)T
.
117 / 126
Images/
Deriving with respect to Σ−1
l
We have that
∂
∂Σ−1
l
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) (2Σl − diag (Σl)) −
1
2
N
i=1
p (l|xi, Θg
) 2Nl.i − diag Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) 2Ml,i − diag Ml,i
=2S − diag (S)
Where Ml,i = Σl − Nl,i and S = 1
2
N
i=1
p (l|xi, Θg) Ml,i
118 / 126
Images/
Deriving with respect to Σ−1
l
We have that
∂
∂Σ−1
l
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) (2Σl − diag (Σl)) −
1
2
N
i=1
p (l|xi, Θg
) 2Nl.i − diag Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) 2Ml,i − diag Ml,i
=2S − diag (S)
Where Ml,i = Σl − Nl,i and S = 1
2
N
i=1
p (l|xi, Θg) Ml,i
118 / 126
Images/
Deriving with respect to Σ−1
l
We have that
∂
∂Σ−1
l
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) (2Σl − diag (Σl)) −
1
2
N
i=1
p (l|xi, Θg
) 2Nl.i − diag Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) 2Ml,i − diag Ml,i
=2S − diag (S)
Where Ml,i = Σl − Nl,i and S = 1
2
N
i=1
p (l|xi, Θg) Ml,i
118 / 126
Images/
Deriving with respect to Σ−1
l
We have that
∂
∂Σ−1
l
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) (2Σl − diag (Σl)) −
1
2
N
i=1
p (l|xi, Θg
) 2Nl.i − diag Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) 2Ml,i − diag Ml,i
=2S − diag (S)
Where Ml,i = Σl − Nl,i and S = 1
2
N
i=1
p (l|xi, Θg) Ml,i
118 / 126
Images/
Thus, we have
Thus
If 2S − diag (S) = 0 =⇒ S = 0
Implying
1
2
N
i=1
p (l|xi, Θg
) [Σl − Nl,i] = 0 (58)
Or
Σl =
N
i=1 p (l|xi, Θg) Nl,i
N
i=1 p (l|xi, Θg)
=
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
(59)
119 / 126
Images/
Thus, we have
Thus
If 2S − diag (S) = 0 =⇒ S = 0
Implying
1
2
N
i=1
p (l|xi, Θg
) [Σl − Nl,i] = 0 (58)
Or
Σl =
N
i=1 p (l|xi, Θg) Nl,i
N
i=1 p (l|xi, Θg)
=
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
(59)
119 / 126
Images/
Thus, we have
Thus
If 2S − diag (S) = 0 =⇒ S = 0
Implying
1
2
N
i=1
p (l|xi, Θg
) [Σl − Nl,i] = 0 (58)
Or
Σl =
N
i=1 p (l|xi, Θg) Nl,i
N
i=1 p (l|xi, Θg)
=
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
(59)
119 / 126
Images/
Thus, we have the iterative updates
They are
αNew
l =
1
N
N
i=1
p (l|xi, Θg
)
µNew
l =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
ΣNew
l =
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
120 / 126
Images/
Thus, we have the iterative updates
They are
αNew
l =
1
N
N
i=1
p (l|xi, Θg
)
µNew
l =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
ΣNew
l =
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
120 / 126
Images/
Thus, we have the iterative updates
They are
αNew
l =
1
N
N
i=1
p (l|xi, Θg
)
µNew
l =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
ΣNew
l =
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
120 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
121 / 126
Images/
EM Algorithm for Gaussian Mixtures
Step 1
Initialize:
The means µl
Covariances Σl
Mixing coefficients αl
122 / 126
Images/
Evaluate
Step 2 - E-Step
Evaluate the the probabilities of component l given xi using the
current parameter values:
p (l|xi, Θg
) =
αg
l pyi xi|θg
l
M
k=1 αg
kpk xi|θg
k
123 / 126
Images/
Now
Step 3 - M-Step
Re-estimate the parameters using the current iteration values:
αNew
l =
1
N
N
i=1
p (l|xi, Θg
)
µNew
l =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
ΣNew
l =
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
124 / 126
Images/
Now
Step 3 - M-Step
Re-estimate the parameters using the current iteration values:
αNew
l =
1
N
N
i=1
p (l|xi, Θg
)
µNew
l =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
ΣNew
l =
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
124 / 126
Images/
Now
Step 3 - M-Step
Re-estimate the parameters using the current iteration values:
αNew
l =
1
N
N
i=1
p (l|xi, Θg
)
µNew
l =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
ΣNew
l =
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
124 / 126
Images/
Evaluate
Step 4
Evaluate the log likelihood:
log p (X|µ, Σ, α) =
N
i=1
log
M
l=1
αNew
l pl xi|µNew
l , Σl
New
Step 6
Check for convergence of either the parameters or the log likelihood.
If the convergence criterion is not satisfied return to step 2.
125 / 126
Images/
Evaluate
Step 4
Evaluate the log likelihood:
log p (X|µ, Σ, α) =
N
i=1
log
M
l=1
αNew
l pl xi|µNew
l , Σl
New
Step 6
Check for convergence of either the parameters or the log likelihood.
If the convergence criterion is not satisfied return to step 2.
125 / 126
Images/
References I
S. Borman, “The expectation maximization algorithm-a short
tutorial,” Submitted for publication, pp. 1–9, 2004.
J. Bilmes, “A gentle tutorial of the EM algorithm and its application
to parameter estimation for Gaussian mixture and hidden Markov
models,” International Computer Science Institute, vol. 4, 1998.
F. Dellaert, “The expectation maximization algorithm,” tech. rep.,
Georgia Institute of Technology, 2002.
G. McLachlan and T. Krishnan, The EM algorithm and extensions,
vol. 382.
John Wiley & Sons, 2007.
126 / 126

07 Machine Learning - Expectation Maximization

  • 1.
    Introduction to MachineLearning Expectation Maximization Andres Mendez-Vazquez September 16, 2017 1 / 126
  • 2.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 2 / 126
  • 3.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 3 / 126
  • 4.
    Images/ Maximum-Likelihood We have adensity function p (x|Θ) Assume that we have a data set of size N, X = {x1, x2, ..., xN } This data is known as evidence. We assume in addition that The vectors are independent and identically distributed (i.i.d.) with distribution p under parameter θ. 4 / 126
  • 5.
    Images/ Maximum-Likelihood We have adensity function p (x|Θ) Assume that we have a data set of size N, X = {x1, x2, ..., xN } This data is known as evidence. We assume in addition that The vectors are independent and identically distributed (i.i.d.) with distribution p under parameter θ. 4 / 126
  • 6.
    Images/ What Can WeDo With The Evidence? We may use the Bayes’ Rule to estimate the parameters θ p (Θ|X) = P (X|Θ) P (Θ) P (X) (1) Or, given a new observation ˜x p (˜x|X) (2) I.e. to compute the probability of the new observation being supported by the evidence X. Thus The former represents parameter estimation and the latter data prediction. 5 / 126
  • 7.
    Images/ What Can WeDo With The Evidence? We may use the Bayes’ Rule to estimate the parameters θ p (Θ|X) = P (X|Θ) P (Θ) P (X) (1) Or, given a new observation ˜x p (˜x|X) (2) I.e. to compute the probability of the new observation being supported by the evidence X. Thus The former represents parameter estimation and the latter data prediction. 5 / 126
  • 8.
    Images/ What Can WeDo With The Evidence? We may use the Bayes’ Rule to estimate the parameters θ p (Θ|X) = P (X|Θ) P (Θ) P (X) (1) Or, given a new observation ˜x p (˜x|X) (2) I.e. to compute the probability of the new observation being supported by the evidence X. Thus The former represents parameter estimation and the latter data prediction. 5 / 126
  • 9.
    Images/ Focusing First onthe Estimation of the Parameters θ We can interpret the Bayes’ Rule p (Θ|X) = P (X|Θ) P (Θ) P (X) (3) Interpreted as posterior = likelihood × prior evidence (4) Thus, we want likelihood = P (X|Θ) 6 / 126
  • 10.
    Images/ Focusing First onthe Estimation of the Parameters θ We can interpret the Bayes’ Rule p (Θ|X) = P (X|Θ) P (Θ) P (X) (3) Interpreted as posterior = likelihood × prior evidence (4) Thus, we want likelihood = P (X|Θ) 6 / 126
  • 11.
    Images/ Focusing First onthe Estimation of the Parameters θ We can interpret the Bayes’ Rule p (Θ|X) = P (X|Θ) P (Θ) P (X) (3) Interpreted as posterior = likelihood × prior evidence (4) Thus, we want likelihood = P (X|Θ) 6 / 126
  • 12.
    Images/ What we want... Wewant to maximize the likelihood as a function of θ 7 / 126
  • 13.
    Images/ Maximum-Likelihood We have p (x1,x2, ..., xN |Θ) = N i=1 p (xi|Θ) (5) Also known as the likelihood function. Because multiplication of quantities p (xi|Θ) ≤ 1 can be problematic L (Θ|X) = log N i=1 p (xi|Θ) = N i=1 log p (xi|Θ) (6) 8 / 126
  • 14.
    Images/ Maximum-Likelihood We have p (x1,x2, ..., xN |Θ) = N i=1 p (xi|Θ) (5) Also known as the likelihood function. Because multiplication of quantities p (xi|Θ) ≤ 1 can be problematic L (Θ|X) = log N i=1 p (xi|Θ) = N i=1 log p (xi|Θ) (6) 8 / 126
  • 15.
    Images/ Maximum-Likelihood We want tofind a Θ∗ Θ∗ = argmaxΘL (Θ|X) (7) The classic method ∂L (Θ|X) ∂θi = 0 ∀θi ∈ Θ (8) 9 / 126
  • 16.
    Images/ Maximum-Likelihood We want tofind a Θ∗ Θ∗ = argmaxΘL (Θ|X) (7) The classic method ∂L (Θ|X) ∂θi = 0 ∀θi ∈ Θ (8) 9 / 126
  • 17.
    Images/ What happened ifwe have incomplete data Data could have been split 1 X = observed data or incomplete data 2 Y = unobserved data For this type of problems We have the famous Expectation Maximization (EM) 10 / 126
  • 18.
    Images/ What happened ifwe have incomplete data Data could have been split 1 X = observed data or incomplete data 2 Y = unobserved data For this type of problems We have the famous Expectation Maximization (EM) 10 / 126
  • 19.
    Images/ What happened ifwe have incomplete data Data could have been split 1 X = observed data or incomplete data 2 Y = unobserved data For this type of problems We have the famous Expectation Maximization (EM) 10 / 126
  • 20.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 11 / 126
  • 21.
    Images/ The Expectation Maximization TheEM algorithm It was first developed by Dempster et al. (1977). Its popularity comes from the fact It can estimate an underlying distribution when data is incomplete or has missing values. Two main applications 1 When missing values exists. 2 When a likelihood function can be simplified by assuming extra parameters that are missing or hidden. 12 / 126
  • 22.
    Images/ The Expectation Maximization TheEM algorithm It was first developed by Dempster et al. (1977). Its popularity comes from the fact It can estimate an underlying distribution when data is incomplete or has missing values. Two main applications 1 When missing values exists. 2 When a likelihood function can be simplified by assuming extra parameters that are missing or hidden. 12 / 126
  • 23.
    Images/ The Expectation Maximization TheEM algorithm It was first developed by Dempster et al. (1977). Its popularity comes from the fact It can estimate an underlying distribution when data is incomplete or has missing values. Two main applications 1 When missing values exists. 2 When a likelihood function can be simplified by assuming extra parameters that are missing or hidden. 12 / 126
  • 24.
    Images/ The Expectation Maximization TheEM algorithm It was first developed by Dempster et al. (1977). Its popularity comes from the fact It can estimate an underlying distribution when data is incomplete or has missing values. Two main applications 1 When missing values exists. 2 When a likelihood function can be simplified by assuming extra parameters that are missing or hidden. 12 / 126
  • 25.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 13 / 126
  • 26.
    Images/ Clustering Given a seriesof data sets Given the fact that Radial Gaussian Functions are Universal Approximators . Samples {x1, x2, ..., xN } are the visible parameters The Gaussian distributions generating each of the samples are the hidden parameters Then, we model the cluster as a mixture of Gaussian’s 14 / 126
  • 27.
    Images/ Clustering Given a seriesof data sets Given the fact that Radial Gaussian Functions are Universal Approximators . Samples {x1, x2, ..., xN } are the visible parameters The Gaussian distributions generating each of the samples are the hidden parameters Then, we model the cluster as a mixture of Gaussian’s 14 / 126
  • 28.
    Images/ Natural Language Processing Unsupervisedinduction of probabilistic context-free grammars Here given a series of words o1, o2, o3, ... and normalized Context-Free Grammar We want to know the probabilities of each rule P (i → jk) Thus Here the you have two variables: The Visible Ones: The sequence of words The Hidden Ones: The rule that produces the possible sequence oi → oj 15 / 126
  • 29.
    Images/ Natural Language Processing Unsupervisedinduction of probabilistic context-free grammars Here given a series of words o1, o2, o3, ... and normalized Context-Free Grammar We want to know the probabilities of each rule P (i → jk) Thus Here the you have two variables: The Visible Ones: The sequence of words The Hidden Ones: The rule that produces the possible sequence oi → oj 15 / 126
  • 30.
    Images/ Natural Language Processing Baum-WelchAlgorithm for Hidden Markov Models Visible Data Hidden Data Here Hidden Variables: The circular nodes producing the data Visible Variables: The square nodes representing the samples. 16 / 126
  • 31.
    Images/ Natural Language Processing Baum-WelchAlgorithm for Hidden Markov Models Visible Data Hidden Data Here Hidden Variables: The circular nodes producing the data Visible Variables: The square nodes representing the samples. 16 / 126
  • 32.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 17 / 126
  • 33.
    Images/ Incomplete Data We assumethe following Two parts of data 1 X = observed data or incomplete data 2 Y = unobserved data Thus Z = (X,Y)=Complete Data (9) Thus, we have the following probability p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10) 18 / 126
  • 34.
    Images/ Incomplete Data We assumethe following Two parts of data 1 X = observed data or incomplete data 2 Y = unobserved data Thus Z = (X,Y)=Complete Data (9) Thus, we have the following probability p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10) 18 / 126
  • 35.
    Images/ Incomplete Data We assumethe following Two parts of data 1 X = observed data or incomplete data 2 Y = unobserved data Thus Z = (X,Y)=Complete Data (9) Thus, we have the following probability p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10) 18 / 126
  • 36.
    Images/ Incomplete Data We assumethe following Two parts of data 1 X = observed data or incomplete data 2 Y = unobserved data Thus Z = (X,Y)=Complete Data (9) Thus, we have the following probability p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10) 18 / 126
  • 37.
    Images/ Incomplete Data We assumethe following Two parts of data 1 X = observed data or incomplete data 2 Y = unobserved data Thus Z = (X,Y)=Complete Data (9) Thus, we have the following probability p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10) 18 / 126
  • 38.
    Images/ New Likelihood Function TheNew Likelihood Function L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11) Note: The complete data likelihood. Thus, we have L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12) Did you notice? p (X|Θ) is the likelihood of the observed data. p (Y|X, Θ) is the likelihood of the no-observed data under the observed data!!! 19 / 126
  • 39.
    Images/ New Likelihood Function TheNew Likelihood Function L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11) Note: The complete data likelihood. Thus, we have L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12) Did you notice? p (X|Θ) is the likelihood of the observed data. p (Y|X, Θ) is the likelihood of the no-observed data under the observed data!!! 19 / 126
  • 40.
    Images/ New Likelihood Function TheNew Likelihood Function L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11) Note: The complete data likelihood. Thus, we have L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12) Did you notice? p (X|Θ) is the likelihood of the observed data. p (Y|X, Θ) is the likelihood of the no-observed data under the observed data!!! 19 / 126
  • 41.
    Images/ Rewriting This can berewritten as L (Θ|X, Y) = hX,Θ (Y) (13) This basically signify that X, Θ are constant and the only random part is Y. In addition L (Θ|X) (14) It is known as the incomplete-data likelihood function. 20 / 126
  • 42.
    Images/ Rewriting This can berewritten as L (Θ|X, Y) = hX,Θ (Y) (13) This basically signify that X, Θ are constant and the only random part is Y. In addition L (Θ|X) (14) It is known as the incomplete-data likelihood function. 20 / 126
  • 43.
    Images/ Thus We can connectboth incomplete-complete data equations by doing the following L (Θ|X) =p (X|Θ) = Y p (X, Y|Θ) = Y p (Y|X, Θ) p (X|Θ) = N i=1 p (xi|Θ) Y p (Y|X, Θ) 21 / 126
  • 44.
    Images/ Thus We can connectboth incomplete-complete data equations by doing the following L (Θ|X) =p (X|Θ) = Y p (X, Y|Θ) = Y p (Y|X, Θ) p (X|Θ) = N i=1 p (xi|Θ) Y p (Y|X, Θ) 21 / 126
  • 45.
    Images/ Thus We can connectboth incomplete-complete data equations by doing the following L (Θ|X) =p (X|Θ) = Y p (X, Y|Θ) = Y p (Y|X, Θ) p (X|Θ) = N i=1 p (xi|Θ) Y p (Y|X, Θ) 21 / 126
  • 46.
    Images/ Thus We can connectboth incomplete-complete data equations by doing the following L (Θ|X) =p (X|Θ) = Y p (X, Y|Θ) = Y p (Y|X, Θ) p (X|Θ) = N i=1 p (xi|Θ) Y p (Y|X, Θ) 21 / 126
  • 47.
    Images/ Remarks Problems Normally, it isalmost impossible to obtain a closed analytical solution for the previous equation. However We can use the expected value of log p (X, Y|Θ), which allows us to find an iterative procedure to approximate the solution. 22 / 126
  • 48.
    Images/ Remarks Problems Normally, it isalmost impossible to obtain a closed analytical solution for the previous equation. However We can use the expected value of log p (X, Y|Θ), which allows us to find an iterative procedure to approximate the solution. 22 / 126
  • 49.
    Images/ The function wewould like to have The Q function We want an estimation of the complete-data log-likelihood log p (X, Y|Θ) (15) Based in the info provided by X, Θn−1 where Θn−1 is a previously estimated set of parameters at step n. Think about the following, if we want to remove Y ˆ [log p (X, Y|Θ)] p (Y|X, Θn−1) dY (16) Remark: We integrate out Y - Actually, this is the expected value of log p (X, Y|Θ). 23 / 126
  • 50.
    Images/ The function wewould like to have The Q function We want an estimation of the complete-data log-likelihood log p (X, Y|Θ) (15) Based in the info provided by X, Θn−1 where Θn−1 is a previously estimated set of parameters at step n. Think about the following, if we want to remove Y ˆ [log p (X, Y|Θ)] p (Y|X, Θn−1) dY (16) Remark: We integrate out Y - Actually, this is the expected value of log p (X, Y|Θ). 23 / 126
  • 51.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 24 / 126
  • 52.
    Images/ Use the ExpectedValue Then, we want an iterative method to guess Θ from Θn−1 Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17) Take in account that 1 X, Θn−1 are taken as constants. 2 Θ is a normal variable that we wish to adjust. 3 Y is a random variable governed by distribution p (Y|X, Θn−1)=marginal distribution of missing data. 25 / 126
  • 53.
    Images/ Use the ExpectedValue Then, we want an iterative method to guess Θ from Θn−1 Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17) Take in account that 1 X, Θn−1 are taken as constants. 2 Θ is a normal variable that we wish to adjust. 3 Y is a random variable governed by distribution p (Y|X, Θn−1)=marginal distribution of missing data. 25 / 126
  • 54.
    Images/ Use the ExpectedValue Then, we want an iterative method to guess Θ from Θn−1 Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17) Take in account that 1 X, Θn−1 are taken as constants. 2 Θ is a normal variable that we wish to adjust. 3 Y is a random variable governed by distribution p (Y|X, Θn−1)=marginal distribution of missing data. 25 / 126
  • 55.
    Images/ Use the ExpectedValue Then, we want an iterative method to guess Θ from Θn−1 Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17) Take in account that 1 X, Θn−1 are taken as constants. 2 Θ is a normal variable that we wish to adjust. 3 Y is a random variable governed by distribution p (Y|X, Θn−1)=marginal distribution of missing data. 25 / 126
  • 56.
    Images/ Another Interpretation Given theprevious information E [log p (X, Y|Θ) |X, Θn−1] = ´ Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY Something Notable 1 In the best of cases, this marginal distribution is a simple analytical expression of the assumed parameter Θn−1. 2 In the worst of cases, this density might be very hard to obtain. Actually, we use p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18) which is not dependent on Θ. 26 / 126
  • 57.
    Images/ Another Interpretation Given theprevious information E [log p (X, Y|Θ) |X, Θn−1] = ´ Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY Something Notable 1 In the best of cases, this marginal distribution is a simple analytical expression of the assumed parameter Θn−1. 2 In the worst of cases, this density might be very hard to obtain. Actually, we use p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18) which is not dependent on Θ. 26 / 126
  • 58.
    Images/ Another Interpretation Given theprevious information E [log p (X, Y|Θ) |X, Θn−1] = ´ Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY Something Notable 1 In the best of cases, this marginal distribution is a simple analytical expression of the assumed parameter Θn−1. 2 In the worst of cases, this density might be very hard to obtain. Actually, we use p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18) which is not dependent on Θ. 26 / 126
  • 59.
    Images/ Another Interpretation Given theprevious information E [log p (X, Y|Θ) |X, Θn−1] = ´ Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY Something Notable 1 In the best of cases, this marginal distribution is a simple analytical expression of the assumed parameter Θn−1. 2 In the worst of cases, this density might be very hard to obtain. Actually, we use p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18) which is not dependent on Θ. 26 / 126
  • 60.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 27 / 126
  • 61.
    Images/ Back to theQ function The intuition We have the following analogy: Consider h (θ, Y ) a function θ a constant Y ∼ pY (y), a random variable with distribution pY (y). Thus, if Y is a discrete random variable q (θ) = EY [h (θ, Y )] = y h (θ, y) pY (y) (19) 28 / 126
  • 62.
    Images/ Back to theQ function The intuition We have the following analogy: Consider h (θ, Y ) a function θ a constant Y ∼ pY (y), a random variable with distribution pY (y). Thus, if Y is a discrete random variable q (θ) = EY [h (θ, Y )] = y h (θ, y) pY (y) (19) 28 / 126
  • 63.
    Images/ Back to theQ function The intuition We have the following analogy: Consider h (θ, Y ) a function θ a constant Y ∼ pY (y), a random variable with distribution pY (y). Thus, if Y is a discrete random variable q (θ) = EY [h (θ, Y )] = y h (θ, y) pY (y) (19) 28 / 126
  • 64.
    Images/ Back to theQ function The intuition We have the following analogy: Consider h (θ, Y ) a function θ a constant Y ∼ pY (y), a random variable with distribution pY (y). Thus, if Y is a discrete random variable q (θ) = EY [h (θ, Y )] = y h (θ, y) pY (y) (19) 28 / 126
  • 65.
    Images/ Back to theQ function The intuition We have the following analogy: Consider h (θ, Y ) a function θ a constant Y ∼ pY (y), a random variable with distribution pY (y). Thus, if Y is a discrete random variable q (θ) = EY [h (θ, Y )] = y h (θ, y) pY (y) (19) 28 / 126
  • 66.
    Images/ Why E-step!!! From herethe name This is basically the E-step The second step It tries to maximize the Q function Θn = argmaxΘQ (Θ, Θn−1) (20) 29 / 126
  • 67.
    Images/ Why E-step!!! From herethe name This is basically the E-step The second step It tries to maximize the Q function Θn = argmaxΘQ (Θ, Θn−1) (20) 29 / 126
  • 68.
    Images/ Why E-step!!! From herethe name This is basically the E-step The second step It tries to maximize the Q function Θn = argmaxΘQ (Θ, Θn−1) (20) 29 / 126
  • 69.
    Images/ Derivation of theEM-Algorithm The likelihood function we are going to use Let X be a random vector which results from a parametrized family: L (Θ) = ln P (X|Θ) (21) Note: ln (x) is a strictly increasing function. We wish to compute Θ Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn) Or the maximization of the difference L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22) 30 / 126
  • 70.
    Images/ Derivation of theEM-Algorithm The likelihood function we are going to use Let X be a random vector which results from a parametrized family: L (Θ) = ln P (X|Θ) (21) Note: ln (x) is a strictly increasing function. We wish to compute Θ Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn) Or the maximization of the difference L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22) 30 / 126
  • 71.
    Images/ Derivation of theEM-Algorithm The likelihood function we are going to use Let X be a random vector which results from a parametrized family: L (Θ) = ln P (X|Θ) (21) Note: ln (x) is a strictly increasing function. We wish to compute Θ Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn) Or the maximization of the difference L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22) 30 / 126
  • 72.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 31 / 126
  • 73.
    Images/ Introducing the HiddenFeatures Given that the hidden random vector Y exits with y values P (X|Θ) = y P (X|y, Θ) P (y|Θ) (23) Thus, using our first constraint L (Θ) − L (Θn) L (Θ) − L (Θn) = ln y P (X|y, Θ) P (y|Θ) − ln P (X|Θn) (24) 32 / 126
  • 74.
    Images/ Introducing the HiddenFeatures Given that the hidden random vector Y exits with y values P (X|Θ) = y P (X|y, Θ) P (y|Θ) (23) Thus, using our first constraint L (Θ) − L (Θn) L (Θ) − L (Θn) = ln y P (X|y, Θ) P (y|Θ) − ln P (X|Θn) (24) 32 / 126
  • 75.
    Images/ Here, we introducesome concepts of convexity For Convexity Theorem (Jensen’s inequality) Let f be a convex function defined on an interval I. If x1, x2, ..., xn ∈ I and λ1, λ2, ..., λn ≥ 0 with n i=1 λi = 1, then f n i=1 λixi ≤ n i=1 λif (xi) (25) 33 / 126
  • 76.
    Images/ Proof: For n =1 We have the trivial case For n = 2 The convexity definition. Now the inductive hypothesis We assume that the theorem is true for some n. 34 / 126
  • 77.
    Images/ Proof: For n =1 We have the trivial case For n = 2 The convexity definition. Now the inductive hypothesis We assume that the theorem is true for some n. 34 / 126
  • 78.
    Images/ Proof: For n =1 We have the trivial case For n = 2 The convexity definition. Now the inductive hypothesis We assume that the theorem is true for some n. 34 / 126
  • 79.
    Images/ Now, we have Thefollowing linear combination for λi f n+1 i=1 λixi = f λn+1xn+1 + n i=1 λixi = f λn+1xn+1 + (1 − λn+1) (1 − λn+1) n i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) f 1 (1 − λn+1) n i=1 λixi 35 / 126
  • 80.
    Images/ Now, we have Thefollowing linear combination for λi f n+1 i=1 λixi = f λn+1xn+1 + n i=1 λixi = f λn+1xn+1 + (1 − λn+1) (1 − λn+1) n i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) f 1 (1 − λn+1) n i=1 λixi 35 / 126
  • 81.
    Images/ Now, we have Thefollowing linear combination for λi f n+1 i=1 λixi = f λn+1xn+1 + n i=1 λixi = f λn+1xn+1 + (1 − λn+1) (1 − λn+1) n i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) f 1 (1 − λn+1) n i=1 λixi 35 / 126
  • 82.
    Images/ Did you notice? SomethingNotable n+1 i=1 λi = 1 Thus n i=1 λi = 1 − λn+1 Finally 1 (1 − λn+1) n i=1 λi = 1 36 / 126
  • 83.
    Images/ Did you notice? SomethingNotable n+1 i=1 λi = 1 Thus n i=1 λi = 1 − λn+1 Finally 1 (1 − λn+1) n i=1 λi = 1 36 / 126
  • 84.
    Images/ Did you notice? SomethingNotable n+1 i=1 λi = 1 Thus n i=1 λi = 1 − λn+1 Finally 1 (1 − λn+1) n i=1 λi = 1 36 / 126
  • 85.
    Images/ Now We have that f n+1 i=1 λixi≤ λn+1f (xn+1) + (1 − λn+1) f 1 (1 − λn+1) n i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) 1 (1 − λn+1) n i=1 λif (xi) ≤ λn+1f (xn+1) + n i=1 λif (xi) Q.E.D. 37 / 126
  • 86.
    Images/ Now We have that f n+1 i=1 λixi≤ λn+1f (xn+1) + (1 − λn+1) f 1 (1 − λn+1) n i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) 1 (1 − λn+1) n i=1 λif (xi) ≤ λn+1f (xn+1) + n i=1 λif (xi) Q.E.D. 37 / 126
  • 87.
    Images/ Now We have that f n+1 i=1 λixi≤ λn+1f (xn+1) + (1 − λn+1) f 1 (1 − λn+1) n i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) 1 (1 − λn+1) n i=1 λif (xi) ≤ λn+1f (xn+1) + n i=1 λif (xi) Q.E.D. 37 / 126
  • 88.
    Images/ Thus, for concavefunctions It is possible to shown that Given ln (x) a concave function: ln n i=1 λixi ≥ n i=1 λi ln (xi) If we take in consideration Assume that the λi = P (y|X, Θn). We know that 1 P (y|X, Θn) ≥ 0 2 y P (y|X, Θn) = 1 38 / 126
  • 89.
    Images/ Thus, for concavefunctions It is possible to shown that Given ln (x) a concave function: ln n i=1 λixi ≥ n i=1 λi ln (xi) If we take in consideration Assume that the λi = P (y|X, Θn). We know that 1 P (y|X, Θn) ≥ 0 2 y P (y|X, Θn) = 1 38 / 126
  • 90.
    Images/ We have First L (Θ)− L (Θn) = ln y P (X|y, Θ) P (y|Θ) − ln P (X|Θn) = ln y P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (y|X, Θn) − ln P (X|Θn) = ln y P (y|X, Θn) P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ln P (X|Θn) ≥ y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ... y P (y|X, Θn) ln P (X|Θn) Why this? 39 / 126
  • 91.
    Images/ We have First L (Θ)− L (Θn) = ln y P (X|y, Θ) P (y|Θ) − ln P (X|Θn) = ln y P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (y|X, Θn) − ln P (X|Θn) = ln y P (y|X, Θn) P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ln P (X|Θn) ≥ y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ... y P (y|X, Θn) ln P (X|Θn) Why this? 39 / 126
  • 92.
    Images/ We have First L (Θ)− L (Θn) = ln y P (X|y, Θ) P (y|Θ) − ln P (X|Θn) = ln y P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (y|X, Θn) − ln P (X|Θn) = ln y P (y|X, Θn) P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ln P (X|Θn) ≥ y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ... y P (y|X, Θn) ln P (X|Θn) Why this? 39 / 126
  • 93.
    Images/ We have First L (Θ)− L (Θn) = ln y P (X|y, Θ) P (y|Θ) − ln P (X|Θn) = ln y P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (y|X, Θn) − ln P (X|Θn) = ln y P (y|X, Θn) P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ln P (X|Θn) ≥ y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ... y P (y|X, Θn) ln P (X|Θn) Why this? 39 / 126
  • 94.
    Images/ Next Because y P (y|X, Θn)= 1 Then L (Θ) − L (Θn) ≥ y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) =∆ (Θ|Θn) 40 / 126
  • 95.
    Images/ Next Because y P (y|X, Θn)= 1 Then L (Θ) − L (Θn) ≥ y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) =∆ (Θ|Θn) 40 / 126
  • 96.
    Images/ Then, we have Then,we have proved that L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26) Then, we define a new function l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27) Thus l (Θ|Θn) It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ) 41 / 126
  • 97.
    Images/ Then, we have Then,we have proved that L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26) Then, we define a new function l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27) Thus l (Θ|Θn) It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ) 41 / 126
  • 98.
    Images/ Then, we have Then,we have proved that L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26) Then, we define a new function l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27) Thus l (Θ|Θn) It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ) 41 / 126
  • 99.
    Images/ Now, we cando the following We evaluate in Θn l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn) =L (Θn) + y P (y|X, Θn) ln P (X|y, Θn) P (y|Θn) P (y|X, Θn) P (X|Θn) =L (Θn) + y P (y|X, Θn) ln P (X, y|Θn) P (X, y|Θn) =L (Θn) This means that For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal 42 / 126
  • 100.
    Images/ Now, we cando the following We evaluate in Θn l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn) =L (Θn) + y P (y|X, Θn) ln P (X|y, Θn) P (y|Θn) P (y|X, Θn) P (X|Θn) =L (Θn) + y P (y|X, Θn) ln P (X, y|Θn) P (X, y|Θn) =L (Θn) This means that For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal 42 / 126
  • 101.
    Images/ Now, we cando the following We evaluate in Θn l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn) =L (Θn) + y P (y|X, Θn) ln P (X|y, Θn) P (y|Θn) P (y|X, Θn) P (X|Θn) =L (Θn) + y P (y|X, Θn) ln P (X, y|Θn) P (X, y|Θn) =L (Θn) This means that For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal 42 / 126
  • 102.
    Images/ Now, we cando the following We evaluate in Θn l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn) =L (Θn) + y P (y|X, Θn) ln P (X|y, Θn) P (y|Θn) P (y|X, Θn) P (X|Θn) =L (Θn) + y P (y|X, Θn) ln P (X, y|Θn) P (X, y|Θn) =L (Θn) This means that For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal 42 / 126
  • 103.
    Images/ Now, we cando the following We evaluate in Θn l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn) =L (Θn) + y P (y|X, Θn) ln P (X|y, Θn) P (y|Θn) P (y|X, Θn) P (X|Θn) =L (Θn) + y P (y|X, Θn) ln P (X, y|Θn) P (X, y|Θn) =L (Θn) This means that For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal 42 / 126
  • 104.
    Images/ Therefore The function l(Θ|Θn) has the following properties 1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ). 2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal. 3 The function l (Θ|Θn) is concave... How? 43 / 126
  • 105.
    Images/ Therefore The function l(Θ|Θn) has the following properties 1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ). 2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal. 3 The function l (Θ|Θn) is concave... How? 43 / 126
  • 106.
    Images/ Therefore The function l(Θ|Θn) has the following properties 1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ). 2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal. 3 The function l (Θ|Θn) is concave... How? 43 / 126
  • 107.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 44 / 126
  • 108.
    Images/ First We have thevalue L (Θn) We know that L (Θn) is constant i.e. an offset value What about ∆ (Θ|Θn) y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) We have that the ln is a concave function ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) 45 / 126
  • 109.
    Images/ First We have thevalue L (Θn) We know that L (Θn) is constant i.e. an offset value What about ∆ (Θ|Θn) y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) We have that the ln is a concave function ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) 45 / 126
  • 110.
    Images/ First We have thevalue L (Θn) We know that L (Θn) is constant i.e. an offset value What about ∆ (Θ|Θn) y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) We have that the ln is a concave function ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) 45 / 126
  • 111.
    Images/ Therefore Each element isconcave P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) Therefore, the sum of concave functions is a concave function y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) 46 / 126
  • 112.
    Images/ Therefore Each element isconcave P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) Therefore, the sum of concave functions is a concave function y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) 46 / 126
  • 113.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 47 / 126
  • 114.
    Images/ Given the ConcaveFunction Thus, we have that 1 We can select Θn such that l (Θ|Θn) is maximized. 2 Thus, given a Θn, we can generate Θn+1. The process can be seen in the following graph 48 / 126
  • 115.
    Images/ Given the ConcaveFunction Thus, we have that 1 We can select Θn such that l (Θ|Θn) is maximized. 2 Thus, given a Θn, we can generate Θn+1. The process can be seen in the following graph 48 / 126
  • 116.
    Images/ Given the ConcaveFunction Thus, we have that 1 We can select Θn such that l (Θ|Θn) is maximized. 2 Thus, given a Θn, we can generate Θn+1. The process can be seen in the following graph 48 / 126
  • 117.
    Images/ Given The Previous Constraints 1l (Θ|Θn) is bounded from above by L (Θ) l (Θ|Θn) ≤ L (Θ) 2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal L (Θn) = l (Θ|Θn) 3 The function l (Θ|Θn) is concave 49 / 126
  • 118.
    Images/ Given The Previous Constraints 1l (Θ|Θn) is bounded from above by L (Θ) l (Θ|Θn) ≤ L (Θ) 2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal L (Θn) = l (Θ|Θn) 3 The function l (Θ|Θn) is concave 49 / 126
  • 119.
    Images/ Given The Previous Constraints 1l (Θ|Θn) is bounded from above by L (Θ) l (Θ|Θn) ≤ L (Θ) 2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal L (Θn) = l (Θ|Θn) 3 The function l (Θ|Θn) is concave 49 / 126
  • 120.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 50 / 126
  • 121.
    Images/ From The following Θn+1 =argmaxΘ{l (Θ|Θn)} =argmaxΘ L (Θn) + y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) The terms with Θn are constants. ≈argmaxΘ y P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ)) =argmaxΘ y P (y|X, Θn) ln P (X, y|Θ) P (y|Θ) P (y, Θ) P (Θ) =argmaxΘ    y P (y|X, Θn) ln   P(X,y,Θ) P(Θ) P(y,Θ) P(Θ) P (y, Θ) P (Θ)      51 / 126
  • 122.
    Images/ From The following Θn+1 =argmaxΘ{l (Θ|Θn)} =argmaxΘ L (Θn) + y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) The terms with Θn are constants. ≈argmaxΘ y P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ)) =argmaxΘ y P (y|X, Θn) ln P (X, y|Θ) P (y|Θ) P (y, Θ) P (Θ) =argmaxΘ    y P (y|X, Θn) ln   P(X,y,Θ) P(Θ) P(y,Θ) P(Θ) P (y, Θ) P (Θ)      51 / 126
  • 123.
    Images/ From The following Θn+1 =argmaxΘ{l (Θ|Θn)} =argmaxΘ L (Θn) + y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) The terms with Θn are constants. ≈argmaxΘ y P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ)) =argmaxΘ y P (y|X, Θn) ln P (X, y|Θ) P (y|Θ) P (y, Θ) P (Θ) =argmaxΘ    y P (y|X, Θn) ln   P(X,y,Θ) P(Θ) P(y,Θ) P(Θ) P (y, Θ) P (Θ)      51 / 126
  • 124.
    Images/ From The following Θn+1 =argmaxΘ{l (Θ|Θn)} =argmaxΘ L (Θn) + y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) The terms with Θn are constants. ≈argmaxΘ y P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ)) =argmaxΘ y P (y|X, Θn) ln P (X, y|Θ) P (y|Θ) P (y, Θ) P (Θ) =argmaxΘ    y P (y|X, Θn) ln   P(X,y,Θ) P(Θ) P(y,Θ) P(Θ) P (y, Θ) P (Θ)      51 / 126
  • 125.
    Images/ From The following Θn+1 =argmaxΘ{l (Θ|Θn)} =argmaxΘ L (Θn) + y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) The terms with Θn are constants. ≈argmaxΘ y P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ)) =argmaxΘ y P (y|X, Θn) ln P (X, y|Θ) P (y|Θ) P (y, Θ) P (Θ) =argmaxΘ    y P (y|X, Θn) ln   P(X,y,Θ) P(Θ) P(y,Θ) P(Θ) P (y, Θ) P (Θ)      51 / 126
  • 126.
    Images/ Thus Then θn+1 =argmaxΘ y P (y|X,Θn) ln P (X, y, Θ) P (y, Θ) P (y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln (P (X, y|Θ)) =argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] 52 / 126
  • 127.
    Images/ Thus Then θn+1 =argmaxΘ y P (y|X,Θn) ln P (X, y, Θ) P (y, Θ) P (y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln (P (X, y|Θ)) =argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] 52 / 126
  • 128.
    Images/ Thus Then θn+1 =argmaxΘ y P (y|X,Θn) ln P (X, y, Θ) P (y, Θ) P (y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln (P (X, y|Θ)) =argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] 52 / 126
  • 129.
    Images/ Thus Then θn+1 =argmaxΘ y P (y|X,Θn) ln P (X, y, Θ) P (y, Θ) P (y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln (P (X, y|Θ)) =argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] 52 / 126
  • 130.
    Images/ Thus Then θn+1 =argmaxΘ y P (y|X,Θn) ln P (X, y, Θ) P (y, Θ) P (y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln (P (X, y|Θ)) =argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] 52 / 126
  • 131.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 53 / 126
  • 132.
    Images/ The EM-Algorithm Steps ofEM 1 Expectation under hidden variables. 2 Maximization of the resulting formula. E-Step Determine the conditional expectation, Ey|X,Θn [ln (P (X, y|Θ))]. M-Step Maximize this expression with respect to Θ. 54 / 126
  • 133.
    Images/ The EM-Algorithm Steps ofEM 1 Expectation under hidden variables. 2 Maximization of the resulting formula. E-Step Determine the conditional expectation, Ey|X,Θn [ln (P (X, y|Θ))]. M-Step Maximize this expression with respect to Θ. 54 / 126
  • 134.
    Images/ The EM-Algorithm Steps ofEM 1 Expectation under hidden variables. 2 Maximization of the resulting formula. E-Step Determine the conditional expectation, Ey|X,Θn [ln (P (X, y|Θ))]. M-Step Maximize this expression with respect to Θ. 54 / 126
  • 135.
    Images/ The EM-Algorithm Steps ofEM 1 Expectation under hidden variables. 2 Maximization of the resulting formula. E-Step Determine the conditional expectation, Ey|X,Θn [ln (P (X, y|Θ))]. M-Step Maximize this expression with respect to Θ. 54 / 126
  • 136.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 55 / 126
  • 137.
    Images/ Notes and Convergenceof EM Gains between L (Θ) and l (Θ|Θn) Using the hidden variables it is possible to simplify the optimization of L (Θ) through l (Θ|Θn). Convergence Remember that Θn+1is the estimate for Θ which maximizes the difference ∆ (Θ|Θn). 56 / 126
  • 138.
    Images/ Notes and Convergenceof EM Gains between L (Θ) and l (Θ|Θn) Using the hidden variables it is possible to simplify the optimization of L (Θ) through l (Θ|Θn). Convergence Remember that Θn+1is the estimate for Θ which maximizes the difference ∆ (Θ|Θn). 56 / 126
  • 139.
    Images/ Notes and Convergenceof EM Gains between L (Θ) and l (Θ|Θn) Using the hidden variables it is possible to simplify the optimization of L (Θ) through l (Θ|Θn). Convergence Remember that Θn+1is the estimate for Θ which maximizes the difference ∆ (Θ|Θn). 56 / 126
  • 140.
    Images/ Therefore Then, we have Giventhe initial estimate of Θ by Θn ∆ (Θn|Θn) = 0 Now If we choose Θn+1 to maximize the ∆ (Θ|Θn), then ∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0 We have that The Likelihood L (Θ) is not a decreasing function with respect to Θ. 57 / 126
  • 141.
    Images/ Therefore Then, we have Giventhe initial estimate of Θ by Θn ∆ (Θn|Θn) = 0 Now If we choose Θn+1 to maximize the ∆ (Θ|Θn), then ∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0 We have that The Likelihood L (Θ) is not a decreasing function with respect to Θ. 57 / 126
  • 142.
    Images/ Therefore Then, we have Giventhe initial estimate of Θ by Θn ∆ (Θn|Θn) = 0 Now If we choose Θn+1 to maximize the ∆ (Θ|Θn), then ∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0 We have that The Likelihood L (Θ) is not a decreasing function with respect to Θ. 57 / 126
  • 143.
    Images/ Notes and Convergenceof EM Properties When the algorithm reaches a fixed point for some Θn, the value maximizes l (Θ|Θn). Definition A fixed point of a function is an element on domain that is mapped to itself by the function: f (x) = x Basically the EM algorithm does the following EM [Θ∗ ] = Θ∗ 58 / 126
  • 144.
    Images/ Notes and Convergenceof EM Properties When the algorithm reaches a fixed point for some Θn, the value maximizes l (Θ|Θn). Definition A fixed point of a function is an element on domain that is mapped to itself by the function: f (x) = x Basically the EM algorithm does the following EM [Θ∗ ] = Θ∗ 58 / 126
  • 145.
    Images/ Notes and Convergenceof EM Properties When the algorithm reaches a fixed point for some Θn, the value maximizes l (Θ|Θn). Definition A fixed point of a function is an element on domain that is mapped to itself by the function: f (x) = x Basically the EM algorithm does the following EM [Θ∗ ] = Θ∗ 58 / 126
  • 146.
    Images/ At this moment Wehave that The algorithm reaches a fixed point for some Θn, the value Θ∗ maximizes l (Θ|Θn). Then, when the algorithm It reaches a fixed point for some Θn the value maximizes l (Θ|Θn). Basically Θn+1 = Θn. 59 / 126
  • 147.
    Images/ At this moment Wehave that The algorithm reaches a fixed point for some Θn, the value Θ∗ maximizes l (Θ|Θn). Then, when the algorithm It reaches a fixed point for some Θn the value maximizes l (Θ|Θn). Basically Θn+1 = Θn. 59 / 126
  • 148.
  • 149.
    Images/ Then If L andl are differentiable at Θn Since L and l are equal at Θn Then, Θn is a stationary point of L i.e. the derivative of L vanishes at that point. Local Maxima 61 / 126
  • 150.
    Images/ However You could finishwith the following case, no local maxima Saddle Point 62 / 126
  • 151.
    Images/ For more onthe subject Please take a look to Geoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithm and Extensions,” John Wiley & Sons, New York, 1996. 63 / 126
  • 152.
    Images/ Finding Maximum LikelihoodMixture Densities Parameters via EM Something Notable The mixture-density parameter estimation problem is probably one of the most widely used applications of the EM algorithm in the computational pattern recognition community. We have p (x|Θ) = M i=1 αipi (x|θi) (28) where 1 Θ = (α1, ..., αM , θ1, ..., θM ) 2 M i=1 αi = 1 3 Each pi is a density function parametrized by θi. 64 / 126
  • 153.
    Images/ Finding Maximum LikelihoodMixture Densities Parameters via EM Something Notable The mixture-density parameter estimation problem is probably one of the most widely used applications of the EM algorithm in the computational pattern recognition community. We have p (x|Θ) = M i=1 αipi (x|θi) (28) where 1 Θ = (α1, ..., αM , θ1, ..., θM ) 2 M i=1 αi = 1 3 Each pi is a density function parametrized by θi. 64 / 126
  • 154.
    Images/ Finding Maximum LikelihoodMixture Densities Parameters via EM Something Notable The mixture-density parameter estimation problem is probably one of the most widely used applications of the EM algorithm in the computational pattern recognition community. We have p (x|Θ) = M i=1 αipi (x|θi) (28) where 1 Θ = (α1, ..., αM , θ1, ..., θM ) 2 M i=1 αi = 1 3 Each pi is a density function parametrized by θi. 64 / 126
  • 155.
    Images/ Finding Maximum LikelihoodMixture Densities Parameters via EM Something Notable The mixture-density parameter estimation problem is probably one of the most widely used applications of the EM algorithm in the computational pattern recognition community. We have p (x|Θ) = M i=1 αipi (x|θi) (28) where 1 Θ = (α1, ..., αM , θ1, ..., θM ) 2 M i=1 αi = 1 3 Each pi is a density function parametrized by θi. 64 / 126
  • 156.
    Images/ A log-likelihood forthis function We have log L (Θ|X) = log N i=1 p (xi|Θ) = N i=1 log   M j=1 αjpj (xi|θj)   (29) Note: This is too difficult to optimize due to the log function. However We can simplify this assuming the following: 1 We assume that each unobserved data Y = {yi}N i=1 has a the following range yi ∈ {1, ..., M} 2 yi = k if the ith samples was generated by the kth mixture. 65 / 126
  • 157.
    Images/ A log-likelihood forthis function We have log L (Θ|X) = log N i=1 p (xi|Θ) = N i=1 log   M j=1 αjpj (xi|θj)   (29) Note: This is too difficult to optimize due to the log function. However We can simplify this assuming the following: 1 We assume that each unobserved data Y = {yi}N i=1 has a the following range yi ∈ {1, ..., M} 2 yi = k if the ith samples was generated by the kth mixture. 65 / 126
  • 158.
    Images/ A log-likelihood forthis function We have log L (Θ|X) = log N i=1 p (xi|Θ) = N i=1 log   M j=1 αjpj (xi|θj)   (29) Note: This is too difficult to optimize due to the log function. However We can simplify this assuming the following: 1 We assume that each unobserved data Y = {yi}N i=1 has a the following range yi ∈ {1, ..., M} 2 yi = k if the ith samples was generated by the kth mixture. 65 / 126
  • 159.
    Images/ A log-likelihood forthis function We have log L (Θ|X) = log N i=1 p (xi|Θ) = N i=1 log   M j=1 αjpj (xi|θj)   (29) Note: This is too difficult to optimize due to the log function. However We can simplify this assuming the following: 1 We assume that each unobserved data Y = {yi}N i=1 has a the following range yi ∈ {1, ..., M} 2 yi = k if the ith samples was generated by the kth mixture. 65 / 126
  • 160.
    Images/ A log-likelihood forthis function We have log L (Θ|X) = log N i=1 p (xi|Θ) = N i=1 log   M j=1 αjpj (xi|θj)   (29) Note: This is too difficult to optimize due to the log function. However We can simplify this assuming the following: 1 We assume that each unobserved data Y = {yi}N i=1 has a the following range yi ∈ {1, ..., M} 2 yi = k if the ith samples was generated by the kth mixture. 65 / 126
  • 161.
    Images/ Now We have log L(Θ|X, Y) = log [P (X, Y|Θ)] (30) Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and assuming independence log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)] = log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)] = log N i=1 P (xi, yi|Θ) = N i=1 log P (xi, yi|Θ) 66 / 126
  • 162.
    Images/ Now We have log L(Θ|X, Y) = log [P (X, Y|Θ)] (30) Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and assuming independence log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)] = log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)] = log N i=1 P (xi, yi|Θ) = N i=1 log P (xi, yi|Θ) 66 / 126
  • 163.
    Images/ Now We have log L(Θ|X, Y) = log [P (X, Y|Θ)] (30) Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and assuming independence log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)] = log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)] = log N i=1 P (xi, yi|Θ) = N i=1 log P (xi, yi|Θ) 66 / 126
  • 164.
    Images/ Now We have log L(Θ|X, Y) = log [P (X, Y|Θ)] (30) Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and assuming independence log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)] = log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)] = log N i=1 P (xi, yi|Θ) = N i=1 log P (xi, yi|Θ) 66 / 126
  • 165.
    Images/ Now We have log L(Θ|X, Y) = log [P (X, Y|Θ)] (30) Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and assuming independence log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)] = log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)] = log N i=1 P (xi, yi|Θ) = N i=1 log P (xi, yi|Θ) 66 / 126
  • 166.
    Images/ Then Thus, by thechain Rule N i=1 log P (xi, yi|Θ) = N i=1 log [P (xi|yi, θyi ) P (yi|θyi )] (31) Question Do you need yi if you know θyi or the other way around? Finally N i=1 log [P (xi|yi, θyi ) P (yi|θyi )] = N i=1 log [P (yi) pyi (xi|θyi )] (32) NOPE: You do not need yi if you know θyi or the other way around. 67 / 126
  • 167.
    Images/ Then Thus, by thechain Rule N i=1 log P (xi, yi|Θ) = N i=1 log [P (xi|yi, θyi ) P (yi|θyi )] (31) Question Do you need yi if you know θyi or the other way around? Finally N i=1 log [P (xi|yi, θyi ) P (yi|θyi )] = N i=1 log [P (yi) pyi (xi|θyi )] (32) NOPE: You do not need yi if you know θyi or the other way around. 67 / 126
  • 168.
    Images/ Finally, we have Makingαyi = P (yi) log L (Θ|X, Y) = N i=1 log [αyi P (xi|yi, θyi )] (33) 68 / 126
  • 169.
    Images/ Problem Which Labels? We donot know the values of Y. We can get away by using the following idea Assume the Y is a random variable. 69 / 126
  • 170.
    Images/ Problem Which Labels? We donot know the values of Y. We can get away by using the following idea Assume the Y is a random variable. 69 / 126
  • 171.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 70 / 126
  • 172.
    Images/ Thus You do afirst guess for the parameters at the beginning of EM Θg = (αg 1, ..., αg M , θg 1, ..., θg M ) (34) Then, it is possible to calculate given the parametric probability pj xi|θg j Therefore The mixing parameters αj can be though of as a prior probabilities of each mixture: αj = p (component j) (35) 71 / 126
  • 173.
    Images/ Thus You do afirst guess for the parameters at the beginning of EM Θg = (αg 1, ..., αg M , θg 1, ..., θg M ) (34) Then, it is possible to calculate given the parametric probability pj xi|θg j Therefore The mixing parameters αj can be though of as a prior probabilities of each mixture: αj = p (component j) (35) 71 / 126
  • 174.
    Images/ Thus You do afirst guess for the parameters at the beginning of EM Θg = (αg 1, ..., αg M , θg 1, ..., θg M ) (34) Then, it is possible to calculate given the parametric probability pj xi|θg j Therefore The mixing parameters αj can be though of as a prior probabilities of each mixture: αj = p (component j) (35) 71 / 126
  • 175.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 72 / 126
  • 176.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 73 / 126
  • 177.
    Images/ We want tocalculate the following probability We want to calculate p (yi|xi, Θg ) Basically We want a Bayesian formulation of this probability. Assuming that the y = (y1, y2, ..., yN ) are samples identically independent samples from a distribution. 74 / 126
  • 178.
    Images/ We want tocalculate the following probability We want to calculate p (yi|xi, Θg ) Basically We want a Bayesian formulation of this probability. Assuming that the y = (y1, y2, ..., yN ) are samples identically independent samples from a distribution. 74 / 126
  • 179.
    Images/ We want tocalculate the following probability We want to calculate p (yi|xi, Θg ) Basically We want a Bayesian formulation of this probability. Assuming that the y = (y1, y2, ..., yN ) are samples identically independent samples from a distribution. 74 / 126
  • 180.
    Images/ Using Bayes’ Rule Compute p(yi|xi, Θg ) = p (yi, xi|Θg) p (xi|Θg) = p (xi|Θg) p yi|θg yi p (xi|Θg) We know θg yi ⇒ Drop it = αg yi pyi xi|θg yi p (xi|Θg) = αg yi pyi xi|θg yi M k=1 αg kpk xi|θg k 75 / 126
  • 181.
    Images/ Using Bayes’ Rule Compute p(yi|xi, Θg ) = p (yi, xi|Θg) p (xi|Θg) = p (xi|Θg) p yi|θg yi p (xi|Θg) We know θg yi ⇒ Drop it = αg yi pyi xi|θg yi p (xi|Θg) = αg yi pyi xi|θg yi M k=1 αg kpk xi|θg k 75 / 126
  • 182.
    Images/ Using Bayes’ Rule Compute p(yi|xi, Θg ) = p (yi, xi|Θg) p (xi|Θg) = p (xi|Θg) p yi|θg yi p (xi|Θg) We know θg yi ⇒ Drop it = αg yi pyi xi|θg yi p (xi|Θg) = αg yi pyi xi|θg yi M k=1 αg kpk xi|θg k 75 / 126
  • 183.
    Images/ Using Bayes’ Rule Compute p(yi|xi, Θg ) = p (yi, xi|Θg) p (xi|Θg) = p (xi|Θg) p yi|θg yi p (xi|Θg) We know θg yi ⇒ Drop it = αg yi pyi xi|θg yi p (xi|Θg) = αg yi pyi xi|θg yi M k=1 αg kpk xi|θg k 75 / 126
  • 184.
    Images/ As in NaiveBayes We have the fact that there is a probability per probability at the mixture and sample p (yi|xi, Θg ) = αg yi pyi xi|θg yi M k=1 αg kpk xi|θg k ∀xi, yi and k ∈ {1, ..., M} This is going to be updated at each iteration of the EM algorithm After the initial Guess!!! Until convergence!!! 76 / 126
  • 185.
    Images/ As in NaiveBayes We have the fact that there is a probability per probability at the mixture and sample p (yi|xi, Θg ) = αg yi pyi xi|θg yi M k=1 αg kpk xi|θg k ∀xi, yi and k ∈ {1, ..., M} This is going to be updated at each iteration of the EM algorithm After the initial Guess!!! Until convergence!!! 76 / 126
  • 186.
    Images/ Additionally We assume againthat the samples yis are identically and independent samples p (y|X, Θg ) = N i=1 p (yi|xi, Θg ) (36) Where y = (y1, y2, ..., yN ) 77 / 126
  • 187.
    Images/ Now, using equation17 Then Q (Θ|Θg ) = y∈Y log (L (Θ|X, y)) p (y|X, Θg ) = y∈Y N i=1 log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) 78 / 126
  • 188.
    Images/ Now, using equation17 Then Q (Θ|Θg ) = y∈Y log (L (Θ|X, y)) p (y|X, Θg ) = y∈Y N i=1 log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) 78 / 126
  • 189.
    Images/ Here, a smallstop What is the meaning of y∈Y It is actually a summation of all possible states of the random vector y. Then, we can rewrite the previous summation as y∈Y = M y1=1 M y2=1 · · · M yN =1 N Running over all the samples {x1, x2, ..., xN }. 79 / 126
  • 190.
    Images/ Here, a smallstop What is the meaning of y∈Y It is actually a summation of all possible states of the random vector y. Then, we can rewrite the previous summation as y∈Y = M y1=1 M y2=1 · · · M yN =1 N Running over all the samples {x1, x2, ..., xN }. 79 / 126
  • 191.
    Images/ Then We have Q (Θ|Θg )= M y1=1 M y2=1 · · · M yN =1 N i=1  log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg )   80 / 126
  • 192.
    Images/ We introduce thefollowing We have the following function δl,yi = 1 I = yi 0 I = yi Therefore, we can do the following αi = M j=1 δi,jαj Then log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) = M l=1 δl,yi log [αlpl (xi|θl)] N j=1 p (yj|xj, Θg ) 81 / 126
  • 193.
    Images/ We introduce thefollowing We have the following function δl,yi = 1 I = yi 0 I = yi Therefore, we can do the following αi = M j=1 δi,jαj Then log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) = M l=1 δl,yi log [αlpl (xi|θl)] N j=1 p (yj|xj, Θg ) 81 / 126
  • 194.
    Images/ We introduce thefollowing We have the following function δl,yi = 1 I = yi 0 I = yi Therefore, we can do the following αi = M j=1 δi,jαj Then log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) = M l=1 δl,yi log [αlpl (xi|θl)] N j=1 p (yj|xj, Θg ) 81 / 126
  • 195.
    Images/ Thus We have thatfor M y1=1 · · · M yN =1 N i=1 log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) = ∗ ∗ = M y1=1 M y2=1 · · · M yN =1 N i=1 M l=1 δl,yi log [αlpl (xi|θl)] N j=1 p (yj|xj, Θg ) = N i=1 M l=1 log [αlpl (xi|θl)] M y1=1 M y2=1 · · · M yN =1  δl,yi N j=1 p (yj|xj, Θg )   Because M y1=1 M y2=1 · · · M yN =1 applies only to δl,yi N j=1 p (yj|xj, Θg) 82 / 126
  • 196.
    Images/ Thus We have thatfor M y1=1 · · · M yN =1 N i=1 log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) = ∗ ∗ = M y1=1 M y2=1 · · · M yN =1 N i=1 M l=1 δl,yi log [αlpl (xi|θl)] N j=1 p (yj|xj, Θg ) = N i=1 M l=1 log [αlpl (xi|θl)] M y1=1 M y2=1 · · · M yN =1  δl,yi N j=1 p (yj|xj, Θg )   Because M y1=1 M y2=1 · · · M yN =1 applies only to δl,yi N j=1 p (yj|xj, Θg) 82 / 126
  • 197.
    Images/ Thus We have thatfor M y1=1 · · · M yN =1 N i=1 log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) = ∗ ∗ = M y1=1 M y2=1 · · · M yN =1 N i=1 M l=1 δl,yi log [αlpl (xi|θl)] N j=1 p (yj|xj, Θg ) = N i=1 M l=1 log [αlpl (xi|θl)] M y1=1 M y2=1 · · · M yN =1  δl,yi N j=1 p (yj|xj, Θg )   Because M y1=1 M y2=1 · · · M yN =1 applies only to δl,yi N j=1 p (yj|xj, Θg) 82 / 126
  • 198.
    Images/ Then, we havethat First notice the following M y1=1 M y2=1 · · · M yN =1 δl,yi N j=1 p (yj|xj, Θg ) = =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 M yi=1 δl,yi p (yi|xi, Θg ) N j=1,j=i, p (yj|xj, Θg )   Then, we have M yi=1 δl,yi p (yi|xi, Θg ) = p (l|xi, Θg ) 83 / 126
  • 199.
    Images/ Then, we havethat First notice the following M y1=1 M y2=1 · · · M yN =1 δl,yi N j=1 p (yj|xj, Θg ) = =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 M yi=1 δl,yi p (yi|xi, Θg ) N j=1,j=i, p (yj|xj, Θg )   Then, we have M yi=1 δl,yi p (yi|xi, Θg ) = p (l|xi, Θg ) 83 / 126
  • 200.
    Images/ Then, we havethat First notice the following M y1=1 M y2=1 · · · M yN =1 δl,yi N j=1 p (yj|xj, Θg ) = =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 M yi=1 δl,yi p (yi|xi, Θg ) N j=1,j=i, p (yj|xj, Θg )   Then, we have M yi=1 δl,yi p (yi|xi, Θg ) = p (l|xi, Θg ) 83 / 126
  • 201.
    Images/ In this way Pluggingback the previous equation M y1=1 M y2=1 · · · M yN =1 δl,yi N j=1 p (yj|xj, Θg ) = =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 p (l|xi, Θg ) N j=1,j=i p (yj|xj, Θg )   =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg )   p (l|xi, Θg ) 84 / 126
  • 202.
    Images/ In this way Pluggingback the previous equation M y1=1 M y2=1 · · · M yN =1 δl,yi N j=1 p (yj|xj, Θg ) = =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 p (l|xi, Θg ) N j=1,j=i p (yj|xj, Θg )   =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg )   p (l|xi, Θg ) 84 / 126
  • 203.
    Images/ In this way Pluggingback the previous equation M y1=1 M y2=1 · · · M yN =1 δl,yi N j=1 p (yj|xj, Θg ) = =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 p (l|xi, Θg ) N j=1,j=i p (yj|xj, Θg )   =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg )   p (l|xi, Θg ) 84 / 126
  • 204.
    Images/ Now, what about...? Theleft part of the equation M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg ) = =   M y1=1 p (y1|x1, Θg )   · · ·   M yi−1=1 p (yi−1|xi−1, Θg )   × ...   M yi+1=1 p (yi+1|xi+1, Θg )   · · ·   M yN =1 p (yN |xN , Θg )   = N j=1,j=i   M yj=1 p (yj|xj, Θg )   85 / 126
  • 205.
    Images/ Now, what about...? Theleft part of the equation M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg ) = =   M y1=1 p (y1|x1, Θg )   · · ·   M yi−1=1 p (yi−1|xi−1, Θg )   × ...   M yi+1=1 p (yi+1|xi+1, Θg )   · · ·   M yN =1 p (yN |xN , Θg )   = N j=1,j=i   M yj=1 p (yj|xj, Θg )   85 / 126
  • 206.
    Images/ Now, what about...? Theleft part of the equation M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg ) = =   M y1=1 p (y1|x1, Θg )   · · ·   M yi−1=1 p (yi−1|xi−1, Θg )   × ...   M yi+1=1 p (yi+1|xi+1, Θg )   · · ·   M yN =1 p (yN |xN , Θg )   = N j=1,j=i   M yj=1 p (yj|xj, Θg )   85 / 126
  • 207.
    Images/ Then, we havethat Plugging back to the original equation    M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg )    p (l|xi, Θg ) = =    N j=1,j=i   M yj=1 p (yj|xj, Θg )      p (l|xi, Θg ) 86 / 126
  • 208.
    Images/ Then, we havethat Plugging back to the original equation    M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg )    p (l|xi, Θg ) = =    N j=1,j=i   M yj=1 p (yj|xj, Θg )      p (l|xi, Θg ) 86 / 126
  • 209.
    Images/ We can useproperties of probability We know that M yi=1 p (yi|xi, Θg ) = 1 (37) Then    N j=1,j=i   M yj=1 p (yj|xj, Θg )      p (l|xi, Θg ) = =    N j=1,j=i 1    p (l|xi, Θg ) =p (l|xi, Θg ) = αg l pyi xi|θg l M k=1 αg kpk xi|θg k 87 / 126
  • 210.
    Images/ We can useproperties of probability We know that M yi=1 p (yi|xi, Θg ) = 1 (37) Then    N j=1,j=i   M yj=1 p (yj|xj, Θg )      p (l|xi, Θg ) = =    N j=1,j=i 1    p (l|xi, Θg ) =p (l|xi, Θg ) = αg l pyi xi|θg l M k=1 αg kpk xi|θg k 87 / 126
  • 211.
    Images/ We can useproperties of probability We know that M yi=1 p (yi|xi, Θg ) = 1 (37) Then    N j=1,j=i   M yj=1 p (yj|xj, Θg )      p (l|xi, Θg ) = =    N j=1,j=i 1    p (l|xi, Θg ) =p (l|xi, Θg ) = αg l pyi xi|θg l M k=1 αg kpk xi|θg k 87 / 126
  • 212.
    Images/ We can useproperties of probability We know that M yi=1 p (yi|xi, Θg ) = 1 (37) Then    N j=1,j=i   M yj=1 p (yj|xj, Θg )      p (l|xi, Θg ) = =    N j=1,j=i 1    p (l|xi, Θg ) =p (l|xi, Θg ) = αg l pyi xi|θg l M k=1 αg kpk xi|θg k 87 / 126
  • 213.
    Images/ Thus We can writeQ in the following way Q (Θ, Θg ) = N i=1 M l=1 log [αlpl (xi|θl)] p (l|xi, Θg ) = N i=1 M l=1 log (αl) p (l|xi, Θg ) + ... N i=1 M l=1 log (pl (xi|θl)) p (l|xi, Θg ) (38) 88 / 126
  • 214.
    Images/ Thus We can writeQ in the following way Q (Θ, Θg ) = N i=1 M l=1 log [αlpl (xi|θl)] p (l|xi, Θg ) = N i=1 M l=1 log (αl) p (l|xi, Θg ) + ... N i=1 M l=1 log (pl (xi|θl)) p (l|xi, Θg ) (38) 88 / 126
  • 215.
    Images/ Thus We can writeQ in the following way Q (Θ, Θg ) = N i=1 M l=1 log [αlpl (xi|θl)] p (l|xi, Θg ) = N i=1 M l=1 log (αl) p (l|xi, Θg ) + ... N i=1 M l=1 log (pl (xi|θl)) p (l|xi, Θg ) (38) 88 / 126
  • 216.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 89 / 126
  • 217.
    Images/ A Method That couldbe used as a general framework To solve problems set as EM problem. First, we will look at the Lagrange Multipliers setup Then, we will look at a specific case using the mixture of Gaussian’s Note Not all the mixture of distributions will get you an analytical solution. 90 / 126
  • 218.
    Images/ A Method That couldbe used as a general framework To solve problems set as EM problem. First, we will look at the Lagrange Multipliers setup Then, we will look at a specific case using the mixture of Gaussian’s Note Not all the mixture of distributions will get you an analytical solution. 90 / 126
  • 219.
    Images/ A Method That couldbe used as a general framework To solve problems set as EM problem. First, we will look at the Lagrange Multipliers setup Then, we will look at a specific case using the mixture of Gaussian’s Note Not all the mixture of distributions will get you an analytical solution. 90 / 126
  • 220.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 91 / 126
  • 221.
    Images/ Lagrange Multipliers The methodof Lagrange multipliers Gives a set of necessary conditions to identify optimal points of equality constrained optimization problems. This is done by converting a constrained problem To an equivalent unconstrained problem with the help of certain unspecified parameters known as Lagrange multipliers. 92 / 126
  • 222.
    Images/ Lagrange Multipliers The methodof Lagrange multipliers Gives a set of necessary conditions to identify optimal points of equality constrained optimization problems. This is done by converting a constrained problem To an equivalent unconstrained problem with the help of certain unspecified parameters known as Lagrange multipliers. 92 / 126
  • 223.
    Images/ Lagrange Multipliers The classicalproblem formulation min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 It can be converted into min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (39) where L(x, λ) is the Lagrangian function. λ is an unspecified positive or negative constant called the Lagrange Multiplier. 93 / 126
  • 224.
    Images/ Lagrange Multipliers The classicalproblem formulation min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 It can be converted into min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (39) where L(x, λ) is the Lagrangian function. λ is an unspecified positive or negative constant called the Lagrange Multiplier. 93 / 126
  • 225.
    Images/ Lagrange Multipliers The classicalproblem formulation min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 It can be converted into min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (39) where L(x, λ) is the Lagrangian function. λ is an unspecified positive or negative constant called the Lagrange Multiplier. 93 / 126
  • 226.
    Images/ Lagrange Multipliers areapplications of Py Think about this Remember First Law of Newton!!! Yes!!! 94 / 126
  • 227.
    Images/ Lagrange Multipliers areapplications of Py Think about this Remember First Law of Newton!!! Yes!!! A system in equilibrium does not move Static Body 94 / 126
  • 228.
    Images/ Lagrange Multipliers Definition Gives aset of necessary conditions to identify optimal points of equality constrained optimization problem 95 / 126
  • 229.
    Images/ Lagrange was aPhysicists He was thinking in the following formula A system in equilibrium has the following equation: F1 + F2 + ... + FK = 0 (40) But functions do not have forces? Are you sure? Think about the following The Gradient of a surface. 96 / 126
  • 230.
    Images/ Lagrange was aPhysicists He was thinking in the following formula A system in equilibrium has the following equation: F1 + F2 + ... + FK = 0 (40) But functions do not have forces? Are you sure? Think about the following The Gradient of a surface. 96 / 126
  • 231.
    Images/ Lagrange was aPhysicists He was thinking in the following formula A system in equilibrium has the following equation: F1 + F2 + ... + FK = 0 (40) But functions do not have forces? Are you sure? Think about the following The Gradient of a surface. 96 / 126
  • 232.
    Images/ Gradient to aSurface After all a gradient is a measure of the maximal change For example the gradient of a function of three variables: f (x) = i ∂f (x) ∂x + j ∂f (x) ∂y + k ∂f (x) ∂z (41) where i, j and k are unitary vectors in the directions x, y and z. 97 / 126
  • 233.
    Images/ Example We have f(x, y) = x exp {−x2 − y2 } 98 / 126
  • 234.
    Images/ Example With Gradient atthe the contours when projecting in the 2D plane 99 / 126
  • 235.
    Images/ Now, Think aboutthis Yes, we can use the gradient However, we need to do some scaling of the forces by using parameters λ Thus, we have F0 + λ1F1 + ... + λKFK = 0 (42) where F0 is the gradient of the principal cost function and Fi for i = 1, 2, .., K. 100 / 126
  • 236.
    Images/ Now, Think aboutthis Yes, we can use the gradient However, we need to do some scaling of the forces by using parameters λ Thus, we have F0 + λ1F1 + ... + λKFK = 0 (42) where F0 is the gradient of the principal cost function and Fi for i = 1, 2, .., K. 100 / 126
  • 237.
    Images/ Thus If we havethe following optimization: min f (x) s.tg1 (x) = 0 g2 (x) = 0 101 / 126
  • 238.
    Images/ Geometric interpretation inthe case of minimization What is wrong? Gradients are going in the other direction, we can fix by simple multiplying by -1 Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize f (−→x ) g1 (−→x ) g2 (−→x ) −∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0 Nevertheless: it is equivalent to f −→x − λ1 g1 −→x − λ2 g2 −→x = 0 101 / 126
  • 239.
    Images/ Geometric interpretation inthe case of minimization What is wrong? Gradients are going in the other direction, we can fix by simple multiplying by -1 Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize f (−→x ) g1 (−→x ) g2 (−→x ) −∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0 Nevertheless: it is equivalent to f −→x − λ1 g1 −→x − λ2 g2 −→x = 0 101 / 126
  • 240.
    Images/ Method Steps 1 Original problemis rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 102 / 126
  • 241.
    Images/ Method Steps 1 Original problemis rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 102 / 126
  • 242.
    Images/ Method Steps 1 Original problemis rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 102 / 126
  • 243.
    Images/ Method Steps 1 Original problemis rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 102 / 126
  • 244.
    Images/ Method Steps 1 Original problemis rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 102 / 126
  • 245.
    Images/ Method Steps 1 Original problemis rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 102 / 126
  • 246.
    Images/ Example We can applythat to the following problem min f (x, y) = x2 − 8x + y2 − 12y + 48 s.t x + y = 8 103 / 126
  • 247.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 104 / 126
  • 248.
    Images/ Lagrange Multipliers forQ We can us the following constraint for that l αl = 1 (43) We have the following cost function Q (Θ, Θg ) + λ l αl − 1 (44) Deriving by αl ∂ ∂αl Q (Θ, Θg ) + λ l αl − 1 = 0 (45) 105 / 126
  • 249.
    Images/ Lagrange Multipliers forQ We can us the following constraint for that l αl = 1 (43) We have the following cost function Q (Θ, Θg ) + λ l αl − 1 (44) Deriving by αl ∂ ∂αl Q (Θ, Θg ) + λ l αl − 1 = 0 (45) 105 / 126
  • 250.
    Images/ Lagrange Multipliers forQ We can us the following constraint for that l αl = 1 (43) We have the following cost function Q (Θ, Θg ) + λ l αl − 1 (44) Deriving by αl ∂ ∂αl Q (Θ, Θg ) + λ l αl − 1 = 0 (45) 105 / 126
  • 251.
    Images/ Thus The Q function Q(Θ, Θg ) = N i=1 M l=1 log (αl) p (l|xi, Θg ) + ... N i=1 M l=1 log (pl (xi|θl)) p (l|xi, Θg ) 106 / 126
  • 252.
    Images/ Deriving We have ∂ ∂αl Q (Θ,Θg ) + λ l αl − 1 = N i=1 1 αl p (l|xi, Θg ) + λ 107 / 126
  • 253.
    Images/ Finally We have makingthe previous equation equal to 0 N i=1 1 αl p (l|xi, Θg ) + λ = 0 (46) Thus N i=1 p (l|xi, Θg ) = −λαl (47) Summing over l, we get λ = −N (48) 108 / 126
  • 254.
    Images/ Finally We have makingthe previous equation equal to 0 N i=1 1 αl p (l|xi, Θg ) + λ = 0 (46) Thus N i=1 p (l|xi, Θg ) = −λαl (47) Summing over l, we get λ = −N (48) 108 / 126
  • 255.
    Images/ Finally We have makingthe previous equation equal to 0 N i=1 1 αl p (l|xi, Θg ) + λ = 0 (46) Thus N i=1 p (l|xi, Θg ) = −λαl (47) Summing over l, we get λ = −N (48) 108 / 126
  • 256.
    Images/ Lagrange Multipliers Thus αl = 1 N N i=1 p(l|xi, Θg ) (49) About θl It is possible to get an analytical expressions for θl as functions of everything else. This is for you to try!!! For more, please look at “Geometric Idea of Lagrange Multipliers” by John Wyatt. 109 / 126
  • 257.
    Images/ Lagrange Multipliers Thus αl = 1 N N i=1 p(l|xi, Θg ) (49) About θl It is possible to get an analytical expressions for θl as functions of everything else. This is for you to try!!! For more, please look at “Geometric Idea of Lagrange Multipliers” by John Wyatt. 109 / 126
  • 258.
    Images/ Lagrange Multipliers Thus αl = 1 N N i=1 p(l|xi, Θg ) (49) About θl It is possible to get an analytical expressions for θl as functions of everything else. This is for you to try!!! For more, please look at “Geometric Idea of Lagrange Multipliers” by John Wyatt. 109 / 126
  • 259.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 110 / 126
  • 260.
    Images/ Remember? Gaussian Distribution pl (x|µl,Σl) = 1 (2π) d/2 |Σl| 1/2 exp − 1 2 (x − µl)T Σ−1 l (x − µl) (50) 111 / 126
  • 261.
    Images/ How to usethis for Gaussian Distributions For this, we need to refresh some linear algebra 1 tr (A + B) = tr (A) + tr (B) 2 tr (AB) = tr (BA) 3 i xT i Axi = tr (AB) where B = i xixT i . 4 A−1 = 1 |A| Now, we need the derivative of a matrix function f (A) Thus, ∂f(A) ∂A is going to be the matrix with i, jth entry ∂f(A) ∂ai,j where ai,j is the i, jth entry of A. 112 / 126
  • 262.
    Images/ How to usethis for Gaussian Distributions For this, we need to refresh some linear algebra 1 tr (A + B) = tr (A) + tr (B) 2 tr (AB) = tr (BA) 3 i xT i Axi = tr (AB) where B = i xixT i . 4 A−1 = 1 |A| Now, we need the derivative of a matrix function f (A) Thus, ∂f(A) ∂A is going to be the matrix with i, jth entry ∂f(A) ∂ai,j where ai,j is the i, jth entry of A. 112 / 126
  • 263.
    Images/ How to usethis for Gaussian Distributions For this, we need to refresh some linear algebra 1 tr (A + B) = tr (A) + tr (B) 2 tr (AB) = tr (BA) 3 i xT i Axi = tr (AB) where B = i xixT i . 4 A−1 = 1 |A| Now, we need the derivative of a matrix function f (A) Thus, ∂f(A) ∂A is going to be the matrix with i, jth entry ∂f(A) ∂ai,j where ai,j is the i, jth entry of A. 112 / 126
  • 264.
    Images/ How to usethis for Gaussian Distributions For this, we need to refresh some linear algebra 1 tr (A + B) = tr (A) + tr (B) 2 tr (AB) = tr (BA) 3 i xT i Axi = tr (AB) where B = i xixT i . 4 A−1 = 1 |A| Now, we need the derivative of a matrix function f (A) Thus, ∂f(A) ∂A is going to be the matrix with i, jth entry ∂f(A) ∂ai,j where ai,j is the i, jth entry of A. 112 / 126
  • 265.
    Images/ How to usethis for Gaussian Distributions For this, we need to refresh some linear algebra 1 tr (A + B) = tr (A) + tr (B) 2 tr (AB) = tr (BA) 3 i xT i Axi = tr (AB) where B = i xixT i . 4 A−1 = 1 |A| Now, we need the derivative of a matrix function f (A) Thus, ∂f(A) ∂A is going to be the matrix with i, jth entry ∂f(A) ∂ai,j where ai,j is the i, jth entry of A. 112 / 126
  • 266.
    Images/ In addition If Ais symmetric ∂ |A| ∂A = Ai,j if i = j 2Ai,j if i = j (51) Where Ai,j is the i, jth cofactor of A. Note: The determinant obtained by deleting the row and column of a given element of a matrix or determinant. The cofactor is preceded by a + or – sign depending whether the element is in a + or – position. Thus ∂ log |A| ∂A =    Ai,j |A| if i = j 2Ai,j if i = j = 2A−1 − diag A−1 (52) 113 / 126
  • 267.
    Images/ In addition If Ais symmetric ∂ |A| ∂A = Ai,j if i = j 2Ai,j if i = j (51) Where Ai,j is the i, jth cofactor of A. Note: The determinant obtained by deleting the row and column of a given element of a matrix or determinant. The cofactor is preceded by a + or – sign depending whether the element is in a + or – position. Thus ∂ log |A| ∂A =    Ai,j |A| if i = j 2Ai,j if i = j = 2A−1 − diag A−1 (52) 113 / 126
  • 268.
    Images/ In addition If Ais symmetric ∂ |A| ∂A = Ai,j if i = j 2Ai,j if i = j (51) Where Ai,j is the i, jth cofactor of A. Note: The determinant obtained by deleting the row and column of a given element of a matrix or determinant. The cofactor is preceded by a + or – sign depending whether the element is in a + or – position. Thus ∂ log |A| ∂A =    Ai,j |A| if i = j 2Ai,j if i = j = 2A−1 − diag A−1 (52) 113 / 126
  • 269.
    Images/ In addition If Ais symmetric ∂ |A| ∂A = Ai,j if i = j 2Ai,j if i = j (51) Where Ai,j is the i, jth cofactor of A. Note: The determinant obtained by deleting the row and column of a given element of a matrix or determinant. The cofactor is preceded by a + or – sign depending whether the element is in a + or – position. Thus ∂ log |A| ∂A =    Ai,j |A| if i = j 2Ai,j if i = j = 2A−1 − diag A−1 (52) 113 / 126
  • 270.
    Images/ Finally The last equationwe need ∂tr (AB) ∂A = B + BT − diag (B) (53) In addition ∂xT Ax ∂x (54) 114 / 126
  • 271.
    Images/ Finally The last equationwe need ∂tr (AB) ∂A = B + BT − diag (B) (53) In addition ∂xT Ax ∂x (54) 114 / 126
  • 272.
    Images/ Thus, using lastpart of equation 38 We get, after ignoring constant terms Remember they disappear after derivatives N i=1 M l=1 log (pl (xi|µl, Σl)) p (l|xi, Θg ) = N i=1 M l=1 − 1 2 log (|Σl|) − 1 2 (xi − µl)T Σ−1 l (xi − µl) p (l|xi, Θg ) (55) 115 / 126
  • 273.
    Images/ Thus, using lastpart of equation 38 We get, after ignoring constant terms Remember they disappear after derivatives N i=1 M l=1 log (pl (xi|µl, Σl)) p (l|xi, Θg ) = N i=1 M l=1 − 1 2 log (|Σl|) − 1 2 (xi − µl)T Σ−1 l (xi − µl) p (l|xi, Θg ) (55) 115 / 126
  • 274.
    Images/ Finally Thus, when takingthe derivative with respect to µl N i=1 Σ−1 l (xi − µl) p (l|xi, Θg ) = 0 (56) Then µl = N i=1 xip (l|xi, Θg) N i=1 p (l|xi, Θg) (57) 116 / 126
  • 275.
    Images/ Finally Thus, when takingthe derivative with respect to µl N i=1 Σ−1 l (xi − µl) p (l|xi, Θg ) = 0 (56) Then µl = N i=1 xip (l|xi, Θg) N i=1 p (l|xi, Θg) (57) 116 / 126
  • 276.
    Images/ Now, if wederive with respect to Σl First, we rewrite equation 55 N i=1 M l=1 − 1 2 log (|Σl|) − 1 2 (xi − µl)T Σ−1 l (xi − µl) p (l|xi, Θg ) = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l (xi − µl) (xi − µl)T = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i Where Nl,i = (xi − µl) (xi − µl)T . 117 / 126
  • 277.
    Images/ Now, if wederive with respect to Σl First, we rewrite equation 55 N i=1 M l=1 − 1 2 log (|Σl|) − 1 2 (xi − µl)T Σ−1 l (xi − µl) p (l|xi, Θg ) = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l (xi − µl) (xi − µl)T = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i Where Nl,i = (xi − µl) (xi − µl)T . 117 / 126
  • 278.
    Images/ Now, if wederive with respect to Σl First, we rewrite equation 55 N i=1 M l=1 − 1 2 log (|Σl|) − 1 2 (xi − µl)T Σ−1 l (xi − µl) p (l|xi, Θg ) = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l (xi − µl) (xi − µl)T = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i Where Nl,i = (xi − µl) (xi − µl)T . 117 / 126
  • 279.
    Images/ Now, if wederive with respect to Σl First, we rewrite equation 55 N i=1 M l=1 − 1 2 log (|Σl|) − 1 2 (xi − µl)T Σ−1 l (xi − µl) p (l|xi, Θg ) = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l (xi − µl) (xi − µl)T = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i Where Nl,i = (xi − µl) (xi − µl)T . 117 / 126
  • 280.
    Images/ Deriving with respectto Σ−1 l We have that ∂ ∂Σ−1 l M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i = 1 2 N i=1 p (l|xi, Θg ) (2Σl − diag (Σl)) − 1 2 N i=1 p (l|xi, Θg ) 2Nl.i − diag Nl,i = 1 2 N i=1 p (l|xi, Θg ) 2Ml,i − diag Ml,i =2S − diag (S) Where Ml,i = Σl − Nl,i and S = 1 2 N i=1 p (l|xi, Θg) Ml,i 118 / 126
  • 281.
    Images/ Deriving with respectto Σ−1 l We have that ∂ ∂Σ−1 l M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i = 1 2 N i=1 p (l|xi, Θg ) (2Σl − diag (Σl)) − 1 2 N i=1 p (l|xi, Θg ) 2Nl.i − diag Nl,i = 1 2 N i=1 p (l|xi, Θg ) 2Ml,i − diag Ml,i =2S − diag (S) Where Ml,i = Σl − Nl,i and S = 1 2 N i=1 p (l|xi, Θg) Ml,i 118 / 126
  • 282.
    Images/ Deriving with respectto Σ−1 l We have that ∂ ∂Σ−1 l M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i = 1 2 N i=1 p (l|xi, Θg ) (2Σl − diag (Σl)) − 1 2 N i=1 p (l|xi, Θg ) 2Nl.i − diag Nl,i = 1 2 N i=1 p (l|xi, Θg ) 2Ml,i − diag Ml,i =2S − diag (S) Where Ml,i = Σl − Nl,i and S = 1 2 N i=1 p (l|xi, Θg) Ml,i 118 / 126
  • 283.
    Images/ Deriving with respectto Σ−1 l We have that ∂ ∂Σ−1 l M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i = 1 2 N i=1 p (l|xi, Θg ) (2Σl − diag (Σl)) − 1 2 N i=1 p (l|xi, Θg ) 2Nl.i − diag Nl,i = 1 2 N i=1 p (l|xi, Θg ) 2Ml,i − diag Ml,i =2S − diag (S) Where Ml,i = Σl − Nl,i and S = 1 2 N i=1 p (l|xi, Θg) Ml,i 118 / 126
  • 284.
    Images/ Thus, we have Thus If2S − diag (S) = 0 =⇒ S = 0 Implying 1 2 N i=1 p (l|xi, Θg ) [Σl − Nl,i] = 0 (58) Or Σl = N i=1 p (l|xi, Θg) Nl,i N i=1 p (l|xi, Θg) = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) (59) 119 / 126
  • 285.
    Images/ Thus, we have Thus If2S − diag (S) = 0 =⇒ S = 0 Implying 1 2 N i=1 p (l|xi, Θg ) [Σl − Nl,i] = 0 (58) Or Σl = N i=1 p (l|xi, Θg) Nl,i N i=1 p (l|xi, Θg) = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) (59) 119 / 126
  • 286.
    Images/ Thus, we have Thus If2S − diag (S) = 0 =⇒ S = 0 Implying 1 2 N i=1 p (l|xi, Θg ) [Σl − Nl,i] = 0 (58) Or Σl = N i=1 p (l|xi, Θg) Nl,i N i=1 p (l|xi, Θg) = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) (59) 119 / 126
  • 287.
    Images/ Thus, we havethe iterative updates They are αNew l = 1 N N i=1 p (l|xi, Θg ) µNew l = N i=1 xip (l|xi, Θg) N i=1 p (l|xi, Θg) ΣNew l = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) 120 / 126
  • 288.
    Images/ Thus, we havethe iterative updates They are αNew l = 1 N N i=1 p (l|xi, Θg ) µNew l = N i=1 xip (l|xi, Θg) N i=1 p (l|xi, Θg) ΣNew l = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) 120 / 126
  • 289.
    Images/ Thus, we havethe iterative updates They are αNew l = 1 N N i=1 p (l|xi, Θg ) µNew l = N i=1 xip (l|xi, Θg) N i=1 p (l|xi, Θg) ΣNew l = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) 120 / 126
  • 290.
    Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examplesof Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 121 / 126
  • 291.
    Images/ EM Algorithm forGaussian Mixtures Step 1 Initialize: The means µl Covariances Σl Mixing coefficients αl 122 / 126
  • 292.
    Images/ Evaluate Step 2 -E-Step Evaluate the the probabilities of component l given xi using the current parameter values: p (l|xi, Θg ) = αg l pyi xi|θg l M k=1 αg kpk xi|θg k 123 / 126
  • 293.
    Images/ Now Step 3 -M-Step Re-estimate the parameters using the current iteration values: αNew l = 1 N N i=1 p (l|xi, Θg ) µNew l = N i=1 xip (l|xi, Θg) N i=1 p (l|xi, Θg) ΣNew l = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) 124 / 126
  • 294.
    Images/ Now Step 3 -M-Step Re-estimate the parameters using the current iteration values: αNew l = 1 N N i=1 p (l|xi, Θg ) µNew l = N i=1 xip (l|xi, Θg) N i=1 p (l|xi, Θg) ΣNew l = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) 124 / 126
  • 295.
    Images/ Now Step 3 -M-Step Re-estimate the parameters using the current iteration values: αNew l = 1 N N i=1 p (l|xi, Θg ) µNew l = N i=1 xip (l|xi, Θg) N i=1 p (l|xi, Θg) ΣNew l = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) 124 / 126
  • 296.
    Images/ Evaluate Step 4 Evaluate thelog likelihood: log p (X|µ, Σ, α) = N i=1 log M l=1 αNew l pl xi|µNew l , Σl New Step 6 Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied return to step 2. 125 / 126
  • 297.
    Images/ Evaluate Step 4 Evaluate thelog likelihood: log p (X|µ, Σ, α) = N i=1 log M l=1 αNew l pl xi|µNew l , Σl New Step 6 Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied return to step 2. 125 / 126
  • 298.
    Images/ References I S. Borman,“The expectation maximization algorithm-a short tutorial,” Submitted for publication, pp. 1–9, 2004. J. Bilmes, “A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models,” International Computer Science Institute, vol. 4, 1998. F. Dellaert, “The expectation maximization algorithm,” tech. rep., Georgia Institute of Technology, 2002. G. McLachlan and T. Krishnan, The EM algorithm and extensions, vol. 382. John Wiley & Sons, 2007. 126 / 126