1. Introduction to Machine Learning
Expectation Maximization
Andres Mendez-Vazquez
September 16, 2017
1 / 126
2. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
2 / 126
3. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
3 / 126
4. Images/
Maximum-Likelihood
We have a density function p (x|Θ)
Assume that we have a data set of size N, X = {x1, x2, ..., xN }
This data is known as evidence.
We assume in addition that
The vectors are independent and identically distributed (i.i.d.) with
distribution p under parameter θ.
4 / 126
5. Images/
Maximum-Likelihood
We have a density function p (x|Θ)
Assume that we have a data set of size N, X = {x1, x2, ..., xN }
This data is known as evidence.
We assume in addition that
The vectors are independent and identically distributed (i.i.d.) with
distribution p under parameter θ.
4 / 126
6. Images/
What Can We Do With The Evidence?
We may use the Bayes’ Rule to estimate the parameters θ
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(1)
Or, given a new observation ˜x
p (˜x|X) (2)
I.e. to compute the probability of the new observation being supported by
the evidence X.
Thus
The former represents parameter estimation and the latter data prediction.
5 / 126
7. Images/
What Can We Do With The Evidence?
We may use the Bayes’ Rule to estimate the parameters θ
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(1)
Or, given a new observation ˜x
p (˜x|X) (2)
I.e. to compute the probability of the new observation being supported by
the evidence X.
Thus
The former represents parameter estimation and the latter data prediction.
5 / 126
8. Images/
What Can We Do With The Evidence?
We may use the Bayes’ Rule to estimate the parameters θ
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(1)
Or, given a new observation ˜x
p (˜x|X) (2)
I.e. to compute the probability of the new observation being supported by
the evidence X.
Thus
The former represents parameter estimation and the latter data prediction.
5 / 126
9. Images/
Focusing First on the Estimation of the Parameters θ
We can interpret the Bayes’ Rule
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(3)
Interpreted as
posterior =
likelihood × prior
evidence
(4)
Thus, we want
likelihood = P (X|Θ)
6 / 126
10. Images/
Focusing First on the Estimation of the Parameters θ
We can interpret the Bayes’ Rule
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(3)
Interpreted as
posterior =
likelihood × prior
evidence
(4)
Thus, we want
likelihood = P (X|Θ)
6 / 126
11. Images/
Focusing First on the Estimation of the Parameters θ
We can interpret the Bayes’ Rule
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(3)
Interpreted as
posterior =
likelihood × prior
evidence
(4)
Thus, we want
likelihood = P (X|Θ)
6 / 126
13. Images/
Maximum-Likelihood
We have
p (x1, x2, ..., xN |Θ) =
N
i=1
p (xi|Θ) (5)
Also known as the likelihood function.
Because multiplication of quantities p (xi|Θ) ≤ 1 can be problematic
L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log p (xi|Θ) (6)
8 / 126
14. Images/
Maximum-Likelihood
We have
p (x1, x2, ..., xN |Θ) =
N
i=1
p (xi|Θ) (5)
Also known as the likelihood function.
Because multiplication of quantities p (xi|Θ) ≤ 1 can be problematic
L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log p (xi|Θ) (6)
8 / 126
17. Images/
What happened if we have incomplete data
Data could have been split
1 X = observed data or incomplete data
2 Y = unobserved data
For this type of problems
We have the famous Expectation Maximization (EM)
10 / 126
18. Images/
What happened if we have incomplete data
Data could have been split
1 X = observed data or incomplete data
2 Y = unobserved data
For this type of problems
We have the famous Expectation Maximization (EM)
10 / 126
19. Images/
What happened if we have incomplete data
Data could have been split
1 X = observed data or incomplete data
2 Y = unobserved data
For this type of problems
We have the famous Expectation Maximization (EM)
10 / 126
20. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
11 / 126
21. Images/
The Expectation Maximization
The EM algorithm
It was first developed by Dempster et al. (1977).
Its popularity comes from the fact
It can estimate an underlying distribution when data is incomplete or has
missing values.
Two main applications
1 When missing values exists.
2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
12 / 126
22. Images/
The Expectation Maximization
The EM algorithm
It was first developed by Dempster et al. (1977).
Its popularity comes from the fact
It can estimate an underlying distribution when data is incomplete or has
missing values.
Two main applications
1 When missing values exists.
2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
12 / 126
23. Images/
The Expectation Maximization
The EM algorithm
It was first developed by Dempster et al. (1977).
Its popularity comes from the fact
It can estimate an underlying distribution when data is incomplete or has
missing values.
Two main applications
1 When missing values exists.
2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
12 / 126
24. Images/
The Expectation Maximization
The EM algorithm
It was first developed by Dempster et al. (1977).
Its popularity comes from the fact
It can estimate an underlying distribution when data is incomplete or has
missing values.
Two main applications
1 When missing values exists.
2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
12 / 126
25. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
13 / 126
26. Images/
Clustering
Given a series of data sets
Given the fact that Radial Gaussian Functions are Universal Approximators
.
Samples {x1, x2, ..., xN } are the visible parameters
The Gaussian distributions generating each of the samples are the
hidden parameters
Then, we model the cluster as a mixture of Gaussian’s
14 / 126
27. Images/
Clustering
Given a series of data sets
Given the fact that Radial Gaussian Functions are Universal Approximators
.
Samples {x1, x2, ..., xN } are the visible parameters
The Gaussian distributions generating each of the samples are the
hidden parameters
Then, we model the cluster as a mixture of Gaussian’s
14 / 126
28. Images/
Natural Language Processing
Unsupervised induction of probabilistic context-free grammars
Here given a series of words o1, o2, o3, ... and normalized Context-Free
Grammar
We want to know the probabilities of each rule P (i → jk)
Thus
Here the you have two variables:
The Visible Ones: The sequence of words
The Hidden Ones: The rule that produces the possible sequence
oi → oj
15 / 126
29. Images/
Natural Language Processing
Unsupervised induction of probabilistic context-free grammars
Here given a series of words o1, o2, o3, ... and normalized Context-Free
Grammar
We want to know the probabilities of each rule P (i → jk)
Thus
Here the you have two variables:
The Visible Ones: The sequence of words
The Hidden Ones: The rule that produces the possible sequence
oi → oj
15 / 126
30. Images/
Natural Language Processing
Baum-Welch Algorithm for Hidden Markov Models
Visible Data
Hidden Data
Here
Hidden Variables: The circular nodes producing the data
Visible Variables: The square nodes representing the samples.
16 / 126
31. Images/
Natural Language Processing
Baum-Welch Algorithm for Hidden Markov Models
Visible Data
Hidden Data
Here
Hidden Variables: The circular nodes producing the data
Visible Variables: The square nodes representing the samples.
16 / 126
32. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
17 / 126
33. Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
34. Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
35. Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
36. Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
37. Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
38. Images/
New Likelihood Function
The New Likelihood Function
L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11)
Note: The complete data likelihood.
Thus, we have
L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12)
Did you notice?
p (X|Θ) is the likelihood of the observed data.
p (Y|X, Θ) is the likelihood of the no-observed data under the
observed data!!!
19 / 126
39. Images/
New Likelihood Function
The New Likelihood Function
L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11)
Note: The complete data likelihood.
Thus, we have
L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12)
Did you notice?
p (X|Θ) is the likelihood of the observed data.
p (Y|X, Θ) is the likelihood of the no-observed data under the
observed data!!!
19 / 126
40. Images/
New Likelihood Function
The New Likelihood Function
L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11)
Note: The complete data likelihood.
Thus, we have
L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12)
Did you notice?
p (X|Θ) is the likelihood of the observed data.
p (Y|X, Θ) is the likelihood of the no-observed data under the
observed data!!!
19 / 126
41. Images/
Rewriting
This can be rewritten as
L (Θ|X, Y) = hX,Θ (Y) (13)
This basically signify that X, Θ are constant and the only random part is
Y.
In addition
L (Θ|X) (14)
It is known as the incomplete-data likelihood function.
20 / 126
42. Images/
Rewriting
This can be rewritten as
L (Θ|X, Y) = hX,Θ (Y) (13)
This basically signify that X, Θ are constant and the only random part is
Y.
In addition
L (Θ|X) (14)
It is known as the incomplete-data likelihood function.
20 / 126
43. Images/
Thus
We can connect both incomplete-complete data equations by doing
the following
L (Θ|X) =p (X|Θ)
=
Y
p (X, Y|Θ)
=
Y
p (Y|X, Θ) p (X|Θ)
=
N
i=1
p (xi|Θ)
Y
p (Y|X, Θ)
21 / 126
44. Images/
Thus
We can connect both incomplete-complete data equations by doing
the following
L (Θ|X) =p (X|Θ)
=
Y
p (X, Y|Θ)
=
Y
p (Y|X, Θ) p (X|Θ)
=
N
i=1
p (xi|Θ)
Y
p (Y|X, Θ)
21 / 126
45. Images/
Thus
We can connect both incomplete-complete data equations by doing
the following
L (Θ|X) =p (X|Θ)
=
Y
p (X, Y|Θ)
=
Y
p (Y|X, Θ) p (X|Θ)
=
N
i=1
p (xi|Θ)
Y
p (Y|X, Θ)
21 / 126
46. Images/
Thus
We can connect both incomplete-complete data equations by doing
the following
L (Θ|X) =p (X|Θ)
=
Y
p (X, Y|Θ)
=
Y
p (Y|X, Θ) p (X|Θ)
=
N
i=1
p (xi|Θ)
Y
p (Y|X, Θ)
21 / 126
47. Images/
Remarks
Problems
Normally, it is almost impossible to obtain a closed analytical solution for
the previous equation.
However
We can use the expected value of log p (X, Y|Θ), which allows us to find
an iterative procedure to approximate the solution.
22 / 126
48. Images/
Remarks
Problems
Normally, it is almost impossible to obtain a closed analytical solution for
the previous equation.
However
We can use the expected value of log p (X, Y|Θ), which allows us to find
an iterative procedure to approximate the solution.
22 / 126
49. Images/
The function we would like to have
The Q function
We want an estimation of the complete-data log-likelihood
log p (X, Y|Θ) (15)
Based in the info provided by X, Θn−1 where Θn−1 is a previously
estimated set of parameters at step n.
Think about the following, if we want to remove Y
ˆ
[log p (X, Y|Θ)] p (Y|X, Θn−1) dY (16)
Remark: We integrate out Y - Actually, this is the expected value of
log p (X, Y|Θ).
23 / 126
50. Images/
The function we would like to have
The Q function
We want an estimation of the complete-data log-likelihood
log p (X, Y|Θ) (15)
Based in the info provided by X, Θn−1 where Θn−1 is a previously
estimated set of parameters at step n.
Think about the following, if we want to remove Y
ˆ
[log p (X, Y|Θ)] p (Y|X, Θn−1) dY (16)
Remark: We integrate out Y - Actually, this is the expected value of
log p (X, Y|Θ).
23 / 126
51. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
24 / 126
52. Images/
Use the Expected Value
Then, we want an iterative method to guess Θ from Θn−1
Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17)
Take in account that
1 X, Θn−1 are taken as constants.
2 Θ is a normal variable that we wish to adjust.
3 Y is a random variable governed by distribution
p (Y|X, Θn−1)=marginal distribution of missing data.
25 / 126
53. Images/
Use the Expected Value
Then, we want an iterative method to guess Θ from Θn−1
Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17)
Take in account that
1 X, Θn−1 are taken as constants.
2 Θ is a normal variable that we wish to adjust.
3 Y is a random variable governed by distribution
p (Y|X, Θn−1)=marginal distribution of missing data.
25 / 126
54. Images/
Use the Expected Value
Then, we want an iterative method to guess Θ from Θn−1
Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17)
Take in account that
1 X, Θn−1 are taken as constants.
2 Θ is a normal variable that we wish to adjust.
3 Y is a random variable governed by distribution
p (Y|X, Θn−1)=marginal distribution of missing data.
25 / 126
55. Images/
Use the Expected Value
Then, we want an iterative method to guess Θ from Θn−1
Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17)
Take in account that
1 X, Θn−1 are taken as constants.
2 Θ is a normal variable that we wish to adjust.
3 Y is a random variable governed by distribution
p (Y|X, Θn−1)=marginal distribution of missing data.
25 / 126
56. Images/
Another Interpretation
Given the previous information
E [log p (X, Y|Θ) |X, Θn−1] =
´
Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY
Something Notable
1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.
2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
26 / 126
57. Images/
Another Interpretation
Given the previous information
E [log p (X, Y|Θ) |X, Θn−1] =
´
Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY
Something Notable
1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.
2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
26 / 126
58. Images/
Another Interpretation
Given the previous information
E [log p (X, Y|Θ) |X, Θn−1] =
´
Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY
Something Notable
1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.
2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
26 / 126
59. Images/
Another Interpretation
Given the previous information
E [log p (X, Y|Θ) |X, Θn−1] =
´
Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY
Something Notable
1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.
2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
26 / 126
60. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
27 / 126
61. Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
62. Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
63. Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
64. Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
65. Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
66. Images/
Why E-step!!!
From here the name
This is basically the E-step
The second step
It tries to maximize the Q function
Θn = argmaxΘQ (Θ, Θn−1) (20)
29 / 126
67. Images/
Why E-step!!!
From here the name
This is basically the E-step
The second step
It tries to maximize the Q function
Θn = argmaxΘQ (Θ, Θn−1) (20)
29 / 126
68. Images/
Why E-step!!!
From here the name
This is basically the E-step
The second step
It tries to maximize the Q function
Θn = argmaxΘQ (Θ, Θn−1) (20)
29 / 126
69. Images/
Derivation of the EM-Algorithm
The likelihood function we are going to use
Let X be a random vector which results from a parametrized family:
L (Θ) = ln P (X|Θ) (21)
Note: ln (x) is a strictly increasing function.
We wish to compute Θ
Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn)
Or the maximization of the difference
L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22)
30 / 126
70. Images/
Derivation of the EM-Algorithm
The likelihood function we are going to use
Let X be a random vector which results from a parametrized family:
L (Θ) = ln P (X|Θ) (21)
Note: ln (x) is a strictly increasing function.
We wish to compute Θ
Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn)
Or the maximization of the difference
L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22)
30 / 126
71. Images/
Derivation of the EM-Algorithm
The likelihood function we are going to use
Let X be a random vector which results from a parametrized family:
L (Θ) = ln P (X|Θ) (21)
Note: ln (x) is a strictly increasing function.
We wish to compute Θ
Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn)
Or the maximization of the difference
L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22)
30 / 126
72. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
31 / 126
73. Images/
Introducing the Hidden Features
Given that the hidden random vector Y exits with y values
P (X|Θ) =
y
P (X|y, Θ) P (y|Θ) (23)
Thus, using our first constraint L (Θ) − L (Θn)
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn) (24)
32 / 126
74. Images/
Introducing the Hidden Features
Given that the hidden random vector Y exits with y values
P (X|Θ) =
y
P (X|y, Θ) P (y|Θ) (23)
Thus, using our first constraint L (Θ) − L (Θn)
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn) (24)
32 / 126
75. Images/
Here, we introduce some concepts of convexity
For Convexity
Theorem (Jensen’s inequality)
Let f be a convex function defined on an interval I. If x1, x2, ..., xn ∈ I
and λ1, λ2, ..., λn ≥ 0 with n
i=1 λi = 1, then
f
n
i=1
λixi ≤
n
i=1
λif (xi) (25)
33 / 126
76. Images/
Proof:
For n = 1
We have the trivial case
For n = 2
The convexity definition.
Now the inductive hypothesis
We assume that the theorem is true for some n.
34 / 126
77. Images/
Proof:
For n = 1
We have the trivial case
For n = 2
The convexity definition.
Now the inductive hypothesis
We assume that the theorem is true for some n.
34 / 126
78. Images/
Proof:
For n = 1
We have the trivial case
For n = 2
The convexity definition.
Now the inductive hypothesis
We assume that the theorem is true for some n.
34 / 126
79. Images/
Now, we have
The following linear combination for λi
f
n+1
i=1
λixi = f λn+1xn+1 +
n
i=1
λixi
= f λn+1xn+1 +
(1 − λn+1)
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
35 / 126
80. Images/
Now, we have
The following linear combination for λi
f
n+1
i=1
λixi = f λn+1xn+1 +
n
i=1
λixi
= f λn+1xn+1 +
(1 − λn+1)
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
35 / 126
81. Images/
Now, we have
The following linear combination for λi
f
n+1
i=1
λixi = f λn+1xn+1 +
n
i=1
λixi
= f λn+1xn+1 +
(1 − λn+1)
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
35 / 126
85. Images/
Now
We have that
f
n+1
i=1
λixi ≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1)
1
(1 − λn+1)
n
i=1
λif (xi)
≤ λn+1f (xn+1) +
n
i=1
λif (xi) Q.E.D.
37 / 126
86. Images/
Now
We have that
f
n+1
i=1
λixi ≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1)
1
(1 − λn+1)
n
i=1
λif (xi)
≤ λn+1f (xn+1) +
n
i=1
λif (xi) Q.E.D.
37 / 126
87. Images/
Now
We have that
f
n+1
i=1
λixi ≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1)
1
(1 − λn+1)
n
i=1
λif (xi)
≤ λn+1f (xn+1) +
n
i=1
λif (xi) Q.E.D.
37 / 126
88. Images/
Thus, for concave functions
It is possible to shown that
Given ln (x) a concave function:
ln
n
i=1
λixi ≥
n
i=1
λi ln (xi)
If we take in consideration
Assume that the λi = P (y|X, Θn). We know that
1 P (y|X, Θn) ≥ 0
2
y P (y|X, Θn) = 1
38 / 126
89. Images/
Thus, for concave functions
It is possible to shown that
Given ln (x) a concave function:
ln
n
i=1
λixi ≥
n
i=1
λi ln (xi)
If we take in consideration
Assume that the λi = P (y|X, Θn). We know that
1 P (y|X, Θn) ≥ 0
2
y P (y|X, Θn) = 1
38 / 126
90. Images/
We have
First
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn)
= ln
y
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
P (y|X, Θn)
− ln P (X|Θn)
= ln
y
P (y|X, Θn)
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ln P (X|Θn)
≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ...
y
P (y|X, Θn) ln P (X|Θn) Why this?
39 / 126
91. Images/
We have
First
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn)
= ln
y
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
P (y|X, Θn)
− ln P (X|Θn)
= ln
y
P (y|X, Θn)
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ln P (X|Θn)
≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ...
y
P (y|X, Θn) ln P (X|Θn) Why this?
39 / 126
92. Images/
We have
First
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn)
= ln
y
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
P (y|X, Θn)
− ln P (X|Θn)
= ln
y
P (y|X, Θn)
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ln P (X|Θn)
≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ...
y
P (y|X, Θn) ln P (X|Θn) Why this?
39 / 126
93. Images/
We have
First
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn)
= ln
y
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
P (y|X, Θn)
− ln P (X|Θn)
= ln
y
P (y|X, Θn)
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ln P (X|Θn)
≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ...
y
P (y|X, Θn) ln P (X|Θn) Why this?
39 / 126
94. Images/
Next
Because
y
P (y|X, Θn) = 1
Then
L (Θ) − L (Θn) ≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
=∆ (Θ|Θn)
40 / 126
95. Images/
Next
Because
y
P (y|X, Θn) = 1
Then
L (Θ) − L (Θn) ≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
=∆ (Θ|Θn)
40 / 126
96. Images/
Then, we have
Then, we have proved that
L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)
Then, we define a new function
l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27)
Thus l (Θ|Θn)
It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)
41 / 126
97. Images/
Then, we have
Then, we have proved that
L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)
Then, we define a new function
l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27)
Thus l (Θ|Θn)
It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)
41 / 126
98. Images/
Then, we have
Then, we have proved that
L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)
Then, we define a new function
l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27)
Thus l (Θ|Θn)
It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)
41 / 126
99. Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
100. Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
101. Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
102. Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
103. Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
104. Images/
Therefore
The function l (Θ|Θn) has the following properties
1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ).
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal.
3 The function l (Θ|Θn) is concave... How?
43 / 126
105. Images/
Therefore
The function l (Θ|Θn) has the following properties
1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ).
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal.
3 The function l (Θ|Θn) is concave... How?
43 / 126
106. Images/
Therefore
The function l (Θ|Θn) has the following properties
1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ).
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal.
3 The function l (Θ|Θn) is concave... How?
43 / 126
107. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
44 / 126
108. Images/
First
We have the value L (Θn)
We know that L (Θn) is constant i.e. an offset value
What about ∆ (Θ|Θn)
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
We have that the ln is a concave function
ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
45 / 126
109. Images/
First
We have the value L (Θn)
We know that L (Θn) is constant i.e. an offset value
What about ∆ (Θ|Θn)
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
We have that the ln is a concave function
ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
45 / 126
110. Images/
First
We have the value L (Θn)
We know that L (Θn) is constant i.e. an offset value
What about ∆ (Θ|Θn)
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
We have that the ln is a concave function
ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
45 / 126
111. Images/
Therefore
Each element is concave
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
Therefore, the sum of concave functions is a concave function
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
46 / 126
112. Images/
Therefore
Each element is concave
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
Therefore, the sum of concave functions is a concave function
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
46 / 126
113. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
47 / 126
114. Images/
Given the Concave Function
Thus, we have that
1 We can select Θn such that l (Θ|Θn) is maximized.
2 Thus, given a Θn, we can generate Θn+1.
The process can be seen in the following graph
48 / 126
115. Images/
Given the Concave Function
Thus, we have that
1 We can select Θn such that l (Θ|Θn) is maximized.
2 Thus, given a Θn, we can generate Θn+1.
The process can be seen in the following graph
48 / 126
116. Images/
Given the Concave Function
Thus, we have that
1 We can select Θn such that l (Θ|Θn) is maximized.
2 Thus, given a Θn, we can generate Θn+1.
The process can be seen in the following graph
48 / 126
117. Images/
Given
The Previous Constraints
1 l (Θ|Θn) is bounded from above by L (Θ)
l (Θ|Θn) ≤ L (Θ)
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
L (Θn) = l (Θ|Θn)
3 The function l (Θ|Θn) is concave
49 / 126
118. Images/
Given
The Previous Constraints
1 l (Θ|Θn) is bounded from above by L (Θ)
l (Θ|Θn) ≤ L (Θ)
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
L (Θn) = l (Θ|Θn)
3 The function l (Θ|Θn) is concave
49 / 126
119. Images/
Given
The Previous Constraints
1 l (Θ|Θn) is bounded from above by L (Θ)
l (Θ|Θn) ≤ L (Θ)
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
L (Θn) = l (Θ|Θn)
3 The function l (Θ|Θn) is concave
49 / 126
120. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
50 / 126
121. Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)
51 / 126
122. Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)
51 / 126
123. Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)
51 / 126
124. Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)
51 / 126
125. Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)
51 / 126
126. Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
127. Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
128. Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
129. Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
130. Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
131. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
53 / 126
132. Images/
The EM-Algorithm
Steps of EM
1 Expectation under hidden variables.
2 Maximization of the resulting formula.
E-Step
Determine the conditional expectation, Ey|X,Θn
[ln (P (X, y|Θ))].
M-Step
Maximize this expression with respect to Θ.
54 / 126
133. Images/
The EM-Algorithm
Steps of EM
1 Expectation under hidden variables.
2 Maximization of the resulting formula.
E-Step
Determine the conditional expectation, Ey|X,Θn
[ln (P (X, y|Θ))].
M-Step
Maximize this expression with respect to Θ.
54 / 126
134. Images/
The EM-Algorithm
Steps of EM
1 Expectation under hidden variables.
2 Maximization of the resulting formula.
E-Step
Determine the conditional expectation, Ey|X,Θn
[ln (P (X, y|Θ))].
M-Step
Maximize this expression with respect to Θ.
54 / 126
135. Images/
The EM-Algorithm
Steps of EM
1 Expectation under hidden variables.
2 Maximization of the resulting formula.
E-Step
Determine the conditional expectation, Ey|X,Θn
[ln (P (X, y|Θ))].
M-Step
Maximize this expression with respect to Θ.
54 / 126
136. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
55 / 126
137. Images/
Notes and Convergence of EM
Gains between L (Θ) and l (Θ|Θn)
Using the hidden variables it is possible to simplify the optimization of
L (Θ) through l (Θ|Θn).
Convergence
Remember that Θn+1is the estimate for Θ which maximizes the
difference ∆ (Θ|Θn).
56 / 126
138. Images/
Notes and Convergence of EM
Gains between L (Θ) and l (Θ|Θn)
Using the hidden variables it is possible to simplify the optimization of
L (Θ) through l (Θ|Θn).
Convergence
Remember that Θn+1is the estimate for Θ which maximizes the
difference ∆ (Θ|Θn).
56 / 126
139. Images/
Notes and Convergence of EM
Gains between L (Θ) and l (Θ|Θn)
Using the hidden variables it is possible to simplify the optimization of
L (Θ) through l (Θ|Θn).
Convergence
Remember that Θn+1is the estimate for Θ which maximizes the
difference ∆ (Θ|Θn).
56 / 126
140. Images/
Therefore
Then, we have
Given the initial estimate of Θ by Θn
∆ (Θn|Θn) = 0
Now
If we choose Θn+1 to maximize the ∆ (Θ|Θn), then
∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0
We have that
The Likelihood L (Θ) is not a decreasing function with respect to Θ.
57 / 126
141. Images/
Therefore
Then, we have
Given the initial estimate of Θ by Θn
∆ (Θn|Θn) = 0
Now
If we choose Θn+1 to maximize the ∆ (Θ|Θn), then
∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0
We have that
The Likelihood L (Θ) is not a decreasing function with respect to Θ.
57 / 126
142. Images/
Therefore
Then, we have
Given the initial estimate of Θ by Θn
∆ (Θn|Θn) = 0
Now
If we choose Θn+1 to maximize the ∆ (Θ|Θn), then
∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0
We have that
The Likelihood L (Θ) is not a decreasing function with respect to Θ.
57 / 126
143. Images/
Notes and Convergence of EM
Properties
When the algorithm reaches a fixed point for some Θn, the value
maximizes l (Θ|Θn).
Definition
A fixed point of a function is an element on domain that is mapped to
itself by the function:
f (x) = x
Basically the EM algorithm does the following
EM [Θ∗
] = Θ∗
58 / 126
144. Images/
Notes and Convergence of EM
Properties
When the algorithm reaches a fixed point for some Θn, the value
maximizes l (Θ|Θn).
Definition
A fixed point of a function is an element on domain that is mapped to
itself by the function:
f (x) = x
Basically the EM algorithm does the following
EM [Θ∗
] = Θ∗
58 / 126
145. Images/
Notes and Convergence of EM
Properties
When the algorithm reaches a fixed point for some Θn, the value
maximizes l (Θ|Θn).
Definition
A fixed point of a function is an element on domain that is mapped to
itself by the function:
f (x) = x
Basically the EM algorithm does the following
EM [Θ∗
] = Θ∗
58 / 126
146. Images/
At this moment
We have that
The algorithm reaches a fixed point for some Θn, the value Θ∗ maximizes
l (Θ|Θn).
Then, when the algorithm
It reaches a fixed point for some Θn the value maximizes l (Θ|Θn).
Basically Θn+1 = Θn.
59 / 126
147. Images/
At this moment
We have that
The algorithm reaches a fixed point for some Θn, the value Θ∗ maximizes
l (Θ|Θn).
Then, when the algorithm
It reaches a fixed point for some Θn the value maximizes l (Θ|Θn).
Basically Θn+1 = Θn.
59 / 126
149. Images/
Then
If L and l are differentiable at Θn
Since L and l are equal at Θn
Then, Θn is a stationary point of L i.e. the derivative of L vanishes at
that point.
Local Maxima
61 / 126
151. Images/
For more on the subject
Please take a look to
Geoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithm
and Extensions,” John Wiley & Sons, New York, 1996.
63 / 126
152. Images/
Finding Maximum Likelihood Mixture Densities Parameters
via EM
Something Notable
The mixture-density parameter estimation problem is probably one of the
most widely used applications of the EM algorithm in the computational
pattern recognition community.
We have
p (x|Θ) =
M
i=1
αipi (x|θi) (28)
where
1 Θ = (α1, ..., αM , θ1, ..., θM )
2 M
i=1 αi = 1
3 Each pi is a density function parametrized by θi.
64 / 126
153. Images/
Finding Maximum Likelihood Mixture Densities Parameters
via EM
Something Notable
The mixture-density parameter estimation problem is probably one of the
most widely used applications of the EM algorithm in the computational
pattern recognition community.
We have
p (x|Θ) =
M
i=1
αipi (x|θi) (28)
where
1 Θ = (α1, ..., αM , θ1, ..., θM )
2 M
i=1 αi = 1
3 Each pi is a density function parametrized by θi.
64 / 126
154. Images/
Finding Maximum Likelihood Mixture Densities Parameters
via EM
Something Notable
The mixture-density parameter estimation problem is probably one of the
most widely used applications of the EM algorithm in the computational
pattern recognition community.
We have
p (x|Θ) =
M
i=1
αipi (x|θi) (28)
where
1 Θ = (α1, ..., αM , θ1, ..., θM )
2 M
i=1 αi = 1
3 Each pi is a density function parametrized by θi.
64 / 126
155. Images/
Finding Maximum Likelihood Mixture Densities Parameters
via EM
Something Notable
The mixture-density parameter estimation problem is probably one of the
most widely used applications of the EM algorithm in the computational
pattern recognition community.
We have
p (x|Θ) =
M
i=1
αipi (x|θi) (28)
where
1 Θ = (α1, ..., αM , θ1, ..., θM )
2 M
i=1 αi = 1
3 Each pi is a density function parametrized by θi.
64 / 126
156. Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log
M
j=1
αjpj (xi|θj)
(29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
157. Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log
M
j=1
αjpj (xi|θj)
(29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
158. Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log
M
j=1
αjpj (xi|θj)
(29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
159. Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log
M
j=1
αjpj (xi|θj)
(29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
160. Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log
M
j=1
αjpj (xi|θj)
(29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
161. Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
162. Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
163. Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
164. Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
165. Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
166. Images/
Then
Thus, by the chain Rule
N
i=1
log P (xi, yi|Θ) =
N
i=1
log [P (xi|yi, θyi ) P (yi|θyi )] (31)
Question Do you need yi if you know θyi or the other way around?
Finally
N
i=1
log [P (xi|yi, θyi ) P (yi|θyi )] =
N
i=1
log [P (yi) pyi (xi|θyi )] (32)
NOPE: You do not need yi if you know θyi or the other way around.
67 / 126
167. Images/
Then
Thus, by the chain Rule
N
i=1
log P (xi, yi|Θ) =
N
i=1
log [P (xi|yi, θyi ) P (yi|θyi )] (31)
Question Do you need yi if you know θyi or the other way around?
Finally
N
i=1
log [P (xi|yi, θyi ) P (yi|θyi )] =
N
i=1
log [P (yi) pyi (xi|θyi )] (32)
NOPE: You do not need yi if you know θyi or the other way around.
67 / 126
169. Images/
Problem
Which Labels?
We do not know the values of Y.
We can get away by using the following idea
Assume the Y is a random variable.
69 / 126
170. Images/
Problem
Which Labels?
We do not know the values of Y.
We can get away by using the following idea
Assume the Y is a random variable.
69 / 126
171. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
70 / 126
172. Images/
Thus
You do a first guess for the parameters at the beginning of EM
Θg
= (αg
1, ..., αg
M , θg
1, ..., θg
M ) (34)
Then, it is possible to calculate given the parametric probability
pj xi|θg
j
Therefore
The mixing parameters αj can be though of as a prior probabilities of each
mixture:
αj = p (component j) (35)
71 / 126
173. Images/
Thus
You do a first guess for the parameters at the beginning of EM
Θg
= (αg
1, ..., αg
M , θg
1, ..., θg
M ) (34)
Then, it is possible to calculate given the parametric probability
pj xi|θg
j
Therefore
The mixing parameters αj can be though of as a prior probabilities of each
mixture:
αj = p (component j) (35)
71 / 126
174. Images/
Thus
You do a first guess for the parameters at the beginning of EM
Θg
= (αg
1, ..., αg
M , θg
1, ..., θg
M ) (34)
Then, it is possible to calculate given the parametric probability
pj xi|θg
j
Therefore
The mixing parameters αj can be though of as a prior probabilities of each
mixture:
αj = p (component j) (35)
71 / 126
175. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
72 / 126
176. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
73 / 126
177. Images/
We want to calculate the following probability
We want to calculate
p (yi|xi, Θg
)
Basically
We want a Bayesian formulation of this probability.
Assuming that the y = (y1, y2, ..., yN ) are samples identically
independent samples from a distribution.
74 / 126
178. Images/
We want to calculate the following probability
We want to calculate
p (yi|xi, Θg
)
Basically
We want a Bayesian formulation of this probability.
Assuming that the y = (y1, y2, ..., yN ) are samples identically
independent samples from a distribution.
74 / 126
179. Images/
We want to calculate the following probability
We want to calculate
p (yi|xi, Θg
)
Basically
We want a Bayesian formulation of this probability.
Assuming that the y = (y1, y2, ..., yN ) are samples identically
independent samples from a distribution.
74 / 126
180. Images/
Using Bayes’ Rule
Compute
p (yi|xi, Θg
) =
p (yi, xi|Θg)
p (xi|Θg)
=
p (xi|Θg) p yi|θg
yi
p (xi|Θg)
We know θg
yi
⇒ Drop it
=
αg
yi
pyi xi|θg
yi
p (xi|Θg)
=
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
75 / 126
181. Images/
Using Bayes’ Rule
Compute
p (yi|xi, Θg
) =
p (yi, xi|Θg)
p (xi|Θg)
=
p (xi|Θg) p yi|θg
yi
p (xi|Θg)
We know θg
yi
⇒ Drop it
=
αg
yi
pyi xi|θg
yi
p (xi|Θg)
=
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
75 / 126
182. Images/
Using Bayes’ Rule
Compute
p (yi|xi, Θg
) =
p (yi, xi|Θg)
p (xi|Θg)
=
p (xi|Θg) p yi|θg
yi
p (xi|Θg)
We know θg
yi
⇒ Drop it
=
αg
yi
pyi xi|θg
yi
p (xi|Θg)
=
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
75 / 126
183. Images/
Using Bayes’ Rule
Compute
p (yi|xi, Θg
) =
p (yi, xi|Θg)
p (xi|Θg)
=
p (xi|Θg) p yi|θg
yi
p (xi|Θg)
We know θg
yi
⇒ Drop it
=
αg
yi
pyi xi|θg
yi
p (xi|Θg)
=
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
75 / 126
184. Images/
As in Naive Bayes
We have the fact that there is a probability per probability at the
mixture and sample
p (yi|xi, Θg
) =
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
∀xi, yi and k ∈ {1, ..., M}
This is going to be updated at each iteration of the EM algorithm
After the initial Guess!!! Until convergence!!!
76 / 126
185. Images/
As in Naive Bayes
We have the fact that there is a probability per probability at the
mixture and sample
p (yi|xi, Θg
) =
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
∀xi, yi and k ∈ {1, ..., M}
This is going to be updated at each iteration of the EM algorithm
After the initial Guess!!! Until convergence!!!
76 / 126
186. Images/
Additionally
We assume again that the samples yis are identically and independent
samples
p (y|X, Θg
) =
N
i=1
p (yi|xi, Θg
) (36)
Where y = (y1, y2, ..., yN )
77 / 126
187. Images/
Now, using equation 17
Then
Q (Θ|Θg
) =
y∈Y
log (L (Θ|X, y)) p (y|X, Θg
)
=
y∈Y
N
i=1
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
)
78 / 126
188. Images/
Now, using equation 17
Then
Q (Θ|Θg
) =
y∈Y
log (L (Θ|X, y)) p (y|X, Θg
)
=
y∈Y
N
i=1
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
)
78 / 126
189. Images/
Here, a small stop
What is the meaning of y∈Y
It is actually a summation of all possible states of the random vector y.
Then, we can rewrite the previous summation as
y∈Y
=
M
y1=1
M
y2=1
· · ·
M
yN =1
N
Running over all the samples {x1, x2, ..., xN }.
79 / 126
190. Images/
Here, a small stop
What is the meaning of y∈Y
It is actually a summation of all possible states of the random vector y.
Then, we can rewrite the previous summation as
y∈Y
=
M
y1=1
M
y2=1
· · ·
M
yN =1
N
Running over all the samples {x1, x2, ..., xN }.
79 / 126
191. Images/
Then
We have
Q (Θ|Θg
) =
M
y1=1
M
y2=1
· · ·
M
yN =1
N
i=1
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
)
80 / 126
192. Images/
We introduce the following
We have the following function
δl,yi
=
1 I = yi
0 I = yi
Therefore, we can do the following
αi =
M
j=1
δi,jαj
Then
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
) =
M
l=1
δl,yi log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
81 / 126
193. Images/
We introduce the following
We have the following function
δl,yi
=
1 I = yi
0 I = yi
Therefore, we can do the following
αi =
M
j=1
δi,jαj
Then
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
) =
M
l=1
δl,yi log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
81 / 126
194. Images/
We introduce the following
We have the following function
δl,yi
=
1 I = yi
0 I = yi
Therefore, we can do the following
αi =
M
j=1
δi,jαj
Then
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
) =
M
l=1
δl,yi log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
81 / 126
195. Images/
Thus
We have that for
M
y1=1 · · · M
yN =1
N
i=1 log [αyi
pyi
(xi|θyi
)] N
j=1 p (yj|xj, Θg
) = ∗
∗ =
M
y1=1
M
y2=1
· · ·
M
yN =1
N
i=1
M
l=1
δl,yi
log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
=
N
i=1
M
l=1
log [αlpl (xi|θl)]
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
)
Because
M
y1=1
M
y2=1 · · · M
yN =1 applies only to δl,yi
N
j=1 p (yj|xj, Θg)
82 / 126
196. Images/
Thus
We have that for
M
y1=1 · · · M
yN =1
N
i=1 log [αyi
pyi
(xi|θyi
)] N
j=1 p (yj|xj, Θg
) = ∗
∗ =
M
y1=1
M
y2=1
· · ·
M
yN =1
N
i=1
M
l=1
δl,yi
log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
=
N
i=1
M
l=1
log [αlpl (xi|θl)]
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
)
Because
M
y1=1
M
y2=1 · · · M
yN =1 applies only to δl,yi
N
j=1 p (yj|xj, Θg)
82 / 126
197. Images/
Thus
We have that for
M
y1=1 · · · M
yN =1
N
i=1 log [αyi
pyi
(xi|θyi
)] N
j=1 p (yj|xj, Θg
) = ∗
∗ =
M
y1=1
M
y2=1
· · ·
M
yN =1
N
i=1
M
l=1
δl,yi
log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
=
N
i=1
M
l=1
log [αlpl (xi|θl)]
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
)
Because
M
y1=1
M
y2=1 · · · M
yN =1 applies only to δl,yi
N
j=1 p (yj|xj, Θg)
82 / 126
198. Images/
Then, we have that
First notice the following
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
M
yi=1
δl,yi
p (yi|xi, Θg
)
N
j=1,j=i,
p (yj|xj, Θg
)
Then, we have
M
yi=1
δl,yi
p (yi|xi, Θg
) = p (l|xi, Θg
)
83 / 126
199. Images/
Then, we have that
First notice the following
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
M
yi=1
δl,yi
p (yi|xi, Θg
)
N
j=1,j=i,
p (yj|xj, Θg
)
Then, we have
M
yi=1
δl,yi
p (yi|xi, Θg
) = p (l|xi, Θg
)
83 / 126
200. Images/
Then, we have that
First notice the following
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
M
yi=1
δl,yi
p (yi|xi, Θg
)
N
j=1,j=i,
p (yj|xj, Θg
)
Then, we have
M
yi=1
δl,yi
p (yi|xi, Θg
) = p (l|xi, Θg
)
83 / 126
201. Images/
In this way
Plugging back the previous equation
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
p (l|xi, Θg
)
N
j=1,j=i
p (yj|xj, Θg
)
=
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)
p (l|xi, Θg
)
84 / 126
202. Images/
In this way
Plugging back the previous equation
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
p (l|xi, Θg
)
N
j=1,j=i
p (yj|xj, Θg
)
=
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)
p (l|xi, Θg
)
84 / 126
203. Images/
In this way
Plugging back the previous equation
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
p (l|xi, Θg
)
N
j=1,j=i
p (yj|xj, Θg
)
=
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)
p (l|xi, Θg
)
84 / 126
204. Images/
Now, what about...?
The left part of the equation
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
) =
=
M
y1=1
p (y1|x1, Θg
)
· · ·
M
yi−1=1
p (yi−1|xi−1, Θg
)
× ...
M
yi+1=1
p (yi+1|xi+1, Θg
)
· · ·
M
yN =1
p (yN |xN , Θg
)
=
N
j=1,j=i
M
yj=1
p (yj|xj, Θg
)
85 / 126
205. Images/
Now, what about...?
The left part of the equation
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
) =
=
M
y1=1
p (y1|x1, Θg
)
· · ·
M
yi−1=1
p (yi−1|xi−1, Θg
)
× ...
M
yi+1=1
p (yi+1|xi+1, Θg
)
· · ·
M
yN =1
p (yN |xN , Θg
)
=
N
j=1,j=i
M
yj=1
p (yj|xj, Θg
)
85 / 126
206. Images/
Now, what about...?
The left part of the equation
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
) =
=
M
y1=1
p (y1|x1, Θg
)
· · ·
M
yi−1=1
p (yi−1|xi−1, Θg
)
× ...
M
yi+1=1
p (yi+1|xi+1, Θg
)
· · ·
M
yN =1
p (yN |xN , Θg
)
=
N
j=1,j=i
M
yj=1
p (yj|xj, Θg
)
85 / 126
207. Images/
Then, we have that
Plugging back to the original equation
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)
p (l|xi, Θg
) =
=
N
j=1,j=i
M
yj=1
p (yj|xj, Θg
)
p (l|xi, Θg
)
86 / 126
208. Images/
Then, we have that
Plugging back to the original equation
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)
p (l|xi, Θg
) =
=
N
j=1,j=i
M
yj=1
p (yj|xj, Θg
)
p (l|xi, Θg
)
86 / 126
209. Images/
We can use properties of probability
We know that
M
yi=1
p (yi|xi, Θg
) = 1 (37)
Then
N
j=1,j=i
M
yj=1
p (yj|xj, Θg
)
p (l|xi, Θg
) =
=
N
j=1,j=i
1
p (l|xi, Θg
)
=p (l|xi, Θg
)
=
αg
l pyi xi|θg
l
M
k=1 αg
kpk xi|θg
k
87 / 126
210. Images/
We can use properties of probability
We know that
M
yi=1
p (yi|xi, Θg
) = 1 (37)
Then
N
j=1,j=i
M
yj=1
p (yj|xj, Θg
)
p (l|xi, Θg
) =
=
N
j=1,j=i
1
p (l|xi, Θg
)
=p (l|xi, Θg
)
=
αg
l pyi xi|θg
l
M
k=1 αg
kpk xi|θg
k
87 / 126
211. Images/
We can use properties of probability
We know that
M
yi=1
p (yi|xi, Θg
) = 1 (37)
Then
N
j=1,j=i
M
yj=1
p (yj|xj, Θg
)
p (l|xi, Θg
) =
=
N
j=1,j=i
1
p (l|xi, Θg
)
=p (l|xi, Θg
)
=
αg
l pyi xi|θg
l
M
k=1 αg
kpk xi|θg
k
87 / 126
212. Images/
We can use properties of probability
We know that
M
yi=1
p (yi|xi, Θg
) = 1 (37)
Then
N
j=1,j=i
M
yj=1
p (yj|xj, Θg
)
p (l|xi, Θg
) =
=
N
j=1,j=i
1
p (l|xi, Θg
)
=p (l|xi, Θg
)
=
αg
l pyi xi|θg
l
M
k=1 αg
kpk xi|θg
k
87 / 126
213. Images/
Thus
We can write Q in the following way
Q (Θ, Θg
) =
N
i=1
M
l=1
log [αlpl (xi|θl)] p (l|xi, Θg
)
=
N
i=1
M
l=1
log (αl) p (l|xi, Θg
) + ...
N
i=1
M
l=1
log (pl (xi|θl)) p (l|xi, Θg
) (38)
88 / 126
214. Images/
Thus
We can write Q in the following way
Q (Θ, Θg
) =
N
i=1
M
l=1
log [αlpl (xi|θl)] p (l|xi, Θg
)
=
N
i=1
M
l=1
log (αl) p (l|xi, Θg
) + ...
N
i=1
M
l=1
log (pl (xi|θl)) p (l|xi, Θg
) (38)
88 / 126
215. Images/
Thus
We can write Q in the following way
Q (Θ, Θg
) =
N
i=1
M
l=1
log [αlpl (xi|θl)] p (l|xi, Θg
)
=
N
i=1
M
l=1
log (αl) p (l|xi, Θg
) + ...
N
i=1
M
l=1
log (pl (xi|θl)) p (l|xi, Θg
) (38)
88 / 126
216. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
89 / 126
217. Images/
A Method
That could be used as a general framework
To solve problems set as EM problem.
First, we will look at the Lagrange Multipliers setup
Then, we will look at a specific case using the mixture of Gaussian’s
Note
Not all the mixture of distributions will get you an analytical solution.
90 / 126
218. Images/
A Method
That could be used as a general framework
To solve problems set as EM problem.
First, we will look at the Lagrange Multipliers setup
Then, we will look at a specific case using the mixture of Gaussian’s
Note
Not all the mixture of distributions will get you an analytical solution.
90 / 126
219. Images/
A Method
That could be used as a general framework
To solve problems set as EM problem.
First, we will look at the Lagrange Multipliers setup
Then, we will look at a specific case using the mixture of Gaussian’s
Note
Not all the mixture of distributions will get you an analytical solution.
90 / 126
220. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
91 / 126
221. Images/
Lagrange Multipliers
The method of Lagrange multipliers
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problems.
This is done by converting a constrained problem
To an equivalent unconstrained problem with the help of certain
unspecified parameters known as Lagrange multipliers.
92 / 126
222. Images/
Lagrange Multipliers
The method of Lagrange multipliers
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problems.
This is done by converting a constrained problem
To an equivalent unconstrained problem with the help of certain
unspecified parameters known as Lagrange multipliers.
92 / 126
223. Images/
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
(39)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
93 / 126
224. Images/
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
(39)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
93 / 126
225. Images/
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
(39)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
93 / 126
227. Images/
Lagrange Multipliers are applications of Py
Think about this
Remember First Law of Newton!!!
Yes!!!
A system in equilibrium does not move
Static Body
94 / 126
229. Images/
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (40)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
96 / 126
230. Images/
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (40)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
96 / 126
231. Images/
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (40)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
96 / 126
232. Images/
Gradient to a Surface
After all a gradient is a measure of the maximal change
For example the gradient of a function of three variables:
f (x) = i
∂f (x)
∂x
+ j
∂f (x)
∂y
+ k
∂f (x)
∂z
(41)
where i, j and k are unitary vectors in the directions x, y and z.
97 / 126
235. Images/
Now, Think about this
Yes, we can use the gradient
However, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ... + λKFK = 0 (42)
where F0 is the gradient of the principal cost function and Fi for
i = 1, 2, .., K.
100 / 126
236. Images/
Now, Think about this
Yes, we can use the gradient
However, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ... + λKFK = 0 (42)
where F0 is the gradient of the principal cost function and Fi for
i = 1, 2, .., K.
100 / 126
237. Images/
Thus
If we have the following optimization:
min f (x)
s.tg1 (x) = 0
g2 (x) = 0
101 / 126
238. Images/
Geometric interpretation in the case of minimization
What is wrong? Gradients are going in the other direction, we can fix
by simple multiplying by -1
Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0
Nevertheless: it is equivalent to f −→x − λ1 g1
−→x − λ2 g2
−→x = 0
101 / 126
239. Images/
Geometric interpretation in the case of minimization
What is wrong? Gradients are going in the other direction, we can fix
by simple multiplying by -1
Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0
Nevertheless: it is equivalent to f −→x − λ1 g1
−→x − λ2 g2
−→x = 0
101 / 126
240. Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
241. Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
242. Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
243. Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
244. Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
245. Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
246. Images/
Example
We can apply that to the following problem
min f (x, y) = x2
− 8x + y2
− 12y + 48
s.t x + y = 8
103 / 126
247. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
104 / 126
248. Images/
Lagrange Multipliers for Q
We can us the following constraint for that
l
αl = 1 (43)
We have the following cost function
Q (Θ, Θg
) + λ
l
αl − 1 (44)
Deriving by αl
∂
∂αl
Q (Θ, Θg
) + λ
l
αl − 1 = 0 (45)
105 / 126
249. Images/
Lagrange Multipliers for Q
We can us the following constraint for that
l
αl = 1 (43)
We have the following cost function
Q (Θ, Θg
) + λ
l
αl − 1 (44)
Deriving by αl
∂
∂αl
Q (Θ, Θg
) + λ
l
αl − 1 = 0 (45)
105 / 126
250. Images/
Lagrange Multipliers for Q
We can us the following constraint for that
l
αl = 1 (43)
We have the following cost function
Q (Θ, Θg
) + λ
l
αl − 1 (44)
Deriving by αl
∂
∂αl
Q (Θ, Θg
) + λ
l
αl − 1 = 0 (45)
105 / 126
251. Images/
Thus
The Q function
Q (Θ, Θg
) =
N
i=1
M
l=1
log (αl) p (l|xi, Θg
) + ...
N
i=1
M
l=1
log (pl (xi|θl)) p (l|xi, Θg
)
106 / 126
253. Images/
Finally
We have making the previous equation equal to 0
N
i=1
1
αl
p (l|xi, Θg
) + λ = 0 (46)
Thus
N
i=1
p (l|xi, Θg
) = −λαl (47)
Summing over l, we get
λ = −N (48)
108 / 126
254. Images/
Finally
We have making the previous equation equal to 0
N
i=1
1
αl
p (l|xi, Θg
) + λ = 0 (46)
Thus
N
i=1
p (l|xi, Θg
) = −λαl (47)
Summing over l, we get
λ = −N (48)
108 / 126
255. Images/
Finally
We have making the previous equation equal to 0
N
i=1
1
αl
p (l|xi, Θg
) + λ = 0 (46)
Thus
N
i=1
p (l|xi, Θg
) = −λαl (47)
Summing over l, we get
λ = −N (48)
108 / 126
256. Images/
Lagrange Multipliers
Thus
αl =
1
N
N
i=1
p (l|xi, Θg
) (49)
About θl
It is possible to get an analytical expressions for θl as functions of
everything else.
This is for you to try!!!
For more, please look at
“Geometric Idea of Lagrange Multipliers” by John Wyatt.
109 / 126
257. Images/
Lagrange Multipliers
Thus
αl =
1
N
N
i=1
p (l|xi, Θg
) (49)
About θl
It is possible to get an analytical expressions for θl as functions of
everything else.
This is for you to try!!!
For more, please look at
“Geometric Idea of Lagrange Multipliers” by John Wyatt.
109 / 126
258. Images/
Lagrange Multipliers
Thus
αl =
1
N
N
i=1
p (l|xi, Θg
) (49)
About θl
It is possible to get an analytical expressions for θl as functions of
everything else.
This is for you to try!!!
For more, please look at
“Geometric Idea of Lagrange Multipliers” by John Wyatt.
109 / 126
259. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
110 / 126
261. Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
262. Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
263. Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
264. Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
265. Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
266. Images/
In addition
If A is symmetric
∂ |A|
∂A
=
Ai,j if i = j
2Ai,j if i = j
(51)
Where Ai,j is the i, jth cofactor of A.
Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor is
preceded by a + or – sign depending whether the element is
in a + or – position.
Thus
∂ log |A|
∂A
=
Ai,j
|A| if i = j
2Ai,j if i = j
= 2A−1
− diag A−1
(52)
113 / 126
267. Images/
In addition
If A is symmetric
∂ |A|
∂A
=
Ai,j if i = j
2Ai,j if i = j
(51)
Where Ai,j is the i, jth cofactor of A.
Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor is
preceded by a + or – sign depending whether the element is
in a + or – position.
Thus
∂ log |A|
∂A
=
Ai,j
|A| if i = j
2Ai,j if i = j
= 2A−1
− diag A−1
(52)
113 / 126
268. Images/
In addition
If A is symmetric
∂ |A|
∂A
=
Ai,j if i = j
2Ai,j if i = j
(51)
Where Ai,j is the i, jth cofactor of A.
Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor is
preceded by a + or – sign depending whether the element is
in a + or – position.
Thus
∂ log |A|
∂A
=
Ai,j
|A| if i = j
2Ai,j if i = j
= 2A−1
− diag A−1
(52)
113 / 126
269. Images/
In addition
If A is symmetric
∂ |A|
∂A
=
Ai,j if i = j
2Ai,j if i = j
(51)
Where Ai,j is the i, jth cofactor of A.
Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor is
preceded by a + or – sign depending whether the element is
in a + or – position.
Thus
∂ log |A|
∂A
=
Ai,j
|A| if i = j
2Ai,j if i = j
= 2A−1
− diag A−1
(52)
113 / 126
272. Images/
Thus, using last part of equation 38
We get, after ignoring constant terms
Remember they disappear after derivatives
N
i=1
M
l=1
log (pl (xi|µl, Σl)) p (l|xi, Θg
)
=
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l (xi − µl) p (l|xi, Θg
) (55)
115 / 126
273. Images/
Thus, using last part of equation 38
We get, after ignoring constant terms
Remember they disappear after derivatives
N
i=1
M
l=1
log (pl (xi|µl, Σl)) p (l|xi, Θg
)
=
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l (xi − µl) p (l|xi, Θg
) (55)
115 / 126
274. Images/
Finally
Thus, when taking the derivative with respect to µl
N
i=1
Σ−1
l (xi − µl) p (l|xi, Θg
) = 0 (56)
Then
µl =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
(57)
116 / 126
275. Images/
Finally
Thus, when taking the derivative with respect to µl
N
i=1
Σ−1
l (xi − µl) p (l|xi, Θg
) = 0 (56)
Then
µl =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
(57)
116 / 126
276. Images/
Now, if we derive with respect to Σl
First, we rewrite equation 55
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l
(xi − µl) p (l|xi, Θg
)
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
(xi − µl) (xi − µl)T
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
Where Nl,i = (xi − µl) (xi − µl)T
.
117 / 126
277. Images/
Now, if we derive with respect to Σl
First, we rewrite equation 55
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l
(xi − µl) p (l|xi, Θg
)
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
(xi − µl) (xi − µl)T
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
Where Nl,i = (xi − µl) (xi − µl)T
.
117 / 126
278. Images/
Now, if we derive with respect to Σl
First, we rewrite equation 55
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l
(xi − µl) p (l|xi, Θg
)
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
(xi − µl) (xi − µl)T
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
Where Nl,i = (xi − µl) (xi − µl)T
.
117 / 126
279. Images/
Now, if we derive with respect to Σl
First, we rewrite equation 55
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l
(xi − µl) p (l|xi, Θg
)
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
(xi − µl) (xi − µl)T
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
Where Nl,i = (xi − µl) (xi − µl)T
.
117 / 126
280. Images/
Deriving with respect to Σ−1
l
We have that
∂
∂Σ−1
l
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) (2Σl − diag (Σl)) −
1
2
N
i=1
p (l|xi, Θg
) 2Nl.i − diag Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) 2Ml,i − diag Ml,i
=2S − diag (S)
Where Ml,i = Σl − Nl,i and S = 1
2
N
i=1
p (l|xi, Θg) Ml,i
118 / 126
281. Images/
Deriving with respect to Σ−1
l
We have that
∂
∂Σ−1
l
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) (2Σl − diag (Σl)) −
1
2
N
i=1
p (l|xi, Θg
) 2Nl.i − diag Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) 2Ml,i − diag Ml,i
=2S − diag (S)
Where Ml,i = Σl − Nl,i and S = 1
2
N
i=1
p (l|xi, Θg) Ml,i
118 / 126
282. Images/
Deriving with respect to Σ−1
l
We have that
∂
∂Σ−1
l
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) (2Σl − diag (Σl)) −
1
2
N
i=1
p (l|xi, Θg
) 2Nl.i − diag Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) 2Ml,i − diag Ml,i
=2S − diag (S)
Where Ml,i = Σl − Nl,i and S = 1
2
N
i=1
p (l|xi, Θg) Ml,i
118 / 126
283. Images/
Deriving with respect to Σ−1
l
We have that
∂
∂Σ−1
l
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) (2Σl − diag (Σl)) −
1
2
N
i=1
p (l|xi, Θg
) 2Nl.i − diag Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) 2Ml,i − diag Ml,i
=2S − diag (S)
Where Ml,i = Σl − Nl,i and S = 1
2
N
i=1
p (l|xi, Θg) Ml,i
118 / 126
284. Images/
Thus, we have
Thus
If 2S − diag (S) = 0 =⇒ S = 0
Implying
1
2
N
i=1
p (l|xi, Θg
) [Σl − Nl,i] = 0 (58)
Or
Σl =
N
i=1 p (l|xi, Θg) Nl,i
N
i=1 p (l|xi, Θg)
=
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
(59)
119 / 126
285. Images/
Thus, we have
Thus
If 2S − diag (S) = 0 =⇒ S = 0
Implying
1
2
N
i=1
p (l|xi, Θg
) [Σl − Nl,i] = 0 (58)
Or
Σl =
N
i=1 p (l|xi, Θg) Nl,i
N
i=1 p (l|xi, Θg)
=
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
(59)
119 / 126
286. Images/
Thus, we have
Thus
If 2S − diag (S) = 0 =⇒ S = 0
Implying
1
2
N
i=1
p (l|xi, Θg
) [Σl − Nl,i] = 0 (58)
Or
Σl =
N
i=1 p (l|xi, Θg) Nl,i
N
i=1 p (l|xi, Θg)
=
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
(59)
119 / 126
287. Images/
Thus, we have the iterative updates
They are
αNew
l =
1
N
N
i=1
p (l|xi, Θg
)
µNew
l =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
ΣNew
l =
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
120 / 126
288. Images/
Thus, we have the iterative updates
They are
αNew
l =
1
N
N
i=1
p (l|xi, Θg
)
µNew
l =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
ΣNew
l =
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
120 / 126
289. Images/
Thus, we have the iterative updates
They are
αNew
l =
1
N
N
i=1
p (l|xi, Θg
)
µNew
l =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
ΣNew
l =
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
120 / 126
290. Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
121 / 126
291. Images/
EM Algorithm for Gaussian Mixtures
Step 1
Initialize:
The means µl
Covariances Σl
Mixing coefficients αl
122 / 126