SlideShare a Scribd company logo
Introduction to Machine Learning
Expectation Maximization
Andres Mendez-Vazquez
September 16, 2017
1 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
2 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
3 / 126
Images/
Maximum-Likelihood
We have a density function p (x|Θ)
Assume that we have a data set of size N, X = {x1, x2, ..., xN }
This data is known as evidence.
We assume in addition that
The vectors are independent and identically distributed (i.i.d.) with
distribution p under parameter θ.
4 / 126
Images/
Maximum-Likelihood
We have a density function p (x|Θ)
Assume that we have a data set of size N, X = {x1, x2, ..., xN }
This data is known as evidence.
We assume in addition that
The vectors are independent and identically distributed (i.i.d.) with
distribution p under parameter θ.
4 / 126
Images/
What Can We Do With The Evidence?
We may use the Bayes’ Rule to estimate the parameters θ
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(1)
Or, given a new observation ˜x
p (˜x|X) (2)
I.e. to compute the probability of the new observation being supported by
the evidence X.
Thus
The former represents parameter estimation and the latter data prediction.
5 / 126
Images/
What Can We Do With The Evidence?
We may use the Bayes’ Rule to estimate the parameters θ
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(1)
Or, given a new observation ˜x
p (˜x|X) (2)
I.e. to compute the probability of the new observation being supported by
the evidence X.
Thus
The former represents parameter estimation and the latter data prediction.
5 / 126
Images/
What Can We Do With The Evidence?
We may use the Bayes’ Rule to estimate the parameters θ
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(1)
Or, given a new observation ˜x
p (˜x|X) (2)
I.e. to compute the probability of the new observation being supported by
the evidence X.
Thus
The former represents parameter estimation and the latter data prediction.
5 / 126
Images/
Focusing First on the Estimation of the Parameters θ
We can interpret the Bayes’ Rule
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(3)
Interpreted as
posterior =
likelihood × prior
evidence
(4)
Thus, we want
likelihood = P (X|Θ)
6 / 126
Images/
Focusing First on the Estimation of the Parameters θ
We can interpret the Bayes’ Rule
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(3)
Interpreted as
posterior =
likelihood × prior
evidence
(4)
Thus, we want
likelihood = P (X|Θ)
6 / 126
Images/
Focusing First on the Estimation of the Parameters θ
We can interpret the Bayes’ Rule
p (Θ|X) =
P (X|Θ) P (Θ)
P (X)
(3)
Interpreted as
posterior =
likelihood × prior
evidence
(4)
Thus, we want
likelihood = P (X|Θ)
6 / 126
Images/
What we want...
We want to maximize the likelihood as a function of θ
7 / 126
Images/
Maximum-Likelihood
We have
p (x1, x2, ..., xN |Θ) =
N
i=1
p (xi|Θ) (5)
Also known as the likelihood function.
Because multiplication of quantities p (xi|Θ) ≤ 1 can be problematic
L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log p (xi|Θ) (6)
8 / 126
Images/
Maximum-Likelihood
We have
p (x1, x2, ..., xN |Θ) =
N
i=1
p (xi|Θ) (5)
Also known as the likelihood function.
Because multiplication of quantities p (xi|Θ) ≤ 1 can be problematic
L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log p (xi|Θ) (6)
8 / 126
Images/
Maximum-Likelihood
We want to find a Θ∗
Θ∗
= argmaxΘL (Θ|X) (7)
The classic method
∂L (Θ|X)
∂θi
= 0 ∀θi ∈ Θ (8)
9 / 126
Images/
Maximum-Likelihood
We want to find a Θ∗
Θ∗
= argmaxΘL (Θ|X) (7)
The classic method
∂L (Θ|X)
∂θi
= 0 ∀θi ∈ Θ (8)
9 / 126
Images/
What happened if we have incomplete data
Data could have been split
1 X = observed data or incomplete data
2 Y = unobserved data
For this type of problems
We have the famous Expectation Maximization (EM)
10 / 126
Images/
What happened if we have incomplete data
Data could have been split
1 X = observed data or incomplete data
2 Y = unobserved data
For this type of problems
We have the famous Expectation Maximization (EM)
10 / 126
Images/
What happened if we have incomplete data
Data could have been split
1 X = observed data or incomplete data
2 Y = unobserved data
For this type of problems
We have the famous Expectation Maximization (EM)
10 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
11 / 126
Images/
The Expectation Maximization
The EM algorithm
It was first developed by Dempster et al. (1977).
Its popularity comes from the fact
It can estimate an underlying distribution when data is incomplete or has
missing values.
Two main applications
1 When missing values exists.
2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
12 / 126
Images/
The Expectation Maximization
The EM algorithm
It was first developed by Dempster et al. (1977).
Its popularity comes from the fact
It can estimate an underlying distribution when data is incomplete or has
missing values.
Two main applications
1 When missing values exists.
2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
12 / 126
Images/
The Expectation Maximization
The EM algorithm
It was first developed by Dempster et al. (1977).
Its popularity comes from the fact
It can estimate an underlying distribution when data is incomplete or has
missing values.
Two main applications
1 When missing values exists.
2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
12 / 126
Images/
The Expectation Maximization
The EM algorithm
It was first developed by Dempster et al. (1977).
Its popularity comes from the fact
It can estimate an underlying distribution when data is incomplete or has
missing values.
Two main applications
1 When missing values exists.
2 When a likelihood function can be simplified by assuming extra
parameters that are missing or hidden.
12 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
13 / 126
Images/
Clustering
Given a series of data sets
Given the fact that Radial Gaussian Functions are Universal Approximators
.
Samples {x1, x2, ..., xN } are the visible parameters
The Gaussian distributions generating each of the samples are the
hidden parameters
Then, we model the cluster as a mixture of Gaussian’s
14 / 126
Images/
Clustering
Given a series of data sets
Given the fact that Radial Gaussian Functions are Universal Approximators
.
Samples {x1, x2, ..., xN } are the visible parameters
The Gaussian distributions generating each of the samples are the
hidden parameters
Then, we model the cluster as a mixture of Gaussian’s
14 / 126
Images/
Natural Language Processing
Unsupervised induction of probabilistic context-free grammars
Here given a series of words o1, o2, o3, ... and normalized Context-Free
Grammar
We want to know the probabilities of each rule P (i → jk)
Thus
Here the you have two variables:
The Visible Ones: The sequence of words
The Hidden Ones: The rule that produces the possible sequence
oi → oj
15 / 126
Images/
Natural Language Processing
Unsupervised induction of probabilistic context-free grammars
Here given a series of words o1, o2, o3, ... and normalized Context-Free
Grammar
We want to know the probabilities of each rule P (i → jk)
Thus
Here the you have two variables:
The Visible Ones: The sequence of words
The Hidden Ones: The rule that produces the possible sequence
oi → oj
15 / 126
Images/
Natural Language Processing
Baum-Welch Algorithm for Hidden Markov Models
Visible Data
Hidden Data
Here
Hidden Variables: The circular nodes producing the data
Visible Variables: The square nodes representing the samples.
16 / 126
Images/
Natural Language Processing
Baum-Welch Algorithm for Hidden Markov Models
Visible Data
Hidden Data
Here
Hidden Variables: The circular nodes producing the data
Visible Variables: The square nodes representing the samples.
16 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
17 / 126
Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
Images/
Incomplete Data
We assume the following
Two parts of data
1 X = observed data or incomplete data
2 Y = unobserved data
Thus
Z = (X,Y)=Complete Data (9)
Thus, we have the following probability
p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10)
18 / 126
Images/
New Likelihood Function
The New Likelihood Function
L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11)
Note: The complete data likelihood.
Thus, we have
L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12)
Did you notice?
p (X|Θ) is the likelihood of the observed data.
p (Y|X, Θ) is the likelihood of the no-observed data under the
observed data!!!
19 / 126
Images/
New Likelihood Function
The New Likelihood Function
L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11)
Note: The complete data likelihood.
Thus, we have
L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12)
Did you notice?
p (X|Θ) is the likelihood of the observed data.
p (Y|X, Θ) is the likelihood of the no-observed data under the
observed data!!!
19 / 126
Images/
New Likelihood Function
The New Likelihood Function
L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11)
Note: The complete data likelihood.
Thus, we have
L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12)
Did you notice?
p (X|Θ) is the likelihood of the observed data.
p (Y|X, Θ) is the likelihood of the no-observed data under the
observed data!!!
19 / 126
Images/
Rewriting
This can be rewritten as
L (Θ|X, Y) = hX,Θ (Y) (13)
This basically signify that X, Θ are constant and the only random part is
Y.
In addition
L (Θ|X) (14)
It is known as the incomplete-data likelihood function.
20 / 126
Images/
Rewriting
This can be rewritten as
L (Θ|X, Y) = hX,Θ (Y) (13)
This basically signify that X, Θ are constant and the only random part is
Y.
In addition
L (Θ|X) (14)
It is known as the incomplete-data likelihood function.
20 / 126
Images/
Thus
We can connect both incomplete-complete data equations by doing
the following
L (Θ|X) =p (X|Θ)
=
Y
p (X, Y|Θ)
=
Y
p (Y|X, Θ) p (X|Θ)
=
N
i=1
p (xi|Θ)
Y
p (Y|X, Θ)
21 / 126
Images/
Thus
We can connect both incomplete-complete data equations by doing
the following
L (Θ|X) =p (X|Θ)
=
Y
p (X, Y|Θ)
=
Y
p (Y|X, Θ) p (X|Θ)
=
N
i=1
p (xi|Θ)
Y
p (Y|X, Θ)
21 / 126
Images/
Thus
We can connect both incomplete-complete data equations by doing
the following
L (Θ|X) =p (X|Θ)
=
Y
p (X, Y|Θ)
=
Y
p (Y|X, Θ) p (X|Θ)
=
N
i=1
p (xi|Θ)
Y
p (Y|X, Θ)
21 / 126
Images/
Thus
We can connect both incomplete-complete data equations by doing
the following
L (Θ|X) =p (X|Θ)
=
Y
p (X, Y|Θ)
=
Y
p (Y|X, Θ) p (X|Θ)
=
N
i=1
p (xi|Θ)
Y
p (Y|X, Θ)
21 / 126
Images/
Remarks
Problems
Normally, it is almost impossible to obtain a closed analytical solution for
the previous equation.
However
We can use the expected value of log p (X, Y|Θ), which allows us to find
an iterative procedure to approximate the solution.
22 / 126
Images/
Remarks
Problems
Normally, it is almost impossible to obtain a closed analytical solution for
the previous equation.
However
We can use the expected value of log p (X, Y|Θ), which allows us to find
an iterative procedure to approximate the solution.
22 / 126
Images/
The function we would like to have
The Q function
We want an estimation of the complete-data log-likelihood
log p (X, Y|Θ) (15)
Based in the info provided by X, Θn−1 where Θn−1 is a previously
estimated set of parameters at step n.
Think about the following, if we want to remove Y
ˆ
[log p (X, Y|Θ)] p (Y|X, Θn−1) dY (16)
Remark: We integrate out Y - Actually, this is the expected value of
log p (X, Y|Θ).
23 / 126
Images/
The function we would like to have
The Q function
We want an estimation of the complete-data log-likelihood
log p (X, Y|Θ) (15)
Based in the info provided by X, Θn−1 where Θn−1 is a previously
estimated set of parameters at step n.
Think about the following, if we want to remove Y
ˆ
[log p (X, Y|Θ)] p (Y|X, Θn−1) dY (16)
Remark: We integrate out Y - Actually, this is the expected value of
log p (X, Y|Θ).
23 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
24 / 126
Images/
Use the Expected Value
Then, we want an iterative method to guess Θ from Θn−1
Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17)
Take in account that
1 X, Θn−1 are taken as constants.
2 Θ is a normal variable that we wish to adjust.
3 Y is a random variable governed by distribution
p (Y|X, Θn−1)=marginal distribution of missing data.
25 / 126
Images/
Use the Expected Value
Then, we want an iterative method to guess Θ from Θn−1
Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17)
Take in account that
1 X, Θn−1 are taken as constants.
2 Θ is a normal variable that we wish to adjust.
3 Y is a random variable governed by distribution
p (Y|X, Θn−1)=marginal distribution of missing data.
25 / 126
Images/
Use the Expected Value
Then, we want an iterative method to guess Θ from Θn−1
Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17)
Take in account that
1 X, Θn−1 are taken as constants.
2 Θ is a normal variable that we wish to adjust.
3 Y is a random variable governed by distribution
p (Y|X, Θn−1)=marginal distribution of missing data.
25 / 126
Images/
Use the Expected Value
Then, we want an iterative method to guess Θ from Θn−1
Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17)
Take in account that
1 X, Θn−1 are taken as constants.
2 Θ is a normal variable that we wish to adjust.
3 Y is a random variable governed by distribution
p (Y|X, Θn−1)=marginal distribution of missing data.
25 / 126
Images/
Another Interpretation
Given the previous information
E [log p (X, Y|Θ) |X, Θn−1] =
´
Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY
Something Notable
1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.
2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
26 / 126
Images/
Another Interpretation
Given the previous information
E [log p (X, Y|Θ) |X, Θn−1] =
´
Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY
Something Notable
1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.
2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
26 / 126
Images/
Another Interpretation
Given the previous information
E [log p (X, Y|Θ) |X, Θn−1] =
´
Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY
Something Notable
1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.
2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
26 / 126
Images/
Another Interpretation
Given the previous information
E [log p (X, Y|Θ) |X, Θn−1] =
´
Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY
Something Notable
1 In the best of cases, this marginal distribution is a simple analytical
expression of the assumed parameter Θn−1.
2 In the worst of cases, this density might be very hard to obtain.
Actually, we use
p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18)
which is not dependent on Θ.
26 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
27 / 126
Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
Images/
Back to the Q function
The intuition
We have the following analogy:
Consider h (θ, Y ) a function
θ a constant
Y ∼ pY (y), a random variable with distribution pY (y).
Thus, if Y is a discrete random variable
q (θ) = EY [h (θ, Y )] =
y
h (θ, y) pY (y) (19)
28 / 126
Images/
Why E-step!!!
From here the name
This is basically the E-step
The second step
It tries to maximize the Q function
Θn = argmaxΘQ (Θ, Θn−1) (20)
29 / 126
Images/
Why E-step!!!
From here the name
This is basically the E-step
The second step
It tries to maximize the Q function
Θn = argmaxΘQ (Θ, Θn−1) (20)
29 / 126
Images/
Why E-step!!!
From here the name
This is basically the E-step
The second step
It tries to maximize the Q function
Θn = argmaxΘQ (Θ, Θn−1) (20)
29 / 126
Images/
Derivation of the EM-Algorithm
The likelihood function we are going to use
Let X be a random vector which results from a parametrized family:
L (Θ) = ln P (X|Θ) (21)
Note: ln (x) is a strictly increasing function.
We wish to compute Θ
Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn)
Or the maximization of the difference
L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22)
30 / 126
Images/
Derivation of the EM-Algorithm
The likelihood function we are going to use
Let X be a random vector which results from a parametrized family:
L (Θ) = ln P (X|Θ) (21)
Note: ln (x) is a strictly increasing function.
We wish to compute Θ
Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn)
Or the maximization of the difference
L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22)
30 / 126
Images/
Derivation of the EM-Algorithm
The likelihood function we are going to use
Let X be a random vector which results from a parametrized family:
L (Θ) = ln P (X|Θ) (21)
Note: ln (x) is a strictly increasing function.
We wish to compute Θ
Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn)
Or the maximization of the difference
L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22)
30 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
31 / 126
Images/
Introducing the Hidden Features
Given that the hidden random vector Y exits with y values
P (X|Θ) =
y
P (X|y, Θ) P (y|Θ) (23)
Thus, using our first constraint L (Θ) − L (Θn)
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn) (24)
32 / 126
Images/
Introducing the Hidden Features
Given that the hidden random vector Y exits with y values
P (X|Θ) =
y
P (X|y, Θ) P (y|Θ) (23)
Thus, using our first constraint L (Θ) − L (Θn)
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn) (24)
32 / 126
Images/
Here, we introduce some concepts of convexity
For Convexity
Theorem (Jensen’s inequality)
Let f be a convex function defined on an interval I. If x1, x2, ..., xn ∈ I
and λ1, λ2, ..., λn ≥ 0 with n
i=1 λi = 1, then
f
n
i=1
λixi ≤
n
i=1
λif (xi) (25)
33 / 126
Images/
Proof:
For n = 1
We have the trivial case
For n = 2
The convexity definition.
Now the inductive hypothesis
We assume that the theorem is true for some n.
34 / 126
Images/
Proof:
For n = 1
We have the trivial case
For n = 2
The convexity definition.
Now the inductive hypothesis
We assume that the theorem is true for some n.
34 / 126
Images/
Proof:
For n = 1
We have the trivial case
For n = 2
The convexity definition.
Now the inductive hypothesis
We assume that the theorem is true for some n.
34 / 126
Images/
Now, we have
The following linear combination for λi
f
n+1
i=1
λixi = f λn+1xn+1 +
n
i=1
λixi
= f λn+1xn+1 +
(1 − λn+1)
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
35 / 126
Images/
Now, we have
The following linear combination for λi
f
n+1
i=1
λixi = f λn+1xn+1 +
n
i=1
λixi
= f λn+1xn+1 +
(1 − λn+1)
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
35 / 126
Images/
Now, we have
The following linear combination for λi
f
n+1
i=1
λixi = f λn+1xn+1 +
n
i=1
λixi
= f λn+1xn+1 +
(1 − λn+1)
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
35 / 126
Images/
Did you notice?
Something Notable
n+1
i=1
λi = 1
Thus
n
i=1
λi = 1 − λn+1
Finally
1
(1 − λn+1)
n
i=1
λi = 1
36 / 126
Images/
Did you notice?
Something Notable
n+1
i=1
λi = 1
Thus
n
i=1
λi = 1 − λn+1
Finally
1
(1 − λn+1)
n
i=1
λi = 1
36 / 126
Images/
Did you notice?
Something Notable
n+1
i=1
λi = 1
Thus
n
i=1
λi = 1 − λn+1
Finally
1
(1 − λn+1)
n
i=1
λi = 1
36 / 126
Images/
Now
We have that
f
n+1
i=1
λixi ≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1)
1
(1 − λn+1)
n
i=1
λif (xi)
≤ λn+1f (xn+1) +
n
i=1
λif (xi) Q.E.D.
37 / 126
Images/
Now
We have that
f
n+1
i=1
λixi ≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1)
1
(1 − λn+1)
n
i=1
λif (xi)
≤ λn+1f (xn+1) +
n
i=1
λif (xi) Q.E.D.
37 / 126
Images/
Now
We have that
f
n+1
i=1
λixi ≤ λn+1f (xn+1) + (1 − λn+1) f
1
(1 − λn+1)
n
i=1
λixi
≤ λn+1f (xn+1) + (1 − λn+1)
1
(1 − λn+1)
n
i=1
λif (xi)
≤ λn+1f (xn+1) +
n
i=1
λif (xi) Q.E.D.
37 / 126
Images/
Thus, for concave functions
It is possible to shown that
Given ln (x) a concave function:
ln
n
i=1
λixi ≥
n
i=1
λi ln (xi)
If we take in consideration
Assume that the λi = P (y|X, Θn). We know that
1 P (y|X, Θn) ≥ 0
2
y P (y|X, Θn) = 1
38 / 126
Images/
Thus, for concave functions
It is possible to shown that
Given ln (x) a concave function:
ln
n
i=1
λixi ≥
n
i=1
λi ln (xi)
If we take in consideration
Assume that the λi = P (y|X, Θn). We know that
1 P (y|X, Θn) ≥ 0
2
y P (y|X, Θn) = 1
38 / 126
Images/
We have
First
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn)
= ln
y
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
P (y|X, Θn)
− ln P (X|Θn)
= ln
y
P (y|X, Θn)
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ln P (X|Θn)
≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ...
y
P (y|X, Θn) ln P (X|Θn) Why this?
39 / 126
Images/
We have
First
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn)
= ln
y
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
P (y|X, Θn)
− ln P (X|Θn)
= ln
y
P (y|X, Θn)
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ln P (X|Θn)
≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ...
y
P (y|X, Θn) ln P (X|Θn) Why this?
39 / 126
Images/
We have
First
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn)
= ln
y
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
P (y|X, Θn)
− ln P (X|Θn)
= ln
y
P (y|X, Θn)
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ln P (X|Θn)
≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ...
y
P (y|X, Θn) ln P (X|Θn) Why this?
39 / 126
Images/
We have
First
L (Θ) − L (Θn) = ln
y
P (X|y, Θ) P (y|Θ) − ln P (X|Θn)
= ln
y
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
P (y|X, Θn)
− ln P (X|Θn)
= ln
y
P (y|X, Θn)
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ln P (X|Θn)
≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn)
− ...
y
P (y|X, Θn) ln P (X|Θn) Why this?
39 / 126
Images/
Next
Because
y
P (y|X, Θn) = 1
Then
L (Θ) − L (Θn) ≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
=∆ (Θ|Θn)
40 / 126
Images/
Next
Because
y
P (y|X, Θn) = 1
Then
L (Θ) − L (Θn) ≥
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
=∆ (Θ|Θn)
40 / 126
Images/
Then, we have
Then, we have proved that
L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)
Then, we define a new function
l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27)
Thus l (Θ|Θn)
It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)
41 / 126
Images/
Then, we have
Then, we have proved that
L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)
Then, we define a new function
l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27)
Thus l (Θ|Θn)
It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)
41 / 126
Images/
Then, we have
Then, we have proved that
L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26)
Then, we define a new function
l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27)
Thus l (Θ|Θn)
It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ)
41 / 126
Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
Images/
Now, we can do the following
We evaluate in Θn
l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θn) P (y|Θn)
P (y|X, Θn) P (X|Θn)
=L (Θn) +
y
P (y|X, Θn) ln
P (X, y|Θn)
P (X, y|Θn)
=L (Θn)
This means that
For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
42 / 126
Images/
Therefore
The function l (Θ|Θn) has the following properties
1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ).
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal.
3 The function l (Θ|Θn) is concave... How?
43 / 126
Images/
Therefore
The function l (Θ|Θn) has the following properties
1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ).
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal.
3 The function l (Θ|Θn) is concave... How?
43 / 126
Images/
Therefore
The function l (Θ|Θn) has the following properties
1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ).
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal.
3 The function l (Θ|Θn) is concave... How?
43 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
44 / 126
Images/
First
We have the value L (Θn)
We know that L (Θn) is constant i.e. an offset value
What about ∆ (Θ|Θn)
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
We have that the ln is a concave function
ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
45 / 126
Images/
First
We have the value L (Θn)
We know that L (Θn) is constant i.e. an offset value
What about ∆ (Θ|Θn)
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
We have that the ln is a concave function
ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
45 / 126
Images/
First
We have the value L (Θn)
We know that L (Θn) is constant i.e. an offset value
What about ∆ (Θ|Θn)
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
We have that the ln is a concave function
ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
45 / 126
Images/
Therefore
Each element is concave
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
Therefore, the sum of concave functions is a concave function
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
46 / 126
Images/
Therefore
Each element is concave
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
Therefore, the sum of concave functions is a concave function
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
46 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
47 / 126
Images/
Given the Concave Function
Thus, we have that
1 We can select Θn such that l (Θ|Θn) is maximized.
2 Thus, given a Θn, we can generate Θn+1.
The process can be seen in the following graph
48 / 126
Images/
Given the Concave Function
Thus, we have that
1 We can select Θn such that l (Θ|Θn) is maximized.
2 Thus, given a Θn, we can generate Θn+1.
The process can be seen in the following graph
48 / 126
Images/
Given the Concave Function
Thus, we have that
1 We can select Θn such that l (Θ|Θn) is maximized.
2 Thus, given a Θn, we can generate Θn+1.
The process can be seen in the following graph
48 / 126
Images/
Given
The Previous Constraints
1 l (Θ|Θn) is bounded from above by L (Θ)
l (Θ|Θn) ≤ L (Θ)
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
L (Θn) = l (Θ|Θn)
3 The function l (Θ|Θn) is concave
49 / 126
Images/
Given
The Previous Constraints
1 l (Θ|Θn) is bounded from above by L (Θ)
l (Θ|Θn) ≤ L (Θ)
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
L (Θn) = l (Θ|Θn)
3 The function l (Θ|Θn) is concave
49 / 126
Images/
Given
The Previous Constraints
1 l (Θ|Θn) is bounded from above by L (Θ)
l (Θ|Θn) ≤ L (Θ)
2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal
L (Θn) = l (Θ|Θn)
3 The function l (Θ|Θn) is concave
49 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
50 / 126
Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ


 y
P (y|X, Θn) ln


P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)





51 / 126
Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ


 y
P (y|X, Θn) ln


P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)





51 / 126
Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ


 y
P (y|X, Θn) ln


P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)





51 / 126
Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ


 y
P (y|X, Θn) ln


P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)





51 / 126
Images/
From
The following
Θn+1 =argmaxΘ {l (Θ|Θn)}
=argmaxΘ L (Θn) +
y
P (y|X, Θn) ln
P (X|y, Θ) P (y|Θ)
P (y|X, Θn) P (X|Θn)
The terms with Θn are constants.
≈argmaxΘ
y
P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ))
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y|Θ)
P (y|Θ)
P (y, Θ)
P (Θ)
=argmaxΘ


 y
P (y|X, Θn) ln


P(X,y,Θ)
P(Θ)
P(y,Θ)
P(Θ)
P (y, Θ)
P (Θ)





51 / 126
Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
Images/
Thus
Then
θn+1 =argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (y, Θ)
P (y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln
P (X, y, Θ)
P (Θ)
=argmaxΘ
y
P (y|X, Θn) ln (P (X, y|Θ))
=argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn
[ln (P (X, y|Θ))]
52 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
53 / 126
Images/
The EM-Algorithm
Steps of EM
1 Expectation under hidden variables.
2 Maximization of the resulting formula.
E-Step
Determine the conditional expectation, Ey|X,Θn
[ln (P (X, y|Θ))].
M-Step
Maximize this expression with respect to Θ.
54 / 126
Images/
The EM-Algorithm
Steps of EM
1 Expectation under hidden variables.
2 Maximization of the resulting formula.
E-Step
Determine the conditional expectation, Ey|X,Θn
[ln (P (X, y|Θ))].
M-Step
Maximize this expression with respect to Θ.
54 / 126
Images/
The EM-Algorithm
Steps of EM
1 Expectation under hidden variables.
2 Maximization of the resulting formula.
E-Step
Determine the conditional expectation, Ey|X,Θn
[ln (P (X, y|Θ))].
M-Step
Maximize this expression with respect to Θ.
54 / 126
Images/
The EM-Algorithm
Steps of EM
1 Expectation under hidden variables.
2 Maximization of the resulting formula.
E-Step
Determine the conditional expectation, Ey|X,Θn
[ln (P (X, y|Θ))].
M-Step
Maximize this expression with respect to Θ.
54 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
55 / 126
Images/
Notes and Convergence of EM
Gains between L (Θ) and l (Θ|Θn)
Using the hidden variables it is possible to simplify the optimization of
L (Θ) through l (Θ|Θn).
Convergence
Remember that Θn+1is the estimate for Θ which maximizes the
difference ∆ (Θ|Θn).
56 / 126
Images/
Notes and Convergence of EM
Gains between L (Θ) and l (Θ|Θn)
Using the hidden variables it is possible to simplify the optimization of
L (Θ) through l (Θ|Θn).
Convergence
Remember that Θn+1is the estimate for Θ which maximizes the
difference ∆ (Θ|Θn).
56 / 126
Images/
Notes and Convergence of EM
Gains between L (Θ) and l (Θ|Θn)
Using the hidden variables it is possible to simplify the optimization of
L (Θ) through l (Θ|Θn).
Convergence
Remember that Θn+1is the estimate for Θ which maximizes the
difference ∆ (Θ|Θn).
56 / 126
Images/
Therefore
Then, we have
Given the initial estimate of Θ by Θn
∆ (Θn|Θn) = 0
Now
If we choose Θn+1 to maximize the ∆ (Θ|Θn), then
∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0
We have that
The Likelihood L (Θ) is not a decreasing function with respect to Θ.
57 / 126
Images/
Therefore
Then, we have
Given the initial estimate of Θ by Θn
∆ (Θn|Θn) = 0
Now
If we choose Θn+1 to maximize the ∆ (Θ|Θn), then
∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0
We have that
The Likelihood L (Θ) is not a decreasing function with respect to Θ.
57 / 126
Images/
Therefore
Then, we have
Given the initial estimate of Θ by Θn
∆ (Θn|Θn) = 0
Now
If we choose Θn+1 to maximize the ∆ (Θ|Θn), then
∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0
We have that
The Likelihood L (Θ) is not a decreasing function with respect to Θ.
57 / 126
Images/
Notes and Convergence of EM
Properties
When the algorithm reaches a fixed point for some Θn, the value
maximizes l (Θ|Θn).
Definition
A fixed point of a function is an element on domain that is mapped to
itself by the function:
f (x) = x
Basically the EM algorithm does the following
EM [Θ∗
] = Θ∗
58 / 126
Images/
Notes and Convergence of EM
Properties
When the algorithm reaches a fixed point for some Θn, the value
maximizes l (Θ|Θn).
Definition
A fixed point of a function is an element on domain that is mapped to
itself by the function:
f (x) = x
Basically the EM algorithm does the following
EM [Θ∗
] = Θ∗
58 / 126
Images/
Notes and Convergence of EM
Properties
When the algorithm reaches a fixed point for some Θn, the value
maximizes l (Θ|Θn).
Definition
A fixed point of a function is an element on domain that is mapped to
itself by the function:
f (x) = x
Basically the EM algorithm does the following
EM [Θ∗
] = Θ∗
58 / 126
Images/
At this moment
We have that
The algorithm reaches a fixed point for some Θn, the value Θ∗ maximizes
l (Θ|Θn).
Then, when the algorithm
It reaches a fixed point for some Θn the value maximizes l (Θ|Θn).
Basically Θn+1 = Θn.
59 / 126
Images/
At this moment
We have that
The algorithm reaches a fixed point for some Θn, the value Θ∗ maximizes
l (Θ|Θn).
Then, when the algorithm
It reaches a fixed point for some Θn the value maximizes l (Θ|Θn).
Basically Θn+1 = Θn.
59 / 126
Images/
Therefore
We have
60 / 126
Images/
Then
If L and l are differentiable at Θn
Since L and l are equal at Θn
Then, Θn is a stationary point of L i.e. the derivative of L vanishes at
that point.
Local Maxima
61 / 126
Images/
However
You could finish with the following case, no local maxima
Saddle Point
62 / 126
Images/
For more on the subject
Please take a look to
Geoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithm
and Extensions,” John Wiley & Sons, New York, 1996.
63 / 126
Images/
Finding Maximum Likelihood Mixture Densities Parameters
via EM
Something Notable
The mixture-density parameter estimation problem is probably one of the
most widely used applications of the EM algorithm in the computational
pattern recognition community.
We have
p (x|Θ) =
M
i=1
αipi (x|θi) (28)
where
1 Θ = (α1, ..., αM , θ1, ..., θM )
2 M
i=1 αi = 1
3 Each pi is a density function parametrized by θi.
64 / 126
Images/
Finding Maximum Likelihood Mixture Densities Parameters
via EM
Something Notable
The mixture-density parameter estimation problem is probably one of the
most widely used applications of the EM algorithm in the computational
pattern recognition community.
We have
p (x|Θ) =
M
i=1
αipi (x|θi) (28)
where
1 Θ = (α1, ..., αM , θ1, ..., θM )
2 M
i=1 αi = 1
3 Each pi is a density function parametrized by θi.
64 / 126
Images/
Finding Maximum Likelihood Mixture Densities Parameters
via EM
Something Notable
The mixture-density parameter estimation problem is probably one of the
most widely used applications of the EM algorithm in the computational
pattern recognition community.
We have
p (x|Θ) =
M
i=1
αipi (x|θi) (28)
where
1 Θ = (α1, ..., αM , θ1, ..., θM )
2 M
i=1 αi = 1
3 Each pi is a density function parametrized by θi.
64 / 126
Images/
Finding Maximum Likelihood Mixture Densities Parameters
via EM
Something Notable
The mixture-density parameter estimation problem is probably one of the
most widely used applications of the EM algorithm in the computational
pattern recognition community.
We have
p (x|Θ) =
M
i=1
αipi (x|θi) (28)
where
1 Θ = (α1, ..., αM , θ1, ..., θM )
2 M
i=1 αi = 1
3 Each pi is a density function parametrized by θi.
64 / 126
Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log


M
j=1
αjpj (xi|θj)

 (29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log


M
j=1
αjpj (xi|θj)

 (29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log


M
j=1
αjpj (xi|θj)

 (29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log


M
j=1
αjpj (xi|θj)

 (29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
Images/
A log-likelihood for this function
We have
log L (Θ|X) = log
N
i=1
p (xi|Θ) =
N
i=1
log


M
j=1
αjpj (xi|θj)

 (29)
Note: This is too difficult to optimize due to the log function.
However
We can simplify this assuming the following:
1 We assume that each unobserved data Y = {yi}N
i=1 has a the
following range yi ∈ {1, ..., M}
2 yi = k if the ith samples was generated by the kth mixture.
65 / 126
Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
Images/
Now
We have
log L (Θ|X, Y) = log [P (X, Y|Θ)] (30)
Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and
assuming independence
log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)]
= log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)]
= log
N
i=1
P (xi, yi|Θ)
=
N
i=1
log P (xi, yi|Θ)
66 / 126
Images/
Then
Thus, by the chain Rule
N
i=1
log P (xi, yi|Θ) =
N
i=1
log [P (xi|yi, θyi ) P (yi|θyi )] (31)
Question Do you need yi if you know θyi or the other way around?
Finally
N
i=1
log [P (xi|yi, θyi ) P (yi|θyi )] =
N
i=1
log [P (yi) pyi (xi|θyi )] (32)
NOPE: You do not need yi if you know θyi or the other way around.
67 / 126
Images/
Then
Thus, by the chain Rule
N
i=1
log P (xi, yi|Θ) =
N
i=1
log [P (xi|yi, θyi ) P (yi|θyi )] (31)
Question Do you need yi if you know θyi or the other way around?
Finally
N
i=1
log [P (xi|yi, θyi ) P (yi|θyi )] =
N
i=1
log [P (yi) pyi (xi|θyi )] (32)
NOPE: You do not need yi if you know θyi or the other way around.
67 / 126
Images/
Finally, we have
Making αyi
= P (yi)
log L (Θ|X, Y) =
N
i=1
log [αyi P (xi|yi, θyi )] (33)
68 / 126
Images/
Problem
Which Labels?
We do not know the values of Y.
We can get away by using the following idea
Assume the Y is a random variable.
69 / 126
Images/
Problem
Which Labels?
We do not know the values of Y.
We can get away by using the following idea
Assume the Y is a random variable.
69 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
70 / 126
Images/
Thus
You do a first guess for the parameters at the beginning of EM
Θg
= (αg
1, ..., αg
M , θg
1, ..., θg
M ) (34)
Then, it is possible to calculate given the parametric probability
pj xi|θg
j
Therefore
The mixing parameters αj can be though of as a prior probabilities of each
mixture:
αj = p (component j) (35)
71 / 126
Images/
Thus
You do a first guess for the parameters at the beginning of EM
Θg
= (αg
1, ..., αg
M , θg
1, ..., θg
M ) (34)
Then, it is possible to calculate given the parametric probability
pj xi|θg
j
Therefore
The mixing parameters αj can be though of as a prior probabilities of each
mixture:
αj = p (component j) (35)
71 / 126
Images/
Thus
You do a first guess for the parameters at the beginning of EM
Θg
= (αg
1, ..., αg
M , θg
1, ..., θg
M ) (34)
Then, it is possible to calculate given the parametric probability
pj xi|θg
j
Therefore
The mixing parameters αj can be though of as a prior probabilities of each
mixture:
αj = p (component j) (35)
71 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
72 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
73 / 126
Images/
We want to calculate the following probability
We want to calculate
p (yi|xi, Θg
)
Basically
We want a Bayesian formulation of this probability.
Assuming that the y = (y1, y2, ..., yN ) are samples identically
independent samples from a distribution.
74 / 126
Images/
We want to calculate the following probability
We want to calculate
p (yi|xi, Θg
)
Basically
We want a Bayesian formulation of this probability.
Assuming that the y = (y1, y2, ..., yN ) are samples identically
independent samples from a distribution.
74 / 126
Images/
We want to calculate the following probability
We want to calculate
p (yi|xi, Θg
)
Basically
We want a Bayesian formulation of this probability.
Assuming that the y = (y1, y2, ..., yN ) are samples identically
independent samples from a distribution.
74 / 126
Images/
Using Bayes’ Rule
Compute
p (yi|xi, Θg
) =
p (yi, xi|Θg)
p (xi|Θg)
=
p (xi|Θg) p yi|θg
yi
p (xi|Θg)
We know θg
yi
⇒ Drop it
=
αg
yi
pyi xi|θg
yi
p (xi|Θg)
=
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
75 / 126
Images/
Using Bayes’ Rule
Compute
p (yi|xi, Θg
) =
p (yi, xi|Θg)
p (xi|Θg)
=
p (xi|Θg) p yi|θg
yi
p (xi|Θg)
We know θg
yi
⇒ Drop it
=
αg
yi
pyi xi|θg
yi
p (xi|Θg)
=
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
75 / 126
Images/
Using Bayes’ Rule
Compute
p (yi|xi, Θg
) =
p (yi, xi|Θg)
p (xi|Θg)
=
p (xi|Θg) p yi|θg
yi
p (xi|Θg)
We know θg
yi
⇒ Drop it
=
αg
yi
pyi xi|θg
yi
p (xi|Θg)
=
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
75 / 126
Images/
Using Bayes’ Rule
Compute
p (yi|xi, Θg
) =
p (yi, xi|Θg)
p (xi|Θg)
=
p (xi|Θg) p yi|θg
yi
p (xi|Θg)
We know θg
yi
⇒ Drop it
=
αg
yi
pyi xi|θg
yi
p (xi|Θg)
=
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
75 / 126
Images/
As in Naive Bayes
We have the fact that there is a probability per probability at the
mixture and sample
p (yi|xi, Θg
) =
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
∀xi, yi and k ∈ {1, ..., M}
This is going to be updated at each iteration of the EM algorithm
After the initial Guess!!! Until convergence!!!
76 / 126
Images/
As in Naive Bayes
We have the fact that there is a probability per probability at the
mixture and sample
p (yi|xi, Θg
) =
αg
yi
pyi xi|θg
yi
M
k=1 αg
kpk xi|θg
k
∀xi, yi and k ∈ {1, ..., M}
This is going to be updated at each iteration of the EM algorithm
After the initial Guess!!! Until convergence!!!
76 / 126
Images/
Additionally
We assume again that the samples yis are identically and independent
samples
p (y|X, Θg
) =
N
i=1
p (yi|xi, Θg
) (36)
Where y = (y1, y2, ..., yN )
77 / 126
Images/
Now, using equation 17
Then
Q (Θ|Θg
) =
y∈Y
log (L (Θ|X, y)) p (y|X, Θg
)
=
y∈Y
N
i=1
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
)
78 / 126
Images/
Now, using equation 17
Then
Q (Θ|Θg
) =
y∈Y
log (L (Θ|X, y)) p (y|X, Θg
)
=
y∈Y
N
i=1
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
)
78 / 126
Images/
Here, a small stop
What is the meaning of y∈Y
It is actually a summation of all possible states of the random vector y.
Then, we can rewrite the previous summation as
y∈Y
=
M
y1=1
M
y2=1
· · ·
M
yN =1
N
Running over all the samples {x1, x2, ..., xN }.
79 / 126
Images/
Here, a small stop
What is the meaning of y∈Y
It is actually a summation of all possible states of the random vector y.
Then, we can rewrite the previous summation as
y∈Y
=
M
y1=1
M
y2=1
· · ·
M
yN =1
N
Running over all the samples {x1, x2, ..., xN }.
79 / 126
Images/
Then
We have
Q (Θ|Θg
) =
M
y1=1
M
y2=1
· · ·
M
yN =1
N
i=1

log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
)


80 / 126
Images/
We introduce the following
We have the following function
δl,yi
=
1 I = yi
0 I = yi
Therefore, we can do the following
αi =
M
j=1
δi,jαj
Then
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
) =
M
l=1
δl,yi log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
81 / 126
Images/
We introduce the following
We have the following function
δl,yi
=
1 I = yi
0 I = yi
Therefore, we can do the following
αi =
M
j=1
δi,jαj
Then
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
) =
M
l=1
δl,yi log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
81 / 126
Images/
We introduce the following
We have the following function
δl,yi
=
1 I = yi
0 I = yi
Therefore, we can do the following
αi =
M
j=1
δi,jαj
Then
log [αyi pyi (xi|θyi )]
N
j=1
p (yj|xj, Θg
) =
M
l=1
δl,yi log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
81 / 126
Images/
Thus
We have that for
M
y1=1 · · · M
yN =1
N
i=1 log [αyi
pyi
(xi|θyi
)] N
j=1 p (yj|xj, Θg
) = ∗
∗ =
M
y1=1
M
y2=1
· · ·
M
yN =1
N
i=1
M
l=1
δl,yi
log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
=
N
i=1
M
l=1
log [αlpl (xi|θl)]
M
y1=1
M
y2=1
· · ·
M
yN =1

δl,yi
N
j=1
p (yj|xj, Θg
)


Because
M
y1=1
M
y2=1 · · · M
yN =1 applies only to δl,yi
N
j=1 p (yj|xj, Θg)
82 / 126
Images/
Thus
We have that for
M
y1=1 · · · M
yN =1
N
i=1 log [αyi
pyi
(xi|θyi
)] N
j=1 p (yj|xj, Θg
) = ∗
∗ =
M
y1=1
M
y2=1
· · ·
M
yN =1
N
i=1
M
l=1
δl,yi
log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
=
N
i=1
M
l=1
log [αlpl (xi|θl)]
M
y1=1
M
y2=1
· · ·
M
yN =1

δl,yi
N
j=1
p (yj|xj, Θg
)


Because
M
y1=1
M
y2=1 · · · M
yN =1 applies only to δl,yi
N
j=1 p (yj|xj, Θg)
82 / 126
Images/
Thus
We have that for
M
y1=1 · · · M
yN =1
N
i=1 log [αyi
pyi
(xi|θyi
)] N
j=1 p (yj|xj, Θg
) = ∗
∗ =
M
y1=1
M
y2=1
· · ·
M
yN =1
N
i=1
M
l=1
δl,yi
log [αlpl (xi|θl)]
N
j=1
p (yj|xj, Θg
)
=
N
i=1
M
l=1
log [αlpl (xi|θl)]
M
y1=1
M
y2=1
· · ·
M
yN =1

δl,yi
N
j=1
p (yj|xj, Θg
)


Because
M
y1=1
M
y2=1 · · · M
yN =1 applies only to δl,yi
N
j=1 p (yj|xj, Θg)
82 / 126
Images/
Then, we have that
First notice the following
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
M
yi=1
δl,yi
p (yi|xi, Θg
)
N
j=1,j=i,
p (yj|xj, Θg
)


Then, we have
M
yi=1
δl,yi
p (yi|xi, Θg
) = p (l|xi, Θg
)
83 / 126
Images/
Then, we have that
First notice the following
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
M
yi=1
δl,yi
p (yi|xi, Θg
)
N
j=1,j=i,
p (yj|xj, Θg
)


Then, we have
M
yi=1
δl,yi
p (yi|xi, Θg
) = p (l|xi, Θg
)
83 / 126
Images/
Then, we have that
First notice the following
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
M
yi=1
δl,yi
p (yi|xi, Θg
)
N
j=1,j=i,
p (yj|xj, Θg
)


Then, we have
M
yi=1
δl,yi
p (yi|xi, Θg
) = p (l|xi, Θg
)
83 / 126
Images/
In this way
Plugging back the previous equation
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
p (l|xi, Θg
)
N
j=1,j=i
p (yj|xj, Θg
)


=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)

 p (l|xi, Θg
)
84 / 126
Images/
In this way
Plugging back the previous equation
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
p (l|xi, Θg
)
N
j=1,j=i
p (yj|xj, Θg
)


=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)

 p (l|xi, Θg
)
84 / 126
Images/
In this way
Plugging back the previous equation
M
y1=1
M
y2=1
· · ·
M
yN =1
δl,yi
N
j=1
p (yj|xj, Θg
) =
=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
p (l|xi, Θg
)
N
j=1,j=i
p (yj|xj, Θg
)


=


M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)

 p (l|xi, Θg
)
84 / 126
Images/
Now, what about...?
The left part of the equation
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
) =
=


M
y1=1
p (y1|x1, Θg
)

 · · ·


M
yi−1=1
p (yi−1|xi−1, Θg
)

 × ...


M
yi+1=1
p (yi+1|xi+1, Θg
)

 · · ·


M
yN =1
p (yN |xN , Θg
)


=
N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)


85 / 126
Images/
Now, what about...?
The left part of the equation
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
) =
=


M
y1=1
p (y1|x1, Θg
)

 · · ·


M
yi−1=1
p (yi−1|xi−1, Θg
)

 × ...


M
yi+1=1
p (yi+1|xi+1, Θg
)

 · · ·


M
yN =1
p (yN |xN , Θg
)


=
N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)


85 / 126
Images/
Now, what about...?
The left part of the equation
M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
) =
=


M
y1=1
p (y1|x1, Θg
)

 · · ·


M
yi−1=1
p (yi−1|xi−1, Θg
)

 × ...


M
yi+1=1
p (yi+1|xi+1, Θg
)

 · · ·


M
yN =1
p (yN |xN , Θg
)


=
N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)


85 / 126
Images/
Then, we have that
Plugging back to the original equation



M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)



p (l|xi, Θg
) =
=



N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)





p (l|xi, Θg
)
86 / 126
Images/
Then, we have that
Plugging back to the original equation



M
y1=1
· · ·
M
yi−1=1
M
yi+1=1
· · ·
M
yN =1
N
j=1,j=i
p (yj|xj, Θg
)



p (l|xi, Θg
) =
=



N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)





p (l|xi, Θg
)
86 / 126
Images/
We can use properties of probability
We know that
M
yi=1
p (yi|xi, Θg
) = 1 (37)
Then



N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)





p (l|xi, Θg
) =
=



N
j=1,j=i
1



p (l|xi, Θg
)
=p (l|xi, Θg
)
=
αg
l pyi xi|θg
l
M
k=1 αg
kpk xi|θg
k
87 / 126
Images/
We can use properties of probability
We know that
M
yi=1
p (yi|xi, Θg
) = 1 (37)
Then



N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)





p (l|xi, Θg
) =
=



N
j=1,j=i
1



p (l|xi, Θg
)
=p (l|xi, Θg
)
=
αg
l pyi xi|θg
l
M
k=1 αg
kpk xi|θg
k
87 / 126
Images/
We can use properties of probability
We know that
M
yi=1
p (yi|xi, Θg
) = 1 (37)
Then



N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)





p (l|xi, Θg
) =
=



N
j=1,j=i
1



p (l|xi, Θg
)
=p (l|xi, Θg
)
=
αg
l pyi xi|θg
l
M
k=1 αg
kpk xi|θg
k
87 / 126
Images/
We can use properties of probability
We know that
M
yi=1
p (yi|xi, Θg
) = 1 (37)
Then



N
j=1,j=i


M
yj=1
p (yj|xj, Θg
)





p (l|xi, Θg
) =
=



N
j=1,j=i
1



p (l|xi, Θg
)
=p (l|xi, Θg
)
=
αg
l pyi xi|θg
l
M
k=1 αg
kpk xi|θg
k
87 / 126
Images/
Thus
We can write Q in the following way
Q (Θ, Θg
) =
N
i=1
M
l=1
log [αlpl (xi|θl)] p (l|xi, Θg
)
=
N
i=1
M
l=1
log (αl) p (l|xi, Θg
) + ...
N
i=1
M
l=1
log (pl (xi|θl)) p (l|xi, Θg
) (38)
88 / 126
Images/
Thus
We can write Q in the following way
Q (Θ, Θg
) =
N
i=1
M
l=1
log [αlpl (xi|θl)] p (l|xi, Θg
)
=
N
i=1
M
l=1
log (αl) p (l|xi, Θg
) + ...
N
i=1
M
l=1
log (pl (xi|θl)) p (l|xi, Θg
) (38)
88 / 126
Images/
Thus
We can write Q in the following way
Q (Θ, Θg
) =
N
i=1
M
l=1
log [αlpl (xi|θl)] p (l|xi, Θg
)
=
N
i=1
M
l=1
log (αl) p (l|xi, Θg
) + ...
N
i=1
M
l=1
log (pl (xi|θl)) p (l|xi, Θg
) (38)
88 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
89 / 126
Images/
A Method
That could be used as a general framework
To solve problems set as EM problem.
First, we will look at the Lagrange Multipliers setup
Then, we will look at a specific case using the mixture of Gaussian’s
Note
Not all the mixture of distributions will get you an analytical solution.
90 / 126
Images/
A Method
That could be used as a general framework
To solve problems set as EM problem.
First, we will look at the Lagrange Multipliers setup
Then, we will look at a specific case using the mixture of Gaussian’s
Note
Not all the mixture of distributions will get you an analytical solution.
90 / 126
Images/
A Method
That could be used as a general framework
To solve problems set as EM problem.
First, we will look at the Lagrange Multipliers setup
Then, we will look at a specific case using the mixture of Gaussian’s
Note
Not all the mixture of distributions will get you an analytical solution.
90 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
91 / 126
Images/
Lagrange Multipliers
The method of Lagrange multipliers
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problems.
This is done by converting a constrained problem
To an equivalent unconstrained problem with the help of certain
unspecified parameters known as Lagrange multipliers.
92 / 126
Images/
Lagrange Multipliers
The method of Lagrange multipliers
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problems.
This is done by converting a constrained problem
To an equivalent unconstrained problem with the help of certain
unspecified parameters known as Lagrange multipliers.
92 / 126
Images/
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
(39)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
93 / 126
Images/
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
(39)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
93 / 126
Images/
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
(39)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
93 / 126
Images/
Lagrange Multipliers are applications of Py
Think about this
Remember First Law of Newton!!!
Yes!!!
94 / 126
Images/
Lagrange Multipliers are applications of Py
Think about this
Remember First Law of Newton!!!
Yes!!!
A system in equilibrium does not move
Static Body
94 / 126
Images/
Lagrange Multipliers
Definition
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problem
95 / 126
Images/
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (40)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
96 / 126
Images/
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (40)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
96 / 126
Images/
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (40)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
96 / 126
Images/
Gradient to a Surface
After all a gradient is a measure of the maximal change
For example the gradient of a function of three variables:
f (x) = i
∂f (x)
∂x
+ j
∂f (x)
∂y
+ k
∂f (x)
∂z
(41)
where i, j and k are unitary vectors in the directions x, y and z.
97 / 126
Images/
Example
We have f (x, y) = x exp {−x2
− y2
}
98 / 126
Images/
Example
With Gradient at the the contours when projecting in the 2D plane
99 / 126
Images/
Now, Think about this
Yes, we can use the gradient
However, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ... + λKFK = 0 (42)
where F0 is the gradient of the principal cost function and Fi for
i = 1, 2, .., K.
100 / 126
Images/
Now, Think about this
Yes, we can use the gradient
However, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ... + λKFK = 0 (42)
where F0 is the gradient of the principal cost function and Fi for
i = 1, 2, .., K.
100 / 126
Images/
Thus
If we have the following optimization:
min f (x)
s.tg1 (x) = 0
g2 (x) = 0
101 / 126
Images/
Geometric interpretation in the case of minimization
What is wrong? Gradients are going in the other direction, we can fix
by simple multiplying by -1
Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0
Nevertheless: it is equivalent to f −→x − λ1 g1
−→x − λ2 g2
−→x = 0
101 / 126
Images/
Geometric interpretation in the case of minimization
What is wrong? Gradients are going in the other direction, we can fix
by simple multiplying by -1
Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0
Nevertheless: it is equivalent to f −→x − λ1 g1
−→x − λ2 g2
−→x = 0
101 / 126
Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
Images/
Method
Steps
1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x) − λh1 (x)
2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
102 / 126
Images/
Example
We can apply that to the following problem
min f (x, y) = x2
− 8x + y2
− 12y + 48
s.t x + y = 8
103 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
104 / 126
Images/
Lagrange Multipliers for Q
We can us the following constraint for that
l
αl = 1 (43)
We have the following cost function
Q (Θ, Θg
) + λ
l
αl − 1 (44)
Deriving by αl
∂
∂αl
Q (Θ, Θg
) + λ
l
αl − 1 = 0 (45)
105 / 126
Images/
Lagrange Multipliers for Q
We can us the following constraint for that
l
αl = 1 (43)
We have the following cost function
Q (Θ, Θg
) + λ
l
αl − 1 (44)
Deriving by αl
∂
∂αl
Q (Θ, Θg
) + λ
l
αl − 1 = 0 (45)
105 / 126
Images/
Lagrange Multipliers for Q
We can us the following constraint for that
l
αl = 1 (43)
We have the following cost function
Q (Θ, Θg
) + λ
l
αl − 1 (44)
Deriving by αl
∂
∂αl
Q (Θ, Θg
) + λ
l
αl − 1 = 0 (45)
105 / 126
Images/
Thus
The Q function
Q (Θ, Θg
) =
N
i=1
M
l=1
log (αl) p (l|xi, Θg
) + ...
N
i=1
M
l=1
log (pl (xi|θl)) p (l|xi, Θg
)
106 / 126
Images/
Deriving
We have
∂
∂αl
Q (Θ, Θg
) + λ
l
αl − 1 =
N
i=1
1
αl
p (l|xi, Θg
) + λ
107 / 126
Images/
Finally
We have making the previous equation equal to 0
N
i=1
1
αl
p (l|xi, Θg
) + λ = 0 (46)
Thus
N
i=1
p (l|xi, Θg
) = −λαl (47)
Summing over l, we get
λ = −N (48)
108 / 126
Images/
Finally
We have making the previous equation equal to 0
N
i=1
1
αl
p (l|xi, Θg
) + λ = 0 (46)
Thus
N
i=1
p (l|xi, Θg
) = −λαl (47)
Summing over l, we get
λ = −N (48)
108 / 126
Images/
Finally
We have making the previous equation equal to 0
N
i=1
1
αl
p (l|xi, Θg
) + λ = 0 (46)
Thus
N
i=1
p (l|xi, Θg
) = −λαl (47)
Summing over l, we get
λ = −N (48)
108 / 126
Images/
Lagrange Multipliers
Thus
αl =
1
N
N
i=1
p (l|xi, Θg
) (49)
About θl
It is possible to get an analytical expressions for θl as functions of
everything else.
This is for you to try!!!
For more, please look at
“Geometric Idea of Lagrange Multipliers” by John Wyatt.
109 / 126
Images/
Lagrange Multipliers
Thus
αl =
1
N
N
i=1
p (l|xi, Θg
) (49)
About θl
It is possible to get an analytical expressions for θl as functions of
everything else.
This is for you to try!!!
For more, please look at
“Geometric Idea of Lagrange Multipliers” by John Wyatt.
109 / 126
Images/
Lagrange Multipliers
Thus
αl =
1
N
N
i=1
p (l|xi, Θg
) (49)
About θl
It is possible to get an analytical expressions for θl as functions of
everything else.
This is for you to try!!!
For more, please look at
“Geometric Idea of Lagrange Multipliers” by John Wyatt.
109 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
110 / 126
Images/
Remember?
Gaussian Distribution
pl (x|µl, Σl) =
1
(2π)
d/2
|Σl|
1/2
exp −
1
2
(x − µl)T
Σ−1
l (x − µl) (50)
111 / 126
Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
Images/
How to use this for Gaussian Distributions
For this, we need to refresh some linear algebra
1 tr (A + B) = tr (A) + tr (B)
2 tr (AB) = tr (BA)
3
i xT
i Axi = tr (AB) where B = i xixT
i .
4 A−1 = 1
|A|
Now, we need the derivative of a matrix function f (A)
Thus, ∂f(A)
∂A is going to be the matrix with i, jth entry ∂f(A)
∂ai,j
where ai,j
is the i, jth entry of A.
112 / 126
Images/
In addition
If A is symmetric
∂ |A|
∂A
=
Ai,j if i = j
2Ai,j if i = j
(51)
Where Ai,j is the i, jth cofactor of A.
Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor is
preceded by a + or – sign depending whether the element is
in a + or – position.
Thus
∂ log |A|
∂A
=



Ai,j
|A| if i = j
2Ai,j if i = j
= 2A−1
− diag A−1
(52)
113 / 126
Images/
In addition
If A is symmetric
∂ |A|
∂A
=
Ai,j if i = j
2Ai,j if i = j
(51)
Where Ai,j is the i, jth cofactor of A.
Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor is
preceded by a + or – sign depending whether the element is
in a + or – position.
Thus
∂ log |A|
∂A
=



Ai,j
|A| if i = j
2Ai,j if i = j
= 2A−1
− diag A−1
(52)
113 / 126
Images/
In addition
If A is symmetric
∂ |A|
∂A
=
Ai,j if i = j
2Ai,j if i = j
(51)
Where Ai,j is the i, jth cofactor of A.
Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor is
preceded by a + or – sign depending whether the element is
in a + or – position.
Thus
∂ log |A|
∂A
=



Ai,j
|A| if i = j
2Ai,j if i = j
= 2A−1
− diag A−1
(52)
113 / 126
Images/
In addition
If A is symmetric
∂ |A|
∂A
=
Ai,j if i = j
2Ai,j if i = j
(51)
Where Ai,j is the i, jth cofactor of A.
Note: The determinant obtained by deleting the row and column of
a given element of a matrix or determinant. The cofactor is
preceded by a + or – sign depending whether the element is
in a + or – position.
Thus
∂ log |A|
∂A
=



Ai,j
|A| if i = j
2Ai,j if i = j
= 2A−1
− diag A−1
(52)
113 / 126
Images/
Finally
The last equation we need
∂tr (AB)
∂A
= B + BT
− diag (B) (53)
In addition
∂xT Ax
∂x
(54)
114 / 126
Images/
Finally
The last equation we need
∂tr (AB)
∂A
= B + BT
− diag (B) (53)
In addition
∂xT Ax
∂x
(54)
114 / 126
Images/
Thus, using last part of equation 38
We get, after ignoring constant terms
Remember they disappear after derivatives
N
i=1
M
l=1
log (pl (xi|µl, Σl)) p (l|xi, Θg
)
=
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l (xi − µl) p (l|xi, Θg
) (55)
115 / 126
Images/
Thus, using last part of equation 38
We get, after ignoring constant terms
Remember they disappear after derivatives
N
i=1
M
l=1
log (pl (xi|µl, Σl)) p (l|xi, Θg
)
=
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l (xi − µl) p (l|xi, Θg
) (55)
115 / 126
Images/
Finally
Thus, when taking the derivative with respect to µl
N
i=1
Σ−1
l (xi − µl) p (l|xi, Θg
) = 0 (56)
Then
µl =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
(57)
116 / 126
Images/
Finally
Thus, when taking the derivative with respect to µl
N
i=1
Σ−1
l (xi − µl) p (l|xi, Θg
) = 0 (56)
Then
µl =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
(57)
116 / 126
Images/
Now, if we derive with respect to Σl
First, we rewrite equation 55
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l
(xi − µl) p (l|xi, Θg
)
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
(xi − µl) (xi − µl)T
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
Where Nl,i = (xi − µl) (xi − µl)T
.
117 / 126
Images/
Now, if we derive with respect to Σl
First, we rewrite equation 55
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l
(xi − µl) p (l|xi, Θg
)
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
(xi − µl) (xi − µl)T
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
Where Nl,i = (xi − µl) (xi − µl)T
.
117 / 126
Images/
Now, if we derive with respect to Σl
First, we rewrite equation 55
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l
(xi − µl) p (l|xi, Θg
)
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
(xi − µl) (xi − µl)T
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
Where Nl,i = (xi − µl) (xi − µl)T
.
117 / 126
Images/
Now, if we derive with respect to Σl
First, we rewrite equation 55
N
i=1
M
l=1
−
1
2
log (|Σl|) −
1
2
(xi − µl)T
Σ−1
l
(xi − µl) p (l|xi, Θg
)
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
(xi − µl) (xi − µl)T
=
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
Where Nl,i = (xi − µl) (xi − µl)T
.
117 / 126
Images/
Deriving with respect to Σ−1
l
We have that
∂
∂Σ−1
l
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) (2Σl − diag (Σl)) −
1
2
N
i=1
p (l|xi, Θg
) 2Nl.i − diag Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) 2Ml,i − diag Ml,i
=2S − diag (S)
Where Ml,i = Σl − Nl,i and S = 1
2
N
i=1
p (l|xi, Θg) Ml,i
118 / 126
Images/
Deriving with respect to Σ−1
l
We have that
∂
∂Σ−1
l
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) (2Σl − diag (Σl)) −
1
2
N
i=1
p (l|xi, Θg
) 2Nl.i − diag Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) 2Ml,i − diag Ml,i
=2S − diag (S)
Where Ml,i = Σl − Nl,i and S = 1
2
N
i=1
p (l|xi, Θg) Ml,i
118 / 126
Images/
Deriving with respect to Σ−1
l
We have that
∂
∂Σ−1
l
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) (2Σl − diag (Σl)) −
1
2
N
i=1
p (l|xi, Θg
) 2Nl.i − diag Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) 2Ml,i − diag Ml,i
=2S − diag (S)
Where Ml,i = Σl − Nl,i and S = 1
2
N
i=1
p (l|xi, Θg) Ml,i
118 / 126
Images/
Deriving with respect to Σ−1
l
We have that
∂
∂Σ−1
l
M
l=1
−
1
2
log (|Σl|)
N
i=1
p (l|xi, Θg
) −
1
2
N
i=1
p (l|xi, Θg
) tr Σ−1
l
Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) (2Σl − diag (Σl)) −
1
2
N
i=1
p (l|xi, Θg
) 2Nl.i − diag Nl,i
=
1
2
N
i=1
p (l|xi, Θg
) 2Ml,i − diag Ml,i
=2S − diag (S)
Where Ml,i = Σl − Nl,i and S = 1
2
N
i=1
p (l|xi, Θg) Ml,i
118 / 126
Images/
Thus, we have
Thus
If 2S − diag (S) = 0 =⇒ S = 0
Implying
1
2
N
i=1
p (l|xi, Θg
) [Σl − Nl,i] = 0 (58)
Or
Σl =
N
i=1 p (l|xi, Θg) Nl,i
N
i=1 p (l|xi, Θg)
=
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
(59)
119 / 126
Images/
Thus, we have
Thus
If 2S − diag (S) = 0 =⇒ S = 0
Implying
1
2
N
i=1
p (l|xi, Θg
) [Σl − Nl,i] = 0 (58)
Or
Σl =
N
i=1 p (l|xi, Θg) Nl,i
N
i=1 p (l|xi, Θg)
=
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
(59)
119 / 126
Images/
Thus, we have
Thus
If 2S − diag (S) = 0 =⇒ S = 0
Implying
1
2
N
i=1
p (l|xi, Θg
) [Σl − Nl,i] = 0 (58)
Or
Σl =
N
i=1 p (l|xi, Θg) Nl,i
N
i=1 p (l|xi, Θg)
=
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
(59)
119 / 126
Images/
Thus, we have the iterative updates
They are
αNew
l =
1
N
N
i=1
p (l|xi, Θg
)
µNew
l =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
ΣNew
l =
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
120 / 126
Images/
Thus, we have the iterative updates
They are
αNew
l =
1
N
N
i=1
p (l|xi, Θg
)
µNew
l =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
ΣNew
l =
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
120 / 126
Images/
Thus, we have the iterative updates
They are
αNew
l =
1
N
N
i=1
p (l|xi, Θg
)
µNew
l =
N
i=1 xip (l|xi, Θg)
N
i=1 p (l|xi, Θg)
ΣNew
l =
N
i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T
N
i=1 p (l|xi, Θg)
120 / 126
Images/
Outline
1 Introduction
Maximum-Likelihood
Expectation Maximization
Examples of Applications of EM
2 Incomplete Data
Introduction
Using the Expected Value
Analogy
3 Derivation of the EM-Algorithm
Hidden Features
Proving Concavity
Using the Concave Functions for Approximation
From The Concave Function to the EM
The Final Algorithm
Notes and Convergence of EM
4 Finding Maximum Likelihood Mixture Densities
The Beginning of The Process
Bayes’ Rule for the components
Mixing Parameters
Maximizing Q using Lagrange Multipliers
Lagrange Multipliers
In Our Case
Example on Mixture of Gaussian Distributions
The EM Algorithm
121 / 126
Images/
EM Algorithm for Gaussian Mixtures
Step 1
Initialize:
The means µl
Covariances Σl
Mixing coefficients αl
122 / 126
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization

More Related Content

What's hot

Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine LearningKnoldus Inc.
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)ananth
 
Evaluating hypothesis
Evaluating  hypothesisEvaluating  hypothesis
Evaluating hypothesisswapnac12
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networksAkash Goel
 
Stochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxStochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxShubham Jaybhaye
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentationAyanaRukasar
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & UnderfittingSOUMIT KAR
 
Data Science - Part IX - Support Vector Machine
Data Science - Part IX -  Support Vector MachineData Science - Part IX -  Support Vector Machine
Data Science - Part IX - Support Vector MachineDerek Kane
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryAhmed Yousry
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERKnoldus Inc.
 
Deep Neural Networks for Machine Learning
Deep Neural Networks for Machine LearningDeep Neural Networks for Machine Learning
Deep Neural Networks for Machine LearningJustin Beirold
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsAndrew Ferlitsch
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 

What's hot (20)

Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)
 
Evaluating hypothesis
Evaluating  hypothesisEvaluating  hypothesis
Evaluating hypothesis
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Stochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxStochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptx
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentation
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Data Science - Part IX - Support Vector Machine
Data Science - Part IX -  Support Vector MachineData Science - Part IX -  Support Vector Machine
Data Science - Part IX - Support Vector Machine
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
GMM
GMMGMM
GMM
 
Deep Neural Networks for Machine Learning
Deep Neural Networks for Machine LearningDeep Neural Networks for Machine Learning
Deep Neural Networks for Machine Learning
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
Em Algorithm | Statistics
Em Algorithm | StatisticsEm Algorithm | Statistics
Em Algorithm | Statistics
 
Bayesian learning
Bayesian learningBayesian learning
Bayesian learning
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 

Similar to 07 Machine Learning - Expectation Maximization

Explaining the idea behind automatic relevance determination and bayesian int...
Explaining the idea behind automatic relevance determination and bayesian int...Explaining the idea behind automatic relevance determination and bayesian int...
Explaining the idea behind automatic relevance determination and bayesian int...Florian Wilhelm
 
Mncs 16-09-4주-변승규-introduction to the machine learning
Mncs 16-09-4주-변승규-introduction to the machine learningMncs 16-09-4주-변승규-introduction to the machine learning
Mncs 16-09-4주-변승규-introduction to the machine learningSeung-gyu Byeon
 
Statistical Decision Theory
Statistical Decision TheoryStatistical Decision Theory
Statistical Decision TheorySangwoo Mo
 
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님AI Robotics KR
 
Anomaly detection Full Article
Anomaly detection Full ArticleAnomaly detection Full Article
Anomaly detection Full ArticleMenglinLiu1
 
Kinetic bands versus Bollinger Bands
Kinetic bands versus Bollinger  BandsKinetic bands versus Bollinger  Bands
Kinetic bands versus Bollinger BandsAlexandru Daia
 
STOMA FULL SLIDE (probability of IISc bangalore)
STOMA FULL SLIDE (probability of IISc bangalore)STOMA FULL SLIDE (probability of IISc bangalore)
STOMA FULL SLIDE (probability of IISc bangalore)2010111
 
Thesis Presentation_Extreme Learning Machine_Nimai_SC14M045
Thesis Presentation_Extreme Learning Machine_Nimai_SC14M045Thesis Presentation_Extreme Learning Machine_Nimai_SC14M045
Thesis Presentation_Extreme Learning Machine_Nimai_SC14M045Nimai Chand Das Adhikari
 
USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...
USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...
USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...AIRCC Publishing Corporation
 
USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...
USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...
USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...ijcsit
 
Using Cuckoo Algorithm for Estimating Two GLSD Parameters and Comparing it wi...
Using Cuckoo Algorithm for Estimating Two GLSD Parameters and Comparing it wi...Using Cuckoo Algorithm for Estimating Two GLSD Parameters and Comparing it wi...
Using Cuckoo Algorithm for Estimating Two GLSD Parameters and Comparing it wi...AIRCC Publishing Corporation
 
Hessian Matrices in Statistics
Hessian Matrices in StatisticsHessian Matrices in Statistics
Hessian Matrices in StatisticsFerris Jumah
 
Classifiers
ClassifiersClassifiers
ClassifiersAyurdata
 
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingHands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingArthur Charpentier
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONijaia
 
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...Andres Mendez-Vazquez
 
New_ML_Lecture_9.ppt
New_ML_Lecture_9.pptNew_ML_Lecture_9.ppt
New_ML_Lecture_9.pptEmmaMaria6
 

Similar to 07 Machine Learning - Expectation Maximization (20)

2008 spie gmm
2008 spie gmm2008 spie gmm
2008 spie gmm
 
Explaining the idea behind automatic relevance determination and bayesian int...
Explaining the idea behind automatic relevance determination and bayesian int...Explaining the idea behind automatic relevance determination and bayesian int...
Explaining the idea behind automatic relevance determination and bayesian int...
 
Mncs 16-09-4주-변승규-introduction to the machine learning
Mncs 16-09-4주-변승규-introduction to the machine learningMncs 16-09-4주-변승규-introduction to the machine learning
Mncs 16-09-4주-변승규-introduction to the machine learning
 
Statistical Decision Theory
Statistical Decision TheoryStatistical Decision Theory
Statistical Decision Theory
 
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
Bayesian Inference : Kalman filter 에서 Optimization 까지 - 김홍배 박사님
 
Anomaly detection Full Article
Anomaly detection Full ArticleAnomaly detection Full Article
Anomaly detection Full Article
 
Kinetic bands versus Bollinger Bands
Kinetic bands versus Bollinger  BandsKinetic bands versus Bollinger  Bands
Kinetic bands versus Bollinger Bands
 
STOMA FULL SLIDE (probability of IISc bangalore)
STOMA FULL SLIDE (probability of IISc bangalore)STOMA FULL SLIDE (probability of IISc bangalore)
STOMA FULL SLIDE (probability of IISc bangalore)
 
Thesis Presentation_Extreme Learning Machine_Nimai_SC14M045
Thesis Presentation_Extreme Learning Machine_Nimai_SC14M045Thesis Presentation_Extreme Learning Machine_Nimai_SC14M045
Thesis Presentation_Extreme Learning Machine_Nimai_SC14M045
 
USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...
USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...
USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...
 
USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...
USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...
USING CUCKOO ALGORITHM FOR ESTIMATING TWO GLSD PARAMETERS AND COMPARING IT WI...
 
Using Cuckoo Algorithm for Estimating Two GLSD Parameters and Comparing it wi...
Using Cuckoo Algorithm for Estimating Two GLSD Parameters and Comparing it wi...Using Cuckoo Algorithm for Estimating Two GLSD Parameters and Comparing it wi...
Using Cuckoo Algorithm for Estimating Two GLSD Parameters and Comparing it wi...
 
Hessian Matrices in Statistics
Hessian Matrices in StatisticsHessian Matrices in Statistics
Hessian Matrices in Statistics
 
Classifiers
ClassifiersClassifiers
Classifiers
 
Fuzzy sets
Fuzzy sets Fuzzy sets
Fuzzy sets
 
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingHands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
 
2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
 
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
 
New_ML_Lecture_9.ppt
New_ML_Lecture_9.pptNew_ML_Lecture_9.ppt
New_ML_Lecture_9.ppt
 

More from Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusAndres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems Syllabus
 

Recently uploaded

KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamDr. Radhey Shyam
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdfKamal Acharya
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineJulioCesarSalazarHer1
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edgePaco Orozco
 
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdfONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdfKamal Acharya
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdfKamal Acharya
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdfKamal Acharya
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxMd. Shahidul Islam Prodhan
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdfKamal Acharya
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdfKamal Acharya
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Krakówbim.edu.pl
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdfKamal Acharya
 
grop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tthgrop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tthAmanyaSylus
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGKOUSTAV SARKAR
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfKamal Acharya
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...Roi Lipman
 
Maestro Scripting Language CNC programacion
Maestro Scripting Language CNC programacionMaestro Scripting Language CNC programacion
Maestro Scripting Language CNC programacionliberfusta1
 

Recently uploaded (20)

KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission line
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdfONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdf
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdf
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
grop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tthgrop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tth
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 
Maestro Scripting Language CNC programacion
Maestro Scripting Language CNC programacionMaestro Scripting Language CNC programacion
Maestro Scripting Language CNC programacion
 

07 Machine Learning - Expectation Maximization

  • 1. Introduction to Machine Learning Expectation Maximization Andres Mendez-Vazquez September 16, 2017 1 / 126
  • 2. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 2 / 126
  • 3. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 3 / 126
  • 4. Images/ Maximum-Likelihood We have a density function p (x|Θ) Assume that we have a data set of size N, X = {x1, x2, ..., xN } This data is known as evidence. We assume in addition that The vectors are independent and identically distributed (i.i.d.) with distribution p under parameter θ. 4 / 126
  • 5. Images/ Maximum-Likelihood We have a density function p (x|Θ) Assume that we have a data set of size N, X = {x1, x2, ..., xN } This data is known as evidence. We assume in addition that The vectors are independent and identically distributed (i.i.d.) with distribution p under parameter θ. 4 / 126
  • 6. Images/ What Can We Do With The Evidence? We may use the Bayes’ Rule to estimate the parameters θ p (Θ|X) = P (X|Θ) P (Θ) P (X) (1) Or, given a new observation ˜x p (˜x|X) (2) I.e. to compute the probability of the new observation being supported by the evidence X. Thus The former represents parameter estimation and the latter data prediction. 5 / 126
  • 7. Images/ What Can We Do With The Evidence? We may use the Bayes’ Rule to estimate the parameters θ p (Θ|X) = P (X|Θ) P (Θ) P (X) (1) Or, given a new observation ˜x p (˜x|X) (2) I.e. to compute the probability of the new observation being supported by the evidence X. Thus The former represents parameter estimation and the latter data prediction. 5 / 126
  • 8. Images/ What Can We Do With The Evidence? We may use the Bayes’ Rule to estimate the parameters θ p (Θ|X) = P (X|Θ) P (Θ) P (X) (1) Or, given a new observation ˜x p (˜x|X) (2) I.e. to compute the probability of the new observation being supported by the evidence X. Thus The former represents parameter estimation and the latter data prediction. 5 / 126
  • 9. Images/ Focusing First on the Estimation of the Parameters θ We can interpret the Bayes’ Rule p (Θ|X) = P (X|Θ) P (Θ) P (X) (3) Interpreted as posterior = likelihood × prior evidence (4) Thus, we want likelihood = P (X|Θ) 6 / 126
  • 10. Images/ Focusing First on the Estimation of the Parameters θ We can interpret the Bayes’ Rule p (Θ|X) = P (X|Θ) P (Θ) P (X) (3) Interpreted as posterior = likelihood × prior evidence (4) Thus, we want likelihood = P (X|Θ) 6 / 126
  • 11. Images/ Focusing First on the Estimation of the Parameters θ We can interpret the Bayes’ Rule p (Θ|X) = P (X|Θ) P (Θ) P (X) (3) Interpreted as posterior = likelihood × prior evidence (4) Thus, we want likelihood = P (X|Θ) 6 / 126
  • 12. Images/ What we want... We want to maximize the likelihood as a function of θ 7 / 126
  • 13. Images/ Maximum-Likelihood We have p (x1, x2, ..., xN |Θ) = N i=1 p (xi|Θ) (5) Also known as the likelihood function. Because multiplication of quantities p (xi|Θ) ≤ 1 can be problematic L (Θ|X) = log N i=1 p (xi|Θ) = N i=1 log p (xi|Θ) (6) 8 / 126
  • 14. Images/ Maximum-Likelihood We have p (x1, x2, ..., xN |Θ) = N i=1 p (xi|Θ) (5) Also known as the likelihood function. Because multiplication of quantities p (xi|Θ) ≤ 1 can be problematic L (Θ|X) = log N i=1 p (xi|Θ) = N i=1 log p (xi|Θ) (6) 8 / 126
  • 15. Images/ Maximum-Likelihood We want to find a Θ∗ Θ∗ = argmaxΘL (Θ|X) (7) The classic method ∂L (Θ|X) ∂θi = 0 ∀θi ∈ Θ (8) 9 / 126
  • 16. Images/ Maximum-Likelihood We want to find a Θ∗ Θ∗ = argmaxΘL (Θ|X) (7) The classic method ∂L (Θ|X) ∂θi = 0 ∀θi ∈ Θ (8) 9 / 126
  • 17. Images/ What happened if we have incomplete data Data could have been split 1 X = observed data or incomplete data 2 Y = unobserved data For this type of problems We have the famous Expectation Maximization (EM) 10 / 126
  • 18. Images/ What happened if we have incomplete data Data could have been split 1 X = observed data or incomplete data 2 Y = unobserved data For this type of problems We have the famous Expectation Maximization (EM) 10 / 126
  • 19. Images/ What happened if we have incomplete data Data could have been split 1 X = observed data or incomplete data 2 Y = unobserved data For this type of problems We have the famous Expectation Maximization (EM) 10 / 126
  • 20. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 11 / 126
  • 21. Images/ The Expectation Maximization The EM algorithm It was first developed by Dempster et al. (1977). Its popularity comes from the fact It can estimate an underlying distribution when data is incomplete or has missing values. Two main applications 1 When missing values exists. 2 When a likelihood function can be simplified by assuming extra parameters that are missing or hidden. 12 / 126
  • 22. Images/ The Expectation Maximization The EM algorithm It was first developed by Dempster et al. (1977). Its popularity comes from the fact It can estimate an underlying distribution when data is incomplete or has missing values. Two main applications 1 When missing values exists. 2 When a likelihood function can be simplified by assuming extra parameters that are missing or hidden. 12 / 126
  • 23. Images/ The Expectation Maximization The EM algorithm It was first developed by Dempster et al. (1977). Its popularity comes from the fact It can estimate an underlying distribution when data is incomplete or has missing values. Two main applications 1 When missing values exists. 2 When a likelihood function can be simplified by assuming extra parameters that are missing or hidden. 12 / 126
  • 24. Images/ The Expectation Maximization The EM algorithm It was first developed by Dempster et al. (1977). Its popularity comes from the fact It can estimate an underlying distribution when data is incomplete or has missing values. Two main applications 1 When missing values exists. 2 When a likelihood function can be simplified by assuming extra parameters that are missing or hidden. 12 / 126
  • 25. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 13 / 126
  • 26. Images/ Clustering Given a series of data sets Given the fact that Radial Gaussian Functions are Universal Approximators . Samples {x1, x2, ..., xN } are the visible parameters The Gaussian distributions generating each of the samples are the hidden parameters Then, we model the cluster as a mixture of Gaussian’s 14 / 126
  • 27. Images/ Clustering Given a series of data sets Given the fact that Radial Gaussian Functions are Universal Approximators . Samples {x1, x2, ..., xN } are the visible parameters The Gaussian distributions generating each of the samples are the hidden parameters Then, we model the cluster as a mixture of Gaussian’s 14 / 126
  • 28. Images/ Natural Language Processing Unsupervised induction of probabilistic context-free grammars Here given a series of words o1, o2, o3, ... and normalized Context-Free Grammar We want to know the probabilities of each rule P (i → jk) Thus Here the you have two variables: The Visible Ones: The sequence of words The Hidden Ones: The rule that produces the possible sequence oi → oj 15 / 126
  • 29. Images/ Natural Language Processing Unsupervised induction of probabilistic context-free grammars Here given a series of words o1, o2, o3, ... and normalized Context-Free Grammar We want to know the probabilities of each rule P (i → jk) Thus Here the you have two variables: The Visible Ones: The sequence of words The Hidden Ones: The rule that produces the possible sequence oi → oj 15 / 126
  • 30. Images/ Natural Language Processing Baum-Welch Algorithm for Hidden Markov Models Visible Data Hidden Data Here Hidden Variables: The circular nodes producing the data Visible Variables: The square nodes representing the samples. 16 / 126
  • 31. Images/ Natural Language Processing Baum-Welch Algorithm for Hidden Markov Models Visible Data Hidden Data Here Hidden Variables: The circular nodes producing the data Visible Variables: The square nodes representing the samples. 16 / 126
  • 32. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 17 / 126
  • 33. Images/ Incomplete Data We assume the following Two parts of data 1 X = observed data or incomplete data 2 Y = unobserved data Thus Z = (X,Y)=Complete Data (9) Thus, we have the following probability p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10) 18 / 126
  • 34. Images/ Incomplete Data We assume the following Two parts of data 1 X = observed data or incomplete data 2 Y = unobserved data Thus Z = (X,Y)=Complete Data (9) Thus, we have the following probability p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10) 18 / 126
  • 35. Images/ Incomplete Data We assume the following Two parts of data 1 X = observed data or incomplete data 2 Y = unobserved data Thus Z = (X,Y)=Complete Data (9) Thus, we have the following probability p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10) 18 / 126
  • 36. Images/ Incomplete Data We assume the following Two parts of data 1 X = observed data or incomplete data 2 Y = unobserved data Thus Z = (X,Y)=Complete Data (9) Thus, we have the following probability p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10) 18 / 126
  • 37. Images/ Incomplete Data We assume the following Two parts of data 1 X = observed data or incomplete data 2 Y = unobserved data Thus Z = (X,Y)=Complete Data (9) Thus, we have the following probability p (z|Θ) = p (x, y|Θ) = p (y|x, Θ) p (x|Θ) (10) 18 / 126
  • 38. Images/ New Likelihood Function The New Likelihood Function L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11) Note: The complete data likelihood. Thus, we have L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12) Did you notice? p (X|Θ) is the likelihood of the observed data. p (Y|X, Θ) is the likelihood of the no-observed data under the observed data!!! 19 / 126
  • 39. Images/ New Likelihood Function The New Likelihood Function L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11) Note: The complete data likelihood. Thus, we have L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12) Did you notice? p (X|Θ) is the likelihood of the observed data. p (Y|X, Θ) is the likelihood of the no-observed data under the observed data!!! 19 / 126
  • 40. Images/ New Likelihood Function The New Likelihood Function L (Θ|Z) = L (Θ|X, Y) = p (X, Y|Θ) (11) Note: The complete data likelihood. Thus, we have L (Θ|X, Y) = p (X, Y|Θ) = p (Y|X, Θ) p (X|Θ) (12) Did you notice? p (X|Θ) is the likelihood of the observed data. p (Y|X, Θ) is the likelihood of the no-observed data under the observed data!!! 19 / 126
  • 41. Images/ Rewriting This can be rewritten as L (Θ|X, Y) = hX,Θ (Y) (13) This basically signify that X, Θ are constant and the only random part is Y. In addition L (Θ|X) (14) It is known as the incomplete-data likelihood function. 20 / 126
  • 42. Images/ Rewriting This can be rewritten as L (Θ|X, Y) = hX,Θ (Y) (13) This basically signify that X, Θ are constant and the only random part is Y. In addition L (Θ|X) (14) It is known as the incomplete-data likelihood function. 20 / 126
  • 43. Images/ Thus We can connect both incomplete-complete data equations by doing the following L (Θ|X) =p (X|Θ) = Y p (X, Y|Θ) = Y p (Y|X, Θ) p (X|Θ) = N i=1 p (xi|Θ) Y p (Y|X, Θ) 21 / 126
  • 44. Images/ Thus We can connect both incomplete-complete data equations by doing the following L (Θ|X) =p (X|Θ) = Y p (X, Y|Θ) = Y p (Y|X, Θ) p (X|Θ) = N i=1 p (xi|Θ) Y p (Y|X, Θ) 21 / 126
  • 45. Images/ Thus We can connect both incomplete-complete data equations by doing the following L (Θ|X) =p (X|Θ) = Y p (X, Y|Θ) = Y p (Y|X, Θ) p (X|Θ) = N i=1 p (xi|Θ) Y p (Y|X, Θ) 21 / 126
  • 46. Images/ Thus We can connect both incomplete-complete data equations by doing the following L (Θ|X) =p (X|Θ) = Y p (X, Y|Θ) = Y p (Y|X, Θ) p (X|Θ) = N i=1 p (xi|Θ) Y p (Y|X, Θ) 21 / 126
  • 47. Images/ Remarks Problems Normally, it is almost impossible to obtain a closed analytical solution for the previous equation. However We can use the expected value of log p (X, Y|Θ), which allows us to find an iterative procedure to approximate the solution. 22 / 126
  • 48. Images/ Remarks Problems Normally, it is almost impossible to obtain a closed analytical solution for the previous equation. However We can use the expected value of log p (X, Y|Θ), which allows us to find an iterative procedure to approximate the solution. 22 / 126
  • 49. Images/ The function we would like to have The Q function We want an estimation of the complete-data log-likelihood log p (X, Y|Θ) (15) Based in the info provided by X, Θn−1 where Θn−1 is a previously estimated set of parameters at step n. Think about the following, if we want to remove Y ˆ [log p (X, Y|Θ)] p (Y|X, Θn−1) dY (16) Remark: We integrate out Y - Actually, this is the expected value of log p (X, Y|Θ). 23 / 126
  • 50. Images/ The function we would like to have The Q function We want an estimation of the complete-data log-likelihood log p (X, Y|Θ) (15) Based in the info provided by X, Θn−1 where Θn−1 is a previously estimated set of parameters at step n. Think about the following, if we want to remove Y ˆ [log p (X, Y|Θ)] p (Y|X, Θn−1) dY (16) Remark: We integrate out Y - Actually, this is the expected value of log p (X, Y|Θ). 23 / 126
  • 51. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 24 / 126
  • 52. Images/ Use the Expected Value Then, we want an iterative method to guess Θ from Θn−1 Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17) Take in account that 1 X, Θn−1 are taken as constants. 2 Θ is a normal variable that we wish to adjust. 3 Y is a random variable governed by distribution p (Y|X, Θn−1)=marginal distribution of missing data. 25 / 126
  • 53. Images/ Use the Expected Value Then, we want an iterative method to guess Θ from Θn−1 Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17) Take in account that 1 X, Θn−1 are taken as constants. 2 Θ is a normal variable that we wish to adjust. 3 Y is a random variable governed by distribution p (Y|X, Θn−1)=marginal distribution of missing data. 25 / 126
  • 54. Images/ Use the Expected Value Then, we want an iterative method to guess Θ from Θn−1 Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17) Take in account that 1 X, Θn−1 are taken as constants. 2 Θ is a normal variable that we wish to adjust. 3 Y is a random variable governed by distribution p (Y|X, Θn−1)=marginal distribution of missing data. 25 / 126
  • 55. Images/ Use the Expected Value Then, we want an iterative method to guess Θ from Θn−1 Q (Θ, Θn−1) = E [log p (X, Y|Θ) |X, Θn−1] (17) Take in account that 1 X, Θn−1 are taken as constants. 2 Θ is a normal variable that we wish to adjust. 3 Y is a random variable governed by distribution p (Y|X, Θn−1)=marginal distribution of missing data. 25 / 126
  • 56. Images/ Another Interpretation Given the previous information E [log p (X, Y|Θ) |X, Θn−1] = ´ Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY Something Notable 1 In the best of cases, this marginal distribution is a simple analytical expression of the assumed parameter Θn−1. 2 In the worst of cases, this density might be very hard to obtain. Actually, we use p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18) which is not dependent on Θ. 26 / 126
  • 57. Images/ Another Interpretation Given the previous information E [log p (X, Y|Θ) |X, Θn−1] = ´ Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY Something Notable 1 In the best of cases, this marginal distribution is a simple analytical expression of the assumed parameter Θn−1. 2 In the worst of cases, this density might be very hard to obtain. Actually, we use p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18) which is not dependent on Θ. 26 / 126
  • 58. Images/ Another Interpretation Given the previous information E [log p (X, Y|Θ) |X, Θn−1] = ´ Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY Something Notable 1 In the best of cases, this marginal distribution is a simple analytical expression of the assumed parameter Θn−1. 2 In the worst of cases, this density might be very hard to obtain. Actually, we use p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18) which is not dependent on Θ. 26 / 126
  • 59. Images/ Another Interpretation Given the previous information E [log p (X, Y|Θ) |X, Θn−1] = ´ Y∈Y log p (X, Y|Θ) p (Y|X, Θn−1) dY Something Notable 1 In the best of cases, this marginal distribution is a simple analytical expression of the assumed parameter Θn−1. 2 In the worst of cases, this density might be very hard to obtain. Actually, we use p (Y, X|Θn−1) = p (Y|X, Θn−1) p (X|Θn−1) (18) which is not dependent on Θ. 26 / 126
  • 60. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 27 / 126
  • 61. Images/ Back to the Q function The intuition We have the following analogy: Consider h (θ, Y ) a function θ a constant Y ∼ pY (y), a random variable with distribution pY (y). Thus, if Y is a discrete random variable q (θ) = EY [h (θ, Y )] = y h (θ, y) pY (y) (19) 28 / 126
  • 62. Images/ Back to the Q function The intuition We have the following analogy: Consider h (θ, Y ) a function θ a constant Y ∼ pY (y), a random variable with distribution pY (y). Thus, if Y is a discrete random variable q (θ) = EY [h (θ, Y )] = y h (θ, y) pY (y) (19) 28 / 126
  • 63. Images/ Back to the Q function The intuition We have the following analogy: Consider h (θ, Y ) a function θ a constant Y ∼ pY (y), a random variable with distribution pY (y). Thus, if Y is a discrete random variable q (θ) = EY [h (θ, Y )] = y h (θ, y) pY (y) (19) 28 / 126
  • 64. Images/ Back to the Q function The intuition We have the following analogy: Consider h (θ, Y ) a function θ a constant Y ∼ pY (y), a random variable with distribution pY (y). Thus, if Y is a discrete random variable q (θ) = EY [h (θ, Y )] = y h (θ, y) pY (y) (19) 28 / 126
  • 65. Images/ Back to the Q function The intuition We have the following analogy: Consider h (θ, Y ) a function θ a constant Y ∼ pY (y), a random variable with distribution pY (y). Thus, if Y is a discrete random variable q (θ) = EY [h (θ, Y )] = y h (θ, y) pY (y) (19) 28 / 126
  • 66. Images/ Why E-step!!! From here the name This is basically the E-step The second step It tries to maximize the Q function Θn = argmaxΘQ (Θ, Θn−1) (20) 29 / 126
  • 67. Images/ Why E-step!!! From here the name This is basically the E-step The second step It tries to maximize the Q function Θn = argmaxΘQ (Θ, Θn−1) (20) 29 / 126
  • 68. Images/ Why E-step!!! From here the name This is basically the E-step The second step It tries to maximize the Q function Θn = argmaxΘQ (Θ, Θn−1) (20) 29 / 126
  • 69. Images/ Derivation of the EM-Algorithm The likelihood function we are going to use Let X be a random vector which results from a parametrized family: L (Θ) = ln P (X|Θ) (21) Note: ln (x) is a strictly increasing function. We wish to compute Θ Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn) Or the maximization of the difference L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22) 30 / 126
  • 70. Images/ Derivation of the EM-Algorithm The likelihood function we are going to use Let X be a random vector which results from a parametrized family: L (Θ) = ln P (X|Θ) (21) Note: ln (x) is a strictly increasing function. We wish to compute Θ Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn) Or the maximization of the difference L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22) 30 / 126
  • 71. Images/ Derivation of the EM-Algorithm The likelihood function we are going to use Let X be a random vector which results from a parametrized family: L (Θ) = ln P (X|Θ) (21) Note: ln (x) is a strictly increasing function. We wish to compute Θ Based on an estimate Θn (After the nth) such that L (Θ) > L (Θn) Or the maximization of the difference L (Θ) − L (Θn) = ln P (X|Θ) − ln P (X|Θn) (22) 30 / 126
  • 72. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 31 / 126
  • 73. Images/ Introducing the Hidden Features Given that the hidden random vector Y exits with y values P (X|Θ) = y P (X|y, Θ) P (y|Θ) (23) Thus, using our first constraint L (Θ) − L (Θn) L (Θ) − L (Θn) = ln y P (X|y, Θ) P (y|Θ) − ln P (X|Θn) (24) 32 / 126
  • 74. Images/ Introducing the Hidden Features Given that the hidden random vector Y exits with y values P (X|Θ) = y P (X|y, Θ) P (y|Θ) (23) Thus, using our first constraint L (Θ) − L (Θn) L (Θ) − L (Θn) = ln y P (X|y, Θ) P (y|Θ) − ln P (X|Θn) (24) 32 / 126
  • 75. Images/ Here, we introduce some concepts of convexity For Convexity Theorem (Jensen’s inequality) Let f be a convex function defined on an interval I. If x1, x2, ..., xn ∈ I and λ1, λ2, ..., λn ≥ 0 with n i=1 λi = 1, then f n i=1 λixi ≤ n i=1 λif (xi) (25) 33 / 126
  • 76. Images/ Proof: For n = 1 We have the trivial case For n = 2 The convexity definition. Now the inductive hypothesis We assume that the theorem is true for some n. 34 / 126
  • 77. Images/ Proof: For n = 1 We have the trivial case For n = 2 The convexity definition. Now the inductive hypothesis We assume that the theorem is true for some n. 34 / 126
  • 78. Images/ Proof: For n = 1 We have the trivial case For n = 2 The convexity definition. Now the inductive hypothesis We assume that the theorem is true for some n. 34 / 126
  • 79. Images/ Now, we have The following linear combination for λi f n+1 i=1 λixi = f λn+1xn+1 + n i=1 λixi = f λn+1xn+1 + (1 − λn+1) (1 − λn+1) n i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) f 1 (1 − λn+1) n i=1 λixi 35 / 126
  • 80. Images/ Now, we have The following linear combination for λi f n+1 i=1 λixi = f λn+1xn+1 + n i=1 λixi = f λn+1xn+1 + (1 − λn+1) (1 − λn+1) n i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) f 1 (1 − λn+1) n i=1 λixi 35 / 126
  • 81. Images/ Now, we have The following linear combination for λi f n+1 i=1 λixi = f λn+1xn+1 + n i=1 λixi = f λn+1xn+1 + (1 − λn+1) (1 − λn+1) n i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) f 1 (1 − λn+1) n i=1 λixi 35 / 126
  • 82. Images/ Did you notice? Something Notable n+1 i=1 λi = 1 Thus n i=1 λi = 1 − λn+1 Finally 1 (1 − λn+1) n i=1 λi = 1 36 / 126
  • 83. Images/ Did you notice? Something Notable n+1 i=1 λi = 1 Thus n i=1 λi = 1 − λn+1 Finally 1 (1 − λn+1) n i=1 λi = 1 36 / 126
  • 84. Images/ Did you notice? Something Notable n+1 i=1 λi = 1 Thus n i=1 λi = 1 − λn+1 Finally 1 (1 − λn+1) n i=1 λi = 1 36 / 126
  • 85. Images/ Now We have that f n+1 i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) f 1 (1 − λn+1) n i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) 1 (1 − λn+1) n i=1 λif (xi) ≤ λn+1f (xn+1) + n i=1 λif (xi) Q.E.D. 37 / 126
  • 86. Images/ Now We have that f n+1 i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) f 1 (1 − λn+1) n i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) 1 (1 − λn+1) n i=1 λif (xi) ≤ λn+1f (xn+1) + n i=1 λif (xi) Q.E.D. 37 / 126
  • 87. Images/ Now We have that f n+1 i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) f 1 (1 − λn+1) n i=1 λixi ≤ λn+1f (xn+1) + (1 − λn+1) 1 (1 − λn+1) n i=1 λif (xi) ≤ λn+1f (xn+1) + n i=1 λif (xi) Q.E.D. 37 / 126
  • 88. Images/ Thus, for concave functions It is possible to shown that Given ln (x) a concave function: ln n i=1 λixi ≥ n i=1 λi ln (xi) If we take in consideration Assume that the λi = P (y|X, Θn). We know that 1 P (y|X, Θn) ≥ 0 2 y P (y|X, Θn) = 1 38 / 126
  • 89. Images/ Thus, for concave functions It is possible to shown that Given ln (x) a concave function: ln n i=1 λixi ≥ n i=1 λi ln (xi) If we take in consideration Assume that the λi = P (y|X, Θn). We know that 1 P (y|X, Θn) ≥ 0 2 y P (y|X, Θn) = 1 38 / 126
  • 90. Images/ We have First L (Θ) − L (Θn) = ln y P (X|y, Θ) P (y|Θ) − ln P (X|Θn) = ln y P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (y|X, Θn) − ln P (X|Θn) = ln y P (y|X, Θn) P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ln P (X|Θn) ≥ y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ... y P (y|X, Θn) ln P (X|Θn) Why this? 39 / 126
  • 91. Images/ We have First L (Θ) − L (Θn) = ln y P (X|y, Θ) P (y|Θ) − ln P (X|Θn) = ln y P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (y|X, Θn) − ln P (X|Θn) = ln y P (y|X, Θn) P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ln P (X|Θn) ≥ y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ... y P (y|X, Θn) ln P (X|Θn) Why this? 39 / 126
  • 92. Images/ We have First L (Θ) − L (Θn) = ln y P (X|y, Θ) P (y|Θ) − ln P (X|Θn) = ln y P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (y|X, Θn) − ln P (X|Θn) = ln y P (y|X, Θn) P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ln P (X|Θn) ≥ y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ... y P (y|X, Θn) ln P (X|Θn) Why this? 39 / 126
  • 93. Images/ We have First L (Θ) − L (Θn) = ln y P (X|y, Θ) P (y|Θ) − ln P (X|Θn) = ln y P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (y|X, Θn) − ln P (X|Θn) = ln y P (y|X, Θn) P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ln P (X|Θn) ≥ y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) − ... y P (y|X, Θn) ln P (X|Θn) Why this? 39 / 126
  • 94. Images/ Next Because y P (y|X, Θn) = 1 Then L (Θ) − L (Θn) ≥ y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) =∆ (Θ|Θn) 40 / 126
  • 95. Images/ Next Because y P (y|X, Θn) = 1 Then L (Θ) − L (Θn) ≥ y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) =∆ (Θ|Θn) 40 / 126
  • 96. Images/ Then, we have Then, we have proved that L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26) Then, we define a new function l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27) Thus l (Θ|Θn) It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ) 41 / 126
  • 97. Images/ Then, we have Then, we have proved that L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26) Then, we define a new function l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27) Thus l (Θ|Θn) It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ) 41 / 126
  • 98. Images/ Then, we have Then, we have proved that L (Θ) ≥ L (Θn) + ∆ (Θ|Θn) (26) Then, we define a new function l (Θ|Θn) = L (Θn) + ∆ (Θ|Θn) (27) Thus l (Θ|Θn) It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ) 41 / 126
  • 99. Images/ Now, we can do the following We evaluate in Θn l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn) =L (Θn) + y P (y|X, Θn) ln P (X|y, Θn) P (y|Θn) P (y|X, Θn) P (X|Θn) =L (Θn) + y P (y|X, Θn) ln P (X, y|Θn) P (X, y|Θn) =L (Θn) This means that For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal 42 / 126
  • 100. Images/ Now, we can do the following We evaluate in Θn l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn) =L (Θn) + y P (y|X, Θn) ln P (X|y, Θn) P (y|Θn) P (y|X, Θn) P (X|Θn) =L (Θn) + y P (y|X, Θn) ln P (X, y|Θn) P (X, y|Θn) =L (Θn) This means that For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal 42 / 126
  • 101. Images/ Now, we can do the following We evaluate in Θn l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn) =L (Θn) + y P (y|X, Θn) ln P (X|y, Θn) P (y|Θn) P (y|X, Θn) P (X|Θn) =L (Θn) + y P (y|X, Θn) ln P (X, y|Θn) P (X, y|Θn) =L (Θn) This means that For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal 42 / 126
  • 102. Images/ Now, we can do the following We evaluate in Θn l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn) =L (Θn) + y P (y|X, Θn) ln P (X|y, Θn) P (y|Θn) P (y|X, Θn) P (X|Θn) =L (Θn) + y P (y|X, Θn) ln P (X, y|Θn) P (X, y|Θn) =L (Θn) This means that For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal 42 / 126
  • 103. Images/ Now, we can do the following We evaluate in Θn l (Θn|Θn) =L (Θn) + ∆ (Θn|Θn) =L (Θn) + y P (y|X, Θn) ln P (X|y, Θn) P (y|Θn) P (y|X, Θn) P (X|Θn) =L (Θn) + y P (y|X, Θn) ln P (X, y|Θn) P (X, y|Θn) =L (Θn) This means that For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal 42 / 126
  • 104. Images/ Therefore The function l (Θ|Θn) has the following properties 1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ). 2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal. 3 The function l (Θ|Θn) is concave... How? 43 / 126
  • 105. Images/ Therefore The function l (Θ|Θn) has the following properties 1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ). 2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal. 3 The function l (Θ|Θn) is concave... How? 43 / 126
  • 106. Images/ Therefore The function l (Θ|Θn) has the following properties 1 It is bounded from above by L (Θ) i.e l (Θ|Θn) ≤ L (Θ). 2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal. 3 The function l (Θ|Θn) is concave... How? 43 / 126
  • 107. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 44 / 126
  • 108. Images/ First We have the value L (Θn) We know that L (Θn) is constant i.e. an offset value What about ∆ (Θ|Θn) y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) We have that the ln is a concave function ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) 45 / 126
  • 109. Images/ First We have the value L (Θn) We know that L (Θn) is constant i.e. an offset value What about ∆ (Θ|Θn) y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) We have that the ln is a concave function ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) 45 / 126
  • 110. Images/ First We have the value L (Θn) We know that L (Θn) is constant i.e. an offset value What about ∆ (Θ|Θn) y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) We have that the ln is a concave function ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) 45 / 126
  • 111. Images/ Therefore Each element is concave P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) Therefore, the sum of concave functions is a concave function y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) 46 / 126
  • 112. Images/ Therefore Each element is concave P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) Therefore, the sum of concave functions is a concave function y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) 46 / 126
  • 113. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 47 / 126
  • 114. Images/ Given the Concave Function Thus, we have that 1 We can select Θn such that l (Θ|Θn) is maximized. 2 Thus, given a Θn, we can generate Θn+1. The process can be seen in the following graph 48 / 126
  • 115. Images/ Given the Concave Function Thus, we have that 1 We can select Θn such that l (Θ|Θn) is maximized. 2 Thus, given a Θn, we can generate Θn+1. The process can be seen in the following graph 48 / 126
  • 116. Images/ Given the Concave Function Thus, we have that 1 We can select Θn such that l (Θ|Θn) is maximized. 2 Thus, given a Θn, we can generate Θn+1. The process can be seen in the following graph 48 / 126
  • 117. Images/ Given The Previous Constraints 1 l (Θ|Θn) is bounded from above by L (Θ) l (Θ|Θn) ≤ L (Θ) 2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal L (Θn) = l (Θ|Θn) 3 The function l (Θ|Θn) is concave 49 / 126
  • 118. Images/ Given The Previous Constraints 1 l (Θ|Θn) is bounded from above by L (Θ) l (Θ|Θn) ≤ L (Θ) 2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal L (Θn) = l (Θ|Θn) 3 The function l (Θ|Θn) is concave 49 / 126
  • 119. Images/ Given The Previous Constraints 1 l (Θ|Θn) is bounded from above by L (Θ) l (Θ|Θn) ≤ L (Θ) 2 For Θ = Θn, functions L (Θ) and l (Θ|Θn) are equal L (Θn) = l (Θ|Θn) 3 The function l (Θ|Θn) is concave 49 / 126
  • 120. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 50 / 126
  • 121. Images/ From The following Θn+1 =argmaxΘ {l (Θ|Θn)} =argmaxΘ L (Θn) + y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) The terms with Θn are constants. ≈argmaxΘ y P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ)) =argmaxΘ y P (y|X, Θn) ln P (X, y|Θ) P (y|Θ) P (y, Θ) P (Θ) =argmaxΘ    y P (y|X, Θn) ln   P(X,y,Θ) P(Θ) P(y,Θ) P(Θ) P (y, Θ) P (Θ)      51 / 126
  • 122. Images/ From The following Θn+1 =argmaxΘ {l (Θ|Θn)} =argmaxΘ L (Θn) + y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) The terms with Θn are constants. ≈argmaxΘ y P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ)) =argmaxΘ y P (y|X, Θn) ln P (X, y|Θ) P (y|Θ) P (y, Θ) P (Θ) =argmaxΘ    y P (y|X, Θn) ln   P(X,y,Θ) P(Θ) P(y,Θ) P(Θ) P (y, Θ) P (Θ)      51 / 126
  • 123. Images/ From The following Θn+1 =argmaxΘ {l (Θ|Θn)} =argmaxΘ L (Θn) + y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) The terms with Θn are constants. ≈argmaxΘ y P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ)) =argmaxΘ y P (y|X, Θn) ln P (X, y|Θ) P (y|Θ) P (y, Θ) P (Θ) =argmaxΘ    y P (y|X, Θn) ln   P(X,y,Θ) P(Θ) P(y,Θ) P(Θ) P (y, Θ) P (Θ)      51 / 126
  • 124. Images/ From The following Θn+1 =argmaxΘ {l (Θ|Θn)} =argmaxΘ L (Θn) + y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) The terms with Θn are constants. ≈argmaxΘ y P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ)) =argmaxΘ y P (y|X, Θn) ln P (X, y|Θ) P (y|Θ) P (y, Θ) P (Θ) =argmaxΘ    y P (y|X, Θn) ln   P(X,y,Θ) P(Θ) P(y,Θ) P(Θ) P (y, Θ) P (Θ)      51 / 126
  • 125. Images/ From The following Θn+1 =argmaxΘ {l (Θ|Θn)} =argmaxΘ L (Θn) + y P (y|X, Θn) ln P (X|y, Θ) P (y|Θ) P (y|X, Θn) P (X|Θn) The terms with Θn are constants. ≈argmaxΘ y P (y|X, Θn) ln (P (X|y, Θ) P (y|Θ)) =argmaxΘ y P (y|X, Θn) ln P (X, y|Θ) P (y|Θ) P (y, Θ) P (Θ) =argmaxΘ    y P (y|X, Θn) ln   P(X,y,Θ) P(Θ) P(y,Θ) P(Θ) P (y, Θ) P (Θ)      51 / 126
  • 126. Images/ Thus Then θn+1 =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (y, Θ) P (y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln (P (X, y|Θ)) =argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] 52 / 126
  • 127. Images/ Thus Then θn+1 =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (y, Θ) P (y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln (P (X, y|Θ)) =argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] 52 / 126
  • 128. Images/ Thus Then θn+1 =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (y, Θ) P (y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln (P (X, y|Θ)) =argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] 52 / 126
  • 129. Images/ Thus Then θn+1 =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (y, Θ) P (y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln (P (X, y|Θ)) =argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] 52 / 126
  • 130. Images/ Thus Then θn+1 =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (y, Θ) P (y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln P (X, y, Θ) P (Θ) =argmaxΘ y P (y|X, Θn) ln (P (X, y|Θ)) =argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] Then argmaxΘ {l (Θ|Θn)} ≈ argmaxΘ Ey|X,Θn [ln (P (X, y|Θ))] 52 / 126
  • 131. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 53 / 126
  • 132. Images/ The EM-Algorithm Steps of EM 1 Expectation under hidden variables. 2 Maximization of the resulting formula. E-Step Determine the conditional expectation, Ey|X,Θn [ln (P (X, y|Θ))]. M-Step Maximize this expression with respect to Θ. 54 / 126
  • 133. Images/ The EM-Algorithm Steps of EM 1 Expectation under hidden variables. 2 Maximization of the resulting formula. E-Step Determine the conditional expectation, Ey|X,Θn [ln (P (X, y|Θ))]. M-Step Maximize this expression with respect to Θ. 54 / 126
  • 134. Images/ The EM-Algorithm Steps of EM 1 Expectation under hidden variables. 2 Maximization of the resulting formula. E-Step Determine the conditional expectation, Ey|X,Θn [ln (P (X, y|Θ))]. M-Step Maximize this expression with respect to Θ. 54 / 126
  • 135. Images/ The EM-Algorithm Steps of EM 1 Expectation under hidden variables. 2 Maximization of the resulting formula. E-Step Determine the conditional expectation, Ey|X,Θn [ln (P (X, y|Θ))]. M-Step Maximize this expression with respect to Θ. 54 / 126
  • 136. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 55 / 126
  • 137. Images/ Notes and Convergence of EM Gains between L (Θ) and l (Θ|Θn) Using the hidden variables it is possible to simplify the optimization of L (Θ) through l (Θ|Θn). Convergence Remember that Θn+1is the estimate for Θ which maximizes the difference ∆ (Θ|Θn). 56 / 126
  • 138. Images/ Notes and Convergence of EM Gains between L (Θ) and l (Θ|Θn) Using the hidden variables it is possible to simplify the optimization of L (Θ) through l (Θ|Θn). Convergence Remember that Θn+1is the estimate for Θ which maximizes the difference ∆ (Θ|Θn). 56 / 126
  • 139. Images/ Notes and Convergence of EM Gains between L (Θ) and l (Θ|Θn) Using the hidden variables it is possible to simplify the optimization of L (Θ) through l (Θ|Θn). Convergence Remember that Θn+1is the estimate for Θ which maximizes the difference ∆ (Θ|Θn). 56 / 126
  • 140. Images/ Therefore Then, we have Given the initial estimate of Θ by Θn ∆ (Θn|Θn) = 0 Now If we choose Θn+1 to maximize the ∆ (Θ|Θn), then ∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0 We have that The Likelihood L (Θ) is not a decreasing function with respect to Θ. 57 / 126
  • 141. Images/ Therefore Then, we have Given the initial estimate of Θ by Θn ∆ (Θn|Θn) = 0 Now If we choose Θn+1 to maximize the ∆ (Θ|Θn), then ∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0 We have that The Likelihood L (Θ) is not a decreasing function with respect to Θ. 57 / 126
  • 142. Images/ Therefore Then, we have Given the initial estimate of Θ by Θn ∆ (Θn|Θn) = 0 Now If we choose Θn+1 to maximize the ∆ (Θ|Θn), then ∆ (Θn+1|Θn) ≥ ∆ (Θn|Θn) = 0 We have that The Likelihood L (Θ) is not a decreasing function with respect to Θ. 57 / 126
  • 143. Images/ Notes and Convergence of EM Properties When the algorithm reaches a fixed point for some Θn, the value maximizes l (Θ|Θn). Definition A fixed point of a function is an element on domain that is mapped to itself by the function: f (x) = x Basically the EM algorithm does the following EM [Θ∗ ] = Θ∗ 58 / 126
  • 144. Images/ Notes and Convergence of EM Properties When the algorithm reaches a fixed point for some Θn, the value maximizes l (Θ|Θn). Definition A fixed point of a function is an element on domain that is mapped to itself by the function: f (x) = x Basically the EM algorithm does the following EM [Θ∗ ] = Θ∗ 58 / 126
  • 145. Images/ Notes and Convergence of EM Properties When the algorithm reaches a fixed point for some Θn, the value maximizes l (Θ|Θn). Definition A fixed point of a function is an element on domain that is mapped to itself by the function: f (x) = x Basically the EM algorithm does the following EM [Θ∗ ] = Θ∗ 58 / 126
  • 146. Images/ At this moment We have that The algorithm reaches a fixed point for some Θn, the value Θ∗ maximizes l (Θ|Θn). Then, when the algorithm It reaches a fixed point for some Θn the value maximizes l (Θ|Θn). Basically Θn+1 = Θn. 59 / 126
  • 147. Images/ At this moment We have that The algorithm reaches a fixed point for some Θn, the value Θ∗ maximizes l (Θ|Θn). Then, when the algorithm It reaches a fixed point for some Θn the value maximizes l (Θ|Θn). Basically Θn+1 = Θn. 59 / 126
  • 149. Images/ Then If L and l are differentiable at Θn Since L and l are equal at Θn Then, Θn is a stationary point of L i.e. the derivative of L vanishes at that point. Local Maxima 61 / 126
  • 150. Images/ However You could finish with the following case, no local maxima Saddle Point 62 / 126
  • 151. Images/ For more on the subject Please take a look to Geoffrey McLachlan and Thriyambakam Krishnan, “The EM Algorithm and Extensions,” John Wiley & Sons, New York, 1996. 63 / 126
  • 152. Images/ Finding Maximum Likelihood Mixture Densities Parameters via EM Something Notable The mixture-density parameter estimation problem is probably one of the most widely used applications of the EM algorithm in the computational pattern recognition community. We have p (x|Θ) = M i=1 αipi (x|θi) (28) where 1 Θ = (α1, ..., αM , θ1, ..., θM ) 2 M i=1 αi = 1 3 Each pi is a density function parametrized by θi. 64 / 126
  • 153. Images/ Finding Maximum Likelihood Mixture Densities Parameters via EM Something Notable The mixture-density parameter estimation problem is probably one of the most widely used applications of the EM algorithm in the computational pattern recognition community. We have p (x|Θ) = M i=1 αipi (x|θi) (28) where 1 Θ = (α1, ..., αM , θ1, ..., θM ) 2 M i=1 αi = 1 3 Each pi is a density function parametrized by θi. 64 / 126
  • 154. Images/ Finding Maximum Likelihood Mixture Densities Parameters via EM Something Notable The mixture-density parameter estimation problem is probably one of the most widely used applications of the EM algorithm in the computational pattern recognition community. We have p (x|Θ) = M i=1 αipi (x|θi) (28) where 1 Θ = (α1, ..., αM , θ1, ..., θM ) 2 M i=1 αi = 1 3 Each pi is a density function parametrized by θi. 64 / 126
  • 155. Images/ Finding Maximum Likelihood Mixture Densities Parameters via EM Something Notable The mixture-density parameter estimation problem is probably one of the most widely used applications of the EM algorithm in the computational pattern recognition community. We have p (x|Θ) = M i=1 αipi (x|θi) (28) where 1 Θ = (α1, ..., αM , θ1, ..., θM ) 2 M i=1 αi = 1 3 Each pi is a density function parametrized by θi. 64 / 126
  • 156. Images/ A log-likelihood for this function We have log L (Θ|X) = log N i=1 p (xi|Θ) = N i=1 log   M j=1 αjpj (xi|θj)   (29) Note: This is too difficult to optimize due to the log function. However We can simplify this assuming the following: 1 We assume that each unobserved data Y = {yi}N i=1 has a the following range yi ∈ {1, ..., M} 2 yi = k if the ith samples was generated by the kth mixture. 65 / 126
  • 157. Images/ A log-likelihood for this function We have log L (Θ|X) = log N i=1 p (xi|Θ) = N i=1 log   M j=1 αjpj (xi|θj)   (29) Note: This is too difficult to optimize due to the log function. However We can simplify this assuming the following: 1 We assume that each unobserved data Y = {yi}N i=1 has a the following range yi ∈ {1, ..., M} 2 yi = k if the ith samples was generated by the kth mixture. 65 / 126
  • 158. Images/ A log-likelihood for this function We have log L (Θ|X) = log N i=1 p (xi|Θ) = N i=1 log   M j=1 αjpj (xi|θj)   (29) Note: This is too difficult to optimize due to the log function. However We can simplify this assuming the following: 1 We assume that each unobserved data Y = {yi}N i=1 has a the following range yi ∈ {1, ..., M} 2 yi = k if the ith samples was generated by the kth mixture. 65 / 126
  • 159. Images/ A log-likelihood for this function We have log L (Θ|X) = log N i=1 p (xi|Θ) = N i=1 log   M j=1 αjpj (xi|θj)   (29) Note: This is too difficult to optimize due to the log function. However We can simplify this assuming the following: 1 We assume that each unobserved data Y = {yi}N i=1 has a the following range yi ∈ {1, ..., M} 2 yi = k if the ith samples was generated by the kth mixture. 65 / 126
  • 160. Images/ A log-likelihood for this function We have log L (Θ|X) = log N i=1 p (xi|Θ) = N i=1 log   M j=1 αjpj (xi|θj)   (29) Note: This is too difficult to optimize due to the log function. However We can simplify this assuming the following: 1 We assume that each unobserved data Y = {yi}N i=1 has a the following range yi ∈ {1, ..., M} 2 yi = k if the ith samples was generated by the kth mixture. 65 / 126
  • 161. Images/ Now We have log L (Θ|X, Y) = log [P (X, Y|Θ)] (30) Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and assuming independence log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)] = log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)] = log N i=1 P (xi, yi|Θ) = N i=1 log P (xi, yi|Θ) 66 / 126
  • 162. Images/ Now We have log L (Θ|X, Y) = log [P (X, Y|Θ)] (30) Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and assuming independence log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)] = log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)] = log N i=1 P (xi, yi|Θ) = N i=1 log P (xi, yi|Θ) 66 / 126
  • 163. Images/ Now We have log L (Θ|X, Y) = log [P (X, Y|Θ)] (30) Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and assuming independence log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)] = log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)] = log N i=1 P (xi, yi|Θ) = N i=1 log P (xi, yi|Θ) 66 / 126
  • 164. Images/ Now We have log L (Θ|X, Y) = log [P (X, Y|Θ)] (30) Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and assuming independence log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)] = log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)] = log N i=1 P (xi, yi|Θ) = N i=1 log P (xi, yi|Θ) 66 / 126
  • 165. Images/ Now We have log L (Θ|X, Y) = log [P (X, Y|Θ)] (30) Remember that X = {x1, x2, ..., xN } with Y = {y1, y2, ..., yN } and assuming independence log [P (X, Y|Θ)] = log [P (x1, x2, ..., xN , y1, y2, ..., yN |Θ)] = log [P (x1, y1, ..., xi, yi, ..., xN , yN |Θ)] = log N i=1 P (xi, yi|Θ) = N i=1 log P (xi, yi|Θ) 66 / 126
  • 166. Images/ Then Thus, by the chain Rule N i=1 log P (xi, yi|Θ) = N i=1 log [P (xi|yi, θyi ) P (yi|θyi )] (31) Question Do you need yi if you know θyi or the other way around? Finally N i=1 log [P (xi|yi, θyi ) P (yi|θyi )] = N i=1 log [P (yi) pyi (xi|θyi )] (32) NOPE: You do not need yi if you know θyi or the other way around. 67 / 126
  • 167. Images/ Then Thus, by the chain Rule N i=1 log P (xi, yi|Θ) = N i=1 log [P (xi|yi, θyi ) P (yi|θyi )] (31) Question Do you need yi if you know θyi or the other way around? Finally N i=1 log [P (xi|yi, θyi ) P (yi|θyi )] = N i=1 log [P (yi) pyi (xi|θyi )] (32) NOPE: You do not need yi if you know θyi or the other way around. 67 / 126
  • 168. Images/ Finally, we have Making αyi = P (yi) log L (Θ|X, Y) = N i=1 log [αyi P (xi|yi, θyi )] (33) 68 / 126
  • 169. Images/ Problem Which Labels? We do not know the values of Y. We can get away by using the following idea Assume the Y is a random variable. 69 / 126
  • 170. Images/ Problem Which Labels? We do not know the values of Y. We can get away by using the following idea Assume the Y is a random variable. 69 / 126
  • 171. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 70 / 126
  • 172. Images/ Thus You do a first guess for the parameters at the beginning of EM Θg = (αg 1, ..., αg M , θg 1, ..., θg M ) (34) Then, it is possible to calculate given the parametric probability pj xi|θg j Therefore The mixing parameters αj can be though of as a prior probabilities of each mixture: αj = p (component j) (35) 71 / 126
  • 173. Images/ Thus You do a first guess for the parameters at the beginning of EM Θg = (αg 1, ..., αg M , θg 1, ..., θg M ) (34) Then, it is possible to calculate given the parametric probability pj xi|θg j Therefore The mixing parameters αj can be though of as a prior probabilities of each mixture: αj = p (component j) (35) 71 / 126
  • 174. Images/ Thus You do a first guess for the parameters at the beginning of EM Θg = (αg 1, ..., αg M , θg 1, ..., θg M ) (34) Then, it is possible to calculate given the parametric probability pj xi|θg j Therefore The mixing parameters αj can be though of as a prior probabilities of each mixture: αj = p (component j) (35) 71 / 126
  • 175. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 72 / 126
  • 176. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 73 / 126
  • 177. Images/ We want to calculate the following probability We want to calculate p (yi|xi, Θg ) Basically We want a Bayesian formulation of this probability. Assuming that the y = (y1, y2, ..., yN ) are samples identically independent samples from a distribution. 74 / 126
  • 178. Images/ We want to calculate the following probability We want to calculate p (yi|xi, Θg ) Basically We want a Bayesian formulation of this probability. Assuming that the y = (y1, y2, ..., yN ) are samples identically independent samples from a distribution. 74 / 126
  • 179. Images/ We want to calculate the following probability We want to calculate p (yi|xi, Θg ) Basically We want a Bayesian formulation of this probability. Assuming that the y = (y1, y2, ..., yN ) are samples identically independent samples from a distribution. 74 / 126
  • 180. Images/ Using Bayes’ Rule Compute p (yi|xi, Θg ) = p (yi, xi|Θg) p (xi|Θg) = p (xi|Θg) p yi|θg yi p (xi|Θg) We know θg yi ⇒ Drop it = αg yi pyi xi|θg yi p (xi|Θg) = αg yi pyi xi|θg yi M k=1 αg kpk xi|θg k 75 / 126
  • 181. Images/ Using Bayes’ Rule Compute p (yi|xi, Θg ) = p (yi, xi|Θg) p (xi|Θg) = p (xi|Θg) p yi|θg yi p (xi|Θg) We know θg yi ⇒ Drop it = αg yi pyi xi|θg yi p (xi|Θg) = αg yi pyi xi|θg yi M k=1 αg kpk xi|θg k 75 / 126
  • 182. Images/ Using Bayes’ Rule Compute p (yi|xi, Θg ) = p (yi, xi|Θg) p (xi|Θg) = p (xi|Θg) p yi|θg yi p (xi|Θg) We know θg yi ⇒ Drop it = αg yi pyi xi|θg yi p (xi|Θg) = αg yi pyi xi|θg yi M k=1 αg kpk xi|θg k 75 / 126
  • 183. Images/ Using Bayes’ Rule Compute p (yi|xi, Θg ) = p (yi, xi|Θg) p (xi|Θg) = p (xi|Θg) p yi|θg yi p (xi|Θg) We know θg yi ⇒ Drop it = αg yi pyi xi|θg yi p (xi|Θg) = αg yi pyi xi|θg yi M k=1 αg kpk xi|θg k 75 / 126
  • 184. Images/ As in Naive Bayes We have the fact that there is a probability per probability at the mixture and sample p (yi|xi, Θg ) = αg yi pyi xi|θg yi M k=1 αg kpk xi|θg k ∀xi, yi and k ∈ {1, ..., M} This is going to be updated at each iteration of the EM algorithm After the initial Guess!!! Until convergence!!! 76 / 126
  • 185. Images/ As in Naive Bayes We have the fact that there is a probability per probability at the mixture and sample p (yi|xi, Θg ) = αg yi pyi xi|θg yi M k=1 αg kpk xi|θg k ∀xi, yi and k ∈ {1, ..., M} This is going to be updated at each iteration of the EM algorithm After the initial Guess!!! Until convergence!!! 76 / 126
  • 186. Images/ Additionally We assume again that the samples yis are identically and independent samples p (y|X, Θg ) = N i=1 p (yi|xi, Θg ) (36) Where y = (y1, y2, ..., yN ) 77 / 126
  • 187. Images/ Now, using equation 17 Then Q (Θ|Θg ) = y∈Y log (L (Θ|X, y)) p (y|X, Θg ) = y∈Y N i=1 log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) 78 / 126
  • 188. Images/ Now, using equation 17 Then Q (Θ|Θg ) = y∈Y log (L (Θ|X, y)) p (y|X, Θg ) = y∈Y N i=1 log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) 78 / 126
  • 189. Images/ Here, a small stop What is the meaning of y∈Y It is actually a summation of all possible states of the random vector y. Then, we can rewrite the previous summation as y∈Y = M y1=1 M y2=1 · · · M yN =1 N Running over all the samples {x1, x2, ..., xN }. 79 / 126
  • 190. Images/ Here, a small stop What is the meaning of y∈Y It is actually a summation of all possible states of the random vector y. Then, we can rewrite the previous summation as y∈Y = M y1=1 M y2=1 · · · M yN =1 N Running over all the samples {x1, x2, ..., xN }. 79 / 126
  • 191. Images/ Then We have Q (Θ|Θg ) = M y1=1 M y2=1 · · · M yN =1 N i=1  log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg )   80 / 126
  • 192. Images/ We introduce the following We have the following function δl,yi = 1 I = yi 0 I = yi Therefore, we can do the following αi = M j=1 δi,jαj Then log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) = M l=1 δl,yi log [αlpl (xi|θl)] N j=1 p (yj|xj, Θg ) 81 / 126
  • 193. Images/ We introduce the following We have the following function δl,yi = 1 I = yi 0 I = yi Therefore, we can do the following αi = M j=1 δi,jαj Then log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) = M l=1 δl,yi log [αlpl (xi|θl)] N j=1 p (yj|xj, Θg ) 81 / 126
  • 194. Images/ We introduce the following We have the following function δl,yi = 1 I = yi 0 I = yi Therefore, we can do the following αi = M j=1 δi,jαj Then log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) = M l=1 δl,yi log [αlpl (xi|θl)] N j=1 p (yj|xj, Θg ) 81 / 126
  • 195. Images/ Thus We have that for M y1=1 · · · M yN =1 N i=1 log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) = ∗ ∗ = M y1=1 M y2=1 · · · M yN =1 N i=1 M l=1 δl,yi log [αlpl (xi|θl)] N j=1 p (yj|xj, Θg ) = N i=1 M l=1 log [αlpl (xi|θl)] M y1=1 M y2=1 · · · M yN =1  δl,yi N j=1 p (yj|xj, Θg )   Because M y1=1 M y2=1 · · · M yN =1 applies only to δl,yi N j=1 p (yj|xj, Θg) 82 / 126
  • 196. Images/ Thus We have that for M y1=1 · · · M yN =1 N i=1 log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) = ∗ ∗ = M y1=1 M y2=1 · · · M yN =1 N i=1 M l=1 δl,yi log [αlpl (xi|θl)] N j=1 p (yj|xj, Θg ) = N i=1 M l=1 log [αlpl (xi|θl)] M y1=1 M y2=1 · · · M yN =1  δl,yi N j=1 p (yj|xj, Θg )   Because M y1=1 M y2=1 · · · M yN =1 applies only to δl,yi N j=1 p (yj|xj, Θg) 82 / 126
  • 197. Images/ Thus We have that for M y1=1 · · · M yN =1 N i=1 log [αyi pyi (xi|θyi )] N j=1 p (yj|xj, Θg ) = ∗ ∗ = M y1=1 M y2=1 · · · M yN =1 N i=1 M l=1 δl,yi log [αlpl (xi|θl)] N j=1 p (yj|xj, Θg ) = N i=1 M l=1 log [αlpl (xi|θl)] M y1=1 M y2=1 · · · M yN =1  δl,yi N j=1 p (yj|xj, Θg )   Because M y1=1 M y2=1 · · · M yN =1 applies only to δl,yi N j=1 p (yj|xj, Θg) 82 / 126
  • 198. Images/ Then, we have that First notice the following M y1=1 M y2=1 · · · M yN =1 δl,yi N j=1 p (yj|xj, Θg ) = =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 M yi=1 δl,yi p (yi|xi, Θg ) N j=1,j=i, p (yj|xj, Θg )   Then, we have M yi=1 δl,yi p (yi|xi, Θg ) = p (l|xi, Θg ) 83 / 126
  • 199. Images/ Then, we have that First notice the following M y1=1 M y2=1 · · · M yN =1 δl,yi N j=1 p (yj|xj, Θg ) = =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 M yi=1 δl,yi p (yi|xi, Θg ) N j=1,j=i, p (yj|xj, Θg )   Then, we have M yi=1 δl,yi p (yi|xi, Θg ) = p (l|xi, Θg ) 83 / 126
  • 200. Images/ Then, we have that First notice the following M y1=1 M y2=1 · · · M yN =1 δl,yi N j=1 p (yj|xj, Θg ) = =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 M yi=1 δl,yi p (yi|xi, Θg ) N j=1,j=i, p (yj|xj, Θg )   Then, we have M yi=1 δl,yi p (yi|xi, Θg ) = p (l|xi, Θg ) 83 / 126
  • 201. Images/ In this way Plugging back the previous equation M y1=1 M y2=1 · · · M yN =1 δl,yi N j=1 p (yj|xj, Θg ) = =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 p (l|xi, Θg ) N j=1,j=i p (yj|xj, Θg )   =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg )   p (l|xi, Θg ) 84 / 126
  • 202. Images/ In this way Plugging back the previous equation M y1=1 M y2=1 · · · M yN =1 δl,yi N j=1 p (yj|xj, Θg ) = =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 p (l|xi, Θg ) N j=1,j=i p (yj|xj, Θg )   =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg )   p (l|xi, Θg ) 84 / 126
  • 203. Images/ In this way Plugging back the previous equation M y1=1 M y2=1 · · · M yN =1 δl,yi N j=1 p (yj|xj, Θg ) = =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 p (l|xi, Θg ) N j=1,j=i p (yj|xj, Θg )   =   M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg )   p (l|xi, Θg ) 84 / 126
  • 204. Images/ Now, what about...? The left part of the equation M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg ) = =   M y1=1 p (y1|x1, Θg )   · · ·   M yi−1=1 p (yi−1|xi−1, Θg )   × ...   M yi+1=1 p (yi+1|xi+1, Θg )   · · ·   M yN =1 p (yN |xN , Θg )   = N j=1,j=i   M yj=1 p (yj|xj, Θg )   85 / 126
  • 205. Images/ Now, what about...? The left part of the equation M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg ) = =   M y1=1 p (y1|x1, Θg )   · · ·   M yi−1=1 p (yi−1|xi−1, Θg )   × ...   M yi+1=1 p (yi+1|xi+1, Θg )   · · ·   M yN =1 p (yN |xN , Θg )   = N j=1,j=i   M yj=1 p (yj|xj, Θg )   85 / 126
  • 206. Images/ Now, what about...? The left part of the equation M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg ) = =   M y1=1 p (y1|x1, Θg )   · · ·   M yi−1=1 p (yi−1|xi−1, Θg )   × ...   M yi+1=1 p (yi+1|xi+1, Θg )   · · ·   M yN =1 p (yN |xN , Θg )   = N j=1,j=i   M yj=1 p (yj|xj, Θg )   85 / 126
  • 207. Images/ Then, we have that Plugging back to the original equation    M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg )    p (l|xi, Θg ) = =    N j=1,j=i   M yj=1 p (yj|xj, Θg )      p (l|xi, Θg ) 86 / 126
  • 208. Images/ Then, we have that Plugging back to the original equation    M y1=1 · · · M yi−1=1 M yi+1=1 · · · M yN =1 N j=1,j=i p (yj|xj, Θg )    p (l|xi, Θg ) = =    N j=1,j=i   M yj=1 p (yj|xj, Θg )      p (l|xi, Θg ) 86 / 126
  • 209. Images/ We can use properties of probability We know that M yi=1 p (yi|xi, Θg ) = 1 (37) Then    N j=1,j=i   M yj=1 p (yj|xj, Θg )      p (l|xi, Θg ) = =    N j=1,j=i 1    p (l|xi, Θg ) =p (l|xi, Θg ) = αg l pyi xi|θg l M k=1 αg kpk xi|θg k 87 / 126
  • 210. Images/ We can use properties of probability We know that M yi=1 p (yi|xi, Θg ) = 1 (37) Then    N j=1,j=i   M yj=1 p (yj|xj, Θg )      p (l|xi, Θg ) = =    N j=1,j=i 1    p (l|xi, Θg ) =p (l|xi, Θg ) = αg l pyi xi|θg l M k=1 αg kpk xi|θg k 87 / 126
  • 211. Images/ We can use properties of probability We know that M yi=1 p (yi|xi, Θg ) = 1 (37) Then    N j=1,j=i   M yj=1 p (yj|xj, Θg )      p (l|xi, Θg ) = =    N j=1,j=i 1    p (l|xi, Θg ) =p (l|xi, Θg ) = αg l pyi xi|θg l M k=1 αg kpk xi|θg k 87 / 126
  • 212. Images/ We can use properties of probability We know that M yi=1 p (yi|xi, Θg ) = 1 (37) Then    N j=1,j=i   M yj=1 p (yj|xj, Θg )      p (l|xi, Θg ) = =    N j=1,j=i 1    p (l|xi, Θg ) =p (l|xi, Θg ) = αg l pyi xi|θg l M k=1 αg kpk xi|θg k 87 / 126
  • 213. Images/ Thus We can write Q in the following way Q (Θ, Θg ) = N i=1 M l=1 log [αlpl (xi|θl)] p (l|xi, Θg ) = N i=1 M l=1 log (αl) p (l|xi, Θg ) + ... N i=1 M l=1 log (pl (xi|θl)) p (l|xi, Θg ) (38) 88 / 126
  • 214. Images/ Thus We can write Q in the following way Q (Θ, Θg ) = N i=1 M l=1 log [αlpl (xi|θl)] p (l|xi, Θg ) = N i=1 M l=1 log (αl) p (l|xi, Θg ) + ... N i=1 M l=1 log (pl (xi|θl)) p (l|xi, Θg ) (38) 88 / 126
  • 215. Images/ Thus We can write Q in the following way Q (Θ, Θg ) = N i=1 M l=1 log [αlpl (xi|θl)] p (l|xi, Θg ) = N i=1 M l=1 log (αl) p (l|xi, Θg ) + ... N i=1 M l=1 log (pl (xi|θl)) p (l|xi, Θg ) (38) 88 / 126
  • 216. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 89 / 126
  • 217. Images/ A Method That could be used as a general framework To solve problems set as EM problem. First, we will look at the Lagrange Multipliers setup Then, we will look at a specific case using the mixture of Gaussian’s Note Not all the mixture of distributions will get you an analytical solution. 90 / 126
  • 218. Images/ A Method That could be used as a general framework To solve problems set as EM problem. First, we will look at the Lagrange Multipliers setup Then, we will look at a specific case using the mixture of Gaussian’s Note Not all the mixture of distributions will get you an analytical solution. 90 / 126
  • 219. Images/ A Method That could be used as a general framework To solve problems set as EM problem. First, we will look at the Lagrange Multipliers setup Then, we will look at a specific case using the mixture of Gaussian’s Note Not all the mixture of distributions will get you an analytical solution. 90 / 126
  • 220. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 91 / 126
  • 221. Images/ Lagrange Multipliers The method of Lagrange multipliers Gives a set of necessary conditions to identify optimal points of equality constrained optimization problems. This is done by converting a constrained problem To an equivalent unconstrained problem with the help of certain unspecified parameters known as Lagrange multipliers. 92 / 126
  • 222. Images/ Lagrange Multipliers The method of Lagrange multipliers Gives a set of necessary conditions to identify optimal points of equality constrained optimization problems. This is done by converting a constrained problem To an equivalent unconstrained problem with the help of certain unspecified parameters known as Lagrange multipliers. 92 / 126
  • 223. Images/ Lagrange Multipliers The classical problem formulation min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 It can be converted into min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (39) where L(x, λ) is the Lagrangian function. λ is an unspecified positive or negative constant called the Lagrange Multiplier. 93 / 126
  • 224. Images/ Lagrange Multipliers The classical problem formulation min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 It can be converted into min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (39) where L(x, λ) is the Lagrangian function. λ is an unspecified positive or negative constant called the Lagrange Multiplier. 93 / 126
  • 225. Images/ Lagrange Multipliers The classical problem formulation min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 It can be converted into min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (39) where L(x, λ) is the Lagrangian function. λ is an unspecified positive or negative constant called the Lagrange Multiplier. 93 / 126
  • 226. Images/ Lagrange Multipliers are applications of Py Think about this Remember First Law of Newton!!! Yes!!! 94 / 126
  • 227. Images/ Lagrange Multipliers are applications of Py Think about this Remember First Law of Newton!!! Yes!!! A system in equilibrium does not move Static Body 94 / 126
  • 228. Images/ Lagrange Multipliers Definition Gives a set of necessary conditions to identify optimal points of equality constrained optimization problem 95 / 126
  • 229. Images/ Lagrange was a Physicists He was thinking in the following formula A system in equilibrium has the following equation: F1 + F2 + ... + FK = 0 (40) But functions do not have forces? Are you sure? Think about the following The Gradient of a surface. 96 / 126
  • 230. Images/ Lagrange was a Physicists He was thinking in the following formula A system in equilibrium has the following equation: F1 + F2 + ... + FK = 0 (40) But functions do not have forces? Are you sure? Think about the following The Gradient of a surface. 96 / 126
  • 231. Images/ Lagrange was a Physicists He was thinking in the following formula A system in equilibrium has the following equation: F1 + F2 + ... + FK = 0 (40) But functions do not have forces? Are you sure? Think about the following The Gradient of a surface. 96 / 126
  • 232. Images/ Gradient to a Surface After all a gradient is a measure of the maximal change For example the gradient of a function of three variables: f (x) = i ∂f (x) ∂x + j ∂f (x) ∂y + k ∂f (x) ∂z (41) where i, j and k are unitary vectors in the directions x, y and z. 97 / 126
  • 233. Images/ Example We have f (x, y) = x exp {−x2 − y2 } 98 / 126
  • 234. Images/ Example With Gradient at the the contours when projecting in the 2D plane 99 / 126
  • 235. Images/ Now, Think about this Yes, we can use the gradient However, we need to do some scaling of the forces by using parameters λ Thus, we have F0 + λ1F1 + ... + λKFK = 0 (42) where F0 is the gradient of the principal cost function and Fi for i = 1, 2, .., K. 100 / 126
  • 236. Images/ Now, Think about this Yes, we can use the gradient However, we need to do some scaling of the forces by using parameters λ Thus, we have F0 + λ1F1 + ... + λKFK = 0 (42) where F0 is the gradient of the principal cost function and Fi for i = 1, 2, .., K. 100 / 126
  • 237. Images/ Thus If we have the following optimization: min f (x) s.tg1 (x) = 0 g2 (x) = 0 101 / 126
  • 238. Images/ Geometric interpretation in the case of minimization What is wrong? Gradients are going in the other direction, we can fix by simple multiplying by -1 Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize f (−→x ) g1 (−→x ) g2 (−→x ) −∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0 Nevertheless: it is equivalent to f −→x − λ1 g1 −→x − λ2 g2 −→x = 0 101 / 126
  • 239. Images/ Geometric interpretation in the case of minimization What is wrong? Gradients are going in the other direction, we can fix by simple multiplying by -1 Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize f (−→x ) g1 (−→x ) g2 (−→x ) −∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0 Nevertheless: it is equivalent to f −→x − λ1 g1 −→x − λ2 g2 −→x = 0 101 / 126
  • 240. Images/ Method Steps 1 Original problem is rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 102 / 126
  • 241. Images/ Method Steps 1 Original problem is rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 102 / 126
  • 242. Images/ Method Steps 1 Original problem is rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 102 / 126
  • 243. Images/ Method Steps 1 Original problem is rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 102 / 126
  • 244. Images/ Method Steps 1 Original problem is rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 102 / 126
  • 245. Images/ Method Steps 1 Original problem is rewritten as: 1 minimize L (x, λ) = f (x) − λh1 (x) 2 Take derivatives of L (x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ). 102 / 126
  • 246. Images/ Example We can apply that to the following problem min f (x, y) = x2 − 8x + y2 − 12y + 48 s.t x + y = 8 103 / 126
  • 247. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 104 / 126
  • 248. Images/ Lagrange Multipliers for Q We can us the following constraint for that l αl = 1 (43) We have the following cost function Q (Θ, Θg ) + λ l αl − 1 (44) Deriving by αl ∂ ∂αl Q (Θ, Θg ) + λ l αl − 1 = 0 (45) 105 / 126
  • 249. Images/ Lagrange Multipliers for Q We can us the following constraint for that l αl = 1 (43) We have the following cost function Q (Θ, Θg ) + λ l αl − 1 (44) Deriving by αl ∂ ∂αl Q (Θ, Θg ) + λ l αl − 1 = 0 (45) 105 / 126
  • 250. Images/ Lagrange Multipliers for Q We can us the following constraint for that l αl = 1 (43) We have the following cost function Q (Θ, Θg ) + λ l αl − 1 (44) Deriving by αl ∂ ∂αl Q (Θ, Θg ) + λ l αl − 1 = 0 (45) 105 / 126
  • 251. Images/ Thus The Q function Q (Θ, Θg ) = N i=1 M l=1 log (αl) p (l|xi, Θg ) + ... N i=1 M l=1 log (pl (xi|θl)) p (l|xi, Θg ) 106 / 126
  • 252. Images/ Deriving We have ∂ ∂αl Q (Θ, Θg ) + λ l αl − 1 = N i=1 1 αl p (l|xi, Θg ) + λ 107 / 126
  • 253. Images/ Finally We have making the previous equation equal to 0 N i=1 1 αl p (l|xi, Θg ) + λ = 0 (46) Thus N i=1 p (l|xi, Θg ) = −λαl (47) Summing over l, we get λ = −N (48) 108 / 126
  • 254. Images/ Finally We have making the previous equation equal to 0 N i=1 1 αl p (l|xi, Θg ) + λ = 0 (46) Thus N i=1 p (l|xi, Θg ) = −λαl (47) Summing over l, we get λ = −N (48) 108 / 126
  • 255. Images/ Finally We have making the previous equation equal to 0 N i=1 1 αl p (l|xi, Θg ) + λ = 0 (46) Thus N i=1 p (l|xi, Θg ) = −λαl (47) Summing over l, we get λ = −N (48) 108 / 126
  • 256. Images/ Lagrange Multipliers Thus αl = 1 N N i=1 p (l|xi, Θg ) (49) About θl It is possible to get an analytical expressions for θl as functions of everything else. This is for you to try!!! For more, please look at “Geometric Idea of Lagrange Multipliers” by John Wyatt. 109 / 126
  • 257. Images/ Lagrange Multipliers Thus αl = 1 N N i=1 p (l|xi, Θg ) (49) About θl It is possible to get an analytical expressions for θl as functions of everything else. This is for you to try!!! For more, please look at “Geometric Idea of Lagrange Multipliers” by John Wyatt. 109 / 126
  • 258. Images/ Lagrange Multipliers Thus αl = 1 N N i=1 p (l|xi, Θg ) (49) About θl It is possible to get an analytical expressions for θl as functions of everything else. This is for you to try!!! For more, please look at “Geometric Idea of Lagrange Multipliers” by John Wyatt. 109 / 126
  • 259. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 110 / 126
  • 260. Images/ Remember? Gaussian Distribution pl (x|µl, Σl) = 1 (2π) d/2 |Σl| 1/2 exp − 1 2 (x − µl)T Σ−1 l (x − µl) (50) 111 / 126
  • 261. Images/ How to use this for Gaussian Distributions For this, we need to refresh some linear algebra 1 tr (A + B) = tr (A) + tr (B) 2 tr (AB) = tr (BA) 3 i xT i Axi = tr (AB) where B = i xixT i . 4 A−1 = 1 |A| Now, we need the derivative of a matrix function f (A) Thus, ∂f(A) ∂A is going to be the matrix with i, jth entry ∂f(A) ∂ai,j where ai,j is the i, jth entry of A. 112 / 126
  • 262. Images/ How to use this for Gaussian Distributions For this, we need to refresh some linear algebra 1 tr (A + B) = tr (A) + tr (B) 2 tr (AB) = tr (BA) 3 i xT i Axi = tr (AB) where B = i xixT i . 4 A−1 = 1 |A| Now, we need the derivative of a matrix function f (A) Thus, ∂f(A) ∂A is going to be the matrix with i, jth entry ∂f(A) ∂ai,j where ai,j is the i, jth entry of A. 112 / 126
  • 263. Images/ How to use this for Gaussian Distributions For this, we need to refresh some linear algebra 1 tr (A + B) = tr (A) + tr (B) 2 tr (AB) = tr (BA) 3 i xT i Axi = tr (AB) where B = i xixT i . 4 A−1 = 1 |A| Now, we need the derivative of a matrix function f (A) Thus, ∂f(A) ∂A is going to be the matrix with i, jth entry ∂f(A) ∂ai,j where ai,j is the i, jth entry of A. 112 / 126
  • 264. Images/ How to use this for Gaussian Distributions For this, we need to refresh some linear algebra 1 tr (A + B) = tr (A) + tr (B) 2 tr (AB) = tr (BA) 3 i xT i Axi = tr (AB) where B = i xixT i . 4 A−1 = 1 |A| Now, we need the derivative of a matrix function f (A) Thus, ∂f(A) ∂A is going to be the matrix with i, jth entry ∂f(A) ∂ai,j where ai,j is the i, jth entry of A. 112 / 126
  • 265. Images/ How to use this for Gaussian Distributions For this, we need to refresh some linear algebra 1 tr (A + B) = tr (A) + tr (B) 2 tr (AB) = tr (BA) 3 i xT i Axi = tr (AB) where B = i xixT i . 4 A−1 = 1 |A| Now, we need the derivative of a matrix function f (A) Thus, ∂f(A) ∂A is going to be the matrix with i, jth entry ∂f(A) ∂ai,j where ai,j is the i, jth entry of A. 112 / 126
  • 266. Images/ In addition If A is symmetric ∂ |A| ∂A = Ai,j if i = j 2Ai,j if i = j (51) Where Ai,j is the i, jth cofactor of A. Note: The determinant obtained by deleting the row and column of a given element of a matrix or determinant. The cofactor is preceded by a + or – sign depending whether the element is in a + or – position. Thus ∂ log |A| ∂A =    Ai,j |A| if i = j 2Ai,j if i = j = 2A−1 − diag A−1 (52) 113 / 126
  • 267. Images/ In addition If A is symmetric ∂ |A| ∂A = Ai,j if i = j 2Ai,j if i = j (51) Where Ai,j is the i, jth cofactor of A. Note: The determinant obtained by deleting the row and column of a given element of a matrix or determinant. The cofactor is preceded by a + or – sign depending whether the element is in a + or – position. Thus ∂ log |A| ∂A =    Ai,j |A| if i = j 2Ai,j if i = j = 2A−1 − diag A−1 (52) 113 / 126
  • 268. Images/ In addition If A is symmetric ∂ |A| ∂A = Ai,j if i = j 2Ai,j if i = j (51) Where Ai,j is the i, jth cofactor of A. Note: The determinant obtained by deleting the row and column of a given element of a matrix or determinant. The cofactor is preceded by a + or – sign depending whether the element is in a + or – position. Thus ∂ log |A| ∂A =    Ai,j |A| if i = j 2Ai,j if i = j = 2A−1 − diag A−1 (52) 113 / 126
  • 269. Images/ In addition If A is symmetric ∂ |A| ∂A = Ai,j if i = j 2Ai,j if i = j (51) Where Ai,j is the i, jth cofactor of A. Note: The determinant obtained by deleting the row and column of a given element of a matrix or determinant. The cofactor is preceded by a + or – sign depending whether the element is in a + or – position. Thus ∂ log |A| ∂A =    Ai,j |A| if i = j 2Ai,j if i = j = 2A−1 − diag A−1 (52) 113 / 126
  • 270. Images/ Finally The last equation we need ∂tr (AB) ∂A = B + BT − diag (B) (53) In addition ∂xT Ax ∂x (54) 114 / 126
  • 271. Images/ Finally The last equation we need ∂tr (AB) ∂A = B + BT − diag (B) (53) In addition ∂xT Ax ∂x (54) 114 / 126
  • 272. Images/ Thus, using last part of equation 38 We get, after ignoring constant terms Remember they disappear after derivatives N i=1 M l=1 log (pl (xi|µl, Σl)) p (l|xi, Θg ) = N i=1 M l=1 − 1 2 log (|Σl|) − 1 2 (xi − µl)T Σ−1 l (xi − µl) p (l|xi, Θg ) (55) 115 / 126
  • 273. Images/ Thus, using last part of equation 38 We get, after ignoring constant terms Remember they disappear after derivatives N i=1 M l=1 log (pl (xi|µl, Σl)) p (l|xi, Θg ) = N i=1 M l=1 − 1 2 log (|Σl|) − 1 2 (xi − µl)T Σ−1 l (xi − µl) p (l|xi, Θg ) (55) 115 / 126
  • 274. Images/ Finally Thus, when taking the derivative with respect to µl N i=1 Σ−1 l (xi − µl) p (l|xi, Θg ) = 0 (56) Then µl = N i=1 xip (l|xi, Θg) N i=1 p (l|xi, Θg) (57) 116 / 126
  • 275. Images/ Finally Thus, when taking the derivative with respect to µl N i=1 Σ−1 l (xi − µl) p (l|xi, Θg ) = 0 (56) Then µl = N i=1 xip (l|xi, Θg) N i=1 p (l|xi, Θg) (57) 116 / 126
  • 276. Images/ Now, if we derive with respect to Σl First, we rewrite equation 55 N i=1 M l=1 − 1 2 log (|Σl|) − 1 2 (xi − µl)T Σ−1 l (xi − µl) p (l|xi, Θg ) = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l (xi − µl) (xi − µl)T = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i Where Nl,i = (xi − µl) (xi − µl)T . 117 / 126
  • 277. Images/ Now, if we derive with respect to Σl First, we rewrite equation 55 N i=1 M l=1 − 1 2 log (|Σl|) − 1 2 (xi − µl)T Σ−1 l (xi − µl) p (l|xi, Θg ) = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l (xi − µl) (xi − µl)T = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i Where Nl,i = (xi − µl) (xi − µl)T . 117 / 126
  • 278. Images/ Now, if we derive with respect to Σl First, we rewrite equation 55 N i=1 M l=1 − 1 2 log (|Σl|) − 1 2 (xi − µl)T Σ−1 l (xi − µl) p (l|xi, Θg ) = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l (xi − µl) (xi − µl)T = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i Where Nl,i = (xi − µl) (xi − µl)T . 117 / 126
  • 279. Images/ Now, if we derive with respect to Σl First, we rewrite equation 55 N i=1 M l=1 − 1 2 log (|Σl|) − 1 2 (xi − µl)T Σ−1 l (xi − µl) p (l|xi, Θg ) = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l (xi − µl) (xi − µl)T = M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i Where Nl,i = (xi − µl) (xi − µl)T . 117 / 126
  • 280. Images/ Deriving with respect to Σ−1 l We have that ∂ ∂Σ−1 l M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i = 1 2 N i=1 p (l|xi, Θg ) (2Σl − diag (Σl)) − 1 2 N i=1 p (l|xi, Θg ) 2Nl.i − diag Nl,i = 1 2 N i=1 p (l|xi, Θg ) 2Ml,i − diag Ml,i =2S − diag (S) Where Ml,i = Σl − Nl,i and S = 1 2 N i=1 p (l|xi, Θg) Ml,i 118 / 126
  • 281. Images/ Deriving with respect to Σ−1 l We have that ∂ ∂Σ−1 l M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i = 1 2 N i=1 p (l|xi, Θg ) (2Σl − diag (Σl)) − 1 2 N i=1 p (l|xi, Θg ) 2Nl.i − diag Nl,i = 1 2 N i=1 p (l|xi, Θg ) 2Ml,i − diag Ml,i =2S − diag (S) Where Ml,i = Σl − Nl,i and S = 1 2 N i=1 p (l|xi, Θg) Ml,i 118 / 126
  • 282. Images/ Deriving with respect to Σ−1 l We have that ∂ ∂Σ−1 l M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i = 1 2 N i=1 p (l|xi, Θg ) (2Σl − diag (Σl)) − 1 2 N i=1 p (l|xi, Θg ) 2Nl.i − diag Nl,i = 1 2 N i=1 p (l|xi, Θg ) 2Ml,i − diag Ml,i =2S − diag (S) Where Ml,i = Σl − Nl,i and S = 1 2 N i=1 p (l|xi, Θg) Ml,i 118 / 126
  • 283. Images/ Deriving with respect to Σ−1 l We have that ∂ ∂Σ−1 l M l=1 − 1 2 log (|Σl|) N i=1 p (l|xi, Θg ) − 1 2 N i=1 p (l|xi, Θg ) tr Σ−1 l Nl,i = 1 2 N i=1 p (l|xi, Θg ) (2Σl − diag (Σl)) − 1 2 N i=1 p (l|xi, Θg ) 2Nl.i − diag Nl,i = 1 2 N i=1 p (l|xi, Θg ) 2Ml,i − diag Ml,i =2S − diag (S) Where Ml,i = Σl − Nl,i and S = 1 2 N i=1 p (l|xi, Θg) Ml,i 118 / 126
  • 284. Images/ Thus, we have Thus If 2S − diag (S) = 0 =⇒ S = 0 Implying 1 2 N i=1 p (l|xi, Θg ) [Σl − Nl,i] = 0 (58) Or Σl = N i=1 p (l|xi, Θg) Nl,i N i=1 p (l|xi, Θg) = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) (59) 119 / 126
  • 285. Images/ Thus, we have Thus If 2S − diag (S) = 0 =⇒ S = 0 Implying 1 2 N i=1 p (l|xi, Θg ) [Σl − Nl,i] = 0 (58) Or Σl = N i=1 p (l|xi, Θg) Nl,i N i=1 p (l|xi, Θg) = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) (59) 119 / 126
  • 286. Images/ Thus, we have Thus If 2S − diag (S) = 0 =⇒ S = 0 Implying 1 2 N i=1 p (l|xi, Θg ) [Σl − Nl,i] = 0 (58) Or Σl = N i=1 p (l|xi, Θg) Nl,i N i=1 p (l|xi, Θg) = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) (59) 119 / 126
  • 287. Images/ Thus, we have the iterative updates They are αNew l = 1 N N i=1 p (l|xi, Θg ) µNew l = N i=1 xip (l|xi, Θg) N i=1 p (l|xi, Θg) ΣNew l = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) 120 / 126
  • 288. Images/ Thus, we have the iterative updates They are αNew l = 1 N N i=1 p (l|xi, Θg ) µNew l = N i=1 xip (l|xi, Θg) N i=1 p (l|xi, Θg) ΣNew l = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) 120 / 126
  • 289. Images/ Thus, we have the iterative updates They are αNew l = 1 N N i=1 p (l|xi, Θg ) µNew l = N i=1 xip (l|xi, Θg) N i=1 p (l|xi, Θg) ΣNew l = N i=1 p (l|xi, Θg) (xi − µl) (xi − µl)T N i=1 p (l|xi, Θg) 120 / 126
  • 290. Images/ Outline 1 Introduction Maximum-Likelihood Expectation Maximization Examples of Applications of EM 2 Incomplete Data Introduction Using the Expected Value Analogy 3 Derivation of the EM-Algorithm Hidden Features Proving Concavity Using the Concave Functions for Approximation From The Concave Function to the EM The Final Algorithm Notes and Convergence of EM 4 Finding Maximum Likelihood Mixture Densities The Beginning of The Process Bayes’ Rule for the components Mixing Parameters Maximizing Q using Lagrange Multipliers Lagrange Multipliers In Our Case Example on Mixture of Gaussian Distributions The EM Algorithm 121 / 126
  • 291. Images/ EM Algorithm for Gaussian Mixtures Step 1 Initialize: The means µl Covariances Σl Mixing coefficients αl 122 / 126