Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
A Tutorial of the EM-algorithm and Its Application to Outlier Detection
1. A Tutorial of the EM-algorithm and its Application to
Outlier Detection
Jaehyeong Ahn
Konkuk University
jayahn0104@gmail.com
September 9, 2020
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 1 / 47
2. Table of Contents
1 EM-algorithm: An Overview
2 Proof for EM-algorithm
Non-decreasing (Ascent Property)
Convergence
Local Maximum
3 An Example: Gaussian Mixture Model (GMM)
4 Application to Outlier Detection
Directly Used
Indirectly Used
5 Summary
6 References
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 2 / 47
4. EM-algorithm: An Overview
Introduction
The EM-algorithm (Expectation-Maximization algorithm) is an
iterative procedure for computing the maximum likelihood estimator
(MLE) when only a subset of the data is available (When the model
depends on the unobserved latent variable)
The first proper theoretical study of the algorithm was done by
Dempster, Laird, and Rubin (1977) [1]
The EM-algorithm is widely used in various research areas when
unobserved latent variables are included in the model
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 4 / 47
5. EM-algorithm: An Overview
Data
Y = (y, · · · , yN )T : Observed data
Model
Assume that Y is dependent on some unobserved latent variable Z
where Z = (z, · · · , zN )T
When Z is assumed as discrete random variable
rik = P(zi = k|Y, θ), k = , · · · , K
z∗
i = argmax
k
rik
When Z is assumed as continuous random variable
ri = fZ|Y,Θ(zi|Y, θ)
Log-likelihood
obs(θ; Y ) = log fY |Θ(Y |θ) = log fY,Z|Θ(Y, Z|θ) dz
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 5 / 47
6. EM-algorithm: An Overview
Goal
By maximizing obs(θ; Y ) w.r.t. θ
Find ˆθ which satisfies ∂θj obs(θ; Y )|θ=ˆθ =0, for j = , · · · , J
Compute the estimated value ˆrik = fZ|Y , Θ(zi|Y, ˆθ)
Problem
The latent variable Z is not observable
It is difficult to compute the integral in obs(θ; Y )
Thus the parameters can not be estimated separately
Solution
Assume the latent variable Z is observed
Define the complete Data
Maximize the complete log likelihood
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 6 / 47
7. EM-algorithm: An Overview
Data
Y = (y, · · · , yN )T : Observed data
Z = (z, · · · , zN )T : Unobserved (latent) variable
It is assumed as observed
X = (Y, Z): Complete Data
Model
ri = fZ|Y,Θ(zi|Y, θ)
Complete log-likelihood
C(θ; X) = log fX|Θ(X|θ) = log fX|Θ(Y, Z|θ)
Log-likelihood for observed data
obs(θ; Y ) = log fY |Θ(Y |θ) = log fX|Θ(Y, Z|θ) dz
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 7 / 47
8. EM-algorithm: An Overview
Estimation Idea
Maximize C(θ; X) instead of maximizing obs(θ; Y )
Since the parameters in C(θ; X) can be decoupled
E-step
Compute the Expected Complete Log Likelihood (ECLL)
Taking conditional expectation on C (θ; X) given Y and current value
of parameters θ(s)
This step estimates the realizations of z (since the value of z is not
identified)
M-step
Maximize the computed ECLL w.r.t. θ
Update the estimates of Θ by θ(s+)
which maximize the current
ECLL
Iterate this procedure until it converges
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 8 / 47
9. EM-algorithm: An Overview
Estimation
E-step
Taking conditional expectation on C(θ; X) given Y and θ = θ(s)
For θ()
, set initial guess
Compute Q(θ|θ(s)) = Eθ(s) [ C(θ; X)|Y ]
M-step
Maximize Q(θ|θ(s)) w.r.t. θ
Put θ(s+) = argmax
θ
Q(θ|θ(s))
Iterate until it satisfies following inequality
||θ(s+)
− θ(s)
|| < , where denotes the sufficiently small value
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 9 / 47
10. EM-algorithm: An Overview
Immediate Question
Does maximizing the sequence Q(θ|θ(s)) leads to maximizing
obs(θ; Y ) ?
This question will be answered in following slides (Proof for
EM-algorithm) with 3 parts:
Non-decreasing / Convergence / Local maximum
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 10 / 47
11. Proof for EM-algorithm
Proof for EM-algorithm
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 11 / 47
12. Proof for EM-algorithm: Non-decreasing
Non-decreasing (Ascent Property)
Proposition 1.
The Sequence obs(θ(s); Y ) in the EM-algorithm is non-decreasing
Proof
We write X = (Y, Z) for the complete data
Then
fZ|Y,Θ(Z|Y, θ) =
fX|Θ(Y,Z|θ)
fY |Θ(Y |θ)
Hence,
obs(θ; Y ) = log fY |Θ(Y |θ) = log fX|Θ((Y, Z)|θ) − log fZ|Y,Θ(Z|Y, θ)
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 12 / 47
13. Proof for EM-algorithm: Non-decreasing
Taking conditional expectation given Y and Θ = θ(s)
on both sides
yields
obs(θ; Y ) = Eθ(s) [ obs(θ; Y )|Y ]
= Eθ(s) [log fX|Θ((Y, Z)|θ)|Y ] − Eθ(s) [log fZ|Y,Θ(Z|Y, θ)|Y ]
= Q(θ|θ(s)
) − H(θ|θ(s)
)
Where
Q(θ|θ(s)
) = Eθ(s) [log fX|Θ((Y, Z)|θ)|Y ]
H(θ|θ(s)
) = Eθ(s) [log fZ|Y,Θ(Z|Y, θ)|Y ]
Then we have
obs(θ(s+)
; Y ) − obs(θ(s)
; Y ) = Q(θ(s+)
|θ(s)
) − Q(θ(s)
|θ(s)
)
− H(θ(s+)
|θ(s)
) − H(θ(s)
|θ(s)
)
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 13 / 47
14. Proof for EM-algorithm: Non-decreasing
Recall that
obs(θ(s+)
; Y ) − obs(θ(s)
; Y ) = Q(θ(s+)
|θ(s)
) − Q(θ(s)
|θ(s)
)
(I)
− H(θ(s+)
|θ(s)
) − H(θ(s)
|θ(s)
)
(II)
(I) is non-negative
θ(s+)
= argmax
θ
Q(θ|θ(s)
)
Hence Q(θ(s+)
|θ(s)
) ≥ Q(θ(s)
|θ(s)
)
Thus (I) ≥ 0
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 14 / 47
15. Proof for EM-algorithm: Non-decreasing
(II) is non-positive
Using Jensen’s inequality for concave functions (log is concave)
Theroem 1. Jensen’s inequality
Let f be a concave function, and let X be a random variable.
Then
E[f(X)] ≤ f(EX)
H(θ(s+)
|θ(s)
) − H(θ(s)
|θ(s)
) = Eθ(s) log
fZ|Y,Θ(Z|Y, θ(s+)
)
fZ|Y,Θ(Z|Y, θ(s))
|Y
≤ log Eθ(s)
fZ|Y,Θ(Z|Y, θ(s+)
)
fZ|Y,Θ(Z|Y, θ(s))
|Y
= log
fZ|Y,Θ(z|Y, θ(s+)
)
fZ|Y,Θ(z|Y, θ(s))
fZ|Y,Θ(z|Y, θ(s)
) dz
= log1 = 0
Hence H(θ(s+)
|θ(s)
) ≤ H(θ(s)
|θ(s)
)
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 15 / 47
16. Proof for EM-algorithm: Non-decreasing
Theroem 1. Jensen’s inequality
Let f be a concave function, and let X be a random variable.
Then
E[f(X)] ≤ f(EX)
Figure: Jensen’s inequality for concave function[3]
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 16 / 47
17. Proof for EM-algorithm: Non-decreasing
Recall that
obs(θ(s+)
; Y ) − obs(θ(s)
; Y ) = Q(θ(s+)
|θ(s)
) − Q(θ(s)
|θ(s)
)
(I)
− H(θ(s+)
|θ(s)
) − H(θ(s)
|θ(s)
)
(II)
We’ve proven that (I) ≥ 0 and (II) ≤ 0
This shows obs(θ(s+)
; Y ) − obs(θ(s)
; Y ) ≥0
Thus the sequence obs(θ(s)
; Y ) in the EM-algorithm is
non-decreasing
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 17 / 47
18. Proof for EM-algorithm: Convergence
Convergence
We will show that the sequence θ(s) converges to some θ∗ with
(θ∗; y) = ∗, the limit of (θ(s))
Assumption
Ω is a subset of Rk
Ωθ = {θ ∈ Ω : (θ; y) ≥ (θ; y)} is compact for any
(θ; y) > −∞
(θ; x) is continuous and differentiable in the interior of Ω
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 18 / 47
19. Proof for EM-algorithm: Convergence
Theorem 2
Suppose that Q(θ|φ) is continuous in both θ and φ. Then all
limit points of any instance {θ(s)} of the EM algorithm are
stationary points, i.e. θ∗ = argmax
θ
Q(θ|θ∗), and (θ(s); y)
converges monotonically to some value ∗ = (θ∗; y) for some
stationary point θ∗
Theorem 3
Assume the hypothesis of Theorem 2. Suppose in addition that
∂θQ(θ|φ) is continuous in θ and φ. Then θ(s) converges to a
stationary point θ∗ with (θ∗; y) = ∗, the limit of (θ(s)), if either
{θ : (θ; y) = ∗} = {θ∗} or
|θ(s+) − θ(s)| →0 and {θ : (θ; y) = ∗} is discrete
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 19 / 47
20. Proof for EM-algorithm: Local Maximum
Local Maximum
Recall that
C(θ; X) = log fX|Θ(X|θ) = log fZ|Y,Θ(Z|Y, θ) + log fY |Θ(Y |θ)
Then
Q(θ|θ(s)
) = log fZ|Y,Θ(z|Y, θ)fZ|Y,Θ(z|Y, θ(s)
) dz + obs(θ; Y )
Differentiating w.r.t. θj and putting equal to 0 in order to maximize
Q gives
0 = ∂θj
Q(θ|θ(s)
) =
∂θj fZ|Y,Θ(z|Y, θ)
fZ|Y,Θ(z|Y, θ)
fZ|Y,Θ(z|Y, θ(s)
) dz + ∂θj obs(θ, Y )
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 20 / 47
21. Proof for EM-algorithm: Local Maximum
Recall that
0 = ∂θj
Q(θ|θ(s)
) =
∂θj fZ|Y,Θ(z|Y, θ)
fZ|Y,Θ(z|Y, θ)
fZ|Y,Θ(z|Y, θ(s)
) dz + ∂θj obs(θ, Y )
If θ(s) → θ∗ then we have for θ∗ that (with j = 1, · · · , J)
0 = ∂θj
Q(θ∗
|θ∗
)
=
∂θj
fZ|Y,Θ(z|Y, θ∗)
fZ|Y,Θ(z|Y, θ∗)
fZ|Y,Θ(z|Y, θ∗
) dz + ∂θj obs(θ∗
; Y )
= ∂θj
fZ|Y,Θ(z|Y, θ∗
) dz + ∂θj obs(θ∗
; Y )
= ∂θj obs(θ∗
; Y )
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 21 / 47
22. An Example: Gaussian Mixture Model (GMM)
An Example: Gaussian Mixture Model (GMM)
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 22 / 47
23. An Example: Gaussian Mixture Model (GMM)
Introduction
Mixture models make use of latent variables to model different
parameters for different groups (or clusters) of data points
For a point yi, let the cluster to which that point belongs be labeled
zi; where zi is latent, or unobserved
In this example, we will assume our observable features yi to be
distributed as a Gaussian, chosen based on the cluster that point yi is
associated with
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 23 / 47
24. An Example: Gaussian Mixture Model (GMM)
Figure: Gaussian Mixture Model Example[5]
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 24 / 47
25. An Example: Gaussian Mixture Model (GMM)
Data
Y = (y, · · · , yN )T : Observed data
∀i yi ∈ Rp
Z = (z, · · · , zN )T : Unobserved (latent) variable
Assume that Z is observed
∀i zi ∈ {, , · · · , K}
X = (Y, Z): Complete data
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 25 / 47
26. An Example: Gaussian Mixture Model (GMM)
Distribution Assumption
zi ∼ Mult(π), π ∈ RK
yi|zi = k ∼ NP (µk, Σk)
Model
rik
def
= p(zi = k|yi, θ) =
p(yi|zi = k, θ)p(zi = k|θ)
K
k= (p(yi|zi = k, θ)p(zi = k|θ))
z∗
i = argmax
k
rik
θ denotes the general parameter; θ = {π, µ, Σ}
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 26 / 47
27. An Example: Gaussian Mixture Model (GMM)
Notation Simplification
Write zi = k as
zi = (zi, · · · , zik, · · · , ziK)T
= (0, · · · , 1, · · · , 0)T
Where zij = I(j = k) ∈{0, 1} for j = , · · · , K
Using this:
p(zi|π) =
K
k=
π
zk
i
k
p(yi|zi, θ) =
K
k=
N(µk, Σk)zk
i =
K
k=
φ(yi|µk,k )zk
i
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 27 / 47
28. An Example: Gaussian Mixture Model (GMM)
Log-likelihood for observed data
obs(θ; Y ) = log fY |Θ(Y |θ)
=
N
i=
log p(yi|θ)
=
N
i=
log
K
zi
p(yi, zi|θ)
=
N
i=
log
zi∈Z
K
k=
π
zk
i
k N(µk, Σk)zk
i
This does not decouple the likelihood because the log cannot be
‘pushed’ inside the summation
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 28 / 47
29. An Example: Gaussian Mixture Model (GMM)
Complete log-likelihood
C(θ; X) = log fX|Θ(Y, Z|θ)
= log fY |Z,Θ(Y |Z, θ) + log fZ|Θ(Z|θ)
=
N
i=
log
K
k=
φ(yi|µk, Σk)zk
i + log
k=
π
zk
i
k
=
N
i=
K
k=
zk
i log φ(yi|µk, Σk) + zk
i log πk
Parameters are now decoupled since we can estimate πk and µk , Σk
separately
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 29 / 47
30. An Example: Gaussian Mixture Model (GMM)
Estimation
E-step
Q(θ|θ(s)
) = Eθ(s)
N
i=1
K
k=1
zk
i log φ(yi |µk , Σk ) + zk
i log πk |Y
=
N
i=
K
k=
Eθ(s) zk
i |Y log φ(yi|µk, Σk) + Eθ(s) zk
i |Y log πk
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 30 / 47
31. An Example: Gaussian Mixture Model (GMM)
E-step
Note that zk
i =1|Y ∼ Berrnoulli p(zk
i = 1|Y, θ)
Hence
r
(s)
ik
def
= Eθ(s) zk
i |Y = p(zk
i = 1|Y, θ(s)
)
=
p(zk
i = 1, yi|θ(s)
)
K
k=1 p(zk
i = , yi|θ(s))
=
p(yi|zk
i = 1, θ(s)
) p(zk
i = 1|θ(s)
)
K
k= p(yi|zk
i = 1, θ(s)) p(zk
i = 1|θ(s))
=
φ(yi|µ
(s)
k , Σ
(s)
k ) π
(s)
k
K
k= φ(yi|µ
(s)
k , Σ
(s)
k ) π
(s)
k
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 31 / 47
32. An Example: Gaussian Mixture Model (GMM)
M-step
Recall that
Q(θ|θ(s)
) =
N
i=
K
k=
r
(s)
ik log φ(yi|µk, Σk) + r
(s)
ik log πk
Set θ(s+)
= argmax
θ
Q(θ|θ(s)
)
π
(s+)
k =
N
i= r
(s)
ik
N
µ
(s+)
k =
N
i= r
(s)
ik
yi
N
i= r
(s)
ik
Σ
(s+)
k =
N
i= r
(s)
ik
yi−µ
(s+)
k
yi−µ
(s+)
k
T
N
i= r
(s)
ik
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 32 / 47
33. An Example: Gaussian Mixture Model (GMM)
Iterate until it satisfies following inequality
||θ(s+)
− θ(s)
|| <
Let ˆθ = θ(s)
What we get
ˆθ = (ˆπ, ˆµ, ˆΣ)
ˆrik = p(zi = k|Y, ˆθ) = φ(yi|ˆµk, ˆΣk) ˆπk
K
k= φ(yi|ˆµk, ˆΣk) ˆπk
ˆzi = argmax
k
ˆrik
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 33 / 47
34. An Example: Gaussian Mixture Model (GMM)
Figure: Gaussian Mixture Model Fitting Example[5]
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 34 / 47
35. Application to Outlier Detection
Application to Outlier Detection
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 35 / 47
36. Application to Outlier Detection: Directly Used
Basic Idea
For cases in which the data may have many different clusters with
different orientations
Assume a specific form of the generative model (e.g., a mixture of
Gaussians)
Fit the model to the data (usually for normal behavior)
Estimate the parameters with EM-algorithm
Fit this model to the unseen (test) data and get the estimation of the
fit (joint) probabilities
Data points that fit the distribution will have high fit (joint)
probabilities
Whereas anomalies (outliers) will have very low fit (joint) probabilities
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 36 / 47
37. Application to Outlier Detection: Directly Used
Simulation in R
https://rpubs.com/JayAhn/650433
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 37 / 47
38. Application to Outlier Detection: Indirectly Used
Introduction
Interestingly, EM-algorithms can also be used as a final step after
many such outlier detection algorithms for converting the scores into
probabilities [7]
Converting the outlier scores into well-calibrated probability estimates
is more favorable for several reasons
1 The probability estimates allow us to select the appropriate threshold
for declaring outliers using a Bayesian risk model
2 The probability estimates obtained from individual models can be
aggregated to build an ensemble outlier detection framework
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 38 / 47
39. Application to Outlier Detection: Indirectly Used
Motivation
Since the outlier detection problem is mainly about unsupervised
learning environment, it is hard to select the appropriate threshold for
decalring outliers
Every outlier detection model outputs different outlier score with
different scale which leads to a difficult problem during constructing
an outlier ensemble model
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 39 / 47
40. Application to Outlier Detection: Indirectly Used
Outlier Score Distributions
(a) Outlier Score Distribution for
Normal Examples
(b) Outlier Score Distributions for
Outliers
Figure: Outlier Score Distributions [7]
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 40 / 47
41. Application to Outlier Detection: Indirectly Used
Basic Idea
Treat an outlier score as an univariate random variable
Assume the label of outlierness as an unobserved latent variable
Estimate the posterior probabilities for the latent variable with
EM-algorithm
1 Model the posterior probability for outlier scores using a sigmoid
function
2 Model the score distribution as a mixture model (mixture of
exponential and Gaussian) and calculate the posterior probabilities via
the Bayes’ rule
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 41 / 47
42. Application to Outlier Detection: Indirectly Used
Bayesian Risk Model
Bayesian risk model minimizes the overall risk associated with some
cost function
For example, in the case of a two-class problem
The Bayes decision rule for a given observation x is to decide ω if:
(λ − λ)P(ω|x) > (λ − λ)P(ω|x)
Where ω, ω are the two classes while λij is the cost of misclassifying
ωj as ωi
Since P(ω|x) =1 − P(ω|x), the preceding inequality suggests that
the appropriate outlier threshold is automatically determined once the
cost functions are known
In the case of a zero-one loss function, the threshold which minimizes
the overall risk is 0.5, where
λij =
0 if i = j
1 if i = j
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 42 / 47
44. Summary
This slide had an overview on EM-algorithm and its application to
Outlier detection
We’ve checked the basic procedure of the EM-algorithm for estimating
the parameters of model which has the unobserved latent variable
This slide also has shown that the log-likelihood for observed data is
maximized by EM-algorithm through 3 parts: Non-decreasing /
Convergence / Local Maximum
Further more, we’ve seen that the EM-algorithm can be applied for
the Outlier detection not only directly but also indirectly
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 44 / 47
45. References
[1] Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. ”Maximum
likelihood from incomplete data via the EM algorithm.” Journal of the Royal
Statistical Society: Series B (Methodological) 39, no. 1 (1977): 1-22.
[2] https://www.math.kth.se/matstat/gru/Statistical%20inference/Lecture8.pdf
[3] https://www.cs.cmu.edu/∼epxing/Class/10708-17/notes-17/10708-scribe-
lecture8.pdf
[4] http://www2.stat.duke.edu/∼sayan/Sta613/2018/lec/emnotes.pdf
[5] Contributions to collaborative clustering and its potential applications on very
high resolution satellite images
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 45 / 47
46. References
[6] Kriegel, Hans-Peter, Peer Kroger, Erich Schubert, and Arthur Zimek.
”Interpreting and unifying outlier scores.” In Proceedings of the 2011 SIAM
International Conference on Data Mining, pp. 13-24. Society for Industrial and
Applied Mathematics, 2011.
[7] Gao, Jing, and Pang-Ning Tan. ”Converting output scores from outlier
detection algorithms into probability estimates.” In Sixth International Conference
on Data Mining (ICDM’06), pp. 212-221. IEEE, 2006.
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 46 / 47