K-means, EM and Mixture models

Machine Learning
K-means, E.M. and Mixture models
VU Pham
phvu@ﬁt.hcmus.edu.vn

Department of Computer Science

November 22, 2010

Machine Learning

Remind: Three Main Problems in ML

• Three main problems in ML:
– Regression: Linear Regression, Neural net...
– Classification: Decision Tree, kNN, Bayessian Classifier...
– Density Estimation: Gauss Naive DE,...

• Today, we will learn:
– K-means: a trivial unsupervised classification algorithm.
– Expectation Maximization: a general algorithm for density estimation.
∗ We will see how to use EM in general cases and in specific case of GMM.
– GMM: a tool for modelling Data-in-the-Wild (density estimator)
∗ We also learn how to use GMM in a Bayessian Classifier

Machine Learning 1

Contents

• Unsupervised Learning

• K-means clustering

• Expectation Maximization (E.M.)
– Regularized EM
– Model Selection

• Gaussian mixtures as a Density Estimator
– Gaussian mixtures
– EM for mixtures

• Gaussian mixtures for classiﬁcation

• Case studies

Machine Learning 2

Unsupervised Learning
• So far, we have considered supervised learning techniques:
– Label of each sample is included in the training set
Sample Label
x1 y1
... ...
xn yk

• Unsupervised learning:
– Traning set contains the samples only
Sample Label
x1
...
xn

Machine Learning 3

Unsupervised Learning

60 60

50 50

40 40

30 30

20 20

10 10

0 0
−10 0 10 20 30 40 50 −10 0 10 20 30 40 50

(a) Supervised learning. (b) Unsupervised learning.

Figure 1: Unsupervised vs. Supervised Learning

Machine Learning 4

What is unsupervised learning useful for?

• Collecting and labeling a large training set can be very expensive.

• Be able to ﬁnd features which are helpful for categorization.

• Gain insight into the natural structure of the data.

Machine Learning 5

Contents



– Regularized EM
– Model Selection

– EM for mixtures


• Case studies

Machine Learning 6

K-means clustering
• Clustering algorithms aim to ﬁnd
groups of “similar” data points among 60

the input data. 50

• K-means is an eﬀective algorithm to ex- 40

tract a given number of clusters from a 30

training set.
20

• Once done, the cluster locations can 10

be used to classify data into distinct
0
classes. −10 0 10 20 30 40 50

Machine Learning 7

K-means clustering

• Given:
– The dataset: {xn}N = {x1, x2, ..., xN}
n=1
– Number of clusters: K (K < N )

• Goal: ﬁnd a partition S = {Sk }K so that it minimizes the objective function
k=1

N
∑ K
∑
J= rnk ∥ xn − µk ∥2 (1)
n=1 k=1

where rnk = 1 if xn is assigned to cluster Sk , and rnj = 0 for j ̸= k.

i.e. Find values for the {rnk } and the {µk } to minimize (1).

Machine Learning 8

K-means clustering
N
∑ K
∑
J= rnk ∥ xn − µk ∥2
n=1 k=1

• Select some initial values for the µk .

• Expectation: keep the µk ﬁxed, minimize J respect to rnk .

• Maximization: keep the rnk ﬁxed, minimize J respect to the µk .

• Loop until no change in the partitions (or maximum number of interations is
exceeded).

Machine Learning 9

K-means clustering
N
∑ K
∑
n=1 k=1

• Expectation: J is linear function of rnk


1 if k = arg minj ∥ xn − µj ∥2




rnk =


0
 otherwise

• Maximization: setting the derivative of J with respect to µk to zero, gives:
∑
n rnk xn
µk = ∑
n rnk

Convergence of K-means: assured [why?], but may lead to local minimum of J
[8]

Machine Learning 10

K-means clustering: How to understand?
N
∑ K
∑
n=1 k=1

• Expectation: minimize J respect to rnk
– For each xn, ﬁnd the “closest” cluster mean µk and put xn into cluster Sk .

• Maximization: minimize J respect to µk
– For each cluster Sk , re-estimate the cluster mean µk to be the average value
of all samples in Sk .

• Loop until no change in the partitions (or maximum number of interations is
exceeded).

Machine Learning 11

K-means clustering: Demonstration

Machine Learning 12

K-means clustering: some variations

• Initial cluster centroids:
– Randomly selected
– Iterative procedure: k-mean++ [2]

• Number of clusters K:
√
– Empirically/experimentally: 2 ∼ n
– Learning [6]

• Objective function:
– General dissimilarity measure: k-medoids algorithm.

• Speeding up:
– kd-trees for pre-processing [7]
– Triangle inequality for distance calculation [4]

Machine Learning 13

Contents



– Regularized EM
– Model Selection

– EM for mixtures


• Case studies

Machine Learning 14

Expectation Maximization

E.M.
Machine Learning 15

Expectation Maximization

• A general-purpose algorithm for MLE in a wide range of situations.

• First formally stated by Dempster, Laird and Rubin in 1977 [1]
– We even have several books discussing only on EM and its variations!

• An excellent way of doing our unsupervised learning problem, as we will see
– EM is also used widely in other domains.

Machine Learning 16

EM: a solution for MLE

• Given a statistical model with:
– a set X of observed data,
– a set Z of unobserved latent data,
– a vector of unknown parameters θ,
– a likelihood function L (θ; X, Z) = p (X, Z | θ)

• Roughly speaking, the aim of MLE is to determine θ = arg maxθ L (θ; X, Z)
– We known the old trick: partial derivatives of the log likelihood...
– But it is not always tractable [e.g.]
– Other solutions are available.

Machine Learning 17

EM: General Case

L (θ; X, Z) = p (X, Z | θ)

• EM is just an iterative procedure for ﬁnding the MLE

• Expectation step: keep the current estimate θ (t) ﬁxed, calculate the expected
value of the log likelihood function
( )
Q θ|θ (t)
= E [log L (θ; X, Z)] = E [log p (X, Z | θ)]

• Maximization step: Find the parameter that maximizes this quantity
( )
θ (t+1)
= arg max Q θ | θ (t)
θ

Machine Learning 18

EM: Motivation

• If we know the value of the parameters θ, we can ﬁnd the value of latent variables
Z by maximizing the log likelihood over all possible values of Z
– Searching on the value space of Z.

• If we know Z, we can ﬁnd an estimate of θ
– Typically by grouping the observed data points according to the value of asso-
ciated latent variable,
– then averaging the values (or some functions of the values) of the points in
each group.

To understand this motivation, let’s take K-means as a trivial example...

Machine Learning 19

EM: informal description
Both θ and Z are unknown, EM is an iterative algorithm:

1. Initialize the parameters θ to some random values.

2. Compute the best values of Z given these parameter values.

3. Use the just-computed values of Z to ﬁnd better estimates for θ.

4. Iterate until convergence.

Machine Learning 20

EM Convergence

• E.M. Convergence: Yes
– After each iteration, p (X, Z | θ) must increase or remain [NOT OBVIOUS]
– But it can not exceed 1 [OBVIOUS]
– Hence it must converge [OBVIOUS]

• Bad news: E.M. converges to local optimum.
– Whether the algorithm converges to the global optimum depends on the ini-
tialization.

• Let’s take K-means as an example, again...

• Details can be found in [9].

Machine Learning 21

Regularized EM (REM)

• EM tries to inference the latent (missing) data Z from the observations X
– We want to choose the missing data that has a strong probabilistic relation
to the observations, i.e. we assume that the observations contains lots of
information about the missing data.
– But E.M. does not have any control on the relationship between the missing
data and the observations!

• Regularized EM (REM) [5] tries to optimized the penalized likelihood

L (θ | X, Z) = L (θ | X, Z) − γH (Z | X, θ)

where H (Y ) is Shannon’s entropy of the random variable Y :
∑
H (Y ) = − p (y) log p (y)
y

and the positive value γ is the regularization parameter. [When γ = 0?]

Machine Learning 22

Regularized EM (REM)

• E-step: unchanged

• M-step: Find the parameter that maximizes this quantity
( )
θ (t+1)
= arg max Q θ | θ (t)
θ

where ( ) ( )
Q θ|θ (t)
=Q θ|θ (t)
− γH (Z | X, θ)

• REM is expected to converge faster than EM (and it does!)

• So, to apply REM, we just need to determine the H (·) part...

Machine Learning 23

Model Selection

• Considering a parametric model:
– When estimating model parameters using MLE, it is possible to increase the
likelihood by adding parameters
– But may result in over-fitting.

• e.g. K-means with different values of K...

• Need a criteria for model selection, e.g. to “judge” which model configuration is
better, how many parameters is sufficient...
– Cross Validation
– Akaike Information Criterion (AIC)
– Bayesian Factor
∗ Bayesian Informaction Criterion (BIC)
∗ Deviance Information Criterion
– ...

Machine Learning 24

Bayesian Information Criterion
( ) # of param
BIC = − log p data | θ + log n
2

• Where:
– θ:( the estimated parameters.
)
– p data | θ : the maximized value of the likelihood function for the estimated
model.
– n: number of data points.
– Note that there are other ways to write the BIC expression, but they are all
equivalent.

• Given any two estimated models, the model with the lower value of BIC is
preferred.

Machine Learning 25

Bayesian Score

• BIC is an asymptotic (large n) approximation to better (and hard to evaluate)
Bayesian score ˆ
Bayesian score = p (θ) p (data | θ) dθ
θ

• Given two models, the model selection is based on Bayes factor
ˆ
p (θ1) p (data | θ1) dθ1
K = ˆθ1
p (θ2) p (data | θ2) dθ2
θ2

Machine Learning 26

Contents



– Regularized EM
– Model Selection

– EM for mixtures


• Case studies

Machine Learning 27

Remind: Bayes Classiﬁer

70

60

50

40

30

20

10

0

−10
0 10 20 30 40 50 60 70 80

p (x | y = i) p (y = i)
p (y = i | x) =
p (x)

Machine Learning 28

Remind: Bayes Classiﬁer

70

60

50

40

30

20

10

0

−10
0 10 20 30 40 50 60 70 80

In case of Gaussian Bayes Classiﬁer:

[ ]
T
d/2
1
exp −2
1
(x − µi) Σi (x − µi) pi
(2π) ∥Σi ∥1/2
p (y = i | x) =
p (x)

How can we deal with the denominator p (x)?

Machine Learning 29

Remind: The Single Gaussian Distribution

• Multivariate Gaussian
 
1 1
N (x; µ, Σ) = d/2 exp −

(x − µ)T Σ−1 (x − µ)

(2π) ∥ Σ ∥1/2 2

• For maximum likelihood

∂ ln N (x1, x2, ..., xN; µ, Σ)
0=
∂µ

• and the solution is
1 N
∑
µM L = xi
N i=1
1 N
∑
ΣM L = (xi − µM L)T (xi − µM L)
N i=1

Machine Learning 30

The GMM assumption
• There are k components: {ci}k
i=1

• Component ci has an associated mean
vector µi
µ2
•
µ1

µ3
•

Machine Learning 31

The GMM assumption
i=1

vector µi
µ2
• Each component generates data from a
Gaussian with mean µi and covariance µ1
matrix Σi

• Each sample is generated according to µ3
the following guidelines:

Machine Learning 32

The GMM assumption
i=1

vector µi

• Each component generates data from a µ2
Gaussian with mean µi and covariance
matrix Σi

• Each sample is generated according to
– Randomly select component ci
with probability P (ci) = wi, s.t.
∑k
i=1 wi = 1

Machine Learning 33

The GMM assumption
i=1

vector µi
µ2
• Each component generates data from a
x
Gaussian with mean µi and covariance
matrix Σi

• Each sample is generated according to

– Randomly select component ci with
probability P (ci) = wi, s.t.
∑k
i=1 wi = 1
– Sample ~ N (µi, Σi)

Machine Learning 34

Probability density function of GMM
“Linear combination” of Gaussians:

k
∑ k
∑
f (x) = wiN (x; µi, Σi) , where wi = 1
i=1 i=1

0.018

0.016

0.014

0.012

0.01
f (x)
0.008
2
2
w1 N µ1 , σ1 w2 N µ2 , σ2
0.006

2
w3 N µ3 , σ3
0.004

0.002

0
0 50 100 150 200 250

(a) The pdf of an 1D GMM with 3 components. (b) The pdf of an 2D GMM with 3 components.

Figure 2: Probability density function of some GMMs.

Machine Learning 35

GMM: Problem deﬁnition
k
∑ k
∑
i=1 i=1

Given a training set, how to model these data point using GMM?

• Given:
– The trainning set: {xi}N
i=1
– Number of clusters: k

• Goal: model this data using a mixture of Gaussians
– Weights: w1, w2, ..., wk
– Means and covariances: µ1, µ2, ..., µk ; Σ1, Σ2, ..., Σk

Machine Learning 36

Computing likelihoods in unsupervised case
k
∑ k
∑
i=1 i=1

• Given a mixture of Gaussians, denoted by G. For any x, we can deﬁne the
likelihood:

P (x | G) = P (x | w1, µ1, Σ1, ..., wk , µk , Σk )
k
∑
= P (x | ci) P (ci)
i=1
k
∑
= wiN (x; µi, Σi)
i=1

• So we can deﬁne likelihood for the whole training set [Why?]
N
∏
P (x1, x2, ..., xN | G) = P (xi | G)
i=1
N ∑
∏ k
= wj N (xi; µj , Σj )
i=1 j=1

Machine Learning 37

Estimating GMM parameters

• We known this: Maximum Likelihood Estimation
 
N
∑ k
∑
ln P (X | G) = ln 
 wj N (xi; µj , Σj )

i=1 j=1

– For the max likelihood:
∂ ln P (X | G)
0=
∂µj
– This leads to non-linear non-analytically-solvable equations!

• Use gradient descent
– Slow but doable

• A much cuter and recently popular method...

Machine Learning 38

E.M. for GMM

• Remember:
– We have the training set {xi}N , the number of components k.
i=1
– Assume we know p (c1) = w1, p (c2) = w2, ..., p (ck ) = wk
– We don’t know µ1, µ2, ..., µk

The likelihood:

p (data | µ1, µ2, ..., µk ) = p (x1, x2, ..., xN | µ1, µ2, ..., µk )
N
∏
= p (xi | µ1, µ2, ..., µk )
i=1
N ∑
∏ k
= p (xi | wj , µ1, µ2, ..., µk ) p (cj )
i=1 j=1  
N ∑
∏ k 1 ( )
2
= K exp − 2 xi − µj wi

i=1 j=1 2σ

Machine Learning 39

E.M. for GMM

• For Max. Likelihood, we know ∂µ log p (data | µ1, µ2, ..., µk ) = 0
∂
i
• Some wild algebra turns this into: For Maximum Likelihood, for each j:

N
∑
p (cj | xi, µ1, µ2, ..., µk ) xi
i=1
µj = N
∑
p (cj | xi, µ1, µ2, ..., µk )
i=1

This is N non-linear equations of µj ’s.
• So:
– If, for each xi, we know p (cj | xi, µ1, µ2, ..., µk ), then we could easily compute
µj ,
– If we know each µj , we could compute p (cj | xi, µ1, µ2, ..., µk ) for each xi
and cj .

Machine Learning 40

E.M. for GMM

• E.M. is coming: on the t’th iteration, let our estimates be

λt = {µ1 (t) , µ2 (t) , ..., µk (t)}

• E-step: compute the expected classes of all data points for each class
( )
p (xi | cj , λt) p (cj | λt) p xi | cj , µj (t) , σj I p (cj )
p (cj | xi, λt) = =
p (xi | λt) k
∑
p (xi | cm, µm (t) , σmI) p (cm)
m=1

• M-step: compute µ given our data’s class membership distributions
N
∑
p (cj | xi, λt) xi
i=1
µj (t + 1) = N
∑
p (cj | xi, λt)
i=1

Machine Learning 41

E.M. for General GMM: E-step

• On the t’th iteration, let our estimates be

λt = {µ1 (t) , µ2 (t) , ..., µk (t) , Σ1 (t) , Σ2 (t) , ..., Σk (t) , w1 (t) , w2 (t) , ..., wk (t)}

• E-step: compute the expected classes of all data points for each class

p (xi | cj , λt) p (cj | λt)
τij (t) ≡ p (cj | xi, λt) =
p (xi | λt)
( )
p xi | cj , µj (t) , Σj (t) wj (t)
= k
∑
p (xi | cm, µm (t) , Σj (t)) wm (t)
m=1

Machine Learning 42

E.M. for General GMM: M-step

• M-step: compute µ given our data’s class membership distributions

N
∑ N
∑
p (cj | xi, λt) p (cj | xi, λt) xi
i=1 i=1
wj (t + 1) = µj (t + 1) = N
N ∑
p (cj | xi, λt)
i=1
1 N
∑ 1 N
∑
= τij (t) = τij (t) xi
N i=1 N wj (t + 1) i=1

N
∑ [ ][ ]
T
p (cj | xi, λt) xi − µj (t + 1) xi − µj (t + 1)
i=1
Σj (t + 1) = N
∑
p (cj | xi, λt)
i=1
1 N
∑ [ ][ ]
T
= τij (t) xi − µj (t + 1) xi − µj (t + 1)
N wj (t + 1) i=1

Machine Learning 43

E.M. for General GMM: Initialization

• wj = 1/k, j = 1, 2, ..., k

• Each µj is set to a randomly selected point
– Or use K-means for this initialization.

• Each Σj is computed using the equation in previous slide...

Machine Learning 44

Regularized E.M. for GMM

• In case of REM, the entropy H (·) is

N
∑ k
∑
H (C | X; λt) =− p (cj | xi; λt) log p (cj | xi; λt)
i=1 i=1
N
∑ k
∑
=− τij (t) log τij (t)
i=1 i=1

and the likelihood will be

L (λt; X, C) =L (λt; X, C) − γH (C | X; λt)
N
∑ k
∑
= log wj p (xi | cj , λt)
i=1 j=1
N
∑ k
∑
+γ τij (t) log τij (t)
i=1 i=1

Machine Learning 45


• Some algebra [5] turns into:

N
∑
p (cj | xi, λt) (1 + γ log p (cj | xi, λt))
i=1
wj (t + 1) =
N
1 N
∑
= τij (t) (1 + γ log τij (t))
N i=1

N
∑
p (cj | xi, λt) xi (1 + γ log p (cj | xi, λt))
µj (t + 1) = i=1
N
∑
p (cj | xi, λt) (1 + γ log p (cj | xi, λt))
i=1
1 N
∑
= τij (t) xi (1 + γ log τij (t))
N wj (t + 1) i=1

Machine Learning 46


• Some algebra [5] turns into (cont.):

1 N
∑
Σj (t + 1) = τij (t) (1 + γ log τij (t)) dij (t + 1)
N wj (t + 1) i=1

where [ ][ ]
T
dij (t + 1) = xi − µj (t + 1) xi − µj (t + 1)

Machine Learning 47

Demonstration

• EM for GMM

• REM for GMM

Machine Learning 48

Local optimum solution

• E.M. is guaranteed to ﬁnd the local optimal solution by monotonically increasing
the log-likelihood

• Whether it converges to the global optimal solution depends on the initialization

18 15

16

14

12 10

10

8

6 5

4

2

0 0
−10 −5 0 5 10 15 −10 −5 0 5 10 15

Machine Learning 49

GMM: Selecting the number of components

• We can run the E.M. algorithm with diﬀerent numbers of components.
– Need a criteria for selecting the “best” number of components

15 16 16

14 14

12 12

10
10 10

8 8

6 6
5

4 4

2 2

0 0 0
−10 −5 0 5 10 15 −10 −5 0 5 10 15 −10 −5 0 5 10 15

Machine Learning 50

GMM: Model Selection

• Empirically/Experimentally [Sure!]

• Cross-Validation [How?]

• BIC

• ...

Machine Learning 51

GMM: Model Selection

• Empirically/Experimentally
– Typically 3-5 components

• Cross-Validation: K-fold, leave-one-out...
– Omit each point xi in turn, estimate the parameters θ −i on the basis of the
remaining points, then evaluate
N ( )
∑ −i
log p xi | θ
i=1

• BIC: ﬁnd k (the number of components) that minimize the BIC

( ) dk
BIC = − log p data | θm + log n
2

where dk is the number of (eﬀective) parameters in the k-component mixture.

Machine Learning 52

Contents



– Regularized EM
– Model Selection

– EM for mixtures


• Case studies

Machine Learning 53

Gaussian mixtures for classiﬁcation
p (x | y = i) p (y = i)
p (y = i | x) =
p (x)

• To build a Bayesian classiﬁer based on GMM, we can use GMM to model data in
each class
– So each class is modeled by one k-component GMM.

• For example:
Class 0: p (y = 0) , p (x | θ 0), (a 3-component mixture)
...

Machine Learning 54

GMM for Classiﬁcation

• As previous, each class is modeled by a k-component GMM.

• A new test sample x is classiﬁed according to

c = arg max p (y = i) p (x | θ i)
i

where
k
∑
p (x | θ i) = wiN (x; µi, Σi)
i=1

• Simple, quick (and is actually used!)

Machine Learning 55

Contents



– Regularized EM
– Model Selection

– EM for mixtures


• Case studies

Machine Learning 56

Case studies

• Background subtraction
– GMM for each pixel

• Speech recognition
– GMM for the underlying distribution of feature vectors of each phone

• Many, many others...

Machine Learning 57

What you should already know?

• K-means as a trivial classiﬁer

• E.M. - an algorithm for solving many MLE problems

• GMM - a tool for modeling data
– Note 1: We can have a mixture model of many diﬀerent types of distribution,
not only Gaussians
– Note 2: Compute the sum of Gaussians may be expensive, some approximations
are available [3]

• Model selection:
– Bayesian Information Criterion

Machine Learning 58

Q&A

Machine Learning 59

References

[1] N. Laird A. Dempster and D. Rubin. Maximum likelihood from incomplete data
via the em algorithm. Journal of the Royal Statistical Society. Series B (Method-
ological), 39(1):pp. 1–38., 1977.

[2] David Arthur and Sergei Vassilvitskii. k-means ++ : The Advantages of Careful
Seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on
Discrete algorithms, volume 8, pages 1027–1035, 2007.

[3] N. Gumerov C. Yang, R. Duraiswami and L. Davis. Improved fast gauss transform
and eﬃcient kernel density estimation. In IEEE International Conference on
Computer Vision, pages pages 464–471, 2003.

[4] Charles Elkan. Using the Triangle Inequality to Accelerate k-Means. In Proceed-

Machine Learning 60

ings of the Twentieth International Conference on Machine Learning (ICML),
2003.

[5] Keshu Zhang Haifeng Li and Tao Jiang. The regularized em algorithm. In
Proceedings of the 20th National Conference on Artificial Intelligence, pages
pages 807 – 812, Pittsburgh, PA, 2005.

[6] Greg Hamerly and Charles Elkan. Learning the k in k-means. In In Neural
Information Processing Systems. MIT Press, 2003.

[7] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth
Silverman, and Angela Y Wu. An efficient k-means clustering algorithm: anal-
ysis and implementation. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 24(7):881–892, July 2002.

[8] J MacQueen. Some methods for classification and analysis of multivariate obser-
vations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics
and Probability, volume 233, pages 281–297. University of California Press, 1967.

Machine Learning 61

[9] C.F. Wu. On the convergence properties of the em algorithm. The Annals of
Statistics, 11:95–103, 1983.

Machine Learning 62

K-means, EM and Mixture models

More Related Content

What's hot

Similar to K-means, EM and Mixture models

More from Vu Pham

Recently uploaded

K-means, EM and Mixture models