Network Intelligence and Analysis Lab 
Network Intelligence and Analysis Lab 
Clustering methods via EM algorithm 
2014.07.10 
SanghyukChun
Network Intelligence and Analysis Lab 
• 
Machine Learning 
• 
Training data 
• 
Learning model 
• 
Unsupervised Learning 
• 
Training data without label 
• 
Input data: 퐷퐷={푥푥1,푥푥2,…,푥푥푁푁} 
• 
Most of unsupervised learning problems are trying to find hidden structure in unlabeled data 
• 
Examples: Clustering, Dimensionality Reduction (PCA, LDA), … 
Machine Learning and Unsupervised Learning 
2
Network Intelligence and Analysis Lab 
• 
Clustering 
• 
Grouping objects in a such way that objects in the same group are more similar to each other than other groups 
• 
Input: a set of objects (or data) without group information 
• 
Output: cluster index for each object 
• 
Usage: Customer Segmentation, Image Segmentation… 
Unsupervised Learning and Clustering 
Input 
Output 
Clustering 
Algorithm 
3
Network Intelligence and Analysis Lab 
K-means Clustering 
Introduction 
Optimization 
4
Network Intelligence and Analysis Lab 
• 
Intuition: data in same cluster has shorter distance than data which are in other clusters 
• 
Goal: minimize distance between data in same cluster 
• 
Objective function: 
• 
퐽퐽=෍ 푛푛=1 푁푁 ෍ 푘푘=1 퐾퐾 푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2 
• 
Where N is number of data points, K is number of clusters 
• 
푟푟푛푛푛∈{0,1}is indicator variables where k describing which of the K clusters the data point 퐱퐱퐧퐧is assigned to 
• 
훍훍퐤퐤is a prototype associated with the k-thcluster 
• 
Eventually 훍훍퐤퐤is same as the center (mean) of cluster 
K-means Clustering 
5
Network Intelligence and Analysis Lab 
• 
Objective function: 
• 
푎푎푎푎푎푎푎푎푎푛푛{푟푟푛푛푛푛,훍훍퐤퐤}෍ 푛푛=1 푁푁 ෍ 푘푘=1 퐾퐾 푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2 
• 
This function can be solved through an iterative procedure 
• 
Step 1: minimize J with respect to the 푟푟푛푛푛, keeping 훍훍퐤퐤is fixed 
• 
Step 2: minimize J with respect to the 훍훍퐤퐤, keeping 푟푟푛푛푛is fixed 
• 
Repeat Step 1,2 until converge 
• 
Does it always converge? 
K-means Clustering –Optimization 
6
Network Intelligence and Analysis Lab 
• 
Biconvex optimization is a generalization of convex optimization where the objective function and the constraint set can be biconvex 
• 
푓푓푥푥,푦푦is biconvex if fixing x, 푓푓푥푥y=푓푓푥푥,푦푦is convex over Y and fixing y, 푓푓푦푦푥푥=푓푓푥푥,푦푦is convex over X 
• 
One way to solve biconvex optimization problem is that iteratively solve the corresponding convex problems 
• 
It does not guarantee the global optimal point 
• 
But it always converge to some local optimum 
Optional –Biconvex optimization 
7
Network Intelligence and Analysis Lab 
• 
푎푎푎푎푎푎푎푎푎푛푛{푟푟푛푛푛푛,훍훍퐤퐤}෍ 푛푛=1 푁푁 ෍ 푘푘=1 퐾퐾 푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2 
• 
Step 1: minimize J with respect to the 푟푟푛푛푛, keeping 훍훍퐤퐤is fixed 
• 
푟푟푛푛푛=ቊ1푖푘푘=푎푎푎푎푎푎푎푎푎푛푛푗푗퐱퐱퐧퐧−훍훍퐤퐤 ퟐퟐ 0표표표표표표표표표표표표표표표 
• 
Step 2: minimize J with respect to the 훍훍퐤퐤, keeping 푟푟푛푛푛is fixed 
• 
Derivative with respect to 훍훍퐤퐤to zero giving 
• 
2Σ푛푛푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤=0 
• 
훍훍퐤퐤=Σ푛푛푟푟푛푛푛푛퐱퐱퐧퐧 Σ푛푛푟푟푛푛푛푛 
• 
훍훍퐤퐤is equal to the mean of all the data assigned to cluster k 
K-means Clustering –Optimization 
8
Network Intelligence and Analysis Lab 
• 
Advantage of K-means clustering 
• 
Easy to implement (kmeansin Matlab, kclusterin Python) 
• 
In practice, it works well 
• 
Disadvantage of K-means clustering 
• 
It can converge to local optimum 
• 
Computing Euclidian distance of every point is expensive 
• 
Solution: Batch K-means 
• 
Euclidian distance is non-robust to outlier 
• 
Solution: K-medoidsalgorithms (use different metric) 
K-means Clustering –Conclusion 
9
Network Intelligence and Analysis Lab 
Mixture of Gaussians 
Mixture Model 
EM Algorithm 
EM for Gaussian Mixtures 
10
Network Intelligence and Analysis Lab 
• 
Assumption: There are k components: 푐푐푖푖푖푖=1 푘푘 
• 
Component 푐푐푖푖has an associated mean vector 휇휇푖푖 
• 
Each component generates data from a Gaussian with mean 휇휇푖푖 and covariance matrix Σ푖푖 
Mixture of Gaussians 
휇휇1 
휇휇2 
휇휇3 
휇휇4 
휇휇5 
11
Network Intelligence and Analysis Lab 
• 
Represent model as linear combination of Gaussians 
• 
Probability density function of GMM 
• 
푝푝푥푥=෍ 푘푘=1 퐾퐾 휋휋푘푘푁푁푥푥휇휇푘푘,Σ푘푘 
• 
푁푁푥푥휇휇푘푘,Σ푘푘=12휋휋푑푑/2Σ1/2exp{−12푥푥−휇휇⊤Σ−1푥푥−휇휇} 
• 
Which is called a mixture of Gaussian or Gaussian Mixture Model 
• 
Each Gaussian density is called component of the mixtures and has its own mean 휇휇푘푘and covariance Σ푘푘 
• 
The parameters are called mixing coefficients (Σ푘푘휋휋푘푘=1) 
Gaussian Mixture Model 
12
Network Intelligence and Analysis Lab 
• 
푝푝푥푥=Σ푘푘=1 퐾퐾휋휋푘푘푁푁푥푥휇휇푘푘,Σ푘푘, where Σ푘푘휋휋푘푘=1 
• 
Input: 
• 
The training set: 푥푥푖푖푖푖=1 푁푁 
• 
Number of clusters: k 
• 
Goal: model this data using mixture of Gaussians 
• 
Mixing coefficients 휋휋1,휋휋2,…,휋휋푘푘 
• 
Means and covariance: 휇휇1,휇휇2,…,휇휇푘푘;Σ1,Σ2,…,Σ푘푘 
Clustering using Mixture Model 
13
Network Intelligence and Analysis Lab 
• 
푝푝푥푥퐺퐺=푝푝푥푥휋휋1,휇휇1,…=Σ푖푖푝푝푥푥푐푐푖푖푝푝(푐푐푖푖)=Σ푖푖휋휋푖푖푁푁(푥푥|휇휇푖푖,Σ푖푖) 
• 
푝푝푥푥1,푥푥2,…,푥푥푁푁퐺퐺=Π푖푖푝푝(푥푥푖푖|퐺퐺) 
• 
The log likelihood function is given by 
• 
ln푝푝퐗퐗훑훑,훍훍,횺횺=෍ 푛푛=1 푁푁 ln෍ 푘푘=1 퐾퐾 휋휋푘푘푁푁퐱퐱퐧퐧훍훍퐤퐤,횺횺퐤퐤 
• 
Goal: Find parameter which maximize log-likelihood 
• 
Problem: Hard to compute maximum likelihood 
• 
Solution: use EM algorithm 
Maximum Likelihood of GMM 
14
Network Intelligence and Analysis Lab 
• 
EM algorithm is an iterative procedure for finding the MLE 
• 
An expectation (E) step creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters 
• 
A maximization (M) step computes parameters maximizing the expected log-likelihood found on the E step 
• 
These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. 
• 
EM always converges to one of local optimums 
EM (Expectation Maximization) Algorithm 
15
Network Intelligence and Analysis Lab 
• 
푎푎푎푎푎푎푎푎푎푛푛{푟푟푛푛푛푛,훍훍퐤퐤}෍ 푛푛=1 푁푁 ෍ 푘푘=1 퐾퐾 푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2 
• 
E-Step: minimize J with respect to the 푟푟푛푛푛, keeping 훍훍퐤퐤is fixed 
• 
푟푟푛푛푛=ቊ1푖푘푘=푎푎푎푎푎푎푎푎푎푛푛푗푗퐱퐱퐧퐧−훍훍퐤퐤 ퟐퟐ 0표표표표표표표표표표표표표표표 
• 
M-Step: minimize J with respect to the 훍훍퐤퐤, keeping 푟푟푛푛푛is fixed 
• 
훍훍퐤퐤=Σ푛푛푟푟푛푛푛푛퐱퐱퐧퐧 Σ푛푛푟푟푛푛푛푛 
K-means revisit: EM and K-means 
16
Network Intelligence and Analysis Lab 
• 
Let 푧푧푘푘is Bernoulli random variable with probability 휋휋푘푘 
• 
푝푝푧푧푘푘=1=휋휋푘푘where Σ푧푧푘푘=1and Σ휋휋푘푘=1 
• 
Because z use a 1-of-K representation, this distribution in the form 
• 
푝푝푧푧=Π푘푘=1 퐾퐾휋휋푘푘 푧푧푘
• 
Similarly, the conditional distribution of x given a particular value for z is a Gaussian 
• 
푝푝푥푥푧푧=Π푘푘=1 퐾퐾푁푁푥푥휇휇푘푘,Σ푘푘 푧푧푘
Latent variable for GMM 
17
Network Intelligence and Analysis Lab 
• 
The joint distribution is given by 푝푝푥푥,푧푧=푝푝푧푧푝푝(푥푥|푧푧) 
• 
푝푝푥푥=Σ푧푧푝푝푧푧푝푝(푥푥|푧푧)=Σ푘푘휋휋푘푘푁푁(푥푥|휇휇푘푘,Σ푘푘) 
• 
Thus the marginal distribution of x is a Gaussian mixture of the above form 
• 
Now, we are able to work with joint distribution instead of marginal distribution 
• 
Graphical representation of a GMMfor a set of N i.i.d. data points {푥푥푛푛} with corresponding latent variable{푧푧푛푛},where n=1,…,N 
Latent variable for GMM 
퐳퐳퐧퐧 
푿푿풏풏 
훑훑 
흁흁 
횺횺 
N 
18
Network Intelligence and Analysis Lab 
• 
Conditional probability of z given x 
• 
From Bayes’ theorem, 
• 
훾훾푧푧푘푘≡푝푝푧푧푘푘=1퐱퐱=푝푝푧푧푘=1푝푝퐱퐱푧푧푘푘=1Σ푗푗=1 퐾퐾푝푝푧푧푗푗=1푝푝퐱퐱푧푧푗푗=1= 휋휋푘푘푁푁퐱퐱훍훍퐤퐤,횺횺퐤퐤 Σ푗푗=1 퐾퐾휋휋푗푗푁푁(퐱퐱|훍훍퐣퐣,횺횺퐣퐣) 
• 
훾훾푧푧푘푘can also be viewed as the responsibility that component k takes for ‘explaining’ the observation x 
EM for Gaussian Mixtures (E-step) 
19
Network Intelligence and Analysis Lab 
• 
Likelihood function for GMM 
• 
ln푝푝퐗퐗훑훑,훍훍,횺횺=෍ 푛푛=1 푁푁 ln෍ 푘푘=1 퐾퐾 휋휋푘푘푁푁퐱퐱퐧퐧훍훍퐤퐤,횺횺퐤퐤 
• 
Setting the derivatives of log likelihood with respect to the means 휇휇푘푘of the Gaussian components to zero, we obtain 
• 
휇휇푘푘= 1N푘푘 ෍ 푛푛=1 푁푁 훾훾푧푧푛푛푛퐱퐱퐧퐧 where, 푁푁푘푘=Σ푛푛=1 푁푁훾훾(푧푧푛푛푛) 
EM for Gaussian Mixtures (M-step) 
20
Network Intelligence and Analysis Lab 
• 
Setting the derivatives of likelihood with respect to the Σ푘푘to zero, we obtain 
• 
횺횺풌풌= 1 푁푁푘푘 ෍ 푛푛=1 푁푁 훾훾푧푧푛푛푛퐱퐱퐧퐧−휇휇푘푘퐱퐱퐧퐧−휇휇푘푘 ⊤ 
• 
Maximize likelihood with respect to the mixing coefficient 휋휋by using a Lagrange multiplier, we obtain 
• 
ln푝푝퐗퐗훑훑,훍훍,횺횺+휆휆(Σ푘푘=1 퐾퐾휋휋푘푘−1) 
• 
휋휋푘푘=푁푁푘푁푁 
EM for Gaussian Mixtures (M-step) 
21
Network Intelligence and Analysis Lab 
• 
휇휇푘푘,Σ푘푘,휋휋푘푘do not constitute a closed-form solution for the parameters of the mixture model because the responsibility 훾훾푧푧푛푛푛depend on those parameters in a complex way 
• 
훾훾(푧푧푛푛푛)=휋휋푘푁푁퐱퐱훍훍퐤퐤,횺횺퐤퐤 Σ푗푗=1 퐾퐾휋휋푗푗푁푁(퐱퐱|훍훍퐣퐣,횺횺퐣퐣) 
• 
In EM algorithm for GMM, 훾훾(푧푧푛푛푛)and parameters are iteratively optimized 
• 
In E step, responsibilities or the posterior probabilities are evaluated by current values for the parameters 
• 
In M step, re-estimate the means, covariances, and mixing coefficients using previous results 
EM for Gaussian Mixtures 
22
Network Intelligence and Analysis Lab 
• 
Initialize the means 휇휇푘푘, covariancesΣ푘푘and mixing coefficient 휋휋푘푘, and evaluate the initial value of the log likelihood 
• 
E step: Evaluate the responsibilities using the current parameter 
• 
훾훾(푧푧푛푛푛)= 휋휋푘푘푁푁퐱퐱훍훍퐤퐤,횺횺퐤퐤 Σ푗푗=1 퐾퐾휋휋푗푗푁푁(퐱퐱|훍훍퐣퐣,횺횺퐣퐣) 
• 
M step: Re-estimate parameters using the current responsibilities 
• 
휇휇푘푘 푛푛푛푛푛푛=1N푘Σ푛푛=1 푁푁훾훾푧푧푛푛푛퐱퐱퐧퐧 
• 
횺횺풌풌 풏풏풏풏풏풏=1 푁푁푘Σ푛푛=1 푁푁훾훾푧푧푛푛푛퐱퐱퐧퐧−휇휇푘푘퐱퐱퐧퐧−휇휇푘푘 ⊤ 
• 
휋휋푘푘 푛푛푛푛푛푛=푁푁푘푁푁 
• 
푁푁푘푘=Σ푛푛=1 푁푁훾훾(푧푧푛푛푛) 
• 
Repeat E step and M step until converge 
EM for Gaussian Mixtures 
23
Network Intelligence and Analysis Lab 
• 
We can derive the K-means algorithm as a particular limit of EM for Gaussian Mixture Model 
• 
Consider a Gaussian mixture model with covariance matrices are given by 휀휀퐼퐼, where 휀휀is a variance parameter and I is identity 
• 
If we consider the limit휀휀→0, log likelihood of GMM becomes 
• 
퐸퐸푧푧ln푝푝푋푋,푍푍휇휇,Σ,휋휋→−12=Σ푛푛Σ푘푘푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2+퐶퐶 
• 
Thus, we see that in this limit, maximizing the expected complete- data log likelihood is equivalent to K-means algorithm 
Relationship between K-means algorithm and GMM 
24

K-means and GMM

  • 1.
    Network Intelligence andAnalysis Lab Network Intelligence and Analysis Lab Clustering methods via EM algorithm 2014.07.10 SanghyukChun
  • 2.
    Network Intelligence andAnalysis Lab • Machine Learning • Training data • Learning model • Unsupervised Learning • Training data without label • Input data: 퐷퐷={푥푥1,푥푥2,…,푥푥푁푁} • Most of unsupervised learning problems are trying to find hidden structure in unlabeled data • Examples: Clustering, Dimensionality Reduction (PCA, LDA), … Machine Learning and Unsupervised Learning 2
  • 3.
    Network Intelligence andAnalysis Lab • Clustering • Grouping objects in a such way that objects in the same group are more similar to each other than other groups • Input: a set of objects (or data) without group information • Output: cluster index for each object • Usage: Customer Segmentation, Image Segmentation… Unsupervised Learning and Clustering Input Output Clustering Algorithm 3
  • 4.
    Network Intelligence andAnalysis Lab K-means Clustering Introduction Optimization 4
  • 5.
    Network Intelligence andAnalysis Lab • Intuition: data in same cluster has shorter distance than data which are in other clusters • Goal: minimize distance between data in same cluster • Objective function: • 퐽퐽=෍ 푛푛=1 푁푁 ෍ 푘푘=1 퐾퐾 푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2 • Where N is number of data points, K is number of clusters • 푟푟푛푛푛∈{0,1}is indicator variables where k describing which of the K clusters the data point 퐱퐱퐧퐧is assigned to • 훍훍퐤퐤is a prototype associated with the k-thcluster • Eventually 훍훍퐤퐤is same as the center (mean) of cluster K-means Clustering 5
  • 6.
    Network Intelligence andAnalysis Lab • Objective function: • 푎푎푎푎푎푎푎푎푎푛푛{푟푟푛푛푛푛,훍훍퐤퐤}෍ 푛푛=1 푁푁 ෍ 푘푘=1 퐾퐾 푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2 • This function can be solved through an iterative procedure • Step 1: minimize J with respect to the 푟푟푛푛푛, keeping 훍훍퐤퐤is fixed • Step 2: minimize J with respect to the 훍훍퐤퐤, keeping 푟푟푛푛푛is fixed • Repeat Step 1,2 until converge • Does it always converge? K-means Clustering –Optimization 6
  • 7.
    Network Intelligence andAnalysis Lab • Biconvex optimization is a generalization of convex optimization where the objective function and the constraint set can be biconvex • 푓푓푥푥,푦푦is biconvex if fixing x, 푓푓푥푥y=푓푓푥푥,푦푦is convex over Y and fixing y, 푓푓푦푦푥푥=푓푓푥푥,푦푦is convex over X • One way to solve biconvex optimization problem is that iteratively solve the corresponding convex problems • It does not guarantee the global optimal point • But it always converge to some local optimum Optional –Biconvex optimization 7
  • 8.
    Network Intelligence andAnalysis Lab • 푎푎푎푎푎푎푎푎푎푛푛{푟푟푛푛푛푛,훍훍퐤퐤}෍ 푛푛=1 푁푁 ෍ 푘푘=1 퐾퐾 푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2 • Step 1: minimize J with respect to the 푟푟푛푛푛, keeping 훍훍퐤퐤is fixed • 푟푟푛푛푛=ቊ1푖푘푘=푎푎푎푎푎푎푎푎푎푛푛푗푗퐱퐱퐧퐧−훍훍퐤퐤 ퟐퟐ 0표표표표표표표표표표표표표표표 • Step 2: minimize J with respect to the 훍훍퐤퐤, keeping 푟푟푛푛푛is fixed • Derivative with respect to 훍훍퐤퐤to zero giving • 2Σ푛푛푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤=0 • 훍훍퐤퐤=Σ푛푛푟푟푛푛푛푛퐱퐱퐧퐧 Σ푛푛푟푟푛푛푛푛 • 훍훍퐤퐤is equal to the mean of all the data assigned to cluster k K-means Clustering –Optimization 8
  • 9.
    Network Intelligence andAnalysis Lab • Advantage of K-means clustering • Easy to implement (kmeansin Matlab, kclusterin Python) • In practice, it works well • Disadvantage of K-means clustering • It can converge to local optimum • Computing Euclidian distance of every point is expensive • Solution: Batch K-means • Euclidian distance is non-robust to outlier • Solution: K-medoidsalgorithms (use different metric) K-means Clustering –Conclusion 9
  • 10.
    Network Intelligence andAnalysis Lab Mixture of Gaussians Mixture Model EM Algorithm EM for Gaussian Mixtures 10
  • 11.
    Network Intelligence andAnalysis Lab • Assumption: There are k components: 푐푐푖푖푖푖=1 푘푘 • Component 푐푐푖푖has an associated mean vector 휇휇푖푖 • Each component generates data from a Gaussian with mean 휇휇푖푖 and covariance matrix Σ푖푖 Mixture of Gaussians 휇휇1 휇휇2 휇휇3 휇휇4 휇휇5 11
  • 12.
    Network Intelligence andAnalysis Lab • Represent model as linear combination of Gaussians • Probability density function of GMM • 푝푝푥푥=෍ 푘푘=1 퐾퐾 휋휋푘푘푁푁푥푥휇휇푘푘,Σ푘푘 • 푁푁푥푥휇휇푘푘,Σ푘푘=12휋휋푑푑/2Σ1/2exp{−12푥푥−휇휇⊤Σ−1푥푥−휇휇} • Which is called a mixture of Gaussian or Gaussian Mixture Model • Each Gaussian density is called component of the mixtures and has its own mean 휇휇푘푘and covariance Σ푘푘 • The parameters are called mixing coefficients (Σ푘푘휋휋푘푘=1) Gaussian Mixture Model 12
  • 13.
    Network Intelligence andAnalysis Lab • 푝푝푥푥=Σ푘푘=1 퐾퐾휋휋푘푘푁푁푥푥휇휇푘푘,Σ푘푘, where Σ푘푘휋휋푘푘=1 • Input: • The training set: 푥푥푖푖푖푖=1 푁푁 • Number of clusters: k • Goal: model this data using mixture of Gaussians • Mixing coefficients 휋휋1,휋휋2,…,휋휋푘푘 • Means and covariance: 휇휇1,휇휇2,…,휇휇푘푘;Σ1,Σ2,…,Σ푘푘 Clustering using Mixture Model 13
  • 14.
    Network Intelligence andAnalysis Lab • 푝푝푥푥퐺퐺=푝푝푥푥휋휋1,휇휇1,…=Σ푖푖푝푝푥푥푐푐푖푖푝푝(푐푐푖푖)=Σ푖푖휋휋푖푖푁푁(푥푥|휇휇푖푖,Σ푖푖) • 푝푝푥푥1,푥푥2,…,푥푥푁푁퐺퐺=Π푖푖푝푝(푥푥푖푖|퐺퐺) • The log likelihood function is given by • ln푝푝퐗퐗훑훑,훍훍,횺횺=෍ 푛푛=1 푁푁 ln෍ 푘푘=1 퐾퐾 휋휋푘푘푁푁퐱퐱퐧퐧훍훍퐤퐤,횺횺퐤퐤 • Goal: Find parameter which maximize log-likelihood • Problem: Hard to compute maximum likelihood • Solution: use EM algorithm Maximum Likelihood of GMM 14
  • 15.
    Network Intelligence andAnalysis Lab • EM algorithm is an iterative procedure for finding the MLE • An expectation (E) step creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters • A maximization (M) step computes parameters maximizing the expected log-likelihood found on the E step • These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. • EM always converges to one of local optimums EM (Expectation Maximization) Algorithm 15
  • 16.
    Network Intelligence andAnalysis Lab • 푎푎푎푎푎푎푎푎푎푛푛{푟푟푛푛푛푛,훍훍퐤퐤}෍ 푛푛=1 푁푁 ෍ 푘푘=1 퐾퐾 푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2 • E-Step: minimize J with respect to the 푟푟푛푛푛, keeping 훍훍퐤퐤is fixed • 푟푟푛푛푛=ቊ1푖푘푘=푎푎푎푎푎푎푎푎푎푛푛푗푗퐱퐱퐧퐧−훍훍퐤퐤 ퟐퟐ 0표표표표표표표표표표표표표표표 • M-Step: minimize J with respect to the 훍훍퐤퐤, keeping 푟푟푛푛푛is fixed • 훍훍퐤퐤=Σ푛푛푟푟푛푛푛푛퐱퐱퐧퐧 Σ푛푛푟푟푛푛푛푛 K-means revisit: EM and K-means 16
  • 17.
    Network Intelligence andAnalysis Lab • Let 푧푧푘푘is Bernoulli random variable with probability 휋휋푘푘 • 푝푝푧푧푘푘=1=휋휋푘푘where Σ푧푧푘푘=1and Σ휋휋푘푘=1 • Because z use a 1-of-K representation, this distribution in the form • 푝푝푧푧=Π푘푘=1 퐾퐾휋휋푘푘 푧푧푘 • Similarly, the conditional distribution of x given a particular value for z is a Gaussian • 푝푝푥푥푧푧=Π푘푘=1 퐾퐾푁푁푥푥휇휇푘푘,Σ푘푘 푧푧푘 Latent variable for GMM 17
  • 18.
    Network Intelligence andAnalysis Lab • The joint distribution is given by 푝푝푥푥,푧푧=푝푝푧푧푝푝(푥푥|푧푧) • 푝푝푥푥=Σ푧푧푝푝푧푧푝푝(푥푥|푧푧)=Σ푘푘휋휋푘푘푁푁(푥푥|휇휇푘푘,Σ푘푘) • Thus the marginal distribution of x is a Gaussian mixture of the above form • Now, we are able to work with joint distribution instead of marginal distribution • Graphical representation of a GMMfor a set of N i.i.d. data points {푥푥푛푛} with corresponding latent variable{푧푧푛푛},where n=1,…,N Latent variable for GMM 퐳퐳퐧퐧 푿푿풏풏 훑훑 흁흁 횺횺 N 18
  • 19.
    Network Intelligence andAnalysis Lab • Conditional probability of z given x • From Bayes’ theorem, • 훾훾푧푧푘푘≡푝푝푧푧푘푘=1퐱퐱=푝푝푧푧푘=1푝푝퐱퐱푧푧푘푘=1Σ푗푗=1 퐾퐾푝푝푧푧푗푗=1푝푝퐱퐱푧푧푗푗=1= 휋휋푘푘푁푁퐱퐱훍훍퐤퐤,횺횺퐤퐤 Σ푗푗=1 퐾퐾휋휋푗푗푁푁(퐱퐱|훍훍퐣퐣,횺횺퐣퐣) • 훾훾푧푧푘푘can also be viewed as the responsibility that component k takes for ‘explaining’ the observation x EM for Gaussian Mixtures (E-step) 19
  • 20.
    Network Intelligence andAnalysis Lab • Likelihood function for GMM • ln푝푝퐗퐗훑훑,훍훍,횺횺=෍ 푛푛=1 푁푁 ln෍ 푘푘=1 퐾퐾 휋휋푘푘푁푁퐱퐱퐧퐧훍훍퐤퐤,횺횺퐤퐤 • Setting the derivatives of log likelihood with respect to the means 휇휇푘푘of the Gaussian components to zero, we obtain • 휇휇푘푘= 1N푘푘 ෍ 푛푛=1 푁푁 훾훾푧푧푛푛푛퐱퐱퐧퐧 where, 푁푁푘푘=Σ푛푛=1 푁푁훾훾(푧푧푛푛푛) EM for Gaussian Mixtures (M-step) 20
  • 21.
    Network Intelligence andAnalysis Lab • Setting the derivatives of likelihood with respect to the Σ푘푘to zero, we obtain • 횺횺풌풌= 1 푁푁푘푘 ෍ 푛푛=1 푁푁 훾훾푧푧푛푛푛퐱퐱퐧퐧−휇휇푘푘퐱퐱퐧퐧−휇휇푘푘 ⊤ • Maximize likelihood with respect to the mixing coefficient 휋휋by using a Lagrange multiplier, we obtain • ln푝푝퐗퐗훑훑,훍훍,횺횺+휆휆(Σ푘푘=1 퐾퐾휋휋푘푘−1) • 휋휋푘푘=푁푁푘푁푁 EM for Gaussian Mixtures (M-step) 21
  • 22.
    Network Intelligence andAnalysis Lab • 휇휇푘푘,Σ푘푘,휋휋푘푘do not constitute a closed-form solution for the parameters of the mixture model because the responsibility 훾훾푧푧푛푛푛depend on those parameters in a complex way • 훾훾(푧푧푛푛푛)=휋휋푘푁푁퐱퐱훍훍퐤퐤,횺횺퐤퐤 Σ푗푗=1 퐾퐾휋휋푗푗푁푁(퐱퐱|훍훍퐣퐣,횺횺퐣퐣) • In EM algorithm for GMM, 훾훾(푧푧푛푛푛)and parameters are iteratively optimized • In E step, responsibilities or the posterior probabilities are evaluated by current values for the parameters • In M step, re-estimate the means, covariances, and mixing coefficients using previous results EM for Gaussian Mixtures 22
  • 23.
    Network Intelligence andAnalysis Lab • Initialize the means 휇휇푘푘, covariancesΣ푘푘and mixing coefficient 휋휋푘푘, and evaluate the initial value of the log likelihood • E step: Evaluate the responsibilities using the current parameter • 훾훾(푧푧푛푛푛)= 휋휋푘푘푁푁퐱퐱훍훍퐤퐤,횺횺퐤퐤 Σ푗푗=1 퐾퐾휋휋푗푗푁푁(퐱퐱|훍훍퐣퐣,횺횺퐣퐣) • M step: Re-estimate parameters using the current responsibilities • 휇휇푘푘 푛푛푛푛푛푛=1N푘Σ푛푛=1 푁푁훾훾푧푧푛푛푛퐱퐱퐧퐧 • 횺횺풌풌 풏풏풏풏풏풏=1 푁푁푘Σ푛푛=1 푁푁훾훾푧푧푛푛푛퐱퐱퐧퐧−휇휇푘푘퐱퐱퐧퐧−휇휇푘푘 ⊤ • 휋휋푘푘 푛푛푛푛푛푛=푁푁푘푁푁 • 푁푁푘푘=Σ푛푛=1 푁푁훾훾(푧푧푛푛푛) • Repeat E step and M step until converge EM for Gaussian Mixtures 23
  • 24.
    Network Intelligence andAnalysis Lab • We can derive the K-means algorithm as a particular limit of EM for Gaussian Mixture Model • Consider a Gaussian mixture model with covariance matrices are given by 휀휀퐼퐼, where 휀휀is a variance parameter and I is identity • If we consider the limit휀휀→0, log likelihood of GMM becomes • 퐸퐸푧푧ln푝푝푋푋,푍푍휇휇,Σ,휋휋→−12=Σ푛푛Σ푘푘푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2+퐶퐶 • Thus, we see that in this limit, maximizing the expected complete- data log likelihood is equivalent to K-means algorithm Relationship between K-means algorithm and GMM 24