This document discusses clustering methods using the EM algorithm. It begins with an overview of machine learning and unsupervised learning. It then describes clustering, k-means clustering, and how k-means can be formulated as an optimization of a biconvex objective function solved via an iterative EM algorithm. The document goes on to describe mixture models and how the EM algorithm can be used to estimate the parameters of a Gaussian mixture model (GMM) via maximum likelihood.
Overview of the Network Intelligence and Analysis Lab with a focus on clustering methods via the EM algorithm.
Machine Learning concepts including training data and unsupervised learning, aiming to find hidden structures in data.
Definition and purpose of clustering, including grouping similar objects without prior labels. Applications mentioned: Customer Segmentation, Image Segmentation.
Introduction to K-means clustering, optimization methods outlined.
Objective of K-means to minimize intra-cluster distance, involves mathematical formulation of distance with data points.
Detailed iterative optimization steps in K-means clustering to achieve convergence.
Explanation of biconvex optimization as a method to solve optimization problems in clustering.
Continuation of K-means optimization processes detailing the implications and derivatives.
Advantages of K-means, such as ease of implementation, and disadvantages including local optimum convergence and distance evaluation issues.
Introduction to the Mixture of Gaussians and the EM algorithm for Gaussian mixtures.
Description of Gaussian components in mixtures including mean vectors and covariance matrices.
Probability density functions of Gaussian Mixture Model (GMM), discussing components and parameters.
Details about using mixture models for clustering, covering mixtures' structure and data representation.
Explanation of maximum likelihood estimation challenges and the application of EM algorithm in Gaussian mixtures.
Overview of the EM algorithm procedure, including E and M steps for optimizing parameters.
Re-examination of K-means through the lens of the EM algorithm's operations and optimization.
Exploration of latent variables in Gaussian mixtures, including Bernoulli random variable notation.
Description of joint distributions and their relevance to marginal distributions in Gaussian mixtures.
Conditional probabilities from Bayes’ theorem in the context of EM algorithm for Gaussian mixtures.
Discussion on log-likelihood functions and how to derive estimates for Gaussian mixture means.
Maximizing likelihood with respect to Gaussian components' parameters during the M-step of EM.
Iterative optimization process for estimating parameters in GMM via responsibilities evaluation.
Steps to implement EM algorithm for GMM, detailing initialization and iterative parameter updates.
Theoretical relationship between K-means algorithm and GMM under certain conditions, highlighting their equivalence.
Network Intelligence andAnalysis Lab
Network Intelligence and Analysis Lab
Clustering methods via EM algorithm
2014.07.10
SanghyukChun
2.
Network Intelligence andAnalysis Lab
•
Machine Learning
•
Training data
•
Learning model
•
Unsupervised Learning
•
Training data without label
•
Input data: 퐷퐷={푥푥1,푥푥2,…,푥푥푁푁}
•
Most of unsupervised learning problems are trying to find hidden structure in unlabeled data
•
Examples: Clustering, Dimensionality Reduction (PCA, LDA), …
Machine Learning and Unsupervised Learning
2
3.
Network Intelligence andAnalysis Lab
•
Clustering
•
Grouping objects in a such way that objects in the same group are more similar to each other than other groups
•
Input: a set of objects (or data) without group information
•
Output: cluster index for each object
•
Usage: Customer Segmentation, Image Segmentation…
Unsupervised Learning and Clustering
Input
Output
Clustering
Algorithm
3
Network Intelligence andAnalysis Lab
•
Intuition: data in same cluster has shorter distance than data which are in other clusters
•
Goal: minimize distance between data in same cluster
•
Objective function:
•
퐽퐽= 푛푛=1 푁푁 푘푘=1 퐾퐾 푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2
•
Where N is number of data points, K is number of clusters
•
푟푟푛푛푛∈{0,1}is indicator variables where k describing which of the K clusters the data point 퐱퐱퐧퐧is assigned to
•
훍훍퐤퐤is a prototype associated with the k-thcluster
•
Eventually 훍훍퐤퐤is same as the center (mean) of cluster
K-means Clustering
5
6.
Network Intelligence andAnalysis Lab
•
Objective function:
•
푎푎푎푎푎푎푎푎푎푛푛{푟푟푛푛푛푛,훍훍퐤퐤} 푛푛=1 푁푁 푘푘=1 퐾퐾 푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2
•
This function can be solved through an iterative procedure
•
Step 1: minimize J with respect to the 푟푟푛푛푛, keeping 훍훍퐤퐤is fixed
•
Step 2: minimize J with respect to the 훍훍퐤퐤, keeping 푟푟푛푛푛is fixed
•
Repeat Step 1,2 until converge
•
Does it always converge?
K-means Clustering –Optimization
6
7.
Network Intelligence andAnalysis Lab
•
Biconvex optimization is a generalization of convex optimization where the objective function and the constraint set can be biconvex
•
푓푓푥푥,푦푦is biconvex if fixing x, 푓푓푥푥y=푓푓푥푥,푦푦is convex over Y and fixing y, 푓푓푦푦푥푥=푓푓푥푥,푦푦is convex over X
•
One way to solve biconvex optimization problem is that iteratively solve the corresponding convex problems
•
It does not guarantee the global optimal point
•
But it always converge to some local optimum
Optional –Biconvex optimization
7
8.
Network Intelligence andAnalysis Lab
•
푎푎푎푎푎푎푎푎푎푛푛{푟푟푛푛푛푛,훍훍퐤퐤} 푛푛=1 푁푁 푘푘=1 퐾퐾 푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2
•
Step 1: minimize J with respect to the 푟푟푛푛푛, keeping 훍훍퐤퐤is fixed
•
푟푟푛푛푛=ቊ1푖푘푘=푎푎푎푎푎푎푎푎푎푛푛푗푗퐱퐱퐧퐧−훍훍퐤퐤 ퟐퟐ 0표표표표표표표표표표표표표표표
•
Step 2: minimize J with respect to the 훍훍퐤퐤, keeping 푟푟푛푛푛is fixed
•
Derivative with respect to 훍훍퐤퐤to zero giving
•
2Σ푛푛푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤=0
•
훍훍퐤퐤=Σ푛푛푟푟푛푛푛푛퐱퐱퐧퐧 Σ푛푛푟푟푛푛푛푛
•
훍훍퐤퐤is equal to the mean of all the data assigned to cluster k
K-means Clustering –Optimization
8
9.
Network Intelligence andAnalysis Lab
•
Advantage of K-means clustering
•
Easy to implement (kmeansin Matlab, kclusterin Python)
•
In practice, it works well
•
Disadvantage of K-means clustering
•
It can converge to local optimum
•
Computing Euclidian distance of every point is expensive
•
Solution: Batch K-means
•
Euclidian distance is non-robust to outlier
•
Solution: K-medoidsalgorithms (use different metric)
K-means Clustering –Conclusion
9
10.
Network Intelligence andAnalysis Lab
Mixture of Gaussians
Mixture Model
EM Algorithm
EM for Gaussian Mixtures
10
11.
Network Intelligence andAnalysis Lab
•
Assumption: There are k components: 푐푐푖푖푖푖=1 푘푘
•
Component 푐푐푖푖has an associated mean vector 휇휇푖푖
•
Each component generates data from a Gaussian with mean 휇휇푖푖 and covariance matrix Σ푖푖
Mixture of Gaussians
휇휇1
휇휇2
휇휇3
휇휇4
휇휇5
11
12.
Network Intelligence andAnalysis Lab
•
Represent model as linear combination of Gaussians
•
Probability density function of GMM
•
푝푝푥푥= 푘푘=1 퐾퐾 휋휋푘푘푁푁푥푥휇휇푘푘,Σ푘푘
•
푁푁푥푥휇휇푘푘,Σ푘푘=12휋휋푑푑/2Σ1/2exp{−12푥푥−휇휇⊤Σ−1푥푥−휇휇}
•
Which is called a mixture of Gaussian or Gaussian Mixture Model
•
Each Gaussian density is called component of the mixtures and has its own mean 휇휇푘푘and covariance Σ푘푘
•
The parameters are called mixing coefficients (Σ푘푘휋휋푘푘=1)
Gaussian Mixture Model
12
13.
Network Intelligence andAnalysis Lab
•
푝푝푥푥=Σ푘푘=1 퐾퐾휋휋푘푘푁푁푥푥휇휇푘푘,Σ푘푘, where Σ푘푘휋휋푘푘=1
•
Input:
•
The training set: 푥푥푖푖푖푖=1 푁푁
•
Number of clusters: k
•
Goal: model this data using mixture of Gaussians
•
Mixing coefficients 휋휋1,휋휋2,…,휋휋푘푘
•
Means and covariance: 휇휇1,휇휇2,…,휇휇푘푘;Σ1,Σ2,…,Σ푘푘
Clustering using Mixture Model
13
14.
Network Intelligence andAnalysis Lab
•
푝푝푥푥퐺퐺=푝푝푥푥휋휋1,휇휇1,…=Σ푖푖푝푝푥푥푐푐푖푖푝푝(푐푐푖푖)=Σ푖푖휋휋푖푖푁푁(푥푥|휇휇푖푖,Σ푖푖)
•
푝푝푥푥1,푥푥2,…,푥푥푁푁퐺퐺=Π푖푖푝푝(푥푥푖푖|퐺퐺)
•
The log likelihood function is given by
•
ln푝푝퐗퐗훑훑,훍훍,횺횺= 푛푛=1 푁푁 ln 푘푘=1 퐾퐾 휋휋푘푘푁푁퐱퐱퐧퐧훍훍퐤퐤,횺횺퐤퐤
•
Goal: Find parameter which maximize log-likelihood
•
Problem: Hard to compute maximum likelihood
•
Solution: use EM algorithm
Maximum Likelihood of GMM
14
15.
Network Intelligence andAnalysis Lab
•
EM algorithm is an iterative procedure for finding the MLE
•
An expectation (E) step creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters
•
A maximization (M) step computes parameters maximizing the expected log-likelihood found on the E step
•
These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.
•
EM always converges to one of local optimums
EM (Expectation Maximization) Algorithm
15
16.
Network Intelligence andAnalysis Lab
•
푎푎푎푎푎푎푎푎푎푛푛{푟푟푛푛푛푛,훍훍퐤퐤} 푛푛=1 푁푁 푘푘=1 퐾퐾 푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2
•
E-Step: minimize J with respect to the 푟푟푛푛푛, keeping 훍훍퐤퐤is fixed
•
푟푟푛푛푛=ቊ1푖푘푘=푎푎푎푎푎푎푎푎푎푛푛푗푗퐱퐱퐧퐧−훍훍퐤퐤 ퟐퟐ 0표표표표표표표표표표표표표표표
•
M-Step: minimize J with respect to the 훍훍퐤퐤, keeping 푟푟푛푛푛is fixed
•
훍훍퐤퐤=Σ푛푛푟푟푛푛푛푛퐱퐱퐧퐧 Σ푛푛푟푟푛푛푛푛
K-means revisit: EM and K-means
16
17.
Network Intelligence andAnalysis Lab
•
Let 푧푧푘푘is Bernoulli random variable with probability 휋휋푘푘
•
푝푝푧푧푘푘=1=휋휋푘푘where Σ푧푧푘푘=1and Σ휋휋푘푘=1
•
Because z use a 1-of-K representation, this distribution in the form
•
푝푝푧푧=Π푘푘=1 퐾퐾휋휋푘푘 푧푧푘
•
Similarly, the conditional distribution of x given a particular value for z is a Gaussian
•
푝푝푥푥푧푧=Π푘푘=1 퐾퐾푁푁푥푥휇휇푘푘,Σ푘푘 푧푧푘
Latent variable for GMM
17
18.
Network Intelligence andAnalysis Lab
•
The joint distribution is given by 푝푝푥푥,푧푧=푝푝푧푧푝푝(푥푥|푧푧)
•
푝푝푥푥=Σ푧푧푝푝푧푧푝푝(푥푥|푧푧)=Σ푘푘휋휋푘푘푁푁(푥푥|휇휇푘푘,Σ푘푘)
•
Thus the marginal distribution of x is a Gaussian mixture of the above form
•
Now, we are able to work with joint distribution instead of marginal distribution
•
Graphical representation of a GMMfor a set of N i.i.d. data points {푥푥푛푛} with corresponding latent variable{푧푧푛푛},where n=1,…,N
Latent variable for GMM
퐳퐳퐧퐧
푿푿풏풏
훑훑
흁흁
횺횺
N
18
19.
Network Intelligence andAnalysis Lab
•
Conditional probability of z given x
•
From Bayes’ theorem,
•
훾훾푧푧푘푘≡푝푝푧푧푘푘=1퐱퐱=푝푝푧푧푘=1푝푝퐱퐱푧푧푘푘=1Σ푗푗=1 퐾퐾푝푝푧푧푗푗=1푝푝퐱퐱푧푧푗푗=1= 휋휋푘푘푁푁퐱퐱훍훍퐤퐤,횺횺퐤퐤 Σ푗푗=1 퐾퐾휋휋푗푗푁푁(퐱퐱|훍훍퐣퐣,횺횺퐣퐣)
•
훾훾푧푧푘푘can also be viewed as the responsibility that component k takes for ‘explaining’ the observation x
EM for Gaussian Mixtures (E-step)
19
20.
Network Intelligence andAnalysis Lab
•
Likelihood function for GMM
•
ln푝푝퐗퐗훑훑,훍훍,횺횺= 푛푛=1 푁푁 ln 푘푘=1 퐾퐾 휋휋푘푘푁푁퐱퐱퐧퐧훍훍퐤퐤,횺횺퐤퐤
•
Setting the derivatives of log likelihood with respect to the means 휇휇푘푘of the Gaussian components to zero, we obtain
•
휇휇푘푘= 1N푘푘 푛푛=1 푁푁 훾훾푧푧푛푛푛퐱퐱퐧퐧 where, 푁푁푘푘=Σ푛푛=1 푁푁훾훾(푧푧푛푛푛)
EM for Gaussian Mixtures (M-step)
20
21.
Network Intelligence andAnalysis Lab
•
Setting the derivatives of likelihood with respect to the Σ푘푘to zero, we obtain
•
횺횺풌풌= 1 푁푁푘푘 푛푛=1 푁푁 훾훾푧푧푛푛푛퐱퐱퐧퐧−휇휇푘푘퐱퐱퐧퐧−휇휇푘푘 ⊤
•
Maximize likelihood with respect to the mixing coefficient 휋휋by using a Lagrange multiplier, we obtain
•
ln푝푝퐗퐗훑훑,훍훍,횺횺+휆휆(Σ푘푘=1 퐾퐾휋휋푘푘−1)
•
휋휋푘푘=푁푁푘푁푁
EM for Gaussian Mixtures (M-step)
21
22.
Network Intelligence andAnalysis Lab
•
휇휇푘푘,Σ푘푘,휋휋푘푘do not constitute a closed-form solution for the parameters of the mixture model because the responsibility 훾훾푧푧푛푛푛depend on those parameters in a complex way
•
훾훾(푧푧푛푛푛)=휋휋푘푁푁퐱퐱훍훍퐤퐤,횺횺퐤퐤 Σ푗푗=1 퐾퐾휋휋푗푗푁푁(퐱퐱|훍훍퐣퐣,횺횺퐣퐣)
•
In EM algorithm for GMM, 훾훾(푧푧푛푛푛)and parameters are iteratively optimized
•
In E step, responsibilities or the posterior probabilities are evaluated by current values for the parameters
•
In M step, re-estimate the means, covariances, and mixing coefficients using previous results
EM for Gaussian Mixtures
22
23.
Network Intelligence andAnalysis Lab
•
Initialize the means 휇휇푘푘, covariancesΣ푘푘and mixing coefficient 휋휋푘푘, and evaluate the initial value of the log likelihood
•
E step: Evaluate the responsibilities using the current parameter
•
훾훾(푧푧푛푛푛)= 휋휋푘푘푁푁퐱퐱훍훍퐤퐤,횺횺퐤퐤 Σ푗푗=1 퐾퐾휋휋푗푗푁푁(퐱퐱|훍훍퐣퐣,횺횺퐣퐣)
•
M step: Re-estimate parameters using the current responsibilities
•
휇휇푘푘 푛푛푛푛푛푛=1N푘Σ푛푛=1 푁푁훾훾푧푧푛푛푛퐱퐱퐧퐧
•
횺횺풌풌 풏풏풏풏풏풏=1 푁푁푘Σ푛푛=1 푁푁훾훾푧푧푛푛푛퐱퐱퐧퐧−휇휇푘푘퐱퐱퐧퐧−휇휇푘푘 ⊤
•
휋휋푘푘 푛푛푛푛푛푛=푁푁푘푁푁
•
푁푁푘푘=Σ푛푛=1 푁푁훾훾(푧푧푛푛푛)
•
Repeat E step and M step until converge
EM for Gaussian Mixtures
23
24.
Network Intelligence andAnalysis Lab
•
We can derive the K-means algorithm as a particular limit of EM for Gaussian Mixture Model
•
Consider a Gaussian mixture model with covariance matrices are given by 휀휀퐼퐼, where 휀휀is a variance parameter and I is identity
•
If we consider the limit휀휀→0, log likelihood of GMM becomes
•
퐸퐸푧푧ln푝푝푋푋,푍푍휇휇,Σ,휋휋→−12=Σ푛푛Σ푘푘푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2+퐶퐶
•
Thus, we see that in this limit, maximizing the expected complete- data log likelihood is equivalent to K-means algorithm
Relationship between K-means algorithm and GMM
24