Cluster Analysis : Assignment & Update

Cluster Analysis
Assignment & Update
Billy Yang

2018/02/27

Cluster Analysis
Sometime, it like our thinking before stand up meeting ~

Supervised & Unsupervised
Learning
So, unsupervised learning equals cluster analysis ?

Supervised & Unsupervised
Learning
Supervised learning
Supervised learning
Unsupervised learning
Unsupervised learning

Generative Model
Generative
Model
Generative model could learns from a large amount of complex object in same domain.
Then, generates the new object like it

Cluster Analysis
Application
1. Transfer RGB pixel
to 3 dimensional data
2. Run
Cluster analysis
3. Use some major colors
to replace neighbor one.

Cluster Analysis
Application
Kaspersky Lab Whitepaper Machine Learning

Yann Lecun’s Cake Theory
at NIPS 2016

How can we
use computer to ﬁnd out
the cluster ?

5 Ideas
1. Partitioning-based
divide data objects into a number of partitions,
where each partition represents a cluster

5 ideas
2. Hierarchical-based
starts with one object for each cluster and
merges two or more of the most appropriate clusters

5 ideas
3. Density-based
defined as a connected dense component,
grows in any direction that density leads to

5 ideas
4. Grid-based
The space of the data objects is divided into grids,
then perform the clustering on the grids

5 ideas
5. Model-based
Assume the data is generated by probability distribution,
optimizes the fits between data and the distribution

Taxonomy
A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis
there is no best clustering algorithm

Let’s recap today topic
Assignment & Update
The basic idea of K-Means

K-Means
K = 2
+
+
Step 0
random

pick two data

as centroids

K-Means
+
+
Data
Assignment
Iteration 1
K = 2
+
+
Step 0
random

pick two data

as centroids

K-Means
+
+
+
+
Data
Assignment
Update
Centroid
Iteration 1
K = 2
+
+
random

pick two data

as centroids
Step 0

K-Means
+
+
+
+
Data
Assignment
Update
Centroid
Iteration 1
Data
Assignment
+
+
Iteration 2
K = 2
+
+
random

pick two data

as centroids
Step 0

K-Means
+
+
+
+
Data
Assignment
Update
Centroid
Iteration 1
Data
Assignment
+
+
Iteration 2
Update
Centroid
K = 2
+
+
+
+
random

pick two data

as centroids
Step 0

K-Means
+
+
+
+
Update
Centroid
Iteration 1
+
+
Iteration 2
Update
Centroid
K = 2
+
+
+
+
Iteration 3
Update
Centroid
+
+
+
+
No Change
STOP!
Data
Assignment
Data
Assignment
Data
Assignment
random

pick two data

as centroids
Step 0

K-Means
1: Centroid
Initialization
2-1: Data
assignment
2-2:
Update center
Done
End
Start

K-Means
number of data
data i in cluster j
euclidean distance
+
+

K-Means
number of data
data i in cluster j
euclidean distance
+
+
within-cluster distance

K-Means provide simple framework
to partition data
But, it has some problems

Problem 1 :
Centroid Initialization

Good centroid
initialization
Problem 1 :
Bad centroid
initialization

Good centroid
initialization
Problem 1 :
Bad centroid
initialization
No clustering algorithm can guarantee
provide best cluster result

K-Means++
d1
d2
d3
d5
d4
Random pick ﬁrst
centroid
Start
c1

K-Means++
d1
d2
d3
d5
d4
Random pick ﬁrst
centroid
Start
Calculate the distance
between data to
nearest center
d1 d2 … d3 d4 … d5
1 1 … 5 5 … 7
c1

K-Means++
d1
d2
d3
d5
d4
Random pick ﬁrst
centroid
Start
between data to
nearest center
d1 d2 … d3 d4 … d5
1 1 … 5 5 … 7
Normalize all distances
• D=d1
2+d2
2+d3
2+…+dn
2
• Pi=di
2 / D
• ∑Pi=1
P1 P2 … P3 P4 … P5
0.05 0.05 … 0.1 0,1 … 0.15
c1

K-Means++
d1
d2
d3
d5
d4
Random pick ﬁrst
centroid
Start
between data to
nearest center
d1 d2 … d3 d4 … d5
1 1 … 5 5 … 7
P1 P2 … P3 P4 … P5
0.05 0.05 … 0.1 0,1 … 0.15
Pick new centroid
• X = rand(0, 1)
• If P1+…+Pj <= X, P1+…+Pj+1 > X
• dj is new centroid
c1
c2

K-Means++
d1
d2
d3
d5
d4
Random pick ﬁrst
centroid
Start
between data to
nearest center
d1 d2 … d3 d4 … d5
1 1 … 5 5 … 7
P1 P2 … P3 P4 … P5
0.05 0.05 … 0.1 0,1 … 0.15
Pick new centroid
number of

centroid < K
Done
number of centroid = K
Run assignment & update
c1
c2

Problem 2 :
Outlier handling
Without outlier With outlier
+
+
+
+

K-Medoids (PAM)
K-Means

With outlier
+
+
K-Medoids

With outlier
Instead of generating centroid points,

K-Medoids select k data as centroid points
We also have K-Medoids++

K-Medoids (PAM)
• Centroid update is slower than K-Means

• Assume all data is centroid candidate, and pick the data
which could get smallest within-cluster distance sum.

• Pre-computation distance matrix.

• Instead of update all cluster’s centroid, just evaluate the
sampling data util the cost is not decrease.

• Sometime, K-Medoids can get better cluster than K-
Means in outlier free dataset

Problem 3 :
Categorical Data
For example, if we want to
ﬁnd out cluster
from patient record …

Problem 3 :
Categorical Data
For example, if we want to
ﬁnd out cluster
from patient record …
If we create a feature for total diseases,
How to deﬁne the value and distance?
If we create many features for each disease,
How to many feature do we have?

How to handle imbalance problem?

K-Modes
Id 性別⾎血型教育程度
1 男 A ⼤大學
2 男 B ⾼高中
3 女 A ⼤大學
4 女 B ⼤大學
5 男 O ⼤大學
6 女 O ⾼高中
7 女 B 研究所
Data

K-Modes
1 男 A ⼤大學
2 男 B ⾼高中
3 女 A ⼤大學
4 女 B ⼤大學
5 男 O ⼤大學
6 女 O ⾼高中
7 女 B 研究所
Data
Step 1 : random select two data as modes
If k = 2
Mode id 性別⾎血型教育程度
1 男 A ⼤大學
2 男 B ⾼高中

K-Modes
Id 性別⾎血型教育程度 D1 D2
1 男 A ⼤大學 0 2 M1
2 男 B ⾼高中 2 0 M2
3 女 A ⼤大學 1 3 M1
4 女 B ⼤大學 2 2 M2
5 男 O ⼤大學 1 2 M1
6 女 O ⾼高中 3 2 M2
7 女 B 研究所 3 2 M2
Data
Step 2-1 : use hamming distance to
assign data into cluster
1 男 A ⼤大學
2 男 B ⾼高中

K-Modes
Step 2-2: use frequent item to update modes
1 男 A ⼤大學
3 女 A ⼤大學
5 男 O ⼤大學
Cluster 1
2 男 B ⾼高中
4 女 B ⼤大學
6 女 O ⾼高中
7 女 B 研究所
Cluster 2
1 男 A ⼤大學
2 女 B ⾼高中

Problem 4 :
Curse of Dimensionality
https://www.datasciencecentral.com/proﬁles/blogs/about-the-curse-of-dimensionality

Problem 4 :
More
Feature,
More
Easier to
Classify?

Problem 4 :
An increase in the dimensionality
of a data set
Exponentially
more data being required to
produce a representative sample
of that data set

Problem 4 :
Basic

Clustering in
High Dimensional Data
• Apply dimension reduction

• Projected clustering

• Subspace clustering

• Manifold learning

• Change distance function

Problem 5:
Computation Complexity
1: Centroid
Initialization
2-1: Data
assignment
2-2:
Update center
Done
End
Start
• Average complexity : O(T * n * k * d)

• K-Means is NP-Hard

• The classic k-means algorithm is
expensive for large data sets
The Planar k-means Problem is NP-hard
How Slow is the k-Means Method?

Convergence Properties of the K-Means Algorithms
Batch
Gradient Decent
1. Use all examples in each iteration
2. Convergence speed is slower than K-Means

Batch
Gradient Decent
Stochastic
Gradient Decent
1. Use 1 example in each iteration
2. Convergence speed is faster than K-Means
3. While SGD converges quickly on large data sets,
it finds lower quality solutions than
the batch algorithm due to stochastic noise

Batch
Gradient Decent
Stochastic
Gradient Decent
1. Use 1 example in each iteration
3. While SGD converges quickly on large data sets,
it finds lower quality solutions than
the batch algorithm due to stochastic noise
Mini-Batch
Gradient Decent
1. Use b example in each iteration
3. Quality is better than SGD K-Means

Mini Batch K-Means
1: Centroid
Initialization
2-1: Sampling
b data
2-2: Data
assignment
2-3:
Update centroid
with per-center
learning rate.
Done
End
Start
Web-Scale K-Means Clustering

Mini Batch K-Means
K = 3 K = 10 K = 50
Training CPU secs Training CPU secs Training CPU secs
Error
Error
Error
781,265 examples from RCV1 dataset

Mini Batch K-Means
K = 3 K = 10 K = 50
Training CPU secs Training CPU secs Training CPU secs
Error
Error
Error
781,265 examples from RCV1 dataset
Refer from scikit learn user guide

Scalable K-Means++
(K-Means||)
d1
d2
d3
d5
d4
Random pick ﬁrst
centroid
Start
between data to
nearest center
d1 d2 … d3 d4 … d5
1 1 … 5 5 … 7
P1 P2 … P3 P4 … P5
0.05 0.05 … 0.1 0,1 … 0.15
Pick new centroid
• X = rand(0, 1)
• If P1+…+Pj <= X, P1+…+Pj+1 > X
Pick 1 new centroid

Scalable K-Means++
d1
d2
d3
d5
d4
d1 d2 … d3 d4 … d5
1 1 … 5 5 … 7
P1 P2 … P3 P4 … P5
0.05 0.05 … 0.1 0,1 … 0.15
• X1 = rand(0, 1)
• X2 = rand(0, 1)
• If P1+…+Pj <= X1, P1+…+Pj+1 > X1
Pick 2 new centroid
• If P1+…+Pk <= X2, P1+…+Pk+1 > X2
• dk is new centroid
dk
dj

Scalable K-Means++
Sampling
Sampling

Scalable K-Means++
Sampling
Sampling Merge
Merge

Scalable K-Means++
Sampling
Sampling Merge
Merge Reﬁne

Problem Summary
• Centroid initialization

• Outlier handling

• Categorical data

• Curse of dimensionality

• Computation complexity

• ….

How to ﬁnd out
best K ?
Have ground truth or not

If we have ground
truth …
ground truth = data with correct class(label)

Homogeneity
each cluster contains only members of a single class.
Cluster 1
Cluster 2
Cluster 3
Good Bad
Cluster 1
Cluster 2

Completeness
all members of a given class are assigned to the
same cluster.
Cluster 1
Cluster 2
Good
Cluster 1
Cluster 2
Cluster 3
Bad

V measure
Homogeneity
Cluster 1
Cluster 2
Cluster 3
homogeneity score = 1
Good

V measure
Homogeneity
Completeness
Cluster 1
Cluster 2
completeness score = 1
Good

V measure
Homogeneity
Completeness
V measure
All Good : 2 * 1 * 1 / ( 1 + 1 ) = 1
All Bad : 2 * 0 * 0 / ( 0 + 0 ) = 0

V measure
Homogeneity
Completeness
V measure
All Good : 2 * 1 * 1 / ( 1 + 1 ) = 1

All Bad : 2 * 0 * 0 / ( 0 + 0 ) = 0

One Bad : 2 * 1* 0 / ( 1 + 0 ) = 0

Punish bad value !
why use harmonic mean ?
20 km / hr 50 km / hr
Total : 100 km
Wrong : 20 + 50 / 2
Correct : 200 / (5 + 2)

= 2 * 20 * 50 / 50 + 20
* Reason 1
* Reason 2

If we don’t have
ground truth,
how can we do?

Elbow Method
One should choose a number of clusters so that adding
another cluster doesn’t improve much better
the within-cluster distance

Elbow Method
This "elbow" cannot always be unambiguously identiﬁed :(
?
?

Calinski-Harabasz (CH)
Index
+
+
+
W(k)


Index
+
+
+
W(k)

+
+
+
B(k)

inter-cluster distance
+

Index
Tableau use CH index to identify number of cluster
+
+
+
W(k)

+
+
+
B(k)

inter-cluster distance
+

Average Silhouette Method
Compute the mean Silhouette Coeﬃcient of all samples.
Silhouette Coefﬁcient is calculated using
the mean intra-cluster distance (a) and
the mean nearest-cluster distance (b) for each sample

Silhouette coefﬁcients near +1 indicate that the sample is far away
from the neighboring clusters.

0 : the sample is on or very close to the decision boundary
between two neighboring clusters
-1 : the sample might have been assigned to the wrong cluster.

GAP Statistics
Fig 1 Fig 2
Fig 3
Fig 4
Fig 5 Fig 6

GAP Statistics
https://datasciencelab.wordpress.com/tag/gap-statistic/

A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis
Taxonomy

Idea
Model-based
Assume the data is generated by probability distribution,
optimizes the fits between data and the distribution

Central Limit Theorem
If a fair coin is ﬂipped N times,
what is the probability of getting x heads?
x
N = 2
(T, T)
(F, T)
(T, F)
(F, F)
N = 4
x

Multivariate Gaussian
Distribution
2

Maximum Likelihood
x1 x3x4x5x6 x7x2
Which gaussian distribution can represent

the distribution of data X ?

Maximum Likelihood
L(f1, X) = f1(x1) + f1(x2) + … + f1(x7)

= 0.03 + 0.12 + … + 0.0001
x1 x3x4x5x6 x7x2

Maximum Likelihood
L(f1, X) > L(f2, X) > L(f3, X)
x1 x3x4x5x6 x7x2

x1 x3x4x5x6 x7x2
Maximum Likelihood
mean = 10,

std = 2
mean = 18,

std = 5 mean = 25,

std = 3
Given a single gaussian and 1 dimensional data,

how to ﬁnd the best gaussian parameters?

Maximum likelihood
徐亦達 - 机器学习课程Expectation Maximisation
• ﬁnd the best mean

Maximum likelihood
g , h
g x h

Maximum likelihood
g , h
g x h Hard :(

Maximum likelihood
g , h
g x h Hard :(
Easy :)
log g + log hlog g x hlog[ ]

徐亦達 - 机器学习课程Expectation Maximisation
• ﬁnd the best std

x1 x3x4x5x6 x7x2
Maximum Likelihood
= ( u, )
Evidence probability

Maximum Likelihood
x1 x3x4x5x6 x7x2 x8 x9 x10 x11 x12
In real case, single gaussian distribution

can not represent our data

Gaussian Mixture Model
i
i
i
i
w1 f1(x) + w2 f2(x) + … + wk fk(x)

Gaussian Mixture Model
Given k gaussians and 1 dimensional data,

how to ﬁnd the best gaussian parameters?
The problem is computationally difﬁcult (NP-hard)
Gradient ascent optimization
( no global optimization guarantee )

Expectation Maximization
X
x
log
i=1
K
log wi
i
i
i
X
Expectation Step
Maximization Step
choose w to maximize
Xlog
choose

to maximize
Xlog
Random initialize
k gaussian and weight
> ?
No
DONE
Yes
t+1
t+1 t+1
= L
t+1
= L
t+1
L
t+1
L
t

EM for Multivariate
Gaussian Distribution
https://jakevdp.github.io/PythonDataScienceHandbook/05.12-gaussian-mixtures.html

Experiment
EM can handle soft clustering

K-Means is hard clustering

Conclusion
• Method

• K-Means

• K-Means++

• K-Medoids

• K-Modes

• Mini Batch K-Means

• EM with GMM
• Metrics

• Homogeneity

• Completeness

• V-measure

• Elbow Method

• Calinski-Harabasz (CH) Index

• Average Silhouette Method

• GAP

Reference
• https://wiki.illinois.edu//wiki/display/
cs412/2.+Course+Syllabus+and+Schedule

• https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html

• https://bl.ocks.org/rpgove/0060ﬀ3b656618e9136b

• https://skydome20.github.io/R-Notes/R9/R9#P3-1

• https://stats.stackexchange.com/questions/169156/explain-curse-of-
dimensionality-to-a-child

• https://msu.edu/~ashton/classes/866/papers/
2010_jain_kmeans_50yrs__clustering_review.pdf

Cluster Analysis : Assignment & Update

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cluster Analysis : Assignment & Update

Similar to Cluster Analysis : Assignment & Update (20)

More from Billy Yang

More from Billy Yang (6)

Recently uploaded

Recently uploaded (20)

Cluster Analysis : Assignment & Update