This document discusses cluster analysis and various clustering algorithms. It begins with an overview of supervised and unsupervised learning, as well as generative models. It then discusses 5 common clustering techniques: partitioning, hierarchical, density-based, grid-based, and model-based clustering. The document also covers challenges with cluster analysis such as centroid initialization, outlier handling, categorical data, the curse of dimensionality, and computational complexity. Specific clustering algorithms discussed in more detail include K-means, K-medoids, K-modes, mini-batch K-means, and scalable K-means++.
33. Good centroid
initialization
Problem 1 :
Centroid Initialization
Bad centroid
initialization
No clustering algorithm can guarantee
provide best cluster result
36. K-Means++
d1
d2
d3
d5
d4
Random pick first
centroid
Start
Calculate the distance
between data to
nearest center
d1 d2 … d3 d4 … d5
1 1 … 5 5 … 7
Normalize all distances
• D=d1
2+d2
2+d3
2+…+dn
2
• Pi=di
2 / D
• ∑Pi=1
P1 P2 … P3 P4 … P5
0.05 0.05 … 0.1 0,1 … 0.15
c1
37. K-Means++
d1
d2
d3
d5
d4
Random pick first
centroid
Start
Calculate the distance
between data to
nearest center
d1 d2 … d3 d4 … d5
1 1 … 5 5 … 7
Normalize all distances
P1 P2 … P3 P4 … P5
0.05 0.05 … 0.1 0,1 … 0.15
Pick new centroid
• X = rand(0, 1)
• If P1+…+Pj <= X, P1+…+Pj+1 > X
• dj is new centroid
c1
c2
38. K-Means++
d1
d2
d3
d5
d4
Random pick first
centroid
Start
Calculate the distance
between data to
nearest center
d1 d2 … d3 d4 … d5
1 1 … 5 5 … 7
Normalize all distances
P1 P2 … P3 P4 … P5
0.05 0.05 … 0.1 0,1 … 0.15
Pick new centroid
number of
centroid < K
Done
number of centroid = K
Run assignment & update
c1
c2
42. K-Medoids (PAM)
• Centroid update is slower than K-Means
• Assume all data is centroid candidate, and pick the data
which could get smallest within-cluster distance sum.
• Pre-computation distance matrix.
• Instead of update all cluster’s centroid, just evaluate the
sampling data util the cost is not decrease.
• Sometime, K-Medoids can get better cluster than K-
Means in outlier free dataset
44. Problem 3 :
Categorical Data
For example, if we want to
find out cluster
from patient record …
45. Problem 3 :
Categorical Data
For example, if we want to
find out cluster
from patient record …
If we create a feature for total diseases,
How to define the value and distance?
If we create many features for each disease,
How to many feature do we have?
How to handle imbalance problem?
46. K-Modes
Id 性別 ⾎血型 教育程度
1 男 A ⼤大學
2 男 B ⾼高中
3 女 A ⼤大學
4 女 B ⼤大學
5 男 O ⼤大學
6 女 O ⾼高中
7 女 B 研究所
Data
47. K-Modes
Id 性別 ⾎血型 教育程度
1 男 A ⼤大學
2 男 B ⾼高中
3 女 A ⼤大學
4 女 B ⼤大學
5 男 O ⼤大學
6 女 O ⾼高中
7 女 B 研究所
Data
Step 1 : random select two data as modes
If k = 2
Mode id 性別 ⾎血型 教育程度
1 男 A ⼤大學
2 男 B ⾼高中
48. K-Modes
Id 性別 ⾎血型 教育程度 D1 D2
1 男 A ⼤大學 0 2 M1
2 男 B ⾼高中 2 0 M2
3 女 A ⼤大學 1 3 M1
4 女 B ⼤大學 2 2 M2
5 男 O ⼤大學 1 2 M1
6 女 O ⾼高中 3 2 M2
7 女 B 研究所 3 2 M2
Data
Step 2-1 : use hamming distance to
assign data into cluster
Mode id 性別 ⾎血型 教育程度
1 男 A ⼤大學
2 男 B ⾼高中
49. K-Modes
Step 2-2: use frequent item to update modes
Id 性別 ⾎血型 教育程度
1 男 A ⼤大學
3 女 A ⼤大學
5 男 O ⼤大學
Cluster 1
Id 性別 ⾎血型 教育程度
2 男 B ⾼高中
4 女 B ⼤大學
6 女 O ⾼高中
7 女 B 研究所
Cluster 2
Mode id 性別 ⾎血型 教育程度
1 男 A ⼤大學
2 女 B ⾼高中
50. Problem 4 :
Curse of Dimensionality
https://www.datasciencecentral.com/profiles/blogs/about-the-curse-of-dimensionality
51. Problem 4 :
Curse of Dimensionality
https://www.datasciencecentral.com/profiles/blogs/about-the-curse-of-dimensionality
52. Problem 4 :
Curse of Dimensionality
https://www.datasciencecentral.com/profiles/blogs/about-the-curse-of-dimensionality
53. Problem 4 :
Curse of Dimensionality
https://www.datasciencecentral.com/profiles/blogs/about-the-curse-of-dimensionality
54. Problem 4 :
Curse of Dimensionality
https://www.datasciencecentral.com/profiles/blogs/about-the-curse-of-dimensionality
55. Problem 4 :
Curse of Dimensionality
More
Feature,
More
Easier to
Classify?
https://www.datasciencecentral.com/profiles/blogs/about-the-curse-of-dimensionality
56. Problem 4 :
Curse of Dimensionality
An increase in the dimensionality
of a data set
Exponentially
more data being required to
produce a representative sample
of that data set
https://www.datasciencecentral.com/profiles/blogs/about-the-curse-of-dimensionality
57. Problem 4 :
Curse of Dimensionality
https://www.datasciencecentral.com/profiles/blogs/about-the-curse-of-dimensionality
60. Clustering in
High Dimensional Data
• Apply dimension reduction
• Projected clustering
• Subspace clustering
• Manifold learning
• Change distance function
61. Problem 5:
Computation Complexity
1: Centroid
Initialization
2-1: Data
assignment
2-2:
Update center
Done
End
Start
• Average complexity : O(T * n * k * d)
• K-Means is NP-Hard
• The classic k-means algorithm is
expensive for large data sets
The Planar k-means Problem is NP-hard
How Slow is the k-Means Method?
62. Convergence Properties of the K-Means Algorithms
Batch
Gradient Decent
1. Use all examples in each iteration
2. Convergence speed is slower than K-Means
63. Convergence Properties of the K-Means Algorithms
Batch
Gradient Decent
1. Use all examples in each iteration
2. Convergence speed is slower than K-Means
Stochastic
Gradient Decent
1. Use 1 example in each iteration
2. Convergence speed is faster than K-Means
3. While SGD converges quickly on large data sets,
it finds lower quality solutions than
the batch algorithm due to stochastic noise
64. Convergence Properties of the K-Means Algorithms
Batch
Gradient Decent
1. Use all examples in each iteration
2. Convergence speed is slower than K-Means
Stochastic
Gradient Decent
1. Use 1 example in each iteration
2. Convergence speed is faster than K-Means
3. While SGD converges quickly on large data sets,
it finds lower quality solutions than
the batch algorithm due to stochastic noise
Mini-Batch
Gradient Decent
1. Use b example in each iteration
2. Convergence speed is faster than K-Means
3. Quality is better than SGD K-Means
65. Mini Batch K-Means
1: Centroid
Initialization
2-1: Sampling
b data
2-2: Data
assignment
2-3:
Update centroid
with per-center
learning rate.
Done
End
Start
Web-Scale K-Means Clustering
66. Mini Batch K-Means
1: Centroid
Initialization
2-1: Sampling
b data
2-2: Data
assignment
2-3:
Update centroid
with per-center
learning rate.
Done
End
Start
Web-Scale K-Means Clustering
67. Mini Batch K-Means
K = 3 K = 10 K = 50
Training CPU secs Training CPU secs Training CPU secs
Error
Error
Error
781,265 examples from RCV1 dataset
68. Mini Batch K-Means
K = 3 K = 10 K = 50
Training CPU secs Training CPU secs Training CPU secs
Error
Error
Error
781,265 examples from RCV1 dataset
Refer from scikit learn user guide
95. Average Silhouette Method
Compute the mean Silhouette Coefficient of all samples.
Silhouette Coefficient is calculated using
the mean intra-cluster distance (a) and
the mean nearest-cluster distance (b) for each sample
97. Average Silhouette Method
0 : the sample is on or very close to the decision boundary
between two neighboring clusters
-1 : the sample might have been assigned to the wrong cluster.
108. x1 x3x4x5x6 x7x2
Maximum Likelihood
mean = 10,
std = 2
mean = 18,
std = 5 mean = 25,
std = 3
Given a single gaussian and 1 dimensional data,
how to find the best gaussian parameters?
120. Gaussian Mixture Model
Given k gaussians and 1 dimensional data,
how to find the best gaussian parameters?
The problem is computationally difficult (NP-hard)
Gradient ascent optimization
( no global optimization guarantee )