Machine Learning
Computer Science Department, Faculty of Computer and Information System, Islamic University of Madinah, Madinah, KSA
Computer Science Department, Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt.
K-Means Clustering Simply
Dr. Emad Nabil
The Lloyd Algorithm
Faculty of computers and information systems
Slides are compiled from many resources, thanks to those who made their slides available online.
Most of the Slides are by Andrew NG
Andrew Ng
Supervised learning
Training set:
Andrew Ng
Unsupervised learning
Training set:
Andrew Ng
Applications of clustering
Organize computing clusters
Social network analysis
Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of Wisconsin, Madison)
Astronomical data analysis
Market segmentation
Clustering
K-means
algorithm
Machine Learning
Andrew Ng
Andrew Ng
Random
initialization
of clusters
Andrew Ng
1
Cluster
assignment
step
Andrew Ng
Andrew Ng
2
Move
centroid
step
Andrew Ng
1
Cluster
assignment
step
Andrew Ng
Andrew Ng
2
Move
centroid
step
Andrew Ng
1
Cluster
assignment
step
Andrew Ng
Andrew Ng
2
Move
centroid
step
Andrew Ng
1
Cluster
assignment
step
Andrew Ng
Andrew Ng
No
Enhancement
2
Move
centroid
step
stop
Andrew Ng
Input:
- (number of clusters)
- Training set
(drop convention)
K-means algorithm
Andrew Ng
f1 f2 …. fn Cluster index
𝑐 𝑖 𝜇 𝑐(𝑖)
1 1 𝑐 1
= 1 𝑣1, 𝑣2, … 𝑣 𝑛
2 1 𝑐 2 = 1 𝑣1, 𝑣2, … 𝑣 𝑛
3 2 𝑐(3) = 2
4 1 𝑐(4) = 1 𝑣1, 𝑣2, … 𝑣 𝑛
2 ….
…… …
m
1 ≤ 𝑐(𝑖)≤ 𝑘, 𝑐 𝑖 is the centrid assigned to data example 𝑥 𝑖
Say K=5, then  we have 5 centroids
𝜇 𝑐(𝑖) = 𝑤ℎ𝑒𝑟𝑒 𝑐(𝑖) = 1 ⇒ 𝜇1 =
x 1 +x 2 +x 4
3
= 𝑣1, 𝑣2, … 𝑣 𝑛 ∈ ℝ 𝑛
Data set description example
Clustering
Optimization
objective
Machine Learning
Andrew Ng
Randomly initialize cluster centroids
K-means algorithm
Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
for loop over K, to find the nearest centroid to
𝒙 𝒊
, many distances measure may be used,
here we used squared Euclidean distance
1
Cluster
assignment
step
Andrew Ng
Randomly initialize cluster centroids
K-means algorithm
Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
2
Move
centroid
step
Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
Andrew Ng
K-means optimization objective
= index of cluster (1,2,…, ) to which example is currently
assigned
= cluster centroid ( )
= cluster centroid of cluster to which example has been
assigned
Optimization objective:
Clustering
Random
initialization
Machine Learning
Andrew Ng
Randomly initialize cluster centroids
K-means algorithm
Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
Andrew Ng
Random initialization
Should have
Andrew Ng
Random initialization
Should have
Randomly pick training
examples.
Set equal to these
examples.
Andrew Ng
Local optima example 1
Optimal
clustering
Andrew Ng
Local optima example 1
No enhancement in the objective
function over iterations
Local optima
Andrew Ng
Local optima example 2
Local optima
No enhancement in the objective
function over iterations
Andrew Ng
Local optima example 2
Local optima
Andrew Ng
Local optima example 2
Local optima
Andrew Ng
For i = 1 to 100 {
Randomly initialize K-means.
Run K-means. Get .
Compute cost function (distortion)
}
Pick clustering that gave lowest cost
Random initialization Solution :
Run K-means many times and pick the
clustering that gave lowest cost
Clustering
Choosing the
number of clusters
Machine Learning
Andrew Ng
What is the right value of K?
Andrew Ng
What is the right value of K?
Andrew Ng
Choosing the value of K
Costfunction
(no. of clusters)
Elbow method
Elbow
Andrew Ng
Choosing the value of K
Sometimes, you’re running K-means to get clusters to use for some
later/downstream purpose. Evaluate K-means based on a metric for
how well it performs for that later purpose.
E.g. T-shirt sizing
Height
Weight
T-shirt sizing
HeightWeight
K=3 Small, Medium, Large
K=5 S, M, L, XL, XXL
Andrew Ng
Some of the Distance measures
Andrew Ng
K-Means Evaluation
Andrew Ng
Complexity
Complexity = O(t*k*m*n) where:
• t of iterations of the standard algorithm takes only
• m number of examples (data points)
• n: (n-dimensional) points,
• k is the number of centroids (or clusters).
This what practical implementations do (often with
random restarts between the iterations).
m iterations
k iterations
To find the distance between
centroid and any x i
𝑡ℎ𝑒𝑟𝑒 𝑎𝑟𝑒 𝑛
𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠
t iterations
Efficient algorithm !
Andrew Ng
K-Means Visualization
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

K-Means Clustering Simply

Editor's Notes

  • #6 Swap: market seg and organize clusters
  • #18 Get rid of the legacy points
  • #19 Get rid of the legacy points
  • #20 Get rid of the legacy points
  • #21 Get rid of the legacy points
  • #25 Replace as previous; change spacing to fill page
  • #26 Replace as previous; change spacing to fill page
  • #27 Replace as previous; change spacing to fill page
  • #28 Replace as previous; change spacing to fill page
  • #29 Replace as previous; change spacing to fill page
  • #30 Replace as previous; change spacing to fill page
  • #31 Replace as previous; change spacing to fill page
  • #32 Replace as previous; change spacing to fill page
  • #33 Change numbers to LATEX as well
  • #35 Replace as previous; change spacing to fill page
  • #36 LATEX font
  • #37 LATEX font