12. k-Means Algorithm
• k-Means clustering algorithm proposed by J. Hartigan and M. A. Wong [1979].
• Given a set of n distinct objects, the k-Means clustering partitions the objects into k number of
clusters such that intracluster similarity is high but the intercluster similarity is low.
• In this algorithm, user need to specify k, the number of clusters.
• Let consider the objects are defined with numeric attributes.
• Then use any one of the distance metric (Euclidian, Manhattan) to create the clusters.
13. k-Means Algorithm
• First it selects k number of objects at random from the set of n objects. These k objects are treated
as the centroids of k clusters.
• For each of the remaining objects, it is assigned to one of the closest centroid. Thus, it forms a
collection of objects assigned to each centroid is called a cluster.
• Next, the centroid of each cluster is then updated (by calculating the mean values of attributes of
each object).
• Repeat (Iterate) the assignment and update procedure is until it reaches some stopping criteria
(such as, number of iteration, centroids remain unchanged or no assignment, etc.)
14. k-Means Algorithm
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.
2. For each of the objects in D do
• Compute distance between the current objects and k cluster centroids
• Assign the current object to that cluster to which it is closest.
3. Compute the “cluster centers” of each cluster. These become the new cluster centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
5. Stop
17. • Problem
Example
Let’s have 4 types of medicines and each has two attributes (pH and weight index).
Create a group from these objects into K=2 group of medicine.
Medicine Weight pH-Index
A 1 1
B 2 1
C 4 3
D 5 4
A B
C
D
18. Example
• Step 1: Use initial seed (random) points for partitioning
B
c
,
A
c 2
1
24
.
4
)
1
4
(
)
2
5
(
)
,
(
5
)
1
4
(
)
1
5
(
)
,
(
2
2
2
2
2
1
c
D
d
c
D
d
Assign each object to the cluster
with the nearest seed point
Euclidean distance
D
C
A B
19. Example
• Step 2: Compute new centroids of the current partition
Knowing the members of each cluster,
now compute the new centroid of each
group based on these new
memberships.
)
3
8
,
3
11
(
3
4
3
1
,
3
5
4
2
)
1
,
1
(
2
1
c
c
20. Example
• Step 2: Renew membership based on new centroids
Compute the distance of all
objects to the new centroids
Assign the membership to objects
21. Example
• Step 3: Repeat the first two steps until its convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
)
2
1
3
,
2
1
4
(
2
4
3
,
2
5
4
)
1
,
2
1
1
(
2
1
1
,
2
2
1
2
1
c
c
22. Example
• Step 3: Repeat the first two steps until its convergence
Compute the distance of all objects to
the new centroids
Stop due to no new assignment
Membership in each cluster no longer change
23. How to choose k? – Elbow Method
• A fundamental step for any unsupervised algorithm is to
determine the optimal number of clusters for data which
may be clustered.
• From the visualization, observe the optimal number of
clusters should be around 3. But visualizing the data alone
cannot always give the optimal cluster numbers.
• The Elbow Method is one of the most popular methods to
determine this optimal value of k.
x1
x
2
24. How to choose k? – Elbow Method
Elbow method (using Distortion):
• Step 1: Distortion: It is the average of the squared
distances from the cluster centers of the respective
clusters. Typically, the Euclidean distance metric is used.
• Iterate the values of k from 1 to 9 and calculate the values
of distortions for each value of k and calculate the
distortion and inertia for each value of k in the given
range.
• Step 2: Building the clustering model and calculating
the values of the Distortion
– select the value of k at the “elbow” ie the point after which the
distortion is start to decrease in a linear fashion
25. How to choose k? – Elbow Method
Elbow method (using Inertia):
• Step 1: Inertia: It is the sum of squared distances of
samples (objects) to their closest cluster center.
• Iterate the values of k from 1 to 9 and calculate the values of
inertia for each value of k and calculate the inertia for each
value of k in the given range.
• Step 2: Building the clustering model and calculating the
values of the Inertia.
– select the value of k at the “elbow” ie the point after which the
inertia is start to decrease in a linear fashion.
27. Limitations of k-Means algorithm
• Local optimum
– sensitive to initial seed points
– converge to a local optimum: It maybe an unwanted solution
• Need to specify K, the number of clusters, in advance
• Not suitable for discovering clusters with non-convex shapes
• Applicable only when mean is defined, not handling categorical data. (Use K-mode algorithm)
52. Convergence
• The cluster centroids in iteration3 and iteration4 are same. i.e (No change).
• It satisfies the convergence criteria. i.e. Data points cannot be clustered
further.
• So, Stop the process.