K means clustering

K-means Clustering
Dr. P. Kuppusamy
Prof / CSE

Clustering
Techniques
Partitioning
methods
Hierarchical
methods
Density-based
methods
Graph based
methods
Model based
clustering
• k-Means algorithm [1957, 1967]
• k-Medoids algorithm
• k-Modes [1998]
• Fuzzy c-means algorithm [1999]
Divisive
Agglomerative
methods
• STING [1997]
• DBSCAN [1996]
• CLIQUE [1998]
• DENCLUE [1998]
• OPTICS [1999]
• Wave Cluster [1998]
• MST Clustering [1999]
• OPOSSUM [2000]
• SNN Similarity Clustering [2001, 2003]
• EM Algorithm [1977]
• Auto class [1996]
• COBWEB [1987]
• ANN Clustering [1982, 1989]
• AGNES [1990]
• BIRCH [1996]
• CURE [1998]
• ROCK [1999]
• Chamelon [1999]
• DIANA [1990]
• PAM [1990]
• CLARA [1990]
• CLARANS
[1994]

Illustration of k-Means clustering algorithms
List of objects (Attributes)
x1 x2 x1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2
Initial cluster with respect to Table
Centroid Objects
A1 A2
c1 3.8 9.9
c2 7.8 12.2
c3 6.2 18.5
Initial Centroids chosen
randomly
x1
x
2

List of objects (Attributes)
x1 x2 d1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2
Centroid Objects
x1 x2
c1 3.8 9.9
c2 7.8 12.2
c3 6.2 18.5
randomly
x1
x
2
c1
c2
c3

Distance calculation
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2
Centroid Objects
x1 x2
c1 3.8 9.9
c2 7.8 12.2
c3 6.2 18.5
randomly
x1
x
2
c1
c2
c3

Cluster Assigning
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2
Centroid Objects
A1 A2
c1 3.8 9.9
c2 7.8 12.2
c3 6.2 18.5
randomly
x1
x
2
c1
c2
c3

k-Means Algorithm
• k-Means clustering algorithm proposed by J. Hartigan and M. A. Wong [1979].
• Given a set of n distinct objects, the k-Means clustering partitions the objects into k number of
clusters such that intracluster similarity is high but the intercluster similarity is low.
• In this algorithm, user need to specify k, the number of clusters.
• Let consider the objects are defined with numeric attributes.
• Then use any one of the distance metric (Euclidian, Manhattan) to create the clusters.

k-Means Algorithm
• First it selects k number of objects at random from the set of n objects. These k objects are treated
as the centroids of k clusters.
• For each of the remaining objects, it is assigned to one of the closest centroid. Thus, it forms a
collection of objects assigned to each centroid is called a cluster.
• Next, the centroid of each cluster is then updated (by calculating the mean values of attributes of
each object).
• Repeat (Iterate) the assignment and update procedure is until it reaches some stopping criteria
(such as, number of iteration, centroids remain unchanged or no assignment, etc.)

k-Means Algorithm
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.
2. For each of the objects in D do
• Compute distance between the current objects and k cluster centroids
• Assign the current object to that cluster to which it is closest.
3. Compute the “cluster centers” of each cluster. These become the new cluster centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
5. Stop

• Problem
Example
Let’s have 4 types of medicines and each has two attributes (pH and weight index).
Create a group from these objects into K=2 group of medicine.
Medicine Weight pH-Index
A 1 1
B 2 1
C 4 3
D 5 4
A B
C
D

Example
• Step 1: Use initial seed (random) points for partitioning
B
c
,
A
c 2
1 

24
.
4
)
1
4
(
)
2
5
(
)
,
(
5
)
1
4
(
)
1
5
(
)
,
(
2
2
2
2
2
1










c
D
d
c
D
d
Assign each object to the cluster
with the nearest seed point
Euclidean distance
D
C
A B

Example
• Step 2: Compute new centroids of the current partition
Knowing the members of each cluster,
now compute the new centroid of each
group based on these new
memberships.
)
3
8
,
3
11
(
3
4
3
1
,
3
5
4
2
)
1
,
1
(
2
1






 





c
c

Example
• Step 2: Renew membership based on new centroids
Compute the distance of all
objects to the new centroids
Assign the membership to objects

Example
• Step 3: Repeat the first two steps until its convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
)
2
1
3
,
2
1
4
(
2
4
3
,
2
5
4
)
1
,
2
1
1
(
2
1
1
,
2
2
1
2
1






 








 


c
c

Example
• Step 3: Repeat the first two steps until its convergence
Compute the distance of all objects to
the new centroids
Stop due to no new assignment
Membership in each cluster no longer change

How to choose k? – Elbow Method
• A fundamental step for any unsupervised algorithm is to
determine the optimal number of clusters for data which
may be clustered.
• From the visualization, observe the optimal number of
clusters should be around 3. But visualizing the data alone
cannot always give the optimal cluster numbers.
• The Elbow Method is one of the most popular methods to
determine this optimal value of k.
x1
x
2

Elbow method (using Distortion):
• Step 1: Distortion: It is the average of the squared
distances from the cluster centers of the respective
clusters. Typically, the Euclidean distance metric is used.
• Iterate the values of k from 1 to 9 and calculate the values
of distortions for each value of k and calculate the
distortion and inertia for each value of k in the given
range.
• Step 2: Building the clustering model and calculating
the values of the Distortion
– select the value of k at the “elbow” ie the point after which the
distortion is start to decrease in a linear fashion

Elbow method (using Inertia):
• Step 1: Inertia: It is the sum of squared distances of
samples (objects) to their closest cluster center.
• Iterate the values of k from 1 to 9 and calculate the values of
inertia for each value of k and calculate the inertia for each
value of k in the given range.
• Step 2: Building the clustering model and calculating the
values of the Inertia.
– select the value of k at the “elbow” ie the point after which the
inertia is start to decrease in a linear fashion.

Limitations of k-Means algorithm

Limitations of k-Means algorithm
• Local optimum
– sensitive to initial seed points
– converge to a local optimum: It maybe an unwanted solution
• Need to specify K, the number of clusters, in advance
• Not suitable for discovering clusters with non-convex shapes
• Applicable only when mean is defined, not handling categorical data. (Use K-mode algorithm)

Application
• Colour-Based Image Segmentation Using K-means

Application
“blue” pixels “white” pixels “pink” pixels

Practice
• dataset = { (5,3), (10,15), (15,12), (24,10), (30,45), (85,70), (71,80),
(60,78), (55,52), (80,91) }
• Let’s randomly initiate the cluster centroid c1, c2 as (5, 3) and (10, 15).
• While Iteration = 1,
– Compute the distance

Iteration = 1
x1 x2 d1 d2 cluster
5 3 0 13
10 15 13 0
15 12 13.45 5.83
24 10 20.24 14.86
30 45 48.87 36
85 70 104.35 93
71 80 101.41 89
60 78 93 80
55 52 70 58
80 91 115.52 103.32

Iteration = 1
• Assign the data points to closest centroid’s cluster
x1 x2 d1 d2 cluster
5 3 0 13 C1
10 15 13 0 C2
15 12 13.45 5.83 C2
24 10 20.24 14.86 C2
30 45 48.87 36 C2
85 70 104.35 93 C2
71 80 101.41 89 C2
60 78 93 80 C2
55 52 70 58 C2
80 91 115.52 103.32 C2

Iteration = 1
• Calculate the New Cluster Centroid
x1 x2 d1 d2 cluster
5 3 0 13 C1
10 15 13 0 C2
15 12 13.45 5.83 C2
24 10 20.24 14.86 C2
30 45 48.87 36 C2
85 70 104.35 93 C2
71 80 101.41 89 C2
60 78 93 80 C2
55 52 70 58 C2
80 91 115.52 103.32 C2

Iteration = 1
• C1 has only one data point (5,3).
• Mean c1(x1)= (5/1)= 5.
• Mean c1(x2)= (3/1) = 3
• So, new centroid for Cluster1 is again (5,3)
x1 x2 d1 d2 cluster
5 3 0 13 C1
10 15 13 0 C2
15 12 13.45 5.83 C2
24 10 20.24 14.86 C2
30 45 48.87 36 C2
85 70 104.35 93 C2
71 80 101.41 89 C2
60 78 93 80 C2
55 52 70 58 C2
80 91 115.52 103.32 C2

Iteration = 1
• C2 has only 9 data points.
• Mean c2(x1)= (10 + 15 + 24 + 30 + 85 + 71 + 60 + 55 + 80) / 9 = 47.77
• Mean c2(x2)= (15 + 12 + 10 + 45 + 70 + 80 + 78 + 52 + 91) / 9 = 50.33
• So, new centroid for Cluster2 is (47.77,50.33)
x1 x2 d1 d2 cluster
5 3 0 13 C1
10 15 13 0 C2
15 12 13.45 5.83 C2
24 10 20.24 14.86 C2
30 45 48.87 36 C2
85 70 104.35 93 C2
71 80 101.41 89 C2
60 78 93 80 C2
55 52 70 58 C2
80 91 115.52 103.32 C2

Iteration = 2
• Compute the distance between new centroids c1 i.e. (5,3) and all data
points.
x1 x2 d1 d2 cluster
5 3 0
10 15 13
15 12 13.45
24 10 20.24
30 45 48.87
85 70 104.35
71 80 101.41
60 78 93
55 52 70
80 91 115.52

Iteration = 2
x1 x2 d1 d2 cluster
5 3 0 63.79
10 15 13 51.71
15 12 13.45 50.42
24 10 20.24 46.81
30 45 48.87 18.55
85 70 104.35 42.1
71 80 101.41 37.68
60 78 93 30.25
55 52 70 7.42
80 91 115.52 51.89
• Compute the distance between new centroid c2 i.e. (47.77,50.33) and all
data points.

Iteration = 2
• Assign the data points to the closest centroids.
x1 x2 d1 d2 cluster
5 3 0 63.79 C1
10 15 13 51.71 C1
15 12 13.45 50.42 C1
24 10 20.24 46.81 C1
30 45 48.87 18.55 C2
85 70 104.35 42.1 C2
71 80 101.41 37.68 C2
60 78 93 30.25 C2
55 52 70 7.42 C2
80 91 115.52 51.89 C2

Iteration = 2
• Compute the new cluster centroids
x1 x2 d1 d2 cluster
5 3 0 63.79 C1
10 15 13 51.71 C1
15 12 13.45 50.42 C1
24 10 20.24 46.81 C1
30 45 48.87 18.55 C2
85 70 104.35 42.1 C2
71 80 101.41 37.68 C2
60 78 93 30.25 C2
55 52 70 7.42 C2
80 91 115.52 51.89 C2

Iteration = 2
• C1 has 4 data points.
• Mean c1(x1)= (5, 10, 15, 24) / 4 = 13.5
• Mean c1(x2)= (3, 15, 12, 10) / 4 = 10.0
x1 x2 d1 d2 cluster
5 3 0 63.79 C1
10 15 13 51.71 C1
15 12 13.45 50.42 C1
24 10 20.24 46.81 C1
30 45 48.87 18.55 C2
85 70 104.35 42.1 C2
71 80 101.41 37.68 C2
60 78 93 30.25 C2
55 52 70 7.42 C2
80 91 115.52 51.89 C2

Iteration = 2
• Mean c2(x1)= (30 + 85 + 71 + 60 + 55 + 80) / 6 = 63.5
• Mean c2(x2)= (45 + 70 + 80 + 78 + 52 +91) / 6 = 69.33
• So, new centroid for Cluster2 is (63.5, 69.33)
x1 x2 d1 d2 cluster
5 3 0 63.79 C1
10 15 13 51.71 C1
15 12 13.45 50.42 C1
24 10 20.24 46.81 C1
30 45 48.87 18.55 C2
85 70 104.35 42.1 C2
71 80 101.41 37.68 C2
60 78 93 30.25 C2
55 52 70 7.42 C2
80 91 115.52 51.89 C2

Iteration = 3
• Compute the distance between new centroids and all data points.
x1 x2 d1 d2 cluster
5 3 11.01 88.44
10 15 6.1 76.24
15 12 2.5 75.09
24 10 10.5 71.27
30 45 38.69 41.4
85 70 93.3 21.51
71 80 90.58 13.04
60 78 85.37 9.34
55 52 59.04 19.3
80 91 104.8 27.3

Iteration = 3
• Assign the data points to the closest centroids.
x1 x2 d1 d2 cluster
5 3 11.01 88.44 C1
10 15 6.1 76.24 C1
15 12 2.5 75.09 C1
24 10 10.5 71.27 C1
30 45 38.69 41.4 C1
85 70 93.3 21.51 C2
71 80 90.58 13.04 C2
60 78 85.37 9.34 C2
55 52 59.04 19.3 C2
80 91 104.8 27.3 C2

Iteration = 3
• Compute the new cluster centroids
x1 x2 d1 d2 cluster
5 3 11.01 88.44 C1
10 15 6.1 76.24 C1
15 12 2.5 75.09 C1
24 10 10.5 71.27 C1
30 45 38.69 41.4 C1
85 70 93.3 21.51 C2
71 80 90.58 13.04 C2
60 78 85.37 9.34 C2
55 52 59.04 19.3 C2
80 91 104.8 27.3 C2

Iteration = 3
• Mean c1(x1)= (5, 10, 15, 24, 30) / 5 = 16.8
• Mean c1(x2)= (3, 15, 12, 10, 45) / 5 = 17.0
x1 x2 d1 d2 cluster
5 3 11.01 88.44 C1
10 15 6.1 76.24 C1
15 12 2.5 75.09 C1
24 10 10.5 71.27 C1
30 45 38.69 41.4 C1
85 70 93.3 21.51 C2
71 80 90.58 13.04 C2
60 78 85.37 9.34 C2
55 52 59.04 19.3 C2
80 91 104.8 27.3 C2

Iteration = 3
• Mean c2(x1)= (85 + 71 + 60 + 55 + 80) / 5 = 70.2
• Mean c2(x2)= (70 + 80 + 78 + 52 + 91) / 5 = 74.2
x1 x2 d1 d2 cluster
5 3 11.01 88.44 C1
10 15 6.1 76.24 C1
15 12 2.5 75.09 C1
24 10 10.5 71.27 C1
30 45 38.69 41.4 C1
85 70 93.3 21.51 C2
71 80 90.58 13.04 C2
60 78 85.37 9.34 C2
55 52 59.04 19.3 C2
80 91 104.8 27.3 C2

Iteration = 4
• Compute the distance from new centroids and data points
• c1(16.8,17.0) and c2(70.2,74.2)
x1 x2 d1 d2 cluster
5 3 18.3 96.54
10 15 7.08 84.43
15 12 5.31 83.16
24 10 10.04 79.09
30 45 30.95 49.68
85 70 86.37 15.38
71 80 83.1 5.85
60 78 74.74 10.88
55 52 51.8 26.9
80 91 97.31 19.44

Iteration = 4
• Assign the data points to the closest centroids
x1 x2 d1 d2 cluster
5 3 18.3 96.54 C1
10 15 7.08 84.43 C1
15 12 5.31 83.16 C1
24 10 10.04 79.09 C1
30 45 30.95 49.68 C1
85 70 86.37 15.38 C2
71 80 83.1 5.85 C2
60 78 74.74 10.88 C2
55 52 51.8 26.9 C2
80 91 97.31 19.44 C2

Iteration = 4
• Compute the new cluster centroids and all data points
x1 x2 d1 d2 cluster
5 3 18.3 96.54 C1
10 15 7.08 84.43 C1
15 12 5.31 83.16 C1
24 10 10.04 79.09 C1
30 45 30.95 49.68 C1
85 70 86.37 15.38 C2
71 80 83.1 5.85 C2
60 78 74.74 10.88 C2
55 52 51.8 26.9 C2
80 91 97.31 19.44 C2

Iteration = 4
• Mean c1(x1)= (5, 10, 15, 24, 30) / 5 = 16.8
• Mean c1(x2)= (3, 15, 12, 10, 45) / 5 = 17.0
x1 x2 d1 d2 cluster
5 3 18.3 96.54 C1
10 15 7.08 84.43 C1
15 12 5.31 83.16 C1
24 10 10.04 79.09 C1
30 45 30.95 49.68 C1
85 70 86.37 15.38 C2
71 80 83.1 5.85 C2
60 78 74.74 10.88 C2
55 52 51.8 26.9 C2
80 91 97.31 19.44 C2

Iteration = 4
• Mean c2(x1)= (85 + 71 + 60 + 55 + 80) / 5 = 70.2
• Mean c2(x2)= (70 + 80 + 78 + 52 + 91) / 5 = 74.2
x1 x2 d1 d2 cluster
5 3 18.3 96.54 C1
10 15 7.08 84.43 C1
15 12 5.31 83.16 C1
24 10 10.04 79.09 C1
30 45 30.95 49.68 C1
85 70 86.37 15.38 C2
71 80 83.1 5.85 C2
60 78 74.74 10.88 C2
55 52 51.8 26.9 C2
80 91 97.31 19.44 C2

Convergence
• The cluster centroids in iteration3 and iteration4 are same. i.e (No change).
• It satisfies the convergence criteria. i.e. Data points cannot be clustered
further.
• So, Stop the process.

Reference
• Artificial Intelligence and Machine Learning, Chandra S.S. & H.S. Anand, PHI Publications
• Online materials

K means clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to K means clustering

Similar to K means clustering (20)

More from Kuppusamy P

More from Kuppusamy P (20)

Recently uploaded

Recently uploaded (20)

K means clustering