Representative basedclustering

Representative Based Clustering Algorithm: Part 1,
K-Means
Ananda Swarup Das, Technical Staﬀ Member,
IBM India Research Labs, New Delhi, anandaswarup@gmail.com.
December 18, 2016
Ananda Swarup Das, Technical Staﬀ Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 1 / 15

Standing on the Shoulders of the Giants.
Please note that, I use the excellent book titled ”Python Machine
Learning” [3] for most of the programming examples in this
presentation.
The Theoretical text is covered from multiple sources like [1], [2] and
[4].
Thanks to all the authors for such great books.

Definition of Clustering
A Formal Definition: Clustering can be defined as partitioning a
given data set D = {xi }n
i=1 where each xi ∈ Rd into k sub-partitions
denoted by C = {C1, . . . , Ck} such that that D ∩ Cj = ∅ , for 1 ≤ j ≤ k
and ∪n
j=1Cj = D. The k is an user-defined/chosen parameter.

Is the definition okay ?
The definition is incomplete in the sense that it misses the definition
of quality of each sub-partition.
Going simply by the previous definition, the points can be grouped
arbitrarily into k sub-groups. (Will that help ?)
Did we miss something ? (Well, Yes,. . ., We did not speak about how
we represent each cluster . . .)

The Representative Based Clustering
For each cluster, we try to ﬁnd a representative point that
summarizes the cluster.
Ideally, it is the mean of the cluster.
K-Means algorithm is an example of the representative based
clustering.

K Means Clustering: Definition and the Objective Function
Given the task of clustering, the first important factor is to find an
appropriate scoring function to ensure the quality of the clustering.
The K-means clustering greedily finds the k-number of means
µ1 . . . µk for the clusters c1, . . . , ck.
The sum of squared error for each cluster Ci is given as
SSE(ci ) = xj ∈ci
||xj − µi ||2.
The sum of squared error for the clustering scheme C is define as
SSE(C) = k
i=1 xj ∈ci
||xj − µi ||2.
The objective is therefore to find the clustering scheme C such that
C = arg min C{SSE(C)} .

K Means Clustering: Algorithmic Steps and Hard
Assignment
As stated in [4],
1 At the ﬁrst step t = 0, randomly initialize k centroids denoted by
µt
1 . . . µt
k.
2 repeat
Increment the iteration index t by 1.
Let Cj = ∅ for all j = 1 . . . k.
for each xj in the data set D, do,
Find j = arg mini {||xj − µt−1
i ||2
}
Cj = Cj ∪ {xj }.
3 Update Centroids for each cluster as µt
i = 1
|Ci | xj ∈Ci
xj .
4 Stop if k
i=1 ||µt
i − µt−1
i ||2 ≤ where is a user-deﬁned parameter.

K Means Clustering: Algorithmic Steps and Hard
Assignment
As stated in [4],
1 At the ﬁrst step t = 0, randomly initialize k centroids denoted by
µt
1 . . . µt
k.
2 repeat
Increment the iteration index t by 1.
Let Cj = ∅ for all j = 1 . . . k.
for each xj in the data set D, do,
Find j = arg mini {||xj − µt−1
i ||2
}
Cj = Cj ∪ {xj }.
3 Update Centroids for each cluster as µt
i = 1
|Ci | xj ∈Ci
xj .
4 Stop if k
i=1 ||µt
i − µt−1
i ||2 ≤ where is a user-deﬁned parameter.
Notice that in each iteration of k means, a point in D is greedily assigned
to at most one cluster. This is called hard assignment.

A Question to Ponder
Can we do something so that instead of greedily assigning a point to one
cluster at most, we assign the point to multiple clusters ?

A Question to Ponder
Can we do something so that instead of greedily assigning a point to one
cluster at most, we assign the point to multiple clusters ?
Yes, we can, but we will defer the answer to that for sometime as we have
some maths to brush up. Part 2 series of this slide will answer the question.

Few Things to Learn
1 Clustering is an unsupervised technique.
You are not provided with any training data of any labeled data to
train a system.
You are trying to ﬁnd some group/pattern in the data.
2 How to decide an ideal value for the parameter k.
Use Elbow-Method
If the dimension is not too high, one can also use Bayesian Information
Criterion (bic)

Visualizations
I am using make−blobs from sklearn-datasets following examples from [3] to
generate the 2-d sample data set with four centers. It is a synthetic data (for
demo purpose) and in practice, one will rarely get such well clustered data.

Introducing k-Means from sklearn-cluster
This is as simple as follows:
1 from sklearn.cluster import KMeans
2 km = KMeans(n−clusters = 4, init=’random’, n−init = 10,
max−iter = 800, tol = 1e − 04, random−state = 0)
The important terms:
n−cluster, denotes the number of clusters you want. This is actually your
value of k.
init=’random’ means k random points will be initially selected as the
centroid/means.
n−init denote the number of times the k-means algorithm will be run with
diﬀerent centroid seeds.
max−iter = 800 denotes the maximum number of iterations the KMeans
algorithm will run. Default is 800.
tol = 1e − 04 Minimum tolerance to declare convergence. Remember,
k
i=1 ||µt
i − µt−1
i ||2
≤ . is the tol.

Deciding the Cluster Number of k
1 Remember the sum of squared error for the clustering scheme C is
define as SSE(C) = k
i=1 xj ∈ci
||xj − µi ||2. This parameter is also
known as cluster distortion or cluster inertia.
2 The Kmeans module of skelarn.cluster will give you that value as
km.inertia−.
3 Run your Kmeans algorithm in a loop where at each iteration, you
choose a different value of k. Collect the cluster inertia for that value
of k.
4 Make a plot and find the elbow.

The Elbow Method
Figure: Notice the Sharp decline of distortion from 3 to 4. This is called the
Elbow. This gives us an idea that probably k = 4 is a good choice.

In the Next Series
In the next part of this series (probably in a time of week), we will
introduce the Expectation Maximization Algorithm with elaborate
details and explanations.
Till then Happy Data Science with Python.

Citations
G. James, D. Witten, T. Hastie, and R. Tibshirani.
An Introduction to Statistical Learning: with Applications in R.
Springer Texts in Statistics. Springer New York, 2014.
C. D. Manning, P. Raghavan, and H. Sch¨utze.
Introduction to information retrieval.
Cambridge University Press, 2008.
S. Raschka.
Python Machine Learning.
Packt Publishing, 2015.
M. J. Zaki and W. Meira.
Data Mining and Analysis: Fundamental Concepts and Algorithms.
Cambridge University Press, 2014.

Representative basedclustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Representative basedclustering

Similar to Representative basedclustering (20)

Recently uploaded

Recently uploaded (20)

Representative basedclustering