Influencing policy (training slides from Fast Track Impact)
Representative basedclustering
1. Representative Based Clustering Algorithm: Part 1,
K-Means
Ananda Swarup Das, Technical Staff Member,
IBM India Research Labs, New Delhi, anandaswarup@gmail.com.
December 18, 2016
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 1 / 15
2. Standing on the Shoulders of the Giants.
Please note that, I use the excellent book titled ”Python Machine
Learning” [3] for most of the programming examples in this
presentation.
The Theoretical text is covered from multiple sources like [1], [2] and
[4].
Thanks to all the authors for such great books.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 2 / 15
3. Definition of Clustering
A Formal Definition: Clustering can be defined as partitioning a
given data set D = {xi }n
i=1 where each xi ∈ Rd into k sub-partitions
denoted by C = {C1, . . . , Ck} such that that D ∩ Cj = ∅ , for 1 ≤ j ≤ k
and ∪n
j=1Cj = D. The k is an user-defined/chosen parameter.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 3 / 15
4. Is the definition okay ?
The definition is incomplete in the sense that it misses the definition
of quality of each sub-partition.
Going simply by the previous definition, the points can be grouped
arbitrarily into k sub-groups. (Will that help ?)
Did we miss something ? (Well, Yes,. . ., We did not speak about how
we represent each cluster . . .)
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 4 / 15
5. The Representative Based Clustering
For each cluster, we try to find a representative point that
summarizes the cluster.
Ideally, it is the mean of the cluster.
K-Means algorithm is an example of the representative based
clustering.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 5 / 15
6. K Means Clustering: Definition and the Objective Function
Given the task of clustering, the first important factor is to find an
appropriate scoring function to ensure the quality of the clustering.
The K-means clustering greedily finds the k-number of means
µ1 . . . µk for the clusters c1, . . . , ck.
The sum of squared error for each cluster Ci is given as
SSE(ci ) = xj ∈ci
||xj − µi ||2.
The sum of squared error for the clustering scheme C is define as
SSE(C) = k
i=1 xj ∈ci
||xj − µi ||2.
The objective is therefore to find the clustering scheme C such that
C = arg min C{SSE(C)} .
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 6 / 15
7. K Means Clustering: Algorithmic Steps and Hard
Assignment
As stated in [4],
1 At the first step t = 0, randomly initialize k centroids denoted by
µt
1 . . . µt
k.
2 repeat
Increment the iteration index t by 1.
Let Cj = ∅ for all j = 1 . . . k.
for each xj in the data set D, do,
Find j = arg mini {||xj − µt−1
i ||2
}
Cj = Cj ∪ {xj }.
3 Update Centroids for each cluster as µt
i = 1
|Ci | xj ∈Ci
xj .
4 Stop if k
i=1 ||µt
i − µt−1
i ||2 ≤ where is a user-defined parameter.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 7 / 15
8. K Means Clustering: Algorithmic Steps and Hard
Assignment
As stated in [4],
1 At the first step t = 0, randomly initialize k centroids denoted by
µt
1 . . . µt
k.
2 repeat
Increment the iteration index t by 1.
Let Cj = ∅ for all j = 1 . . . k.
for each xj in the data set D, do,
Find j = arg mini {||xj − µt−1
i ||2
}
Cj = Cj ∪ {xj }.
3 Update Centroids for each cluster as µt
i = 1
|Ci | xj ∈Ci
xj .
4 Stop if k
i=1 ||µt
i − µt−1
i ||2 ≤ where is a user-defined parameter.
Notice that in each iteration of k means, a point in D is greedily assigned
to at most one cluster. This is called hard assignment.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 7 / 15
9. A Question to Ponder
Can we do something so that instead of greedily assigning a point to one
cluster at most, we assign the point to multiple clusters ?
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 8 / 15
10. A Question to Ponder
Can we do something so that instead of greedily assigning a point to one
cluster at most, we assign the point to multiple clusters ?
Yes, we can, but we will defer the answer to that for sometime as we have
some maths to brush up. Part 2 series of this slide will answer the question.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 8 / 15
11. Few Things to Learn
1 Clustering is an unsupervised technique.
You are not provided with any training data of any labeled data to
train a system.
You are trying to find some group/pattern in the data.
2 How to decide an ideal value for the parameter k.
Use Elbow-Method
If the dimension is not too high, one can also use Bayesian Information
Criterion (bic)
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 9 / 15
12. Visualizations
I am using make−blobs from sklearn-datasets following examples from [3] to
generate the 2-d sample data set with four centers. It is a synthetic data (for
demo purpose) and in practice, one will rarely get such well clustered data.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 10 / 15
13. Introducing k-Means from sklearn-cluster
This is as simple as follows:
1 from sklearn.cluster import KMeans
2 km = KMeans(n−clusters = 4, init=’random’, n−init = 10,
max−iter = 800, tol = 1e − 04, random−state = 0)
The important terms:
n−cluster, denotes the number of clusters you want. This is actually your
value of k.
init=’random’ means k random points will be initially selected as the
centroid/means.
n−init denote the number of times the k-means algorithm will be run with
different centroid seeds.
max−iter = 800 denotes the maximum number of iterations the KMeans
algorithm will run. Default is 800.
tol = 1e − 04 Minimum tolerance to declare convergence. Remember,
k
i=1 ||µt
i − µt−1
i ||2
≤ . is the tol.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 11 / 15
14. Deciding the Cluster Number of k
1 Remember the sum of squared error for the clustering scheme C is
define as SSE(C) = k
i=1 xj ∈ci
||xj − µi ||2. This parameter is also
known as cluster distortion or cluster inertia.
2 The Kmeans module of skelarn.cluster will give you that value as
km.inertia−.
3 Run your Kmeans algorithm in a loop where at each iteration, you
choose a different value of k. Collect the cluster inertia for that value
of k.
4 Make a plot and find the elbow.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 12 / 15
15. The Elbow Method
Figure: Notice the Sharp decline of distortion from 3 to 4. This is called the
Elbow. This gives us an idea that probably k = 4 is a good choice.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 13 / 15
16. In the Next Series
In the next part of this series (probably in a time of week), we will
introduce the Expectation Maximization Algorithm with elaborate
details and explanations.
Till then Happy Data Science with Python.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 14 / 15
17. Citations
G. James, D. Witten, T. Hastie, and R. Tibshirani.
An Introduction to Statistical Learning: with Applications in R.
Springer Texts in Statistics. Springer New York, 2014.
C. D. Manning, P. Raghavan, and H. Sch¨utze.
Introduction to information retrieval.
Cambridge University Press, 2008.
S. Raschka.
Python Machine Learning.
Packt Publishing, 2015.
M. J. Zaki and W. Meira.
Data Mining and Analysis: Fundamental Concepts and Algorithms.
Cambridge University Press, 2014.
Ananda Swarup Das, Technical Staff Member, IBM India Research Labs, New Delhi, anandaswarup@gmail.com.Representative Based Clustering Algorithm: Part 1, K-MeansDecember 18, 2016 15 / 15