2. Clustering
Clustering is an unsupervised method. It is used to segment the data into similar groups instead
of prediction.
It does not predict anything but can be used to improve the accuracy of predictive methods.
3. Distances for Clustering
Euclidean Distance(As-the-crow-flies Distance): Its the straight line between 2 points
Distances between Clusters:
Minimum Distance: Distances between the points which are closest
Maximum Distance: Distances between the points that are farthest
Centroid Distance: Distance between centroid of clusters
Minimum Maximum Centroid
4. K Means Clustering
It aims at partitioning the data into k clusters in a way that each data point belongs to the
cluster whose mean is nearest to it.
Lets say you have some data points and you want to group them into 3 clusters. So k=3.
1. You will start with randomly assigning the data points into 3 clusters.
2. Then, you will calculate the centroid or mean of each cluster
3. Now the data points will be reassigned to the closest cluster mean.
4. Recompute cluster centroids
5. Repeat steps 2 and 4 until no improvement is made
6. K Means Clustering
Take a group of 12 football players who have each scored a certain number of goals this season
(say in the range 3–30). Let’s divide them into separate clusters—say three.
Step 1 requires us to randomly split the players into three groups and calculate the means of
each.
7. K Means Clustering
Step 2: For each player, reassign them to the group with the closest mean. E.g., Player A (5 goals)
is assigned to Group 2 (mean = 9). Then recalculate the group means.
8. K Means Clustering
Repeat Step 2 over and over until the group means no longer change.
With this example, the clusters could correspond to the players’ positions on the
field — such as defenders, midfielders and attackers.
9. How to find k
In K-Means Clustering, value of K has to be specified beforehand. It can be determined using
below method:
Elbow Method: Clustering is done on a dataset for varying values and SSE (Sum of squared
errors) is calculated for each value of K. Then, a graph between K and SSE is plotted. There is a
point on the graph where SSE does not decreases significantly with increasing K. This is
represented by elbow of the arm and is chosen as the value of K.
The SSE is defined as the sum of the squared distance between each member of the cluster and
its centroid.
10. Hierarchal Clustering(Agglomerative or Single Linkage)
Start with each data point in its own cluster
Combine two nearest clusters(Euclidean or centroid)
Repeat the process till all data points belong to same cluster.
12. How many clusters to form?
Visualising dendogram: Best choice of no. of clusters is no. of vertical lines that can be cut
by a horizontal line, that can transverse maximum distance vertically without intersecting
other cluster.
For eg., in the below case, best choice for no. of clusters will be 4.
Intuition and prior knowledge of the data set.
13. K-Means vs Hieraichal
For big data, K-Means is better!
Time complexity of K-Means is linear, while that of hierarchical clustering is quadratic.
Results are reproducible in Hierarchical, and not in K-Means, as they depend on intialization
of centroids.
K-Means requires prior and proper knowledge about the data set for specifying K.
In Hierarchical, we can choose no. of clusters by interpreting dendogram.
14. Applications
Clustering algorithms can be applied in many fields, for instance:
Marketing: finding groups of customers with similar behavior given a large database of
customer data containing their properties and past buying records;
Biology: classification of plants and animals given their features
Insurance: identifying groups of motor insurance policy holders with a high average claim
cost; identifying frauds
Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones
World Wide Web: document classification; clustering weblog data to discover groups of
similar access patterns.