Part 4 Clustering
Dr. Jyoti Yadav
Introduction to Clustering
• Clustering is similar to classification, but the basis is different.
• In Clustering you don’t know what you are looking for, and you are trying
to identify some segments or clusters in your data. When you use
clustering algorithms on your dataset, unexpected things can suddenly
pop up like structures, clusters and groupings you would have never
thought of otherwise.
• Clustering Models
K-Means Clustering
Hierarchical Clustering
Introduction to Clustering
Introduction to Clustering
Types of Clustering
Partitioning Clustering
Density Based Clustering
Distribution Model- Based Clustering
Hierarchical Clustering
Fuzzy Clustering
• Fuzzy clustering is a type of soft method in which a data object
may belong to more than one group or cluster.
• Each dataset has a set of membership coefficients, which depend
on the degree of membership to be in a cluster.
• Fuzzy C-means algorithm is the example of this type of
clustering; it is sometimes also known as the Fuzzy k-means
algorithm.
Clustering Algorithms
• K-Means Algorithm
• Mean-Shift Algorithm
• DBSCAN Algorithm
• Expectation-Maximization Clustering using GMM
• Agglomerative Hierarchical Algorithm
• Affinity Propogation
Applications of Clustering
• Search Result Grouping
• Customer Segmentation
• Medical Imaging
• Anomaly Detection
• Market Segmentation
• Social Network Analysis
• Recommendation Engines
K-Means Algorithm
K-means Clustering Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
Random Initialization Trap
Random Initialization Trap
Random Initialization Trap
What if we have a bad random initialization?
What if we have a bad random initialization?
What if we have a bad random initialization?
What if we have a bad random initialization?
What if we have a bad random initialization?
What if we have a bad random initialization?
What if we have a bad random initialization?
Choosing the right number of clusters
Choosing the right number of clusters
Choosing the right number of clusters
How to select optimal number of clusters?
WCSS Code
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++',
random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
#inertia is a library to compute WCSS
plt.plot(range(1, 11), wcss) # from 1 to 10 clusters
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
For the mall.csv file the WCSS graph shows
the k=5 clusters
IMP Code for K-means
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)
IMP Code for K-means
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Careful')
# cluster1, column 1 # cluster 1, column 2
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Standard')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Target')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Careless')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label =
'Sensible')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c =
'yellow', label = 'Centroids')
plt.title('Clusters of Clients')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()
Output of k-means
Hierarchical Clustering
2 Types of Hierarchical Clustering
•Agglomerative
•Divisive
Agglomerative Clustering
Distance between Cluster
Distance between Cluster
Distance between Cluster
Distance between Cluster
Distance between Cluster
Agglomerative Hierarchical Cluster
Hierarchical Clustering using a Dendrogram
Hierarchical Clustering using a Dendrogram
Hierarchical Clustering using a Dendrogram
Hierarchical Clustering using a Dendrogram
Hierarchical Clustering using a Dendrogram
Hierarchical Clustering using a Dendrogram
Using Dendrogram to form 2 clusters
Using Dendrogram to form 4 clusters
Using Dendrogram to form 6 clusters
How to find #optimal clusters?
How many clusters can be formed from this Dendrogram?
#optimal clusters = 3
Find the number of Clusters from the Dendrogram
Find the number of Clusters from the Dendrogram
Using the dendrogram to find the optimal number of clusters
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method =
'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()
OutputDendrogram
Fitting Hierarchical Clustering to the dataset
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)
Output of Hierarchical Clustering
5. Types of Clustering Algorithms in ML.pdf
5. Types of Clustering Algorithms in ML.pdf

5. Types of Clustering Algorithms in ML.pdf