1. AIML IE2 Project
Gr.No.-10 (P-10)
Unsupervised k-means
clustering
Group members – TYETB078 Vidit Agarwal
TYETB079 Randhirsinh Bhosale
TYETB080 Hetal Gupta
TYETB081 Isha Shetty
TYETB082 Vishwatej Katkar
Guide name - Dr. Rajani P.K.
ELECTRONICS & TELECOMMUNICATION ENGINEERING
Pimpri Chinchwad College of Engineering , Pune
2. OUTLINE OF PROJECT
• Introduction
• Literature review
• Need of the Project
• Problem Statement, Aim, Objectives
• Methodology
• Block Diagram
• Algorithm and Flowchart
• Software Implementations
• Advantages & Applications
• Work Done till date
• Results & Analysis
• Conclusion
• References
• Project Outcome
2
17-10-2023 Unsupervised K-means Clustering Algorithm
3. Introduction
• Clustering is a useful tool in data science. It is a method for
finding cluster structure in a data set that is characterized by
the greatest similarity within the same cluster and the
greatest dissimilarity between different clusters.
• It is known that the k-means algorithm is the oldest and
popular partitional method. The k-means clustering has been
widely studied with various extensions in the literature and
applied in a variety of substantive areas. However, these k-
means clustering algorithms are usually affected by
initialization sand need to be given a number of clusters a
priori. In general, the cluster number is unknown. In this
case, validity indices can be used to find a cluster number
where they are supposed to be independent of clustering
algorithms.
3
17-10-2023 Unsupervised K-means Clustering Algorithm
4. Literature Review
Title Author Objective Algorithm Methodology Conclusion
Unsupervi
sed K-
Means
Clustering
Algorithm
KRISTINA P.
SINAGA AND
MIIN-SHEN
YANG
The k-means
algorithm is
generally the most
known and used
clustering method.
There are Various
extensions of k-
means to be
proposed in the
literature.
Although it is an
unsupervised
learning to
clustering in
pattern
recognition and
machine learning,
K means
clustering
algorithm
Clustering is a useful
tool in data science. It is
a method for finding
cluster structure in a
data set that is
characterized by the
greatest similarity within
the same cluster and the
greatest dissimilarity
between different
clusters.
In this paper we
propose a new
schema with a
learning Frame
work for the
kmeans clustering
algorithm. We
adopt the merit of
entropy-type
penalty terms to
construct a
competition
schema.
17-10-2023 Unsupervised K-means Clustering Algorithm 4
5. 17-10-2023 Unsupervised K-means Clustering Algorithm 5
Title Author Objective Algorithm Methodology Conclusion
Unsupervi
sed K-
Means
Clustering
Algorithm
Yogiraj Singh
Kushawah,
Ashish
Mohan Yadav
Data mining are
data analysis
supported
unsupervised
clustering
algorithm is one of
the quickest
growing research
areas because of
availability of huge
quantity of data
analysis and
extract usefully
information based
on new improve
performance of
clustering
algorithm.
K means
clustering(PCA)
Data mining is that the
process of extracting
useful and hidden
information or data from
large datasets. the data
therefore extracted will
be used to improve an
unsupervised clustering.
Clustering is an
unsupervised technique
that groups the
knowledge of similar
objects with minimum
cluster distance into a
cluster by eliminating
the inappropriate data
objects
Data mining is to
divide clusters from
large data set and
transform it into an
understandable
extract information
.Clustering plays a
very important task
in data mining.
6. • Title
• Author
• Objective
17-10-2023 Unsupervised K-means Clustering Algorithm 6
Title Author Objective Algorith
m
Methodology Conclusion
Unsupervi
sed K-
Means
Clustering
Algorithm
Salim Dridi
M
Many unsupervised
learning approaches
and algorithms have
been introduced
since the last decade
where are well-
known and widely
used algorithms of
unsupervised
learning. The growing
interest in applying
unsupervised
learning techniques
forms a great success
in fields.
K means
clustering
Unsupervised learning
refers to algorithms to
identify patterns in data
sets containing data
points that are neither
classified nor labelled.
The algorithms are thus
allowed to classify, label
and group the data
points within the data
sets without having any
external guidance in
performing that task.
The users do not need to
supervise the model
K-Means Clustering
is an Unsupervised
Learning algorithm.
It arranges the
unlabeled dataset
into several clusters
where “K” denotes
K-Means Clustering
is an Unsupervised
Learning algorithm.
It arranges the
unlabeled dataset
into several clusters
where “K” denotes
clusters.
7. Need of the project
• Unsupervised learning methods do not need so many labels
of data, so this reduces some requirements of the collected
data. Unsupervised learning relies on its unique training
system, and classifies original data with little or no already
known label data. Unsupervised learning is accomplished
through feedback and a series of rewards and punishments.
• K-Means is a type of Partitioning Clustering and hence one
of the simplest yet powerful machine learning algorithms. As
it is an unsupervised algorithm, K-Means makes its
inferences from datasets using only input vectors without
referring to labeled outcomes.
17-10-2023 Unsupervised K-means Clustering Algorithm 7
8. Problem Statement
• There are various extensions of k-means to be proposed in the literature.
Although it is an unsupervised learning to clustering in pattern
recognition and machine learning, the k-means algorithm and its
extensions are always influenced by initializations with a necessary
number of clusters a priori. That is, the k-means algorithm is not exactly
an unsupervised clustering method. In this paper, they construct an
unsupervised learning schema for the k-means algorithm so that it is free
of initializations without parameter selection and can also simultaneously
find an optimal number of clusters. That is, they propose a novel
unsupervised k-means (U-k-means) clustering algorithm with
automatically finding an optimal number of clusters without giving any
initialization and parameter selection. The computational complexity of
the proposed U-k-means clustering algorithm is also analyzed.
17-10-2023 Unsupervised K-means Clustering Algorithm 8
9. Aim, Objective
• To construct an unsupervised learning schema for the k-
means algorithm so that it is free of initializations without
parameter selection and can also simultaneously find an
optimal number of clusters.
• To propose a novel unsupervised k-means (U-k-means)
clustering algorithm with automatically finding an optimal
number of clusters without giving any initialization and
parameter selection.
• To analyze the computational complexity of the proposed U-
k-means clustering algorithm
17-10-2023 Unsupervised K-means Clustering Algorithm 9
11. U-k-means clustering Algorithm
17-10-2023 Unsupervised K-means Clustering Algorithm 11
1.Fix ε>0 . Give initial c(0)=n , α(0)k=1/n , a(0)k=xi , and initial
learning rates γ(0)=β(0)=1 . Set t=0 .
2.Compute z(t+1)ik using a(t)k , α(t)k,c(t),γ(t),β(t).
3.Compute γ(t+1)
4.Update α(t+1)k with z(t+1)ik and α(t)k
5.Compute β(t+1) with α(t+1) and α(t)
6. Update c(t) to c(t+1) by discard those clusters
with α(t+1)k≤1/n and adjust α(t+1)k and z(t+1)ik
7.IF t≥60 and c(t−60)−c(t)=0 , THEN let β(t+1)=0 .
7.Update a(t+1)k with c(t+1) and z(t+1)ik
8.Compare a(t+1)k and a(t)k .
IF max1≤k≤c(t)∥∥a(t+1)k−a(t)k∥∥<ε , THEN Stop.
ELSE t=t+1 and return to Step 2.
13. Software implementations
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.cluster import Kmeans
# Define the U-k-means algorithm
class UKMeans:
def __init__(self, n_clusters=3, fuzziness=2, max_iter=100):
self.n_clusters = n_clusters
self.fuzziness = fuzziness
self.max_iter = max_iter
17-10-2023 Unsupervised K-means Clustering Algorithm 13
14. def fit(self, X):
n = X.shape[0]
C = X[np.random.choice(n, self.n_clusters), :]
U = np.random.rand(n, self.n_clusters)
U = U / np.sum(U, axis=1, keepdims=True)
for i in range(self.max_iter):
U_old = U.copy()
for j in range(self.n_clusters):
C[j, :] = np.sum(U[:, j].reshape(-1, 1) * X, axis=0) / np.sum(U[:, j])
dist = np.linalg.norm(X[:, :, np.newaxis] - C.T[np.newaxis, :, :], axis=1)
U = 1 / (dist ** (2 / (self.fuzziness - 1)))
U = U / np.sum(U, axis=1, keepdims=True)
if np.allclose(U, U_old):
break
17-10-2023 Unsupervised K-means Clustering Algorithm 14
15. self.centroids_ = C
self.labels_ = np.argmax(U, axis=1)
# Load the iris dataset
iris = load_iris()
# Preprocess the data
X = iris.data
y = iris.target
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Perform clustering using U-k-means algorithm
best_accuracy = 0
best_model = None
for n_clusters in range(2, 6):
17-10-2023 Unsupervised K-means Clustering Algorithm 15
16. or fuzziness in [1.5, 2, 2.5]:
for max_iter in [50, 100, 200]:
model = UKMeans(n_clusters=n_clusters, fuzziness=fuzziness,
max_iter=max_iter)
model.fit(X)
accuracy = np.mean(model.labels_ == y)
if accuracy > best_accuracy:
best_accuracy = accuracy
best_model = model
# Print the best accuracy and corresponding parameters
print(f"Best accuracy: {best_accuracy}")
print(f"Number of clusters: {best_model.n_clusters}")
print(f"Fuzziness: {best_model.fuzziness}")
print(f"Maximum number of iterations: {best_model.max_iter}"
17-10-2023 Unsupervised K-means Clustering Algorithm 16
17. # Visualize the clusters using scatter plot
sns.set_style("whitegrid")
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=best_model.labels_, palette="Set2")
sns.scatterplot(x=best_model.centroids_[:, 0], y=best_model.centroids_[:, 1], color="black", marker="X",
s=200, legend=False)
plt.title("Clustering using U-k-means algorithm")
plt.xlabel("Sepal length (cm)")
plt.ylabel("Sepal width (cm)")
plt.show()
# Visualize the distribution of samples across clusters using bar graph
sns.set_style("whitegrid")
sns.countplot(x=best_model.labels_, palette="Set2")
plt.title("Distribution of samples across clusters")
plt.xlabel("Cluster")
plt.ylabel("Number of samples")
plt.show()
17-10-2023 Unsupervised K-means Clustering Algorithm 17
18. # Compute the confusion matrix and classification report
cm = confusion_matrix(y, best_model.labels_)
cr = classification_report(y, best_model.labels_)
# Print the confusion matrix and classification report
print("Confusion Matrix:")
print(cm)
print("Classification Report:")
print(cr)
# Plot the confusion matrix
sns.set_style("white")
sns.heatmap(cm, annot=True, cmap="Blues", fmt="d", xticklabels=iris.target_names,
yticklabels=iris.target_names)
plt.title("Confusion Matrix")
plt.xlabel("Predicted Class")
plt.ylabel("True Class")
plt.show()
17-10-2023 Unsupervised K-means Clustering Algorithm 18
19. Advantages
17-10-2023 Project Title 19
1. High Performance - K-Means algorithm has linear time complexity and
it can be used with large datasets conveniently. With unlabeled big data K-
Means offers many insights and benefits as an unsupervised clustering
algorithm.
2. Unlabeled Data -This one is a general unsupervised machine learning
algorithm that also applies to K-Means.If your data has no labels (class
values or targets) or even column headers, K-Means will still successfully
cluster your data.
3. Easy to Use - K-Means is also easy to use. It can be initialized using
default parameters in the Scikit-Learn implementation.
4. Result Interpretation - K-Means returns clusters which can be easily
interpreted and even visualized. This simplicity makes it highly useful in
some cases when you need a quick overview of the data segments.
20. Applications
K-Means clustering is used in a variety of examples or business cases in
real life, like:
• Academic performance - Based on the scores, students are categorized
into grades like A, B, or C.
• Diagnostic systems - The medical profession uses k-means in creating
smarter medical decision support systems, especially in the treatment of
liver ailments.
• Search engines - Clustering forms a backbone of search engines. When a
search is performed, the search results need to be grouped, and the search
engines very often use clustering to do this.
• Wireless sensor networks - The clustering algorithm plays the role of
finding the cluster heads, which collect all the data in its respective
cluster.
17-10-2023 Unsupervised K-means Clustering Algorithm 20
21. Results and analysis
17-10-2023 Unsupervised K-means Clustering Algorithm 21
The U-k-means algorithm is a variant of the k-means algorithm that assigns
each data point a fuzzy membership to each cluster, instead of assigning it to
only one cluster. The implementation of the U-k-means algorithm was done
using the Python programming language and the scikit-learn library.
The best accuracy achieved using the U-k-means algorithm was 0.83, which
was obtained for the following hyperparameters: 3 clusters, fuzziness value
of 1.5, and maximum number of iterations of 50. The scatter plot showed the
clusters of the iris dataset, and the bar graph showed the distribution of
samples across the clusters.
The confusion matrix and classification report were computed to evaluate the
performance of the U-k-means algorithm. The confusion matrix showed the
number of true positives, true negatives, false positives, and false negatives,
while the classification report showed the precision, recall, F1-score, and
support for each class. The performance of the U-k-means algorithm was not
as good as that of supervised learning algorithms, but it is still a useful tool
for unsupervised learning tasks.
23. Results and analysis
17-10-2023 Unsupervised K-means Clustering Algorithm 23
Confusion Matrix For U-k-means algorithm which introduce
different performance measures
24. Conclusion
17-10-2023 Unsupervised K-means Clustering Algorithm 24
• In this paper a new scheme is proposed with a learning
framework for the k-means clustering algorithm. During
iterations, the U-k-means algorithm will discard extra
clusters, and then an optimal number of clusters can be
automatically found according to the structure of data. The
advantages of U-k-means are free of initializations and
parameters that also robust to different cluster volumes and
shapes with automatically finding the number of clusters.
The proposed U-k-means algorithm was performed on
several synthetic and real data sets and also compared with
most existing algorithms, such as R-EM, C-FS, k-means
with the true number c, k-means + gap, and X-means
algorithms. The results actually demonstrate the superiority
of the U-k-means clustering algorithm.
25. References
[1] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Englewood Cliffs, NJ, USA: Prentice-Hall,
1988.
[2] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. New
York, NY, USA: Wiley, 1990.
[3] G. J. McLachlan and K. E. Basford, Mixture Models: Inference and Applications to Clustering. New
York, NY, USA: Marcel Dekker, 1988.
[4] A. P. Dempster, N. M. Laird, and D. B. Rubin, ‘‘Maximum likelihood from incomplete data via the EM
algorithm (with discussion),’’ J. Roy. Stat. Soc., Ser. B, Methodol., vol. 39, no. 1, pp. 1–38, 1977.
[5] J. Yu, C. Chaomurilige, and M.-S. Yang, ‘‘On convergence and parameter selection of the EM and DA-
EM algorithms for Gaussian mixtures,’’ Pattern Recognit., vol. 77, pp. 188–203, May 2018.
[6] A. K. Jain, ‘‘Data clustering: 50 years beyond K-means,’’ Pattern Recognit. Lett., vol. 31, no. 8, pp.
651–666, Jun. 2010.
[7] M.-S. Yang, S.-J. Chang-Chien, and Y. Nataliani, ‘‘A fully-unsupervised possibilistic C-Means
clustering algorithm,’’ IEEE Access, vol. 6, pp. 78308–78320, 2018.
[8] J. MacQueen, ‘‘Some methods for classification and analysis of multivariate observations,’’ in Proc. 5th
Berkeley Symp. Math. Statist. Probab., vol. 1, 1967, pp. 281–297.
[9] M. Alhawarat and M. Hegazi, ‘‘Revisiting K-Means and topic modeling, a comparison study to cluster
arabic documents,’’ IEEE Access, vol. 6, pp. 42740–42749, 2018.
17-10-2023 Unsupervised K-means Clustering Algorithm 25
26. Project Outcome
• The k-means algorithm is generally the most known and used clustering
method. There are various extensions of k-means to be proposed in the
literature. Although it is an unsupervised learning to clustering in pattern
recognition and machine learning, the k-means algorithm and its
extensions are always influenced by initializations with a necessary
number of clusters a priori.
• The outcome of this project was an unsupervised learning schema for the
k-means algorithm so that it is free of initializations without parameter
selection and can also simultaneously find an optimal number of clusters.
• Also, we analyzed the computational complexity of the proposed U-k-
means clustering algorithm.
17-10-2023 Unsupervised K-means Clustering Algorithm 26