This document summarizes a research paper that proposes a novel approach to improving the k-means clustering algorithm. The standard k-means algorithm is computationally expensive and produces results that depend heavily on the initial centroid selection. The proposed approach determines initial centroids systematically and uses a heuristic to efficiently assign data points to clusters. It improves both the accuracy and efficiency of k-means clustering by ensuring the entire process takes O(n2) time without sacrificing cluster quality.
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
K-means and K-medoids clustering algorithms are widely used for many practical applications. Original k
medoids algorithms select initial centroids and medoids randomly that affect the quality of the resulting clusters and sometimes it
generates unstable and empty clusters which are meaningless.
expensive and requires time proportional to the product of the number of data items, number of clusters and the number of iterations.
The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centro
requirements of users and then gives better, effective and stable cluster. It also takes less execution time because it eliminates
unnecessary distance computation by using previous iteration. The new approach for k
systematically based on initial centroids. It generates stable clusters to improve accuracy.
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other.
K-Means clustering uses an iterative procedure which is very much sensitive and dependent upon the initial centroids. The initial centroids in the k-means clustering are chosen randomly, and hence the clustering also changes with respect to the initial centroids. This paper tries to overcome this problem of random selection of centroids and hence change of clusters with a premeditated selection of initial centroids. We have used the iris, abalone and wine data sets to demonstrate that the proposed method of finding the initial centroids and using the centroids in k-means algorithm improves the clustering performance. The clustering also remains the same in every run as the initial centroids are not randomly selected but through premeditated method.
An improvement in k mean clustering algorithm using better time and accuracyijpla
Cluster
analysis
or
clustering
is the task of grouping a set of objects in such a way that objects in the same
group (called a
cluster
) are more similar (in some sense or another) to each other than to those in other
groups (clusters)
.
K
-
means
is
one of the simplest unsupervised learning algorithms that solve the well
known clustering problem.
The
process of k means algorithm data
is partiti
oned int
o K clusters and the
data are randomly choose
to the clusters resulti
ng in clusters that have
the sa
me number of data
set
.
This
paper is proposed a new K means clustering algorithm we calculate the initial
centroids
systemically
instead of random assigned due to which accuracy and time
improved.
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...ijscmc
Face recognition is one of the most unobtrusive biometric techniques that can be used for access control as well as surveillance purposes. Various methods for implementing face recognition have been proposed with varying degrees of performance in different scenarios. The most common issue with effective facial biometric systems is high susceptibility of variations in the face owing to different factors like changes in pose, varying illumination, different expression, presence of outliers, noise etc. This paper explores a novel technique for face recognition by performing classification of the face images using unsupervised learning approach through K-Medoids clustering. Partitioning Around Medoids algorithm (PAM) has been used for performing K-Medoids clustering of the data. The results are suggestive of increased robustness to noise and outliers in comparison to other clustering methods. Therefore the technique can also be used to increase the overall robustness of a face recognition system and thereby increase its invariance and make it a reliably usable biometric modality
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
K-means and K-medoids clustering algorithms are widely used for many practical applications. Original k
medoids algorithms select initial centroids and medoids randomly that affect the quality of the resulting clusters and sometimes it
generates unstable and empty clusters which are meaningless.
expensive and requires time proportional to the product of the number of data items, number of clusters and the number of iterations.
The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centro
requirements of users and then gives better, effective and stable cluster. It also takes less execution time because it eliminates
unnecessary distance computation by using previous iteration. The new approach for k
systematically based on initial centroids. It generates stable clusters to improve accuracy.
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other.
K-Means clustering uses an iterative procedure which is very much sensitive and dependent upon the initial centroids. The initial centroids in the k-means clustering are chosen randomly, and hence the clustering also changes with respect to the initial centroids. This paper tries to overcome this problem of random selection of centroids and hence change of clusters with a premeditated selection of initial centroids. We have used the iris, abalone and wine data sets to demonstrate that the proposed method of finding the initial centroids and using the centroids in k-means algorithm improves the clustering performance. The clustering also remains the same in every run as the initial centroids are not randomly selected but through premeditated method.
An improvement in k mean clustering algorithm using better time and accuracyijpla
Cluster
analysis
or
clustering
is the task of grouping a set of objects in such a way that objects in the same
group (called a
cluster
) are more similar (in some sense or another) to each other than to those in other
groups (clusters)
.
K
-
means
is
one of the simplest unsupervised learning algorithms that solve the well
known clustering problem.
The
process of k means algorithm data
is partiti
oned int
o K clusters and the
data are randomly choose
to the clusters resulti
ng in clusters that have
the sa
me number of data
set
.
This
paper is proposed a new K means clustering algorithm we calculate the initial
centroids
systemically
instead of random assigned due to which accuracy and time
improved.
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...ijscmc
Face recognition is one of the most unobtrusive biometric techniques that can be used for access control as well as surveillance purposes. Various methods for implementing face recognition have been proposed with varying degrees of performance in different scenarios. The most common issue with effective facial biometric systems is high susceptibility of variations in the face owing to different factors like changes in pose, varying illumination, different expression, presence of outliers, noise etc. This paper explores a novel technique for face recognition by performing classification of the face images using unsupervised learning approach through K-Medoids clustering. Partitioning Around Medoids algorithm (PAM) has been used for performing K-Medoids clustering of the data. The results are suggestive of increased robustness to noise and outliers in comparison to other clustering methods. Therefore the technique can also be used to increase the overall robustness of a face recognition system and thereby increase its invariance and make it a reliably usable biometric modality
Slides for Introductory session on K Means Clustering.
simple and good. ppt
Could be used for taking classes for MCA students on Clustering Algorithms for Data mining.
Prepared By K.T.Thomas HOD of Computer Science, Santhigiri College Vazhithala
k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning
Anchor Positioning using Sensor Transmission Range Based Clustering for Mobil...ijdmtaiir
In wireless sensor network, energy efficiency is a
major concern as the sensors have minimum energy capacity.
As the sensor energy consumption plays a vital role in
determining the network lifetime, many strategies have been
proposed for energy conservation. One of them is mobile data
gathering (MDG).In Mobile Data Gathering either the sink can
go on tour to collect the data or a Mobile Data Collector
(MDC) can collect the data from sensors and drops them back
to the static sink. In this paper, static sink with mobile data
collector is considered for the topic of interest. The mobile data
collector will reach the cluster of sensors at some appropriate
locations called as anchor points. For the formation of clusters,
many algorithms have been proposed so far, two of them ,
Square Grid based Clustering (SGC) and Sensor Transmission
based Clustering (STC) are analysed here for the selection of
anchors in the anchor based mobile data gathering approach
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
A brief description of clustering, two relevant clustering algorithms(K-means and Fuzzy C-means), clustering validation, two inner validity indices(Dunn-n-Dunn and Devies Bouldin) .
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...IJECEIAES
Image segmentation takes a major role for analyzing the area of interest in image processing. Many researchers have used different types of techniques for analyzing the image. One of the widely used techniques is K-means clustering. In this paper we use two algorithms K-means and the advance of K-means is called as adaptive K-means clustering. Both the algorithms are using in different types of image and got a successful result. By comparing the Time period, PSNR and RMSE value from the result of both algorithms we prove that the Adaptive K-means clustering algorithm gives a best result as compard to K-means clustering in image segmentation.
K-means Clustering Method for the Analysis of Log Dataidescitation
Clustering analysis method is one of the main
analytical methods in data mining; the method of clustering
algorithm will influence the clustering results directly. This
paper discusses the standard k-means clustering algorithm
and analyzes the shortcomings of standard k-means
algorithm. This paper also focuses on web usage mining to
analyze the data for pattern recognition. With the help of k-
means algorithm, pattern is identified.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
Slides for Introductory session on K Means Clustering.
simple and good. ppt
Could be used for taking classes for MCA students on Clustering Algorithms for Data mining.
Prepared By K.T.Thomas HOD of Computer Science, Santhigiri College Vazhithala
k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning
Anchor Positioning using Sensor Transmission Range Based Clustering for Mobil...ijdmtaiir
In wireless sensor network, energy efficiency is a
major concern as the sensors have minimum energy capacity.
As the sensor energy consumption plays a vital role in
determining the network lifetime, many strategies have been
proposed for energy conservation. One of them is mobile data
gathering (MDG).In Mobile Data Gathering either the sink can
go on tour to collect the data or a Mobile Data Collector
(MDC) can collect the data from sensors and drops them back
to the static sink. In this paper, static sink with mobile data
collector is considered for the topic of interest. The mobile data
collector will reach the cluster of sensors at some appropriate
locations called as anchor points. For the formation of clusters,
many algorithms have been proposed so far, two of them ,
Square Grid based Clustering (SGC) and Sensor Transmission
based Clustering (STC) are analysed here for the selection of
anchors in the anchor based mobile data gathering approach
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
A brief description of clustering, two relevant clustering algorithms(K-means and Fuzzy C-means), clustering validation, two inner validity indices(Dunn-n-Dunn and Devies Bouldin) .
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...IJECEIAES
Image segmentation takes a major role for analyzing the area of interest in image processing. Many researchers have used different types of techniques for analyzing the image. One of the widely used techniques is K-means clustering. In this paper we use two algorithms K-means and the advance of K-means is called as adaptive K-means clustering. Both the algorithms are using in different types of image and got a successful result. By comparing the Time period, PSNR and RMSE value from the result of both algorithms we prove that the Adaptive K-means clustering algorithm gives a best result as compard to K-means clustering in image segmentation.
K-means Clustering Method for the Analysis of Log Dataidescitation
Clustering analysis method is one of the main
analytical methods in data mining; the method of clustering
algorithm will influence the clustering results directly. This
paper discusses the standard k-means clustering algorithm
and analyzes the shortcomings of standard k-means
algorithm. This paper also focuses on web usage mining to
analyze the data for pattern recognition. With the help of k-
means algorithm, pattern is identified.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
Mine Blood Donors Information through Improved K-Means Clusteringijcsity
The number of accidents and health diseases which are increasing at an alarming rate are resulting in a huge increase in the demand for blood. There is a necessity for the organized analysis of the blood donor database or blood banks repositories. Clustering analysis is one of the data mining applications and K-means clustering algorithm is the fundamental algorithm for modern clustering techniques. K-means clustering algorithm is traditional approach and iterative algorithm. At every iteration, it attempts to find the distance from the centroid of each cluster to each and every data point. This paper gives the improvement to the original k-means algorithm by improving the initial centroids with distribution of data. Results and discussions show that improved K-means algorithm produces accurate clusters in less computation time to find the donors information
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
Data mining is the process of using technology to identify patterns and prospects from large amount of information. In Data Mining, Clustering is an important research topic and wide range of unverified classification application. Clustering is technique which divides a data into meaningful groups. K-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. In this paper, we present the comparison of different K-means clustering algorithms.
Unsupervised learning Algorithms and Assumptionsrefedey275
Topics :
Introduction to unsupervised learning
Unsupervised learning Algorithms and Assumptions
K-Means algorithm – introduction
Implementation of K-means algorithm
Hierarchical Clustering – need and importance of hierarchical clustering
Agglomerative Hierarchical Clustering
Working of dendrogram
Steps for implementation of AHC using Python
Gaussian Mixture Models – Introduction, importance and need of the model
Normal , Gaussian distribution
Implementation of Gaussian mixture model
Understand the different distance metrics used in clustering
Euclidean, Manhattan, Cosine, Mahala Nobis
Features of a Cluster – Labels, Centroids, Inertia, Eigen vectors and Eigen values
Principal component analysis
Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Types of Hierarchical Clustering
There are mainly two types of hierarchical clustering:
Agglomerative hierarchical clustering
Divisive Hierarchical clustering
A distribution in statistics is a function that shows the possible values for a variable and how often they occur.
In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution.
is the most significant continuous probability distribution.
Sometimes it is also called a bell curve.
Spectroscopy or hyperspectral imaging consists in the acquisition, analysis, and extraction of the spectral information measured on a specific region or object using an airborne or satellite device. Hyperspectral imaging has become an active field of research recently. One way of analysing such data is through clustering. However, due to the high dimensionality of the data and the small distance between the different material signatures, clustering such a data is a challenging task.In this paper, we empirically compared five clustering techniques in different hyperspectral data sets. The considered clustering techniques are K-means, K-medoids, fuzzy Cmeans, hierarchical, and density-based spatial clustering of applications with noise. Four data sets are used to achieve this purpose which is Botswana, Kennedy space centre, Pavia, and Pavia University. Beside the accuracy, we adopted four more similarity measures: Rand statistics, Jaccard coefficient, Fowlkes-Mallows index, and Hubert index. According to accuracy, we found that fuzzy C-means clustering is doing better on Botswana and Pavia data sets, K-means and K-medoids are giving better results on Kennedy space centre data set, and for Pavia University the hierarchical clustering is better
Fitness function X-means for prolonging wireless sensor networks lifetimeIJECEIAES
X-means and k-means are clustering algorithms proposed as a solution for prolonging wireless sensor networks (WSN) lifetime. In general, X-means overcomes k-means limitations such as predetermined number of clusters. The main concept of X-means is to create a network with basic clusters called parents and then generate (j) number of children clusters by parents splitting. X-means did not provide any criteria for splitting parent’s clusters, nor does it provide a method to determine the acceptable number of children. This article proposes fitness function X-means (FFX-means) as an enhancement of X-means; FFX-means has a new method that determines if the parent clusters are worth splitting or not based on predefined network criteria, and later on it determines the number of children. Furthermore, FFX-means proposes a new cluster-heads selection method, where the cluster-head is selected based on the remaining energy of the node and the intra-cluster distance. The simulation results show that FFX-means extend network lifetime by 11.5% over X-means and 75.34% over k-means. Furthermore, the results show that FFX-means balance the node’s energy consumption, and nearly all nodes depleted their energy within an acceptable range of simulation rounds.
Similar to The International Journal of Engineering and Science (The IJES) (20)
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
The International Journal of Engineering and Science (The IJES)
1. The International Journal Of Engineering And Science (IJES)
||Volume||2 ||Issue|| 10||Pages|| 20-23||2013||
ISSN(e): 2319 – 1813 ISSN(p): 2319 – 1805
Novel Approach of K-Mean Algorithm
1,
Nikhil Chaturvedi, 2,Anand Rajavat
1
Dept.of C.S.E, 2,Asso.Prof, S.V.I.T.S,Indore
---------------------------------------------------ABSTRACT ------------------------------------------------------Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data
pertaining to diverse fields. Conventional database querying methods are inadequate to extract useful
information from huge data banks. Cluster analysis is one of the major data analysis methods and the k-means
clustering algorithm is widely used for many practical applications. But the original k-means algorithm is
computationally expensive and the quality of the resulting clusters heavily depends on the selection of initial
centroids. Several methods have been proposed in the literature for improving the performance of the k-means
clustering algorithm. This paper proposes a method for making the algorithm more effective and efficient; so as
to get better clustering with reduced complexity.
KEY WORDS: Data Analysis, Clustering, k-means Algorithm, Enhanced k-means Algorithm.
-------------------------------------------------------------------------------------------------------------------------------------Date of Submission: 26 September 2013
Date of Publication: 20 October 2013
---------------------------------------------------------------------------------------------------------------------------------------
I. INTRODUCTION
Advances in scientific data collection methods have resulted in the large scale accumulation of
promising data pertaining to diverse fields of science and technology. Owing to the development of novel
techniques for generating and collecting data, the rate of growth of scientific databases has become tremendous.
Hence it is practically impossible to extract useful information from them by using conventional database
analysis techniques. Effective mining methods are absolutely essential to unearth implicit information from
huge databases. Cluster analysis [3] is one of the major data analysis methods which is widely used for many
practical applications in emerging areas like Bioinformatics [1, 2]. Clustering is the process of partitioning a
given set of objects into disjoint clusters. This is done in such a way that objects in the same cluster are similar
while objects belonging to different clusters differ considerably, with respect to their attributes. The k-means
algorithm [3, 4, 5, 6, 7] is effective in producing clusters for many practical applications. But the computational
complexity of the original k-means algorithm is very high, especially for large data sets. Moreover, this
algorithm results in different types of clusters depending on the random choice of initial centroids. Several
attempts were made by researchers for improving the performance of the k-means clustering algorithm for
improving the accuracy and efficiency of the k-means algorithm.
II. BASIC K-MEANS CLUSTERING ALGORITHM
The K-Means clustering algorithm is a partition-based cluster analysis method [10]. According to the
algorithm we firstly select k objects as initial cluster centers, then calculate the distance between each object and
each cluster center and assign it to the nearest cluster, update the averages of all clusters, repeat this process
until the criterion function converged. Square error criterion for clustering
k ni
E = ∑ ∑ (xij –mi)2 , XiJ is the sample j of i-class,
i=1j=1
mi is the center of i-class, ni is the number of samples of i-class. Algorithm process is shown in Fig K-means
clustering algorithm is simply described as follows:
Input: N objects to be cluster (xj, Xz . . . xn), the number of clusters k;
Output: k clusters and the sum of dissimilarity between each object and its nearest cluster center is the smallest;
1) Arbitrarily select k objects as initial cluster centers (m], m2 ... mk);
2) Calculate the distance between each object Xi and each cluster center, then assign each object to the nearest
cluster, formula for calculating distance as:
d(x"m,) = Id (xil- mjl)' , i=l. . . N; j=l. . . k; /=1 d (Xi, mJ) is the distance between data i and cluster j;
3) Calculate the mean of objects in each cluster as the new cluster centers, 1 '" m. = -Ix, i=l, 2 . . . k; Nds the
number of samples of, NFl" current cluster i;
4) Repeat 2) 3) until the criterion function E converged, return (m), m2 . . . mk). Algorithm terminates.
www.theijes.com
The IJES
Page 20
2. Novel Approach of K-Mean…
Fig 2.1 K mean algorithm process
III. RELATED WORK
Several attempts were made by researchers to improve the effectiveness and efficiency of the k-means
algorithm [8]. A variant of the k-means algorithm is the k-modes [9].Method which replaces the means of
clusters with modes. Like the k-means method, the k-modes algorithm also produces locally optimal solutions
which are dependent on the selection of the initial modes. The k-prototypes algorithm [9] integrates the k-means
and k-modes processes for clustering the data. In this method, the dissimilarity measure is defined by taking into
account both numeric and categorical attributes. The original k-means algorithm consists of two phases: one for
determining the initial centroids and the other for assigning data points to the nearest clusters and then
recalculating the cluster means. The second phase is carried out repetitively until the clusters get stabilized, i.e.,
data points stop crossing over cluster boundaries. Fang Yuan et al. [8] proposed a systematic method for finding
the initial centroids. The centroids obtained by this method are consistent with the distribution of data. Hence it
produced clusters with better accuracy, compared to the original k-means algorithm. However, Yuan’s method
does not suggest any improvement to the time complexity of the k-means algorithm. Fahim A M et al. proposed
an efficient method for assigning data-points to clusters. The original k-means algorithm is computationally very
expensive because each iteration computes the distances between data points and all the centroids. Fahim’s
approach makes use of two distance functions for this purpose- one similar to the k-means algorithm and
another one based on a heuristics to reduce the number of distance calculations. But this method presumes that
the initial centroids are determined randomly, as in the case of the original k-means algorithm. Hence there is no
guarantee for the accuracy of the final clusters.
IV. PROPOSED ALGORITHM
In the novel clustering method discussed in this paper, both the phases of the original k-means
algorithm are modified to improve the accuracy and efficiency.
Input:
D = {d1, d2,......,dn} // set of n data items
k // Number of desired clusters
Output:
A set of k clusters.
Steps:
Phase 1: Determine the initial centroids of the clusters
www.theijes.com
The IJES
Page 21
3. Novel Approach of K-Mean…
Input:
D = {d1, d2,......,dn} // set of n data items
k // Number of desired clusters
Output: A set of k initial centroids .
Steps:
1. Set m = 1;
2. Compute the distance between each data point and all other data- points in the set D;
3. Find the closest pair of data points from the set D and form a data-point set Am (1<= m <= k) which contains
these two data- points, Delete these two data points from the set D;
4. Find the data point in D that is closest to the data point set Am, Add it to Am and delete it from D;
5. Repeat step 4 until the number of data points in Am reaches 0.75*(n/k);
6. If m<k, then m = m+1, find another pair of data points from D between which the distance is the
shortest, form another data-point set Am and delete them from D, Go to step 4;
7. for each data-point set Am (1<=m<=k) find the arithmetic mean of the vectors of data points in Am,
These means will be the initial centroids.
Phase 2: Assign each data point to the appropriate clusters
Input:
D = {d1, d2,......,dn} // set of n data-points.
C = {c1, c2,.......,ck} // set of k centroids
Output:
A set of k clusters
Steps:
1. Compute the distance of each data-point di (1<=i<=n) to all the centroids cj (1<=j<=k) as d(di, cj);
2. For each data-point di, find the closest centroid cj and assign di to cluster j.
3. Set ClusterId[i]=j; // j:Id of the closest cluster
4. Set Nearest_Dist[i]= d(di, cj);
5. For each cluster j (1<=j<=k), recalculate the centroids;
6. Repeat
7. For each data-point di,
7.1 Compute its distance from the centroid of the present nearest cluster;
7.2 If this distance is less than or equal to the present nearest distance, the data-point stays in the
cluster;
Else
7.2.1 For every centroid cj (1<=j<=k) Compute the distance d(di, cj);
Endfor;
7.2.2 Assign the data-point di to the cluster with
the nearest centroid cj
7.2.3 Set ClusterId[i]=j;
7.2.4 Set Nearest_Dist[i]= d(di, cj);
Endfor;
8. For each cluster j (1<=j<=k), recalculate the centroids; Until the convergence criteria is met.
In the first phase, the initial centroids are determined systematically so as to produce clusters with
better accuracy [8]. The second phase makes use of a variant of the clustering method discussed in . It starts by
forming the initial clusters based on the relative distance of each data-point from the initial centroids. These
clusters are subsequently fine-tuned by using a heuristic approach, thereby improving the efficiency. In this
phase initially, compute the distances between each data point and all other data points in the set of data points.
Then find out the closest pair of data points and form a set A1 consisting of these two data points, and delete
them from the data point set D. Then determine the data point which is closest to the set A1, add it to A1 and
delete it from D. Repeat this procedure until the number of elements in the set A1 reaches a threshold. At that
point go back to the second step and form another data-point set A2. Repeat this till ’k’ such sets of data points
are obtained. Finally the initial centroids are obtained by averaging all the vectors in each data-point set. The
Euclidean distance is used for determining the closeness of each data point to the cluster centroids. The distance
between one vector X = (x1, x2, ....xn) and another vector Y = (y1, y2, …….yn) is obtained as d(X ,Y) = Square
root of {(x1− y1)2 + (x2 − y2)2 + .... + (xn − yn )2} The distance between a data point X and a data-point set D
is defined as d(X, D) = min (d (X, Y ), where Y ∈D).The initial centroids of the clusters are given as input to the
second phase, for assigning data-points to appropriate clusters.
www.theijes.com
The IJES
Page 22
4. Novel Approach of K-Mean…
The first step in Phase 2 is to determine the distance between each data-point and the initial centroids of
all the clusters. The data-points are then assigned to the clusters having the closest centroids. This results in an
initial grouping of the data-points. For each data-point, the cluster to which it is assigned (ClusterId) and its
distance from the centroid of the nearest cluster (Nearest_Dist) are noted. Inclusion of data-points in various
clusters may lead to a change in the values of the cluster centroids. For each cluster, the centroids are
recalculated by taking the mean of the values of its data-points. Up to this step, the procedure is almost similar
to the original k-means algorithm except that the initial centroids are computed systematically.The next stage is
an iterative process which makes use of a heuristic method to improve the efficiency. During the iteration, the
data-points may get redistributed to different clusters. The method involves keeping track of the distance
between each data-point and the centroid of its present nearest cluster. At the beginning of the iteration, the
distance of each data-point from the new centroid of its present nearest cluster is determined. If this distance is
less than or equal to the previous nearest distance, that is an indication that the data point stays in that cluster
itself and there is no need to compute its distance from other centroids. This result in the saving of time required
to compute the distances to k-1 cluster centroids. On the other hand, if the new centroid of the present nearest
cluster is more distant from the data-point than its previous centroid, there is a chance for the data-point getting
included in another nearer cluster. In that case, it is required to determine the distance of the data-point from all
the cluster centroids. Identify the new nearest cluster and record the new value of the nearest distance. The loop
is repeated until no more data-points cross cluster boundaries, which indicates the convergence criterion. The
heuristic method described above results in significant reduction in the number of computations and thus
improves the efficiency.
V. CONCLUSION
The k-means algorithm is widely used for clustering large sets of data. But the standard algorithm do
not always guarantee good results as the accuracy of the final clusters depend on the selection of initial
centroids. Moreover, the computational complexity of the standard algorithm is objectionably high owing to the
need to reassign the data points a number of times, during every iteration of the loop. This paper presents an
enhanced k-means algorithm which combines a systematic method for finding initial centroids and an efficient
way for assigning data points to clusters. This method ensures the entire process of clustering in O(n2) time
without sacrificing the accuracy of clusters. The previous improvements of the k-means algorithm compromise
on either accuracy or efficiency.A limitation of the proposed algorithm is that the value of k, the number of
desired clusters, is still required to be given as an input, regardless of the distribution of the data points.
Evolving some statistical methods to compute the value of k, depending on the data distribution, is suggested for
future research. A method for refining the computation of initial centroids is worth investigating.
REFERENCES:
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
Amir Ben-Dor, Ron Shamir and Zohar Yakini, “Clustering Gene Expression Patterns,” Journal of Computational Biology, 6(3/4):
281-297, 1999
Daxin Jiang, Chum Tong and Aidong Zhang, “Cluster Analysis for Gene Expression Data,” IEEE
Transactions on Data and
Knowledge Engineering, 16(11): 1370-1386, 2004
Jiawei Han M. K, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, An Imprint of Elsevier, 2006.
Margaret H. Dunham, Data Mining- Introductory and Advanced Concepts, Pearson Education, 2006.
McQueen J, “Some methods for classification and analysis of multivariate observations,” Proc. 5th Berkeley Symp. Math. Statist.
Prob., (1):281–297, 1967.
Pang-Ning Tan, Michael Steinback and Vipin Kumar, Introduction to Data Mining, Pearson Education, 2007.
Stuart P. Lloyd, “Least squares quantization in pcm,” IEEE Transactions on Information Theory, 28(2): 129-136.
Yuan F, Meng Z. H, Zhang H. X and Dong C. R, “A New Algorithm to Get the Initial Centroids,” Proc. of the 3rd International
Conference on Machine Learning and Cybernetics, pages 26–29, August 2004.
Huang Z, “Extensions to the k-means algorithm for clustering large data sets with categorical values,” Data Mining and Knowledge
Discovery, (2):283–304, 1998.
C.C.Aggarwal. A Human-Computer Interactive Method for Projected Clustering [.I]. IEEE Transactions on Knowledge and Data
Engineering, 16(4), 448-460, 2004.
www.theijes.com
The IJES
Page 23