This document compares hierarchical and non-hierarchical clustering algorithms. It summarizes four clustering algorithms: K-Means, K-Medoids, Farthest First Clustering (hierarchical algorithms), and DBSCAN (non-hierarchical algorithm). It describes the methodology of each algorithm and provides pseudocode. It also describes the datasets used to evaluate the performance of the algorithms and the evaluation metrics. The goal is to compare the performance of the clustering methods on different datasets.
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
Advances made to the traditional clustering algorithms solves the various problems such as curse of
dimensionality and sparsity of data for multiple attributes. The traditional H-K clustering algorithm can
solve the randomness and apriority of the initial centers of K-means clustering algorithm. But when we
apply it to high dimensional data it causes the dimensional disaster problem due to high computational
complexity. All the advanced clustering algorithms like subspace and ensemble clustering algorithms
improve the performance for clustering high dimension dataset from different aspects in different extent.
Still these algorithms will improve the performance form a single perspective. The objective of the
proposed model is to improve the performance of traditional H-K clustering and overcome the limitations
such as high computational complexity and poor accuracy for high dimensional data by combining the
three different approaches of clustering algorithm as subspace clustering algorithm and ensemble
clustering algorithm with H-K clustering algorithm.
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHcscpconf
Due to the intangible nature of “software”, accurate and reliable software effort estimation is a challenge in the software Industry. It is unlikely to expect very accurate estimates of software
development effort because of the inherent uncertainty in software development projects and the complex and dynamic interaction of factors that impact software development. Heterogeneity exists in the software engineering datasets because data is made available from diverse sources.
This can be reduced by defining certain relationship between the data values by classifying them into different clusters. This study focuses on how the combination of clustering and
regression techniques can reduce the potential problems in effectiveness of predictive efficiency due to heterogeneity of the data. Using a clustered approach creates the subsets of data having a degree of homogeneity that enhances prediction accuracy. It was also observed in this study that ridge regression performs better than other regression techniques used in the analysis.
Estimating project development effort using clustered regression approachcsandit
Due to the intangible nature of “software”, accurate and reliable software effort estimation is a
challenge in the software Industry. It is unlikely to expect very accurate estimates of software
development effort because of the inherent uncertainty in software development projects and the
complex and dynamic interaction of factors that impact software development. Heterogeneity
exists in the software engineering datasets because data is made available from diverse sources.
This can be reduced by defining certain relationship between the data values by classifying
them into different clusters. This study focuses on how the combination of clustering and
regression techniques can reduce the potential problems in effectiveness of predictive efficiency
due to heterogeneity of the data. Using a clustered approach creates the subsets of data having
a degree of homogeneity that enhances prediction accuracy. It was also observed in this study
that ridge regression performs better than other regression techniques used in the analysis.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
The main goal of cluster analysis is to classify elements into groupsbased on their similarity. Clustering has many applications such as astronomy, bioinformatics, bibliography, and pattern recognition. In this paper, a survey of clustering methods and techniques and identification of advantages and disadvantages of these methods are presented to give a solid background to choose the best method to extract strong association rules.
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
Advances made to the traditional clustering algorithms solves the various problems such as curse of
dimensionality and sparsity of data for multiple attributes. The traditional H-K clustering algorithm can
solve the randomness and apriority of the initial centers of K-means clustering algorithm. But when we
apply it to high dimensional data it causes the dimensional disaster problem due to high computational
complexity. All the advanced clustering algorithms like subspace and ensemble clustering algorithms
improve the performance for clustering high dimension dataset from different aspects in different extent.
Still these algorithms will improve the performance form a single perspective. The objective of the
proposed model is to improve the performance of traditional H-K clustering and overcome the limitations
such as high computational complexity and poor accuracy for high dimensional data by combining the
three different approaches of clustering algorithm as subspace clustering algorithm and ensemble
clustering algorithm with H-K clustering algorithm.
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHcscpconf
Due to the intangible nature of “software”, accurate and reliable software effort estimation is a challenge in the software Industry. It is unlikely to expect very accurate estimates of software
development effort because of the inherent uncertainty in software development projects and the complex and dynamic interaction of factors that impact software development. Heterogeneity exists in the software engineering datasets because data is made available from diverse sources.
This can be reduced by defining certain relationship between the data values by classifying them into different clusters. This study focuses on how the combination of clustering and
regression techniques can reduce the potential problems in effectiveness of predictive efficiency due to heterogeneity of the data. Using a clustered approach creates the subsets of data having a degree of homogeneity that enhances prediction accuracy. It was also observed in this study that ridge regression performs better than other regression techniques used in the analysis.
Estimating project development effort using clustered regression approachcsandit
Due to the intangible nature of “software”, accurate and reliable software effort estimation is a
challenge in the software Industry. It is unlikely to expect very accurate estimates of software
development effort because of the inherent uncertainty in software development projects and the
complex and dynamic interaction of factors that impact software development. Heterogeneity
exists in the software engineering datasets because data is made available from diverse sources.
This can be reduced by defining certain relationship between the data values by classifying
them into different clusters. This study focuses on how the combination of clustering and
regression techniques can reduce the potential problems in effectiveness of predictive efficiency
due to heterogeneity of the data. Using a clustered approach creates the subsets of data having
a degree of homogeneity that enhances prediction accuracy. It was also observed in this study
that ridge regression performs better than other regression techniques used in the analysis.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
The main goal of cluster analysis is to classify elements into groupsbased on their similarity. Clustering has many applications such as astronomy, bioinformatics, bibliography, and pattern recognition. In this paper, a survey of clustering methods and techniques and identification of advantages and disadvantages of these methods are presented to give a solid background to choose the best method to extract strong association rules.
An Heterogeneous Population-Based Genetic Algorithm for Data Clusteringijeei-iaes
As a primary data mining method for knowledge discovery, clustering is a technique of classifying a dataset into groups of similar objects. The most popular method for data clustering K-means suffers from the drawbacks of requiring the number of clusters and their initial centers, which should be provided by the user. In the literature, several methods have proposed in a form of k-means variants, genetic algorithms, or combinations between them for calculating the number of clusters and finding proper clusters centers. However, none of these solutions has provided satisfactory results and determining the number of clusters and the initial centers are still the main challenge in clustering processes. In this paper we present an approach to automatically generate such parameters to achieve optimal clusters using a modified genetic algorithm operating on varied individual structures and using a new crossover operator. Experimental results show that our modified genetic algorithm is a better efficient alternative to the existing approaches.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
In recent machine learning community, there is a trend of constructing a linear logarithm version of
nonlinear version through the ‘kernel method’ for example kernel principal component analysis, kernel
fisher discriminant analysis, support Vector Machines (SVMs), and the current kernel clustering
algorithms. Typically, in unsupervised methods of clustering algorithms utilizing kernel method, a
nonlinear mapping is operated initially in order to map the data into a much higher space feature, and then
clustering is executed. A hitch of these kernel clustering algorithms is that the clustering prototype resides
in increased features specs of dimensions and therefore lack intuitive and clear descriptions without
utilizing added approximation of projection from the specs to the data as executed in the literature
presented. This paper aims to utilize the ‘kernel method’, a novel clustering algorithm, founded on the
conventional fuzzy clustering algorithm (FCM) is anticipated and known as kernel fuzzy c-means algorithm
(KFCM). This method embraces a novel kernel-induced metric in the space of data in order to interchange
the novel Euclidean matric norm in cluster prototype and fuzzy clustering algorithm still reside in the space
of data so that the results of clustering could be interpreted and reformulated in the spaces which are
original. This property is used for clustering incomplete data. Execution on supposed data illustrate that
KFCM has improved performance of clustering and stout as compare to other transformations of FCM for
clustering incomplete data.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...IJECEIAES
Data analysis plays a prominent role in interpreting various phenomena. Data mining is the process to hypothesize useful knowledge from the extensive data. Based upon the classical statistical prototypes the data can be exploited beyond the storage and management of the data. Cluster analysis a primary investigation with little or no prior knowledge, consists of research and development across a wide variety of communities. Cluster ensembles are melange of individual solutions obtained from different clusterings to produce final quality clustering which is required in wider applications. The method arises in the perspective of increasing robustness, scalability and accuracy. This paper gives a brief overview of the generation methods and consensus functions included in cluster ensemble. The survey is to analyze the various techniques and cluster ensemble methods.
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
DCT AND DFT BASED BIOMETRIC RECOGNITION AND MULTIMODAL BIOMETRIC SECURITYIAEME Publication
This Research paper discusses the study and analysis conducted during this research on various techniques in biometric domain. A close glance on biometric enhancement techniques and their limitations are presented in this research paper. This process would enable researcher to understand the research contributions in the area of DCT and DFT based recognition and security, locate some crucial limitations of these notable research. This paper having summary about the different research papers that applicable to our topic of research which mentioned above. Biometric Recognition and security is a most important subject of research in this area of image processing.
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...IJAEMSJORNAL
The current trend in the industry is to analyze large data sets and apply data mining, machine learning techniques to identify a pattern. But the challenges with huge data sets are the high dimensions associated with it. Sometimes in data analytics applications, large amounts of data produce worse performance. Also, most of the data mining algorithms are implemented column wise and too many columns restrict the performance and make it slower. Therefore, dimensionality reduction is an important step in data analysis. Dimensionality reduction is a technique that converts high dimensional data into much lower dimension, such that maximum variance is explained within the first few dimensions. This paper focuses on multivariate statistical and artificial neural networks techniques for data reduction. Each method has a different rationale to preserve the relationship between input parameters during analysis. Principal Component Analysis which is a multivariate technique and Self Organising Map a neural network technique is presented in this paper. Also, a hierarchical clustering approach has been applied to the reduced data set. A case study of Air quality measurement has been considered to evaluate the performance of the proposed techniques.
Finding Relationships between the Our-NIR Cluster ResultsCSCJournals
The problem of evaluating node importance in clustering has been active research in present days and many methods have been developed. Most of the clustering algorithms deal with general similarity measures. However In real situation most of the cases data changes over time. But clustering this type of data not only decreases the quality of clusters but also disregards the expectation of users, when usually require recent clustering results. In this regard we proposed Our-NIR method that is better than Ming-Syan Chen proposed a method and it has proven with the help of results of node importance, which is related to calculate the node importance that is very useful in clustering of categorical data, still it has deficiency that is importance of data labeling and outlier detection. In this paper we modified Our-NIR method for evaluating of node importance by introducing the probability distribution which will be better than by comparing the results.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
An Heterogeneous Population-Based Genetic Algorithm for Data Clusteringijeei-iaes
As a primary data mining method for knowledge discovery, clustering is a technique of classifying a dataset into groups of similar objects. The most popular method for data clustering K-means suffers from the drawbacks of requiring the number of clusters and their initial centers, which should be provided by the user. In the literature, several methods have proposed in a form of k-means variants, genetic algorithms, or combinations between them for calculating the number of clusters and finding proper clusters centers. However, none of these solutions has provided satisfactory results and determining the number of clusters and the initial centers are still the main challenge in clustering processes. In this paper we present an approach to automatically generate such parameters to achieve optimal clusters using a modified genetic algorithm operating on varied individual structures and using a new crossover operator. Experimental results show that our modified genetic algorithm is a better efficient alternative to the existing approaches.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
In recent machine learning community, there is a trend of constructing a linear logarithm version of
nonlinear version through the ‘kernel method’ for example kernel principal component analysis, kernel
fisher discriminant analysis, support Vector Machines (SVMs), and the current kernel clustering
algorithms. Typically, in unsupervised methods of clustering algorithms utilizing kernel method, a
nonlinear mapping is operated initially in order to map the data into a much higher space feature, and then
clustering is executed. A hitch of these kernel clustering algorithms is that the clustering prototype resides
in increased features specs of dimensions and therefore lack intuitive and clear descriptions without
utilizing added approximation of projection from the specs to the data as executed in the literature
presented. This paper aims to utilize the ‘kernel method’, a novel clustering algorithm, founded on the
conventional fuzzy clustering algorithm (FCM) is anticipated and known as kernel fuzzy c-means algorithm
(KFCM). This method embraces a novel kernel-induced metric in the space of data in order to interchange
the novel Euclidean matric norm in cluster prototype and fuzzy clustering algorithm still reside in the space
of data so that the results of clustering could be interpreted and reformulated in the spaces which are
original. This property is used for clustering incomplete data. Execution on supposed data illustrate that
KFCM has improved performance of clustering and stout as compare to other transformations of FCM for
clustering incomplete data.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...IJECEIAES
Data analysis plays a prominent role in interpreting various phenomena. Data mining is the process to hypothesize useful knowledge from the extensive data. Based upon the classical statistical prototypes the data can be exploited beyond the storage and management of the data. Cluster analysis a primary investigation with little or no prior knowledge, consists of research and development across a wide variety of communities. Cluster ensembles are melange of individual solutions obtained from different clusterings to produce final quality clustering which is required in wider applications. The method arises in the perspective of increasing robustness, scalability and accuracy. This paper gives a brief overview of the generation methods and consensus functions included in cluster ensemble. The survey is to analyze the various techniques and cluster ensemble methods.
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
DCT AND DFT BASED BIOMETRIC RECOGNITION AND MULTIMODAL BIOMETRIC SECURITYIAEME Publication
This Research paper discusses the study and analysis conducted during this research on various techniques in biometric domain. A close glance on biometric enhancement techniques and their limitations are presented in this research paper. This process would enable researcher to understand the research contributions in the area of DCT and DFT based recognition and security, locate some crucial limitations of these notable research. This paper having summary about the different research papers that applicable to our topic of research which mentioned above. Biometric Recognition and security is a most important subject of research in this area of image processing.
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...IJAEMSJORNAL
The current trend in the industry is to analyze large data sets and apply data mining, machine learning techniques to identify a pattern. But the challenges with huge data sets are the high dimensions associated with it. Sometimes in data analytics applications, large amounts of data produce worse performance. Also, most of the data mining algorithms are implemented column wise and too many columns restrict the performance and make it slower. Therefore, dimensionality reduction is an important step in data analysis. Dimensionality reduction is a technique that converts high dimensional data into much lower dimension, such that maximum variance is explained within the first few dimensions. This paper focuses on multivariate statistical and artificial neural networks techniques for data reduction. Each method has a different rationale to preserve the relationship between input parameters during analysis. Principal Component Analysis which is a multivariate technique and Self Organising Map a neural network technique is presented in this paper. Also, a hierarchical clustering approach has been applied to the reduced data set. A case study of Air quality measurement has been considered to evaluate the performance of the proposed techniques.
Finding Relationships between the Our-NIR Cluster ResultsCSCJournals
The problem of evaluating node importance in clustering has been active research in present days and many methods have been developed. Most of the clustering algorithms deal with general similarity measures. However In real situation most of the cases data changes over time. But clustering this type of data not only decreases the quality of clusters but also disregards the expectation of users, when usually require recent clustering results. In this regard we proposed Our-NIR method that is better than Ming-Syan Chen proposed a method and it has proven with the help of results of node importance, which is related to calculate the node importance that is very useful in clustering of categorical data, still it has deficiency that is importance of data labeling and outlier detection. In this paper we modified Our-NIR method for evaluating of node importance by introducing the probability distribution which will be better than by comparing the results.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
K-Means clustering uses an iterative procedure which is very much sensitive and dependent upon the initial centroids. The initial centroids in the k-means clustering are chosen randomly, and hence the clustering also changes with respect to the initial centroids. This paper tries to overcome this problem of random selection of centroids and hence change of clusters with a premeditated selection of initial centroids. We have used the iris, abalone and wine data sets to demonstrate that the proposed method of finding the initial centroids and using the centroids in k-means algorithm improves the clustering performance. The clustering also remains the same in every run as the initial centroids are not randomly selected but through premeditated method.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Clustering heterogeneous categorical data using enhanced mini batch K-means ...IJECEIAES
Clustering methods in data mining aim to group a set of patterns based on their similarity. In a data survey, heterogeneous information is established with various types of data scales like nominal, ordinal, binary, and Likert scales. A lack of treatment of heterogeneous data and information leads to loss of information and scanty decision-making. Although many similarity measures have been established, solutions for heterogeneous data in clustering are still lacking. The recent entropy distance measure seems to provide good results for the heterogeneous categorical data. However, it requires many experiments and evaluations. This article presents a proposed framework for heterogeneous categorical data solution using a mini batch k-means with entropy measure (MBKEM) which is to investigate the effectiveness of similarity measure in clustering method using heterogeneous categorical data. Secondary data from a public survey was used. The findings demonstrate the proposed framework has improved the clustering’s quality. MBKEM outperformed other clustering algorithms with the accuracy at 0.88, v-measure (VM) at 0.82, adjusted rand index (ARI) at 0.87, and Fowlkes-Mallow’s index (FMI) at 0.94. It is observed that the average minimum elapsed time-varying for cluster generation, k at 0.26 s. In the future, the proposed solution would be beneficial for improving the quality of clustering for heterogeneous categorical data problems in many domains.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Mine Blood Donors Information through Improved K-Means Clusteringijcsity
The number of accidents and health diseases which are increasing at an alarming rate are resulting in a huge increase in the demand for blood. There is a necessity for the organized analysis of the blood donor database or blood banks repositories. Clustering analysis is one of the data mining applications and K-means clustering algorithm is the fundamental algorithm for modern clustering techniques. K-means clustering algorithm is traditional approach and iterative algorithm. At every iteration, it attempts to find the distance from the centroid of each cluster to each and every data point. This paper gives the improvement to the original k-means algorithm by improving the initial centroids with distribution of data. Results and discussions show that improved K-means algorithm produces accurate clusters in less computation time to find the donors information
Similar to 84cc04ff77007e457df6aa2b814d2346bf1b (20)
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
1. International Journal of Computer Engineering and Information Technology
VOL. 9, NO. 1, January 2017, 6–14
Available online at: www.ijceit.org
E-ISSN 2412-8856 (Online)
Comparison of Hierarchical and Non-Hierarchical Clustering
Algorithms
Fidan Kaya Gülağız1
and Suhap Şahin2
1, 2
Department of Computer Engineering, Kocaeli University, Kocaeli, İzmit, 41380 Turkey
1
fidan.kaya@kocaeli.edu.tr, 2
suhapsahin@kocaeli.edu.tr
ABSTRACT
As a result of wide use of technology, large data volumes
began to emerge. Analyzes on this big size data and obtain
knowledge in it very difficult with simple methods. So
data mining methods has risen. One of these methods is
clustering. Clustering is an unsupervised data mining
technique and groups data according to similarities
between records. The goal of cluster analysis is finding
subclasses that occur naturally in the data set. There are
too many different methods improved for data clustering.
Performance of these methods can be changed according
to data set or number of records in dataset etc. In this study
we evaluate clustering methods using different datasets.
Results are compared by considering different parameters
such as result similarity, number of steps, processing time
etc. At the end of the study methods are also analyzed to
show the appropriate data set conditions for each method.
Keywords: Data Mining, Hierarchical Clustering, Non-
Hierarchical Clustering, Centroid Similarity.
1. INTRODUCTION
Computer systems are developing each passing day and
also become cheaper. Processors are getting faster and disk
capacities increase as well. Computers can store more data
and process them in less time. Furthermore, the data can
be accessed quickly with advances in computer networks
from other computers [1]. As electronic devices are getting
cheaper, their usage becomes widespread and so data in
electronic environment is growing rapidly. But these data
aren’t meaningful when they aren’t processed by a specific
method. Data mining is defined as a process that useful or
meaningful information are distinguished from large data
[2]. There are many models that can be used to obtain
useful information in data mining. These models can be
grouped under three main headings. These can be listed as
classification, clustering and association rules
[3].Classification models are considered as estimator
models, clustering and association rules are also
considered as identifier models [4]. For estimator models it
is aimed to develop a model of which results are known.
This developed model is being used to estimate result
values for data clusters of which results are not known. For
identifier models it is provided to identify of pattern at
present data that can be used to guide for decision making.
Clustering model among identifier models provides to
separate groups according to their calculated similarities
by taking into consideration specific characteristics of data.
Many methods can be used during calculation of
similarities. Some of these methods are: Euclidean
Distance, Manhattan Distance and Minkowski Distance [5].
First aim of usage of distance methods is to obtain
similarity according to distance between data which is not
grouped. Thus, similar data can be included in the same
cluster. To imply clustering analysis it is assumed that data
should be normal distribution. However this is just
theoretical assumption and is ignored in practice. Only the
suitability of the calculated distance values to normal
distribution is considered [6].
There are many developed clustering methods in data
mining. Methods are being selected according to cluster
number and data attribute that will be clustered. Clustering
methods are divided in two categories. These are
hierarchical clustering and non-hierarchical clustering
methods. Hierarchical clustering methods have two
different classes. These are agglomerative and divisive
approaches. Non-hierarchical clustering methods are also
divided four sub-classes; partitioning, density-based, grid-
based and other approaches [7].The general architecture of
clustering methods is shown in Figure 1.
Fig. 1. Categorization of clustering algorithms [8].
2. 7International Journal of Computer Engineering and Information Technology (IJCEIT), Volume 9, Issue 1, January 2017
F. K. Gülağız and S. Şahin
If used clustering algorithm forms clusters gradually, it is
belonged to hierarchical class. The algorithms of the
hierarchical class gather the most similar two objects in a
cluster. This has very high process cost because all objects
are compared before every clustering step. Agglomerative
approach is called as bottom to top approach. In this
approach, data points set clusters as combining with each
other [8]. Divisive approach has contrary process and is
called top to bottom. Required calculation load by divisive
and agglomerative approaches are similar [9].
Clustering algorithms in non-hierarchical category cluster
the data directly. Such algorithms generally change centers
until all points are related to centers [9]. K-Means
algorithm is given the most common example of
partitioning approach. Fuzzy C-Means and K-Medoids
algorithms are also sort of K-Means algorithms. When it is
compared with hierarchical classification, non-hierarchical
classification is low cost in terms of calculation time.
Because distance of all point and related centers are
calculated consecutively until all centers are minimized
and this is a high cost process [10].
Density based clustering approach is also another non-
hierarchical clustering approach. Density based clustering
algorithms consider intensive data spaces as cluster, so
have no problem in finding clusters with random shapes.
Among the best known density based clustering algorithms;
it is regarded DBSCAN (Density-based spatial clustering
of applications with noise) and OPTICS (Ordering points
to identify the clustering structure) algorithms. Grid based
clustering approach takes into consideration the cells
rather than data points. Because of this feature, grid based
clustering algorithms are generally more effective as all
computational clustering algorithms [11].
In this study, it was conducted to compare the performance
of clustering methods on different data sets. K-Means, K-
Medoids and Farthest First Clustering algorithms are used
for hierarchical clustering and DBSCAN was used for
density based clustering. It was determined how these
algorithms realize the clustering accuracy and
effectiveness with evaluation parameters such processing
time, difference rate of centers and total error rate. The rest
of the paper is organized as follows. In section II,
clustering algorithms are given. The third section explains
datasets and evaluation metrics that used for performance
analysis. Section IV presents experimental results.
Conclusion and some future enhancements are given at the
last section.
2. CLUSTERING ALGORITHMS
In the scope of this study performance evaluation of K-
Means, K-Medoids, Farthest First Clustering and Density
Based Clustering algorithms was made. For this purpose,
seven different data sets in UCI Data Repository were used.
Also both Iterative K-Means and Iterative Multi K-Means
versions of K-Means algorithm were included in the
comparison. In this section clustering algorithms are
explained in detail.
2.1 K-Means Algorithm
K-Means algorithm processes with the thought that new
clusters should be formed according to distance between
points and center of clusters. Distance between elements of
data set and center of clusters also give error rate of
clustering. K-Means algorithm consists of four basic steps
[12]. These are;
- Determination of centers.
- Assigning points to clusters which are outside of the
centers according to distance between centers and points.
- Calculation of new centers.
- Repeating these steps until obtaining decided clusters.
The biggest problem of K-Means algorithm is
determination of starting points. If initially a bad choice is
made, many changes will be at the clustering period and in
this case for each time different clustering results can even
be obtained with the same number of iterations. At the
same time if dimensions of data groups are different,
density of data groups will be different or If there is
contrariety in data, algorithm may not get good results.
Besides the complexity of K-Means algorithm is less than
other methods, the implementation of this algorithm is
easy. Pseudo code of K-Means algorithm is shown in
Figure 2.
Fig. 2. Pseudo code for K-Means algorithm [13].
Input: k // Desired number of clusters
D= {x1, x2,…, xn} // Set of elements
Output: K={ C1, C2,…, Ck } // Set of k clusters which minimizes the squared-error function
K-Means Algorithm
Assign initial values for means point µ1, µ2, …, µk
Repeat
Assign each item xi to the cluster which has closest mean;
Calculate new mean for each cluster;
Until convergence criteria is meat;
3. 8International Journal of Computer Engineering and Information Technology (IJCEIT), Volume 9, Issue 1, January 2017
F. K. Gülağız and S. Şahin
In our study, besides classical K-Means method, Iterative
K-Means and Iterative Multi K-Means algorithms were
analyzed. Iterative K-Means algorithm is different from
classical K-Means algorithm. This algorithm processes
according to two parameters. These are minimum number
of clusters and maximum number of clusters. Algorithm
takes these parameters as input to increase accuracy.
Method works between minimum and maximum number
range of K-Means algorithm and calculates a score value
for each number of cluster. Number of clusters which have
the highest score values, returns as a result. Iterative Multi
K-Means algorithm is a developed version of Iterative K-
Means algorithm in terms of iteration number. Here
differently from Iterative K-Means algorithm, it aims to
achieve best cluster number as working iterations in
different numbers and to get best iteration value for this
number of clusters.
2.2 K-Medoids Algorithm
Basis of K-Medoids algorithm is finding k representative
objects that represents various structural attribute of data.
Therefore, the method is called as K-Medoids or PAM
(Partitioning Around Medoids). Representative objects in
algorithm are called as medoids and these objects are the
nearest points to center. The most commonly used K-
Medoids algorithm, was developed by Kaufman and
Rousseeuw in 1987 [14]. Representative objects that
makes minimum the average distance of other objects is
the most central object. Therefore, this division method is
applied based on reduction of uniqueness between each
object and its reference point.
Fig. 3. Pseudo code for K-Medoids/PAM algorithm [13].
In Figure 3 pseudo-code of K-Medoids algorithm is given.
In this method, after choosing k representative objects,
clusters are formed by assigning each object to nearest
representative k objects [15]. For the next steps each
representative objects are changed with each non-
representative objects and this replacement continues until
quality of clustering is improved. Cost function is used to
evaluate the quality of clustering. This cost function
calculates similarity between an object and its cluster. Also
decision making process is performed according to the
cost value which turns from cost function [16].
2.3 Farthest First Clustering
Farthest First Clustering algorithm is a speedy and greedy
algorithm. It performs the clustering process in two stages
such as K-Means algorithm. These are selection of centers
and assigning the elements to these clusters. Initially,
algorithm makes the process of selection of k centers.
Center of first cluster is being selected randomly. Center
of second cluster is being selected the farthest point to
center of first cluster. In this process a greedy approach is
being used. The next center selections are performed by
determination of the farthest points to selected centers of
clusters greedily. After cluster determination process, the
rest points are assigned to the closest clusters according to
their distance [17]. In Figure 4 pseudo-code of Farthest
First Clustering algorithm is given.
Farthest First Clustering algorithm is similar as a process
to K-Means algorithm but in fact there are some different
points from K-Means algorithm. While Farthest First
Clustering algorithm makes assignment of the all elements
to clusters at the same time, K-Means continues to process
according to iteration number or until any changes aren’t
made in clusters. Farthest First Clustering algorithm
doesn’t need to update for centers. Because in K-Means
algorithm center of clusters are geometric center of
Input: k // Desired number of clusters
D= {x1, x2,…, xn} // Set of elements
Output: K: A set of k clusters which minimizes the sum of dissimilarities of all n objects to their
nearest q-th medoid (q = 1,2,…,k).
K-Medoids Algorithm
Randomly choose k objects from the data set to be the cluster medoids at the initial state.
For each pair of non-selected object h and selected object i, calculate the total swapping cost
For each pair of i and h
If cost < 0, i is replaced by h
Then assign each non-selected object to the most similar representative object.
Repeat steps 2 and 3 until no change happens.
4. 9International Journal of Computer Engineering and Information Technology (IJCEIT), Volume 9, Issue 1, January 2017
F. K. Gülağız and S. Şahin
cluster’s elements but in Farthest First Clustering
algorithm centers are certain points and are determined for
once. Farthest First Clustering algorithm even performs
the selection of centers randomly and in one time, it has a
good performance about this process.
Fig. 4. Pseudo code for Farthest First algorithm [17]
2.4 Density Based Clustering
DBSCAN (Density-based spatial clustering of applications
with noise) algorithm is based on revealing the
neighborhood of data points with each other in two or
multi dimensional space. Database that is dealt with spatial
perspective is mostly used to analyze of spatial data [18].
DBSCAN algorithm has two initial parameters. These are:
ε (epsilon/eps) and minPts (required number to create a
density region). Epsilon parameter is used to indicate
closeness degree of points in the same cluster to each other.
Algorithm starts from an arbitrary point which has not ever
been visited. It finds the points at ε neighborhood points
and forms a new cluster, if its number is enough.
Otherwise this point is tagged as noise. It means that this
point is not a center but it may be a point around ε distance
to another cluster. According to the algorithm operating
logic; if a point is determined as a density part/center of
the cluster, this point is considered as an element of the
cluster at ε neighborhood points. This process continues
until all clusters related density is obtained. Finding
neighborhood process is a part of DBSCAN algorithm
which requires the most process power. In Figure 5 a
pseudo code of DBSCAN algorithm is shown. To obtain
accurate clusters at the DBSCAN algorithm, it is need to
be obtained ideal values of eps parameter. If eps value is
given so little value, only density clustering regions/core
of clusters will be obtained. Otherwise undesirable clusters
may form.
Fig. 5. Pseudo code for DBSCAN Algorithm [13]
Input: k // Desired number of clusters
D= {x1, x2,…, xn} // Set of elements
Output: K={ C1, C2,…, Ck } // Set of k clusters
Farthest First Algorithm
Randomly select first cluster;
For (i=2,…k)
For each remaining point
Calculate distance to the current center set ;
Select the point that has maximum distance as new center;
For each remaining point
Calculate the distance to each center of clusters;
Put it to the cluster with minimum distance
Input: N objects to be clustered
Global parameters: Eps, MinPts
Output: Clusters of objects
Density Based Algorithm
Arbitrary select a point P;
Retrieve all points density-reachable from P wrt Eps and MinPts;
If P is a core point, a cluster is formed.
If P is a noise point, no points are density-reachable from P and DBSCAN visits the next point of the
database.
Continue the process until all of the points have been processed.
5. 10International Journal of Computer Engineering and Information Technology (IJCEIT), Volume 9, Issue 1, January 2017
F. K. Gülağız and S. Şahin
3. DATASET and EVALUATION METRICS
Information about eight different datasets which are used
in testing process of the clustering methods is given in
Table 1. Data sets were obtained from UC Irvine Machine
Learning Repository [18]. UC Irvine Machine Learning
Repository is a data repository that was created by David
Aha and his students in 1987 and contains many big data
sets. During selection of data sets it was considered
parameters such as related different areas, variance of
record number and attribute numbers of the data. Iris and
Wine data sets are the most commonly used data sets of
UCI library. With variance of selected data sets it was
provided that clustering algorithms were analyzed from
different points.
Table 1: Summary description of data sets.
Dataset
Number of
Records
Number of
Attributes
Iris 150 4
Wine 178 13
Glass 214 10
Heart – Disease 303 13
Water Treatment Plant 527 38
Pima Indians 768 8
Isolet 7797 617
In the evaluating process of data sets, different evaluating
methods can be used. Some of these methods can be listed
as process time, sum of centroid similarities, sum of
average pair wise similarities, sum of squared errors, AIC
(Akaike’s Information Criterion) Score and BIC (Bayesian
Information Criterion) Score.
Process time; refers to elapsed time for the clustering
process.
Sum of centroid similarities ; represents the distance of
centers to each other. Any distance calculation method can
be used to obtain distance.
Sum of average pair wise similarities (SOAPS); refers
total distance of the points that matches in the different
clusters (most similar/closest points in the different
clusters) to each other. Any distance calculation method
can be used.
Sum of squared errors (SSE); refers to total distances of
points in the clusters from center of clusters.
AIC and BIC; are used to measure quality of a statistical
model on a given data set. Low AIC and BIC values
indicate that the model is closer to reality.
2
KSSE = dist(x,c )x ci=1 ii (1)
AIC=-2logL +2p (2)
BIC=-2logL +log(n)p (3)
In Formula (1), SSE calculation method formula is given.
Here K Value represents cluster number; ci value
represents center of i. cluster and x value also represents
elements of the i. cluster. Formula of the AIC evaluation
method is shown with Formula (2) and BIC evaluation
method is shown with Formula (3). L value in Formula (2)
and Formula (3) represents similarities calculation method,
p value represents characteristic number in data set and n
value represents the iteration number. AIC is better in
situations when a false negative finding would be
considered more misleading than a false positive, and BIC
is better in situations where a false positive is as
misleading as a false negative [20]. In the scope of this
study process time, SSE and SOCS metrics are used as
evaluation metrics and clustering methods were compared
according to these parameters.
4. EXPERIMENTAL RESULTS
To evaluate the efficiency of the different clustering
algorithms, some tests are realized. Experimental tests are
realized on computer with Intel(R) Core(TM) i5-2410M
CPU @ 2.30 GHz, 4 GB RAM, 350 GB hard disk and
Windows 7 Ultimate operating system. While performing
the tests, six different algorithms were used. Analysis of
algorithms was made on seven different data sets as
mentioned section 2. Test process of algorithms was
carried out with Eclipse compiler and Java-ML (Java
Machine Learning Library) was used for implementation
of algorithms [21]. Table 2 shows results that contain
process time of algorithms, Table 3 shows the similarity of
centers and Table 4 shows results that contains total error
rate of algorithms.
6. 11International Journal of Computer Engineering and Information Technology (IJCEIT), Volume 9, Issue 1, January 2017
F. K. Gülağız and S. Şahin
Table 2: Time taken to form the respective number of clusters
Dataset
#ofClusters
Process Time (in sec.)
K-Means
IterativeK
-Means
IterativeMultiK
–Means
K–Medoids
FarthestFirst
Clustering
DBSCAN
Iris
9 0.000679 0.000553 0.000572 0.014359 0.000610 -
10 0.000784 - - 0.009520 0.000655 0.001058
Wine
3 0.000629 0.000711 0.000561 0.014691 0.000632 0.001049
5 0.000633 0.000594 0.000561 - - -
10 0.000622 0.000612 0.000588 - - -
Glass
2 0,000921 0,000726 0,000724 0,022212 0,000773 -
3 0,000766 0,000725 0,000795 0,022465 0,000761 0,001956
4 0,000668 0,000725 0,000890 0,023465 0,000743 -
Heart-
Diseas
2 0,000970 0,000781 0,000895 0,019614 0,000931 -
4 0,000795 0,000723 0,000933 0,021412 0,000813 -
5 0,000797 0,000714 0,001141 0,022244 0,000758 0,002131
Water
5 0,000833 0,000674 0,000681 0,020624 0,000659 -
9 0,000790 0,000692 0,000679 0,021894 0,000751 -
13 0,001011 0,000749 0,000619 0,022689 0,000731 -
Pima
2 0.000655 0.000614 0.000584 0.013468 0.000580 -
7 0.000603 0.000630 0.000600 0.011716 0.000594 -
14 0.000597 0.000568 0.000581 0.010192 0.000596 0.001038
Isolet
1 - - - - - 0.001102
2 0.000611 0.000678 0.000603 0.000481 0.000710 -
4 0.000646 0.000674 0.000642 0.000478 0.000772 -
Process time of clustering algorithms is given in Table 2.
When process time is compared in terms of data sets
dimensions, generally process time of K-Medoids is more
than other algorithms. When Farthest First Clustering
algorithm and K-Means algorithm are compared, K-Means
algorithm is narrowly faster but this difference is as second,
so process time of both algorithms can be accepted same.
It has been determined that between all algorithms, K-
Medoids and DBSCAN algorithms requires the most
process time. In addition, in case of data set growth,
increase in process time of algorithms is not so much and
number of clusters for the same data set effects the process
time too little.
7. 12International Journal of Computer Engineering and Information Technology (IJCEIT), Volume 9, Issue 1, January 2017
F. K. Gülağız and S. Şahin
Table 3: Sum of centroid similarities of clustersDataset
#ofClusters Sum of Centroid Similarities
K-Means
IterativeK
-Means
IterativeMultiK
–Means
K–Medoids
FarthestFirst
Clustering
DBSCAN
Iris
9 149.85 149.86 149.85 148.70 148.71 -
10 149.85 - - 148.82 148.88 131.89
Wine
3 177.80 177.80 177.81 177.83 177.83 177.83
5 177.82 177.82 177.82 - - -
10 177.86 177.86 177.87 - - -
Glass
2 209.48 209.41 209.41 209.41 205.74 -
3 211.84 211.74 211.84 211.69 211.95 171.86
4 212.83 212.83 212.69 212.72 212.61 -
Heart-Diseas
2 301.42 301.42 301.42 300.81 300.70 -
4 301.62 301.56 301.62 300.84 300.82 -
5 301.78 301.84 301.85 300.87 300.86 55.75
Water
5 526.80 526.81 526.79 526.03 526.04 -
9 526.82 526.82 526.82 526.05 526.07 -
13 526.83 526.85 526.82 526.05 526.09 -
Pima
2 726.95 726.95 726.95 688.77 687.84 -
7 750.30 751.39 750.29 690.49 689.78 -
14 752.45 758.48 757.92 691.09 690.91 687.54
Isolet
1 - - - - - 1136.89
2 1234.00 1234.00 1234.00 1165.24 1144.72 -
4 1274.67 1259.77 1259.74 1169.90 1161.63 -
In Table 3, similarities of clusters which were obtained as
a result of clustering process are evaluated. If the
similarity of centers is low clustering process will be so
successful. When we evaluated the results of different
algorithms, it is seen that similarities of centers are near
for all algorithms but DBSCAN algorithm is more
successful than other methods. When used data sets were
reviewed, it was also seen that DBSCAN algorithm
couldn’t perform clustering process in every data sets. As
a result, DBSCAN algorithm is not appropriate for data
sets that have high differences in data. In such case
determination of epsilon parameter is very difficult.
8. 13International Journal of Computer Engineering and Information Technology (IJCEIT), Volume 9, Issue 1, January 2017
F. K. Gülağız and S. Şahin
Table 4: Sum of Squared Errors of Clustering Process
Dataset
#ofClusters
Sum of Squared Errors
K-Means
IterativeK
-Means
IterativeMultiK
–Means
K–Medoids
FarthestFirst
Clustering
DBSCAN
Iris
9 57.64 57.53 53.27 321.06 315.01 -
10 54.11 - - 286.11 268.95 265.93
Wine
3 22632.55 22632.55 21989.30 67576.07 67576.07 67576.07
5 13071.70 12825.17 12806.69 - - -
10 8621.65 8659.13 7599.18 - - -
Glass
2 410782.58 410699.26 410699.26 410699.26 646858.54 -
3 183758.10 183879.12 183758.10 184353.19 257950.79 799782.81
4 104375.47 104375.47 104365.58 104310.72 135295.57 -
Heart-
Diseas
2 1210711.49 1210711.49 1210711.49 2027583.39 2084976.16 -
4 785706.62 825521.63 785706.62 2019915.36 2031928.89 -
5 704137.84 698950.49 667067.95 2000107.90 2013695.69 189279.47
Water
5 1.45 8.72 7.24 1.34 1.44 -
9 2.73 2.73 2.38 1.08 1.09 -
13 2.24 2.04 1.57 1.01 1.01 -
Pima
2 1.02 1.02 1.02 2.30 2.31 -
7 3.05 2.8 2.61 2.27 2.28 -
14 1.93 1.62 1.65 2.26 2.26 2.25
Isolet
1 - - - - - 352462.39
2 283497.79 283497.79 283497.79 335562.48 350550.20 -
4 250299.70 263343.92 250295.65 332182.18 338303.25 -
In Table 4 total error values which were obtained as a
result of clustering process were given. Total error values
allow us to have an idea about success of clustering
process. At the same time it helps the determination of
appropriate clusters in the data set. Multi K-Means
algorithm within K-Means algorithm versions was
produced the most accurate result. Because this algorithm
works to find number of clusters that gives minimum error
in different iterations for all numbers in the given range.
Also if data set is appropriate for density-based clustering,
it was seen that DBSCAN has minimum error when
compared to other methods. Other methods generally
showed results in similar error value.
5. DISCUSSION OF RESULTS AND
CONCLUSION
In this study, different clustering algorithms were
compared in terms of processing time, sum of centroid
similarities, sum of squared errors of different datasets. As
a result of this study, it is understood that K-Medoids
algorithm produce more accurate results rather than K-
Means but it needs more time and memory. K-Means
algorithm may give different results for different cluster
values because it is not a decided clustering algorithm.
Farthest First algorithm produces similar results with K-
Means. For DBSCAN algorithm distribution of data in
selected data set has a major effect. It is also shown that
choosing appropriate number of cluster for related data set
is really important for correctness of clustering process.
9. 14International Journal of Computer Engineering and Information Technology (IJCEIT), Volume 9, Issue 1, January 2017
F. K. Gülağız and S. Şahin
Considering all methods, DBSCSN algorithm gives the
most accurate results and K-Means is the fastest algorithm.
REFERENCES
[1] Alpaydın, E., Zeki Veri Madenciliği: Ham Veriden Altın
Bilgiye Ulaşma Yöntemler, Bilişim 2000, Veri
madenciliği Eğitim Semineri, 2000.
[2] Özkan, Y., Veri Madenciliği Yöntemleri, Papatya
Yayıncılık, İstanbul, Turkey, 2008.
[3] Ibaraki, M., Data Mining Techniques for Associations,
Clustering and Classification Methodologies, PAKDD'99,
In: Proceedings of Third Pacific-Asia Conference, 1999,
pp 13-23.
[4] Savaş, S., Topaloğlu, N., Yılmaz, M., Veri Madenciliği ve
Türkiye’ deki Uygulama Örnekleri, İstanbul Ticaret
Üniversitesi Fen Bilimleri Dergisi, 2012, 11 (21), pp 1-23.
[5] Deza, M. M., Deza, E., Encyclopedia of Distance,
Springer, 2009.
[6] Çoşlu, E., Veri Madenciliği, In: Proceedings of Akademik
Bilişim 2013 Conference, 2013, pp 573-585
[7] Taşkın, Ç., Emel, G. G., Veri Madenciliğinde Kümeleme
Yaklaşimlari Ve Kohonen Ağlari İle Perakendecilik
Sektöründe Bir Uygulama, Süleyman Demirel
Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi,
2010, 15(3), pp 395-409.
[8] Eden, M. A., Tommy, W. S., A New Shifting Grid
Clustering Algorithm, Pattern Recognition, 2004, 37(3),
pp 503-514.
[9] Kuo, R. J., Ho, L. M., Hu, C. M., Cluster Analysis in
Industrial Market Segmentation Through Artificial Neural
Network, Computers & Industrial Engineering, 2002,
42(2-4), pp 391-399.
[10] Likas, A., Vlassisb, N., Verbeekb, J. J., The Global K-
Means Clustering Algorithm, Pattern Recognition, 2003,
36(2), pp 451-461.
[11] Hsu, C. H., Data Mining to Improve Industrial Standards
and Enhance Production and Marketing: An Empirical
Study in Apparel Industry, Expert Systems with
Applications, 2003, 36(3), pp 4185–4191.
[12] Kaya, H., Köymen, K., Veri Madenciliği Kavrami Ve
Uygulama Alanlari, Doğu Anadolu Bölgesi Araştırmları,
2008.
[13] R. Capaldo and F. Collova, Clustering: A survey,
http://www.slideshare.net/rcapaldo/cluster-analysis-
presentation, (2008).
[14] Kaufman, L., Rousseeuw, P. J., Clustering by Means of
Medoids, Statistical Data Analysis Based on The L1–
Norm and Related Methods, Springer, 1987.
[15] Kaufman, L., Rousseeuw, P. J., Finding Groups in Data:
An Introduction to Cluster Analysis, Wiley, 1990.
[16] Işık, M.,Çanurcu, A., K-Means, K-Medoids Ve Bulanik
C-Means Algoritmalarinin Uygulamali Olarak
Performanslarinin Tespiti, İstanbul Ticaret Üniversitesi
Fen Bilimleri Dergisi, 2007, 6(11), pp 31-45
[17] Sharmila, Kumar, M., An Optimized Farthest First
Clustering Algorithm, In: Proceedings of Nirma
University International Conference on Engineering,
2013, pp 1-5.
[18] Bilgin, T. T., Çamurcu, Y., DBSCAN, OPTICS ve K-
Means Kümeleme Algoritmalarının Uygulamalı
Karşılaştırılması, Politeknik Dergisi, 2005, 8(2), pp 139-
145.
[19] Lichman, M. (2013). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of
California, School of Information and Computer Science.
[20] Dziak, J. J., Coffman, D. L., Lanza, S. T., Li, R.,
Sensitivity and Specificity of Information Criteria,
Technical Report Series #12-119, The Pennsylvania State
University, College of Health and Human Development,
Metodology Center.
[21] Abeel, T., Peer, Y. V., Saeys, Y., Java-ML: A Machine
Learning Library, Journal of Machine Learning Research,
2009, 10, pp 931-934.
AUTHOR PROFILES:
Fidan Kaya Gülağız has received her BEng. in Computer
Engineering from Kocaeli University in 2010 and ME. in
Computer Engineering from Kocaeli University in 2012. She is
currently working towards Ph.D. degree in Computer
Engineering from Kocaeli University, Turkey. Also, she is
currently a Research Assistant of Computer Engineering
Department at Kocaeli University in Turkey. Her main
research interests include data synchronization, distributed file
systems and data filtering methods.
Suhap Şahin has received his BEng. in Electrical, Electronics
and Communications Engineering from Kocaeli University in
2000. He is an associate professor at the Kocaeli University
Computer Engineering Department in Turkey. He has a Ph.D.
at the Kocaeli University Electrical, Electronics and
Communications Engineering. His main research interests
include beamforming, FPGA, OFDM and wireless
communication.