This document discusses and compares various clustering techniques used in data mining. It begins with an introduction to data mining and clustering. It then discusses different types of clustering (hard vs soft), popular clustering methodologies like K-means, hierarchical, density-based etc. It provides examples of clustering applications. The document also discusses challenges in clustering large datasets and proposes approaches like MapReduce. It evaluates pros and cons of different clustering algorithms and their real-world applications.
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
Data mining works to extract information known in advance from the enormous quantities of data which can lead to knowledge. It provides information that helps to make good decisions. The effectiveness of data mining in access to knowledge to achieve the goal of which is the discovery of the hidden facts contained in databases and through the use of multiple technologies. Clustering is organizing data into clusters or groups such that they have high intra-cluster similarity and low inter cluster similarity. This paper deals with K-means clustering algorithm which collect a number of data based on the characteristics and attributes of this data, and process the Clustering by reducing the distances between the data center. This algorithm is applied using open source tool called WEKA, with the Insurance dataset as its input
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUEijcsit
The huge amount of healthcare data, coupled with the need for data analysis tools has made data mining
interesting research areas. Data mining tools and techniques help to discover and understand hidden
patterns in a dataset which may not be possible by mainly visualization of the data. Selecting appropriate
clustering method and optimal number of clusters in healthcare data can be confusing and difficult most
times. Presently, a large number of clustering algorithms are available for clustering healthcare data, but
it is very difficult for people with little knowledge of data mining to choose suitable clustering algorithms.
This paper aims to analyze clustering techniques using healthcare dataset, in order to determine suitable
algorithms which can bring the optimized group clusters. Performances of two clustering algorithms (Kmeans
and DBSCAN) were compared using Silhouette score values. Firstly, we analyzed K-means
algorithm using different number of clusters (K) and different distance metrics. Secondly, we analyzed
DBSCAN algorithm using different minimum number of points required to form a cluster (minPts) and
different distance metrics. The experimental result indicates that both K-means and DBSCAN algorithms
have strong intra-cluster cohesion and inter-cluster separation. Based on the analysis, K-means algorithm
performed better compare to DBSCAN algorithm in terms of clustering accuracy and execution time.
The huge amount of healthcare data, coupled with the need for data analysis tools has made data mining interesting research areas. Data mining tools and techniques help to discover and understand hidden patterns in a dataset which may not be possible by mainly visualization of the data. Selecting appropriate clustering method and optimal number of clusters in healthcare data can be confusing and difficult most times. Presently, a large number of clustering algorithms are available for clustering healthcare data, but it is very difficult for people with little knowledge of data mining to choose suitable clustering algorithms. This paper aims to analyze clustering techniques using healthcare dataset, in order to determine suitable algorithms which can bring the optimized group clusters. Performances of two clustering algorithms (Kmeans and DBSCAN) were compared using Silhouette score values. Firstly, we analyzed K-means algorithm using different number of clusters (K) and different distance metrics. Secondly, we analyzed DBSCAN algorithm using different minimum number of points required to form a cluster (minPts) and different distance metrics. The experimental result indicates that both K-means and DBSCAN algorithms have strong intra-cluster cohesion and inter-cluster separation. Based on the analysis, K-means algorithm performed better compare to DBSCAN algorithm in terms of clustering accuracy and execution time.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Abstract Learning Analytics by nature relies on computational information processing activities intended to extract from raw data some interesting aspects that can be used to obtain insights into the behaviors of learners, the design of learning experiences, etc. There is a large variety of computational techniques that can be employed, all with interesting properties, but it is the interpretation of their results that really forms the core of the analytics process. As a rising subject, data mining and business intelligence are playing an increasingly important role in the decision support activity of every walk of life. The Variance Rover System (VRS) mainly focused on the large data sets obtained from online web visiting and categorizing this into clusters according some similarity and the process of predicting customer behavior and selecting actions to influence that behavior to benefit the company, so as to take optimized and beneficial decisions of business expansion. Keywords: Analytics, Business intelligence, Clustering, Data Mining, Standard K-means, Optimized K-means
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
Data mining works to extract information known in advance from the enormous quantities of data which can lead to knowledge. It provides information that helps to make good decisions. The effectiveness of data mining in access to knowledge to achieve the goal of which is the discovery of the hidden facts contained in databases and through the use of multiple technologies. Clustering is organizing data into clusters or groups such that they have high intra-cluster similarity and low inter cluster similarity. This paper deals with K-means clustering algorithm which collect a number of data based on the characteristics and attributes of this data, and process the Clustering by reducing the distances between the data center. This algorithm is applied using open source tool called WEKA, with the Insurance dataset as its input
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUEijcsit
The huge amount of healthcare data, coupled with the need for data analysis tools has made data mining
interesting research areas. Data mining tools and techniques help to discover and understand hidden
patterns in a dataset which may not be possible by mainly visualization of the data. Selecting appropriate
clustering method and optimal number of clusters in healthcare data can be confusing and difficult most
times. Presently, a large number of clustering algorithms are available for clustering healthcare data, but
it is very difficult for people with little knowledge of data mining to choose suitable clustering algorithms.
This paper aims to analyze clustering techniques using healthcare dataset, in order to determine suitable
algorithms which can bring the optimized group clusters. Performances of two clustering algorithms (Kmeans
and DBSCAN) were compared using Silhouette score values. Firstly, we analyzed K-means
algorithm using different number of clusters (K) and different distance metrics. Secondly, we analyzed
DBSCAN algorithm using different minimum number of points required to form a cluster (minPts) and
different distance metrics. The experimental result indicates that both K-means and DBSCAN algorithms
have strong intra-cluster cohesion and inter-cluster separation. Based on the analysis, K-means algorithm
performed better compare to DBSCAN algorithm in terms of clustering accuracy and execution time.
The huge amount of healthcare data, coupled with the need for data analysis tools has made data mining interesting research areas. Data mining tools and techniques help to discover and understand hidden patterns in a dataset which may not be possible by mainly visualization of the data. Selecting appropriate clustering method and optimal number of clusters in healthcare data can be confusing and difficult most times. Presently, a large number of clustering algorithms are available for clustering healthcare data, but it is very difficult for people with little knowledge of data mining to choose suitable clustering algorithms. This paper aims to analyze clustering techniques using healthcare dataset, in order to determine suitable algorithms which can bring the optimized group clusters. Performances of two clustering algorithms (Kmeans and DBSCAN) were compared using Silhouette score values. Firstly, we analyzed K-means algorithm using different number of clusters (K) and different distance metrics. Secondly, we analyzed DBSCAN algorithm using different minimum number of points required to form a cluster (minPts) and different distance metrics. The experimental result indicates that both K-means and DBSCAN algorithms have strong intra-cluster cohesion and inter-cluster separation. Based on the analysis, K-means algorithm performed better compare to DBSCAN algorithm in terms of clustering accuracy and execution time.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Abstract Learning Analytics by nature relies on computational information processing activities intended to extract from raw data some interesting aspects that can be used to obtain insights into the behaviors of learners, the design of learning experiences, etc. There is a large variety of computational techniques that can be employed, all with interesting properties, but it is the interpretation of their results that really forms the core of the analytics process. As a rising subject, data mining and business intelligence are playing an increasingly important role in the decision support activity of every walk of life. The Variance Rover System (VRS) mainly focused on the large data sets obtained from online web visiting and categorizing this into clusters according some similarity and the process of predicting customer behavior and selecting actions to influence that behavior to benefit the company, so as to take optimized and beneficial decisions of business expansion. Keywords: Analytics, Business intelligence, Clustering, Data Mining, Standard K-means, Optimized K-means
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
In Industrial environments, huge amount of data is being generated which in turn collected indatabase anddata warehouses from all involved areas such as planning, process design, materials, assembly, production, quality, process control, scheduling, fault detection,shutdown, customer relation management, and so on. Data Mining has become auseful tool for knowledge acquisition for industrial process of Iron and steel making. Due to the rapid growth in Data Mining, various industries started using data mining technology to search the hidden patterns, which might further be used to the system with the new knowledge which might design new models to enhance the production quality, productivity optimum cost and maintenance etc. The continuous improvement of all steel production process regarding the avoidance of quality deficiencies and the related improvement of production yield is an essential task of steel producer. Therefore, zero defect strategy is popular today and to maintain it several quality assurancetechniques areused. The present report explains the methods of data mining and describes its application in the industrial environment and especially, in the steel industry.
Introduction to feature subset selection methodIJSRD
Data Mining is a computational progression to ascertain patterns in hefty data sets. It has various important techniques and one of them is Classification which is receiving great attention recently in the database community. Classification technique can solve several problems in different fields like medicine, industry, business, science. PSO is based on social behaviour for optimization problem. Feature Selection (FS) is a solution that involves finding a subset of prominent features to improve predictive accuracy and to remove the redundant features. Rough Set Theory (RST) is a mathematical tool which deals with the uncertainty and vagueness of the decision systems.
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
Institution is a place where teacher explains and student just understands and learns the lesson. Every student has his own definition for toughness and easiness and there isn’t any absolute scale for measuring knowledge but examination score indicate the performance of student. In this case study, knowledge of data mining is combined with educational strategies to improve students’ performance. Generally, data mining (sometimes called data or knowledge discovery) is the process of analysing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for data. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).This project describes the use of clustering data mining technique to improve the efficiency of academic performance in the educational institutions .In this project, a live experiment was conducted on students .By conducting an exam on students of computer science major using MOODLE(LMS) and analysing that data generated using RapidMiner(Datamining Software) and later by performing clustering on the data. This method helps to identify the students who need special advising or counselling by the teacher to give high quality of education.
CONFIGURING ASSOCIATIONS TO INCREASE TRUST IN PRODUCT PURCHASEIJwest
Clustering is categorizing data into groups with similar objects. Data mining adds to complexities of clustering a large dataset with various features. Among these datasets, there are electronic business stores which offer their products through web. These stores require recommendation systems which can offer products to the user which the user might require them with higher probability. In this study, previous purchases of users are used to present a sorted list of products to the user. Identifying associations related to users and finding centers increases precision of the recommended list. Configuration of associations and creating a profile for users is important in current studies. In the proposed method, association rules are presented to model user interactions in the web which use time that a page is visited and frequency of visiting a page to weight pages and describes users’ interest to page groups. Therefore, weight of each transaction item describes user’s interest in that item. Analyzing results show that the proposed method presents a more complete model of users’ behavior because it combines weight and membership degree of pages simultaneously for ranking candidate pages. This method has obtained higher accuracy compared to other methods even in higher number of pages.
Configuring Associations to Increase Trust in Product Purchase dannyijwest
Clustering is categorizing data into groups with similar objects. Data mining adds to complexities of clustering a large dataset with various features. Among these datasets, there are electronic business stores which offer their products through web. These stores require recommendation systems which can offer products to the user which the user might require them with higher probability. In this study, previous purchases of users are used to present a sorted list of products to the user. Identifying associations related to users and finding centers increases precision of the recommended list. Configuration of associations and creating a profile for users is important in current studies. In the proposed method, association rules are presented to model user interactions in the web which use time that a page is visited and frequency of visiting a page to weight pages and describes users’ interest to page groups. Therefore, weight of each transaction item describes user’s interest in that item. Analyzing results show that the proposed method presents a more complete model of users’ behavior because it combines weight and membership degree of pages simultaneously for ranking candidate pages. This method has obtained higher accuracy compared to other methods even in higher number of pages.
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...IJECEIAES
Data analysis plays a prominent role in interpreting various phenomena. Data mining is the process to hypothesize useful knowledge from the extensive data. Based upon the classical statistical prototypes the data can be exploited beyond the storage and management of the data. Cluster analysis a primary investigation with little or no prior knowledge, consists of research and development across a wide variety of communities. Cluster ensembles are melange of individual solutions obtained from different clusterings to produce final quality clustering which is required in wider applications. The method arises in the perspective of increasing robustness, scalability and accuracy. This paper gives a brief overview of the generation methods and consensus functions included in cluster ensemble. The survey is to analyze the various techniques and cluster ensemble methods.
A study and survey on various progressive duplicate detection mechanismseSAT Journals
Abstract One of the serious problems faced in several applications with personal details management, customer affiliation management, data mining, etc is duplicate detection. This survey deals with the various duplicate record detection techniques in both small and large datasets. To detect the duplicity with less time of execution and also without disturbing the dataset quality, methods like Progressive Blocking and Progressive Neighborhood are used. Progressive sorted neighborhood method also called as PSNM is used in this model for finding or detecting the duplicate in a parallel approach. Progressive Blocking algorithm works on large datasets where finding duplication requires immense time. These algorithms are used to enhance duplicate detection system. The efficiency can be doubled over the conventional duplicate detection method using this algorithm. Severa
Data Mining Framework for Network Intrusion Detection using Efficient TechniquesIJAEMSJORNAL
The implementation measures the classification accuracy on benchmark datasets after combining SIS and ANNs. In order to put a number on the gains made by using SIS as a strategic tool in data mining, extensive experiments and analyses are carried out. The predicted results of this investigation will have implications for both theoretical and applied settings. Predictive models in a wide variety of disciplines may benefit from the enhanced classification accuracy enabled by SIS inside ANNs. An invaluable resource for scholars and practitioners in the fields of AI and data mining, this study adds to the continuing conversation about how to maximize the efficacy of machine learning methods.
An Analysis of Outlier Detection through clustering methodIJAEMSJORNAL
This research paper deals with an outlier which is known as an unusual behavior of any substance present in the spot. This is a detection process that can be employed for both anomaly detection and abnormal observation. This can be obtained through other members who belong to that data set. The deviation present in the outlier process can be attained by measuring certain terms like range, size, activity, etc. By detecting outlier one can easily reject the negativity present in the field. For instance, in healthcare, the health condition of a person can be determined through his latest health report or his regular activity. When found the person being inactive there may be a chance for that person to be sick. Many approaches have been used in this research paper for detecting outliers. The approaches used in this research are 1) Centroid based approach based on K-Means and Hierarchical Clustering algorithm and 2) through Clustering based approach. This approach may help in detecting outlier by grouping all similar elements in the same group. For grouping, the elements clustering method paves a way for it. This research paper will be based on the above mentioned 2 approaches.
Introduction to Multi-Objective Clustering EnsembleIJSRD
Association rule mining is a popular and well researched method for discovering interesting relations between variables in large databases. In this paper we introduce the concept of Data mining, Association rule and Multilevel association rule with different algorithm, its advantage and concept of Fuzzy logic and Genetic Algorithm. Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework.
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...iosrjce
In distributed peer-to-peer systems, huge amount of data are dispersed. Grouping of those data from
multiple sources is a tedious task. By applying effective data mining techniques the clustering of distributed data
is become ease and this decreases the hurdles of clustering due to processing, storage, and transmission costs.
To perform a dynamic distributed clustering, a fully decentralized clustering method has been proposed. HGD
Cluster can cluster a data set which is dispersed among a large number of nodes in a distributed environment
using hierarchal and grid based clustering techniques. When nodes are fully asynchronous and decentralized
and also adaptable to stir, then HGD cluster will apply. The general design principles employed in the proposed
algorithm also allow customization for other classes of clustering. It is fully capable of clustering dynamic and
distributed data sets. Using the algorithm, every node can maintain summarized views of the dataset.
Customizing HGD Cluster for execution of the hierarchal-based and grid-based clustering methods on the
summarized views is the main aim of the proposed system. Coping with dynamic data is made possible by
gradually adapting the clustering model.
A Survey on the Clustering Algorithms in Sales Data MiningEditor IJCATR
This paper discusses different clustering techniques that can be used in sales databases. The advancement of digital data
collection and build up of data in data banks as a result of modernization in sales disciplines has brought in great challenges of data
processing for better and meaningful results due to mass data deposits. Clustering techniques therefore are quite necessary so that the
senior management in sales department can have access to processed data as they engage themselves in decision making processes.
In this paper, I focus on the retail sales data mining, classification and clustering techniques. In this study I analyze the attributes for
the prediction of buyer’s behavior and purchase performance by use of various classification methods like decision trees, C4.5
algorithm and ID3 algorithm.
Certain Investigation on Dynamic Clustering in Dynamic Dataminingijdmtaiir
Clustering is the process of grouping a set of objects
into classes of similar objects. Dynamic clustering comes in a
new research area that is concerned about dataset with dynamic
aspects. It requires updates of the clusters whenever new data
records are added to the dataset and may result in a change of
clustering over time. When there is a continuous update and
huge amount of dynamic data, rescan the database is not
possible in static data mining. But this is possible in Dynamic
data mining process. This dynamic data mining occurs when
the derived information is present for the purpose of analysis
and the environment is dynamic, i.e. many updates occur.
Since this has now been established by most researchers and
they will move into solving some of the problems and the
research is to concentrate on solving the problem of using data
mining dynamic databases. This paper gives some
investigation of existing work done in some papers related with
dynamic clustering and incremental data clustering
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVEIJDKP
Knowledge Discovery in Databases is the process of finding knowledge in massive amount of data where
data mining is the core of this process. Data mining can be used to mine understandable meaningful patterns from large databases and these patterns may then be converted into knowledge.Data mining is the process of extracting the information and patterns derived by the KDD process which helps in crucial decision-making.Data mining works with data warehouse and the whole process is divded into action plan to be performed on data: Selection, transformation, mining and results interpretation. In this paper, we have reviewed Knowledge Discovery perspective in Data Mining and consolidated different areas of data
mining, its techniques and methods in it.
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
this paper addresses the issues and techniques for Property/Casualty actuaries applying data mining methods. Data mining means the effective unknown pattern discovery from a large amount database. It is an interactive knowledge discovery procedure which is includes data acquisition, data integration, data exploration, model building, and model validation. The paper provides an overview of the data discovery method and introduces some important data mining method for application to insurance concluding cluster discovery approaches.
An Iterative Improved k-means ClusteringIDES Editor
Clustering is a data mining (machine learning),
unsupervised learning technique used to place data elements
into related groups without advance knowledge of the group
definitions. One of the most popular and widely studied
clustering methods that minimize the clustering error for
points in Euclidean space is called K-means clustering.
However, the k-means method converges to one of many local
minima, and it is known that the final results depend on the
initial starting points (means). In this research paper, we have
introduced and tested an improved algorithm to start the kmeans
with good starting points (means). The good initial
starting points allow k-means to converge to a better local
minimum; also the numbers of iteration over the full dataset
are being decreased. Experimental results show that initial
starting points lead to good solution reducing the number of
iterations to form a cluster.
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUEIJDKP
A high prediction accuracy of the students’ performance is more helpful to identify the low performance students at the beginning of the learning process. Data mining is used to attain this objective. Data mining techniques are used to discover models or patterns of data, and it is much helpful in the decision-making.Boosting technique is the most popular techniques for constructing ensembles of classifier to improve the classification accuracy. Adaptive Boosting (AdaBoost) is a generation of boosting algorithm. It is used for
the binary classification and not applicable to multiclass classification directly. SAMME boosting
technique extends AdaBoost to a multiclass classification without reduce it to a set of sub-binaryclassification.In this paper, students’ performance prediction system usingMulti Agent Data Mining is proposed to predict the performance of the students based on their data with high prediction accuracy and provide helpto the low students by optimization rules.The proposed system has been implemented and evaluated by investigate the prediction accuracy ofAdaboost.M1 and LogitBoost ensemble classifiers methods and with C4.5 single classifier method. The results show that using SAMME Boosting technique improves the prediction accuracy and outperformed
C4.5 single classifier and LogitBoost.
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
In Industrial environments, huge amount of data is being generated which in turn collected indatabase anddata warehouses from all involved areas such as planning, process design, materials, assembly, production, quality, process control, scheduling, fault detection,shutdown, customer relation management, and so on. Data Mining has become auseful tool for knowledge acquisition for industrial process of Iron and steel making. Due to the rapid growth in Data Mining, various industries started using data mining technology to search the hidden patterns, which might further be used to the system with the new knowledge which might design new models to enhance the production quality, productivity optimum cost and maintenance etc. The continuous improvement of all steel production process regarding the avoidance of quality deficiencies and the related improvement of production yield is an essential task of steel producer. Therefore, zero defect strategy is popular today and to maintain it several quality assurancetechniques areused. The present report explains the methods of data mining and describes its application in the industrial environment and especially, in the steel industry.
Introduction to feature subset selection methodIJSRD
Data Mining is a computational progression to ascertain patterns in hefty data sets. It has various important techniques and one of them is Classification which is receiving great attention recently in the database community. Classification technique can solve several problems in different fields like medicine, industry, business, science. PSO is based on social behaviour for optimization problem. Feature Selection (FS) is a solution that involves finding a subset of prominent features to improve predictive accuracy and to remove the redundant features. Rough Set Theory (RST) is a mathematical tool which deals with the uncertainty and vagueness of the decision systems.
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
Institution is a place where teacher explains and student just understands and learns the lesson. Every student has his own definition for toughness and easiness and there isn’t any absolute scale for measuring knowledge but examination score indicate the performance of student. In this case study, knowledge of data mining is combined with educational strategies to improve students’ performance. Generally, data mining (sometimes called data or knowledge discovery) is the process of analysing data from different perspectives and summarizing it into useful information. Data mining software is one of a number of analytical tools for data. It allows users to analyse data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).This project describes the use of clustering data mining technique to improve the efficiency of academic performance in the educational institutions .In this project, a live experiment was conducted on students .By conducting an exam on students of computer science major using MOODLE(LMS) and analysing that data generated using RapidMiner(Datamining Software) and later by performing clustering on the data. This method helps to identify the students who need special advising or counselling by the teacher to give high quality of education.
CONFIGURING ASSOCIATIONS TO INCREASE TRUST IN PRODUCT PURCHASEIJwest
Clustering is categorizing data into groups with similar objects. Data mining adds to complexities of clustering a large dataset with various features. Among these datasets, there are electronic business stores which offer their products through web. These stores require recommendation systems which can offer products to the user which the user might require them with higher probability. In this study, previous purchases of users are used to present a sorted list of products to the user. Identifying associations related to users and finding centers increases precision of the recommended list. Configuration of associations and creating a profile for users is important in current studies. In the proposed method, association rules are presented to model user interactions in the web which use time that a page is visited and frequency of visiting a page to weight pages and describes users’ interest to page groups. Therefore, weight of each transaction item describes user’s interest in that item. Analyzing results show that the proposed method presents a more complete model of users’ behavior because it combines weight and membership degree of pages simultaneously for ranking candidate pages. This method has obtained higher accuracy compared to other methods even in higher number of pages.
Configuring Associations to Increase Trust in Product Purchase dannyijwest
Clustering is categorizing data into groups with similar objects. Data mining adds to complexities of clustering a large dataset with various features. Among these datasets, there are electronic business stores which offer their products through web. These stores require recommendation systems which can offer products to the user which the user might require them with higher probability. In this study, previous purchases of users are used to present a sorted list of products to the user. Identifying associations related to users and finding centers increases precision of the recommended list. Configuration of associations and creating a profile for users is important in current studies. In the proposed method, association rules are presented to model user interactions in the web which use time that a page is visited and frequency of visiting a page to weight pages and describes users’ interest to page groups. Therefore, weight of each transaction item describes user’s interest in that item. Analyzing results show that the proposed method presents a more complete model of users’ behavior because it combines weight and membership degree of pages simultaneously for ranking candidate pages. This method has obtained higher accuracy compared to other methods even in higher number of pages.
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...IJECEIAES
Data analysis plays a prominent role in interpreting various phenomena. Data mining is the process to hypothesize useful knowledge from the extensive data. Based upon the classical statistical prototypes the data can be exploited beyond the storage and management of the data. Cluster analysis a primary investigation with little or no prior knowledge, consists of research and development across a wide variety of communities. Cluster ensembles are melange of individual solutions obtained from different clusterings to produce final quality clustering which is required in wider applications. The method arises in the perspective of increasing robustness, scalability and accuracy. This paper gives a brief overview of the generation methods and consensus functions included in cluster ensemble. The survey is to analyze the various techniques and cluster ensemble methods.
A study and survey on various progressive duplicate detection mechanismseSAT Journals
Abstract One of the serious problems faced in several applications with personal details management, customer affiliation management, data mining, etc is duplicate detection. This survey deals with the various duplicate record detection techniques in both small and large datasets. To detect the duplicity with less time of execution and also without disturbing the dataset quality, methods like Progressive Blocking and Progressive Neighborhood are used. Progressive sorted neighborhood method also called as PSNM is used in this model for finding or detecting the duplicate in a parallel approach. Progressive Blocking algorithm works on large datasets where finding duplication requires immense time. These algorithms are used to enhance duplicate detection system. The efficiency can be doubled over the conventional duplicate detection method using this algorithm. Severa
Data Mining Framework for Network Intrusion Detection using Efficient TechniquesIJAEMSJORNAL
The implementation measures the classification accuracy on benchmark datasets after combining SIS and ANNs. In order to put a number on the gains made by using SIS as a strategic tool in data mining, extensive experiments and analyses are carried out. The predicted results of this investigation will have implications for both theoretical and applied settings. Predictive models in a wide variety of disciplines may benefit from the enhanced classification accuracy enabled by SIS inside ANNs. An invaluable resource for scholars and practitioners in the fields of AI and data mining, this study adds to the continuing conversation about how to maximize the efficacy of machine learning methods.
An Analysis of Outlier Detection through clustering methodIJAEMSJORNAL
This research paper deals with an outlier which is known as an unusual behavior of any substance present in the spot. This is a detection process that can be employed for both anomaly detection and abnormal observation. This can be obtained through other members who belong to that data set. The deviation present in the outlier process can be attained by measuring certain terms like range, size, activity, etc. By detecting outlier one can easily reject the negativity present in the field. For instance, in healthcare, the health condition of a person can be determined through his latest health report or his regular activity. When found the person being inactive there may be a chance for that person to be sick. Many approaches have been used in this research paper for detecting outliers. The approaches used in this research are 1) Centroid based approach based on K-Means and Hierarchical Clustering algorithm and 2) through Clustering based approach. This approach may help in detecting outlier by grouping all similar elements in the same group. For grouping, the elements clustering method paves a way for it. This research paper will be based on the above mentioned 2 approaches.
Introduction to Multi-Objective Clustering EnsembleIJSRD
Association rule mining is a popular and well researched method for discovering interesting relations between variables in large databases. In this paper we introduce the concept of Data mining, Association rule and Multilevel association rule with different algorithm, its advantage and concept of Fuzzy logic and Genetic Algorithm. Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework.
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...iosrjce
In distributed peer-to-peer systems, huge amount of data are dispersed. Grouping of those data from
multiple sources is a tedious task. By applying effective data mining techniques the clustering of distributed data
is become ease and this decreases the hurdles of clustering due to processing, storage, and transmission costs.
To perform a dynamic distributed clustering, a fully decentralized clustering method has been proposed. HGD
Cluster can cluster a data set which is dispersed among a large number of nodes in a distributed environment
using hierarchal and grid based clustering techniques. When nodes are fully asynchronous and decentralized
and also adaptable to stir, then HGD cluster will apply. The general design principles employed in the proposed
algorithm also allow customization for other classes of clustering. It is fully capable of clustering dynamic and
distributed data sets. Using the algorithm, every node can maintain summarized views of the dataset.
Customizing HGD Cluster for execution of the hierarchal-based and grid-based clustering methods on the
summarized views is the main aim of the proposed system. Coping with dynamic data is made possible by
gradually adapting the clustering model.
A Survey on the Clustering Algorithms in Sales Data MiningEditor IJCATR
This paper discusses different clustering techniques that can be used in sales databases. The advancement of digital data
collection and build up of data in data banks as a result of modernization in sales disciplines has brought in great challenges of data
processing for better and meaningful results due to mass data deposits. Clustering techniques therefore are quite necessary so that the
senior management in sales department can have access to processed data as they engage themselves in decision making processes.
In this paper, I focus on the retail sales data mining, classification and clustering techniques. In this study I analyze the attributes for
the prediction of buyer’s behavior and purchase performance by use of various classification methods like decision trees, C4.5
algorithm and ID3 algorithm.
Certain Investigation on Dynamic Clustering in Dynamic Dataminingijdmtaiir
Clustering is the process of grouping a set of objects
into classes of similar objects. Dynamic clustering comes in a
new research area that is concerned about dataset with dynamic
aspects. It requires updates of the clusters whenever new data
records are added to the dataset and may result in a change of
clustering over time. When there is a continuous update and
huge amount of dynamic data, rescan the database is not
possible in static data mining. But this is possible in Dynamic
data mining process. This dynamic data mining occurs when
the derived information is present for the purpose of analysis
and the environment is dynamic, i.e. many updates occur.
Since this has now been established by most researchers and
they will move into solving some of the problems and the
research is to concentrate on solving the problem of using data
mining dynamic databases. This paper gives some
investigation of existing work done in some papers related with
dynamic clustering and incremental data clustering
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVEIJDKP
Knowledge Discovery in Databases is the process of finding knowledge in massive amount of data where
data mining is the core of this process. Data mining can be used to mine understandable meaningful patterns from large databases and these patterns may then be converted into knowledge.Data mining is the process of extracting the information and patterns derived by the KDD process which helps in crucial decision-making.Data mining works with data warehouse and the whole process is divded into action plan to be performed on data: Selection, transformation, mining and results interpretation. In this paper, we have reviewed Knowledge Discovery perspective in Data Mining and consolidated different areas of data
mining, its techniques and methods in it.
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
this paper addresses the issues and techniques for Property/Casualty actuaries applying data mining methods. Data mining means the effective unknown pattern discovery from a large amount database. It is an interactive knowledge discovery procedure which is includes data acquisition, data integration, data exploration, model building, and model validation. The paper provides an overview of the data discovery method and introduces some important data mining method for application to insurance concluding cluster discovery approaches.
An Iterative Improved k-means ClusteringIDES Editor
Clustering is a data mining (machine learning),
unsupervised learning technique used to place data elements
into related groups without advance knowledge of the group
definitions. One of the most popular and widely studied
clustering methods that minimize the clustering error for
points in Euclidean space is called K-means clustering.
However, the k-means method converges to one of many local
minima, and it is known that the final results depend on the
initial starting points (means). In this research paper, we have
introduced and tested an improved algorithm to start the kmeans
with good starting points (means). The good initial
starting points allow k-means to converge to a better local
minimum; also the numbers of iteration over the full dataset
are being decreased. Experimental results show that initial
starting points lead to good solution reducing the number of
iterations to form a cluster.
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUEIJDKP
A high prediction accuracy of the students’ performance is more helpful to identify the low performance students at the beginning of the learning process. Data mining is used to attain this objective. Data mining techniques are used to discover models or patterns of data, and it is much helpful in the decision-making.Boosting technique is the most popular techniques for constructing ensembles of classifier to improve the classification accuracy. Adaptive Boosting (AdaBoost) is a generation of boosting algorithm. It is used for
the binary classification and not applicable to multiclass classification directly. SAMME boosting
technique extends AdaBoost to a multiclass classification without reduce it to a set of sub-binaryclassification.In this paper, students’ performance prediction system usingMulti Agent Data Mining is proposed to predict the performance of the students based on their data with high prediction accuracy and provide helpto the low students by optimization rules.The proposed system has been implemented and evaluated by investigate the prediction accuracy ofAdaboost.M1 and LogitBoost ensemble classifiers methods and with C4.5 single classifier method. The results show that using SAMME Boosting technique improves the prediction accuracy and outperformed
C4.5 single classifier and LogitBoost.
Similar to Applications Of Clustering Techniques In Data Mining A Comparative Study (20)
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
Azure Interview Questions and Answers PDF By ScholarHat
Applications Of Clustering Techniques In Data Mining A Comparative Study
1. (IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 12, 2020
146 | P a g e
www.ijacsa.thesai.org
Applications of Clustering Techniques in Data
Mining: A Comparative Study
Muhammad Faizan1
, Megat F. Zuhairi2*
, Shahrinaz Ismail3
, Sara Sultan4
Malaysian Institute of Information Technology, Universiti Kuala Lumpur, Kuala Lumpur, Malaysia1, 2, 3
College of Computing and Information Sciences, Karachi Institute of Economics and Technology, Karachi, Pakistan4
Abstract—In modern scientific research, data analyses are
often used as a popular tool across computer science,
communication science, and biological science. Clustering plays a
significant role in the reference composition of data analysis.
Clustering, recognized as an essential issue of unsupervised
learning, deals with the segmentation of the data structure in an
unknown region and is the basis for further understanding.
Among many clustering algorithms, “more than 100 clustering
algorithms known” because of its simplicity and rapid
convergence, the K-means clustering algorithm is commonly
used. This paper explains the different applications, literature,
challenges, methodologies, considerations of clustering methods,
and related key objectives to implement clustering with big data.
Also, presents one of the most common clustering technique for
identification of data patterns by performing an analysis of
sample data.
Keywords—Clustering; data analysis; data mining;
unsupervised learning; k-mean; algorithms
I. INTRODUCTION
Data mining is the latest interdisciplinary field of
computational science. Data mining is the process of
discovering attractive information from large amounts of data
stored either in data warehouses, databases, or other
information repositories. It is a process of automatically
discovering data pattern from the massive database [1], [2].
Data mining refers to the extraction or “mining” of valuable
information from large data volumes [3], [4]. Nowadays,
people come across a massive amount of information and store
or represent it as datasets[4], [5]. Process discovery is the
learning task that works to the construction of process models
from event logs of information systems [6]. Fascinating
insights, observable behaviours, or high-level information can
be extracted from the database by performing data mining and
viewed or browsed from various angles. The knowledge
discovered can be applied for process control, decision making,
information management, and question handling. Decision-
makers will make a clear decision using these methods to
improve the real problems of this world further. In data mining,
many data clustering techniques are used to trace a particular
data pattern [2]. Data mining methods for better understanding
are shown in Fig. 1.
Clustering techniques are useful meta-learning tools for
analyzing the knowledge produced by modern applications.
Clustering algorithms are used extensively not only for
organizing and categorizing data but also for data modelling
and data compression [7]. The purpose of the clustering is to
classify the data into groups according to data similarities,
traits, characteristics, and behaviours [8]. Data cluster
evaluation is an essential activity for finding knowledge and
for data mining. The process of clustering is achieved by
unsupervised, semi-supervised, or supervised manner [2].
However, there are more than 100 clustering algorithms known
and selection from these algorithms for better results is more
challenging.
PyClustering is an open-source library for data mining
written in Python and C++, providing a wide variety of
clustering methods and algorithms, including bio-inspired
oscillatory networks. PyClustering focuses primarily on cluster
analysis to make it more user friendly and understandable.
Many methods and algorithms are in the C++ namespace
“ccore::clst” and in the Python module “pyclustering.cluster.”
Some of the algorithms and their availability in PyClustering
module is mentioned in Table I [9].
A. Clustering in Data Mining
Data volumes continue to expand exponentially in various
scientific and industrial sectors, and automated categorization
techniques have become standard tools for data set exploration
[10]. Automatic categorization techniques, traditionally called
clustering, helps to reveal a dataset‟s structure [9]. Clustering is
a well-established unsupervised data mining-based method
[11], and it deals with the discovery of a structure in unlabeled
data collection. The overall process that will be followed when
developing an unsupervised learning solution can be
summarized in the following chart in Fig. 2:
Fig. 1. Methods of Data Mining Techniques.
*Corresponding Author
2. (IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 12, 2020
147 | P a g e
www.ijacsa.thesai.org
TABLE I. ALGORITHMS AND METHODS IN “PYTHON MODULE PYCLUSTERING”
Algorithm Python C++
Agglomerative (Jain & Dubes, 1988) ✓ ✓
BIRCH (Zhang, Ramakrishnan, & Livny, 1996) ✓
CLARANS (Ng & Han, 2002) ✓
TTSAS (Theodoridis & Koutroumbas, 2009) ✓ ✓
CURE (Guha, Rastogi, & Shim, 1998) ✓ ✓
K-Means (Macqueen, 1967) ✓ ✓
BANG (Schikuta & Erhart, 1998) ✓
ROCK (Guha, Rastogi, & Shim, 1999) ✓ ✓
K-Medians (Jain & Dubes, 1988) ✓ ✓
Elbow (Thorndike, 1953) ✓ ✓
GA - Genetic Algorithm (Harvey, Cowgill, & Watson, 1999) ✓ ✓
DBSCAN (Ester, Kriegel, Sander, & Xu, 1996) ✓ ✓
X-Means (Pelleg & Moore, 2000) ✓ ✓
K-Means++ (Arthur & Vassilvitskii, 2007) ✓ ✓
Elbow (Thorndike, 1953) ✓ ✓
BSAS (Theodoridis & Koutroumbas, 2009) ✓ ✓
K-Medoids (Jain & Dubes, 1988) ✓ ✓
Sync-SOM ✓
SyncNet ✓ ✓
SOM-SC (Kohonen, 1990) ✓ ✓
OPTICS (Ankerst, Breunig, Kriegel, & Sander, 1999) ✓ ✓
CLIQUE (Agrawal, Gunopulos, & Raghavan, 2005) ✓ ✓
Silhouette (Rousseeuw, 1987) ✓
MBSAS (Theodoridis & Koutroumbas, 2009) ✓ ✓
HSyncNet (Shao, He, Böhm, Yang, & Plant, 2013) ✓ ✓
EMA (Gupta & Chen, 2011) ✓
Fig. 2. Unsupervised Learning Model.
The main applications of unsupervised learning are:
Simplify datasets by aggregating variables with similar
attributes.
Detecting anomalies that do not fit any group.
Segmenting datasets by some shared attributes.
Clustering results in the reduction of the dimensionality of
the data set. The objective of such a clustering algorithm is to
identify the distinct groups within the data set [12]. There are
different clustering objects, such as hierarchical, partitional,
grid, density-based, and model-based [13]. The performance of
various methods can differ depending on the type of data used
for clustering and the volume of data available [14]. For
example, Document clustering has been investigated for use in
many different areas of text mining and information retrieval
[15]. There are several different metrics of quality, relative
ranking, and the performance of different clustering algorithms
that can vary considerably depending on which measure is
used. Two measures of “goodness” or quality of the cluster are
used for clustering. One type of measure allows comparing
3. (IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 12, 2020
148 | P a g e
www.ijacsa.thesai.org
different cluster sets without external knowledge and is called
an “internal quality measure.” The other form of measure is
called an “external quality measure,” which allows evaluating
how well the clustering works by comparing the groups
generated by the clustering techniques to the classes identified.
Fig. 3 shows a simple example of data clustering based on data
similarity.
1) Types of clustering: Clustering can generally be broken
down into two subgroups:
Hard Clustering: In hard clustering, each data point is
either entirely or not part of a cluster.
o For example, each customer is grouped into one of
10 groups.
Soft Clustering: In soft clustering, a probability or
likelihood of the data point being in certain clusters is
assigned instead of placing each data point into a
separate cluster.
o For example, each customer is assigned a
probability to be in 10 clusters.
2) Clustering methodologies: Since the clustering method
is subjective, it is the tool that can be used to accomplish
plenty of objectives. Every methodology follows several sets
of rules and regulations that describe the „similarity‟ between
data points. Cluster analysis is not an automated task, but an
iterative information discovery process or multi-objective
collaborative optimization involving trial and error [16]. There
are typically more than 100 known clustering algorithms. But
few of these algorithms are popularly used. Some of the
clustering methodologies are mentioned below in Table II.
The best known and most widely used method of
partitioning is K-means [17]–[19]. There are many clustering
techniques from which K-means is an unsupervised and
iterative data mining approach [11]. The standard approach of
all clustering techniques is to classify cluster centres
representing each cluster. K-means clustering is a method of
cluster analysis aimed at observing and partitioning data point
into k clusters in which each observation is part of the nearest
mean cluster [7]. The most significant advantage of the K-
means algorithm in data mining applications is its efficiency in
clustering large data sets. K-means and its different variants
have a computation time complexity that is linear in the
number of records but is assumed to discover inferior clusters
[15].
The K-means algorithm is a basic algorithm for iterative
clustering. It calculates the distance means, giving the initial
centroid, with each class represented by the centroid, using the
distance as the metric and given the classes K in the data set. In
the k-means partitioning algorithm, the mean value of objects
within-cluster is represented at the centre of each cluster.
Fig. 3. Simple Clustering Example.
TABLE II. CLUSTERING METHODOLOGIES
Typical Clustering Methodologies
Method Algorithm
Distance-based method
Partitioning algorithms “K-means, K-medians, K-medoids.”
Hierarchical algorithms, “Agglomerative, Divisive method.”
These algorithms run iteratively to find the local optima and are incredibly easy to understand but have no scalability
for handling large datasets.
Grid-based method
Grid-base algorithm: Individual regions of the data space are formed into a grid-like structure.
These methods use a single-uniform grid mesh to separate the entire problem domain into cells. The cell represents the
data objects located within a cell using a collection of statistical attributes from the objects.
Density-based method
Density-Based Spatial Clustering of Applications with Noise / DBSCAN
Ordering points to identify the clustering structure OPTICS
These algorithms scan the data space for areas with different data points density within the data space. It isolates
different density regions within the same cluster and assigns the data points within those regions.
Probabilistic and generative models
Expectation-maximization algorithm: Modeling data from a generative process.
Often these models suffer from over-fitting. A prominent example of such models is the Expectation-Maximization
algorithm that uses normal multivariate distributions.
4. (IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 12, 2020
149 | P a g e
www.ijacsa.thesai.org
II. BACKGROUND AND DISCUSSION OF CLUSTERING
APPLICATIONS AND APPROACHES
Cluster analyses have lots of applications in different
domains, e.g., It has been popularly used as a preprocessing
step or intermediate step for other data mining tasks
“Generating a compact summary of data for classification,
pattern discovery, hypothesis generation and testing,
compression, reduction, and outlier detection, etc.” Clustering
analysis can also be used in collaborative filtering,
recommendation systems, customer segmentation, multimedia
data analyses, biological data analyses, social network analysis,
and dynamic trend detection. Some of the clustering techniques
and approaches are discussed in Table III.
A. Requirement and Challenges
Despite recent efforts, the challenge of clustering on
“mixed and categorical” data in the sense of big data remains,
due to the lack of inherently meaningful similarity
measurement between the high computational complexity of
current clustering techniques and categorical objects [18]. For
cluster analysis, there are several items to be considered. Some
of them are mentioned in Table IV.
Typically, there are multiple ways to use or apply
clustering analysis; some advantages and limitations of
clustering techniques are mentioned in Table V.
As a stand-alone tool to get insights into data
distribution.
As a preprocessing (or intermediate) steps for other
algorithms.
According to [24], [25], parallel classification is a better
approach for big data, but due to its implementation‟s
complexity remains a significant challenge. However, the
framework of MapReduce can be suitable for implementing
parallel algorithms, but still, there is no algorithm to handle all
Challenges of big data. In [26], the authors proposed a novel
Spark extreme learning machine “SELM” algorithm based on a
spark parallel framework to boost the speed and enhance the
efficiency of the whole process. SELM gives the highest speed
and minimal error in all experimental results compared to
Parallel Extreme Learning Machine (PELM) and an improved
Extreme Learning Machine (ELM*). Table VI presents the
pros and cons of different clustering algorithms with real-world
applications.
TABLE III. CLUSTERING TECHNIQUES AND APPROACHES WITH BENEFITS
Ref. Author Year Technique / Algorithm Approach Outcomes
[19]
Chunhui Yuan and
Haitao Yang
2019
K-Means Clustering
Algorithm
Different methods applied to
each dataset to determine the
optimal selection of K-Value.
Concluded that these four methods (Elbow methods,
silhouette coefficient, gap statistics, and canopy)
satisfy the criteria for clustering small data sets. In
contrast, the canopy algorithm is also the best choice
for large and complex data sets.
[20]
Tengfei Zhang,
Fumin Ma
2015
Rough k-means
clustering
Improved rough k-means
clustering with Gaussian
function based on a weighted
distance measure
An improved rough k means algorithm based on
weighted distance measure with Gaussian function
handles the objects which are wrongly assigned to
clutters, also handles vulnerable sets while
distributing overlapping objects in different clusters
by rough k means with the same weighted distance
for both upper and lower bounds.
[21]
Lior Rokach, Oded
Maimon
2015
Clustering methods:
Hierarchical- based,
Model-based, Grid-
based, Partitioning
based, Density-based.
Different clustering
methods/techniques are used to
determine clustering efficiency
in large data sets and explain
how the number of clusters can
be calculated.
For large dataset concluded that “K-means clustering
is more efficient in terms of its time, space
complexity, and its order-independent” and
“Hierarchical clustering is more versatile, but it has
the following disadvantages:
Time complexity O( and space
complexity of a hierarchical agglomerative algorithm
is O(
[22]
Zengyou He,
Xiaofei Xu,
Shengchun Deng,
Bin Dong
2015
K-mean, K-modes, K-
Histogram
Compare different clustering
algorithms to determine an
efficient clustering algorithm for
the categorical dataset.
K-Histogram is the enhanced version of K-means to
categorical areas by substituting means of clusters
with histograms. In general, K-Histogram is almost
similar to the K-modes algorithm, but as compared to
k-modes, k-histogram algorithms are more stable,
and the algorithm will converge faster.
[16]
M.Venkat Reddy,
M. Vivekananda,
RUVN Satish.
2017
Divisive, and
Agglomerative
Hierarchical Clustering
with K-means.
Discover an efficient clustering
by comparing Divisive and
Agglomerative Hierarchical
Clustering with K-means.
To obtain high accuracy, Agglomerative Clustering
with k-means will be the practical choice. Divisive
clustering with K-means also works efficiently where
each cluster can be taken fixedly.
[23]
Ahamed Al Malki,
Mohamed M.
Rizk, M.A. El-
Shorbagy, A. A.
Mousa
2016
K-means, Genetic
algorithm
For solving the clustering
problems, introduced a hybrid
approach of the Genetic
algorithm with K-means.
A hybrid approach of K-means with a Genetic
algorithm efficiently solves all the problems of the k-
means, e.g., K-mean will produce empty clusters
with initial centre vector and converge to non-
optimal value, etc.
5. (IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 12, 2020
150 | P a g e
www.ijacsa.thesai.org
[7]
Manish Verma,
Mauly Srivastava,
Neha Chack, Atul
Kumar Diswar,
Nidhi Gupta.
2012
Hierarchical, K-Means,
DB Scan, OPTICS,
Density-Based
Clustering, EM
Algorithm
A comparison was made between
different clustering techniques to
measure the best performing
algorithm.
K-means is faster than all the algorithms that are
discussed in this paper. When using a huge dataset,
K-means and EM will the best results than
hierarchical clustering.
[11]
Karthikeyan B.,
Dipu Jo George,
G. Manikandan,
Tony Thomas.
2020
K-means, Agglomerative
Hierarchical Clustering
Comparative research to
determine the best-suited
algorithm on K-Means and
Agglomerative Hierarchical
Clustering.
The k-means is best suited for larger datasets in term
of minimum execution time and rate of change in
usage of memory. It is also concluded that
agglomerative clustering is best suited for smaller
datasets due to the overall minimum consumption of
memory.
TABLE IV. CONSIDERATIONS OF CLUSTERING ANALYSIS
Considerations for Clustering Analysis
Considerations Options Examples
Similarity measure Distances-based / Connectivity- based Euclidean, road network, vector / Density, Contiguity
Partitioning criteria Single level / Hierarchical partitioning Often / Multi-level
Cluster space Full space / Subspace Low-dimensional / High-dimensional
Separation of clusters Exclusive / Non-exclusive Datapoint belongs to only region / Data point belongs to multiple regions
TABLE V. ADVANTAGES AND LIMITATIONS OF CLUSTERING TECHNIQUES
Clustering techniques “Advantages & Limitations”
Clustering Techniques Advantages Limitations
Data-mining clustering algorithms
Implementation is simple.
Compromises on user‟s privacy.
Do not deal with a large amount of data
Dimension reduction
It is very fast, reduces the dataset, and the
cost of the treatment will be optimized.
It must be applied before the classification algorithm.
It cannot provide an efficient result for the high
dimensional dataset.
It may lose some amount of data.
Parallel classification
It gives minimal execution time and more
scalable.
Difficult to implement.
MapReduce framework
Flexibility, scalability, security and
authentication, batch processing, etc.
It does not do best for graphs, iterative, and
incremental, multiple inputs, etc.
TABLE VI. CLUSTERING ALGORITHMS PROS AND CONS
Algorithm Name Pros Cons Applications in Real World
K-means
Handles large amounts of data.
Minimum execution time.
Simple to implement, etc.
Manually choose the K value.
Clustering outliers.
Dependent on starting point/value.
Handle empty clusters, etc.
Wireless networks.
System diagnostic.
Search Engine.
Document Analysis.
Fraud detection.
Call record Analysis.
Hierarchical Clustering
Do not need to specify the initial
value.
Easy to implement, scalable and
easy to understand, etc.
Cannot handle a large amount of
data with different sizes.
No backtracking.
No swapping between objects.
More space and time complexity.
Humans skin analysis [27]
Generating a portal site.
Web usage mining.
Genetic Algorithm
Easily understandable and
converge with different problems.
It cannot always give the best
result for all problems but provide
the optimum solution.
It cannot search for a single point,
search from a population of a point.
It is computationally expensive,
e.g., time-consuming.
It may lose data in a crossover.
Engineering Designs.
Robotics.
Telecommunications, traffic,
shipments routing.
Virtual Gaming.
Marketing.
DBSCAN
The number of clusters does not
need to be defined.
Handle outliers.
Unable to handle datasets with
distinct densities.
Struggles to work with High
Dimensionality Data.
Satellite pictures, etc.
6. (IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 12, 2020
151 | P a g e
www.ijacsa.thesai.org
III. RUNNING EXAMPLE WITH K-MEAN
A car manufacturer company wants to identify the purchase
behaviours of its customers to view which product is getting
more sales and what is the procedure of our customers. They
are currently looking at each customer‟s details based on this
information, decide which product manufacturing should be
increased and what are the behaviour of customers which helps
the company to monitor sales for other products by starting a
promotional campaign or increase the availability of resources.
Recently, the company can potentially have millions of
customers. It is not possible to look at each customer‟s data
individually and then make a decision. A manual process will
take a huge amount of time. This is when K-means Clustering
assists in a convenient way to analyze data automatically. The
K-mean clustering algorithm utilizes a fixed number of clusters
for optimum clustering [12], [28]. Initially, start partitioning
with the chosen number of clusters next to improve the
partitions iteratively to find the patterns in data. Let D= {D1,
D2, …, Dn} be the set of data points and Y= {Y1, Y2, …, Yt}
be the set of centers. This clustering technique is implemented
and analyzed using a k-mean clustering tool WEKA. In the
following steps, the K-means algorithm can be implemented:
The data set used for the K-mean clustering example will
focus on a fictional car dealership. The dealership is starting a
promotional campaign for slow-selling units, whereby it is
trying to push resources to its valuable customers. Table VII
shows the sample dataset, which is used for the analysis.
In Table VII, every row shows the purchase behaviour of
customers, e.g., Customers went to the dealership without
going on a showroom and done some computer search mostly
interested in Toyota Harrier without financing they purchased
it. These types of behaviour understandings about customers
help Toyota to manage their sales. K-mean clustering allows
the company to perform analysis without any efforts by finding
patterns in a given dataset, shown in Fig. 4 and Fig. 5.
Fig. 4 explains that based on cluster 3, 100% of customers
went for the dealership, whereas 45% went to the showroom
too, and 100% of the customers also did computer searching.
The majority of the customer that is 63% have shown interest
in Fortuner, whereas 45% had shown interest in Harrier, and
the least interest was found to be 9% in Corolla. These
customers who 100% end up financing and purchasing a
product consistently went to the dealership and done computer
searching before buying an SUV car.
Meanwhile, based on cluster 4, only 32% of customers
went to the dealership, whereas 100% went to the showroom,
and 24% also did computer searching. Majority of the
customers that are 100% interested in Corolla whereas 32%
had shown interest in Fortuner, and the least interest was found
to be 3% in Harrier; out of all these, 56% of the customers
went for the financing details whereas 82% ends up purchasing
a product. These are the customers looking for a small family
car, i.e., Corolla, mostly approaching the showrooms.
TABLE VII. SAMPLE DATASET
No. Dealership Showroom Computer Search Harrier Corolla Fortuner Financing
1 1 0 1 0 1 0 0
2 0 1 0 0 0 1 0
3 1 0 0 1 0 0 1
4 1 0 1 1 0 0 1
5 0 1 1 0 1 1 1
6 0 0 1 0 1 0 0
7 0 0 1 1 0 1 1
8 1 1 0 0 0 1 0
K-mean Clustering Algorithm
1. The first step in k-means is to pick the number of
clusters, k.
2. Randomly select k number of clusters centre.
3. Find the distance between each data point and each
cluster centre.
4. In contrast with other cluster centres, assign the data
point to the nearest cluster centre.
5. Discover the new Cluster Centre again 𝑌𝑖 = 𝐷𝑖
𝑘𝑖
𝑖=0
by where 𝑘𝑖 represents number of data points in 𝑖𝑡ℎ
cluster.
6. Once again, find out the distance between each data
point and the new cluster centre.
7. If no data points were reassigned then stop,
otherwise back to step 3.
7. (IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 12, 2020
152 | P a g e
www.ijacsa.thesai.org
Fig. 4. Customers Purchase Behaviours.
Fig. 5. Clustered Instances based on Customers Behaviour.
IV. CONCLUSION
This paper describes the different algorithms and
methodologies used to handle large and small sets of data. The
process of clustering is to group data based on their
characteristics and similarities. Previously described the
clustering models, many clustering techniques used to partition
the data into a set of clusters. Algorithm selection should
depend on the properties and the nature of the data collection
because each algorithm has its pros and cons. This shows that
there is no algorithm to manage all the clustering challenge.
However, there are some algorithms to provide an optimist
solution based on their sufficiency to face the challenges of the
problem. To achieve high accuracy in terms of time and space,
K-means would be the best choice for large and categorical
data. However, we need to reduce their time and memory‟s
complexity by upgrading Clustering Algorithms. However, a
combined approach of the Genetic Algorithm with K-means
can almost resolve all the issues of K-means. Genetic K-means
Algorithm (GKA) speeds up the convergence to a globally
optimum, and it concludes that GKA is faster than evolutionary
Algorithms.
V. FUTURE DIRECTIONS AND OPEN ISSUES
To date, Data Mining and information disclosure are
advancing an essential innovation for businesses and scientists
in numerous domains. Although information mining is
extremely powerful, it faces innumerable difficulties during its
usage. The problems could be identified with performance,
data, strategies, and procedures utilized. The information
mining measure becomes effective when the challenges or
issues are distinguished accurately and sifted through
appropriately.
Some of the following challenges and future directions are:
Efficiency and Scalability of Algorithms: The data
mining algorithms must be proficient and adaptable to
extricate data from gigantic sums of information within
the database. So, as a future direction, develop a
parallel formulation of an Improved rough k-means
algorithm to enhance the efficiency of an algorithm.
Privacy and Security: Information mining ordinarily
leads to genuine issues in terms of information
security, protection, and administration. For case, when
a retailer reveals his clients purchasing details without
their permission. So, as a future direction, there needs
to develop a single cache system and DES (Data
Encryption Standard) techniques in any Clustering
Algorithm to improve the privacy and security of data
in the cloud.
Complex Data Types: Complex data elements, objects
with graphical data, temporal data, and spatial data
may be included in the database. Mining of these types
of data isn‟t practical to be done one device.
Performance: The execution of the data mining
framework depends on the proficiency of calculations
and procedures are utilizing. The calculations and
strategies planned are not up to the marked lead to
influence the performance of the data mining process.
Therefore, as a future direction, we need to introduce a new
hybrid approach of an Improved Rough k-means Algorithm,
and the Genetic Algorithm will improve the performance and
handles the complex data. The combination of Partitioning
Clustering and Hierarchical Clustering Algorithms will also
increase the accuracy of data analysis.
REFERENCES
[1] S. Sharma, J. Agrawal, S. Agarwal, and S. Sharma, “Machine learning
techniques for data mining: A survey,” 2013 IEEE Int. Conf. Comput.
Intell. Comput. Res. IEEE ICCIC 2013, no. I, 2013.
8. (IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 12, 2020
153 | P a g e
www.ijacsa.thesai.org
[2] M. Z. Hossain, M. N. Akhtar, R. B. Ahmad, and M. Rahman, “A
dynamic K-means clustering for data mining,” Indones. J. Electr. Eng.
Comput. Sci., vol. 13, no. 2, pp. 521–526, 2019.
[3] Jiawei Han and M. Kamber, Data Mining: Concepts and Techniques
Second Edition. 2013.
[4] D. Patel, R. Modi, and K. Sarvakar, “A Comparative Study of Clustering
Data Mining: Techniques and Research Challenges,” Int. J. Latest
Technol. Eng. Manag. Appl. Sci., vol. 3, no. 9, pp. 67–70, 2014.
[5] P. Indirapriya and D. D. K. Ghosh, “A Survey on Different Clustering
Algorithms in Data Mining Technique,” Int. J. Mod. Eng. Res., vol. 3, no.
1, pp. 267–274, 2013.
[6] J. De Weerdt, S. Vanden Broucke, J. Vanthienen, and B. Baesens,
“Active trace clustering for improved process discovery,” IEEE Trans.
Knowl. Data Eng., vol. 25, no. 12, pp. 2708–2720, 2013.
[7] M. Verma, M. Srivastava, N. Chack, A. K. Diswar, and N. Gupta, “A
Comparative Study of Various Clustering Algorithms in Data Mining,”
Int. J. Eng. Res. Appl. www.ijera.com, vol. 2, no. 3, pp. 1379–1384,
2012.
[8] V. W. Ajin and L. D. Kumar, “Big data and clustering algorithms,” in
International Conference on Research Advances in Integrated Navigation
Systems, RAINS 2016, 2016.
[9] A. Novikov, “PyClustering: Data Mining Library,” J. Open Source
Softw., vol. 4, no. 36, p. 1230, 2019.
[10] D. Xu and Y. Tian, “A Comprehensive Survey of Clustering
Algorithms,” Ann. Data Sci., vol. 2, no. 2, pp. 165–193, 2015.
[11] B. Karthikeyan, D. J. George, G. Manikandan, and T. Thomas, “A
comparative study on k-means clustering and agglomerative hierarchical
clustering,” Int. J. Emerg. Trends Eng. Res., vol. 8, no. 5, pp. 1600–1604,
2020.
[12] P. K. Jain and R. Pamula, “Two-Step Anomaly Detection Approach.”
[13] A. Saxena et al., “A review of clustering techniques and developments,”
Neurocomputing, vol. 267, pp. 664–681, 2017.
[14] S. Rashid, A. Ahmed, I. Al Barazanchi, and Z. A. Jaaz, “Clustering
algorithms subjected to K-mean and gaussian mixture model on
multidimensional data set,” Period. Eng. Nat. Sci., vol. 7, no. 2, pp. 448–
457, 2019.
[15] M. S. Michael Steinbach George, Vipin Kumar, “A Comparison of
Document Clustering Techniques,” TextMining Work. KDD2000, pp.
75–78.
[16] M. V. Reddy, M. Vivekananda, and R. U. V. N. Satish, “Divisive
Hierarchical Clustering with K-means and Agglomerative Divisive
Hierarchical Clustering with K-means and Agglomerative Hierarchical
Clustering,” Int. J. Comput. Sci. Trends Technol., vol. 5, no. Sep-Oct, pp.
5–11, 2017.
[17] H. H. Ali and L. E. Kadhum, “K- Means Clustering Algorithm
Applications in Data Mining and Pattern Recognition,” Int. J. Sci. Res.,
vol. 6, no. 8, pp. 1577–1584, 2017.
[18] T. H. T. Nguyen, D. T. Dinh, S. Sriboonchitta, and V. N. Huynh, “A
method for k-means-like clustering of categorical data,” J. Ambient Intell.
Humaniz. Comput., no. Berkhin 2002, 2019.
[19] C. Yuan and H. Yang, “Research on K-Value Selection Method of K-
Means Clustering Algorithm,” J, vol. 2, no. 2, pp. 226–235, 2019.
[20] T. Zhang and F. Ma, “Improved rough k-means clustering algorithm
based on weighted distance measure with Gaussian function,” Int. J.
Comput. Math., vol. 94, no. 4, pp. 663–675, 2017.
[21] O. M. Lior Rokach, “Clustering methods,” Adv. Inf. Knowl. Process., no.
9781447167341, pp. 131–167, 2015.
[22] S. D. Bin Dong, Zengyou He, Xiaofei Xu, “K-Histrograms: An Efficient
Clustering Algorithm for Categorical Dataset*,” no. 1, pp. 6–8, 2003.
[23] A. Al Malki, M. M. Rizk, M. A. El-Shorbagy, and A. A. Mousa, “Hybrid
Genetic Algorithm with K-Means for Clustering Problems,” Open J.
Optim., vol. 05, no. 02, pp. 71–83, 2016.
[24] B. Zerhari, A. A. Lahcen, and S. Mouline, “Big Data Clustering :
Algorithms and Challenges,” Proc. Int. Conf. Bihree Charact. Call. 3Vs
(Volume, Veloc. Var. It Ref. to data that are too large, Dyn. complex. this
Context. data are difficult to capture, store, Manag. Anal. using Tradit.
data Manag., no. May, pp. 1–7, 2015.
[25] C. C. Aggarwal, Data classification: Algorithms and applications. 2014.
[26] M. Duan, K. Li, X. Liao, and K. Li, “A Parallel Multiclassification
Algorithm for Big Data Using an Extreme Learning Machine,” IEEE
Trans. Neural Networks Learn. Syst., vol. 29, no. 6, pp. 2337–2351,
2018.
[27] H. Azzag, G. Venturini, A. Oliver, and C. Guinot, “A hierarchical ant
based clustering algorithm and its use in three real-world applications,”
Eur. J. Oper. Res., vol. 179, no. 3, pp. 906–922, 2007.
[28] C. C. Aggarwal and P. S. Yu, “Outlier detection for high dimensional
data,” SIGMOD Rec. (ACM Spec. Interes. Gr. Manag. Data), 2001.