This document proposes using a b-colouring technique in graph theory to perform clustering analysis on the Pima Indian Diabetes database. B-colouring involves assigning colors (clusters) to graph vertices such that no adjacent vertices share a color, and each color class has a dominating vertex connected to all other color classes. This guarantees separation between clusters. The document applies b-colouring clustering to the Pima dataset and evaluates classification accuracy compared to other methods, achieving favorable results. Experimental analysis demonstrates the b-colouring approach and reviews cluster validity indices for determining the optimal partition number.
Semi-Supervised Discriminant Analysis Based On Data Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
Semi-Supervised Discriminant Analysis Based On Data Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
Clustering in Aggregated User Profiles across Multiple Social Networks IJECEIAES
A social network is indeed an abstraction of related groups interacting amongst themselves to develop relationships. However, toanalyze any relationships and psychology behind it, clustering plays a vital role. Clustering enhances the predictability and discoveryof like mindedness amongst users. This article’s goal exploits the technique of Ensemble Kmeans clusters to extract the entities and their corresponding interestsas per the skills and location by aggregating user profiles across the multiple online social networks. The proposed ensemble clustering utilizes known K-means algorithm to improve results for the aggregated user profiles across multiple social networks. The approach produces an ensemble similarity measure and provides 70% better results than taking a fixed value of K or guessing a value of K while not altering the clustering method. This paper states that good ensembles clusters can be spawned to envisage the discoverability of a user for a particular interest.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Novel Clustering Method for Similarity Measuring in Text DocumentsIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Review on Non Linear Dimensionality Reduction Techniques for Face Recognitionrahulmonikasharma
Principal component Analysis (PCA) has gained much attention among researchers to address the pboblem of high dimensional data sets.during last decade a non-linear variantof PCA has been used to reduce the dimensions on a non linear hyperplane.This paper reviews the various Non linear techniques ,applied on real and artificial data .It is observed that Non-Linear PCA outperform in the counterpart in most cases .However exceptions are noted.
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...IJECEIAES
Data analysis plays a prominent role in interpreting various phenomena. Data mining is the process to hypothesize useful knowledge from the extensive data. Based upon the classical statistical prototypes the data can be exploited beyond the storage and management of the data. Cluster analysis a primary investigation with little or no prior knowledge, consists of research and development across a wide variety of communities. Cluster ensembles are melange of individual solutions obtained from different clusterings to produce final quality clustering which is required in wider applications. The method arises in the perspective of increasing robustness, scalability and accuracy. This paper gives a brief overview of the generation methods and consensus functions included in cluster ensemble. The survey is to analyze the various techniques and cluster ensemble methods.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Data mining is a process to extract information from a huge amount of data and transform it into an
understandable structure. Data mining provides the number of tasks to extract data from large databases such
as Classification, Clustering, Regression, Association rule mining. This paper provides the concept of
Classification. Classification is an important data mining technique based on machine learning which is used to
classify the each item on the bases of features of the item with respect to the predefined set of classes or groups.
This paper summarises various techniques that are implemented for the classification such as k-NN, Decision
Tree, Naïve Bayes, SVM, ANN and RF. The techniques are analyzed and compared on the basis of their
advantages and disadvantages
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
Abstract: Details linkage is a procedure which almost adjoins two or more places of data (surveyed or proprietary) from different companies to generate a value chest of information which can be used for further analysis. This allows for the real application of the details. One-to-Many data linkage affiliates an enterprise from the first data set with a number of related companies from the other data places. Before performs concentrate on accomplishing one-to-one data linkages. So formerly a two level clustering shrub known as One-Class Clustering Tree (OCCT) with designed in Jaccard Likeness evaluate was suggested in which each flyer contains team instead of only one categorized sequence. OCCT's strategy to use Jaccard's similarity co-efficient increases time complexness significantly. So we recommend to substitute jaccard's similarity coefficient with Jaro wrinket similarity evaluate to acquire the team similarity related because it requires purchase into consideration using positional indices to calculate relevance compared with Jaccard's. An assessment of our suggested idea suffices as approval of an enhanced one-to-many data linkage system.
Index Terms: Maximum-Weighted Bipartite Matching, Ant Colony Optimization, Graph Partitioning Technique
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
This paper presents a hybrid data mining approach based on supervised learning and unsupervised learning to identify the closest data patterns in the data base. This technique enables to achieve the maximum accuracy rate with minimal complexity. The proposed algorithm is compared with traditional clustering and classification algorithm and it is also implemented with multidimensional datasets. The implementation results show better prediction accuracy and reliability.
Reduct generation for the incremental data using rough set theorycsandit
n today’s changing world huge amount of data is ge
nerated and transferred frequently.
Although the data is sometimes static but most comm
only it is dynamic and transactional. New
data that is being generated is getting constantly
added to the old/existing data. To discover the
knowledge from this incremental data, one approach
is to run the algorithm repeatedly for the
modified data sets which is time consuming. The pap
er proposes a dimension reduction
algorithm that can be applied in dynamic environmen
t for generation of reduced attribute set as
dynamic reduct.
The method analyzes the new dataset, when it become
s available, and modifies
the reduct accordingly to fit the entire dataset. T
he concepts of discernibility relation, attribute
dependency and attribute significance of Rough Set
Theory are integrated for the generation of
dynamic reduct set, which not only reduces the comp
lexity but also helps to achieve higher
accuracy of the decision system. The proposed metho
d has been applied on few benchmark
dataset collected from the UCI repository and a dyn
amic reduct is computed. Experimental
result shows the efficiency of the proposed method
Textural Feature Extraction of Natural Objects for Image ClassificationCSCJournals
The field of digital image processing has been growing in scope in the recent years. A digital image is represented as a two-dimensional array of pixels, where each pixel has the intensity and location information. Analysis of digital images involves extraction of meaningful information from them, based on certain requirements. Digital Image Analysis requires the extraction of features, transforms the data in the high-dimensional space to a space of fewer dimensions. Feature vectors are n-dimensional vectors of numerical features used to represent an object. We have used Haralick features to classify various images using different classification algorithms like Support Vector Machines (SVM), Logistic Classifier, Random Forests Multi Layer Perception and Naïve Bayes Classifier. Then we used cross validation to assess how well a classifier works for a generalized data set, as compared to the classifications obtained during training.
Clustering in Aggregated User Profiles across Multiple Social Networks IJECEIAES
A social network is indeed an abstraction of related groups interacting amongst themselves to develop relationships. However, toanalyze any relationships and psychology behind it, clustering plays a vital role. Clustering enhances the predictability and discoveryof like mindedness amongst users. This article’s goal exploits the technique of Ensemble Kmeans clusters to extract the entities and their corresponding interestsas per the skills and location by aggregating user profiles across the multiple online social networks. The proposed ensemble clustering utilizes known K-means algorithm to improve results for the aggregated user profiles across multiple social networks. The approach produces an ensemble similarity measure and provides 70% better results than taking a fixed value of K or guessing a value of K while not altering the clustering method. This paper states that good ensembles clusters can be spawned to envisage the discoverability of a user for a particular interest.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Novel Clustering Method for Similarity Measuring in Text DocumentsIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Review on Non Linear Dimensionality Reduction Techniques for Face Recognitionrahulmonikasharma
Principal component Analysis (PCA) has gained much attention among researchers to address the pboblem of high dimensional data sets.during last decade a non-linear variantof PCA has been used to reduce the dimensions on a non linear hyperplane.This paper reviews the various Non linear techniques ,applied on real and artificial data .It is observed that Non-Linear PCA outperform in the counterpart in most cases .However exceptions are noted.
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...IJECEIAES
Data analysis plays a prominent role in interpreting various phenomena. Data mining is the process to hypothesize useful knowledge from the extensive data. Based upon the classical statistical prototypes the data can be exploited beyond the storage and management of the data. Cluster analysis a primary investigation with little or no prior knowledge, consists of research and development across a wide variety of communities. Cluster ensembles are melange of individual solutions obtained from different clusterings to produce final quality clustering which is required in wider applications. The method arises in the perspective of increasing robustness, scalability and accuracy. This paper gives a brief overview of the generation methods and consensus functions included in cluster ensemble. The survey is to analyze the various techniques and cluster ensemble methods.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Data mining is a process to extract information from a huge amount of data and transform it into an
understandable structure. Data mining provides the number of tasks to extract data from large databases such
as Classification, Clustering, Regression, Association rule mining. This paper provides the concept of
Classification. Classification is an important data mining technique based on machine learning which is used to
classify the each item on the bases of features of the item with respect to the predefined set of classes or groups.
This paper summarises various techniques that are implemented for the classification such as k-NN, Decision
Tree, Naïve Bayes, SVM, ANN and RF. The techniques are analyzed and compared on the basis of their
advantages and disadvantages
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
Abstract: Details linkage is a procedure which almost adjoins two or more places of data (surveyed or proprietary) from different companies to generate a value chest of information which can be used for further analysis. This allows for the real application of the details. One-to-Many data linkage affiliates an enterprise from the first data set with a number of related companies from the other data places. Before performs concentrate on accomplishing one-to-one data linkages. So formerly a two level clustering shrub known as One-Class Clustering Tree (OCCT) with designed in Jaccard Likeness evaluate was suggested in which each flyer contains team instead of only one categorized sequence. OCCT's strategy to use Jaccard's similarity co-efficient increases time complexness significantly. So we recommend to substitute jaccard's similarity coefficient with Jaro wrinket similarity evaluate to acquire the team similarity related because it requires purchase into consideration using positional indices to calculate relevance compared with Jaccard's. An assessment of our suggested idea suffices as approval of an enhanced one-to-many data linkage system.
Index Terms: Maximum-Weighted Bipartite Matching, Ant Colony Optimization, Graph Partitioning Technique
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
This paper presents a hybrid data mining approach based on supervised learning and unsupervised learning to identify the closest data patterns in the data base. This technique enables to achieve the maximum accuracy rate with minimal complexity. The proposed algorithm is compared with traditional clustering and classification algorithm and it is also implemented with multidimensional datasets. The implementation results show better prediction accuracy and reliability.
Reduct generation for the incremental data using rough set theorycsandit
n today’s changing world huge amount of data is ge
nerated and transferred frequently.
Although the data is sometimes static but most comm
only it is dynamic and transactional. New
data that is being generated is getting constantly
added to the old/existing data. To discover the
knowledge from this incremental data, one approach
is to run the algorithm repeatedly for the
modified data sets which is time consuming. The pap
er proposes a dimension reduction
algorithm that can be applied in dynamic environmen
t for generation of reduced attribute set as
dynamic reduct.
The method analyzes the new dataset, when it become
s available, and modifies
the reduct accordingly to fit the entire dataset. T
he concepts of discernibility relation, attribute
dependency and attribute significance of Rough Set
Theory are integrated for the generation of
dynamic reduct set, which not only reduces the comp
lexity but also helps to achieve higher
accuracy of the decision system. The proposed metho
d has been applied on few benchmark
dataset collected from the UCI repository and a dyn
amic reduct is computed. Experimental
result shows the efficiency of the proposed method
Textural Feature Extraction of Natural Objects for Image ClassificationCSCJournals
The field of digital image processing has been growing in scope in the recent years. A digital image is represented as a two-dimensional array of pixels, where each pixel has the intensity and location information. Analysis of digital images involves extraction of meaningful information from them, based on certain requirements. Digital Image Analysis requires the extraction of features, transforms the data in the high-dimensional space to a space of fewer dimensions. Feature vectors are n-dimensional vectors of numerical features used to represent an object. We have used Haralick features to classify various images using different classification algorithms like Support Vector Machines (SVM), Logistic Classifier, Random Forests Multi Layer Perception and Naïve Bayes Classifier. Then we used cross validation to assess how well a classifier works for a generalized data set, as compared to the classifications obtained during training.
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
Introduction to Multi-Objective Clustering EnsembleIJSRD
Association rule mining is a popular and well researched method for discovering interesting relations between variables in large databases. In this paper we introduce the concept of Data mining, Association rule and Multilevel association rule with different algorithm, its advantage and concept of Fuzzy logic and Genetic Algorithm. Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
Clustering is an important step in the process of data analysis with applications to numerous fields. Clustering ensembles, has emerged as a powerful technique for combining different clustering results to obtain a quality cluster. Existing clustering aggregation algorithms are applied directly to large number of data points. The algorithms are inefficient if the number of data points is large. This project defines an efficient approach for clustering aggregation based on data fragments. In fragment-based approach, a data fragment is any subset of the data. To increase the efficiency of the proposed approach, the clustering aggregation can be performed directly on data fragments under comparison measure and normalized mutual information measures for clustering aggregation, enhanced clustering aggregation algorithms are described. To show the minimal computational complexity. (Agglomerative, Furthest, and Local Search); nevertheless, which increases the accuracy.
Published a paper entitled ‘Visualization of Crisp and Rough Clustering using MATLAB’ in CIIT International Journal of Data Mining and Knowledge Engineering on 12th December 2012, Vol.4 , ISSN 0974 – 9683.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
2. 521
identifies each cluster by at least one dominant object which guarantees the
disparity between clusters. Data analysis motivates many computing applications,
either in a planning phase or as part of their on-line operations. Analysis of data is
a process of examining, cleaning, transforming, and modeling data with the goal
of emphasizing useful information, providing conclusion, and giving correct
decision making. The paper is organized as follows: In Section II, Clustering and
its importance was discussed. In Section III, diabetes and its classification were
discussed. In Section IV, Algorithms for Classifying the Pima Indian diabetic
dataset was analyzed. Applications of b-colouring technique in clustering were
discussed in Section V. Section VI deals with experimental analysis. Finally the
conclusion was given in Section VII.
2 Importance of Clustering
2.1 What is clustering?
Clustering is one of the most important research areas in the field of data mining.
Clustering is a data mining technique used to creating group of objects based on
their features in such a way that the objects belonging to the same groups are
similar and those belonging to the different groups are dissimilar. The main
advantage of clustering is that interesting patterns and structures can be found
directly from very large data sets without the background knowledge. Clustering
can be considered the most important unsupervised learning problem; so, as every
other problem of this kind, it deals with finding a structure in a collection of
unlabeled data.
2.2 Applications of Graph colouring in clustering
Graph colouring techniques are highly utilized in computer science applications,
especially in research areas of computer science such as data mining, image
segmentation, clustering, image capturing, networking etc., For example a data
structure can be designed in the form of tree which in turn utilized vertices and
edges. Similarly modeling of network topologies can be done using graph
concepts. In the same way the most important concept of graph colouring is
utilized in resource allocation, scheduling. Also, paths, walks and circuits in graph
theory are used in tremendous applications say traveling salesman problem,
database design concepts, resource networking. This leads to the development of
new algorithms and new theorems that can be used in tremendous application
especially in clustering. Clustering gene expression data using a graph-theoretic
approach: an application of minimum spanning trees It is a new framework for
representing a set of multi-dimensional gene expression data as a Minimum
Spanning Tree (MST), a concept from the graph theory. A key property of this
3. 522 D. Vijayalakshmi, Dr. K. Thilagavathi
representation is that each cluster of the expression data corresponds to one sub
tree of the MST, which rigorously converts a multi-dimensional clustering
problem to a tree partitioning problem. Graph theoretic approach to image
segmentation. The data to be clustered or segmented by an undirected graph G
with are capacities assigned to the similarity between the linked vertices.
Clustering is achieved by removing the arcs of G to form mutually exclusive sub
graphs such that the largest inter-sub graph maximum flow is minimized.
Clustering Technique with an Example Cluster analysis is the collection of a
patterns into clusters based on similarity. Intuitively, patterns within a valid
cluster are more similar to each other than they are to a pattern belonging to a
different cluster. An example of clustering is depicted in Figure 1. The input
patterns are shown in Figure 1(a), and the desired clusters are shown in Figure
1(b). Here, points belonging to the same cluster are given the same label. The
variety of techniques for representing data, measuring similarity between data
elements, and grouping data elements has produced a rich and often confusing
assortment of clustering methods.
Figure 1: Data clustering
It is important to understand the difference between clustering (unsupervised
classification) and discriminant analysis (supervised classification). In supervised
classification, we are provided with a collection of labeled (pre classified)
patterns; the problem is to label a newly encountered, yet unlabeled, pattern.
Typically, the given labeled (training) patterns are used to learn the descriptions
of classes which in turn are used to label a new pattern. In the case of clustering,
the problem is to group a given collection of unlabeled patterns into meaningful
clusters. In a sense, labels are associated with clusters also, but these category
labels are data driven; that is, they are obtained solely from the data. Clustering is
useful in several exploratory pattern-analysis, grouping, decision- making, and
machine-learning situations; including data mining, document retrieval, image
4. 523
segmentation, and pattern classification. However, in many such problems, there
is little prior information (e.g., statistical models) available about the data, and the
decision-maker must make as few assumptions about the data as possible. It is
under these restrictions that clustering methodology is particularly appropriate for
the exploration of interrelationships among the data points to make an assessment
(perhaps preliminary) of their structure. Limitations of Clustering algorithms most
clustering algorithms are generally based on two popular clustering techniques,
namely partitioning and hierarchical clustering [2]. The hierarchical clustering
algorithms build a cluster hierarchy or, in other words, a tree of clusters,
(dendrogram) whose leaves are the data points and whose internal nodes represent
nested clusters of various sizes [3].Hierarchical clustering methods can be further
subdivided into agglomerative and divisive. Given n objects to be clustered, the
agglomerative (bottom-up) methods begin with n clusters (i.e., all objects are
apart). In each step, the two most similar clusters are identified and merged. This
process continues until all objects are clustered into one group. On the other hand,
the divisive (top-down) methods start from one cluster containing the n objects
and split it until n clusters are obtained. Once two objects have been merged by an
agglomerative algorithm, they will always be in the same cluster; if a divisive
method separates two objects, they will never belong to the same cluster. Among
these hierarchical methods, some authors have proposed to use graph colouring
techniques for the classification purpose. In, the authors propose a divisive
classification method based on dissimilarity tables, where the iterative algorithm
consists, at each step, in finding a partition by subdividing the class with the
largest diameter2 into two classes in order to exhibit a new partition with minimal
diameter. The subdivision is obtained by a 2-colouring of the vertices of the
maximum spanning tree built from the dissimilarity table. The derived
classification structure is a hierarchy. In contrast, given the number k of partitions
to be found, partitioning clustering algorithms try to find the best k partitions of
the n objects. The partitioning algorithm typically starts with an initial partition of
a data set, and then uses an iterative control strategy to optimize an objective
function. It is very often the case that the k clusters found by a partitioning method
are of higher quality (i.e., more similar) than the k clusters produced by a
hierarchical method [4]. Owing to this property, developing partitioning methods
has been one of the main focuses of cluster analysis research. Indeed, many
partitioning methods have been developed, some based on k-means [5], some on
k-medoid [4], etc. The most popular partitioning algorithm used in scientific and
industrial applications is k-means. In k-means, each cluster is represented by its
centroid, which is the average of the data vectors in a cluster. The k-means tries to
minimize E, which is the average dissimilarity from any object in the data set to
the centroid of its cluster. Although k-means often finds the local optimum, it
works well when objects within each cluster are quite close to each other while
the objects from different clusters are not. It also displays some limitations on
attribute types (it does not work well with symbolic attributes) and it is very
5. 524 D. Vijayalakshmi, Dr. K. Thilagavathi
sensitive to (i) the selection of the initial partition and (ii) to the presence of
outliers.
3 Diabetes and its classification
World Health Organization (WHO) report had shown a marked increase in the
number of diabetics and this trend is expected to grow in the next couple of
decades. In the International Diabetes Federation Conference 2003 held in Paris,
India was labeled, as "Diabetes Capital of the World," as of about 190 million
diabetics worldwide, more than 33 million is Indians. The worldwide figure is
expected to rise to 330 million, 52 million of them Indians by 2025, largely due to
population growth, ageing, urbanization, unhealthy eating habits and a sedentary
lifestyle. Diabetes mellitus is a disease in which the body is unable to produce or
unable to properly use and store glucose (a form of sugar). Glucose backs up in
the bloodstream causing one’s blood glucose or "sugar" to rise too high. There are
two major types of diabetes. In type 1 (also called juvenile-onset or insulin-
dependent) diabetes, the body completely stops producing any insulin, a hormone
that enables the body to use glucose found in foods for energy. People with type 1
diabetes must take daily insulin injections to survive. This form of diabetes
usually develops in children or young adults, but can occur at any age. Type 2
(also called adult-onset or non-insulin-dependent) diabetes results when the body
doesn’t produce enough insulin and/or is unable to use insulin properly (insulin
resistance). This form of diabetes usually occurs in people who are over 40,
overweight, and have a family history of diabetes, although today it is
increasingly occurring in younger people, particularly adolescents. Type II
Diabetes (not depending on insulin) is the most common form of diabetes (90 to
95 per cent) and occurs primarily in adults but is now also affecting children and
young adults. Type I Diabetes (insulin-dependent) affects predominately children
and youth, and is the less common form of diabetes (5 to 10 percent). The major
risk factors for diabetes include obesity, high cholesterol, high blood pressure and
physical inactivity. The risk of developing diabetes also increases, as people grow
older. People who develop diabetes while pregnant (a condition called gestational
diabetes) are more likely to develop full-blown diabetes later in life. Poorly
managed diabetes can lead to a host of long-term complications among these are
heart attacks, strokes, blindness, kidney failure, blood vessel disease [6] [7].
4 Clustering Algorithms for Classifying Pima Indian
Diabetic Dataset
A lot of research work has been done on various medical data sets including Pima
Indian diabetes dataset. Classification accuracy achieved for Pima Indian diabetes
6. 525
dataset using 22 different classifiers is given in [8] and using 43 different
classifiers is given in [9]. The performance of proposed cascaded model (k-
means+KNN) is compared with [8] and [9]. The accuracy of most of these
classifiers is in the range of 66. 6% to 77.7%. Hybrid K-means and Decision tree
[10] achieved the classification accuracy of 92.38% using 10 fold cross
validations, cascaded learning system based on Generalized Discriminate analysis
(GDA) and Least Square Support Vector Machine (LS_SVM), showed accuracy
of 82.05% for diagnosis of Pima dataset [11]. Further authors have achieved
classification accuracy of % 72.88 using ANN, 78.21% using DT_ANN where
decision tree C4.5 is used to identify relevant features and given as input to ANN
[12], 79.50% using Cascaded GA_CFS_ANN, relevant feature identified by
Genetic algorithm with Correlation based feature selection is given as input to
ANN [13], 77.71% using GA optimized ANN, 84.10% using GA optimized ANN
with relevant features identified by decision tree and 84.71% with GA optimized
ANN with relevant features identified by GA_CFS[14 ].
5 Applications of b-colouring Technique in Clustering
5.1 Why Graph b- colouring based clustering?
The graph b colouring based clustering reduced the partitioning problem of a data
set into p classes with minimal diameter, to the minimal colouring problem of a
superior threshold graph. The edges of this graph are the pairs of vertices
distanced from more than a given threshold. In such a graph, each colour
corresponds to one class and the number of colours is minimal. Therefore, this
method tends to build a partition of the data set with effectively compact clusters
but which does not give any importance to the cluster-separation. Graph
colouring is used to characterize some properties of graphs. A b-colouring of a
graph G (using colours 1,2,...,k) is a colouring of the vertices of G such that (i)
two neighbors have different colours (proper colouring) and (ii) for each colour
class there exists a dominating vertex which is adjacent to all other k-1 colour
classes. In this work, based on a b-colouring of a graph, we propose a clustering
technique. Additionally, we provide a cluster validation algorithm. This algorithm
aims at finding the optimal number of clusters by evaluating the property of
colour dominating vertex.
5.2 How b-colouring is applied in clustering?
The graph b-colouring is an interesting technique that can be applied to various
domains. The proper b-colouring problem is the assignment of colours (classes) to
the vertices of one graph so that no two adjacent vertices have the same colour,
and for each colour class there exists at least one dominating vertex which is
7. 526 D. Vijayalakshmi, Dr. K. Thilagavathi
adjacent (dissimilar) to all other colour classes. This paper presents a new graph
b-colouring framework for clustering heterogeneous objects into groups. A
number of cluster validity indices are also reviewed. Such indices can be used for
automatically determining the optimal partition. The concept b-colouring is
framework for clustering heterogeneous objects into groups. A number of cluster
validity indices are also reviewed. Such indices can be used for automatically
determining the optimal partition. The proposed approach has interesting
properties and gives good results on benchmark data set as well as on real medical
data set.
6 Experimental Analysis on Pima Indian Diabetic
Dataset
6.1 Pima Indian diabetic dataset
The Pima Indian diabetes database, donated by Vincent Sigillito, is a collection of
medical diagnostic reports of 768 examples from a population living near Phoenix,
Arizona, USA. The paper dealing with this data base [15] uses an adaptive
learning routine that generates and executes digital analogs of perceptron-like
devices, called ADAP. They used 576 training instances and obtained a
classification of 76% on the remaining 192 instances. The samples consist of
examples with 8 attribute values and one of the two possible outcomes, namely
whether the patient is tested positive for diabetes (indicated by output one) or not
(indicated by two). The database now available in the repository has 512
examples in the training set and 256 examples in the test set. The attribute vectors
of these examples are in Table1.
6.2 Clustering using b-colouring technique
The clustering based on the b-colouring of a graph is used for clustering Pima
Indian diabetic dataset. Consider the data to be clustered as an undirected edge-
weighted graph with no self-loops G = (V, E), where V = {v1,...,vn} is the vertex
set and E = V ×V is the edge set. Vertices in G correspond to objects, edges
represent neighborhood relationships, and edge-weights reflect dissimilarity
between pairs of linked vertices. The graph G is traditionally represented with the
corresponding weighted dissimilarity matrix, which is the n × n symmetric matrix
D={di,j| vi,vj∈ V}. A common informal definition states that “a cluster is a set of
entities which are similar, and entities from different clusters are not similar”.
Hence, a cluster should satisfy two fundamental conditions: (1) it should have
high internal homogeneity; (2) there should be high heterogeneity between entities
within one cluster and those from other clusters.
8. 527
These two conditions amount to saying that the weights on the edges within a
cluster should be small, and those on the edges connecting the cluster nodes to the
external ones should be large.
Table 1: Attributes
Attribute Type
Number of times pregnant continuous
Plasma glucose concentration continuous
Diastolic blood pressure (mm Hg) continuous
Triceps skin fold thickness (mm) continuous
2-Hour serum insulin (mu U/ml) continuous
Body mass index [weight in kg/(height)] continuous
Diabetes pedigree function continuous
Age (years) continuous
Class variable Binary
Table 2: A part of Dissimilarity table in Pima Indian diabetes dataset
Vi 1 2 3 4 5 6 7 8 9
1 0
2 0.20 0
3 0.10 0.30 0
4 0.10 0.20 0.25 0
5 0.20 0.20 0.15 0.40 0
6 0.20 0.20 0.20 0.25 0.65 0
7 0.15 0.10 0.15 0.10 0.10 0.75 0
8 0.10 0.20 0.10 0.10 0.05 0.05 0.05 0
9 0.40 0.075 0.15 0.15 0.15 0.15 0.15 0.15 0
Table 3: Accuracy when k=4
Classifier accuracy %
KNN with all samples , k=4 74.4826
K means Cluster all samples, k=4 84.76
A graph b-colouring approach all samples, 93.676
k=4
9. 528 D. Vijayalakshmi, Dr. K. Thilagavathi
Figure 2: A threshold graph with θ = 0.15 for data in Table 2
Figure 3: Initializing colours of vertices with maximal colours in Figure 2
Figure 4: A b colouring graph constructed from Fig 3 by removing colours
without any dominating vertex
10. 529
The dataset is divided into training data and test data using 60-40 ratio.
Experiments were carried out for different values of k ranging from 1 to 15. Table
3 shows the improvement in accuracy Diabetic data set using proposed method
with k = 4.
Figure 5: Accuracy Comparison
7 Conclusion
In this work, a clustering algorithm based on a graph b-colouring technique was
used to cluster Pima Indian diabetic dataset. We have implemented, performed
experiments, and compared our approach with KNN Classification and K-means
clustering. The results show that the clustering based on graph colouring
outperformance than the other clustering approach in terms of accuracy and purity.
The proposed technique offers a real representation of clusters by dominant
objects that guarantees the inter cluster disparity in a partitioning and used to
evaluate the quality of cluster.
References
[1] B. Effantin , Hamamache Kheddouci, “A Distributed algorithm for b-
colouring of a graphs”, lecture Notes in Computer science, Vol 3301, (2006),
pp 430-438.
[2] Jain, A.K., M.N. Murty, and P.J. Flynn, “Data Clustering A Review”, ACM
Computing Surveys, Vol. 31, (1999), 264-323.
11. 530 D. Vijayalakshmi, Dr. K. Thilagavathi
[3] Guha, S., R. Rastogi, and K. Shim, “CURE: An efficient clustering algorithm
for large databases”, Proceedings of the ACM SIGMOD Conference, Seattle,
WA, (1998), 73-84.
[4] NG, R. and J. Han,”CLARANS: a method for clustering objects for spatial
data mining”, IEEE Transactions on Knowledge and Data Engineering,
14(5), (2002), 1003-1016.
[5] Hartigan, J. and M. Wong, “Algorithm AS136: A k-means clustering
algorithm”, Journal of Applied Statistics, Vol. 28, (1979), 100-108.
[6] Editorial, “Diagnosis and Classification of Diabetes Mellitus, American
Diabetes Association, Diabetes Care”, Vol 27, Supplement 1, (Jan 2004).
[7] The Expert Committee on the Diagnosis and Classification of Diabetes
Mellitus, “Follow up report on the Diagnosis of Diabetes Mellitus. Diabetic
Care” 26,pp.3160- 3167, (2003).
[8] Michie, D., Spiegelhalter, D. J., & Taylor, C. C., “Machine learning, neural
and statistical classification”. 1994.
[9] Humar, K., & Novruz, A, “Design of a hybrid system for the diabetes and
heart diseases” Expert Systems with Applications, 2008, 35, 82–89.
[10] B.M Patil, R.C Joshi, Durga Tosniwal, “Hybrid Prediction model for Type-2
Diabetic Patients”, Expert System with Applications, 37, 2010, 8102-8108.
[11] Polat, K., Gunes, S., & Aslan, A., “A cascade learning system for
classification of diabetes disease: Generalized discriminant analysis and least
square support vector machine” Expert Systems with Applications,
2008,34(1), 214–221.
[12] Asha Gowda Karegowda ,MA.Jayaram , “Integrating Decision Tree and
ANN for Categorization of Diabetics Data”, International Conference on
Computer Aided Engineering, December 13-15, 2007, IIT Madras, Chennai,
India.
[13] Asha Gowda Karegowda and M.A. Jayaram, “Cascading GA & CFS for
Feature Subset Selection in Medical Data Mining” , International Conference
on IEEE International Advance Computing Conference (IACC’09) on March
6-7, 2009, Thapar University, Patiala, Punjab India.
[14] Asha Gowda Karegowda , A.S. Manjunath , M.A. Jayaram, “Application Of
Genetic Algorithm Optimized Neural Network Connection Weights For
Medical Diagnosis Of Pima Indians Diabetes”, International Journal on Soft
Computing ( IJSC ), Vol.2, No.2, May 2011.
[15] Smith, J.,W., Everhart, J.,E., Dickson, W.,C., Knowler, W.,C. and Johannes,
R.,S., “Using the ADAP learning algorithm to forecast the onset of diabetes
mellitus, in Proceedings of the Symposium on Computer Applications and
Medical Care”, IEEE Computer Society Press, 261-265, 1988.