SlideShare a Scribd company logo
1 of 4
Download to read offline
IOSR Journal of Engineering (IOSRJEN) www.iosrjen.org
ISSN (e): 2250-3021, ISSN (p): 2278-8719
Vol. 04, Issue 04 (April. 2014), ||V6|| PP 34-37
International organization of Scientific Research 34 | P a g e
Different techniques designs for clustering of data, specially
categorical data: An Overview
Kusha Bhatt 1
, Pankaj Dalal 2
1
(M.Tech Scholar, Software Engineering, Shrinathji Institute of Technology & Engg., Nathdwara)
2
(Associate Professor,Deptt. Of CSE, Shrinathji Institute of Technology & Engg., Nathdwara)
Abstract: - Mining of knowledge from a pool of data is known as data mining. One of the research potent field of
research in area of data mining is clustering. The earlier work in clustering is mainly concentrated on quantitative
and ordinal data or we can in general say numeric data which have an inherited order. But application of these
algorithms on categorical data is not possible as these data items do not have any inherited order. Therefore it
becomes necessary to have algorithms that can solve the problem of categorical data clustering. In this paper, an
overview of various categorical data clustering techniques has been given.
Keywords: - Clustering, categorical data, K-means, K-modes
I. INTRODUCTION
The unsupervised learning that aims at partitioning a data set into groups of similar items is known as
Clustering. The goal is to create clusters of data objects where the within-cluster similarity is maximized (intra-
cluster similarity) and the between-cluster similarity is minimized (inter-cluster similarity). One of the stages in a
clustering task is selecting a clustering strategy. A particular clustering algorithm is selected that is suitable for the
data and the desired clustering type at this stage. Selecting a clustering algorithm is not an easy task and requires the
consideration of several issues such as data types, data set size and dimensionality, data noise level, type or shape of
expected clusters, and overall expected clustering quality. Many clustering algorithms have been proposed that
employ a wide range of techniques such as iterative optimization, probability distribution functions, density-based
concepts, information entropy, and spectral analysis in past few decades.
A. An Overview of Clustering
Here we are providing a small exploration of the elements that are to be taken under consideration when designing
clustering algorithms. The elements are data types, proximity measures, and objective functions, the description of
the elements is provided below.
Data Types
The value of an attribute can be classified as follows:
 Quantitative. These attributes contain continuous numerical quantities where a natural order exists between
items of the same data type and an arithmetically based distance measure can be defined.
 Qualitative. These attributes contain discrete data whose domain is finite. We refer to the items in the domain
of each attribute as categories. Qualitative data are further subdivided as:
 Nominal. Data items belonging to this group do not have any inherent order or proximity. We refer to these
data as categorical data.
 Ordinal. These are ordered discrete items that do not have a distance relation
 Binary attributes are attributes that can take on only two values: 1 or 0.
Depending on the context, binary attributes can be qualitative or quantitative.
Proximity Measures
Proximity measures are metrics that define the similarity or dissimilarity between data objects for the
purpose of determining how close or related the data objects are. To define proximity measure there are various
approaches. Depending upon the data types and application area these approaches varies. For most algorithms, these
proximity measures are used to construct a proximity matrix that reflects the distance or similarity between the data
Different techniques designs for clustering of data, specially categorical data: An Overview
International organization of Scientific Research 35 | P a g e
objects. These matrices are used as input for a clustering algorithm that clusters the data according to a partitioning
criterion or an objective function.
B. Clustering Stages
Clustering cannot be a one-step process. In one of the seminal texts on Cluster Analysis, we can divide the clustering
process in the following stages [2].
 Data Collection
 Initial Screening
 Representation
 Clustering Tendency
 Clustering Strategy
 Validation.
 Interpretation
II. TYPES OF CLUSTERING
Types of clustering and algorithms.
Traditionally clustering is divided into two basic types of clustering:
 Hierarchical clustering methods
 Portioning methods.
Hierarchical clustering algorithms consist of assigning the objects to their own clusters and then pair of clusters is
repeatedly merged until the whole tree is formed. [5].Some such algorithms are BIRCH [4], CURE [6] etc. BIRCH
[4] is a data clustering method which is the first to handle noise (data points that are not part of an underlying
pattern).It addresses the problems of large databases and minimization of I/O. BIRCH [4] addresses metric
attributes. CURE [6] identifies clusters which are non-spherical in shape and which have wide variance in size. It
employs the technique of random sampling and partitioning that allows it to handle large data sets. Partitional
clustering decomposes the given data set into disjoint clusters. The common example for partitional clustering
algorithm is K-mean. All the above algorithms are mainly used with numeric data.
III. REVIEW OF CATEGORICAL DATA CLUSTERING ALGORITHMS
A. Algorithms used earlier
Some of the algorithm which are used for clustering categorical data are ROCK [6], K-Modes, STIRR [7],
CACTUS[8], COBWEB [9],and EM[10]. ROCK [6] proposes a novel concept of links to measure the
similarity/proximity between a pair of data points. It employs links and not distances when merging clusters. ROCK
[6] a robust clustering algorithm for categorical attributes. STIRR [7] it is based on an iterative method for assigning
and propagating weights on the categorical values in a table; this facilitates a type of similarity measure arising from
the co-occurrence of values in the dataset. CACTUS [8] introduces a normal formalization of a cluster for
categorical attributes by generalizing the cluster definition for numerical attributes. CACTUS [8] clusters categorical
attributes whose attribute values do not have any natural ordering. In this algorithm there is no post processing of the
final output generated by the user to define the clusters. COBWEB [9] is an incremental clustering algorithm, based
on probabilistic categorization trees. The search for a good clustering is guided by a quality measure for partitions of
data. The algorithm reads unclassified examples, given in attribute-value representation. EM algorithm [10] is an
iterative clustering technique. It starts with an initial clustering model for a data and iteratively refines the model to
fit the data better. After indeterminate number of iterations it terminates at an optimal solution. K-modes [11]
algorithm extends K-means to categorical domain. The K-modes method have three major modifications to K-
means, it uses different dissimilarity measures. Replaces means with modes and uses a frequency based method to
update modes. This algorithm defines the clusters based on how close they are to the centroid of cluster, distances
between centroid of clusters is a poor measure of similarity between them.
B. Recently used algorithms
Tiejun et al. proposed a clustering algorithm processing data with mixed attributes, which aims to improve
the speed and efficiency by means of restricting the searching direction of particles warm in the gradient space.
Compared with the traditional algorithm processing data with mixed attributes such as K-prototype and K-mode, it
is an unsupervised self-learning process. Furthermore it doesn’t need to pre-set parameter according to priori
experience, which makes its results more directionally significant [12]. Tripathy et al. proposed an algorithm called
SSDR which is more efficient than most of the earlier algorithms including MMR, MMeR and SDR, which are
Different techniques designs for clustering of data, specially categorical data: An Overview
International organization of Scientific Research 36 | P a g e
recent algorithms developed in this direction. It handles uncertain data using rough set theory. This takes both the
numerical and categorical data simultaneously besides taking care of uncertainty. Also, this algorithm gives better
performance while tested on well-known datasets [13]. Reddy et al. presented a clustering algorithm based on
similarity weight and filter method paradigm that works well for data with mixed numeric and categorical features.
They propose a modified description of cluster center to overcome the numeric data only limitation and provide a
better characterization of clusters [14]. Khan et al. proposed an algorithm to compute fixed initial cluster centers for
the K-modes clustering algorithm that exploits a multiple clustering approach that determines cluster structures from
the attribute values of given attributes in a data. This algorithm is based on the experimental observations that some
of the data objects do not change cluster membership irrespective of the choice of initial cluster centers and
individual attributes may provide some information about the cluster structures [15]. Chen et al. Presented a method
for performing multidimensional clustering on categorical data and show its superiority over unidimensional
clustering. Complex data sets such as the ICAC data can usually be meaningfully partitioned in multiple ways, each
being based on a subset of the attributes. Unidimensional clustering is unable to uncover such partitions because it
seeks one partition that is jointly defined by all the attributes. They proposed a method for multidimensional
clustering, namely latent tree analysis [16]. Cao et al. presented a weighting k-Modes algorithm for subspace
clustering of categorical data and its corresponding time complexity is analysed as well. In this algorithm, an
additional step is added to the k-Modes clustering process to automatically compute the weight of all dimensions in
each cluster by using complement entropy. Furthermore, the attribute weight can be used to identify the subsets of
important dimensions that categorize different clusters [17]. Chen et al. proposed a novel kernel-density-based
definition using a Bayes-type probability estimator. Then, a new algorithm called k-centers is proposed for central
clustering of categorical data, incorporating a new feature weighting scheme by which each attributes automatically
assigned with a weight measuring its individual contribution for the clusters [18].
IV. CONCLUSION
In data mining clustering is the dynamic field of research. When the size of data set grows it is desirable to
have algorithms which are able to discover highly correlated regions of objects. In this paper, we have discussed
various clustering algorithm for categorical data. The algorithms mentions in this paper have their own advantages
and disadvantages. After having a brief overview of the available techniques of clustering of categorical data we can
try to develop a technique which emphasis on cluster center initialization for better results of clustering. This may be
taken as the future scope of the paper.
REFERENCES
[1] Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall.
[2] Anil K. Jain and Richard C. Dubes, “Algorithms for Clustering Data”, Prentice-Hall, 1988.
[3] Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, and Panos Vassiliadis, “Fundamentals of Data
Warehouses”, Springer, 1999.
[4] Zhang, T., Ramakrishnan, R. and Livny, M., “BIRCH: An efficient data clustering method for very large
databases”, International conference on Management of Data, Montreal, Canada, pp. 103 –144, June 1996.
[5] Ying Zhao and George Karypis, “Evaluation of Hierarchical Clustering Algorithms for DocumentDatasets”
CIKM, 2002.
[6] Guha, S., Rastogi, R. and Shim, K., “Cure: An efficient clustering algorithm for large databases”,
International conference on Management of Data, Seattle WA, pp. 73-83, June 1998.
[7] Gibson, D., Kleinberg, J. and Raghavan, P., “Clustering categorical data: an approach based ong dynamical
systems”, Proceedings of the 24th
Annual International Conference on Very Large Databases, VLDB Journal
vol. 8, no.3-4, pp. 222-236, 2000.
[8] Ganti, V., Gehrke, J. and Ramakrishnan, R., “CACTUS – Clustering Categorical Data using Summaries”,
Proceedings of the Fifth ACM SIGKDD, International Conference on Knowledge Discovery and Data
Mining, San Diego, CA USA, pp. 73-83, Aug. 1999.
[9] Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: A new dataclustering algorithm and its applications”, Data
Mining and Knowledge Discovery, pp. 141--182, 1997.
[10] Liu, C., and Sun, D. X. “Acceleration of EM algorithm for mixtures models using ECME”, ASA Proceedings
of The Stat. Comp. Session. Pp. 109-114, 1997.
[11] Huang, Z., “Clustering large data sets with mixed numeric and categorical values”, Proceedings of First
Pacific-Asia Conference on Knowledge Discovery & Data Mining, pp. 21-34, 1997
Different techniques designs for clustering of data, specially categorical data: An Overview
International organization of Scientific Research 37 | P a g e
[12] Zhang Tiejuna, Yang Jinga, Zhang Jianpei, “Clustering Algorithm for Mixed Attributes Data Based on
Restricted Particle Swarm Optimization”, International Conference on Computer Science and Information
Technology (ICCSIT), vol. 51, 2011.
[13] B. K. Tripathy and Adhir Ghosh, “SSDR: An Algorithm for Clustering Categorical Data Using Rough Set
Theory”, Pelagia Research Library, Advances in Applied Science Research, pp. 314-326, 2011.
[14] M. V. Jagannatha Reddy and B. Kavitha, “Clustering the Mixed Numerical and Categorical Dataset using
Similarity Weight and Filter Method”, International Journal of Database Theory and ApplicationVol. 5,No. 1,
March, 2012.
[15] Shehroz S. Khan, Amir Ahmady, “Cluster Center Initialization for Categorical Data Using Multiple
AttributeClustering”, 3rd
MultiClust Workshop: Discovering, Summarizing and Using Multiple Clustering,
SIAM International Conference on Data Mining, Anaheim, California, USA, 2012.
[16] Tao Chen, Nevin L. Zhang, Tengfei Liu, Kin Man Poon, Yi Wang, “Model-based multidimensional
clustering of categorical data”, Artificial Intelligence, Published by Elsevier Ltd., pp. 2246–2269, Oct. 2012.
[17] Fuyuan Cao, Jiye Liang, Deyu Li, Xingwang Zhao, “A weighting k-Modes algorithm for subspace
clusteringof categorical data”, Neurocomputing Journal, Elsevier Science Publishers, Vol. 108, Pages 23–30,
May, 2013.
[18] Lifei Chen, Shengrui Wang, “Central Clustering of Categorical Datawith Automated Feature
Weighting”Proceedings of the 23rd
International Joint Conference on Artificial Intelligence, Beijing, China,
Aug. 2013.

More Related Content

What's hot

Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 

What's hot (19)

A Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsA Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text Documents
 
G0354451
G0354451G0354451
G0354451
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXING
 
Clustering & classification
Clustering & classificationClustering & classification
Clustering & classification
 
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEA CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
 
Clustering
ClusteringClustering
Clustering
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
My8clst
My8clstMy8clst
My8clst
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrieval
 
A new link based approach for categorical data clustering
A new link based approach for categorical data clusteringA new link based approach for categorical data clustering
A new link based approach for categorical data clustering
 
Clusters techniques
Clusters techniquesClusters techniques
Clusters techniques
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
 

Similar to F04463437

Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
IJERD Editor
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
IJCSIS Research Publications
 
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
Clustering heterogeneous categorical data using enhanced mini  batch K-means ...Clustering heterogeneous categorical data using enhanced mini  batch K-means ...
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
IJECEIAES
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
 

Similar to F04463437 (20)

A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
 
A03202001005
A03202001005A03202001005
A03202001005
 
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
Clustering heterogeneous categorical data using enhanced mini  batch K-means ...Clustering heterogeneous categorical data using enhanced mini  batch K-means ...
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
Paper id 26201478
Paper id 26201478Paper id 26201478
Paper id 26201478
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Clustering Approach Recommendation System using Agglomerative Algorithm
Clustering Approach Recommendation System using Agglomerative AlgorithmClustering Approach Recommendation System using Agglomerative Algorithm
Clustering Approach Recommendation System using Agglomerative Algorithm
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
I017235662
I017235662I017235662
I017235662
 
Ir3116271633
Ir3116271633Ir3116271633
Ir3116271633
 
47 292-298
47 292-29847 292-298
47 292-298
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
 

More from IOSR-JEN

More from IOSR-JEN (20)

C05921721
C05921721C05921721
C05921721
 
B05921016
B05921016B05921016
B05921016
 
A05920109
A05920109A05920109
A05920109
 
J05915457
J05915457J05915457
J05915457
 
I05914153
I05914153I05914153
I05914153
 
H05913540
H05913540H05913540
H05913540
 
G05913234
G05913234G05913234
G05913234
 
F05912731
F05912731F05912731
F05912731
 
E05912226
E05912226E05912226
E05912226
 
D05911621
D05911621D05911621
D05911621
 
C05911315
C05911315C05911315
C05911315
 
B05910712
B05910712B05910712
B05910712
 
A05910106
A05910106A05910106
A05910106
 
B05840510
B05840510B05840510
B05840510
 
I05844759
I05844759I05844759
I05844759
 
H05844346
H05844346H05844346
H05844346
 
G05843942
G05843942G05843942
G05843942
 
F05843238
F05843238F05843238
F05843238
 
E05842831
E05842831E05842831
E05842831
 
D05842227
D05842227D05842227
D05842227
 

Recently uploaded

Recently uploaded (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

F04463437

  • 1. IOSR Journal of Engineering (IOSRJEN) www.iosrjen.org ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 04 (April. 2014), ||V6|| PP 34-37 International organization of Scientific Research 34 | P a g e Different techniques designs for clustering of data, specially categorical data: An Overview Kusha Bhatt 1 , Pankaj Dalal 2 1 (M.Tech Scholar, Software Engineering, Shrinathji Institute of Technology & Engg., Nathdwara) 2 (Associate Professor,Deptt. Of CSE, Shrinathji Institute of Technology & Engg., Nathdwara) Abstract: - Mining of knowledge from a pool of data is known as data mining. One of the research potent field of research in area of data mining is clustering. The earlier work in clustering is mainly concentrated on quantitative and ordinal data or we can in general say numeric data which have an inherited order. But application of these algorithms on categorical data is not possible as these data items do not have any inherited order. Therefore it becomes necessary to have algorithms that can solve the problem of categorical data clustering. In this paper, an overview of various categorical data clustering techniques has been given. Keywords: - Clustering, categorical data, K-means, K-modes I. INTRODUCTION The unsupervised learning that aims at partitioning a data set into groups of similar items is known as Clustering. The goal is to create clusters of data objects where the within-cluster similarity is maximized (intra- cluster similarity) and the between-cluster similarity is minimized (inter-cluster similarity). One of the stages in a clustering task is selecting a clustering strategy. A particular clustering algorithm is selected that is suitable for the data and the desired clustering type at this stage. Selecting a clustering algorithm is not an easy task and requires the consideration of several issues such as data types, data set size and dimensionality, data noise level, type or shape of expected clusters, and overall expected clustering quality. Many clustering algorithms have been proposed that employ a wide range of techniques such as iterative optimization, probability distribution functions, density-based concepts, information entropy, and spectral analysis in past few decades. A. An Overview of Clustering Here we are providing a small exploration of the elements that are to be taken under consideration when designing clustering algorithms. The elements are data types, proximity measures, and objective functions, the description of the elements is provided below. Data Types The value of an attribute can be classified as follows:  Quantitative. These attributes contain continuous numerical quantities where a natural order exists between items of the same data type and an arithmetically based distance measure can be defined.  Qualitative. These attributes contain discrete data whose domain is finite. We refer to the items in the domain of each attribute as categories. Qualitative data are further subdivided as:  Nominal. Data items belonging to this group do not have any inherent order or proximity. We refer to these data as categorical data.  Ordinal. These are ordered discrete items that do not have a distance relation  Binary attributes are attributes that can take on only two values: 1 or 0. Depending on the context, binary attributes can be qualitative or quantitative. Proximity Measures Proximity measures are metrics that define the similarity or dissimilarity between data objects for the purpose of determining how close or related the data objects are. To define proximity measure there are various approaches. Depending upon the data types and application area these approaches varies. For most algorithms, these proximity measures are used to construct a proximity matrix that reflects the distance or similarity between the data
  • 2. Different techniques designs for clustering of data, specially categorical data: An Overview International organization of Scientific Research 35 | P a g e objects. These matrices are used as input for a clustering algorithm that clusters the data according to a partitioning criterion or an objective function. B. Clustering Stages Clustering cannot be a one-step process. In one of the seminal texts on Cluster Analysis, we can divide the clustering process in the following stages [2].  Data Collection  Initial Screening  Representation  Clustering Tendency  Clustering Strategy  Validation.  Interpretation II. TYPES OF CLUSTERING Types of clustering and algorithms. Traditionally clustering is divided into two basic types of clustering:  Hierarchical clustering methods  Portioning methods. Hierarchical clustering algorithms consist of assigning the objects to their own clusters and then pair of clusters is repeatedly merged until the whole tree is formed. [5].Some such algorithms are BIRCH [4], CURE [6] etc. BIRCH [4] is a data clustering method which is the first to handle noise (data points that are not part of an underlying pattern).It addresses the problems of large databases and minimization of I/O. BIRCH [4] addresses metric attributes. CURE [6] identifies clusters which are non-spherical in shape and which have wide variance in size. It employs the technique of random sampling and partitioning that allows it to handle large data sets. Partitional clustering decomposes the given data set into disjoint clusters. The common example for partitional clustering algorithm is K-mean. All the above algorithms are mainly used with numeric data. III. REVIEW OF CATEGORICAL DATA CLUSTERING ALGORITHMS A. Algorithms used earlier Some of the algorithm which are used for clustering categorical data are ROCK [6], K-Modes, STIRR [7], CACTUS[8], COBWEB [9],and EM[10]. ROCK [6] proposes a novel concept of links to measure the similarity/proximity between a pair of data points. It employs links and not distances when merging clusters. ROCK [6] a robust clustering algorithm for categorical attributes. STIRR [7] it is based on an iterative method for assigning and propagating weights on the categorical values in a table; this facilitates a type of similarity measure arising from the co-occurrence of values in the dataset. CACTUS [8] introduces a normal formalization of a cluster for categorical attributes by generalizing the cluster definition for numerical attributes. CACTUS [8] clusters categorical attributes whose attribute values do not have any natural ordering. In this algorithm there is no post processing of the final output generated by the user to define the clusters. COBWEB [9] is an incremental clustering algorithm, based on probabilistic categorization trees. The search for a good clustering is guided by a quality measure for partitions of data. The algorithm reads unclassified examples, given in attribute-value representation. EM algorithm [10] is an iterative clustering technique. It starts with an initial clustering model for a data and iteratively refines the model to fit the data better. After indeterminate number of iterations it terminates at an optimal solution. K-modes [11] algorithm extends K-means to categorical domain. The K-modes method have three major modifications to K- means, it uses different dissimilarity measures. Replaces means with modes and uses a frequency based method to update modes. This algorithm defines the clusters based on how close they are to the centroid of cluster, distances between centroid of clusters is a poor measure of similarity between them. B. Recently used algorithms Tiejun et al. proposed a clustering algorithm processing data with mixed attributes, which aims to improve the speed and efficiency by means of restricting the searching direction of particles warm in the gradient space. Compared with the traditional algorithm processing data with mixed attributes such as K-prototype and K-mode, it is an unsupervised self-learning process. Furthermore it doesn’t need to pre-set parameter according to priori experience, which makes its results more directionally significant [12]. Tripathy et al. proposed an algorithm called SSDR which is more efficient than most of the earlier algorithms including MMR, MMeR and SDR, which are
  • 3. Different techniques designs for clustering of data, specially categorical data: An Overview International organization of Scientific Research 36 | P a g e recent algorithms developed in this direction. It handles uncertain data using rough set theory. This takes both the numerical and categorical data simultaneously besides taking care of uncertainty. Also, this algorithm gives better performance while tested on well-known datasets [13]. Reddy et al. presented a clustering algorithm based on similarity weight and filter method paradigm that works well for data with mixed numeric and categorical features. They propose a modified description of cluster center to overcome the numeric data only limitation and provide a better characterization of clusters [14]. Khan et al. proposed an algorithm to compute fixed initial cluster centers for the K-modes clustering algorithm that exploits a multiple clustering approach that determines cluster structures from the attribute values of given attributes in a data. This algorithm is based on the experimental observations that some of the data objects do not change cluster membership irrespective of the choice of initial cluster centers and individual attributes may provide some information about the cluster structures [15]. Chen et al. Presented a method for performing multidimensional clustering on categorical data and show its superiority over unidimensional clustering. Complex data sets such as the ICAC data can usually be meaningfully partitioned in multiple ways, each being based on a subset of the attributes. Unidimensional clustering is unable to uncover such partitions because it seeks one partition that is jointly defined by all the attributes. They proposed a method for multidimensional clustering, namely latent tree analysis [16]. Cao et al. presented a weighting k-Modes algorithm for subspace clustering of categorical data and its corresponding time complexity is analysed as well. In this algorithm, an additional step is added to the k-Modes clustering process to automatically compute the weight of all dimensions in each cluster by using complement entropy. Furthermore, the attribute weight can be used to identify the subsets of important dimensions that categorize different clusters [17]. Chen et al. proposed a novel kernel-density-based definition using a Bayes-type probability estimator. Then, a new algorithm called k-centers is proposed for central clustering of categorical data, incorporating a new feature weighting scheme by which each attributes automatically assigned with a weight measuring its individual contribution for the clusters [18]. IV. CONCLUSION In data mining clustering is the dynamic field of research. When the size of data set grows it is desirable to have algorithms which are able to discover highly correlated regions of objects. In this paper, we have discussed various clustering algorithm for categorical data. The algorithms mentions in this paper have their own advantages and disadvantages. After having a brief overview of the available techniques of clustering of categorical data we can try to develop a technique which emphasis on cluster center initialization for better results of clustering. This may be taken as the future scope of the paper. REFERENCES [1] Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall. [2] Anil K. Jain and Richard C. Dubes, “Algorithms for Clustering Data”, Prentice-Hall, 1988. [3] Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, and Panos Vassiliadis, “Fundamentals of Data Warehouses”, Springer, 1999. [4] Zhang, T., Ramakrishnan, R. and Livny, M., “BIRCH: An efficient data clustering method for very large databases”, International conference on Management of Data, Montreal, Canada, pp. 103 –144, June 1996. [5] Ying Zhao and George Karypis, “Evaluation of Hierarchical Clustering Algorithms for DocumentDatasets” CIKM, 2002. [6] Guha, S., Rastogi, R. and Shim, K., “Cure: An efficient clustering algorithm for large databases”, International conference on Management of Data, Seattle WA, pp. 73-83, June 1998. [7] Gibson, D., Kleinberg, J. and Raghavan, P., “Clustering categorical data: an approach based ong dynamical systems”, Proceedings of the 24th Annual International Conference on Very Large Databases, VLDB Journal vol. 8, no.3-4, pp. 222-236, 2000. [8] Ganti, V., Gehrke, J. and Ramakrishnan, R., “CACTUS – Clustering Categorical Data using Summaries”, Proceedings of the Fifth ACM SIGKDD, International Conference on Knowledge Discovery and Data Mining, San Diego, CA USA, pp. 73-83, Aug. 1999. [9] Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: A new dataclustering algorithm and its applications”, Data Mining and Knowledge Discovery, pp. 141--182, 1997. [10] Liu, C., and Sun, D. X. “Acceleration of EM algorithm for mixtures models using ECME”, ASA Proceedings of The Stat. Comp. Session. Pp. 109-114, 1997. [11] Huang, Z., “Clustering large data sets with mixed numeric and categorical values”, Proceedings of First Pacific-Asia Conference on Knowledge Discovery & Data Mining, pp. 21-34, 1997
  • 4. Different techniques designs for clustering of data, specially categorical data: An Overview International organization of Scientific Research 37 | P a g e [12] Zhang Tiejuna, Yang Jinga, Zhang Jianpei, “Clustering Algorithm for Mixed Attributes Data Based on Restricted Particle Swarm Optimization”, International Conference on Computer Science and Information Technology (ICCSIT), vol. 51, 2011. [13] B. K. Tripathy and Adhir Ghosh, “SSDR: An Algorithm for Clustering Categorical Data Using Rough Set Theory”, Pelagia Research Library, Advances in Applied Science Research, pp. 314-326, 2011. [14] M. V. Jagannatha Reddy and B. Kavitha, “Clustering the Mixed Numerical and Categorical Dataset using Similarity Weight and Filter Method”, International Journal of Database Theory and ApplicationVol. 5,No. 1, March, 2012. [15] Shehroz S. Khan, Amir Ahmady, “Cluster Center Initialization for Categorical Data Using Multiple AttributeClustering”, 3rd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clustering, SIAM International Conference on Data Mining, Anaheim, California, USA, 2012. [16] Tao Chen, Nevin L. Zhang, Tengfei Liu, Kin Man Poon, Yi Wang, “Model-based multidimensional clustering of categorical data”, Artificial Intelligence, Published by Elsevier Ltd., pp. 2246–2269, Oct. 2012. [17] Fuyuan Cao, Jiye Liang, Deyu Li, Xingwang Zhao, “A weighting k-Modes algorithm for subspace clusteringof categorical data”, Neurocomputing Journal, Elsevier Science Publishers, Vol. 108, Pages 23–30, May, 2013. [18] Lifei Chen, Shengrui Wang, “Central Clustering of Categorical Datawith Automated Feature Weighting”Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China, Aug. 2013.