47 292-298

High Accuracy Clustering Algorithm for Categorical
Dataset
Aman Ahmad Ansari1
and Gaurav Pathak2
1
NIMS Institute of Engineering &Technology, Jaipur, India
Email: ansariaa1jan@gmail.com
2
NIMS Institute of Engineering &Technology, Jaipur, India
Email: pathakg86@gmail.com
Abstract— Step by step operations by which we make a group of objects in which attributes
of all the objects are nearly similar, known as clustering. So, a cluster is a collection of
objects that acquire nearly same attribute values. The property of an object in a cluster is
similar to other objects in same cluster but different with objects of other clusters.
Clustering is used in wide range of applications like pattern recognition, image processing,
data analysis, machine learning etc. Nowadays, more attention has been put on categorical
data rather than numerical data. Where, the range of numerical attributes organizes in a
class like small, medium, high, and so on. There is wide range of algorithm that used to
make clusters of given categorical data. Our approach is to enhance the working on well-
known clustering algorithm k-modes to improve accuracy of algorithm. We proposed a new
approach named “High Accuracy Clustering Algorithm for Categorical datasets”.
Index Terms— clustering, k-mode Algorithm, categorical data, data mining.
I. INTRODUCTION
Data mining refers to extracting or mining knowledge from large amount of data [1], or synonym for KDD
(knowledge discovery in databases). Data mining Techniques:
Association Analysis: Discovering association rules showing attribute-value conditions that occur frequently
together on a given data set.
Classification: To learn to assign data objects to predefined classes. This requires supervised learning, i.e. the
training data has to specify what have to be learning.
Clustering: The process of grouping a set of physical or abstract objects into classes of similar objects is
called clustering. . A cluster is a collection of collection of data objects that are similar to one another within
the class or cluster, and are dissimilar to the objects in other clusters. The cluster of data objects can be
treated collectively as one group. The example shown in figure 1, Clustering of objects into three groups.
During a cholera outbreak in London in 1854, John Snow used a special map to plot the pees of the disease
that were reported [2]. A key observation, alter the creation of the map, was Joe close association between
the density of disease cases and a single well located at a central knee. Most of the clustering algorithms
focus on data sets where objects are defined on a set of numerical values. Datasets also contain non-
numerical values to be clustered; each object is described by multiple attributes, categorical data sets.
Clustering cannot be a one-step process. Jain and Dubes divide the clustering process in the following stages
[9] a). Data Collection: b). Initial Screening: c). Representation: d). Clustering Tendency: e). Clustering
DOI: 02.ITC.2014.5.47
© Association of Computer Electronics and Electrical Engineers, 2014
Proc. of Int. Conf. on Recent Trends in Information, Telecommunication and Computing, ITC

293
Figure 1. Clustering of a set of points
Strategy: f). Validation: g). Interpretation.
This list of stages is given for exposition purposes since we do not propose solutions for each one of them.
We mainly focus on the problem of Clustering Strategy by proposing a new algorithm for categorical data,
and the problem of Clustering Tendency by proposing a heuristic for identifying appropriate values for the
number of clusters that exist in a data set.
II. PROBLEM DEFINITION
The previous clustering algorithm for categorical dataset are not much accurate and do not give same result at
every execution with the same categorical dataset. We want to solve this problem ‘clustering of categorical
data’ with high accuracy.
III. CLUSTERING TECHNIQUES
A. Rules for Clustering Techniques
Every clustering algorithm must follow the following rules:
1. The measure used to assess similarity or dissimilarity between pairs of objects.
2. The particular strategy followed In order to merge Intermediate results. This strategy obviously affects the
way the end clusters are produced, since we may merge intermediate clusters according to the distance of
their closest or furthest points, or the distance of the average of their points [5].
3. An objective function that needs to be minimized or maximized as appropriate, in order to produce final
results.
B. Basic Clustering Techniques
1. Partitional: Given ‘n’ objects partitional clustering algorithm constructs k partitions of the data, so that an
objective function is optimized. Some of these algorithms are high complexity, because of some of them
generate all possible groupings and try to find the optimal solution. If we take small no of objects, there
also the grouping (partitions) may high. Because of this, solutions start with initial, usually random,
partition and proceed with its refinement. Better Approach was, run the partitioned algorithm for several
different sets of k initial points and keep track of the result The majority of them could be considered as
greedy algorithms, i.e., algorithms that at each step choose the best solution and may not lead to optimal
results in the end The best solution at each step is the placement of a certain object In the cluster for
which the representative point is nearest to the object, k-means [4], PAM (partitioning Around Medoids)
[5], CLARA (Clustering LARge Applications) [5] are comes under this category All these are applicable
to numerical attributes.
2. Categorical data clustering algorithms: These are for categorical data where Euclidean, or other
numerically-oriented distance measures are not meaningful. These algorithms are close to partitioned and
hierarchical types. For each category, there exists a plethora of sub-categories, e.g., density-based
clustering oriented toward geographical data. An exception to this is the class of approaches to handling
categorical data. Visualization of such data is not straight forward and there is no inherent geometrical
structure in them, hence the approaches that have appeared in the literature mainly use concepts carried by
the data, such as no-occurrences in tuples. On the other hand, data sets that include some categorical

294
attributes are abundant. Moreover, there are data sets with a mixture of attribute types, such as the United
States Census data set [7] and data sets used in data integration [6].
IV. RELATED WORK
To cluster categorical data objects, k-modes, ROCK, and COOLCAT [10], are exists, but in present work we
are extending the k-modes algorithm especially for accuracy.
A. K-modes Algorithm
The first algorithm for categorical data sets is k-modes algorithm, which is extension to k-means [11].K-
modes algorithm partitions a categorical data set of ‘n’ objects in clusters. It is based on k-means paradigm
and use modes at the place of means for categorical data, and frequency based method to update modes. K-
modes algorithm chooses ‘k’ random objects to set initial mode of cluster, and different dissimilarity measure
use for calculate distance between two objects. Dissimilarity measure is-
( , ) = ∑ , (1)
Where ( , )=0 xi=yi
( , )=1 xi≠yi
Let Q= {q1, q2, q3…..qm} is mode of a cluster.
( , ) = ∑ ( , ) (2)
Where Q can be an object but not necessarily an object.
Algorithm: k-modes
Input:
• k: number of objects
• D: data set that contain ‘n’ objects
Output: set of k clusters
Method:
1. Randomly choose k objects for initial cluster modes, one for each.
2. Allocate each object to that cluster which mode is most similar to that object, according to
eqn.(1).
3. Update modes by calculate the frequent value for each attribute of all objects in cluster.
4. Repeat
a. Reallocate each object to that cluster which mode is most similar to that object.
If that cluster is not current cluster.
b. Update modes of changed clusters.
5. until no changes.
V. PROPOSED METHODOLOGY
Proposed clustering algorithm extends k-modes clustering algorithm with new dissimilarity measure and
selects initial modes by using select_init_modes algorithm unlike k-modes algorithm selects initial modes
randomly.
A.Selection of Initial Nodes
Result of clustering process depends on the initial modes. So, if any clustering algorithm set initial modes in
random manner, then clustering result of that algorithm may not have same accuracy every time for particular
data set. Here, we proposed an algorithm select_init_mode to overcome this problem. This algorithm use k-
modes to calculate modes and store np set of modes in mode-pool, P
Algorithm: select_init_mode
Input:
• np: number of set of ‘k’ modes in mode-pool.
• k: number of clusters.
• D: data set having ’n’ objects.

295
Output: P: mode-pool.
Method:
1. Set i = 0.
2. Repeat
a. Execute k-modes clustering algorithm.
b. Store the set of modes in mode-pool.
c. Increment i.
3. Until i<np
B.Dissimilarity Measure
Similarity can be defined as how far or close the data objects are from one another. The notion of similarity
will help. We call it as “measure, or index or coefficient” [3]. Dissimilarity can be measured in many ways
and one can be in distance. Distance can be measured using any one of a variety of distance measures.
Dissimilarity measure used by k-modes does not represent the real semantic distance between the object and
cluster. For example-
Let’s take a categorical data set having 3 attributes A1={1,2}, A2={1,2} and A3={1,2,3,4,5}with 7 attributes
on using k-modes clustering algorithm with k=2 after 6 objects are clustered as shown in table I below.
TABLE I. CLUSTER 1 AND CLUSTER 2
Let 7th object of dataset are X = [2 1 1], for this object dissimilarities are d(X, C) = 1 and d(X, C) = l. we
may not properly assign this object. But we can see that this object will be assigned to cluster2. By using k-
modes dissimilarity measure we cannot sure this object allocate to cluster2. To solve this problem, I propose
anew dissimilarity measure that accounts the frequency of values of attributes of objects in clusters. New
dissimilarity measure are-
( , ) = ∑ , (3)
Where , = 1 − | |⁄ xj=yj , = 1 xj≠yj
|Ol| number of objects in the lth
cluster, and |Oljm| the number of objects with value a j of the jth
attribute in the
lth
cluster. By using this dissimilarity measure, we sure that 7th
object allocates to cluster2.
C.Proposed Algorithm
Input:
• np: number of set of modes in mode-pool.
• k : number of clusters.
• D: data set having ‘n’ objects.
Output: set of k clusters.
Method:

296
1. Execute select_init_mode algorithm, it returns mode-pool.
2. Select most frequent attribute value of all attributes for a mode n corresponding set of np modes in mode-
pool. Initialize all modes.
3. Allocate each object to that cluster which dissimilarity measure is lowest with that object, according to
equation.
4. Update modes by calculate the frequent value for each attribute of all objects in cluster.
5. Repeat
a. Reallocate each object to that cluster which dissimilarity is lowest with that object, if that cluster is not
current cluster.
b. Update modes of changed clusters.
6. until no changes.
VI. IMPLEMENTATION & RESULT
For the implementation of my proposed algorithm we have designed a tool interface.
Figure 2. Input Frame
Figure 2 is the initial window of my tool. It takes the input file on which we want to apply clustering. It also
takes the number of clusters from the user.
Figure 3. Result Frame

297
From the window shown in figure 3, we can see the output of k-modes algorithm and proposed algorithm by
using the appropriate button. I experimented with two real-life categorical datasets. Mushroom dataset, and
Congressional Voting dataset taken from UCI Machine learning repository [8].
Clustering Accuracy: Cluster Accuracy ‘r’ is defined as
r=(∑ i)/n (4)
Where,
ai= number of objects occurring in a cluster,
k=number of clusters, and
n=number of objects in a data set
Clustering error defined as
e=1-r (5)
We compare proposed k-modes algorithm, and existing k-modes algorithm. For a fixed number of clusters
‘k’, the clustering errors ‘e’ of both algorithms compared and shown in figure 4.
A.Datasets
Congressional Voting Data Set: it includes votes of every house of United States representatives of
congressmen on sixteen key votes recognized by the CQA. The CQA lists 9 various votes- voted for, paired
for, announced for (all 3 are interpreted to yes). Voted against, paired against, and announced against (all 3
interpreted to no).voted present, voted present to elude conflict of interest, didn't vote or elsewhere make a
position known (these 3 interpreted to unknown) [8].
Figure 4. Congressional Voting data (Clustering Error vs No. of clusters)
Mushroom Data Set: We used mushroom database as input of my system. This database drawn from The
Audubon Society Field Guide to North American Mushrooms (1981), this data set has 8124 data objects.
Each object has 22 attributes (e.g., color, odor, and shape) and has a label characterizing the mushroom
specimen as either poisonous (3916 records) or edible (4208 records) [8].
Soybean Disease Data Set: We used Soybean Disease database as input of my system. These databases
drawn from this dataset have 19 classes, only the first 15 of which have been used in prior work. The folklore
seems to be that the last four classes are unjustified by the data since they have so few examples. There are
35 categorical attributes, some nominal and some ordered. The value “dna” means does not apply. The values
for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth.
An unknown value is encoded as “?”.This data set has 307 data objects [8].
The proposed algorithm was tested on other categorical data [8] such as Zoo, Soybeans, US Census Data.
VII. CONCLUSIONS
As we all know clustering is applicable in every area, for eg ranging from image processing, bug prediction,
pattern evolution, and machine learning and so on. So, we need a clustering algorithm that work efficiently as
well as accurately on all type of databases numerical, categorical, and mixture of both.

298
In this paper, we work on only accuracy quality attribute of clustering algorithm, so that; we can find much
accurate and nearly same result at every execution of algorithm on same dataset. Our algorithm worked well
in this scenario to provide accurate result at every execution of algorithm.
We applied this algorithm on only simple real time categorical datasets mushroom database, Congressional
Voting Data Set. In future, it is possible to apply this algorithm on bug dataset to help developer to find the
clusters of bugs that have a same cause. It helps in bug fixing during development and also after deployment.
Presently it works only for categorical datasets. But in future it may enhance to work well with numerical
datasets also.
REFERENCES
[1] Jiawei Han, Micheline Kamber: "data mining Concepts and Techniques", Morgan Kaufmann, 2001.
[2] E. W. Gilbert: "Pioneer Maps of Health and Disease in England'', Geographical Journal, 1958.
[3] Anil K. Jain and Richard C. Dubes: "Algorithms for Clustering data", Prentice-Hall, 2005.
[4] Amir Ahmad, Lipika Dey: "A k-mean clustering algorithm for numeric data", Data & Knowledge Engineering, 2007
[5] Leonard Kaufman and Peter J. Rousseeuw: "Finding Groups in Data: An Introduction to Cluster Analysis.'', John
Wiley & Sons, 1990.
[6] Renjee J. Miller, Mauricio A. Hernjandez Laura M. Haas.: "The Clio Project: Managing Heterogeneity, SIGMOD
Record, 2001.
[7] US Census data set http://www.census.gov.
[8] UCI Repository of Machine Learning Databases. http://archive.ics.uci.edu/ml/datasets.html
[9] Serge Abiteboul, Richard Hull, and Victor Vianu.: "Foundations of Data bases." AddisonWesley, 1995.
[10] Daniel Barbarja, Julia Couto, and Yi Li.: "COOLCAT: An Entropy-based Algorithm for Categorical Clustering.",
CIKM -2002.
[11] Zhihua Cail, Dianhong Wang, and Liangxiao Jiang: “ A New Algorithm for Clustering Categorical Data”, ICIC-
2006.

47 292-298

More Related Content

What's hot

Viewers also liked

Similar to 47 292-298

More from idescitation

Recently uploaded

47 292-298