The ever-increasing availability of textual
documents has lead to a growing challenge for information
systems to effectively manage and retrieve the information
comprised in large collections of texts according to the user’s
information needs. There is no clustering method that can
adequately handle all sorts of cluster structures and properties
(e.g. shape, size, overlapping, and density). Combining
multiple clustering methods is an approach to overcome the
deficiency of single algorithms and further enhance their
performances. A disadvantage of the cluster ensemble is the
highly computational load of combing the clustering results
especially for large and high dimensional datasets. In this paper
we propose a multiclustering algorithm , it is a combination of
Cooperative Hard-Fuzzy Clustering model based on
intermediate cooperation between the hard k-means (KM) and
fuzzy c-means (FCM) to produce better intermediate clusters
and ant colony algorithm. This proposed method gives better
result than individual clusters.
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
Improve the Performance of Clustering Using Combination of Multiple Clustering Algorithms
1. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 02 Issue: 02 December 2013 Page No.59-63
ISSN: 2278-2419
59
Improve the Performance of Clustering Using
Combination of Multiple Clustering Algorithms
Kommineni Jenni1
,Sabahath Khatoon2
, Sehrish Aqeel3
Lecturer , Dept of Computer Science, King Khalid University, Abha, Kingdom of Saudi Arabia
Email : Jenni.k.507@gmail.com, sabakhan_312@yahoo.com , sehrishaqeel.kku@gmail.com
Abstract - The ever-increasing availability of textual
documents has lead to a growing challenge for information
systems to effectively manage and retrieve the information
comprised in large collections of texts according to the user’s
information needs. There is no clustering method that can
adequately handle all sorts of cluster structures and properties
(e.g. shape, size, overlapping, and density). Combining
multiple clustering methods is an approach to overcome the
deficiency of single algorithms and further enhance their
performances. A disadvantage of the cluster ensemble is the
highly computational load of combing the clustering results
especially for large and high dimensional datasets. In this paper
we propose a multiclustering algorithm , it is a combination of
Cooperative Hard-Fuzzy Clustering model based on
intermediate cooperation between the hard k-means (KM) and
fuzzy c-means (FCM) to produce better intermediate clusters
and ant colony algorithm. This proposed method gives better
result than individual clusters.
Keywords- Clustering, density; pheromone; ant colony
algorithm; k-means, hard k-means, fuzzy c-means, Cooperative
Hard-Fuzzy;
I. INTRODUCTION
Clustering can be considered as the most important
unsupervised learning problem; so, as every other problem of
this kind, it deals with finding a structure in a collection of
unlabeled data. A cluster is a collection of objects which are
“similar” between them and are “dissimilar” to the objects
belonging to other clusters. A large number of clustering
methods [1]-[10] have been developed in many fields, with
different definitions of clusters and similarity metrics. It is well
known that no clustering method can adequately handle all
sorts of cluster structures and properties (e.g. overlapping,
shape, size and density). In fact, the cluster structure produced
by a clustering method is sometimes an artifact of the method
itself that is actually imposed on the data rather than
discovered about its true structure. Combining multiple
clustering [11]-[15] is considered as an example to further
broaden and stimulate new progress in the area of data
clustering. Clustering is an important technology of the data
mining study, which can effectively discovered by analyzing
the data and useful information. It groups data objects into
several classes or clusters so that in the same cluster of high
similarity among objects, and objects are vary widely in the
different cluster [19]. Combining clustering can be classified
into two categories based on the level of cooperation between
the clustering algorithms; either they cooperate on the
intermediate level or at the end-results level. Examples of end-
result cooperation are the Ensemble Clustering and the Hybrid
Clustering approaches [11]-[15]. Ensemble clustering is based
on the idea of combining multiple clusterings of a given dataset
X to produce a superior aggregated solution based on
aggregation function. Recent ensemble clustering techniques
have been shown to be effective in improving the accuracy and
stability of standard clustering algorithms. However, an
inherent drawback of these techniques is the computational
cost of generating and combining multiple clusters of the data.
In this paper, we propose a new Cooperative Hard-Fuzzy
Clustering (CHFC) model based on obtaining best clustering
solutions from the hard k-means (KM) [5] and the fuzzy c-
means (FCM) [6] at the intermediate steps based on a
cooperative criterion to produce better solutions and to achieve
faster convergence to solutions of both the KM and FCM over
overlapped clusters. After getting the clusters we can apply the
next stage of clustering is the process of simulation of ants in
search of food, and because the advantages of swarm
intelligence, can avoid falling into local optimal solution, but
get a better clustering results. The rest of this paper is
organized as follows: in section 2, related work to data
clustering is given. The proposed hybrid model is presented in
section 3. Section 4 provides a Multi clustering. Experimental
results are presented and discussed in section 5. Finally, we
discuss some conclusions and outline of future work in section
6.
II. RELATED WORK
Clustering algorithms have been developed and used in many
fields. Hierarchical and partitioning methods are two major
categories of clustering algorithms. [21], [22] provides an
extensive survey of various clustering techniques. In this
section, we highlight work done on document clustering. Many
clustering techniques have been applied to clustering
documents. For instance, [23] provided a survey on applying
hierarchical clustering algorithms into clustering documents.
[24] Adapted various partition-based clustering algorithms to
clustering documents. Another popular approach in document
clustering is agglomerative hierarchical clustering [26].
Algorithms in this family follow a similar template: Compute
the similarity between all pairs of clusters and then merge the
most similar pair. Different agglomerative algorithms may
employ different similarity measuring schemes. K-means and
its variants [25] represent the category of partitioning
clustering algorithms. [25] Illustrates that one of the variants,
bisecting k-means, outperforms basic k-means as well as the
agglomerative approach in terms of accuracy and efficiency.
The bisecting k-means algorithm first selects a cluster to split.
Then it utilizes basic k-means to form two sub-clusters and
repeats until the desired number of clusters is reached. The K-
2. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 02 Issue: 02 December 2013 Page No.59-63
ISSN: 2278-2419
60
means algorithm is simple, so it is widely used in text
clustering. Due to the randomness of the initial center selection
in the K means algorithm, the results of its operation are
unstable and easy to fall into the local minimum point. And
then Dunn proposed the fuzzy C-means (FCM) algorithm
which is promoted by the Bezdek [27]. We divide the n items
of data into the C items of the fuzzy categories and minimize
the objective functions. FCM algorithm is very common in the
fuzzy clustering algorithm [28]. However, the clustering results
of the traditional fuzzy C-means clustering algorithm lack the
stability and the boundary value attribution of the traditional
fuzzy C-means clustering algorithm exists the deficiencies.
Combining clusterings invokes multiple clustering algorithms
in the clustering process to benefit from each other to achieve
global benefit (i.e. they cooperate together to attain better
overall clustering quality). Current approaches to multiple
clustering are the ensemble clustering and hybrid clustering.
Hybrid clustering assumes a cascaded model of multiple
clustering algorithms to enhance the clustering of each other or
to reduce the size of the input representatives to the next level
of the cascaded model.
Hybrid PDDP-k-means algorithm [15] starts by running the
PDDP algorithm [8] and enhances the resulting clustering
solutions using the k-means algorithm. Hybrid clustering
violates the synchronous execution of the clustering algorithms
at the same time, as one or more of the clustering algorithms
stays idle till a former algorithm(s) finishes it clustering.
Deneubourg et al proposed Basic Ant Colony Model (BM) for
clustering. Agents are moving randomly on a square grid of
cell on which some objects are scattered. When a loaded agent
comes to an empty cell, it performs drop down action if the
estimation of neighbor density greater than a probability, or
else it keeps moving to other empty cells.
III. PROPOSED METHOD
A. Cooperative Hard-Fuzzy Clustering
In the proposed CHFC [18] model, we aggregate both the KM
and FCM algorithms to cooperate at each iteration
synchronously (i.e. there is no idle time wasted in the waiting
process as in the hybrid models) for the aim of reducing the
total computational time compared to the FCM, achieving
faster convergence to solutions, and producing an overall
solution that is better than that of both clustering algorithms.
Both KM and FCM attempt to reduce the objective function in
every step, and may terminate at a solution that is locally
optimal. A bad choice of initial centroids can have a great
impact on both the performance and quality of the clustering.
That is why many runs are executed and the overall best value
is taken. Also a good choice of initial centroids reduces the
number of iterations that are required for the algorithm to
converge. So when the KM algorithm is fed by good cluster
centroids from the neighboring FCM, it may result in a better
clustering quality.In addition, when the KM produces good
cluster centroids and sends these centroids to the neighboring
FCM, the FCM algorithm can generate appropriate
memberships for each data object, yielding better cluster
centroids. In the CHFC model, both the KM and FCM
exchange their centroids at each iteration and each algorithm
accepts or rejects the received set of centroids based on a
cooperative quality criterion. Minimum Objective Function
[18] is used as a cooperative quality criterion. Both the KM
and FCM receive a solution (i.e. a set of k centroids) from the
neighboring algorithm, each algorithm accepts the received
solution and replaces the current centroids with the new set of
centroids only if the new set of centroids minimizes its
objective function J value at each step. Catching better solution
with minimum objective function than the current solution
enables both the KM and the FCM to convergence faster. The
CHFC alternates between the KM in some iterations and the
FCM in a different number of iterations as the CHFC takes the
best solutions from either the KM or the FCM at each iteration
(i.e. best solutions with minimum objective function value).
This cooperative strategy is simple because both the KM and
FCM have the same structure of prototypes (centroids), which
enables fast exchange of information between both of them.
The Message Passing Interface (MPI) [52] is used to facilitate
the communication between the KM and the FCM (e.g. send
and receive operations). Assume that CKM and CFCM are the
set of k centroids generated by the KM and FCM algorithms,
respectively, at any iteration. Both the KM and the FCM
evaluate their current objective function J using their current
centroids CKM and CFCM, respectively. Each algorithm
accepts the received set of centroids from its neighboring only
if minimizes its current objective function J and updates its
current set of centroids with the new received set. These steps
of exchanging and updating centroids are repeated until either
the KM or the FCM converge to the desired clustering quality.
B. Quality measures
The clustering results of any clustering algorithm should be
evaluated using an informative quality measure(s) that reflects
the “goodness” of the resulting clusters. Two external quality
measures (F-measure and Entropy) [7]-[8] are used, which
assume that a prior knowledge about the data objects (e.g. class
labels) is given. The Separation Index (SI) [17] is used as
internal quality measure, which does not require prior
knowledge about the objects.
a) F-measure
F-measure combines the precision and recall ideas from the
information retrieval literature. The precision and recall of a
cluster Sj with respect to a class Ri, i, j=1,2,..,k, are defined as:
recall(Ri,Sj) =
| |
ij
i
L
R
(1)
( , )
| |
ij
i
j
L
p re c isio n R S j
S
(2)
Where Lij is the number of objects of class Ri in cluster Sj, |Ri|
is the number of objects in class Ri , and |Sj| is the number of
objects in cluster Sj. The F-measure of a class Ri is defined as:
2 * ( , ) * ( , )
( ) m a x
( , ) ( , )
i j i j
i
j
i j i j
p r e c is io n R s r e c a ll R s
F R
p r e c is io n R S r e c a ll R S
(3)
With respect to class Ri we consider the cluster with the
highest F-measure to be the cluster Sj that is mapped to class
Ri, and that F-measure becomes the score for class Ri. The
overall F-measure for the clustering result of k clusters is the
weighted average of the F-measure for each class Ri:
3. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 02 Issue: 02 December 2013 Page No.59-63
ISSN: 2278-2419
61
1
1
( | | ( ) )
| |
k
i i
i
k
i
i
R F R
o v e r a ll F m e a s u r e
R
(4)
The higher the overall F-measure, the better the clustering due
to the higher accuracy of the resulting clusters mapping to the
original classes.
b) Entropy
Entropy tells us how homogenous a cluster is. Assume a
partitioning result of a clustering algorithm consisting of k
clusters. For every cluster Sj we compute prij, the probability
that a member of cluster Sj belongs to class Ri. The entropy of
each cluster Sj is calculated as:
1
( ) lo g ( ) , 1, ...,
k
j ij ij
i
E S p r p r j k
(5)
The overall entropy for a set of k clusters is the sum of
entropies for each cluster weighted by the size of each cluster:
1
| |
( )
k
j
j
j
S
o v e r a ll E n tr o p y E S
n
(6)
The lower the overall Entropy, the more will be homogeneity
(or similarity) of objects within clusters.
C. Ant Clustering Algorithm
Ant colony algorithm [19] (ACG) is a simulated evolutionary
algorithm. Algorithms simulate the collaborative process of
real ants and be completed jointly by a number of ants. Each
ant search solution independently in the candidate solution
space, and find a solution which leaves a certain amount of
information. The greater the amount of information on the
solution, the more is the likelihood of being selected. Ant
colony clustering method is affected from that. Ant colony
algorithm is a heuristic global optimization algorithm,
according to the amount of information on the cluster centers it
can merge together with the surrounding data, thus to be
clustering classification [20]. There use it to carry out a
division of the input samples. Clustering the data are to be
regarded as having different attributes of the ants, cluster
centers are to be regarded as "food source." that ants need to
find. Assume that the input sample X={X i |i=1,2,…,n}, there
are n input samples. The process of determining the cluster
center is the process of starting ant from the colony to find
food, when ants in the search, different ants select a data
element is independent of each other.
Suppose: 2
|| ( ) ||i j i j
d m x x
“Where” denote that the weighted Euclidean distance between
x i and x; m as the weighting factors, can be set according to
the different contribution of the various components in the
clustering. Established that clustering radius r, ԑ said the
statistical error, ij (t) denote residual amount of information
on the path between the data xi to the data xj at t times, in the
initial moment. The amount of information on the various
paths is the same and equal to 0. The amount of information on
the path from the equation (7) is given.
1,
0 ,
ij
i j
ij
d r
d r
(7)
Ant transition probabilities is given from equation (8), where
i j
denotes the expectations of the degree that xi be integrated
into the X j, generally take 1/d ij.
[ ( ) ] [ ]
( ) ,
[ ( ) ] [ ]
0 ,
i j
i j
ie U
ij t
P ij t j U
ij t
o th e r w is e
(8)
0
( ) , in t in tij i j
If p t p x w ill b e eg ra ted o th e X
/ , 1, 2 , .....j k k j
S u p p o s e C X d r k j
Where, Cj denotes the data set X. that all integrated into the
neighborhood of Xj
Find the ideal cluster center:
1
1
,
J
k k
k
c j x x C j
J
The specific steps of learning methods based on ant colony
clustering algorithm are as follows:
1. First, chose randomly k-representative point according to the
experience. In general, the initial choice of representative
points tend to affect the results of iteration, that is likely to get
is a local optimal solution rather than the global optimal
solution. To address this issue, you can try different initial
representative point in order to avoid falling into local
minimum point.
2. Initialization, suppose 0 0
, , , , , , (0 ) 0,ij
N m r p
0
(0 ) 0 ,ij
p
3. 2
|| ( ) ||ij i j
co m p u te d m x x
4. Calculate the amount of information on the various paths
1,
0 ,
ij
ij
ij
d r
d r
5. Calculate the probability of x i be integrated into the
[ ( ) ] [ ]
( ) ,
[ ( ) ] [ ]
0 ,
i j i j
j i j
i j i j
i e U
t
X p t j U
t
o t h e r w i s e
6. Determine whether 0
( )ij
p t p was set up, if satisfy;
continue, otherwise, i + 1 switch 3.
7. According to
1
1
,
J
k k j
k
C j x x C
J
compute the
cluster center.
8. Computing j-deviation from a clustering error and overall
error, which states that the jth cluster center's first i-
component. Calculating the overall error ԑ
1
k
j
j
D
9.Determine whether the output ԑ ≤ԑ0was set up, if satisfy,
output the clustering results; if not set up to switch 3 continue
iteration.
IV. MULTI CLUSTERING ALGORITHM
In these multi Clustering Algorithm contains following steps:
Initially we can apply the Hard Fuzzy clustering algorithm
on input datasets. Hard Fuzzy clustering consists of a
combination of K-Means and Fuzzy C means clustering.
These two algorithms are applied on input data set based
on distance measure we will get the some inter clusters.
In second stage we will consider inter clusters as an input
and then apply ant clustering algorithm on these clusters.
4. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 02 Issue: 02 December 2013 Page No.59-63
ISSN: 2278-2419
62
Finally, we will get desired output cluster. These desired
output clusters give better performance than individual
clusters.
V. EXPERIMENT RESULTS
In these we can use Mini-news-group as input data set. Then
we can apply the multi clustering algorithms on this dataset. So
the multi clustering gives 10 % performance better result than
individual clusters. From the experimental results of view, k-
means algorithm and ant colony algorithms based on
pheromone selected randomly sample points as the initial
cluster centers, the clustering accuracy rate of this approach is
instability, with a low average accuracy rate, and clustering
effect in the actual data will not be good. The hard fuzzy
clustering gives some better result. Next, we apply ant
clustering algorithm on output of hard fuzzy clustering it will
generate the desired output clusters. The final result gives
better performance than individual clusters. In these we can use
precession and recall to calculate the performance and entropy
of the individual clusters and multi cluster can be calculated.
But this improved algorithm have high and stable rate of
accuracy and can be used on the actual data clustering.
Resulting performance of the Hard Fuzzy clustering algorithm
on min news group dataset can be shown in Table 1 and figure
Proposed method result can be shown in Table 2 and figure.
Table 1. Performance Of Hard Fuzzy Clustering On Min News
Group Dataset
No.f Documents Performance
100 91.34%
200 90.45%
300 88.40%
400 85.37%
500 82.03%
600 79.50%
700 75.21%
800 71.99%
900 68.45%
1000 65.35%
1500 61.23%
2000 55.45%
2500 48.23%
Average 74.08%
Table 2. Performance Of Hard Fuzzy +Ant Clustering On Min
News Group Dataset
No.f Documents Performance
100 95.45%
200 94.02%
300 93.33%
400 91.09%
500 89.99%
600 86.54%
700 85.77%
800 82.43%
900 80.99%
1000 78.60%
1500 76.21%
2000 70.45%
2500 67.34%
Average 84.02%
Figure 1. F-Measure of Hard Fuzzy clustering on Min News
Group Dataset
Figure 2. F-Measure of Hard Fuzzy +Ant clustering on Min
News Group Dataset
Figure 3
VI. CONCLUSION
In this work we presented the cooperative Hard-Fuzzy
Clustering (CHFC) model and Ant clustering algorithms for
the goal of achieving better clustering quality and time
performances. The Multi model is based on intermediate
cooperation between the k-means and the Fuzzy c-means
algorithms and exchanging the best intermediate solutions,
next apply Ant clustering algorithm on intermediate clusters
finally get the desired output clusters with the better
performance. The proposed model achieves faster convergence
to solutions, and produces an overall solution that is better than
both clustering algorithms. The proposed model has better
performance than the hybrid cascaded models.
5. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 02 Issue: 02 December 2013 Page No.59-63
ISSN: 2278-2419
63
Figure 4
REFERENCES
[1]A. Jain, M. Murty and P. Flynn.“Data Clustering: A Review”.ACM
computing surveys,Vol.31,pp:264-323, 1999.
[2] R. Xu, “Survey of Clustering Algorithms”, IEEE Transactions on Neural
Networks, Vol.16, Issue 3, pp: 645- 678, 2005.
[3] M. Steinbach, G. Karypis and V. Kumar, “A Comparison of Document
Clustering Techniques”, Proceeding of the KDD Workshop on Text Mining,
pp: 109-110, 2000.
[4] R. Duda and P. Hart, “Pattern Classification and scene Analysis”. John
Wiley and Sons, 1973.
[5] J. Hartigan and M. Wong. “A k-means Clustering Algorithm”, Applied
Statistics, Vol. 28, pp: 100-108, 1979.
[6] J. Bezdek, R. Ehrlich, and W. Full, “The Fuzzy C-Means Clustering
Algorithm”, Computers and Geosciences, Vol.10, pp: 191-203, 1984.
[7] S. M. Savaresi and D. Boley. “On the Performance of Bisecting K-means
and PDDP”. In Proc. of the 1st SIAM Int. Conf. on Data Mining, pp: 1-14,
2001.
[8] D. Boley. “Principal Direction Divisive Partitioning”. Data Mining and
Knowledge Discovery,2 (4), pp: 325-344, 1998.
[9] M. Ester, P. Kriegel., J. Sander, and X. Xu. “A Density-Based Algorithm
for Discovering Clusters in Large Spatial Databases with Noise”, Proc. 2nd Int.
Conf. on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press,
1996, pages 226-231.
[10] K. Hammouda and M. Kamel, “Collaborative Document Clustering”,2006
SIAM Conference on Data Mining (SDM06), pp: 453-463, 2006.
[11] A. Strehl and J. Ghosh. “Cluster ensembles – knowledge reuse framework
for combining partitionings”. In conference on Artificial Intelligence (AAAI
2002), pp: 93–98, AAAI/MIT Press, 2002.
[12] Y. Qian and C. Suen. “Clustering combination method”. In International
Conference on Pattern Recognition. ICPR 2000, volume 2, pp: 732–735, 2000.
[13] H. Ayad, M. Kamel, "Cumulative Voting Consensus Method for
Partitions with Variable Number of Clusters," IEEE Transactions on Pattern
Analysis and Machine Intelligence, IEEE Computer Society Digital Library.
IEEE Computer Society, 2007.
[14] Y. Eng, C. Kwoh, and Z. Zhou, ''On the Two-Level Hybrid Clustering
Algorithm,'' AISAT04 - International Conference on Artificial Intelligence in
Science and Technology, 2004, pp. 138-142.
[15] S. Xu and J. Zhang, “A Hybrid Parallel Web Document Clustering
Algorithm and its Performance Study”, Journal of Supercomputing, Vol. 30,
Issue 2, pp: 117-131, 2004.
[17] W. Gropp E. Lusk and A. Skjellum, “Using MPI, Portable Parallel
Programming with Message Passing Interface”. The MIT Press, Cambridge,
MA, 1996.
[18] U. Maulik, S. Bandyopadhyay, “Performance Evaluation of Some
Clustering Algorithms and Validity Indices”, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol.24, No.12, pp:1650-1654, 2002.
[18] R. Kashef, Student, M. S. Kamel,” Hard-Fuzzy Clustering: A Cooperative
Approach”, IEEE conf 2007.
[19] Lan Li, Wan-chun Wu, Qiao-mei Rong,” Research on Hybrid Clustering
Based on Density and Ant Colony Algorithm”, 2010 Second International
Workshop on Education Technology and Computer Science 2010.
[20] L. Jing, K. Ng Michael and J. Huang. An Entropy Weighting k-Means
Algorithm for Subspace Clustering of High-Dimensional Sparse Data. IEEE
Transactions on Knowledge and Data Engineering, 19(8): 1026-1041,2007.
[21] A.KJain, M.N. Murty and PJ .Flynn, "Data clustering: a reView", ACM
Computing Surveys, vol. 3I( 3), pp 264-323,1999 .
[22] X. Rui, "Survey ofclustering algorithms", IEEE Transactions on Neural
Networks, vol 16(3), pp. 634-678, 2005.
[23] P.Willett, Recent trends in hierarchical document clustering: a critical
review, Information processing and management, vol. 24,pp. 577-97, 1988.
[24] R. Cutting, D. R. Karger, J. O. Pedersen, and J. W.Tukey,"Scatterlgather:
A cluster-based approach to browsing large document collections". In
Proceedings of the Fifteenth Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 318-329, 1992.
[25] R. C. Dubes and A. K. Jain, "Algorithms for Clustering Data", Prentice
Hall College Div, Englewood Cliffs, NJ, March 1998.
[26] M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document
clustering techniques", KDD Workshop on Text Mining'00, 2000.
[27] N. R. Pal and J. C. Bezdek, On cluster validity for the fuzzy c-means
model, IEEE Trans. Fuzzy Systems, vol. 3, pp. 370-379, 1995.
[28] J. C. Bezdek and N. R. Pal, Some new indexes of cluster validity, IEEE
Trans. Syst. Man. Cybern., vol. 28, pp. 301-315, 1998.