IJSRD - International Journal for Scientific Research & Development| Vol. 3, Issue 10, 2015 | ISSN (online): 2321-0613
All rights reserved by www.ijsrd.com 382
Introduction to Multi-Objective Clustering Ensemble
Bhumi A. Patel1 Lokesh P. Gagnani2
1
ME Student 2
Assistant Professor
1,2
Department of Information Technology
1,2
Kitrc-Kalol, Gandhinagar-382721
Abstract— Association rule mining is a popular and well
researched method for discovering interesting relations
between variables in large databases. In this paper we
introduce the concept of Data mining, Association rule and
Multilevel association rule with different algorithm, its
advantage and concept of Fuzzy logic and Genetic Algorithm.
Multilevel association rules can be mined efficiently using
concept hierarchies under a support-confidence framework.
Key words: Multilevel Association rule, Support,
Confidence, Genetic-Fuzzy algorithm, Fuzzy set
I. INTRODUCTION
A. Data Mining
Data-mining is the process of extracting information from
large amounts of data. Based on data that are processed, the
extraction of data is useful for: Obtaining a model for future
events; Identifying variables and attributes of the process
which is studied; Prediction (forecasting) of future variation
of variables[1]
Data mining can be classified into two high level
categories,such as [1]
– Predictive Data Mining
– Descriptive Data Mining
1) Predictive Data Mining:
This model of data mining techniques creates a model to
predict the future values based on the past and current data
values. The various Predictive Data Mining techniques are
a) Classification
b) Regression Analysis
c) Time Series Analysis
d) Prediction
2) Descriptive Data Mining:
This model of data mining techniques organizes the data,
based on their general properties and transforms it into human
interpretable patterns, associations or correlations. The
various Descriptive Data Mining techniques are
a) Clustering
b) Summarization
c) Association Rule Mining
d) Sequence Discovery
B. Clustering
Clustering is the task of grouping a set of objects in such a
way that objects in the same group (called a cluster) are more
similar (in some sense or another) to each other than to those
in other groups (clusters).
Fig. 1: The result of a cluster analysis shown as the
colouring of the squares into three clusters.
Data clustering is one of the essential tools for
perceptive structure of a data set. It plays a vital and initial
role in data mining, information retrieval and machine
learning. The basic goal in cluster analysis is to discover
natural groupings of objects in a dataset. The data set
sometimes may be in mixed nature that it may consist of both
numeric and categorical type of data and differ in their
individuality.[2]
II. CLUSTERING ENSEMBLE
A cluster ensemble system solves a clustering problem in two
steps. The first step takes a data set as input and outputs an
ensemble of clustering solutions. The second step takes the
cluster ensemble as input and combines the solutions to
produce a single clustering as the final output. Figure 2 shows
the general process of cluster ensemble, that consists of
generating a set of clustering from the similar dataset and
combining them into an ultimate clustering. The objective of
this combination process is to recover the quality of
individual data clustering. The intend of combining dissimilar
clustering results emerged as an unusual approach for
improving the quality of the results of clustering
algorithms.[3]
There are two major parts in cluster ensemble
1) Generation mechanisms [3]
2) Consensus functions
A. Generation Mechanism
Generation is the first step in clustering ensemble methods,
innwhich the set of clusterings is generated and combined. It
generates a collection of clustering solutions i.e., a cluster
ensemble.
Given a data set of n instances X = {X1,X2, · ·
·,Xn}, an ensemble constructor generates a cluster ensemble,
represented asπ= {π1,...,πr} where r is the ensemble size(the
number of clustering in the ensemble). Each clustering
solution πi is simply a partition of the data set X into Ki
disjoint clusters of instances, represented as πi =ci1.... cik
Fig. 2: Basic Process of Cluster Ensembles[3]
Introduction to Multi-Objective Clustering Ensemble
(IJSRD/Vol. 3/Issue 10/2015/075)
All rights reserved by www.ijsrd.com 383
B. Consensus Function
The consensus function is the main step in any clustering
ensemble algorithm that produces the final data partition or
consensus partition , which is the result of any clustering
ensemble algorithm, is obtained[2].There are some types of
consensus function such as:
– Co-association based function
– Graph based methods
– Voting approaches
– Mixture model approaches
– Information theory approach
Consensus
Function
Comp
utatio
-nal
Compl
exity
Scala
bility
Robu
stnes
Ease of
Impleme
ntation
Mixture
Models
O(K3
) High Low
Easy
toImplem
ent
Voting
BasedApproach
O(K3
) High High
Easy to
Implemet
InformationTheo
ry
Approach
O(K3
) Low High
Not easy
to
Implemet
CoAssociationBa
sedApproach
O(N2
) High High
Difficult
to
Implemet
HypergraphBase
dApproach
O(N3
) High High
Difficult
to
Implemet
Table. 1: Comparison among different approaches of
consensus[1] function.
Π1
Π
2
Π
3
Π
4
E
[zi1
]
E [zi2]
Consensu
s
Y1 2 B X β
0.99
9
0.00
1
1
Y2 2 A X α
0.99
7
0.00
3
1
Y3 2 A Y α
0.94
3
0.05
7
1
Y4 2 B X β
0.99
9
0.00
1
1
Y5 1 A X α
0.99
9
0.00
1
1
Y6 2 A Y α
0.94
3
0.05
7
1
Y7 2 B Y β
0.12
4
0.87
6
2
Y8 1 B Y β
0.01
9
0.98
1
2
Y9 1 B Y β
0.26
0
0.74
0
2
Y1
0
1 A Y α
0.11
5
0.88
5
2
Y1
1
2 B Y β
0.12
4
0.87
6
2
Y1
2
1 B Y β
0.01
9
0.98
1
2
Table. 2: Clustering ensemble and consensus solution[6]
Fig. 3: Four possible partitions of 12 data points into 2
clusters. Different partitions use different sets of labels.[6]
III. MULTI-OBJECTIVE CLUSTERING ENSEMBLE
The goal of multi-objective clustering is to find clusters
dataset by applying several clustering algorithms
corresponding to different objective functions. We propose a
clustering approach that integrates the output of different
clustering algorithms into a single partition. More precisely,
given different clustering objective functions, we seek a
partition that utilizes the appropriate objective functions for
different parts of the data space. This framework can be
viewed as a meta-level clustering since it operates on multiple
clustering algorithms simultaneously. The final partition not
only contains meaningful clusters but also associates a
specific objective function with each cluster[2][4].
Multiobjective clustering is a two-step process: (i)
independent or parallel discovery of clusters by different
clustering algorithms, and (ii) construction of an “optimal”
partition from the discovered clusters. The second step is a
difficult conceptual problem, since clustering algorithms
often are not accompanied by a measure of the goodness of
the detected clusters. The objective function used by a
clustering algorithm is not indicative of the quality of the
partitions found by other clustering algorithms. The
goodness of each cluster should be judged not only by the
clustering algorithm that generated it, but also by an external
assessment criteria.
IV. MULTI-OBJECTIVE CLUSTERING ENSEMBLE
The application of cluster analysis to explore a dataset
focuses on the discovery of only one structure that best fits
the data. In such a case, several clustering algorithms are
applied to the data, obtaining different structures. Next, a
validation method is applied to select the structure that best
fits the data. However, the search for only one best fit
structure limits the amount of knowledge that could be
obtained. Moreover, most validation measures are biased
towards a given clustering criterion.
Introduction to Multi-Objective Clustering Ensemble
(IJSRD/Vol. 3/Issue 10/2015/075)
All rights reserved by www.ijsrd.com 384
As an attempt to overcome these limitations, several
multi-objective clustering and cluster ensemble methods have
been proposed [4, 7, 6]. The multi-objective approach offers
a set of alternative structures that could represent different
interpretations of the data. However, as the number of
alternatives increases, the analysis becomes harder [4].
Motivated by the previous context, we propose an
algorithm that: (1) provides a robust way to deal with data
with different types of clusters, and (2) allows finding a
concise set of alternative structures for the same data. Our
method combines ideas from cluster ensemble and multi
objective clustering. More specifically, first we generate a
set of initialpartitions by applying several different
clustering algorithms to the data. Next, we combine and
select the partitions by applying a Pareto-based multi-
objective genetic algorithm.
V. CONCLUSION
Clustering ensemble has emerged as a prominent method for
improving robustness, stability and accuracy of
unsupervised classification solutions. So far, numerous
works have contributed to find consensus clustering. The
main characteristics of our approach are: (i) Stability: a very
similar set of solutions is obtained every time the algorithm
is run for the same dataset and initial configuration. Also, the
same best solutions with respect to each known structure
always appear in the solution set. (ii) Concision: the number
of partitions in the solution set is small enough to be
analyzed by domain experts. (iii) Robustness: it results in
partitions of high quality for a variety of different data
structures and properties. (iv) It reveals a number of distinct
structures present in a dataset. (v) The best partitions
revealed regarding each known structure present a number
of clusters closer to the true one, if compared to the other
techniques analyzed. (vi) Its application does not require a
fine tuning of the parameters of the algorithms to different
datasets. The only difference in the values of the parameters
for the different datasets depends on the size of the dataset.
Thus, MOCLE’s parameters are easily adjusted by the user,
without any additional knowledge on the algorithm or the
data.
REFERENCES
[1] Venkatadri. M*, Hanumat G. Sastry “Genetic
Programming in Data mining Tasks” - International
Journal of Advanced Research in Computer Science
Volume 3, No. 2, March-April 2012
[2] Martin H. C. Law Alexander P. Topchy Anil K. Jain,
“Multiobjective Data Clustering” - To appear in IEEE
Computer Society Conference on Computer Vision and
Pattern Recognition, 2004
[3] S.Sarumathi,N.Shanthi G.Santhiya" A Survey of
Cluster Ensemble"- International Journal of Computer
Applications (0975 – 8887) Volume 65– No.9, March
2013
[4] J. Handl and J. Knowles" Exploiting the trade-off - The
benefits of multiple objectives in data clustering" In
EMO 2005, LNCS 3410, pages 547–560. Springer-
Verlag, 2005.
[5] Reza Ghaemi · Nasir bin Sulaiman ,Hamidah Ibrahim ·
Norwati Mustapha, “A review: accuracy optimization in
clustering ensemblesusing genetic algorithms” -
Springer Science+Business Media B.V. 2010
[6] M. Law, A. Topchy, and A. K. Jain. Multiobjective data
clustering. In IEEE Computer Society Conf. on
Computer Vision and Pattern Recognition, volume 2,
pages 424–430, 2004
[7] A. Strehl and J. Ghosh." Cluster ensembles - a
knowledge reuse framework for combining multiple
partitions." Journal on Machine Learning Research,
3:583–617, 2002.
[8] Jay Prakash · P. K. Singh, “An effective multi objective
approach for hard partitional clustering ” - Springer-
Verlag London Limited 2009

Introduction to Multi-Objective Clustering Ensemble

  • 1.
    IJSRD - InternationalJournal for Scientific Research & Development| Vol. 3, Issue 10, 2015 | ISSN (online): 2321-0613 All rights reserved by www.ijsrd.com 382 Introduction to Multi-Objective Clustering Ensemble Bhumi A. Patel1 Lokesh P. Gagnani2 1 ME Student 2 Assistant Professor 1,2 Department of Information Technology 1,2 Kitrc-Kalol, Gandhinagar-382721 Abstract— Association rule mining is a popular and well researched method for discovering interesting relations between variables in large databases. In this paper we introduce the concept of Data mining, Association rule and Multilevel association rule with different algorithm, its advantage and concept of Fuzzy logic and Genetic Algorithm. Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework. Key words: Multilevel Association rule, Support, Confidence, Genetic-Fuzzy algorithm, Fuzzy set I. INTRODUCTION A. Data Mining Data-mining is the process of extracting information from large amounts of data. Based on data that are processed, the extraction of data is useful for: Obtaining a model for future events; Identifying variables and attributes of the process which is studied; Prediction (forecasting) of future variation of variables[1] Data mining can be classified into two high level categories,such as [1] – Predictive Data Mining – Descriptive Data Mining 1) Predictive Data Mining: This model of data mining techniques creates a model to predict the future values based on the past and current data values. The various Predictive Data Mining techniques are a) Classification b) Regression Analysis c) Time Series Analysis d) Prediction 2) Descriptive Data Mining: This model of data mining techniques organizes the data, based on their general properties and transforms it into human interpretable patterns, associations or correlations. The various Descriptive Data Mining techniques are a) Clustering b) Summarization c) Association Rule Mining d) Sequence Discovery B. Clustering Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). Fig. 1: The result of a cluster analysis shown as the colouring of the squares into three clusters. Data clustering is one of the essential tools for perceptive structure of a data set. It plays a vital and initial role in data mining, information retrieval and machine learning. The basic goal in cluster analysis is to discover natural groupings of objects in a dataset. The data set sometimes may be in mixed nature that it may consist of both numeric and categorical type of data and differ in their individuality.[2] II. CLUSTERING ENSEMBLE A cluster ensemble system solves a clustering problem in two steps. The first step takes a data set as input and outputs an ensemble of clustering solutions. The second step takes the cluster ensemble as input and combines the solutions to produce a single clustering as the final output. Figure 2 shows the general process of cluster ensemble, that consists of generating a set of clustering from the similar dataset and combining them into an ultimate clustering. The objective of this combination process is to recover the quality of individual data clustering. The intend of combining dissimilar clustering results emerged as an unusual approach for improving the quality of the results of clustering algorithms.[3] There are two major parts in cluster ensemble 1) Generation mechanisms [3] 2) Consensus functions A. Generation Mechanism Generation is the first step in clustering ensemble methods, innwhich the set of clusterings is generated and combined. It generates a collection of clustering solutions i.e., a cluster ensemble. Given a data set of n instances X = {X1,X2, · · ·,Xn}, an ensemble constructor generates a cluster ensemble, represented asπ= {π1,...,πr} where r is the ensemble size(the number of clustering in the ensemble). Each clustering solution πi is simply a partition of the data set X into Ki disjoint clusters of instances, represented as πi =ci1.... cik Fig. 2: Basic Process of Cluster Ensembles[3]
  • 2.
    Introduction to Multi-ObjectiveClustering Ensemble (IJSRD/Vol. 3/Issue 10/2015/075) All rights reserved by www.ijsrd.com 383 B. Consensus Function The consensus function is the main step in any clustering ensemble algorithm that produces the final data partition or consensus partition , which is the result of any clustering ensemble algorithm, is obtained[2].There are some types of consensus function such as: – Co-association based function – Graph based methods – Voting approaches – Mixture model approaches – Information theory approach Consensus Function Comp utatio -nal Compl exity Scala bility Robu stnes Ease of Impleme ntation Mixture Models O(K3 ) High Low Easy toImplem ent Voting BasedApproach O(K3 ) High High Easy to Implemet InformationTheo ry Approach O(K3 ) Low High Not easy to Implemet CoAssociationBa sedApproach O(N2 ) High High Difficult to Implemet HypergraphBase dApproach O(N3 ) High High Difficult to Implemet Table. 1: Comparison among different approaches of consensus[1] function. Π1 Π 2 Π 3 Π 4 E [zi1 ] E [zi2] Consensu s Y1 2 B X β 0.99 9 0.00 1 1 Y2 2 A X α 0.99 7 0.00 3 1 Y3 2 A Y α 0.94 3 0.05 7 1 Y4 2 B X β 0.99 9 0.00 1 1 Y5 1 A X α 0.99 9 0.00 1 1 Y6 2 A Y α 0.94 3 0.05 7 1 Y7 2 B Y β 0.12 4 0.87 6 2 Y8 1 B Y β 0.01 9 0.98 1 2 Y9 1 B Y β 0.26 0 0.74 0 2 Y1 0 1 A Y α 0.11 5 0.88 5 2 Y1 1 2 B Y β 0.12 4 0.87 6 2 Y1 2 1 B Y β 0.01 9 0.98 1 2 Table. 2: Clustering ensemble and consensus solution[6] Fig. 3: Four possible partitions of 12 data points into 2 clusters. Different partitions use different sets of labels.[6] III. MULTI-OBJECTIVE CLUSTERING ENSEMBLE The goal of multi-objective clustering is to find clusters dataset by applying several clustering algorithms corresponding to different objective functions. We propose a clustering approach that integrates the output of different clustering algorithms into a single partition. More precisely, given different clustering objective functions, we seek a partition that utilizes the appropriate objective functions for different parts of the data space. This framework can be viewed as a meta-level clustering since it operates on multiple clustering algorithms simultaneously. The final partition not only contains meaningful clusters but also associates a specific objective function with each cluster[2][4]. Multiobjective clustering is a two-step process: (i) independent or parallel discovery of clusters by different clustering algorithms, and (ii) construction of an “optimal” partition from the discovered clusters. The second step is a difficult conceptual problem, since clustering algorithms often are not accompanied by a measure of the goodness of the detected clusters. The objective function used by a clustering algorithm is not indicative of the quality of the partitions found by other clustering algorithms. The goodness of each cluster should be judged not only by the clustering algorithm that generated it, but also by an external assessment criteria. IV. MULTI-OBJECTIVE CLUSTERING ENSEMBLE The application of cluster analysis to explore a dataset focuses on the discovery of only one structure that best fits the data. In such a case, several clustering algorithms are applied to the data, obtaining different structures. Next, a validation method is applied to select the structure that best fits the data. However, the search for only one best fit structure limits the amount of knowledge that could be obtained. Moreover, most validation measures are biased towards a given clustering criterion.
  • 3.
    Introduction to Multi-ObjectiveClustering Ensemble (IJSRD/Vol. 3/Issue 10/2015/075) All rights reserved by www.ijsrd.com 384 As an attempt to overcome these limitations, several multi-objective clustering and cluster ensemble methods have been proposed [4, 7, 6]. The multi-objective approach offers a set of alternative structures that could represent different interpretations of the data. However, as the number of alternatives increases, the analysis becomes harder [4]. Motivated by the previous context, we propose an algorithm that: (1) provides a robust way to deal with data with different types of clusters, and (2) allows finding a concise set of alternative structures for the same data. Our method combines ideas from cluster ensemble and multi objective clustering. More specifically, first we generate a set of initialpartitions by applying several different clustering algorithms to the data. Next, we combine and select the partitions by applying a Pareto-based multi- objective genetic algorithm. V. CONCLUSION Clustering ensemble has emerged as a prominent method for improving robustness, stability and accuracy of unsupervised classification solutions. So far, numerous works have contributed to find consensus clustering. The main characteristics of our approach are: (i) Stability: a very similar set of solutions is obtained every time the algorithm is run for the same dataset and initial configuration. Also, the same best solutions with respect to each known structure always appear in the solution set. (ii) Concision: the number of partitions in the solution set is small enough to be analyzed by domain experts. (iii) Robustness: it results in partitions of high quality for a variety of different data structures and properties. (iv) It reveals a number of distinct structures present in a dataset. (v) The best partitions revealed regarding each known structure present a number of clusters closer to the true one, if compared to the other techniques analyzed. (vi) Its application does not require a fine tuning of the parameters of the algorithms to different datasets. The only difference in the values of the parameters for the different datasets depends on the size of the dataset. Thus, MOCLE’s parameters are easily adjusted by the user, without any additional knowledge on the algorithm or the data. REFERENCES [1] Venkatadri. M*, Hanumat G. Sastry “Genetic Programming in Data mining Tasks” - International Journal of Advanced Research in Computer Science Volume 3, No. 2, March-April 2012 [2] Martin H. C. Law Alexander P. Topchy Anil K. Jain, “Multiobjective Data Clustering” - To appear in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004 [3] S.Sarumathi,N.Shanthi G.Santhiya" A Survey of Cluster Ensemble"- International Journal of Computer Applications (0975 – 8887) Volume 65– No.9, March 2013 [4] J. Handl and J. Knowles" Exploiting the trade-off - The benefits of multiple objectives in data clustering" In EMO 2005, LNCS 3410, pages 547–560. Springer- Verlag, 2005. [5] Reza Ghaemi · Nasir bin Sulaiman ,Hamidah Ibrahim · Norwati Mustapha, “A review: accuracy optimization in clustering ensemblesusing genetic algorithms” - Springer Science+Business Media B.V. 2010 [6] M. Law, A. Topchy, and A. K. Jain. Multiobjective data clustering. In IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, volume 2, pages 424–430, 2004 [7] A. Strehl and J. Ghosh." Cluster ensembles - a knowledge reuse framework for combining multiple partitions." Journal on Machine Learning Research, 3:583–617, 2002. [8] Jay Prakash · P. K. Singh, “An effective multi objective approach for hard partitional clustering ” - Springer- Verlag London Limited 2009