Transcript of "Drobics, m. 2001: datamining using synergiesbetween self-organising maps and inductive learning on fuzzy rule"
1.
Data Mining Using Synergies Between Self-Organizing Maps and
Inductive Learning of Fuzzy Rules
MARIO DROBICS, ULRICH BODENHOFER, WERNER WINIWARTER
Software Competence Center Hagenberg
A-4232 Hagenberg, Austria
{mario.drobics/ulrich.bodenhofer!wemer. winiwarter} @scch.at
ERICH PETER KLEMENT
Fuzzy Logic Laboratorium Linz-Hagenberg
Johannes Kepler Universitat Linz
A-4040 Linz, Austria
klement@flll.uni-linz.ac.at
Abstract
Identifying structures in large data sets raises a number of problems. On the one hand, many methods cannot be applied to larger data sets, while, on the other
hand, the results are often hard to interpret. We address
these problems by a novel three-stage approach. First.
we compute a small representation of the input data using a self-organizing map. This reduces the amount of
data and allows us to create two-dimensional plots of
the data. Then we use this preprocessed information to
identify clusters of similarity. Finally, inductive learning methods are applied to generate sets of fuzzy descriptions of these clusters. This approach is applied to
three case studies, including image data and real-world
data sets. The results illustrate the generality and intuitiveness of the proposed method.
1. Introduction
In this paper, we address the problem of identifying and
describing regions of interest in large data sets. Decision trees (171 or other inductive learning methods
which are often used for this task are not always sufficient, or not even applicable. Clustering methods [2],
on the other hand, do not offer sufficient insight into the
data structure as their results are often hard to display
and interpret.
This contribution is devoted to a novel three-stage faulttolerant approach to data mining which is able to produce qualitative infonnationfrom very large data sets.
In the first stage, self-organizing maps (SOMs) [ !3)
are used to compress the data to a reasonable amount
of nodes which still contain all significant information
0-7803-7078-3/01/$10.00 (C)20011EEE.
while eliminating possible data faults, such as noise,
outliers, and missing values. While, for the purpose of
data reduction, other methods may be used as well [7),
SOMs have the advantages that they preserve the topology of the data space and allow to display the results in
two-dimensional plots [6].
The second step is concerned with identifying significant regions of interest within the SOM nodes using a
modified fuzzy c-means algorithm [3 , lO]. In our variant, the two main drawbacks of the standard fuzzy cmeans algorithm are resolved in a very simple and elegant way. Firstly, the clustering is not influenced by
outliers anymore, as they have been eliminated by the
SOM in the pre-processing steP,. Secondly, initialization
is handled by computing a crisp Ward clustering [2] first
to find initial values for the cluster centers.
In the third and last stage, we create linguistic descriptions of the centers of these clusters which help the analyst to interpret the results. As the number of data
samples under consideration has been reduced tremendously by using the SOM nodes, we are able to apply
inductive learning methods (16] to find fuzzy descriptions of the clusters . Using the previously found clusters, we can make use of supervised learning methods in
an unsupervised environment by considering the cluster
membership as goal parameter. Descriptions are composed using fuzzy predicates of the form "x is/is not/is
at least/is at most A", where xis the parameter under
consideration, and A is a linguistic expression modeled
by a fuzzy set.
We applied this method to several real-world data sets.
Through the combination of the two-dimensional projection of the data using the SOM and the fuzzy descriptions generated, the results are not only significant and
accurate, but also very intuitive.
Page:l780
2.
2. Preprocessing
To reduce computational effort without loosing significant information, we use sel[-organizi11g maps to create
a mapping of the data space. Self-organizing maps use
unsupervised competitive teaming to create a topologypreserving map of the data.
Let us assume that we are given a data set / consisting of
K samples I = { x 1, • • • , xK} from an 11-dimensional real
space 1( <:;: JR!". A SOM is defined as a mapping from
the data space 1( to an m-dimensional discrete output
space C (most often m = 2). The output space consists
of a finite number Nc of neurons. To each neurons E C
of the output space. we associate a parametric weight
vecTOr in the input space w·' = ( w! ,
w~ ) r E 1(.
w ... ,
z,
The mapping
4> : 1(-+C
x >-> cj>(x)
defines the neural map from the input space 1( to the
topological structure C; cj>(x) is defined as
cj> (x) = argmin; {ll x - wi ll v} ,
with
2
II X 11 V
=
2.:" !!!!:_2
r=l V, r
"'"
Lr=l
m,
'
where lllr denotes an individual weight parameter which
allows us to take different importances of the input parameters into account. The factor V, stands for the variance of the values (x) , .. . ,-t{). In other words, the function <P maps every input sample to that neuron of the map
whose weight vector is closest to the input with respect
to the distance measure Jl .ll v.
The goal of the learning process is to reduce the overall
distortion error
An additional benefit of this preprocessing step is noise
reduction. As the weight vector of each node represents
the average of several data records, outliers and small
variations in the input are eliminated.
3. Clustering
Now we try to identify possible regions of interest in
the map. Regions of interest are characterized through
nodes whose weight vectors are close to each other in
the data space. Additionally. we use the information
provided by the map. Doing so, clusters are generated
that are not onl y homogeneous with respect to the input
space, but also within the map. To find regions of interest, we use a modified fuz zy c -means [3 ) clustering.
Using this algorithm, a fuzzy segmentation of the data
space is created.
For initializing the fuzzy clustering, we first perform
a crisp Ward clustering [2). Ward clustering is an agglomerative clustering method, which is, however, often not applicable due to its high computational effort . Through the data reduction done in the preprocessing step, we are not only able to apply this clustering method, but to take the topological information
of the map into account, too. This is accomplished by
merging only nodes/clusters which are neighbors in the
map.
The number of regions of interest L may be fixed a priori or determined by the Ward clustering initialization
procedure, too.
To compute the final fuzzy clustering, we use a modified
distance measure, which also takes the weights and the
topological structure of the map into account:
"'"
d (x,y) = (
where h(i,j) denotes the neighborhood relation in C.
i.e. a scaling factor from the unit interval which rates
the distance in the discrete topological structure of the
map. For details about the learning algorithm, see (13) .
The mapping provided by a SOM does not only create a
set of weight vectors, it furthermore preserves the topology of the data space. As the map has a fixed dimension
of m = 2, it is possible to plot each parameter (or cluster) with respect to the map in a very straightforward
manner [6].
For our purposes, we will use nets with a several hundreds nodes. Through this large number of nodes ,
the overall distortion error can be kept small, however,
training may take longer.
0-7803-7078-31011$10.00 (C)lOOl IEEE.
Lr.; J
m (
r
Jx, -y,J
) 2 )
J n.1 - ~~ ~
l•(~(x),~(y))~
ffiaXH
L."
'-------r- I_m_r____~
=~
Note that, due to the division by maxkl IJ, -
w;.1.
dR(x, y) is always a value between 0 and I if all values Xr. Yr are in the interval [minkJ,,maxkJ,J. The
exponent h(<jl(x), cj>(y))tl corresponds to the influence
of the neighborhood relation I!(Q>(x), <P(y)); the smaller
h( <jl(x) , cj> (y)) is (i .e. the larger the distance of node Q>(x)
and node cp(y) in the map is), the stronger the raw distance dR(x,y) is stretched . The factor~ controls how
strong the inftuence of the topological structure is (we
use P= 2).
The fuzzy clustering is computed using the usual iterative two-step update procedure. First, the degree of
membership to which node i belongs to cluster k, de-
Page:l781
3.
noted
C(i, k). is computed as
C(i,k) =
: (i,k )
·
2:4~J a{i , l )
with
a(i ,k ) =
e-a·d(w',,~ ))
The factor a 2: I determines the fuzziness of the clustering (1 corresponds to a very fuzzy clustering, while,
e.g .• a value of 100 gives very crisp clusters~ we use
a = 40). Then the cluster centers vk are updated by
computing the weighted means of all nodes [ 10] :
Logical operations on the truth values are defined using triangular norms and conorms, i.e. commutative,
associative, and non-decreasing binary operations on
the unit interval with neutral elements I and 0, respectively [12]. In our applications, we restricted to the
Lukasiewicz operations:
h (x,y) = max. (x + y - I ,0)
SL(x,y) == min(x-t- y, 1)
Then conjunctions and disjunctions of fuzzy predicates
can be defined liS follows :
t(A(x) 1 B(x)) = h(t(A(x)) ,t(B(x)))
This procedure is repeated until the change of the cluster
centers is lower than a predefined bound E.
4. Cluster Descriptions
The final step is concerned with finding descriptions of
the regions of interest determined in the clustering step.
Depending on the underlying context of the variable under consideration, we assign natural language expressions like very very low, medium, large, etc . to each
variable. These expressions are modeled by fuzzy sets
which can either be defined by hand or automatically.
When defining the fuzzy sets, in particular when they
are generated automatically, it is important to preserve
the semantic context of the natural language expressions
(e.g. that low corresponds to smaller values than high)
in order to achieve interpretable descriptions [5 ) .
To find appropriate fuzzy segmentations of the input domains according to the distribution of the sample data
automatically, we use a simple clustering-based algorithm . The cluster centers are first initialized by dividing
the input range into a certain number of equally sized intervals. Then these centers are iteratively updated using
a modified c-means algorithm (14] with simple neighborhood interaction to preserve the ordering of the cl usters. Usually, a few iterations (:$ 10) are sufficient for
each domain. The fuzzy sets are then created as trapewidal membership functions around the centers computed previously.
A fuzzy set A induces a fuzzy predicate
t(x is A)= P.A(x),
where tis the function which assigns a truth value to the
assertion "xis A" . In a straightforward way, a fuzzy set
A also induces the following fuzzy predicates [4]:
t(x is not A)= 1- JlA(x)
t(x is at least A) = sup{!J.A (y) I y :$ x}
t(x is at most A) == sup{p.A (y) I y ~ x}
0-7803-7078-3/011$10.00 (C)2001 IEEE.
t(A(x) V B(x)) = SL(t(A (x)) ,t(B(x)))
Since we want to describe the class membership of the
samples, we are interested in fuzzy rules of the form
IF A(x) THEN C(x) (for convenience, we refer to such
a rule as A ...... C. although the rule should rather be regarded as a conditional assignment than as a logical implication) where C is a fuzzy predicate that describes
the class membership and A may be a compound fuzzy
predicate composed of several atomic predicates, e .g.
IF (age(x) is high) 1 (size (x) is at least high)
THEN (class(x) is 3).
The degree offulfillment of such a statement for a given
sample xis defined as t (A (x) 1 C(x)) (corresponding to
the so-called Mamdani inference [15]). The fuzzy set of
samples fulfilling a predicate A, which we denote with
A(/), is defined as (for x E /)
P.A(J)(x) = t(A(x)).
Let us define the cardinality of a fuzzy set B of input
samples as the sum of P.n(x), i.e.
IBI =
L,P.n(x).
xEI
Finally. the cardinality of samples fulfilling a rule A ...... C
can be defined by
lA(/) nC(l)l = L,t(A(x) AC(x))
xE/
= L,T(t(A(x)),t(C(x))).
xE/
To find the most accurate and significant descriptions,
we use FS-FOIL (8], a modification of Quinlan's FOIL
algorithm [ 18] for fuzzy rules . It creates a stepwise coverage of each cluster such that not only spheric, but arbitrary clusters can be handled.
We compute descriptions of the individual clusters independently, therefore, the goal predicate Cis fixed and
a description rule A-. Cis uniquely characterized by its
antecedent predicate A.
Page: 1782
4.
The basic algorithm can be summarized as follows:
open nodrs =cluster;
s; =0
start with the most general predicate {0}
do {
expand the best n previous predicates
if a rule A -> Cis accurate & signi fie ant
{
add predicate A to rule set S;
remove covered nodes from open nodes
go on with the most general predicate {0}
prune predicates
} while open nodes i 0;
In detail, the expansion of a predicate A is performed
by appending a new atomic fuzzy predicate 8 by means
of conjunction. To decide which predicates to expand,
an entropy-based information gain measure is used to
define a ranking of the set of predicates:
G = IA(I)nC(I) ' (tog 2 IA(J) nC(I) I _log 2 'C(f)l)
IA(I)I
I
II I
Rules of interest have to be significant and accurate. The
significance of rule A ->Cis defined by its support
supp (A __. C) =
IA{i) nC(I)I
II I
'
5.1. Iris data
We used the well-known Iris data set according to Fisher
[91 to illustrate the ability of our approach to identify
hidden structures in a data set. The data consist of three
classes of tlowers, where each tlower is characterized by
four attributes. Each class is represented by 50 samples.
We generated a SOM with 10 x 10 nodes and a set of 4
clusters to separate the test data. Finally, a compact set
of nine intuitive fuzzy rules was created by the above
algorithm .
Applying this set of rules to the original data resulted
in 10.6% misclassified and 7.3% unclassified samples.
Note that, throughout the clustering and labeling process, the goal information has not been taken into account. This means that the proposed method is able to
label the goal classes with a correctness of over 80%
without even knowing the correct classification.
In the original Iris data set, one goal class is perfectly
separated from the others. The remaining two classes,
however, have a certain region of overlapping in the data
space. As a detailed analysis has shown, three of the
four generated clusters can be directly associated with
one of the goal classes. The fourth cluster exactly contained the region of overlap, which demonstrates that
our method is able to preserve the topology of the input
space and to detect hidden structures in the data.
the accuracy as its confidence
con f( A...... C) =
supp(A ..... C)
,
supp(A)
where supp(A) is defined as
supp(A)
=
IA(/1)1
i/
We consider only rules which satisfy reasonable requirements in terms of support and confidence. This
is achieved by two lower bounds SUPPmin and confmin,
respectively [ 1]. As, for all predicates A, B. and C,
supp(A-+ C) ~ supp(A II B-+ C),
we can prune the expanded set of predicates by removing all predicates X for which supp(X) < SUPPmin·
The final degree of membership of a sample x E 2( to
cluster i with respect to the rule set Si is given by:
SLA(x)
AES'
5. Results
Now we will present three results with very different
data sets and purposes-<lemonstrating the flexibility
and general applicability of the proposed approach.
0-7803-7078-31011$10.00 (C)2001IEEE.
5.2. Image segmentation
A typical application of clustering in computer vision is
image segmentation, therefore, it seemed interesting to
apply our three-stage approach to this problem. Moreover, the possibility to describe segments with natural
language expressions gives rise to completely new opportunities in image understanding.
First experiments using pixel coordinates and RGB values showed good results in terms of accuracy, however,
the descriptions were rarely intuitive. In order to overcome this problem, we added HSL features (hue, saturation and lightness), which match the way humans perceive color much closer, in the final description step.
Since humans have a certain understanding of colors
and intensities, it is more appropriate to assume predefined fuzzy predicates corresponding to natural language terms describing these properties. By this way,
moreover, it can be avoided that the meaning of the descriptions depend on the image under consideration.
The example in Fig. 1 shows an image with 170 x 256
pixels, i.e. we have K = 43520 samples with n = 8 features. First, it was mapped onto a SOM with 10 x 10
nodes. Then four clusters and the corresponding descriptions were computed. On the right-hand side, Fig.
1 shows the computed segmentation. Figure 2 shows the
Page: 1783
5.
final rule sets and the cross validation table which illustrates how good the clusters are described by the rule
sets.
classification prohlem with ad hoc structured probability mudels, binary decision trees, or Bayesian dassilier
methods were only abk to achieve an accuracy of approximately 55'k ill].
We ..:omput.ed a SOM with 20 >: 20 nodes . which were
then grouped to 5 clusters (Fig. 3 shows pkm of the
class mcnihcrships over the SOM). Finally. a set of descriptions for each cluster was created.
figure 1. Original iuwge ( i£ji) and its scgrnentarion (rig hi}
Rule set 110:
[blue •• B)
[red
<= L] &.&: [blue >=
Figure 3. Clusters in tile yeast datu sel
VH]
Rille set #1:
(lightness <= dark]
To validate our results, we compared the originul classification. with our clusters and rules. Although we used
less clusters than goal classes (which were not taken
into account). the overall number of samples assigned
to the wrong goal class was relatively small (cf. Fig. 4).
Rule set #2:
(ligtltnasn >= light:]
Rille set #3:
(hue
orange)
(hue •• red)
[hue ~::;: yellow]
[hue •• gi·een) U
Rule #0
Clus1:er #0 : 97%
c
[lightness •• normal)
Rule #1
or.
Rule #2
07.
Rule #3
O'l.
Cluster #1:
07.
987.
OJ.
Cluster #2:
7'/,
Oi.
987.
01.
0'1.
Cluster #3:
27.
07.
07.
99'1.
Tote1 Error: 0.41% (+0.43% missing)
Figure 2. Rule sets for tile image segmeruation
[C¥T)
[N1JC)
('129):
[MIT] (244):
(~E3]
(163):
(XE2) (51):
[ME1] (44):
(EXC]
#0
(163): ( 9%)
(35):
[VAC] (30):
(POX] (20) :
[ERLJ (5):
c
#l
C~"l'l..l
c
#2
c
#3
117.
[317.]
?Y.
07.
4'l.
c
#4
13%
67.
14'l.
177.
58'l,
[45'f.]
20%
1/.
5%
93%
82%
O'l,
O'l.
2'l.
11/.
33%
07.
33'l.
36
O'l.
23'l.
15'f.
25%
15%
O'l,
457.
20'1.
0'1.
BO'l.
O'l,
0'1.
11%
17.
6'l.
[68'l.]
27'/.
!)'(.
[59%]
l'l.
5'l.
47.
5%
6Y,
Total Error: 37% ( +3% <nissing)
Figure 4. Cross validation table (goal vs. clusters),
brackets i11dicate cluster-class assignments
The rule sets in Fig. 2 can be interpreted as follows: The
first rule set corresponds to the blue sky. The second
rule set describes the black pants. The snow is classified
by the third rule set. Finally. the jackets are identified by
the t()urth rule set.
5.3. Yeast data set
As third test case, we used the UCI yeast data set due
The rule sets created for each cluster are shown in Fig. 5.
Applying these rules to the original data, and comparing
the result with the cluster memberships. an overage error of 10.38% with 53.63% missing occurred. The high
number of missing samples is due to complex structure
of the clusters (see Fig. 3).
to K. Nakai. The datu set consists of K = 1484 sumplcs
6. Conclusion
with n = 8 attributes. Each sample- belongs to one of
ten classes. The data set has a very complex inner structure and a lot of inconsistencies. Attempts to solve this
ln this paper we have shown how the combination
of different methods (SOMs, clustering, and inductive
0-7803-7078-3/011$10.00 (C)2001 IEEE.
Page: 1784
6.
Rule set #0:
(gvh >u VH)
'=
[meg >= H] t&: [gvh <= Hl llti. [alm
H]
U
(mit <= M] U (v• c >• VH] 1<1< (nue '• VL)
Rul e set lll :
[alm >= VH ) U
(mit • • VVL] 1<1< [vae == M)
Rul e set #2:
[nuc >= M)
[erl >• VL) 1<1< [vac
Rule set #3:
[meg •• VVL] U
•u
(5] U. Bodenhofer and P. Bauer. Towards an axiomatic treatment of "interpretability". In Proc.
IIZUKA2000, pages 334-339, Iizuka, October
2000.
[6] G. Deboeck and T. Kohoncn (Eds). Visual Explorations in Finance with Self-Organizing Maps .
Springer, London, 1998.
VVH] 1<1< [nuc >= VL]
[alm • • VVL]
Rule set #4:
[mit >= M)
Figure 5. Generated rule sets for the yeast data set
learning) can minimize the shortcomings of the individual methods. Through the generic approach, our methods can be applied to a wide variety of problems, including classification, image segmentation. and data mining.
The modular design allows to change individual components on demand or even to omit one step. e.g. when labeled data is available, it is possible to use this information as the goal parameter in the inductive learning step,
instead of performing unsupervised clustering first.
Acknowledgements
Th is work has been done in the framework of the Kplus
Competence Center Program which is funded by the
Austrian Government, the Province of Upper Austria,
and the Chamber of Commerce of Upper Austria.
References
[l] R . Agrawal. T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large
databases. In P. Buneman and S. Jajodia, editors,
Proc. ACM SIGMOD Int. Conf on Management of
Data, pages 207-216, Washington, DC, 1993.
[7 ) C. Diamanrini and M. Panti. An efficient and scalable data compression approach to classification.
ACM SIGKDD Explorations, 2:49-55, 2000.
[8] M. Drobics , W. Winiwarter, and U. Bodenhofer.
Interpretation of self-organizing maps with fuzzy
rules. In Proc. 12th IEEE lilt. Conf on Tools with
Artificial Intelligence (ICTA/'00), pages 304-311.
IEEE Press, 2000.
[9] R . A. Fisher. The use of multiple measurements in
tax.onomic problems . Annals of Eugenics, 7: 179188, 1936.
[10] F. Hoeppner, F. Klawonn, R. Kruse, and T. A. Runkler. Fuuy Cluster Analysis-Methods for Image Recognition, Classification, and Data Analysis. John Wiley & Sons, Chichester, 1999.
I II] P. Horton and K. Nakai. A probabilistic classification system for predicting the cellular localization sites of proteins. ln D. J. States, P. Agarwal , T. Gaasterland. L. Hunter, and R. Smith, editors , Proc. 4th Int. Conf on IntelligenT Systems for
Molecular Biology, pages 109-115. AAAI Press,
Menlo Park, 1996.
[12] E. P. Klement, R. Mesiar, and E. Pap . Triangular Norms, volume 8 of Trends in Logic. Kluwer
Academic Publishers, Dordrecht, 2000.
[13] T. Kohonen . Self-Organizing Maps, volume 30 of
Information Sciences. Springer, Berlin, second extended edition, 1997.
[14] Y. Linde, A. Buzo, and R. M. Gray. An algorithm
for vector quantizer design. IEEE Trans. Commun., 28( I ):84-95, 1980.
[2] M. R. Anderberg. Cluster Analysis for Applications . Academic Press, 1973.
[ 15) E. H. Mamdani and S. Assilian. An experiment
in linguistic synthesis of fuzzy controllers. Int. J.
Man-Mach. Stud., 7:1-13, 1975.
[3] J. C. Bezdek.. Pattern Recognition with Fuzz.y Objective Function Algorithms. Plenum Press, New
York, 1981.
[16] R. S. Michalski, I. Bratko, and M. Kubat. Machine
Learning and Data Mining. John Wiley & Sons,
Chichester, 1998.
[4) U . Bodenhofer. The construction of orderingbased modifiers.
In G. Brewka. R. Der,
S. Gottwald, and A. Schierwagen, editors , FuuyNeuro Systems '99, pages 55-62. Leipziger Universitatsverlag, 1999.
0-7803-70711-31011$10.00 (C)2001 IEEE.
[ 17] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81-106, 1986.
[18] J. R. Quinlan. Learning logical definitions from
relations. Machine Learning, 5(3):239-266, 1990.
Page: 1785
Be the first to comment