Data Mining Using Synergies Between Self-Organizing Maps and
Inductive Learning of Fuzzy Rules
2. Preprocessing
To reduce computational effort without loosing significant information, we use sel[-organizi11g maps to c...

C(i, k). is computed as

C(i,k) =

: (i,k )


2:4~J a{i , l )


a(i ,k ) =

e-a·d(w',,~ ))

The factor a 2: ...
The basic algorithm can be summarized as follows:
open nodrs =cluster;

s; =0

start with the most general predicate {0}
final rule sets and the cross validation table which illustrates how good the clusters are described by the rule

Rule set #0:
(gvh >u VH)


[meg >= H] t&: [gvh <= Hl llti. [alm
(mit <= M] U (v• c >• VH] 1<1< (nue '• VL)

Rul e ...
Upcoming SlideShare
Loading in...5

Drobics, m. 2001: datamining using synergiesbetween self-organising maps and inductive learning on fuzzy rule


Published on

Published in: Design, Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Transcript of "Drobics, m. 2001: datamining using synergiesbetween self-organising maps and inductive learning on fuzzy rule"

  1. 1. Data Mining Using Synergies Between Self-Organizing Maps and Inductive Learning of Fuzzy Rules MARIO DROBICS, ULRICH BODENHOFER, WERNER WINIWARTER Software Competence Center Hagenberg A-4232 Hagenberg, Austria {mario.drobics/ulrich.bodenhofer!wemer. winiwarter} ERICH PETER KLEMENT Fuzzy Logic Laboratorium Linz-Hagenberg Johannes Kepler Universitat Linz A-4040 Linz, Austria Abstract Identifying structures in large data sets raises a number of problems. On the one hand, many methods cannot be applied to larger data sets, while, on the other hand, the results are often hard to interpret. We address these problems by a novel three-stage approach. First. we compute a small representation of the input data using a self-organizing map. This reduces the amount of data and allows us to create two-dimensional plots of the data. Then we use this preprocessed information to identify clusters of similarity. Finally, inductive learning methods are applied to generate sets of fuzzy descriptions of these clusters. This approach is applied to three case studies, including image data and real-world data sets. The results illustrate the generality and intuitiveness of the proposed method. 1. Introduction In this paper, we address the problem of identifying and describing regions of interest in large data sets. Decision trees (171 or other inductive learning methods which are often used for this task are not always sufficient, or not even applicable. Clustering methods [2], on the other hand, do not offer sufficient insight into the data structure as their results are often hard to display and interpret. This contribution is devoted to a novel three-stage faulttolerant approach to data mining which is able to produce qualitative infonnationfrom very large data sets. In the first stage, self-organizing maps (SOMs) [ !3) are used to compress the data to a reasonable amount of nodes which still contain all significant information 0-7803-7078-3/01/$10.00 (C)20011EEE. while eliminating possible data faults, such as noise, outliers, and missing values. While, for the purpose of data reduction, other methods may be used as well [7), SOMs have the advantages that they preserve the topology of the data space and allow to display the results in two-dimensional plots [6]. The second step is concerned with identifying significant regions of interest within the SOM nodes using a modified fuzzy c-means algorithm [3 , lO]. In our variant, the two main drawbacks of the standard fuzzy cmeans algorithm are resolved in a very simple and elegant way. Firstly, the clustering is not influenced by outliers anymore, as they have been eliminated by the SOM in the pre-processing steP,. Secondly, initialization is handled by computing a crisp Ward clustering [2] first to find initial values for the cluster centers. In the third and last stage, we create linguistic descriptions of the centers of these clusters which help the analyst to interpret the results. As the number of data samples under consideration has been reduced tremendously by using the SOM nodes, we are able to apply inductive learning methods (16] to find fuzzy descriptions of the clusters . Using the previously found clusters, we can make use of supervised learning methods in an unsupervised environment by considering the cluster membership as goal parameter. Descriptions are composed using fuzzy predicates of the form "x is/is not/is at least/is at most A", where xis the parameter under consideration, and A is a linguistic expression modeled by a fuzzy set. We applied this method to several real-world data sets. Through the combination of the two-dimensional projection of the data using the SOM and the fuzzy descriptions generated, the results are not only significant and accurate, but also very intuitive. Page:l780
  2. 2. 2. Preprocessing To reduce computational effort without loosing significant information, we use sel[-organizi11g maps to create a mapping of the data space. Self-organizing maps use unsupervised competitive teaming to create a topologypreserving map of the data. Let us assume that we are given a data set / consisting of K samples I = { x 1, • • • , xK} from an 11-dimensional real space 1( <:;: JR!". A SOM is defined as a mapping from the data space 1( to an m-dimensional discrete output space C (most often m = 2). The output space consists of a finite number Nc of neurons. To each neurons E C of the output space. we associate a parametric weight vecTOr in the input space w·' = ( w! , w~ ) r E 1(. w ... , z, The mapping 4> : 1(-+C x >-> cj>(x) defines the neural map from the input space 1( to the topological structure C; cj>(x) is defined as cj> (x) = argmin; {ll x - wi ll v} , with 2 II X 11 V = 2.:" !!!!:_2 r=l V, r "'" Lr=l m, ' where lllr denotes an individual weight parameter which allows us to take different importances of the input parameters into account. The factor V, stands for the variance of the values (x) , .. . ,-t{). In other words, the function <P maps every input sample to that neuron of the map whose weight vector is closest to the input with respect to the distance measure Jl .ll v. The goal of the learning process is to reduce the overall distortion error An additional benefit of this preprocessing step is noise reduction. As the weight vector of each node represents the average of several data records, outliers and small variations in the input are eliminated. 3. Clustering Now we try to identify possible regions of interest in the map. Regions of interest are characterized through nodes whose weight vectors are close to each other in the data space. Additionally. we use the information provided by the map. Doing so, clusters are generated that are not onl y homogeneous with respect to the input space, but also within the map. To find regions of interest, we use a modified fuz zy c -means [3 ) clustering. Using this algorithm, a fuzzy segmentation of the data space is created. For initializing the fuzzy clustering, we first perform a crisp Ward clustering [2). Ward clustering is an agglomerative clustering method, which is, however, often not applicable due to its high computational effort . Through the data reduction done in the preprocessing step, we are not only able to apply this clustering method, but to take the topological information of the map into account, too. This is accomplished by merging only nodes/clusters which are neighbors in the map. The number of regions of interest L may be fixed a priori or determined by the Ward clustering initialization procedure, too. To compute the final fuzzy clustering, we use a modified distance measure, which also takes the weights and the topological structure of the map into account: "'" d (x,y) = ( where h(i,j) denotes the neighborhood relation in C. i.e. a scaling factor from the unit interval which rates the distance in the discrete topological structure of the map. For details about the learning algorithm, see (13) . The mapping provided by a SOM does not only create a set of weight vectors, it furthermore preserves the topology of the data space. As the map has a fixed dimension of m = 2, it is possible to plot each parameter (or cluster) with respect to the map in a very straightforward manner [6]. For our purposes, we will use nets with a several hundreds nodes. Through this large number of nodes , the overall distortion error can be kept small, however, training may take longer. 0-7803-7078-31011$10.00 (C)lOOl IEEE. Lr.; J m ( r Jx, -y,J ) 2 ) J n.1 - ~~ ~ l•(~(x),~(y))~ ffiaXH L." '-------r- I_m_r____~ =~ Note that, due to the division by maxkl IJ, - w;.1. dR(x, y) is always a value between 0 and I if all values Xr. Yr are in the interval [minkJ,,maxkJ,J. The exponent h(<jl(x), cj>(y))tl corresponds to the influence of the neighborhood relation I!(Q>(x), <P(y)); the smaller h( <jl(x) , cj> (y)) is (i .e. the larger the distance of node Q>(x) and node cp(y) in the map is), the stronger the raw distance dR(x,y) is stretched . The factor~ controls how strong the inftuence of the topological structure is (we use P= 2). The fuzzy clustering is computed using the usual iterative two-step update procedure. First, the degree of membership to which node i belongs to cluster k, de- Page:l781
  3. 3. noted C(i, k). is computed as C(i,k) = : (i,k ) · 2:4~J a{i , l ) with a(i ,k ) = e-a·d(w',,~ )) The factor a 2: I determines the fuzziness of the clustering (1 corresponds to a very fuzzy clustering, while, e.g .• a value of 100 gives very crisp clusters~ we use a = 40). Then the cluster centers vk are updated by computing the weighted means of all nodes [ 10] : Logical operations on the truth values are defined using triangular norms and conorms, i.e. commutative, associative, and non-decreasing binary operations on the unit interval with neutral elements I and 0, respectively [12]. In our applications, we restricted to the Lukasiewicz operations: h (x,y) = max. (x + y - I ,0) SL(x,y) == min(x-t- y, 1) Then conjunctions and disjunctions of fuzzy predicates can be defined liS follows : t(A(x) 1 B(x)) = h(t(A(x)) ,t(B(x))) This procedure is repeated until the change of the cluster centers is lower than a predefined bound E. 4. Cluster Descriptions The final step is concerned with finding descriptions of the regions of interest determined in the clustering step. Depending on the underlying context of the variable under consideration, we assign natural language expressions like very very low, medium, large, etc . to each variable. These expressions are modeled by fuzzy sets which can either be defined by hand or automatically. When defining the fuzzy sets, in particular when they are generated automatically, it is important to preserve the semantic context of the natural language expressions (e.g. that low corresponds to smaller values than high) in order to achieve interpretable descriptions [5 ) . To find appropriate fuzzy segmentations of the input domains according to the distribution of the sample data automatically, we use a simple clustering-based algorithm . The cluster centers are first initialized by dividing the input range into a certain number of equally sized intervals. Then these centers are iteratively updated using a modified c-means algorithm (14] with simple neighborhood interaction to preserve the ordering of the cl usters. Usually, a few iterations (:$ 10) are sufficient for each domain. The fuzzy sets are then created as trapewidal membership functions around the centers computed previously. A fuzzy set A induces a fuzzy predicate t(x is A)= P.A(x), where tis the function which assigns a truth value to the assertion "xis A" . In a straightforward way, a fuzzy set A also induces the following fuzzy predicates [4]: t(x is not A)= 1- JlA(x) t(x is at least A) = sup{!J.A (y) I y :$ x} t(x is at most A) == sup{p.A (y) I y ~ x} 0-7803-7078-3/011$10.00 (C)2001 IEEE. t(A(x) V B(x)) = SL(t(A (x)) ,t(B(x))) Since we want to describe the class membership of the samples, we are interested in fuzzy rules of the form IF A(x) THEN C(x) (for convenience, we refer to such a rule as A ...... C. although the rule should rather be regarded as a conditional assignment than as a logical implication) where C is a fuzzy predicate that describes the class membership and A may be a compound fuzzy predicate composed of several atomic predicates, e .g. IF (age(x) is high) 1 (size (x) is at least high) THEN (class(x) is 3). The degree offulfillment of such a statement for a given sample xis defined as t (A (x) 1 C(x)) (corresponding to the so-called Mamdani inference [15]). The fuzzy set of samples fulfilling a predicate A, which we denote with A(/), is defined as (for x E /) P.A(J)(x) = t(A(x)). Let us define the cardinality of a fuzzy set B of input samples as the sum of P.n(x), i.e. IBI = L,P.n(x). xEI Finally. the cardinality of samples fulfilling a rule A ...... C can be defined by lA(/) nC(l)l = L,t(A(x) AC(x)) xE/ = L,T(t(A(x)),t(C(x))). xE/ To find the most accurate and significant descriptions, we use FS-FOIL (8], a modification of Quinlan's FOIL algorithm [ 18] for fuzzy rules . It creates a stepwise coverage of each cluster such that not only spheric, but arbitrary clusters can be handled. We compute descriptions of the individual clusters independently, therefore, the goal predicate Cis fixed and a description rule A-. Cis uniquely characterized by its antecedent predicate A. Page: 1782
  4. 4. The basic algorithm can be summarized as follows: open nodrs =cluster; s; =0 start with the most general predicate {0} do { expand the best n previous predicates if a rule A -> Cis accurate & signi fie ant { add predicate A to rule set S; remove covered nodes from open nodes go on with the most general predicate {0} prune predicates } while open nodes i 0; In detail, the expansion of a predicate A is performed by appending a new atomic fuzzy predicate 8 by means of conjunction. To decide which predicates to expand, an entropy-based information gain measure is used to define a ranking of the set of predicates: G = IA(I)nC(I) ' (tog 2 IA(J) nC(I) I _log 2 'C(f)l) IA(I)I I II I Rules of interest have to be significant and accurate. The significance of rule A ->Cis defined by its support supp (A __. C) = IA{i) nC(I)I II I ' 5.1. Iris data We used the well-known Iris data set according to Fisher [91 to illustrate the ability of our approach to identify hidden structures in a data set. The data consist of three classes of tlowers, where each tlower is characterized by four attributes. Each class is represented by 50 samples. We generated a SOM with 10 x 10 nodes and a set of 4 clusters to separate the test data. Finally, a compact set of nine intuitive fuzzy rules was created by the above algorithm . Applying this set of rules to the original data resulted in 10.6% misclassified and 7.3% unclassified samples. Note that, throughout the clustering and labeling process, the goal information has not been taken into account. This means that the proposed method is able to label the goal classes with a correctness of over 80% without even knowing the correct classification. In the original Iris data set, one goal class is perfectly separated from the others. The remaining two classes, however, have a certain region of overlapping in the data space. As a detailed analysis has shown, three of the four generated clusters can be directly associated with one of the goal classes. The fourth cluster exactly contained the region of overlap, which demonstrates that our method is able to preserve the topology of the input space and to detect hidden structures in the data. the accuracy as its confidence con f( A...... C) = supp(A ..... C) , supp(A) where supp(A) is defined as supp(A) = IA(/1)1 i/ We consider only rules which satisfy reasonable requirements in terms of support and confidence. This is achieved by two lower bounds SUPPmin and confmin, respectively [ 1]. As, for all predicates A, B. and C, supp(A-+ C) ~ supp(A II B-+ C), we can prune the expanded set of predicates by removing all predicates X for which supp(X) < SUPPmin· The final degree of membership of a sample x E 2( to cluster i with respect to the rule set Si is given by: SLA(x) AES' 5. Results Now we will present three results with very different data sets and purposes-<lemonstrating the flexibility and general applicability of the proposed approach. 0-7803-7078-31011$10.00 (C)2001IEEE. 5.2. Image segmentation A typical application of clustering in computer vision is image segmentation, therefore, it seemed interesting to apply our three-stage approach to this problem. Moreover, the possibility to describe segments with natural language expressions gives rise to completely new opportunities in image understanding. First experiments using pixel coordinates and RGB values showed good results in terms of accuracy, however, the descriptions were rarely intuitive. In order to overcome this problem, we added HSL features (hue, saturation and lightness), which match the way humans perceive color much closer, in the final description step. Since humans have a certain understanding of colors and intensities, it is more appropriate to assume predefined fuzzy predicates corresponding to natural language terms describing these properties. By this way, moreover, it can be avoided that the meaning of the descriptions depend on the image under consideration. The example in Fig. 1 shows an image with 170 x 256 pixels, i.e. we have K = 43520 samples with n = 8 features. First, it was mapped onto a SOM with 10 x 10 nodes. Then four clusters and the corresponding descriptions were computed. On the right-hand side, Fig. 1 shows the computed segmentation. Figure 2 shows the Page: 1783
  5. 5. final rule sets and the cross validation table which illustrates how good the clusters are described by the rule sets. classification prohlem with ad hoc structured probability mudels, binary decision trees, or Bayesian dassilier methods were only abk to achieve an accuracy of approximately 55'k ill]. We ..:omput.ed a SOM with 20 >: 20 nodes . which were then grouped to 5 clusters (Fig. 3 shows pkm of the class mcnihcrships over the SOM). Finally. a set of descriptions for each cluster was created. figure 1. Original iuwge ( i£ji) and its scgrnentarion (rig hi} Rule set 110: [blue •• B) [red <= L] &.&: [blue >= Figure 3. Clusters in tile yeast datu sel VH] Rille set #1: (lightness <= dark] To validate our results, we compared the originul classification. with our clusters and rules. Although we used less clusters than goal classes (which were not taken into account). the overall number of samples assigned to the wrong goal class was relatively small (cf. Fig. 4). Rule set #2: (ligtltnasn >= light:] Rille set #3: (hue orange) (hue •• red) [hue ~::;: yellow] [hue •• gi·een) U Rule #0 Clus1:er #0 : 97% c [lightness •• normal) Rule #1 or. Rule #2 07. Rule #3 O'l. Cluster #1: 07. 987. OJ. Cluster #2: 7'/, Oi. 987. 01. 0'1. Cluster #3: 27. 07. 07. 99'1. Tote1 Error: 0.41% (+0.43% missing) Figure 2. Rule sets for tile image segmeruation [C¥T) [N1JC) ('129): [MIT] (244): (~E3] (163): (XE2) (51): [ME1] (44): (EXC] #0 (163): ( 9%) (35): [VAC] (30): (POX] (20) : [ERLJ (5): c #l C~"l'l..l c #2 c #3 117. [317.] ?Y. 07. 4'l. c #4 13% 67. 14'l. 177. 58'l, [45'f.] 20% 1/. 5% 93% 82% O'l, O'l. 2'l. 11/. 33% 07. 33'l. 36 O'l. 23'l. 15'f. 25% 15% O'l, 457. 20'1. 0'1. BO'l. O'l, 0'1. 11% 17. 6'l. [68'l.] 27'/. !)'(. [59%] l'l. 5'l. 47. 5% 6Y, Total Error: 37% ( +3% <nissing) Figure 4. Cross validation table (goal vs. clusters), brackets i11dicate cluster-class assignments The rule sets in Fig. 2 can be interpreted as follows: The first rule set corresponds to the blue sky. The second rule set describes the black pants. The snow is classified by the third rule set. Finally. the jackets are identified by the t()urth rule set. 5.3. Yeast data set As third test case, we used the UCI yeast data set due The rule sets created for each cluster are shown in Fig. 5. Applying these rules to the original data, and comparing the result with the cluster memberships. an overage error of 10.38% with 53.63% missing occurred. The high number of missing samples is due to complex structure of the clusters (see Fig. 3). to K. Nakai. The datu set consists of K = 1484 sumplcs 6. Conclusion with n = 8 attributes. Each sample- belongs to one of ten classes. The data set has a very complex inner structure and a lot of inconsistencies. Attempts to solve this ln this paper we have shown how the combination of different methods (SOMs, clustering, and inductive 0-7803-7078-3/011$10.00 (C)2001 IEEE. Page: 1784
  6. 6. Rule set #0: (gvh >u VH) '= [meg >= H] t&: [gvh <= Hl llti. [alm H] U (mit <= M] U (v• c >• VH] 1<1< (nue '• VL) Rul e set lll : [alm >= VH ) U (mit • • VVL] 1<1< [vae == M) Rul e set #2: [nuc >= M) [erl >• VL) 1<1< [vac Rule set #3: [meg •• VVL] U •u (5] U. Bodenhofer and P. Bauer. Towards an axiomatic treatment of "interpretability". In Proc. IIZUKA2000, pages 334-339, Iizuka, October 2000. [6] G. Deboeck and T. Kohoncn (Eds). Visual Explorations in Finance with Self-Organizing Maps . Springer, London, 1998. VVH] 1<1< [nuc >= VL] [alm • • VVL] Rule set #4: [mit >= M) Figure 5. Generated rule sets for the yeast data set learning) can minimize the shortcomings of the individual methods. Through the generic approach, our methods can be applied to a wide variety of problems, including classification, image segmentation. and data mining. The modular design allows to change individual components on demand or even to omit one step. e.g. when labeled data is available, it is possible to use this information as the goal parameter in the inductive learning step, instead of performing unsupervised clustering first. Acknowledgements Th is work has been done in the framework of the Kplus Competence Center Program which is funded by the Austrian Government, the Province of Upper Austria, and the Chamber of Commerce of Upper Austria. References [l] R . Agrawal. T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proc. ACM SIGMOD Int. Conf on Management of Data, pages 207-216, Washington, DC, 1993. [7 ) C. Diamanrini and M. Panti. An efficient and scalable data compression approach to classification. ACM SIGKDD Explorations, 2:49-55, 2000. [8] M. Drobics , W. Winiwarter, and U. Bodenhofer. Interpretation of self-organizing maps with fuzzy rules. In Proc. 12th IEEE lilt. Conf on Tools with Artificial Intelligence (ICTA/'00), pages 304-311. IEEE Press, 2000. [9] R . A. Fisher. The use of multiple measurements in tax.onomic problems . Annals of Eugenics, 7: 179188, 1936. [10] F. Hoeppner, F. Klawonn, R. Kruse, and T. A. Runkler. Fuuy Cluster Analysis-Methods for Image Recognition, Classification, and Data Analysis. John Wiley & Sons, Chichester, 1999. I II] P. Horton and K. Nakai. A probabilistic classification system for predicting the cellular localization sites of proteins. ln D. J. States, P. Agarwal , T. Gaasterland. L. Hunter, and R. Smith, editors , Proc. 4th Int. Conf on IntelligenT Systems for Molecular Biology, pages 109-115. AAAI Press, Menlo Park, 1996. [12] E. P. Klement, R. Mesiar, and E. Pap . Triangular Norms, volume 8 of Trends in Logic. Kluwer Academic Publishers, Dordrecht, 2000. [13] T. Kohonen . Self-Organizing Maps, volume 30 of Information Sciences. Springer, Berlin, second extended edition, 1997. [14] Y. Linde, A. Buzo, and R. M. Gray. An algorithm for vector quantizer design. IEEE Trans. Commun., 28( I ):84-95, 1980. [2] M. R. Anderberg. Cluster Analysis for Applications . Academic Press, 1973. [ 15) E. H. Mamdani and S. Assilian. An experiment in linguistic synthesis of fuzzy controllers. Int. J. Man-Mach. Stud., 7:1-13, 1975. [3] J. C. Bezdek.. Pattern Recognition with Fuzz.y Objective Function Algorithms. Plenum Press, New York, 1981. [16] R. S. Michalski, I. Bratko, and M. Kubat. Machine Learning and Data Mining. John Wiley & Sons, Chichester, 1998. [4) U . Bodenhofer. The construction of orderingbased modifiers. In G. Brewka. R. Der, S. Gottwald, and A. Schierwagen, editors , FuuyNeuro Systems '99, pages 55-62. Leipziger Universitatsverlag, 1999. 0-7803-70711-31011$10.00 (C)2001 IEEE. [ 17] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81-106, 1986. [18] J. R. Quinlan. Learning logical definitions from relations. Machine Learning, 5(3):239-266, 1990. Page: 1785