Computational Intelligence  for Data Mining Włodzisław Duch Department of Informatics Nicholas Copernicus University  Toru...
Group members
Plan <ul><li>What this tutorial is about ? </li></ul><ul><li>How to discover knowledge in data;  </li></ul><ul><li>how to ...
AI, CI & DM <ul><li>Artificial Intelligence: symbolic models of knowledge.  </li></ul><ul><li>Higher-level cognition: reas...
Forms of useful knowledge <ul><li>AI/Machine Learning camp:  </li></ul><ul><li>Neural nets are black boxes.  </li></ul><ul...
Forms of knowledge <ul><li>Humans remember examples of each category and refer to such examples – as similarity-based or n...
GhostMiner Philosophy <ul><li>GhostMiner, data mining tools from our lab.  </li></ul><ul><li>Separate the process of model...
Wine data example <ul><li>Chemical analysis of wine from grapes grown in the same region in Italy, but derived from three ...
Exploration and visualization <ul><li>General info about the data </li></ul>
Exploration: data <ul><li>Inspect the data </li></ul>
Exploration: data statistics <ul><li>Distribution of feature values </li></ul>Proline  has very large values, the data sho...
Exploration: data standardized <ul><li>Standardized data: unit standard deviation, about 2/3 of all data should fall withi...
Exploration: 1D histograms <ul><li>Distribution of feature values in clasess </li></ul>Some features are more useful than ...
Exploration: 1D/3D histograms <ul><li>Distribution of feature values in clasess, 3D </li></ul>
Exploration: 2D projections <ul><li>Projections on selected 2D </li></ul>Projections on selected 2D
Visualize data  <ul><li>Hard to imagine relations in more than 3D. </li></ul>SOM mappings: popular for visualization, but ...
Visualize data: MDS <ul><li>Multidimensional scaling: invented in psychometry by Torgerson (1952), re-invented by Sammon (...
Visualize data: Wine <ul><li>3 clusters are clearly distinguished, 2D is fine. </li></ul>The green outlier can be identifi...
Decision trees <ul><li>Simplest things first:  use decision tree to find logical rules. </li></ul>Test single attribute, f...
Decision borders <ul><li>Univariate trees:  test the value of a single attribute  x  <  a .  </li></ul>Multivariate trees:...
SSV decision tree <ul><li>Separability Split Value tree:  based on the separability criterion. </li></ul>Define left and r...
SSV – complex tree <ul><li>Trees may always learn to achieve 100% accuracy. </li></ul>Very few vectors are left in the lea...
SSV – simplest tree <ul><li>Pruning finds the nodes that should be removed to increase generalization – accuracy on unseen...
SSV – logical rules <ul><li>Trees may be converted to logical rules. </li></ul><ul><li>Simplest tree leads to 4 logical ru...
SSV – optimal trees/rules <ul><li>Optimal: estimate how well rules will generalize. </li></ul><ul><li>Use stratified cross...
Logical rules <ul><li>Crisp logic rules: for continuous  x  use linguistic variables (predicate functions). </li></ul>s k ...
Crisp logic decisions <ul><li>Crisp logic is based on rectangular membership functions: </li></ul>True/False values jump f...
Logical rules - advantages <ul><li>Logical rules, if simple enough, are preferable. </li></ul><ul><li>Rules may expose lim...
Logical rules - limitations <ul><li>Logical rules are preferred but ... </li></ul><ul><li>Only one class is predicted  p (...
How to use logical rules? <ul><li>Data has been measured with unknown error. Assume Gaussian distribution: </li></ul>x  – ...
Rules - choices <ul><li>Simplicity vs. accuracy.  </li></ul><ul><li>Confidence vs. rejection rate. </li></ul>p   is a hi...
Rules – error functions <ul><li>T he overall accuracy is equal to a combination of sensitivity and s pecificity  weighted ...
Fuzzification of rules <ul><li>Rule   R a ( x ) = { x  a }  is fulfilled by   G x   with probability: </li></ul>Error fun...
Soft trapezoids and NN <ul><li>The difference between two sigmoids makes a soft trapezoidal membership functions . </li></...
Optimization of rules <ul><li>Fuzzy:  large receptive fields ,  rough estimations . </li></ul><ul><li>G x  –  uncertainty ...
Mushrooms <ul><li>The Mushroom Guide: no simple rule for mushrooms; no rule like:  ‘ leaflets three, let it be’ for Poison...
Mushrooms rules <ul><li>To eat or not to eat, this is the question!  Not any more ...  </li></ul>A mushroom is poisonous i...
Recurrence of breast cancer <ul><li>Institute of Oncology, University Medical Center, Ljubljana. </li></ul>286 cases, 201 ...
Neurofuzzy system <ul><li>Feature Space Mapping (FSM) neurofuzzy system. </li></ul><ul><li>Neural adaptation, estimation o...
FSM <ul><li>Rectangular functions: simple rules are created, many nearly equivalent descriptions of this data exist. </li>...
Prototype-based rules <ul><li>IF P = arg min R  D(X,R) THAN Class(X)=Class(P) </li></ul>C-rules (Crisp), are a special cas...
P-rules Euclidean distance leads to a Gaussian fuzzy membership  functions + product as T-norm.  Manhattan function  =>   ...
Promoters DNA strings, 57 aminoacids, 53 + and 53 - samples  tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt Euc...
P-rules New distance functions from info theory => interesting MF. MF => new distance function, with local D(X,R) for each...
Decision borders Euclidean distance from 3 prototypes, one per class.  Minkovski   =20 distance from 3 prototypes.  D(P,X...
P-rules for Wine Manhattan distance:  6  prototypes kept,  4 errors, f2 removed  Chebyshev distance: 15 prototypes kept, 5...
Neural networks <ul><li>MLP – Multilayer Perceptrons, most popular NN models. </li></ul><ul><li>Use soft hyperplanes for d...
Rules from MLPs <ul><li>Why is it difficult? </li></ul>Multi-layer perceptron (MLP) networks: stack many perceptron units,...
MLP2LN <ul><li>Converts MLP neural networks into a network performing logical operations (LN). </li></ul>Input layer  Aggr...
MLP2LN training <ul><li>Constructive algorithm: add as many nodes as needed. </li></ul>Optimize cost function: minimize er...
L-units <ul><li>Create linguistic variables. </li></ul>Numerical representation for R-nodes V sk = (  ) for  s...
Iris example <ul><li>Network after training:  </li></ul>iris setosa: q=1 (0,0,0;0,0,0;+1,0,0;+1,0,0)  iris versicolor: q=2...
Learning dynamics Decision regions shown every 2 00  training epochs in  x 3 ,  x 4  coordinates; borders are optimally pl...
Thyroid screening <ul><li>Garavan Institute, Sydney, Australia </li></ul><ul><li>15 binary, 6 continuous </li></ul><ul><li...
Thyroid – some results. Accuracy of diagnoses obtained with several  systems – rules are accurate. Method    Rules/Feature...
Psychometry <ul><li>Use CI to find knowledge, create Expert System. </li></ul><ul><li>MMPI (Minnesota Multiphasic Personal...
Scanned form
Computer input
Scales
Psychometry: goal <ul><li>There is no simple correlation between single values and final diagnosis.  </li></ul><ul><li>Res...
Psychogram
Psychometric data <ul><li>1600 cases for woman, same number for men. </li></ul><ul><li>27 classes:  norm, psychopathic, sc...
Psychometric results 10-CV for FSM is 82-85%, for C4.5 is 79-84%.  Input uncertainty  + G x  around 1.5% (best ROC) improv...
Psychometric Expert <ul><li>Probabilities for different classes.  For greater uncertainties more classes are predicted.  <...
MMPI probabilities
MMPI rules
MMPI verbal comments
Visualization <ul><li>Probability of classes versus input uncertainty. </li></ul><ul><li>Detailed input probabilities arou...
Class probability/uncertainty
Class probability/feature
MDS visualization
Summary <ul><li>Computational intelligence methods: neural, decision trees, similarity-based & other, help to understand t...
References <ul><li>Many papers, comparison of results for numerous datasets are kept at: </li></ul><ul><li>http://www.phys...
Upcoming SlideShare
Loading in...5
×

Metody Inteligencji Obliczeniowej

305

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
305
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Metody Inteligencji Obliczeniowej

  1. 1. Computational Intelligence for Data Mining Włodzisław Duch Department of Informatics Nicholas Copernicus University Torun, Poland W ith help from R . Adamczak , K . Grąbczewski K . Grudziński , N . Jankowski , A . Naud http://www.phys.uni.torun.pl/kmk WCCI 200 2 , Honolulu, HI
  2. 2. Group members
  3. 3. Plan <ul><li>What this tutorial is about ? </li></ul><ul><li>How to discover knowledge in data; </li></ul><ul><li>how to create comprehensible models of data; </li></ul><ul><li>how to evaluate new data. </li></ul><ul><li>AI, CI & Data Mining </li></ul><ul><li>Forms of useful knowledge </li></ul><ul><li>GhostMiner philosophy </li></ul><ul><li>Exploration & Visualization </li></ul><ul><li>Rule-based data analysis </li></ul><ul><li>Neurofuzzy models </li></ul><ul><li>Neural models </li></ul><ul><li>Similarity-based models </li></ul><ul><li>Committees of models </li></ul>
  4. 4. AI, CI & DM <ul><li>Artificial Intelligence: symbolic models of knowledge. </li></ul><ul><li>Higher-level cognition: reasoning, problem solving, planning, heuristic search for solutions. </li></ul><ul><li>Machine learning, inductive, rule-based methods. </li></ul><ul><li>Technology: expert systems. </li></ul><ul><li>Computational Intelligence, Soft Computing: </li></ul><ul><li>methods inspired by many sources: </li></ul><ul><li>biology – evolutionary, immune, neural computing </li></ul><ul><li>statistics, patter recognition </li></ul><ul><li>probability – Bayesian networks </li></ul><ul><li>logic – fuzzy, rough … </li></ul><ul><li>Perception, object recognition. </li></ul><ul><li>Data Mining, Knowledge Discovery in Databases. </li></ul><ul><li>discovery of interesting patterns, rules, knowledge. </li></ul><ul><li>building predictive data models. </li></ul>
  5. 5. Forms of useful knowledge <ul><li>AI/Machine Learning camp: </li></ul><ul><li>Neural nets are black boxes. </li></ul><ul><li>Unacceptable! Symbolic rules forever. </li></ul><ul><li>But ... knowledge accessible to humans is in: </li></ul><ul><li>symbols, </li></ul><ul><li>similarity to prototypes, </li></ul><ul><li>images, visual representations. </li></ul><ul><li>What type of explanation is satisfactory? </li></ul><ul><li>Interesting question for cognitive scientists. </li></ul><ul><li>Different answers in different fields. </li></ul>
  6. 6. Forms of knowledge <ul><li>Humans remember examples of each category and refer to such examples – as similarity-based or nearest-neighbors methods do. </li></ul><ul><li>Humans create prototypes out of many examples – as Gaussian classifiers, RBF networks, neurofuzzy systems do. </li></ul><ul><li>Logical rules are the highest form of summarization of knowledge. </li></ul><ul><li>Types of explanation: </li></ul><ul><li>exemplar-based: prototypes and similarity; </li></ul><ul><li>logic-based: symbols and rules; </li></ul><ul><li>visualization-based: maps, diagrams, relations ... </li></ul>
  7. 7. GhostMiner Philosophy <ul><li>GhostMiner, data mining tools from our lab. </li></ul><ul><li>Separate the process of model building and knowledge discovery from model use => GhostMiner Developer & GhostMiner Analyzer </li></ul><ul><li>There is no free lunch – provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, committees. </li></ul><ul><li>Provide tools for visualization of data. </li></ul><ul><li>Support the process of knowledge discovery/model building and evaluating, organizing it into projects. </li></ul>
  8. 8. Wine data example <ul><li>Chemical analysis of wine from grapes grown in the same region in Italy, but derived from three different cultivars. Task: recognize the source of wine sample. 13 quantities measured, continuous features: </li></ul><ul><li>alcohol content </li></ul><ul><li>ash content </li></ul><ul><li>magnesium content </li></ul><ul><li>flavanoids content </li></ul><ul><li>proanthocyanins phenols content </li></ul><ul><li>OD280/D315 of diluted wines </li></ul><ul><li>malic acid content </li></ul><ul><li>alkalinity of ash </li></ul><ul><li>total phenols content </li></ul><ul><li>nonanthocyanins phenols content </li></ul><ul><li>color intensity </li></ul><ul><li>hue </li></ul><ul><li>proline. </li></ul>
  9. 9. Exploration and visualization <ul><li>General info about the data </li></ul>
  10. 10. Exploration: data <ul><li>Inspect the data </li></ul>
  11. 11. Exploration: data statistics <ul><li>Distribution of feature values </li></ul>Proline has very large values, the data should be standardized before further processing.
  12. 12. Exploration: data standardized <ul><li>Standardized data: unit standard deviation, about 2/3 of all data should fall within [mean-std,mean+std] </li></ul>Other options: normalize to fit in [-1,+1], or normalize rejecting some extreme values.
  13. 13. Exploration: 1D histograms <ul><li>Distribution of feature values in clasess </li></ul>Some features are more useful than the others.
  14. 14. Exploration: 1D/3D histograms <ul><li>Distribution of feature values in clasess, 3D </li></ul>
  15. 15. Exploration: 2D projections <ul><li>Projections on selected 2D </li></ul>Projections on selected 2D
  16. 16. Visualize data <ul><li>Hard to imagine relations in more than 3D. </li></ul>SOM mappings: popular for visualization, but rather inaccurate, no measure of distortions. Measure of topographical distortions: map all X i points from R n to x i points in R m , m < n , and ask: how well are R ij = D ( X i , X j ) distances reproduced by distances r ij = d ( x i , x j ) ? Use m = 2 for visualization, use higher m for dimensionality reduction.
  17. 17. Visualize data: MDS <ul><li>Multidimensional scaling: invented in psychometry by Torgerson (1952), re-invented by Sammon (1969) and myself (1994) … </li></ul>Minimize measure of topographical distortions moving the x coordinates.
  18. 18. Visualize data: Wine <ul><li>3 clusters are clearly distinguished, 2D is fine. </li></ul>The green outlier can be identified easily.
  19. 19. Decision trees <ul><li>Simplest things first: use decision tree to find logical rules. </li></ul>Test single attribute, find good point to split the data, separating vectors from different classes. DT advantages: fast, simple, easy to understand, easy to program, many good algorithms.
  20. 20. Decision borders <ul><li>Univariate trees: test the value of a single attribute x < a . </li></ul>Multivariate trees: test on combinations of attributes. Result: feature space is divided in hyperrectangular areas.
  21. 21. SSV decision tree <ul><li>Separability Split Value tree: based on the separability criterion. </li></ul>Define left and right sides of the splits: SSV criterion: separate as many pairs of vectors from different classes as possible; minimize the number of separated from the same class.
  22. 22. SSV – complex tree <ul><li>Trees may always learn to achieve 100% accuracy. </li></ul>Very few vectors are left in the leaves!
  23. 23. SSV – simplest tree <ul><li>Pruning finds the nodes that should be removed to increase generalization – accuracy on unseen data. </li></ul>Trees with 7 nodes left: 15 errors/178 vectors.
  24. 24. SSV – logical rules <ul><li>Trees may be converted to logical rules. </li></ul><ul><li>Simplest tree leads to 4 logical rules: </li></ul><ul><li>if proline > 719 and flavanoids > 2.3 then class 1 </li></ul><ul><li>if proline < 719 and OD280 > 2.115 then class 2 </li></ul><ul><li>if proline > 719 and flavanoids < 2.3 then class 3 </li></ul><ul><li>if proline < 719 and OD280 < 2.115 then class 3 </li></ul>How accurate are such rules? Not 15/178 errors, or 91.5% accuracy! Run 10-fold CV and average the results. 85±10%? Run 10X!
  25. 25. SSV – optimal trees/rules <ul><li>Optimal: estimate how well rules will generalize. </li></ul><ul><li>Use stratified crossvalidation for training; </li></ul><ul><li>use beam search for better results. </li></ul><ul><li>if OD280/D315 > 2.505 and proline > 726.5 then class 1 </li></ul><ul><li>if OD280/D315 < 2.505 and hue > 0.875 and malic-acid < 2.82 then class 2 </li></ul><ul><li>if OD280/D315 > 2.505 and proline < 726.5 then class 2 </li></ul><ul><li>if OD280/D315 < 2.505 and hue > 0.875 and malic-acid > 2.82 then class 3 </li></ul><ul><li>if OD280/D315 < 2.505 and hue < 0.875 then class 3 </li></ul>Note 6/178 errors, or 91.5% accuracy! Run 10-fold CV: results are 90.4 ± 6.1 %? Run 10X!
  26. 26. Logical rules <ul><li>Crisp logic rules: for continuous x use linguistic variables (predicate functions). </li></ul>s k ( x ) ş True [ X k Ł x Ł  X' k ], for example: small( x ) = True{ x | x < 1} medium( x ) = True{ x | x  [1,2]} large( x ) = True{ x | x > 2} Linguistic variables are used in crisp (prepositional, Boolean) logic rules: IF small-height( X ) AND has-hat( X ) AND has-beard( X ) THEN ( X is a Brownie) ELSE IF ... ELSE ...
  27. 27. Crisp logic decisions <ul><li>Crisp logic is based on rectangular membership functions: </li></ul>True/False values jump from 0 to 1. Step functions are used for partitioning of the feature space. Very simple hyper-rectangular decision borders. Sever limitation on the expressive power of crisp logical rules!
  28. 28. Logical rules - advantages <ul><li>Logical rules, if simple enough, are preferable. </li></ul><ul><li>Rules may expose limitations of black box solutions. </li></ul><ul><li>Only relevant features are used in rules. </li></ul><ul><li>Rules may sometimes be more accurate than NN and other CI methods. </li></ul><ul><li>Overfitting is easy to control, rules usually have small number of parameters. </li></ul><ul><li>Rules forever !? A logical rule about logical rules is: </li></ul>IF the number of rules is relatively small AND the accuracy is sufficiently high. THEN rules may be an optimal choice.
  29. 29. Logical rules - limitations <ul><li>Logical rules are preferred but ... </li></ul><ul><li>Only one class is predicted p ( C i | X , M ) = 0 or 1 </li></ul><ul><li>black-and-white picture may be inappropriate in many applications. </li></ul><ul><li>Discontinuous cost function allow only non-gradient optimization. </li></ul><ul><li>Sets of rules are unstable: small change in the dataset leads to a large change in structure of complex sets of rules. </li></ul><ul><li>Reliable crisp rules may reject some cases as unclassified. </li></ul><ul><li>Interpretation of crisp rules may be misleading. </li></ul><ul><li>Fuzzy rules are not so comprehensible. </li></ul>
  30. 30. How to use logical rules? <ul><li>Data has been measured with unknown error. Assume Gaussian distribution: </li></ul>x – fuzzy number with Gaussian membership function . A set of logical rules R is used for fuzzy input vectors : Monte Carlo simulations for arbitrary system => p ( C i | X ) Analytical evaluation p ( C | X ) is based on cumulant : Error function is identical to logistic f. < 0.02
  31. 31. Rules - choices <ul><li>Simplicity vs. accuracy. </li></ul><ul><li>Confidence vs. rejection rate. </li></ul>p  is a hit; p  false alarm; p  is a miss. S  ( M ) = p    = p  /p  Specificity S + ( M ) = p +|+ = p ++ /p + Sensitivity R ( M ) =p +r +p  r = 1  L ( M )  A ( M ) Rejection rate L ( M ) = p   + p   Error rate A ( M ) = p  + p  Accuracy (overall)
  32. 32. Rules – error functions <ul><li>T he overall accuracy is equal to a combination of sensitivity and s pecificity weighted by the a priori probabilities: </li></ul>A ( M ) = p  S  ( M ) +p  S  ( M ) Optimization of rules for the C + class; large  means no errors but high rejection rate. E ( M   ) =  L ( M  )  A ( M  ) =  ( p  +p  )   ( p  +p  )   min M E ( M;  )  min M {(1+  ) L ( M ) +R ( M )} Optimization with different costs of errors min M E ( M;  ) = min M { p  +  p   } = min M { p   S  ( M ))   p  r ( M ) +  [ p   S  ( M ))   p  r ( M ) ] } ROC (Receiver Operating Curve): p  ( p   , hit(false alarm).
  33. 33. Fuzzification of rules <ul><li>Rule R a ( x ) = { x  a } is fulfilled by G x with probability: </li></ul>Error function is approximated by logistic function ; assuming error distribution  ( x )  x )), for s 2 =1.7 approximates Gauss < 3.5% Rule R a b ( x ) = { b > x  a } is fulfilled by G x with probability:
  34. 34. Soft trapezoids and NN <ul><li>The difference between two sigmoids makes a soft trapezoidal membership functions . </li></ul>Conclusion : fuzzy logic with  ( x )   ( x-b ) m.f. is equivalent to crisp logic + Gaussian uncertainty .
  35. 35. Optimization of rules <ul><li>Fuzzy: large receptive fields , rough estimations . </li></ul><ul><li>G x – uncertainty of inputs , small receptive fields . </li></ul>Minimization of the number of errors – difficult, non-gradient, but now Monte Carlo or analytical p ( C | X ; M ). <ul><li>Gradient optimization works for large number of parameters. </li></ul><ul><li>Parameters s x are known for some features, use them as optimization parameters for others! </li></ul><ul><li>Probabilities instead of 0/1 rule outcomes. </li></ul><ul><li>Vectors that were not classified by crisp rules have now non-zero probabilities. </li></ul>
  36. 36. Mushrooms <ul><li>The Mushroom Guide: no simple rule for mushrooms; no rule like: ‘ leaflets three, let it be’ for Poisonous Oak and Ivy. </li></ul>8124 cases, 51.8% are edible, the rest non-edible. 22 symbolic attributes, up to 12 values each, equivalent to 118 logical features, or 2 118 =3 . 10 35 possible input vectors. Odor: almond, anise, creosote, fishy, foul, musty, none, pungent, spicy Spore p rint c olor: black, brown, buff, chocolate, green, orange, purple, white, yellow. Safe rule for edible mushrooms: odor=(almond.or.anise.or.none) Ů spore-print-color = Ř green 48 errors, 99.41% correct This is why animals have such a good sense of smell! What does it tell us about odor receptors?
  37. 37. Mushrooms rules <ul><li>To eat or not to eat, this is the question! Not any more ... </li></ul>A mushroom is poisonous if: R 1 ) odor = Ř (almond  anise  none); 120 errors, 98.52% R 2 ) spore-print-color = green 48 errors, 99.41% R 3 ) odor = none Ů stalk-surface-below-ring = scaly Ů stalk-color-above-ring = Ř brown 8 errors, 99.90% R 4 ) habitat = leaves Ů cap-color = white no errors! R 1 + R 2 are quite stable, found even with 10% of data; R 3 and R 4 may be replaced by other rules , ex : R' 3 ): gill-size=narrow Ů stalk-surface-above-ring=(silky  scaly) R' 4 ): gill-size=narrow Ů population=clustered Only 5 of 22 attributes used! Simplest possible rules? 100% in CV tests - structure of this data is completely clear.
  38. 38. Recurrence of breast cancer <ul><li>Institute of Oncology, University Medical Center, Ljubljana. </li></ul>286 cases, 201 no (70.3%), 85 recurrence cases (29.7%) 9 symbolic features: age (9 bins), tumor-size (12 bins), nodes involved (13 bins), degree-malignant (1,2,3), area, radiation, menopause, node-caps. no-recurrence,40-49,premeno,25-29,0-2,?,2, left, right_low, yes Many systems tried, 65-78% accuracy reported . Single rule : IF (nodes-involved  [0,2]  degree-malignant = 3 THEN recurrence ELSE no-recurrence 77% accuracy, only trivial knowledge in the data: highly malignant cancer involving many nodes is likely to strike back.
  39. 39. Neurofuzzy system <ul><li>Feature Space Mapping (FSM) neurofuzzy system. </li></ul><ul><li>Neural adaptation, estimation of probability density distribution (PDF) using single hidden layer network (RBF-like) with nodes realizing separable functions: </li></ul>Fuzzy:  x  (no/yes) replaced by a degree  x  . Triangular, trapezoidal, Gaussian or other membership f. M.f-s in many dimensions:
  40. 40. FSM <ul><li>Rectangular functions: simple rules are created, many nearly equivalent descriptions of this data exist. </li></ul><ul><li>If proline > 929.5 then class 1 (48 cases, 45 correct </li></ul><ul><li>+ 2 recovered by other rules). </li></ul><ul><li>If color < 3.79285 then class 2 (63 cases, 60 correct) </li></ul><ul><li>Interesting rules, but overall accuracy is only 88±9% </li></ul>Initialize using clusterization or decision trees. Triangular & Gaussian f. for fuzzy rules. Rectangular functions for crisp rules. Between 9-14 rules with triangular membership functions are created; accuracy in 10xCV tests about 96±4.5% Similar results obtained with Gaussian functions.
  41. 41. Prototype-based rules <ul><li>IF P = arg min R D(X,R) THAN Class(X)=Class(P) </li></ul>C-rules (Crisp), are a special case of F-rules (fuzzy rules). F-rules (fuzzy rules) are a special case of P-rules (Prototype). P-rules have the form: D(X,R) is a dissimilarity (distance) function, determining decision borders around prototype P. P-rules are easy to interpret! IF X=You are most similar to the P=Superman THAN You are in the Super-league. IF X=You are most similar to the P=Weakling THAN You are in the Failed-league. “ Similar” may involve different features or D(X,P).
  42. 42. P-rules Euclidean distance leads to a Gaussian fuzzy membership functions + product as T-norm. Manhattan function =>  (X;P)=exp{  |X-P|} Various distance functions lead to different MF. Ex. data-dependent distance functions, for symbolic data:
  43. 43. Promoters DNA strings, 57 aminoacids, 53 + and 53 - samples tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt Euclidean distance, symbolic s =a, c, t, g replaced by x=1, 2, 3, 4 PDF distance, symbolic s=a, c, t, g replaced by p(s|+)
  44. 44. P-rules New distance functions from info theory => interesting MF. MF => new distance function, with local D(X,R) for each cluster. Crisp logic rules: use L  norm: D Ch ( X , P ) = || X  P||  = max i W i | X i  P i | D Ch ( X , P ) = const => rectangular contours. Chebyshev distance with thresholds  P IF D Ch ( X , P )  P THEN C( X )=C( P ) is equivalent to a conjunctive crisp rule IF X 1  [ P 1  P  W 1 , P 1  P  W 1 ]  … X N  [ P N  P  W N , P N  P  W N ] THEN C( X )=C( P )
  45. 45. Decision borders Euclidean distance from 3 prototypes, one per class. Minkovski  =20 distance from 3 prototypes. D(P,X) =const and decision borders D(P,X)=D(Q,X) .
  46. 46. P-rules for Wine Manhattan distance: 6 prototypes kept, 4 errors, f2 removed Chebyshev distance: 15 prototypes kept, 5 errors, f2, f8, f10 removed Euclidean distance: 11 prototypes kept, 7 errors Many other solutions.
  47. 47. Neural networks <ul><li>MLP – Multilayer Perceptrons, most popular NN models. </li></ul><ul><li>Use soft hyperplanes for discrimination. </li></ul><ul><li>Results are difficult to interpret, complex decision borders. </li></ul><ul><li>Prediction, approximation: infinite number of classes. </li></ul><ul><li>RBF – Radial Basis Functions. </li></ul><ul><li>RBF with Gaussian functions are equivalent to fuzzy systems with Gaussian membership functions, but … </li></ul><ul><li>No feature selection => complex rules. </li></ul><ul><li>Other radial functions => not separable! </li></ul><ul><li>Use separable functions, not radial => FSM. </li></ul><ul><li>Many methods to convert MLP NN to logical rules. </li></ul>
  48. 48. Rules from MLPs <ul><li>Why is it difficult? </li></ul>Multi-layer perceptron (MLP) networks: stack many perceptron units, performing threshold logic: M-of-N rule: IF ( M conditions of N are true ) THEN ... Problem: for N inputs number of subsets is 2 N . Exponentially growing number of possible conjunctive rules.
  49. 49. MLP2LN <ul><li>Converts MLP neural networks into a network performing logical operations (LN). </li></ul>Input layer Aggregation: better features Output: one node per class. Rule units: threshold logic Linguistic units: windows, filters
  50. 50. MLP2LN training <ul><li>Constructive algorithm: add as many nodes as needed. </li></ul>Optimize cost function: minimize errors + enforce zero connections + leave only +1 and -1 weights makes interpretation easy.
  51. 51. L-units <ul><li>Create linguistic variables. </li></ul>Numerical representation for R-nodes V sk = (  ) for s k = low V sk = (  ) for s k = normal L-units: 2 thresholds as adaptive parameters; logistic  ( x ), or tanh( x )  [  Soft trapezoidal functions change into rectangular filters (Parzen windows). 4 types, depending on signs S i . Product of bi-central functions is logical rule, used by IncNet NN.
  52. 52. Iris example <ul><li>Network after training: </li></ul>iris setosa: q=1 (0,0,0;0,0,0;+1,0,0;+1,0,0) iris versicolor: q=2 (0,0,0;0,0,0;0,+1,0;0,+1,0) iris virginica: q=1 (0,0,0;0,0,0;0,0,+1;0,0,+1) Rules: If ( x 3 =s  x 4 =s ) setosa If ( x 3 =m  x 4 =m ) versicolor If ( x 3 =l  x 4 =l ) virginica 3 errors only (98%).
  53. 53. Learning dynamics Decision regions shown every 2 00 training epochs in x 3 , x 4 coordinates; borders are optimally placed with wide margins.
  54. 54. Thyroid screening <ul><li>Garavan Institute, Sydney, Australia </li></ul><ul><li>15 binary, 6 continuous </li></ul><ul><li>Training: 93+191+3488 Validate: 73+177+3178 </li></ul><ul><li>Determine important clinical factors </li></ul><ul><li>Calculate prob. of each diagnosis. </li></ul>Hidden units Final diagnoses TSH T4U Clinical findings Age sex … … T3 TT4 TBG Normal Hyperthyroid Hypothyroid
  55. 55. Thyroid – some results. Accuracy of diagnoses obtained with several systems – rules are accurate. Method Rules/Features Training % Test % MLP2LN optimized 4/6 99.9 99.36 CART/SSV Decision Trees 3/5 99.8 99.33 Best Backprop MLP -/21 100 98.5 Naïve Bayes -/- 97.0 96.1 k-nearest neighbors -/- - 93.8
  56. 56. Psychometry <ul><li>Use CI to find knowledge, create Expert System. </li></ul><ul><li>MMPI (Minnesota Multiphasic Personality Inventory) psychometric test. </li></ul><ul><li>Printed forms are scanned or computerized version of the test is used. </li></ul><ul><li>Raw data: 550 questions, ex: I am getting tired quickly: Yes - Don’t know - No </li></ul><ul><li>Results are combined into 10 clinical scales and 4 validity scales using fixed coefficients. </li></ul><ul><li>Each scale measures tendencies towards hypochondria, schizophrenia, psychopathic deviations, depression, hysteria, paranoia etc. </li></ul>
  57. 57. Scanned form
  58. 58. Computer input
  59. 59. Scales
  60. 60. Psychometry: goal <ul><li>There is no simple correlation between single values and final diagnosis. </li></ul><ul><li>Results are displayed in form of a histogram, called ‘ a psychogram ’. Interpretation depends on the experience and skill of an expert, takes into account correlations between peaks. </li></ul>Goal : an expert system providing evaluation and interpretation of MMPI tests at an expert level. Problem : agreement between experts only about 70% of the time; alternative diagnosis and personality changes over time are important.
  61. 61. Psychogram
  62. 62. Psychometric data <ul><li>1600 cases for woman, same number for men. </li></ul><ul><li>27 classes: norm, psychopathic, schizophrenia, paranoia, neurosis, mania, simulation, alcoholism, drug addiction, criminal tendencies, abnormal behavior due to ... </li></ul>Extraction of logical rules: 14 scales = features. Define linguistic variables and use FSM, MLP2LN, SSV - giving about 2-3 rules/class.
  63. 63. Psychometric results 10-CV for FSM is 82-85%, for C4.5 is 79-84%. Input uncertainty + G x around 1.5% (best ROC) improves FSM results to 90-92%. 96.9 95.9 98 ♂ 97.6 95.4 69 ♀ FSM 93.1 92.5 61 ♂ 93.7 93.0 55 ♀ C 4.5 + G x % Accuracy N. rules Data Method
  64. 64. Psychometric Expert <ul><li>Probabilities for different classes. For greater uncertainties more classes are predicted. </li></ul><ul><li>Fitting the rules to the conditions: </li></ul><ul><li>typically 3-5 conditions per rule, Gaussian distributions around measured values that fall into the rule interval are shown in green. </li></ul><ul><li>Verbal interpretation of each case, rule and scale dependent. </li></ul>
  65. 65. MMPI probabilities
  66. 66. MMPI rules
  67. 67. MMPI verbal comments
  68. 68. Visualization <ul><li>Probability of classes versus input uncertainty. </li></ul><ul><li>Detailed input probabilities around the measured values vs. change in the single scale; changes over time define ‘patients trajectory’. </li></ul><ul><li>Interactive multidimensional scaling: zooming on the new case to inspect its similarity to other cases. </li></ul>
  69. 69. Class probability/uncertainty
  70. 70. Class probability/feature
  71. 71. MDS visualization
  72. 72. Summary <ul><li>Computational intelligence methods: neural, decision trees, similarity-based & other, help to understand the data. </li></ul><ul><li>Understanding data: achieved by rules, prototypes, visualization. </li></ul><ul><li>Small is beautiful => simple is the best! </li></ul><ul><li>Simplest possible, but not simpler - regularization of models; accurate but not too accurate - handling of uncertainty; </li></ul><ul><li>high confidence, but not paranoid - rejecting some cases. </li></ul><ul><li>Challenges: </li></ul><ul><li>hierarchical systems, discovery of theories rather than data models, integration with image/signal analysis, reasoning in complex domains/objects, applications in bioinformatics, text analysis ... </li></ul>
  73. 73. References <ul><li>Many papers, comparison of results for numerous datasets are kept at: </li></ul><ul><li>http://www.phys.uni.torun.pl/kmk </li></ul><ul><li>See also my homepage at: </li></ul><ul><li>http://www.phys.uni.torun.pl/ ~duch </li></ul><ul><li>for this and other presentations and some papers. </li></ul>We are slowly getting there. All this and more is included in the Ghostminer , data mining software (in collaboration with Fujitsu) just released … http://www.fqspl.com.pl/ghostminer/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×