Your SlideShare is downloading. ×
Metody Inteligencji Obliczeniowej
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Metody Inteligencji Obliczeniowej

290
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
290
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Computational Intelligence for Data Mining Włodzisław Duch Department of Informatics Nicholas Copernicus University Torun, Poland W ith help from R . Adamczak , K . Grąbczewski K . Grudziński , N . Jankowski , A . Naud http://www.phys.uni.torun.pl/kmk WCCI 200 2 , Honolulu, HI
  • 2. Group members
  • 3. Plan
    • What this tutorial is about ?
    • How to discover knowledge in data;
    • how to create comprehensible models of data;
    • how to evaluate new data.
    • AI, CI & Data Mining
    • Forms of useful knowledge
    • GhostMiner philosophy
    • Exploration & Visualization
    • Rule-based data analysis
    • Neurofuzzy models
    • Neural models
    • Similarity-based models
    • Committees of models
  • 4. AI, CI & DM
    • Artificial Intelligence: symbolic models of knowledge.
    • Higher-level cognition: reasoning, problem solving, planning, heuristic search for solutions.
    • Machine learning, inductive, rule-based methods.
    • Technology: expert systems.
    • Computational Intelligence, Soft Computing:
    • methods inspired by many sources:
    • biology – evolutionary, immune, neural computing
    • statistics, patter recognition
    • probability – Bayesian networks
    • logic – fuzzy, rough …
    • Perception, object recognition.
    • Data Mining, Knowledge Discovery in Databases.
    • discovery of interesting patterns, rules, knowledge.
    • building predictive data models.
  • 5. Forms of useful knowledge
    • AI/Machine Learning camp:
    • Neural nets are black boxes.
    • Unacceptable! Symbolic rules forever.
    • But ... knowledge accessible to humans is in:
    • symbols,
    • similarity to prototypes,
    • images, visual representations.
    • What type of explanation is satisfactory?
    • Interesting question for cognitive scientists.
    • Different answers in different fields.
  • 6. Forms of knowledge
    • Humans remember examples of each category and refer to such examples – as similarity-based or nearest-neighbors methods do.
    • Humans create prototypes out of many examples – as Gaussian classifiers, RBF networks, neurofuzzy systems do.
    • Logical rules are the highest form of summarization of knowledge.
    • Types of explanation:
    • exemplar-based: prototypes and similarity;
    • logic-based: symbols and rules;
    • visualization-based: maps, diagrams, relations ...
  • 7. GhostMiner Philosophy
    • GhostMiner, data mining tools from our lab.
    • Separate the process of model building and knowledge discovery from model use => GhostMiner Developer & GhostMiner Analyzer
    • There is no free lunch – provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, committees.
    • Provide tools for visualization of data.
    • Support the process of knowledge discovery/model building and evaluating, organizing it into projects.
  • 8. Wine data example
    • Chemical analysis of wine from grapes grown in the same region in Italy, but derived from three different cultivars. Task: recognize the source of wine sample. 13 quantities measured, continuous features:
    • alcohol content
    • ash content
    • magnesium content
    • flavanoids content
    • proanthocyanins phenols content
    • OD280/D315 of diluted wines
    • malic acid content
    • alkalinity of ash
    • total phenols content
    • nonanthocyanins phenols content
    • color intensity
    • hue
    • proline.
  • 9. Exploration and visualization
    • General info about the data
  • 10. Exploration: data
    • Inspect the data
  • 11. Exploration: data statistics
    • Distribution of feature values
    Proline has very large values, the data should be standardized before further processing.
  • 12. Exploration: data standardized
    • Standardized data: unit standard deviation, about 2/3 of all data should fall within [mean-std,mean+std]
    Other options: normalize to fit in [-1,+1], or normalize rejecting some extreme values.
  • 13. Exploration: 1D histograms
    • Distribution of feature values in clasess
    Some features are more useful than the others.
  • 14. Exploration: 1D/3D histograms
    • Distribution of feature values in clasess, 3D
  • 15. Exploration: 2D projections
    • Projections on selected 2D
    Projections on selected 2D
  • 16. Visualize data
    • Hard to imagine relations in more than 3D.
    SOM mappings: popular for visualization, but rather inaccurate, no measure of distortions. Measure of topographical distortions: map all X i points from R n to x i points in R m , m < n , and ask: how well are R ij = D ( X i , X j ) distances reproduced by distances r ij = d ( x i , x j ) ? Use m = 2 for visualization, use higher m for dimensionality reduction.
  • 17. Visualize data: MDS
    • Multidimensional scaling: invented in psychometry by Torgerson (1952), re-invented by Sammon (1969) and myself (1994) …
    Minimize measure of topographical distortions moving the x coordinates.
  • 18. Visualize data: Wine
    • 3 clusters are clearly distinguished, 2D is fine.
    The green outlier can be identified easily.
  • 19. Decision trees
    • Simplest things first: use decision tree to find logical rules.
    Test single attribute, find good point to split the data, separating vectors from different classes. DT advantages: fast, simple, easy to understand, easy to program, many good algorithms.
  • 20. Decision borders
    • Univariate trees: test the value of a single attribute x < a .
    Multivariate trees: test on combinations of attributes. Result: feature space is divided in hyperrectangular areas.
  • 21. SSV decision tree
    • Separability Split Value tree: based on the separability criterion.
    Define left and right sides of the splits: SSV criterion: separate as many pairs of vectors from different classes as possible; minimize the number of separated from the same class.
  • 22. SSV – complex tree
    • Trees may always learn to achieve 100% accuracy.
    Very few vectors are left in the leaves!
  • 23. SSV – simplest tree
    • Pruning finds the nodes that should be removed to increase generalization – accuracy on unseen data.
    Trees with 7 nodes left: 15 errors/178 vectors.
  • 24. SSV – logical rules
    • Trees may be converted to logical rules.
    • Simplest tree leads to 4 logical rules:
    • if proline > 719 and flavanoids > 2.3 then class 1
    • if proline < 719 and OD280 > 2.115 then class 2
    • if proline > 719 and flavanoids < 2.3 then class 3
    • if proline < 719 and OD280 < 2.115 then class 3
    How accurate are such rules? Not 15/178 errors, or 91.5% accuracy! Run 10-fold CV and average the results. 85±10%? Run 10X!
  • 25. SSV – optimal trees/rules
    • Optimal: estimate how well rules will generalize.
    • Use stratified crossvalidation for training;
    • use beam search for better results.
    • if OD280/D315 > 2.505 and proline > 726.5 then class 1
    • if OD280/D315 < 2.505 and hue > 0.875 and malic-acid < 2.82 then class 2
    • if OD280/D315 > 2.505 and proline < 726.5 then class 2
    • if OD280/D315 < 2.505 and hue > 0.875 and malic-acid > 2.82 then class 3
    • if OD280/D315 < 2.505 and hue < 0.875 then class 3
    Note 6/178 errors, or 91.5% accuracy! Run 10-fold CV: results are 90.4 ± 6.1 %? Run 10X!
  • 26. Logical rules
    • Crisp logic rules: for continuous x use linguistic variables (predicate functions).
    s k ( x ) ş True [ X k Ł x Ł  X' k ], for example: small( x ) = True{ x | x < 1} medium( x ) = True{ x | x  [1,2]} large( x ) = True{ x | x > 2} Linguistic variables are used in crisp (prepositional, Boolean) logic rules: IF small-height( X ) AND has-hat( X ) AND has-beard( X ) THEN ( X is a Brownie) ELSE IF ... ELSE ...
  • 27. Crisp logic decisions
    • Crisp logic is based on rectangular membership functions:
    True/False values jump from 0 to 1. Step functions are used for partitioning of the feature space. Very simple hyper-rectangular decision borders. Sever limitation on the expressive power of crisp logical rules!
  • 28. Logical rules - advantages
    • Logical rules, if simple enough, are preferable.
    • Rules may expose limitations of black box solutions.
    • Only relevant features are used in rules.
    • Rules may sometimes be more accurate than NN and other CI methods.
    • Overfitting is easy to control, rules usually have small number of parameters.
    • Rules forever !? A logical rule about logical rules is:
    IF the number of rules is relatively small AND the accuracy is sufficiently high. THEN rules may be an optimal choice.
  • 29. Logical rules - limitations
    • Logical rules are preferred but ...
    • Only one class is predicted p ( C i | X , M ) = 0 or 1
    • black-and-white picture may be inappropriate in many applications.
    • Discontinuous cost function allow only non-gradient optimization.
    • Sets of rules are unstable: small change in the dataset leads to a large change in structure of complex sets of rules.
    • Reliable crisp rules may reject some cases as unclassified.
    • Interpretation of crisp rules may be misleading.
    • Fuzzy rules are not so comprehensible.
  • 30. How to use logical rules?
    • Data has been measured with unknown error. Assume Gaussian distribution:
    x – fuzzy number with Gaussian membership function . A set of logical rules R is used for fuzzy input vectors : Monte Carlo simulations for arbitrary system => p ( C i | X ) Analytical evaluation p ( C | X ) is based on cumulant : Error function is identical to logistic f. < 0.02
  • 31. Rules - choices
    • Simplicity vs. accuracy.
    • Confidence vs. rejection rate.
    p  is a hit; p  false alarm; p  is a miss. S  ( M ) = p    = p  /p  Specificity S + ( M ) = p +|+ = p ++ /p + Sensitivity R ( M ) =p +r +p  r = 1  L ( M )  A ( M ) Rejection rate L ( M ) = p   + p   Error rate A ( M ) = p  + p  Accuracy (overall)
  • 32. Rules – error functions
    • T he overall accuracy is equal to a combination of sensitivity and s pecificity weighted by the a priori probabilities:
    A ( M ) = p  S  ( M ) +p  S  ( M ) Optimization of rules for the C + class; large  means no errors but high rejection rate. E ( M   ) =  L ( M  )  A ( M  ) =  ( p  +p  )   ( p  +p  )   min M E ( M;  )  min M {(1+  ) L ( M ) +R ( M )} Optimization with different costs of errors min M E ( M;  ) = min M { p  +  p   } = min M { p   S  ( M ))   p  r ( M ) +  [ p   S  ( M ))   p  r ( M ) ] } ROC (Receiver Operating Curve): p  ( p   , hit(false alarm).
  • 33. Fuzzification of rules
    • Rule R a ( x ) = { x  a } is fulfilled by G x with probability:
    Error function is approximated by logistic function ; assuming error distribution  ( x )  x )), for s 2 =1.7 approximates Gauss < 3.5% Rule R a b ( x ) = { b > x  a } is fulfilled by G x with probability:
  • 34. Soft trapezoids and NN
    • The difference between two sigmoids makes a soft trapezoidal membership functions .
    Conclusion : fuzzy logic with  ( x )   ( x-b ) m.f. is equivalent to crisp logic + Gaussian uncertainty .
  • 35. Optimization of rules
    • Fuzzy: large receptive fields , rough estimations .
    • G x – uncertainty of inputs , small receptive fields .
    Minimization of the number of errors – difficult, non-gradient, but now Monte Carlo or analytical p ( C | X ; M ).
    • Gradient optimization works for large number of parameters.
    • Parameters s x are known for some features, use them as optimization parameters for others!
    • Probabilities instead of 0/1 rule outcomes.
    • Vectors that were not classified by crisp rules have now non-zero probabilities.
  • 36. Mushrooms
    • The Mushroom Guide: no simple rule for mushrooms; no rule like: ‘ leaflets three, let it be’ for Poisonous Oak and Ivy.
    8124 cases, 51.8% are edible, the rest non-edible. 22 symbolic attributes, up to 12 values each, equivalent to 118 logical features, or 2 118 =3 . 10 35 possible input vectors. Odor: almond, anise, creosote, fishy, foul, musty, none, pungent, spicy Spore p rint c olor: black, brown, buff, chocolate, green, orange, purple, white, yellow. Safe rule for edible mushrooms: odor=(almond.or.anise.or.none) Ů spore-print-color = Ř green 48 errors, 99.41% correct This is why animals have such a good sense of smell! What does it tell us about odor receptors?
  • 37. Mushrooms rules
    • To eat or not to eat, this is the question! Not any more ...
    A mushroom is poisonous if: R 1 ) odor = Ř (almond  anise  none); 120 errors, 98.52% R 2 ) spore-print-color = green 48 errors, 99.41% R 3 ) odor = none Ů stalk-surface-below-ring = scaly Ů stalk-color-above-ring = Ř brown 8 errors, 99.90% R 4 ) habitat = leaves Ů cap-color = white no errors! R 1 + R 2 are quite stable, found even with 10% of data; R 3 and R 4 may be replaced by other rules , ex : R' 3 ): gill-size=narrow Ů stalk-surface-above-ring=(silky  scaly) R' 4 ): gill-size=narrow Ů population=clustered Only 5 of 22 attributes used! Simplest possible rules? 100% in CV tests - structure of this data is completely clear.
  • 38. Recurrence of breast cancer
    • Institute of Oncology, University Medical Center, Ljubljana.
    286 cases, 201 no (70.3%), 85 recurrence cases (29.7%) 9 symbolic features: age (9 bins), tumor-size (12 bins), nodes involved (13 bins), degree-malignant (1,2,3), area, radiation, menopause, node-caps. no-recurrence,40-49,premeno,25-29,0-2,?,2, left, right_low, yes Many systems tried, 65-78% accuracy reported . Single rule : IF (nodes-involved  [0,2]  degree-malignant = 3 THEN recurrence ELSE no-recurrence 77% accuracy, only trivial knowledge in the data: highly malignant cancer involving many nodes is likely to strike back.
  • 39. Neurofuzzy system
    • Feature Space Mapping (FSM) neurofuzzy system.
    • Neural adaptation, estimation of probability density distribution (PDF) using single hidden layer network (RBF-like) with nodes realizing separable functions:
    Fuzzy:  x  (no/yes) replaced by a degree  x  . Triangular, trapezoidal, Gaussian or other membership f. M.f-s in many dimensions:
  • 40. FSM
    • Rectangular functions: simple rules are created, many nearly equivalent descriptions of this data exist.
    • If proline > 929.5 then class 1 (48 cases, 45 correct
    • + 2 recovered by other rules).
    • If color < 3.79285 then class 2 (63 cases, 60 correct)
    • Interesting rules, but overall accuracy is only 88±9%
    Initialize using clusterization or decision trees. Triangular & Gaussian f. for fuzzy rules. Rectangular functions for crisp rules. Between 9-14 rules with triangular membership functions are created; accuracy in 10xCV tests about 96±4.5% Similar results obtained with Gaussian functions.
  • 41. Prototype-based rules
    • IF P = arg min R D(X,R) THAN Class(X)=Class(P)
    C-rules (Crisp), are a special case of F-rules (fuzzy rules). F-rules (fuzzy rules) are a special case of P-rules (Prototype). P-rules have the form: D(X,R) is a dissimilarity (distance) function, determining decision borders around prototype P. P-rules are easy to interpret! IF X=You are most similar to the P=Superman THAN You are in the Super-league. IF X=You are most similar to the P=Weakling THAN You are in the Failed-league. “ Similar” may involve different features or D(X,P).
  • 42. P-rules Euclidean distance leads to a Gaussian fuzzy membership functions + product as T-norm. Manhattan function =>  (X;P)=exp{  |X-P|} Various distance functions lead to different MF. Ex. data-dependent distance functions, for symbolic data:
  • 43. Promoters DNA strings, 57 aminoacids, 53 + and 53 - samples tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt Euclidean distance, symbolic s =a, c, t, g replaced by x=1, 2, 3, 4 PDF distance, symbolic s=a, c, t, g replaced by p(s|+)
  • 44. P-rules New distance functions from info theory => interesting MF. MF => new distance function, with local D(X,R) for each cluster. Crisp logic rules: use L  norm: D Ch ( X , P ) = || X  P||  = max i W i | X i  P i | D Ch ( X , P ) = const => rectangular contours. Chebyshev distance with thresholds  P IF D Ch ( X , P )  P THEN C( X )=C( P ) is equivalent to a conjunctive crisp rule IF X 1  [ P 1  P  W 1 , P 1  P  W 1 ]  … X N  [ P N  P  W N , P N  P  W N ] THEN C( X )=C( P )
  • 45. Decision borders Euclidean distance from 3 prototypes, one per class. Minkovski  =20 distance from 3 prototypes. D(P,X) =const and decision borders D(P,X)=D(Q,X) .
  • 46. P-rules for Wine Manhattan distance: 6 prototypes kept, 4 errors, f2 removed Chebyshev distance: 15 prototypes kept, 5 errors, f2, f8, f10 removed Euclidean distance: 11 prototypes kept, 7 errors Many other solutions.
  • 47. Neural networks
    • MLP – Multilayer Perceptrons, most popular NN models.
    • Use soft hyperplanes for discrimination.
    • Results are difficult to interpret, complex decision borders.
    • Prediction, approximation: infinite number of classes.
    • RBF – Radial Basis Functions.
    • RBF with Gaussian functions are equivalent to fuzzy systems with Gaussian membership functions, but …
    • No feature selection => complex rules.
    • Other radial functions => not separable!
    • Use separable functions, not radial => FSM.
    • Many methods to convert MLP NN to logical rules.
  • 48. Rules from MLPs
    • Why is it difficult?
    Multi-layer perceptron (MLP) networks: stack many perceptron units, performing threshold logic: M-of-N rule: IF ( M conditions of N are true ) THEN ... Problem: for N inputs number of subsets is 2 N . Exponentially growing number of possible conjunctive rules.
  • 49. MLP2LN
    • Converts MLP neural networks into a network performing logical operations (LN).
    Input layer Aggregation: better features Output: one node per class. Rule units: threshold logic Linguistic units: windows, filters
  • 50. MLP2LN training
    • Constructive algorithm: add as many nodes as needed.
    Optimize cost function: minimize errors + enforce zero connections + leave only +1 and -1 weights makes interpretation easy.
  • 51. L-units
    • Create linguistic variables.
    Numerical representation for R-nodes V sk = (  ) for s k = low V sk = (  ) for s k = normal L-units: 2 thresholds as adaptive parameters; logistic  ( x ), or tanh( x )  [  Soft trapezoidal functions change into rectangular filters (Parzen windows). 4 types, depending on signs S i . Product of bi-central functions is logical rule, used by IncNet NN.
  • 52. Iris example
    • Network after training:
    iris setosa: q=1 (0,0,0;0,0,0;+1,0,0;+1,0,0) iris versicolor: q=2 (0,0,0;0,0,0;0,+1,0;0,+1,0) iris virginica: q=1 (0,0,0;0,0,0;0,0,+1;0,0,+1) Rules: If ( x 3 =s  x 4 =s ) setosa If ( x 3 =m  x 4 =m ) versicolor If ( x 3 =l  x 4 =l ) virginica 3 errors only (98%).
  • 53. Learning dynamics Decision regions shown every 2 00 training epochs in x 3 , x 4 coordinates; borders are optimally placed with wide margins.
  • 54. Thyroid screening
    • Garavan Institute, Sydney, Australia
    • 15 binary, 6 continuous
    • Training: 93+191+3488 Validate: 73+177+3178
    • Determine important clinical factors
    • Calculate prob. of each diagnosis.
    Hidden units Final diagnoses TSH T4U Clinical findings Age sex … … T3 TT4 TBG Normal Hyperthyroid Hypothyroid
  • 55. Thyroid – some results. Accuracy of diagnoses obtained with several systems – rules are accurate. Method Rules/Features Training % Test % MLP2LN optimized 4/6 99.9 99.36 CART/SSV Decision Trees 3/5 99.8 99.33 Best Backprop MLP -/21 100 98.5 Naïve Bayes -/- 97.0 96.1 k-nearest neighbors -/- - 93.8
  • 56. Psychometry
    • Use CI to find knowledge, create Expert System.
    • MMPI (Minnesota Multiphasic Personality Inventory) psychometric test.
    • Printed forms are scanned or computerized version of the test is used.
    • Raw data: 550 questions, ex: I am getting tired quickly: Yes - Don’t know - No
    • Results are combined into 10 clinical scales and 4 validity scales using fixed coefficients.
    • Each scale measures tendencies towards hypochondria, schizophrenia, psychopathic deviations, depression, hysteria, paranoia etc.
  • 57. Scanned form
  • 58. Computer input
  • 59. Scales
  • 60. Psychometry: goal
    • There is no simple correlation between single values and final diagnosis.
    • Results are displayed in form of a histogram, called ‘ a psychogram ’. Interpretation depends on the experience and skill of an expert, takes into account correlations between peaks.
    Goal : an expert system providing evaluation and interpretation of MMPI tests at an expert level. Problem : agreement between experts only about 70% of the time; alternative diagnosis and personality changes over time are important.
  • 61. Psychogram
  • 62. Psychometric data
    • 1600 cases for woman, same number for men.
    • 27 classes: norm, psychopathic, schizophrenia, paranoia, neurosis, mania, simulation, alcoholism, drug addiction, criminal tendencies, abnormal behavior due to ...
    Extraction of logical rules: 14 scales = features. Define linguistic variables and use FSM, MLP2LN, SSV - giving about 2-3 rules/class.
  • 63. Psychometric results 10-CV for FSM is 82-85%, for C4.5 is 79-84%. Input uncertainty + G x around 1.5% (best ROC) improves FSM results to 90-92%. 96.9 95.9 98 ♂ 97.6 95.4 69 ♀ FSM 93.1 92.5 61 ♂ 93.7 93.0 55 ♀ C 4.5 + G x % Accuracy N. rules Data Method
  • 64. Psychometric Expert
    • Probabilities for different classes. For greater uncertainties more classes are predicted.
    • Fitting the rules to the conditions:
    • typically 3-5 conditions per rule, Gaussian distributions around measured values that fall into the rule interval are shown in green.
    • Verbal interpretation of each case, rule and scale dependent.
  • 65. MMPI probabilities
  • 66. MMPI rules
  • 67. MMPI verbal comments
  • 68. Visualization
    • Probability of classes versus input uncertainty.
    • Detailed input probabilities around the measured values vs. change in the single scale; changes over time define ‘patients trajectory’.
    • Interactive multidimensional scaling: zooming on the new case to inspect its similarity to other cases.
  • 69. Class probability/uncertainty
  • 70. Class probability/feature
  • 71. MDS visualization
  • 72. Summary
    • Computational intelligence methods: neural, decision trees, similarity-based & other, help to understand the data.
    • Understanding data: achieved by rules, prototypes, visualization.
    • Small is beautiful => simple is the best!
    • Simplest possible, but not simpler - regularization of models; accurate but not too accurate - handling of uncertainty;
    • high confidence, but not paranoid - rejecting some cases.
    • Challenges:
    • hierarchical systems, discovery of theories rather than data models, integration with image/signal analysis, reasoning in complex domains/objects, applications in bioinformatics, text analysis ...
  • 73. References
    • Many papers, comparison of results for numerous datasets are kept at:
    • http://www.phys.uni.torun.pl/kmk
    • See also my homepage at:
    • http://www.phys.uni.torun.pl/ ~duch
    • for this and other presentations and some papers.
    We are slowly getting there. All this and more is included in the Ghostminer , data mining software (in collaboration with Fujitsu) just released … http://www.fqspl.com.pl/ghostminer/