1.
Computational Intelligence for Data Mining Włodzisław Duch Department of Informatics Nicholas Copernicus University Torun, Poland W ith help from R . Adamczak , K . Grąbczewski K . Grudziński , N . Jankowski , A . Naud http://www.phys.uni.torun.pl/kmk WCCI 200 2 , Honolulu, HI
Chemical analysis of wine from grapes grown in the same region in Italy, but derived from three different cultivars. Task: recognize the source of wine sample. 13 quantities measured, continuous features:
SOM mappings: popular for visualization, but rather inaccurate, no measure of distortions. Measure of topographical distortions: map all X i points from R n to x i points in R m , m < n , and ask: how well are R ij = D ( X i , X j ) distances reproduced by distances r ij = d ( x i , x j ) ? Use m = 2 for visualization, use higher m for dimensionality reduction.
Simplest things first: use decision tree to find logical rules.
Test single attribute, find good point to split the data, separating vectors from different classes. DT advantages: fast, simple, easy to understand, easy to program, many good algorithms.
Separability Split Value tree: based on the separability criterion.
Define left and right sides of the splits: SSV criterion: separate as many pairs of vectors from different classes as possible; minimize the number of separated from the same class.
Crisp logic rules: for continuous x use linguistic variables (predicate functions).
s k ( x ) ş True [ X k Ł x Ł X' k ], for example: small( x ) = True{ x | x < 1} medium( x ) = True{ x | x [1,2]} large( x ) = True{ x | x > 2} Linguistic variables are used in crisp (prepositional, Boolean) logic rules: IF small-height( X ) AND has-hat( X ) AND has-beard( X ) THEN ( X is a Brownie) ELSE IF ... ELSE ...
Crisp logic is based on rectangular membership functions:
True/False values jump from 0 to 1. Step functions are used for partitioning of the feature space. Very simple hyper-rectangular decision borders. Sever limitation on the expressive power of crisp logical rules!
Data has been measured with unknown error. Assume Gaussian distribution:
x – fuzzy number with Gaussian membership function . A set of logical rules R is used for fuzzy input vectors : Monte Carlo simulations for arbitrary system => p ( C i | X ) Analytical evaluation p ( C | X ) is based on cumulant : Error function is identical to logistic f. < 0.02
p is a hit; p false alarm; p is a miss. S ( M ) = p = p /p Specificity S + ( M ) = p +|+ = p ++ /p + Sensitivity R ( M ) =p +r +p r = 1 L ( M ) A ( M ) Rejection rate L ( M ) = p + p Error rate A ( M ) = p + p Accuracy (overall)
T he overall accuracy is equal to a combination of sensitivity and s pecificity weighted by the a priori probabilities:
A ( M ) = p S ( M ) +p S ( M ) Optimization of rules for the C + class; large means no errors but high rejection rate. E ( M ) = L ( M ) A ( M ) = ( p +p ) ( p +p ) min M E ( M; ) min M {(1+ ) L ( M ) +R ( M )} Optimization with different costs of errors min M E ( M; ) = min M { p + p } = min M { p S ( M )) p r ( M ) + [ p S ( M )) p r ( M ) ] } ROC (Receiver Operating Curve): p ( p , hit(false alarm).
Rule R a ( x ) = { x a } is fulfilled by G x with probability:
Error function is approximated by logistic function ; assuming error distribution ( x ) x )), for s 2 =1.7 approximates Gauss < 3.5% Rule R a b ( x ) = { b > x a } is fulfilled by G x with probability:
The Mushroom Guide: no simple rule for mushrooms; no rule like: ‘ leaflets three, let it be’ for Poisonous Oak and Ivy.
8124 cases, 51.8% are edible, the rest non-edible. 22 symbolic attributes, up to 12 values each, equivalent to 118 logical features, or 2 118 =3 . 10 35 possible input vectors. Odor: almond, anise, creosote, fishy, foul, musty, none, pungent, spicy Spore p rint c olor: black, brown, buff, chocolate, green, orange, purple, white, yellow. Safe rule for edible mushrooms: odor=(almond.or.anise.or.none) Ů spore-print-color = Ř green 48 errors, 99.41% correct This is why animals have such a good sense of smell! What does it tell us about odor receptors?
To eat or not to eat, this is the question! Not any more ...
A mushroom is poisonous if: R 1 ) odor = Ř (almond anise none); 120 errors, 98.52% R 2 ) spore-print-color = green 48 errors, 99.41% R 3 ) odor = none Ů stalk-surface-below-ring = scaly Ů stalk-color-above-ring = Ř brown 8 errors, 99.90% R 4 ) habitat = leaves Ů cap-color = white no errors! R 1 + R 2 are quite stable, found even with 10% of data; R 3 and R 4 may be replaced by other rules , ex : R' 3 ): gill-size=narrow Ů stalk-surface-above-ring=(silky scaly) R' 4 ): gill-size=narrow Ů population=clustered Only 5 of 22 attributes used! Simplest possible rules? 100% in CV tests - structure of this data is completely clear.
Neural adaptation, estimation of probability density distribution (PDF) using single hidden layer network (RBF-like) with nodes realizing separable functions:
Fuzzy: x (no/yes) replaced by a degree x . Triangular, trapezoidal, Gaussian or other membership f. M.f-s in many dimensions:
Rectangular functions: simple rules are created, many nearly equivalent descriptions of this data exist.
If proline > 929.5 then class 1 (48 cases, 45 correct
+ 2 recovered by other rules).
If color < 3.79285 then class 2 (63 cases, 60 correct)
Interesting rules, but overall accuracy is only 88±9%
Initialize using clusterization or decision trees. Triangular & Gaussian f. for fuzzy rules. Rectangular functions for crisp rules. Between 9-14 rules with triangular membership functions are created; accuracy in 10xCV tests about 96±4.5% Similar results obtained with Gaussian functions.
C-rules (Crisp), are a special case of F-rules (fuzzy rules). F-rules (fuzzy rules) are a special case of P-rules (Prototype). P-rules have the form: D(X,R) is a dissimilarity (distance) function, determining decision borders around prototype P. P-rules are easy to interpret! IF X=You are most similar to the P=Superman THAN You are in the Super-league. IF X=You are most similar to the P=Weakling THAN You are in the Failed-league. “ Similar” may involve different features or D(X,P).
42.
P-rules Euclidean distance leads to a Gaussian fuzzy membership functions + product as T-norm. Manhattan function => (X;P)=exp{ |X-P|} Various distance functions lead to different MF. Ex. data-dependent distance functions, for symbolic data:
43.
Promoters DNA strings, 57 aminoacids, 53 + and 53 - samples tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt Euclidean distance, symbolic s =a, c, t, g replaced by x=1, 2, 3, 4 PDF distance, symbolic s=a, c, t, g replaced by p(s|+)
44.
P-rules New distance functions from info theory => interesting MF. MF => new distance function, with local D(X,R) for each cluster. Crisp logic rules: use L norm: D Ch ( X , P ) = || X P|| = max i W i | X i P i | D Ch ( X , P ) = const => rectangular contours. Chebyshev distance with thresholds P IF D Ch ( X , P ) P THEN C( X )=C( P ) is equivalent to a conjunctive crisp rule IF X 1 [ P 1 P W 1 , P 1 P W 1 ] … X N [ P N P W N , P N P W N ] THEN C( X )=C( P )
45.
Decision borders Euclidean distance from 3 prototypes, one per class. Minkovski =20 distance from 3 prototypes. D(P,X) =const and decision borders D(P,X)=D(Q,X) .
Multi-layer perceptron (MLP) networks: stack many perceptron units, performing threshold logic: M-of-N rule: IF ( M conditions of N are true ) THEN ... Problem: for N inputs number of subsets is 2 N . Exponentially growing number of possible conjunctive rules.
Numerical representation for R-nodes V sk = ( ) for s k = low V sk = ( ) for s k = normal L-units: 2 thresholds as adaptive parameters; logistic ( x ), or tanh( x ) [ Soft trapezoidal functions change into rectangular filters (Parzen windows). 4 types, depending on signs S i . Product of bi-central functions is logical rule, used by IncNet NN.
iris setosa: q=1 (0,0,0;0,0,0;+1,0,0;+1,0,0) iris versicolor: q=2 (0,0,0;0,0,0;0,+1,0;0,+1,0) iris virginica: q=1 (0,0,0;0,0,0;0,0,+1;0,0,+1) Rules: If ( x 3 =s x 4 =s ) setosa If ( x 3 =m x 4 =m ) versicolor If ( x 3 =l x 4 =l ) virginica 3 errors only (98%).
53.
Learning dynamics Decision regions shown every 2 00 training epochs in x 3 , x 4 coordinates; borders are optimally placed with wide margins.
Hidden units Final diagnoses TSH T4U Clinical findings Age sex … … T3 TT4 TBG Normal Hyperthyroid Hypothyroid
55.
Thyroid – some results. Accuracy of diagnoses obtained with several systems – rules are accurate. Method Rules/Features Training % Test % MLP2LN optimized 4/6 99.9 99.36 CART/SSV Decision Trees 3/5 99.8 99.33 Best Backprop MLP -/21 100 98.5 Naïve Bayes -/- 97.0 96.1 k-nearest neighbors -/- - 93.8
There is no simple correlation between single values and final diagnosis.
Results are displayed in form of a histogram, called ‘ a psychogram ’. Interpretation depends on the experience and skill of an expert, takes into account correlations between peaks.
Goal : an expert system providing evaluation and interpretation of MMPI tests at an expert level. Problem : agreement between experts only about 70% of the time; alternative diagnosis and personality changes over time are important.
27 classes: norm, psychopathic, schizophrenia, paranoia, neurosis, mania, simulation, alcoholism, drug addiction, criminal tendencies, abnormal behavior due to ...
Extraction of logical rules: 14 scales = features. Define linguistic variables and use FSM, MLP2LN, SSV - giving about 2-3 rules/class.
63.
Psychometric results 10-CV for FSM is 82-85%, for C4.5 is 79-84%. Input uncertainty + G x around 1.5% (best ROC) improves FSM results to 90-92%. 96.9 95.9 98 ♂ 97.6 95.4 69 ♀ FSM 93.1 92.5 61 ♂ 93.7 93.0 55 ♀ C 4.5 + G x % Accuracy N. rules Data Method
Computational intelligence methods: neural, decision trees, similarity-based & other, help to understand the data.
Understanding data: achieved by rules, prototypes, visualization.
Small is beautiful => simple is the best!
Simplest possible, but not simpler - regularization of models; accurate but not too accurate - handling of uncertainty;
high confidence, but not paranoid - rejecting some cases.
Challenges:
hierarchical systems, discovery of theories rather than data models, integration with image/signal analysis, reasoning in complex domains/objects, applications in bioinformatics, text analysis ...
Many papers, comparison of results for numerous datasets are kept at:
http://www.phys.uni.torun.pl/kmk
See also my homepage at:
http://www.phys.uni.torun.pl/ ~duch
for this and other presentations and some papers.
We are slowly getting there. All this and more is included in the Ghostminer , data mining software (in collaboration with Fujitsu) just released … http://www.fqspl.com.pl/ghostminer/
Be the first to comment