On cascading small decision trees


Published on

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

On cascading small decision trees

  1. 1. On Cascading Small Decision Trees Julià Minguillón Combinatorics and Digital Communications Group (CCD) Autonomous University of Barcelona (UAB) Barcelona, Spain http://www.tesisenxarxa.net/TESIS_UAB/AVAILABLE/TDX-1209102-150635/jma1de1.pdf
  2. 2. Table of contents <ul><li>Introduction </li></ul><ul><li>Decision trees </li></ul><ul><li>Combining classifiers </li></ul><ul><li>Experimental results </li></ul><ul><li>Theoretical issues </li></ul><ul><li>Conclusions </li></ul><ul><li>Further research </li></ul><ul><li>References </li></ul>
  3. 3. Introduction <ul><li>Main goal: to build simple and fast classifiers for data mining </li></ul><ul><li>Partial goals: </li></ul><ul><ul><li>To reduce both training and exploitation costs </li></ul></ul><ul><ul><li>To increase classification accuracy </li></ul></ul><ul><ul><li>To permit partial classification </li></ul></ul><ul><li>Several classification systems could be used: decision trees, neural networks, support vector machines, nearest neighbour classifier, etc. </li></ul>
  4. 4. Decision trees <ul><li>Introduced by Quinlan in 1983 and developed by Breiman et al. in 1984 </li></ul><ul><li>Decision trees reproduce the way humans take decisions: a path of questions is followed from the input sample to the output label </li></ul><ul><li>Decision trees are based on recursive partitioning of the input space, trying to separate elements from different classes </li></ul><ul><li>Supervised training  labeled data is used for training </li></ul>
  5. 5. Why decision trees? <ul><li>Natural handling of data of mixed types </li></ul><ul><li>Handling of missing values </li></ul><ul><li>Robustness to outliers in input space </li></ul><ul><li>Insensitive to monotone transformations </li></ul><ul><li>Computational scalability </li></ul><ul><li>Ability to deal with irrelevant inputs </li></ul><ul><li>Interpretability </li></ul>
  6. 6. Growing decision trees (binary) <ul><li>T=(data set) /* initially the tree is a single leaf */ </li></ul><ul><li>while stoppingCriterion(T) is false </li></ul><ul><li>select t from T maximising selectionCriterion(t) </li></ul><ul><li>split t=(t L ,t R ) maximising splittingCriterion(t,t L ,t R ) </li></ul><ul><li>replace t in T with (t L ,t R ) </li></ul><ul><li>end </li></ul><ul><li>prune back T using the BFOS algorithm </li></ul><ul><li>choose T’ minimising classification error on (data set’) </li></ul>
  7. 7. Growing algorithm parameters <ul><li>The computed decision tree is determined by: </li></ul><ul><ul><li>Stopping criterion </li></ul></ul><ul><ul><li>Node selection criterion </li></ul></ul><ul><ul><li>Splitting criterion </li></ul></ul><ul><ul><li>Labelling rule </li></ul></ul><ul><li>If a perfect decision tree is built and then it is pruned back, both the stopping and the node selection criteria become irrelevant </li></ul>
  8. 8. Splitting criterion <ul><li>Measures the gain of a split for a given criterion </li></ul><ul><li>Usually related to the concept of impurity </li></ul><ul><li>Classification performance may be very sensitive to such criterion </li></ul><ul><li>Entropy and R-norm criteria yield the best results in average, Bayes error criterion the worst </li></ul><ul><li>Different kinds of splits: </li></ul><ul><ul><li>Orthogonal hyperplanes: fast, interpretable, poor performance </li></ul></ul><ul><ul><li>General hyperplanes: expensive, partially interpretable </li></ul></ul><ul><ul><li>Distance based (spherical trees): expensive, allow clustering </li></ul></ul>
  9. 9. Labelling rule <ul><li>Each leaf t is labelled in order to minimise misclassification error: </li></ul><ul><li>l(t) = arg j min { r(t) =  {k=0..K-1} C(j,k) p(k|t) } </li></ul><ul><li>Different classification costs C(j,k) are allowed </li></ul><ul><li>A priori class probabilities may be included </li></ul><ul><li>Margin is defined as 1-2 r(t) , or also as </li></ul><ul><li>max { p(k|t) } – 2 nd max { p(k|t) } </li></ul>
  10. 10. Problems <ul><li>Repetition, replication and fragmentation </li></ul><ul><li>Poor performance for large data dimensionality or large number of classes </li></ul><ul><li>Orthogonal splits may lead to p oor classification performance due to poor internal decision functions </li></ul><ul><li>Overfitting may occur for large decision trees </li></ul><ul><li>Training is very expensive for large data sets </li></ul><ul><li>Decision trees are unstable classifiers </li></ul>
  11. 11. Progressive decision trees <ul><li>Goal: to overcome some problems related to the use of classical decision trees </li></ul><ul><li>Basic idea: to break the classification problem in a sequence of partial classification problems, from easier to more difficult </li></ul><ul><li>Only small decision trees are used: </li></ul><ul><ul><li>Avoid overfitting </li></ul></ul><ul><ul><li>Reduce both training and exploitation costs </li></ul></ul><ul><ul><li>Permit partial classification </li></ul></ul><ul><ul><li>Detect possible outliers </li></ul></ul><ul><li>Decision trees become decision graphs </li></ul>
  12. 12. Growing progressive decision trees <ul><li>Build a complete decision tree of depth d </li></ul><ul><li>Prune it using the BFOS algorithm </li></ul><ul><li>Relabel it using the new labelling rule: a leaf is labelled as mixed if its margin is not large enough (at least  ) </li></ul><ul><li>Join all regions labelled as mixed </li></ul><ul><li>Start again using only the mixed regions </li></ul>
  13. 13. Example (I) M 1 M 0 M 0 1 M
  14. 14. Example (II) M 0 1 M 1 0 0 1 M M M
  15. 15. Example (III) 1 0 M 0 1
  16. 16. Combining classifiers <ul><li>Basic idea: instead of building a complex classifier, build several simple classifiers and combine them into a more complex one </li></ul><ul><li>Several paradigms: </li></ul><ul><ul><li>Voting: bagging, boosting, randomising </li></ul></ul><ul><ul><li>Stacking </li></ul></ul><ul><ul><li>Cascading </li></ul></ul><ul><li>Why do they work? Because of the fact that different classifiers make different kinds of mistakes </li></ul><ul><li>Different classifiers are built by using different training sets </li></ul>
  17. 17. Cascading generalization <ul><li>Developed by Gama et al. in 2000 </li></ul><ul><li>Basic idea: simple classifiers are sequentially ensembled carrying over information from one classifier to the next in the sequence </li></ul><ul><li>Three types of cascading ensembles: </li></ul><ul><ul><li>Type A: no additional info, mixed class </li></ul></ul><ul><ul><li>Type B: additional info, no mixed class </li></ul></ul><ul><ul><li>Type C: additional info, mixed class </li></ul></ul>
  18. 18. Type A progressive decision trees <ul><li>No additional info is carried from one stage to the next, but only samples labelled as mixed are passed down: </li></ul>T D Y D’
  19. 19. Type B progressive decision trees <ul><li>Additional info (estimated class probabilities and margin) is computed for each sample, and all samples are passed down: </li></ul>T D Y D’
  20. 20. Type C progressive decision trees <ul><li>Additional info is computed for each sample, and only samples labelled as mixed are passed down: </li></ul>T D Y D’
  21. 21. Experimental results <ul><li>Four different projects: </li></ul><ul><ul><li>Document layout recognition </li></ul></ul><ul><ul><li>Hyperspectral imaging </li></ul></ul><ul><ul><li>Brain tumour classification </li></ul></ul><ul><ul><li>UCI collection  evaluation </li></ul></ul><ul><li>Basic tools for evaluation: </li></ul><ul><ul><li>N-fold cross-validation </li></ul></ul><ul><ul><li>bootstrapping </li></ul></ul><ul><ul><li>bias-variance decomposition </li></ul></ul>} real projects
  22. 22. Document layout recognition (I) <ul><li>Goal: adaptive compression for an automated document storage system using lossy/lossless JPEG standard </li></ul><ul><li>Four classes: background (removed), text (OCR), line drawings (lossless) and images (lossy) </li></ul><ul><li>Documents are 8.5” x 11.7” at 150 dpi </li></ul><ul><li>Target block size: 8 x 8 pixels (JPEG standard) </li></ul><ul><li>Minguillón, J. et al., Progressive classification scheme for document layout recognition , Proc. of the SPIE, Denver, CO, USA, v. 3816:241-250, 1999 </li></ul>
  23. 23. Document layout recognition (II) <ul><li>Classical approach: a single decision tree with a block size of 8 x 8 pixels </li></ul>0.078 38 8.56 721 211200 / 211200 8 x 8 Error d max R |T| Num. Blocks Size
  24. 24. Document layout recognition (III) <ul><li>Progressive approach: four block sizes (64 x 64, 32 x 32, 16 x 16 and 8 x 8) </li></ul>0.042 6 3.72 11 21052 / 53760 16 x 16 0.047 6 4.17 14 7856 / 13440 32 x 32 0.089 4 2.77 6 3360 / 3360 64 x 64 0.065 8 4.73 18 27892 / 215040 8 x 8 Error d max R |T| Num. Blocks Size
  25. 25. Hyperspectral imaging (I) <ul><li>Image size is 710 x 4558 pixels x 14 bands (available ground truth data is only 400 x 2400) </li></ul><ul><li>Ground truth data presents some artifacts due to low resolution: around 10% mislabelled </li></ul><ul><li>19 classes including asphalt, water, rocks, soil and several vegetation types </li></ul><ul><li>Goal: to build a classification system and to identify the most important bands for each class, but also to detect possible outliers in the training set </li></ul><ul><li>Minguillón, J. et al., Adaptive lossy compression and classification of hyperspectral images , Proc. of remote sensing VI, Barcelona, Spain, v. 4170:214-225, 2000 </li></ul>
  26. 26. Hyperspectral imaging (II) <ul><li>Classical approach: </li></ul><ul><li>Using the new labeling rule: </li></ul>0.163 1.0 9.83 836 T 1 Error P T R |T| Tree 0.092 0.722 9.60 650 T 2 Error P T R |T| Tree
  27. 27. Hyperspectral imaging (III) <ul><li>Progressive approach: </li></ul>0.199 0.383 2.14 8 T 3B 0.056 0.523 3.02 9 T 3A 0.094 0.706 4.84 44 T 3 Error P T R |T| Tree
  28. 28. Brain tumour classification (I) <ul><li>Goal: to build a classification system for helping clinicians to identify brain tumour types </li></ul><ul><li>Too many classes and too few samples: a hierarchical structure partially reproducing the WHO tree has been created </li></ul><ul><li>Different classifiers (LDA, k -NN, decision trees) are combined using a mixture of cascading and voting schemes </li></ul><ul><li>Minguillón, J. et al., Classifier combination for in vivo magnetic resonance spectra of brain tumours , Proc. of Multiple Classifier Systems, Cagliari, Italy, LNCS 2364 </li></ul>
  29. 29. Brain tumour classification (II) <ul><li>Each classification stage is: </li></ul>k -NN LDA DT X V Y <ul><li>Decision trees use LDA class distances as additional information </li></ul><ul><li>“ Unknown” means classifiers disagree </li></ul>
  30. 30. Brain tumour classification (III) Normal 100% Tumour 99.5% Benign 92.1% Malignant 94.9% Grade II 82.6% Grade IV 94.7% 98.9% Grade III 0% Astro 94.1% Oligo 100% 84.0% 89.9% 83.8% Secondary 91.4% Primary 81.8% 75.0% MN+SCH+HB ASTII+OD GLB+LYM+PNET+MET
  31. 31. UCI collection <ul><li>Goal: exhaustive testing of progressive decision trees </li></ul><ul><li>20 data sets were chosen: </li></ul><ul><ul><li>No categorical variables </li></ul></ul><ul><ul><li>No missing values </li></ul></ul><ul><ul><li>Large range of number of samples, data dimension and number of classes </li></ul></ul><ul><li>Available at http:// kdd.ics.uci.edu </li></ul>
  32. 32. Experiments setup <ul><li>N-fold cross-validation with N=3 </li></ul><ul><li>For each training set, 25 bootstrap replicates are generated (subsampling with replacement) </li></ul><ul><li>Each experiment is repeated 5 times and performance results are averaged </li></ul><ul><li>Bias-variance decomposition is computed for each repetition and then averaged </li></ul>
  33. 33. Bias-variance decomposition <ul><li>Several approaches, Domingos 2000 </li></ul><ul><li>First classifiers in a cascading ensemble should have moderate bias and low variance: small (but not too much) decision trees </li></ul><ul><li>Last classifiers should have small bias and moderate variance: large (but not too much) decision trees </li></ul><ul><li>Only different classifiers (from a bias-variance behaviour) should be ensembled: number of decision trees should be small </li></ul>
  34. 34. Empirical evaluation summary (I) <ul><li>Bias usually predominates over variance on most data sets  decision trees outperform the k -NN classifier </li></ul><ul><li>Bias decreases fast when the decision tree has enough leaves </li></ul><ul><li>Variance shows an unpredictable behaviour, depending on data set intrinsic characteristics </li></ul>
  35. 35. Empirical evaluation summary (II) <ul><li>Type B progressive decision trees usually outperform classical decision trees, mainly to bias reduction. Two or three small decision trees are enough </li></ul><ul><li>Type A progressive decision trees do not outperform classical decision trees in general, but variance is reduced (classifiers are smaller and thus stabler) </li></ul><ul><li>Type C experiments are still running... </li></ul>
  36. 36. Theoretical issues <ul><li>Decision trees are convex combinations of internal node decision functions: </li></ul><ul><li>T j (x)=  {i=1..|T j |} p ij  ij h ij (x) </li></ul><ul><li>Cascading is a convex combination of t decision trees: </li></ul><ul><li>T(x)=  {j=1..t} q j T j (x) </li></ul><ul><ul><li>Type A: the first decision tree is the most important </li></ul></ul><ul><ul><li>Type B: the last decision tree is the most important </li></ul></ul><ul><ul><li>Type C: not aplicable </li></ul></ul>
  37. 37. Error generalization bounds <ul><li>Convex combinations may be studied under the margin paradigm defined by Schapire et al. </li></ul><ul><li>Generalization error depends on tree structure and internal node functions VC dimension </li></ul><ul><li>Unbalanced trees are preferable </li></ul><ul><li>Unbalanced classifiers are preferable </li></ul><ul><li>Modest goal: to see that the current theory related to classifier combination does not deny progressive decision trees </li></ul>
  38. 38. Conclusions <ul><li>Progressive decision trees generalise classical decision trees and the cascading paradigm </li></ul><ul><li>Cascading is very useful for large data sets with a large number of classes  hierarchical structure </li></ul><ul><li>Preliminary experiments with type C progressive decision trees look promising… </li></ul><ul><li>Experiments with real data sets show that it is possible to improve classification accuracy and reduce both classification and explo i tation costs at the same time </li></ul><ul><li>Fine tuning is absolutely necessary!... </li></ul>
  39. 39. Further research <ul><li>The R-norm splitting criterion may be used to build adaptive decision trees </li></ul><ul><li>Better error generalisation bounds are needed </li></ul><ul><li>A complete and specific theoretical framework for the cascading paradigm must be developed </li></ul><ul><li>Parameters (  , d and t ) are currently empirical, more explanations are needed </li></ul><ul><li>New applications (huge data sets): </li></ul><ul><ul><li>Web mining </li></ul></ul><ul><ul><li>DNA interpretation </li></ul></ul>
  40. 40. Selected references <ul><li>Breiman, L. et al., Classification and Regression Trees , Wadsworth International Group, 1984 </li></ul><ul><li>Gama, J. et al., Cascade Generalization , Machine Learning 41(3):315-343, 2000 </li></ul><ul><li>Domingos, P., A unified bias-variance decomposition and its applications , Proc. of the 17 th Int. Conf. On Machine Learning, Stanford, CA, USA, 231-238, 2000 </li></ul><ul><li>Schapire, R.E. et al., Boosting the margin: a new explanation for the effectiveness of voting methods , Annals of Statistics 26(5):1651-1686, 1998 </li></ul>