Handling Numeric Attributes in  Hoeffding Trees Bernhard Pfahringer,  Geoff Holmes  and Richard Kirkby
Overview <ul><li>Hoeffding trees are excellent for classification tasks on data streams. </li></ul><ul><li>Handling numeri...
Data Streams - reminder <ul><li>Idea is that data is being provided from a  continuous  source: </li></ul><ul><ul><li>Exam...
Main assumptions/limitations <ul><li>Assume a  stationary  concept, i.e. no concept drift or change </li></ul><ul><ul><li>...
Hoeffding Trees <ul><li>Introduced by Domingos and Hulten (VFDT) </li></ul><ul><li>“ Extension” of decision trees to strea...
Active leaf data structure <ul><li>For each class value: </li></ul><ul><ul><li>for each nominal attribute: </li></ul></ul>...
Numeric Handling Methods <ul><li>VFDT (VFML – Hulten & Domingos, 2003)   </li></ul><ul><ul><li>Summarize the numeric distr...
Numeric Handling Methods <ul><li>Quantile Summaries (GK – Greenwald and Khanna, 2001) </li></ul><ul><ul><li>Motivation com...
Handling Numeric Methods <ul><li>Gaussian Approximation (GAUSS) </li></ul><ul><ul><li>Assume values conform to Normal Dist...
Gaussian approximation – 2 class problem
Gaussian approximation – 3 class problem
Gaussian approximation – 4 class problem
Empirical Evaluation <ul><li>Use each numeric handling method (8 in total) to build a Hoeffding Tree (HTMC) </li></ul><ul>...
Data generators <ul><li>Random tree (Domingos&Hulten): </li></ul><ul><ul><li>( RTS ) 10 num, 10 nom 5 values, 2 classes, l...
Tree Measurements <ul><li>Accuracy (% correct) </li></ul><ul><li>Number of training examples processed in 10 hours (in mil...
Sensor Network (100K memory limit) Pred Spd % Train Spd % AvgTree Depth Total Nodes Inactive (hdrds) Active Leaves Train  ...
Handheld Environment (32MB memory limit) Pred Spd % Train Spd % AvgTree Depth Total Nodes Inactive (hdrds) Active Leaves T...
Server Environment (400MB memory limit) 74 4 24 591 80.4 320 293 91.41 VF10 Pred Spd % Train Spd % AvgTree Depth Total Nod...
Overall results - comments <ul><li>VFML10 is superior on average in all environments, followed closely by GAUSS10 </li></u...
Remarks – sensor network environment <ul><li>Number of training examples low because learning stops when last active leaf ...
Remarks – Handheld Environment <ul><li>Generates smaller trees (than server) and can therefore process more examples </li>...
Remarks – Server Environment
VFML10 vs GAUSS10 – Closer Analysis <ul><li>Recall VFML10 is superior on average </li></ul><ul><li>Sensor (avg 87.7 vs 86....
Data order
Conclusion <ul><li>We have presented a method for handling numeric attributes in data streams that performs well in empiri...
All algorithms available <ul><li>https://sourceforge.net/projects/moa-datastream </li></ul><ul><li>All methods and an envi...
Upcoming SlideShare
Loading in …5
×

Handling Numeric Attributes in Hoeffding Trees

1,012 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,012
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
29
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Handling Numeric Attributes in Hoeffding Trees

  1. 1. Handling Numeric Attributes in Hoeffding Trees Bernhard Pfahringer, Geoff Holmes and Richard Kirkby
  2. 2. Overview <ul><li>Hoeffding trees are excellent for classification tasks on data streams. </li></ul><ul><li>Handling numeric attributes well is crucial to performance in conventional decision trees (for example, C4.5 -> C4.8) </li></ul><ul><li>Does handling numeric attributes matter for streamed data? </li></ul><ul><li>We implement a range of methods and empirically evaluate their accuracy and costs. </li></ul>
  3. 3. Data Streams - reminder <ul><li>Idea is that data is being provided from a continuous source: </li></ul><ul><ul><li>Examples processed one at a time (inspected once) </li></ul></ul><ul><ul><li>Memory is limited (!) </li></ul></ul><ul><ul><li>Model construction must scale (NlogN in num examples) </li></ul></ul><ul><ul><li>Be ready to predict at any time </li></ul></ul><ul><li>As memory is limited this will have implications for any numeric handling method you might construct </li></ul><ul><li>Only consider methods that work as the tree is built </li></ul>
  4. 4. Main assumptions/limitations <ul><li>Assume a stationary concept, i.e. no concept drift or change </li></ul><ul><ul><li>may seem very limiting, but … </li></ul></ul><ul><li>Three-way trade-off: </li></ul><ul><ul><li>memory </li></ul></ul><ul><ul><li>speed </li></ul></ul><ul><ul><li>accuracy </li></ul></ul><ul><li>Used only artificial data sources </li></ul>
  5. 5. Hoeffding Trees <ul><li>Introduced by Domingos and Hulten (VFDT) </li></ul><ul><li>“ Extension” of decision trees to streams </li></ul><ul><li>HT Algorithm: </li></ul><ul><ul><li>Init tree T to root node </li></ul></ul><ul><ul><li>For each example from stream </li></ul></ul><ul><ul><ul><li>Find leaf L for this example </li></ul></ul></ul><ul><ul><ul><li>Update counts in L with attr values of example and compute split function (eg Info Gain, IG) for each attribute </li></ul></ul></ul><ul><ul><ul><li>If IG(best attr) – IG(next best attr) > ε then split L on best attr </li></ul></ul></ul>
  6. 6. Active leaf data structure <ul><li>For each class value: </li></ul><ul><ul><li>for each nominal attribute: </li></ul></ul><ul><ul><ul><li>for each possible value: </li></ul></ul></ul><ul><ul><ul><ul><li>keep sum of counts/weights </li></ul></ul></ul></ul><ul><ul><li>for each numeric attribute : </li></ul></ul><ul><ul><ul><li>keep sufficient stats to approximate the distribution </li></ul></ul></ul><ul><ul><ul><li>various possibilities: here assume normal distribution so estimate/record: n,mean,variance, + min/max </li></ul></ul></ul>
  7. 7. Numeric Handling Methods <ul><li>VFDT (VFML – Hulten & Domingos, 2003) </li></ul><ul><ul><li>Summarize the numeric distribution with a histogram made up of a maximum number of bins N (default 1000) </li></ul></ul><ul><ul><li>Bin boundaries determined by first N unique values seen in the stream. </li></ul></ul><ul><ul><li>Issues: method sensitive to data order and choosing a good N for a particular problem </li></ul></ul><ul><li>Exhaustive Binary Tree (BINTREE – Gama et al, 2003) </li></ul><ul><ul><li>Closest implementation of a batch method </li></ul></ul><ul><ul><li>Incrementally update a binary tree as data is observed </li></ul></ul><ul><ul><li>Issues: high memory cost, high cost of split search, data order </li></ul></ul>
  8. 8. Numeric Handling Methods <ul><li>Quantile Summaries (GK – Greenwald and Khanna, 2001) </li></ul><ul><ul><li>Motivation comes from VLDB </li></ul></ul><ul><ul><li>Maintain sample of values (quantiles) plus range of possible ranks that the samples can take (tuples) </li></ul></ul><ul><ul><li>Extremely space efficient </li></ul></ul><ul><ul><li>Issues: use max number of tuples per summary </li></ul></ul>
  9. 9. Handling Numeric Methods <ul><li>Gaussian Approximation (GAUSS) </li></ul><ul><ul><li>Assume values conform to Normal Distribution </li></ul></ul><ul><ul><li>Maintain five numbers (eg mean, variance, weight, max, min) </li></ul></ul><ul><ul><li>Note: not sensitive to data order </li></ul></ul><ul><ul><li>Incrementally updateable </li></ul></ul><ul><ul><li>Using the max, min information per class – split the range into N equal parts </li></ul></ul><ul><ul><li>For each part use the 5 numbers per class to compute the approx class distribution </li></ul></ul><ul><ul><ul><li>Use the above to compute the IG of that split </li></ul></ul></ul>
  10. 10. Gaussian approximation – 2 class problem
  11. 11. Gaussian approximation – 3 class problem
  12. 12. Gaussian approximation – 4 class problem
  13. 13. Empirical Evaluation <ul><li>Use each numeric handling method (8 in total) to build a Hoeffding Tree (HTMC) </li></ul><ul><li>Vary parameters of some methods (VFML10,100,1000; BT; GK100,1000; GAUSS10,100) </li></ul><ul><li>Train models for 10 hours – then test on one million (holdout) examples </li></ul><ul><li>Define three application scenarios </li></ul><ul><ul><li>Sensor network (100K memory limit) </li></ul></ul><ul><ul><li>Handheld (32MB) </li></ul></ul><ul><ul><li>Server (400MB) </li></ul></ul>
  14. 14. Data generators <ul><li>Random tree (Domingos&Hulten): </li></ul><ul><ul><li>( RTS ) 10 num, 10 nom 5 values, 2 classes, leaves start at level 3, max level 5, plus version with 10% noise added ( RTSN ) </li></ul></ul><ul><ul><li>( RTC ) 50 num, 50 nom 5 values, 2 classes, leaves start at level 5, max level 10, plus version with 10% noise added ( RTCN ) </li></ul></ul><ul><li>Random RBF (Kirkby): </li></ul><ul><ul><li>( RRBFS ) 10 num, 100 centers, 2 classes </li></ul></ul><ul><ul><li>( RRBFC ) 50 num, 1000 centers, 2 classes </li></ul></ul><ul><li>Waveform (Aha): </li></ul><ul><ul><li>( Wave21 ): 21 noisy num, ( Wave40 ): +19 irrelevant num; 3 classes </li></ul></ul><ul><li>( GenF1-GenF10 ) (Agrawal etal): </li></ul><ul><ul><li>hypothetical loan applications, 10 different rule(s) over 6 num + 3 nom attrs, 5% noise, 2 classes </li></ul></ul>
  15. 15. Tree Measurements <ul><li>Accuracy (% correct) </li></ul><ul><li>Number of training examples processed in 10 hours (in millions) </li></ul><ul><li>Number of active leaves (in hundreds) </li></ul><ul><li>Number of inactive leaves (in hundreds) </li></ul><ul><li>Total nodes (in hundreds) </li></ul><ul><li>Tree depth </li></ul><ul><li>Training speed (% of generation speed) </li></ul><ul><li>Prediction speed (% of generation speed) </li></ul>
  16. 16. Sensor Network (100K memory limit) Pred Spd % Train Spd % AvgTree Depth Total Nodes Inactive (hdrds) Active Leaves Train (million) % correct Method 79 64 20 11.7 8.08 0.01 16 85.33 GAUSS100 81 68 12 12.1 8.87 0 20 86.16 GAUSS10 88 60 3 0.13 0.08 0 1 74.65 GK1000 84 71 8 5.03 4.03 0 12 82.92 GK100 89 75 3 0.11 0.07 0 1 74.45 BT 88 81 3 0.14 0.09 0 1 76.06 VF1000 85 76 7 4.5 3.65 0 13 79.47 VF100 82 70 11 10.6 8.13 0 21 87.7 VF10
  17. 17. Handheld Environment (32MB memory limit) Pred Spd % Train Spd % AvgTree Depth Total Nodes Inactive (hdrds) Active Leaves Train (million) % correct Method 69 14 50 1167 639 92.6 853 90.91 GAUSS100 69 15 24 1166 683 93.7 874 91.35 GAUSS10 75 16 27 581 403 2.66 937 90.94 GK1000 73 17 34 777 530 6.89 961 89.96 GK100 73 15 22 540 373 3.68 808 90.48 BT 73 17 27 604 412 4.22 951 90.97 VF1000 73 17 24 704 481 5.99 973 90.97 VF100 72 16 22 1009 675 31.8 909 91.53 VF10
  18. 18. Server Environment (400MB memory limit) 74 4 24 591 80.4 320 293 91.41 VF10 Pred Spd % Train Spd % AvgTree Depth Total Nodes Inactive (hdrds) Active Leaves Train (million) % correct Method 66 6 63 998 38.7 566 538 90.75 GAUSS100 73 6 28 891 26.8 540 518 91.21 GAUSS10 80 3 21 197 122 17.6 91 91.03 GK1000 75 4 32 346 145 84 158 89.88 GK100 81 2 19 147 92.9 13.7 60 90.50 BT 79 3 22 206 127 19 108 91.12 VF1000 75 4 23 316 143 73.9 142 91.19 VF100
  19. 19. Overall results - comments <ul><li>VFML10 is superior on average in all environments, followed closely by GAUSS10 </li></ul><ul><li>GK methods are generally competitive </li></ul><ul><li>BINTREE is only competitive in a server setting </li></ul><ul><li>Default setting of 1000 for VFML is a poor choice </li></ul><ul><li>Crude binning provides more space which leads to faster growth and better trees (more room to grow) </li></ul><ul><li>Higher values for GAUSS leads to very deep trees (in excess of the # of attributes) suggesting repeated splitting (too fine grained) </li></ul>
  20. 20. Remarks – sensor network environment <ul><li>Number of training examples low because learning stops when last active leaf is deactivated (mem mgmt freezes nodes – low # examples, low probability of splitting) </li></ul><ul><li>Most accurate methods VFML10, GAUSS10 </li></ul>
  21. 21. Remarks – Handheld Environment <ul><li>Generates smaller trees (than server) and can therefore process more examples </li></ul>
  22. 22. Remarks – Server Environment
  23. 23. VFML10 vs GAUSS10 – Closer Analysis <ul><li>Recall VFML10 is superior on average </li></ul><ul><li>Sensor (avg 87.7 vs 86.2) </li></ul><ul><ul><li>GAUSS10 superior on 10 </li></ul></ul><ul><ul><li>VFML10 superior on 6 (2 no difference) </li></ul></ul><ul><li>Handheld (avg 91.5 vs 91.4) </li></ul><ul><ul><li>GAUSS10 superior on 4 </li></ul></ul><ul><ul><li>VFML10 superior on 8 (6 no difference) </li></ul></ul><ul><li>Server (avg 91.4 vs 91.2) </li></ul><ul><ul><li>GAUSS10 superior on 6 </li></ul></ul><ul><ul><li>VFML10 superior on 6 (6 no difference) </li></ul></ul>
  24. 24. Data order
  25. 25. Conclusion <ul><li>We have presented a method for handling numeric attributes in data streams that performs well in empirical studies </li></ul><ul><li>The methods employing the most approximation were superior – they allow greater growth when memory is limited. </li></ul><ul><li>On a dataset by dataset analysis there is not much to choose between VFML10 and GAUSS10 </li></ul><ul><li>Gains made in handling numeric variables come at a cost in terms of training and prediction speed – the cost is high in some environments </li></ul>
  26. 26. All algorithms available <ul><li>https://sourceforge.net/projects/moa-datastream </li></ul><ul><li>All methods and an environment </li></ul><ul><li>for experimental evaluation of </li></ul><ul><li>data streams is available from the </li></ul><ul><li>above URL – system is called </li></ul><ul><li>Massive </li></ul><ul><li>Online </li></ul><ul><li>Analysis (MOA) </li></ul>

×