Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Adaptive XML Tree Mining on Evolving Data Streams

1,421 views

Published on

Talk about mining XML trees on evolving data streams.

Published in: Technology
  • Be the first to comment

Adaptive XML Tree Mining on Evolving Data Streams

  1. 1. Adaptive XML Tree Mining on Evolving Data Streams Albert Bifet Laboratory for Relational Algorithmics, Complexity and Learning LARCA Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Porto, 21 May 2009
  2. 2. Mining Evolving Massive Structured Data The basic problem Finding interesting structure on data Mining massive data Mining time varying data Mining on real time Mining XML data The Disintegration of Persistence of Memory 1952-54 Salvador Dalí 2 / 30
  3. 3. XML Tree Classification on evolving data streams D D D D B B B B B B B C C C C C C A A C LASS 1 C LASS 2 C LASS 1 C LASS 2 D Figure: A dataset example 3 / 30
  4. 4. Tree Pattern Mining Given a dataset of trees, find the complete set of frequent subtrees Frequent Tree Pattern (FT): Include all the trees whose support is no less than min_sup Closed Frequent Tree Pattern (CT): Include no tree which has a Trees are sanctuaries. super-tree with the same Whoever knows how support to listen to them, can learn the truth. CT ⊆ FT Herman Hesse 4 / 30
  5. 5. Mining Closed Frequent Trees Our trees are: Our subtrees are: Labeled and Unlabeled Induced Ordered and Unordered Top-down Two different ordered trees but the same unordered tree 5 / 30
  6. 6. A tale of two trees Consider D = {A, B}, where Frequent subtrees A B A: B: and let min_sup = 2. 6 / 30
  7. 7. A tale of two trees Consider D = {A, B}, where Closed subtrees A B A: B: and let min_sup = 2. 6 / 30
  8. 8. XML Tree Classification on evolving data streams D D D D B B B B B B B C C C C C C A A C LASS 1 C LASS 2 C LASS 1 C LASS 2 D Figure: A dataset example 7 / 30
  9. 9. XML Tree Classification on evolving data streams Tree Trans. Closed Freq. not Closed Trees 1 2 3 4 D B B c1 C C C C 1 0 1 0 D B B C A C C A c2 A A 1 0 0 1 8 / 30
  10. 10. XML Tree Classification on evolving data streams Frequent Trees c1 c2 c3 c4 Id 1 c1 f1 1 2 3 c2 f2 f2 f2 1 c3 f3 1 2 3 4 5 c4 f4 f4 f4 f4 f4 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 2 0 0 0 0 0 0 1 1 1 1 1 1 1 1 3 1 1 0 0 0 0 1 1 1 1 1 1 1 1 4 0 0 1 1 1 1 1 1 1 1 1 1 1 1 Closed Maximal Trees Trees Id Tree c1 c2 c3 c4 c1 c2 c3 Class 1 1 1 0 1 1 1 0 C LASS 1 2 0 0 1 1 0 0 1 C LASS 2 3 1 0 1 1 1 0 1 C LASS 1 4 0 1 1 1 0 1 1 C LASS 2 9 / 30
  11. 11. XML Tree Framework on evolving data streams XML Tree Classification Framework Components An XML closed frequent tree miner A Data stream classifier algorithm, which we will feed with tuples to be classified online. 10 / 30
  12. 12. Mining Evolving Tree Data Streams Problem Given a data stream D of rooted and unordered trees, find frequent closed trees. We provide three algorithms, of increasing power Incremental Sliding Window Adaptive D 11 / 30
  13. 13. Mining Closed Unordered Subtrees C LOSED _S UBTREES(t, D, min_sup, T ) 1 2 3 for every t that can be extended from t in one step 4 do if Support(t ) ≥ min_sup 5 then T ← C LOSED _S UBTREES(t , D, min_sup, T ) 6 7 8 9 10 return T 12 / 30
  14. 14. Mining Closed Unordered Subtrees C LOSED _S UBTREES(t, D, min_sup, T ) 1 if not C ANONICAL _R EPRESENTATIVE(t) 2 then return T 3 for every t that can be extended from t in one step 4 do if Support(t ) ≥ min_sup 5 then T ← C LOSED _S UBTREES(t , D, min_sup, T ) 6 7 8 9 10 return T 12 / 30
  15. 15. Mining Closed Unordered Subtrees C LOSED _S UBTREES(t, D, min_sup, T ) 1 if not C ANONICAL _R EPRESENTATIVE(t) 2 then return T 3 for every t that can be extended from t in one step 4 do if Support(t ) ≥ min_sup 5 then T ← C LOSED _S UBTREES(t , D, min_sup, T ) 6 do if Support(t ) = Support(t) 7 then t is not closed 8 if t is closed 9 then insert t into T 10 return T 12 / 30
  16. 16. Example D = {A, B} A = (0, 1, 2, 3, 2, 1) B = (0, 1, 2, 3, 1, 2, 2) min_sup = 2. (0, 1, 2, 1) (0, 1, 2, 2, 1) (0, 1, 1) (0, 1, 2, 2) (0) (0, 1) (0, 1, 2) (0, 1, 2, 3, 1) (0, 1, 2, 3) 13 / 30
  17. 17. Example D = {A, B} A = (0, 1, 2, 3, 2, 1) B = (0, 1, 2, 3, 1, 2, 2) min_sup = 2. (0, 1, 2, 1) (0, 1, 2, 2, 1) (0, 1, 1) (0, 1, 2, 2) (0) (0, 1) (0, 1, 2) (0, 1, 2, 3, 1) (0, 1, 2, 3) 13 / 30
  18. 18. Experimental results TreeNat CMTreeMiner Unlabeled Trees Labeled Trees Top-Down Subtrees Induced Subtrees No Occurrences Occurrences 14 / 30
  19. 19. Closure Operator on Trees D: the finite input dataset of trees T : the (infinite) set of all trees Definition We define the following the Galois connection pair: For finite A ⊆ D σ (A) is the set of subtrees of the A trees in T σ (A) = {t ∈ T ∀ t ∈ A (t t )} For finite B ⊂ T τD (B) is the set of supertrees of the B trees in D τD (B) = {t ∈ D ∀ t ∈ B (t t )} Closure Operator The composition ΓD = σ ◦ τD is a closure operator. 15 / 30
  20. 20. Galois Lattice of closed set of trees 1 2 3 12 13 23 123 16 / 30
  21. 21. Galois Lattice of closed set of trees 1 2 3 D B={ } 12 13 23 123 17 / 30
  22. 22. Galois Lattice of closed set of trees B={ } 1 2 3 τD (B) = { , } 12 13 23 123 17 / 30
  23. 23. Galois Lattice of closed set of trees B={ } 1 2 3 τD (B) = { , } 12 13 23 ΓD (B) = σ ◦τD(B) = { and its subtrees } 123 17 / 30
  24. 24. Algorithms Algorithms Incremental: I NC T REE N AT Sliding Window: W IN T REE N AT Adaptive: A DAT REE N AT Uses ADWIN to monitor change ADWIN An adaptive sliding window whose size is recomputed online according to the rate of change observed. ADWIN has rigorous guarantees (theorems) On ratio of false positives and false negatives On the relation of the size of the current window and change rates 18 / 30
  25. 25. Experimental Validation: TN1 CMTreeMiner 300 Time 200 (sec.) 100 I NC T REE N AT 2 4 6 8 Size (Milions) Figure: Experiments on ordered trees with TN1 dataset 19 / 30
  26. 26. What is MOA? {M}assive {O}nline {A}nalysis is a framework for online learning from data streams. It is closely related to WEKA It includes a collection of offline and online as well as tools for evaluation: boosting and bagging Hoeffding Trees with and without Naïve Bayes classifiers at the leaves. 20 / 30
  27. 27. WEKA: the bird 21 / 30
  28. 28. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 22 / 30
  29. 29. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 22 / 30
  30. 30. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 22 / 30
  31. 31. Data stream classification cycle 1 Process an example at a time, and inspect it only once (at most) 2 Use a limited amount of memory 3 Work in a limited amount of time 4 Be ready to predict at any point 23 / 30
  32. 32. Environments and Data Sources Environments Sensor Network: 100Kb Handheld Computer: 32 Mb Server: 400 Mb Data Sources Random Tree Generator Random RBF Generator LED Generator Waveform Generator Function Generator 24 / 30
  33. 33. Algorithms Naive Bayes Prediction strategies Decision stumps Majority class Hoeffding Tree Naive Bayes Leaves Hoeffding Option Tree Adaptive Hybrid Bagging and Boosting 25 / 30
  34. 34. Hoeffding Option Tree Hoeffding Option Trees Regular Hoeffding tree containing additional option nodes that allow several tests to be applied, leading to multiple Hoeffding trees as separate paths. 26 / 30
  35. 35. GUI java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.gui.TaskLauncher 27 / 30
  36. 36. GUI java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.gui.TaskLauncher 27 / 30
  37. 37. Ensemble Methods http://www.cs.waikato.ac.nz/∼abifet/MOA/ New ensemble methods: ADWIN bagging: When a change is detected, the worst classifier is removed and a new classifier is added. Adaptive-Size Hoeffding Tree bagging 28 / 30
  38. 38. XML Tree Framework on evolving data streams Maximal Closed # Trees Att. Acc. Mem. Att. Acc. Mem. CSLOG12 15483 84 79.64 1.2 228 78.12 2.54 CSLOG23 15037 88 79.81 1.21 243 78.77 2.75 CSLOG31 15702 86 79.94 1.25 243 77.60 2.73 CSLOG123 23111 84 80.02 1.7 228 78.91 4.18 Table: BAGGING on unordered trees. 29 / 30
  39. 39. Conclusions XML tree stream classifier system. Using Galois Latice Theory, we present methods for mining closed trees Incremental Sliding Window Adaptive: using ADWIN to monitor change We use MOA data stream classifiers. 30 / 30

×