Adaptive XML Tree Mining on Evolving Data Streams


                                 Albert Bifet

       Laboratory for R...
Mining Evolving Massive Structured Data


                                    The basic problem
                          ...
XML Tree Classification on evolving data
               streams

        D             D               D           D

     ...
Tree Pattern Mining

                         Given a dataset of trees, find the
                         complete set of f...
Mining Closed Frequent Trees

Our trees are:                   Our subtrees are:
    Labeled and Unlabeled             Ind...
A tale of two trees


Consider D = {A, B}, where       Frequent subtrees
                                             A   ...
A tale of two trees


Consider D = {A, B}, where        Closed subtrees
                                             A    ...
XML Tree Classification on evolving data
               streams

        D             D               D           D

     ...
XML Tree Classification on evolving data
               streams

                                            Tree Trans.
  ...
XML Tree Classification on evolving data
               streams
                        Frequent Trees
       c1          c...
XML Tree Framework on evolving data
                streams


XML Tree Classification Framework Components
   An XML closed...
Mining Evolving Tree Data Streams


Problem
Given a data stream D of rooted and unordered trees, find
frequent closed trees...
Mining Closed Unordered Subtrees



C LOSED _S UBTREES(t, D, min_sup, T )
 1
 2
 3 for every t that can be extended from t...
Mining Closed Unordered Subtrees



C LOSED _S UBTREES(t, D, min_sup, T )
 1 if not C ANONICAL _R EPRESENTATIVE(t)
 2    t...
Mining Closed Unordered Subtrees



C LOSED _S UBTREES(t, D, min_sup, T )
 1   if not C ANONICAL _R EPRESENTATIVE(t)
 2   ...
Example
D = {A, B}            A = (0, 1, 2, 3, 2, 1)        B = (0, 1, 2, 3, 1, 2, 2)

min_sup = 2.


                    ...
Example
D = {A, B}            A = (0, 1, 2, 3, 2, 1)        B = (0, 1, 2, 3, 1, 2, 2)

min_sup = 2.


                    ...
Experimental results




TreeNat                   CMTreeMiner
   Unlabeled Trees           Labeled Trees
   Top-Down Subt...
Closure Operator on Trees
    D: the finite input dataset of trees
    T : the (infinite) set of all trees

Definition
We defi...
Galois Lattice of closed set of trees




      1           2          3




     12                13    23




         ...
Galois Lattice of closed set of trees




                 1          2           3

      D


B={       }     12         ...
Galois Lattice of closed set of trees


     B={      }



                        1          2           3




τD (B) = {...
Galois Lattice of closed set of trees


     B={         }



                                 1               2          ...
Algorithms

Algorithms
    Incremental: I NC T REE N AT
    Sliding Window: W IN T REE N AT
    Adaptive: A DAT REE N AT U...
Experimental Validation: TN1



                                    CMTreeMiner
       300

 Time 200
(sec.)
       100
  ...
What is MOA?

{M}assive {O}nline {A}nalysis is a framework for online learning
from data streams.




    It is closely re...
WEKA: the bird




                 21 / 30
MOA: the bird

The Moa (another native NZ bird) is not only flightless, like the
                  Weka, but also extinct.
...
MOA: the bird

The Moa (another native NZ bird) is not only flightless, like the
                  Weka, but also extinct.
...
MOA: the bird

The Moa (another native NZ bird) is not only flightless, like the
                  Weka, but also extinct.
...
Data stream classification cycle

1   Process an example at a
    time, and inspect it only
    once (at most)
2   Use a li...
Environments and Data Sources


   Environments
      Sensor Network: 100Kb
      Handheld Computer: 32 Mb
      Server: 4...
Algorithms




Naive Bayes               Prediction strategies
Decision stumps               Majority class
Hoeffding Tree...
Hoeffding Option Tree
Hoeffding Option Trees
Regular Hoeffding tree containing additional option nodes that
allow several ...
GUI
java -cp .:moa.jar:weka.jar
-javaagent:sizeofag.jar moa.gui.TaskLauncher




                                         ...
GUI
java -cp .:moa.jar:weka.jar
-javaagent:sizeofag.jar moa.gui.TaskLauncher




                                         ...
Ensemble Methods
   http://www.cs.waikato.ac.nz/∼abifet/MOA/




New ensemble methods:
   ADWIN bagging: When a change is ...
XML Tree Framework on evolving data
               streams



                            Maximal                 Closed
 ...
Conclusions

XML tree stream classifier system.


    Using Galois Latice Theory, we present methods for mining
    closed ...
Upcoming SlideShare
Loading in...5
×

Adaptive XML Tree Mining on Evolving Data Streams

1,036

Published on

Talk about mining XML trees on evolving data streams.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,036
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
30
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Adaptive XML Tree Mining on Evolving Data Streams

  1. 1. Adaptive XML Tree Mining on Evolving Data Streams Albert Bifet Laboratory for Relational Algorithmics, Complexity and Learning LARCA Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Porto, 21 May 2009
  2. 2. Mining Evolving Massive Structured Data The basic problem Finding interesting structure on data Mining massive data Mining time varying data Mining on real time Mining XML data The Disintegration of Persistence of Memory 1952-54 Salvador Dalí 2 / 30
  3. 3. XML Tree Classification on evolving data streams D D D D B B B B B B B C C C C C C A A C LASS 1 C LASS 2 C LASS 1 C LASS 2 D Figure: A dataset example 3 / 30
  4. 4. Tree Pattern Mining Given a dataset of trees, find the complete set of frequent subtrees Frequent Tree Pattern (FT): Include all the trees whose support is no less than min_sup Closed Frequent Tree Pattern (CT): Include no tree which has a Trees are sanctuaries. super-tree with the same Whoever knows how support to listen to them, can learn the truth. CT ⊆ FT Herman Hesse 4 / 30
  5. 5. Mining Closed Frequent Trees Our trees are: Our subtrees are: Labeled and Unlabeled Induced Ordered and Unordered Top-down Two different ordered trees but the same unordered tree 5 / 30
  6. 6. A tale of two trees Consider D = {A, B}, where Frequent subtrees A B A: B: and let min_sup = 2. 6 / 30
  7. 7. A tale of two trees Consider D = {A, B}, where Closed subtrees A B A: B: and let min_sup = 2. 6 / 30
  8. 8. XML Tree Classification on evolving data streams D D D D B B B B B B B C C C C C C A A C LASS 1 C LASS 2 C LASS 1 C LASS 2 D Figure: A dataset example 7 / 30
  9. 9. XML Tree Classification on evolving data streams Tree Trans. Closed Freq. not Closed Trees 1 2 3 4 D B B c1 C C C C 1 0 1 0 D B B C A C C A c2 A A 1 0 0 1 8 / 30
  10. 10. XML Tree Classification on evolving data streams Frequent Trees c1 c2 c3 c4 Id 1 c1 f1 1 2 3 c2 f2 f2 f2 1 c3 f3 1 2 3 4 5 c4 f4 f4 f4 f4 f4 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 2 0 0 0 0 0 0 1 1 1 1 1 1 1 1 3 1 1 0 0 0 0 1 1 1 1 1 1 1 1 4 0 0 1 1 1 1 1 1 1 1 1 1 1 1 Closed Maximal Trees Trees Id Tree c1 c2 c3 c4 c1 c2 c3 Class 1 1 1 0 1 1 1 0 C LASS 1 2 0 0 1 1 0 0 1 C LASS 2 3 1 0 1 1 1 0 1 C LASS 1 4 0 1 1 1 0 1 1 C LASS 2 9 / 30
  11. 11. XML Tree Framework on evolving data streams XML Tree Classification Framework Components An XML closed frequent tree miner A Data stream classifier algorithm, which we will feed with tuples to be classified online. 10 / 30
  12. 12. Mining Evolving Tree Data Streams Problem Given a data stream D of rooted and unordered trees, find frequent closed trees. We provide three algorithms, of increasing power Incremental Sliding Window Adaptive D 11 / 30
  13. 13. Mining Closed Unordered Subtrees C LOSED _S UBTREES(t, D, min_sup, T ) 1 2 3 for every t that can be extended from t in one step 4 do if Support(t ) ≥ min_sup 5 then T ← C LOSED _S UBTREES(t , D, min_sup, T ) 6 7 8 9 10 return T 12 / 30
  14. 14. Mining Closed Unordered Subtrees C LOSED _S UBTREES(t, D, min_sup, T ) 1 if not C ANONICAL _R EPRESENTATIVE(t) 2 then return T 3 for every t that can be extended from t in one step 4 do if Support(t ) ≥ min_sup 5 then T ← C LOSED _S UBTREES(t , D, min_sup, T ) 6 7 8 9 10 return T 12 / 30
  15. 15. Mining Closed Unordered Subtrees C LOSED _S UBTREES(t, D, min_sup, T ) 1 if not C ANONICAL _R EPRESENTATIVE(t) 2 then return T 3 for every t that can be extended from t in one step 4 do if Support(t ) ≥ min_sup 5 then T ← C LOSED _S UBTREES(t , D, min_sup, T ) 6 do if Support(t ) = Support(t) 7 then t is not closed 8 if t is closed 9 then insert t into T 10 return T 12 / 30
  16. 16. Example D = {A, B} A = (0, 1, 2, 3, 2, 1) B = (0, 1, 2, 3, 1, 2, 2) min_sup = 2. (0, 1, 2, 1) (0, 1, 2, 2, 1) (0, 1, 1) (0, 1, 2, 2) (0) (0, 1) (0, 1, 2) (0, 1, 2, 3, 1) (0, 1, 2, 3) 13 / 30
  17. 17. Example D = {A, B} A = (0, 1, 2, 3, 2, 1) B = (0, 1, 2, 3, 1, 2, 2) min_sup = 2. (0, 1, 2, 1) (0, 1, 2, 2, 1) (0, 1, 1) (0, 1, 2, 2) (0) (0, 1) (0, 1, 2) (0, 1, 2, 3, 1) (0, 1, 2, 3) 13 / 30
  18. 18. Experimental results TreeNat CMTreeMiner Unlabeled Trees Labeled Trees Top-Down Subtrees Induced Subtrees No Occurrences Occurrences 14 / 30
  19. 19. Closure Operator on Trees D: the finite input dataset of trees T : the (infinite) set of all trees Definition We define the following the Galois connection pair: For finite A ⊆ D σ (A) is the set of subtrees of the A trees in T σ (A) = {t ∈ T ∀ t ∈ A (t t )} For finite B ⊂ T τD (B) is the set of supertrees of the B trees in D τD (B) = {t ∈ D ∀ t ∈ B (t t )} Closure Operator The composition ΓD = σ ◦ τD is a closure operator. 15 / 30
  20. 20. Galois Lattice of closed set of trees 1 2 3 12 13 23 123 16 / 30
  21. 21. Galois Lattice of closed set of trees 1 2 3 D B={ } 12 13 23 123 17 / 30
  22. 22. Galois Lattice of closed set of trees B={ } 1 2 3 τD (B) = { , } 12 13 23 123 17 / 30
  23. 23. Galois Lattice of closed set of trees B={ } 1 2 3 τD (B) = { , } 12 13 23 ΓD (B) = σ ◦τD(B) = { and its subtrees } 123 17 / 30
  24. 24. Algorithms Algorithms Incremental: I NC T REE N AT Sliding Window: W IN T REE N AT Adaptive: A DAT REE N AT Uses ADWIN to monitor change ADWIN An adaptive sliding window whose size is recomputed online according to the rate of change observed. ADWIN has rigorous guarantees (theorems) On ratio of false positives and false negatives On the relation of the size of the current window and change rates 18 / 30
  25. 25. Experimental Validation: TN1 CMTreeMiner 300 Time 200 (sec.) 100 I NC T REE N AT 2 4 6 8 Size (Milions) Figure: Experiments on ordered trees with TN1 dataset 19 / 30
  26. 26. What is MOA? {M}assive {O}nline {A}nalysis is a framework for online learning from data streams. It is closely related to WEKA It includes a collection of offline and online as well as tools for evaluation: boosting and bagging Hoeffding Trees with and without Naïve Bayes classifiers at the leaves. 20 / 30
  27. 27. WEKA: the bird 21 / 30
  28. 28. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 22 / 30
  29. 29. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 22 / 30
  30. 30. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 22 / 30
  31. 31. Data stream classification cycle 1 Process an example at a time, and inspect it only once (at most) 2 Use a limited amount of memory 3 Work in a limited amount of time 4 Be ready to predict at any point 23 / 30
  32. 32. Environments and Data Sources Environments Sensor Network: 100Kb Handheld Computer: 32 Mb Server: 400 Mb Data Sources Random Tree Generator Random RBF Generator LED Generator Waveform Generator Function Generator 24 / 30
  33. 33. Algorithms Naive Bayes Prediction strategies Decision stumps Majority class Hoeffding Tree Naive Bayes Leaves Hoeffding Option Tree Adaptive Hybrid Bagging and Boosting 25 / 30
  34. 34. Hoeffding Option Tree Hoeffding Option Trees Regular Hoeffding tree containing additional option nodes that allow several tests to be applied, leading to multiple Hoeffding trees as separate paths. 26 / 30
  35. 35. GUI java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.gui.TaskLauncher 27 / 30
  36. 36. GUI java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.gui.TaskLauncher 27 / 30
  37. 37. Ensemble Methods http://www.cs.waikato.ac.nz/∼abifet/MOA/ New ensemble methods: ADWIN bagging: When a change is detected, the worst classifier is removed and a new classifier is added. Adaptive-Size Hoeffding Tree bagging 28 / 30
  38. 38. XML Tree Framework on evolving data streams Maximal Closed # Trees Att. Acc. Mem. Att. Acc. Mem. CSLOG12 15483 84 79.64 1.2 228 78.12 2.54 CSLOG23 15037 88 79.81 1.21 243 78.77 2.75 CSLOG31 15702 86 79.94 1.25 243 77.60 2.73 CSLOG123 23111 84 80.02 1.7 228 78.91 4.18 Table: BAGGING on unordered trees. 29 / 30
  39. 39. Conclusions XML tree stream classifier system. Using Galois Latice Theory, we present methods for mining closed trees Incremental Sliding Window Adaptive: using ADWIN to monitor change We use MOA data stream classifiers. 30 / 30
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×