Upcoming SlideShare
×

# Adaptive XML Tree Mining on Evolving Data Streams

1,073
-1

Published on

Talk about mining XML trees on evolving data streams.

Published in: Technology
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
1,073
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
31
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Adaptive XML Tree Mining on Evolving Data Streams

1. 1. Adaptive XML Tree Mining on Evolving Data Streams Albert Bifet Laboratory for Relational Algorithmics, Complexity and Learning LARCA Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Porto, 21 May 2009
2. 2. Mining Evolving Massive Structured Data The basic problem Finding interesting structure on data Mining massive data Mining time varying data Mining on real time Mining XML data The Disintegration of Persistence of Memory 1952-54 Salvador Dalí 2 / 30
3. 3. XML Tree Classiﬁcation on evolving data streams D D D D B B B B B B B C C C C C C A A C LASS 1 C LASS 2 C LASS 1 C LASS 2 D Figure: A dataset example 3 / 30
4. 4. Tree Pattern Mining Given a dataset of trees, ﬁnd the complete set of frequent subtrees Frequent Tree Pattern (FT): Include all the trees whose support is no less than min_sup Closed Frequent Tree Pattern (CT): Include no tree which has a Trees are sanctuaries. super-tree with the same Whoever knows how support to listen to them, can learn the truth. CT ⊆ FT Herman Hesse 4 / 30
5. 5. Mining Closed Frequent Trees Our trees are: Our subtrees are: Labeled and Unlabeled Induced Ordered and Unordered Top-down Two different ordered trees but the same unordered tree 5 / 30
6. 6. A tale of two trees Consider D = {A, B}, where Frequent subtrees A B A: B: and let min_sup = 2. 6 / 30
7. 7. A tale of two trees Consider D = {A, B}, where Closed subtrees A B A: B: and let min_sup = 2. 6 / 30
8. 8. XML Tree Classiﬁcation on evolving data streams D D D D B B B B B B B C C C C C C A A C LASS 1 C LASS 2 C LASS 1 C LASS 2 D Figure: A dataset example 7 / 30
9. 9. XML Tree Classiﬁcation on evolving data streams Tree Trans. Closed Freq. not Closed Trees 1 2 3 4 D B B c1 C C C C 1 0 1 0 D B B C A C C A c2 A A 1 0 0 1 8 / 30
10. 10. XML Tree Classiﬁcation on evolving data streams Frequent Trees c1 c2 c3 c4 Id 1 c1 f1 1 2 3 c2 f2 f2 f2 1 c3 f3 1 2 3 4 5 c4 f4 f4 f4 f4 f4 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 2 0 0 0 0 0 0 1 1 1 1 1 1 1 1 3 1 1 0 0 0 0 1 1 1 1 1 1 1 1 4 0 0 1 1 1 1 1 1 1 1 1 1 1 1 Closed Maximal Trees Trees Id Tree c1 c2 c3 c4 c1 c2 c3 Class 1 1 1 0 1 1 1 0 C LASS 1 2 0 0 1 1 0 0 1 C LASS 2 3 1 0 1 1 1 0 1 C LASS 1 4 0 1 1 1 0 1 1 C LASS 2 9 / 30
11. 11. XML Tree Framework on evolving data streams XML Tree Classiﬁcation Framework Components An XML closed frequent tree miner A Data stream classiﬁer algorithm, which we will feed with tuples to be classiﬁed online. 10 / 30
12. 12. Mining Evolving Tree Data Streams Problem Given a data stream D of rooted and unordered trees, ﬁnd frequent closed trees. We provide three algorithms, of increasing power Incremental Sliding Window Adaptive D 11 / 30
13. 13. Mining Closed Unordered Subtrees C LOSED _S UBTREES(t, D, min_sup, T ) 1 2 3 for every t that can be extended from t in one step 4 do if Support(t ) ≥ min_sup 5 then T ← C LOSED _S UBTREES(t , D, min_sup, T ) 6 7 8 9 10 return T 12 / 30
14. 14. Mining Closed Unordered Subtrees C LOSED _S UBTREES(t, D, min_sup, T ) 1 if not C ANONICAL _R EPRESENTATIVE(t) 2 then return T 3 for every t that can be extended from t in one step 4 do if Support(t ) ≥ min_sup 5 then T ← C LOSED _S UBTREES(t , D, min_sup, T ) 6 7 8 9 10 return T 12 / 30
15. 15. Mining Closed Unordered Subtrees C LOSED _S UBTREES(t, D, min_sup, T ) 1 if not C ANONICAL _R EPRESENTATIVE(t) 2 then return T 3 for every t that can be extended from t in one step 4 do if Support(t ) ≥ min_sup 5 then T ← C LOSED _S UBTREES(t , D, min_sup, T ) 6 do if Support(t ) = Support(t) 7 then t is not closed 8 if t is closed 9 then insert t into T 10 return T 12 / 30
16. 16. Example D = {A, B} A = (0, 1, 2, 3, 2, 1) B = (0, 1, 2, 3, 1, 2, 2) min_sup = 2. (0, 1, 2, 1) (0, 1, 2, 2, 1) (0, 1, 1) (0, 1, 2, 2) (0) (0, 1) (0, 1, 2) (0, 1, 2, 3, 1) (0, 1, 2, 3) 13 / 30
17. 17. Example D = {A, B} A = (0, 1, 2, 3, 2, 1) B = (0, 1, 2, 3, 1, 2, 2) min_sup = 2. (0, 1, 2, 1) (0, 1, 2, 2, 1) (0, 1, 1) (0, 1, 2, 2) (0) (0, 1) (0, 1, 2) (0, 1, 2, 3, 1) (0, 1, 2, 3) 13 / 30
18. 18. Experimental results TreeNat CMTreeMiner Unlabeled Trees Labeled Trees Top-Down Subtrees Induced Subtrees No Occurrences Occurrences 14 / 30
19. 19. Closure Operator on Trees D: the ﬁnite input dataset of trees T : the (inﬁnite) set of all trees Deﬁnition We deﬁne the following the Galois connection pair: For ﬁnite A ⊆ D σ (A) is the set of subtrees of the A trees in T σ (A) = {t ∈ T ∀ t ∈ A (t t )} For ﬁnite B ⊂ T τD (B) is the set of supertrees of the B trees in D τD (B) = {t ∈ D ∀ t ∈ B (t t )} Closure Operator The composition ΓD = σ ◦ τD is a closure operator. 15 / 30
20. 20. Galois Lattice of closed set of trees 1 2 3 12 13 23 123 16 / 30
21. 21. Galois Lattice of closed set of trees 1 2 3 D B={ } 12 13 23 123 17 / 30
22. 22. Galois Lattice of closed set of trees B={ } 1 2 3 τD (B) = { , } 12 13 23 123 17 / 30
23. 23. Galois Lattice of closed set of trees B={ } 1 2 3 τD (B) = { , } 12 13 23 ΓD (B) = σ ◦τD(B) = { and its subtrees } 123 17 / 30
24. 24. Algorithms Algorithms Incremental: I NC T REE N AT Sliding Window: W IN T REE N AT Adaptive: A DAT REE N AT Uses ADWIN to monitor change ADWIN An adaptive sliding window whose size is recomputed online according to the rate of change observed. ADWIN has rigorous guarantees (theorems) On ratio of false positives and false negatives On the relation of the size of the current window and change rates 18 / 30
25. 25. Experimental Validation: TN1 CMTreeMiner 300 Time 200 (sec.) 100 I NC T REE N AT 2 4 6 8 Size (Milions) Figure: Experiments on ordered trees with TN1 dataset 19 / 30
26. 26. What is MOA? {M}assive {O}nline {A}nalysis is a framework for online learning from data streams. It is closely related to WEKA It includes a collection of ofﬂine and online as well as tools for evaluation: boosting and bagging Hoeffding Trees with and without Naïve Bayes classiﬁers at the leaves. 20 / 30
27. 27. WEKA: the bird 21 / 30
28. 28. MOA: the bird The Moa (another native NZ bird) is not only ﬂightless, like the Weka, but also extinct. 22 / 30
29. 29. MOA: the bird The Moa (another native NZ bird) is not only ﬂightless, like the Weka, but also extinct. 22 / 30
30. 30. MOA: the bird The Moa (another native NZ bird) is not only ﬂightless, like the Weka, but also extinct. 22 / 30
31. 31. Data stream classiﬁcation cycle 1 Process an example at a time, and inspect it only once (at most) 2 Use a limited amount of memory 3 Work in a limited amount of time 4 Be ready to predict at any point 23 / 30
32. 32. Environments and Data Sources Environments Sensor Network: 100Kb Handheld Computer: 32 Mb Server: 400 Mb Data Sources Random Tree Generator Random RBF Generator LED Generator Waveform Generator Function Generator 24 / 30
33. 33. Algorithms Naive Bayes Prediction strategies Decision stumps Majority class Hoeffding Tree Naive Bayes Leaves Hoeffding Option Tree Adaptive Hybrid Bagging and Boosting 25 / 30
34. 34. Hoeffding Option Tree Hoeffding Option Trees Regular Hoeffding tree containing additional option nodes that allow several tests to be applied, leading to multiple Hoeffding trees as separate paths. 26 / 30
35. 35. GUI java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.gui.TaskLauncher 27 / 30
36. 36. GUI java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.gui.TaskLauncher 27 / 30
37. 37. Ensemble Methods http://www.cs.waikato.ac.nz/∼abifet/MOA/ New ensemble methods: ADWIN bagging: When a change is detected, the worst classiﬁer is removed and a new classiﬁer is added. Adaptive-Size Hoeffding Tree bagging 28 / 30
38. 38. XML Tree Framework on evolving data streams Maximal Closed # Trees Att. Acc. Mem. Att. Acc. Mem. CSLOG12 15483 84 79.64 1.2 228 78.12 2.54 CSLOG23 15037 88 79.81 1.21 243 78.77 2.75 CSLOG31 15702 86 79.94 1.25 243 77.60 2.73 CSLOG123 23111 84 80.02 1.7 228 78.91 4.18 Table: BAGGING on unordered trees. 29 / 30
39. 39. Conclusions XML tree stream classiﬁer system. Using Galois Latice Theory, we present methods for mining closed trees Incremental Sliding Window Adaptive: using ADWIN to monitor change We use MOA data stream classiﬁers. 30 / 30
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.