Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

781 views
670 views

Published on

Talk about tree mining on evolving data streams.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
781
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

  1. 1. Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams Albert Bifet and Ricard Gavaldà Universitat Politècnica de Catalunya 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08) 2008 Las Vegas, USA
  2. 2. Tree Mining Mining frequent trees is becoming an important task Applications: chemical informatics computer vision text retrieval bioinformatics Data Streams Web analysis. Sequence is potentially Many link-based infinite structures may be High amount of data: studied formally by sublinear space means of unordered High speed of arrival: trees sublinear time per example
  3. 3. Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Let π be a permutation of {1, . . . , n}. Let π−1 be π with one element missing. π−1 [i] arrives in increasing order Task: Determine the missing number
  4. 4. Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Use a n-bit Let π be a permutation of {1, . . . , n}. vector to Let π−1 be π with one element memorize all the missing. numbers (O(n) space) π−1 [i] arrives in increasing order Task: Determine the missing number
  5. 5. Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Let π be a permutation of {1, . . . , n}. Data Streams: Let π−1 be π with one element O(log(n)) space. missing. π−1 [i] arrives in increasing order Task: Determine the missing number
  6. 6. Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Data Streams: Puzzle: Finding Missing Numbers O(log(n)) space. Let π be a permutation of {1, . . . , n}. Store Let π−1 be π with one element n(n + 1) missing. − ∑ π−1 [j]. 2 j≤i π−1 [i] arrives in increasing order Task: Determine the missing number
  7. 7. Introduction: Trees Our trees are: Our subtrees are: Unlabeled Induced Ordered and Unordered Two different ordered trees but the same unordered tree
  8. 8. Introduction Induced subtrees: obtained by repeatedly removing leaf nodes Embedded subtrees: obtained by contracting some of the edges
  9. 9. Introduction What Is Tree Pattern Mining? Given a dataset of trees, find the complete set of frequent subtrees Frequent Tree Pattern (FS): Include all the trees whose support is no less than min_sup Closed Frequent Tree Pattern (CS): Include no tree which has a super-tree with the same support CS ⊆ FS Closed Frequent Tree Mining provides a compact representation of frequent trees without loss of information
  10. 10. Introduction Unordered Subtree Mining A: B: X: Y: D = {A, B}, min_sup = 2 # Closed Subtrees : 2 # Frequent Subtrees: 9 Closed Subtrees: X, Y Frequent Subtrees:
  11. 11. Introduction Problem Given a data stream D of rooted, unlabelled and unordered trees, find frequent closed trees. We provide three algorithms, of increasing power Incremental Sliding Window Adaptive D
  12. 12. Outline 1 Introduction 2 Data Streams 3 ADWIN : Concept Drift Mining 4 Adaptive Closed Frequent Tree Mining 5 Summary
  13. 13. Data Streams Data Streams At any time t in the data stream, we would like the per-item processing time and storage to be simultaneously O(log k (N, t)). Approximation algorithms Small error rate with high probability ˜ An algorithm (ε, δ )−approximates F if it outputs F for ˜ which Pr[|F − F | > εF ] < δ .
  14. 14. Data Streams Approximation Algorithms 1011000111 1010101 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
  15. 15. Data Streams Approximation Algorithms 10110001111 0101011 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
  16. 16. Data Streams Approximation Algorithms 101100011110 1010111 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
  17. 17. Data Streams Approximation Algorithms 1011000111101 0101110 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
  18. 18. Data Streams Approximation Algorithms 10110001111010 1011101 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
  19. 19. Data Streams Approximation Algorithms 101100011110101 0111010 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
  20. 20. Outline 1 Introduction 2 Data Streams 3 ADWIN : Concept Drift Mining 4 Adaptive Closed Frequent Tree Mining 5 Summary
  21. 21. ADWIN: Adaptive sliding window ADWIN An adaptive sliding window whose size is recomputed online according to the rate of change observed. ADWIN has rigorous guarantees (theorems) On ratio of false positives and negatives On the relation of the size of the current window and change rates ADWIN using a Data Stream Sliding Window Model, can provide the exact counts of 1’s in O(1) time per point. tries O(log W ) cutpoints uses O( 1 log W ) memory words ε the processing time per example is O(log W ) (amortized and worst-case).
  22. 22. Time Change Detectors and Predictors: A General Framework Estimation - xt - Estimator
  23. 23. Time Change Detectors and Predictors: A General Framework Estimation - xt - Estimator Alarm - - Change Detect.
  24. 24. Time Change Detectors and Predictors: A General Framework Estimation - xt - Estimator Alarm - - Change Detect. 6 6 ? - Memory
  25. 25. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1 01010110111111 [Dasu+ 06]
  26. 26. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10 1010110111111 [Dasu+ 06]
  27. 27. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 101 010110111111 [Dasu+ 06]
  28. 28. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1010 10110111111 [Dasu+ 06]
  29. 29. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10101 0110111111 [Dasu+ 06]
  30. 30. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 101010 110111111 [Dasu+ 06]
  31. 31. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1010101 10111111 [Dasu+ 06]
  32. 32. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10101011 0111111 [Dasu+ 06]
  33. 33. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 101010110 111111 [Dasu+ 06]
  34. 34. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1010101101 11111 [Dasu+ 06]
  35. 35. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10101011011 1111 [Dasu+ 06]
  36. 36. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 101010110111 111 [Dasu+ 06]
  37. 37. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1010101101111 11 [Dasu+ 06]
  38. 38. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10101011011111 1 [Dasu+ 06] 11
  39. 39. Outline 1 Introduction 2 Data Streams 3 ADWIN : Concept Drift Mining 4 Adaptive Closed Frequent Tree Mining 5 Summary
  40. 40. Pattern Relaxed Support Guojie Song, Dongqing Yang, Bin Cui, Baihua Zheng, Yunfeng Liu and Kunqing Xie. CLAIM: An Efficient Method for Relaxed Frequent Closed Itemsets Mining over Stream Data Linear Relaxed Interval:The support space of all subpatterns can be divided into n = 1/εr intervals, where εr is a user-specified relaxed factor, and each interval can be denoted by Ii = [li , ui ), where li = (n − i) ∗ εr ≥ 0, ui = (n − i + 1) ∗ εr ≤ 1 and i ≤ n. Linear Relaxed closed subpattern t: if and only if there exists no proper superpattern t of t such that their suports belong to the same interval Ii .
  41. 41. Pattern Relaxed Support As the number of closed frequent patterns is not linear with respect support, we introduce a new relaxed support: Logarithmic Relaxed Interval:The support space of all subpatterns can be divided into n = 1/εr intervals, where εr is a user-specified relaxed factor, and each interval can be denoted by Ii = [li , ui ), where li = c i , ui = c i+1 − 1 and i ≤ n. Logarithmic Relaxed closed subpattern t: if and only if there exists no proper superpattern t of t such that their suports belong to the same interval Ii .
  42. 42. Galois Lattice of closed set of trees 1 2 3 D We need 12 13 23 a Galois connection pair a closure operator 123
  43. 43. Incremental mining on closed frequent trees 1 Adding a tree transaction, does not decrease the number of closed trees for D. 1 2 3 2 Adding a transaction with a closed tree, does not modify the number of closed 12 13 23 trees for D. 123
  44. 44. Sliding Window mining on closed frequent trees 1 Deleting a tree transaction, does not increase the number of closed trees for D. 1 2 3 2 Deleting a tree transaction that is repeated, does not modify the number of closed trees for 12 13 23 D. 123
  45. 45. Algorithms Algorithms Incremental: I NC T REE N AT Sliding Window: W IN T REE N AT Adaptive: A DAT REE N AT Uses ADWIN to monitor change ADWIN An adaptive sliding window whose size is recomputed online according to the rate of change observed. ADWIN has rigorous guarantees (theorems) On ratio of false positives and negatives On the relation of the size of the current window and change rates
  46. 46. Experimental Validation: TN1 CMTreeMiner 300 Time 200 (sec.) 100 I NC T REE N AT 2 4 6 8 Size (Milions) Figure: Time on experiments on ordered trees on TN1 dataset
  47. 47. Experimental Validation 45 35 Number of Closed Trees 25 AdaTreeInc 1 AdaTreeInc 2 15 5 0 21.460 42.920 64.380 85.840 107.300 128.760 150.220 171.680 193.140 Number of Samples Figure: Number of closed trees maintaining the same number of closed datasets on input data
  48. 48. Outline 1 Introduction 2 Data Streams 3 ADWIN : Concept Drift Mining 4 Adaptive Closed Frequent Tree Mining 5 Summary
  49. 49. Summary Conclusions New logarithmic relaxed closed support Using Galois Latice Theory, we present methods for mining closed trees Incremental: I NC T REE N AT Sliding Window: W IN T REE N AT Adaptive: A DAT REE N AT using ADWIN to monitor change Future Work Labeled Trees and XML data.

×