Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

# Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

551

Published on

Talk about tree mining on evolving data streams.

Talk about tree mining on evolving data streams.

Published in: Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
551
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
15
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams Albert Bifet and Ricard Gavaldà Universitat Politècnica de Catalunya 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08) 2008 Las Vegas, USA
• 2. Tree Mining Mining frequent trees is becoming an important task Applications: chemical informatics computer vision text retrieval bioinformatics Data Streams Web analysis. Sequence is potentially Many link-based inﬁnite structures may be High amount of data: studied formally by sublinear space means of unordered High speed of arrival: trees sublinear time per example
• 3. Introduction: Data Streams Data Streams Sequence is potentially inﬁnite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Let π be a permutation of {1, . . . , n}. Let π−1 be π with one element missing. π−1 [i] arrives in increasing order Task: Determine the missing number
• 4. Introduction: Data Streams Data Streams Sequence is potentially inﬁnite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Use a n-bit Let π be a permutation of {1, . . . , n}. vector to Let π−1 be π with one element memorize all the missing. numbers (O(n) space) π−1 [i] arrives in increasing order Task: Determine the missing number
• 5. Introduction: Data Streams Data Streams Sequence is potentially inﬁnite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Let π be a permutation of {1, . . . , n}. Data Streams: Let π−1 be π with one element O(log(n)) space. missing. π−1 [i] arrives in increasing order Task: Determine the missing number
• 6. Introduction: Data Streams Data Streams Sequence is potentially inﬁnite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Data Streams: Puzzle: Finding Missing Numbers O(log(n)) space. Let π be a permutation of {1, . . . , n}. Store Let π−1 be π with one element n(n + 1) missing. − ∑ π−1 [j]. 2 j≤i π−1 [i] arrives in increasing order Task: Determine the missing number
• 7. Introduction: Trees Our trees are: Our subtrees are: Unlabeled Induced Ordered and Unordered Two different ordered trees but the same unordered tree
• 8. Introduction Induced subtrees: obtained by repeatedly removing leaf nodes Embedded subtrees: obtained by contracting some of the edges
• 9. Introduction What Is Tree Pattern Mining? Given a dataset of trees, ﬁnd the complete set of frequent subtrees Frequent Tree Pattern (FS): Include all the trees whose support is no less than min_sup Closed Frequent Tree Pattern (CS): Include no tree which has a super-tree with the same support CS ⊆ FS Closed Frequent Tree Mining provides a compact representation of frequent trees without loss of information
• 10. Introduction Unordered Subtree Mining A: B: X: Y: D = {A, B}, min_sup = 2 # Closed Subtrees : 2 # Frequent Subtrees: 9 Closed Subtrees: X, Y Frequent Subtrees:
• 11. Introduction Problem Given a data stream D of rooted, unlabelled and unordered trees, ﬁnd frequent closed trees. We provide three algorithms, of increasing power Incremental Sliding Window Adaptive D
• 12. Outline 1 Introduction 2 Data Streams 3 ADWIN : Concept Drift Mining 4 Adaptive Closed Frequent Tree Mining 5 Summary
• 13. Data Streams Data Streams At any time t in the data stream, we would like the per-item processing time and storage to be simultaneously O(log k (N, t)). Approximation algorithms Small error rate with high probability ˜ An algorithm (ε, δ )−approximates F if it outputs F for ˜ which Pr[|F − F | > εF ] < δ .
• 14. Data Streams Approximation Algorithms 1011000111 1010101 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
• 15. Data Streams Approximation Algorithms 10110001111 0101011 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
• 16. Data Streams Approximation Algorithms 101100011110 1010111 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
• 17. Data Streams Approximation Algorithms 1011000111101 0101110 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
• 18. Data Streams Approximation Algorithms 10110001111010 1011101 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
• 19. Data Streams Approximation Algorithms 101100011110101 0111010 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
• 20. Outline 1 Introduction 2 Data Streams 3 ADWIN : Concept Drift Mining 4 Adaptive Closed Frequent Tree Mining 5 Summary
• 21. ADWIN: Adaptive sliding window ADWIN An adaptive sliding window whose size is recomputed online according to the rate of change observed. ADWIN has rigorous guarantees (theorems) On ratio of false positives and negatives On the relation of the size of the current window and change rates ADWIN using a Data Stream Sliding Window Model, can provide the exact counts of 1’s in O(1) time per point. tries O(log W ) cutpoints uses O( 1 log W ) memory words ε the processing time per example is O(log W ) (amortized and worst-case).
• 22. Time Change Detectors and Predictors: A General Framework Estimation - xt - Estimator
• 23. Time Change Detectors and Predictors: A General Framework Estimation - xt - Estimator Alarm - - Change Detect.
• 24. Time Change Detectors and Predictors: A General Framework Estimation - xt - Estimator Alarm - - Change Detect. 6 6 ? - Memory
• 25. Window Management Models W = 101010110111111 Equal & ﬁxed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1 01010110111111 [Dasu+ 06]
• 26. Window Management Models W = 101010110111111 Equal & ﬁxed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10 1010110111111 [Dasu+ 06]
• 27. Window Management Models W = 101010110111111 Equal & ﬁxed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 101 010110111111 [Dasu+ 06]
• 28. Window Management Models W = 101010110111111 Equal & ﬁxed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1010 10110111111 [Dasu+ 06]
• 29. Window Management Models W = 101010110111111 Equal & ﬁxed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10101 0110111111 [Dasu+ 06]
• 30. Window Management Models W = 101010110111111 Equal & ﬁxed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 101010 110111111 [Dasu+ 06]
• 31. Window Management Models W = 101010110111111 Equal & ﬁxed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1010101 10111111 [Dasu+ 06]
• 32. Window Management Models W = 101010110111111 Equal & ﬁxed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10101011 0111111 [Dasu+ 06]
• 33. Window Management Models W = 101010110111111 Equal & ﬁxed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 101010110 111111 [Dasu+ 06]
• 34. Window Management Models W = 101010110111111 Equal & ﬁxed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1010101101 11111 [Dasu+ 06]
• 35. Window Management Models W = 101010110111111 Equal & ﬁxed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10101011011 1111 [Dasu+ 06]
• 36. Window Management Models W = 101010110111111 Equal & ﬁxed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 101010110111 111 [Dasu+ 06]
• 37. Window Management Models W = 101010110111111 Equal & ﬁxed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1010101101111 11 [Dasu+ 06]
• 38. Window Management Models W = 101010110111111 Equal & ﬁxed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10101011011111 1 [Dasu+ 06] 11
• 39. Outline 1 Introduction 2 Data Streams 3 ADWIN : Concept Drift Mining 4 Adaptive Closed Frequent Tree Mining 5 Summary
• 40. Pattern Relaxed Support Guojie Song, Dongqing Yang, Bin Cui, Baihua Zheng, Yunfeng Liu and Kunqing Xie. CLAIM: An Efﬁcient Method for Relaxed Frequent Closed Itemsets Mining over Stream Data Linear Relaxed Interval:The support space of all subpatterns can be divided into n = 1/εr intervals, where εr is a user-speciﬁed relaxed factor, and each interval can be denoted by Ii = [li , ui ), where li = (n − i) ∗ εr ≥ 0, ui = (n − i + 1) ∗ εr ≤ 1 and i ≤ n. Linear Relaxed closed subpattern t: if and only if there exists no proper superpattern t of t such that their suports belong to the same interval Ii .
• 41. Pattern Relaxed Support As the number of closed frequent patterns is not linear with respect support, we introduce a new relaxed support: Logarithmic Relaxed Interval:The support space of all subpatterns can be divided into n = 1/εr intervals, where εr is a user-speciﬁed relaxed factor, and each interval can be denoted by Ii = [li , ui ), where li = c i , ui = c i+1 − 1 and i ≤ n. Logarithmic Relaxed closed subpattern t: if and only if there exists no proper superpattern t of t such that their suports belong to the same interval Ii .
• 42. Galois Lattice of closed set of trees 1 2 3 D We need 12 13 23 a Galois connection pair a closure operator 123
• 43. Incremental mining on closed frequent trees 1 Adding a tree transaction, does not decrease the number of closed trees for D. 1 2 3 2 Adding a transaction with a closed tree, does not modify the number of closed 12 13 23 trees for D. 123
• 44. Sliding Window mining on closed frequent trees 1 Deleting a tree transaction, does not increase the number of closed trees for D. 1 2 3 2 Deleting a tree transaction that is repeated, does not modify the number of closed trees for 12 13 23 D. 123
• 45. Algorithms Algorithms Incremental: I NC T REE N AT Sliding Window: W IN T REE N AT Adaptive: A DAT REE N AT Uses ADWIN to monitor change ADWIN An adaptive sliding window whose size is recomputed online according to the rate of change observed. ADWIN has rigorous guarantees (theorems) On ratio of false positives and negatives On the relation of the size of the current window and change rates
• 46. Experimental Validation: TN1 CMTreeMiner 300 Time 200 (sec.) 100 I NC T REE N AT 2 4 6 8 Size (Milions) Figure: Time on experiments on ordered trees on TN1 dataset
• 47. Experimental Validation 45 35 Number of Closed Trees 25 AdaTreeInc 1 AdaTreeInc 2 15 5 0 21.460 42.920 64.380 85.840 107.300 128.760 150.220 171.680 193.140 Number of Samples Figure: Number of closed trees maintaining the same number of closed datasets on input data
• 48. Outline 1 Introduction 2 Data Streams 3 ADWIN : Concept Drift Mining 4 Adaptive Closed Frequent Tree Mining 5 Summary
• 49. Summary Conclusions New logarithmic relaxed closed support Using Galois Latice Theory, we present methods for mining closed trees Incremental: I NC T REE N AT Sliding Window: W IN T REE N AT Adaptive: A DAT REE N AT using ADWIN to monitor change Future Work Labeled Trees and XML data.