1.
Mining Adaptively Frequent Closed Unlabeled
Rooted Trees in Data Streams
Albert Bifet and Ricard Gavaldà
Universitat Politècnica de Catalunya
14th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD’08)
2008 Las Vegas, USA
2.
Tree Mining
Mining frequent trees is
becoming an important
task
Applications:
chemical informatics
computer vision
text retrieval
bioinformatics
Data Streams
Web analysis.
Sequence is potentially
Many linkbased
inﬁnite
structures may be
High amount of data: studied formally by
sublinear space means of unordered
High speed of arrival: trees
sublinear time per
example
3.
Introduction: Data Streams
Data Streams
Sequence is potentially inﬁnite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed
it is discarded or archived
Example
Puzzle: Finding Missing Numbers
Let π be a permutation of {1, . . . , n}.
Let π−1 be π with one element
missing.
π−1 [i] arrives in increasing order
Task: Determine the missing number
4.
Introduction: Data Streams
Data Streams
Sequence is potentially inﬁnite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed
it is discarded or archived
Example
Puzzle: Finding Missing Numbers Use a nbit
Let π be a permutation of {1, . . . , n}. vector to
Let π−1 be π with one element memorize all the
missing. numbers (O(n)
space)
π−1 [i] arrives in increasing order
Task: Determine the missing number
5.
Introduction: Data Streams
Data Streams
Sequence is potentially inﬁnite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed
it is discarded or archived
Example
Puzzle: Finding Missing Numbers
Let π be a permutation of {1, . . . , n}. Data Streams:
Let π−1 be π with one element O(log(n)) space.
missing.
π−1 [i] arrives in increasing order
Task: Determine the missing number
6.
Introduction: Data Streams
Data Streams
Sequence is potentially inﬁnite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed
it is discarded or archived
Example Data Streams:
Puzzle: Finding Missing Numbers O(log(n)) space.
Let π be a permutation of {1, . . . , n}. Store
Let π−1 be π with one element n(n + 1)
missing. − ∑ π−1 [j].
2 j≤i
π−1 [i] arrives in increasing order
Task: Determine the missing number
7.
Introduction: Trees
Our trees are: Our subtrees are:
Unlabeled Induced
Ordered and Unordered
Two different ordered trees
but the same unordered tree
8.
Introduction
Induced subtrees: obtained by repeatedly removing leaf
nodes
Embedded subtrees: obtained by contracting some of the
edges
9.
Introduction
What Is Tree Pattern Mining?
Given a dataset of trees, ﬁnd the complete set of frequent
subtrees
Frequent Tree Pattern (FS):
Include all the trees whose support is no less than min_sup
Closed Frequent Tree Pattern (CS):
Include no tree which has a supertree with the same
support
CS ⊆ FS
Closed Frequent Tree Mining provides a compact
representation of frequent trees without loss of information
11.
Introduction
Problem
Given a data stream D of rooted, unlabelled and unordered
trees, ﬁnd frequent closed trees.
We provide three algorithms,
of increasing power
Incremental
Sliding Window
Adaptive
D
13.
Data Streams
Data Streams
At any time t in the data stream, we would like the peritem
processing time and storage to be simultaneously
O(log k (N, t)).
Approximation algorithms
Small error rate with high probability
˜
An algorithm (ε, δ )−approximates F if it outputs F for
˜
which Pr[F − F  > εF ] < δ .
14.
Data Streams Approximation Algorithms
1011000111 1010101
Sliding Window
We can maintain simple statistics over sliding windows, using
O( 1 log2 N) space, where
ε
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
15.
Data Streams Approximation Algorithms
10110001111 0101011
Sliding Window
We can maintain simple statistics over sliding windows, using
O( 1 log2 N) space, where
ε
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
16.
Data Streams Approximation Algorithms
101100011110 1010111
Sliding Window
We can maintain simple statistics over sliding windows, using
O( 1 log2 N) space, where
ε
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
17.
Data Streams Approximation Algorithms
1011000111101 0101110
Sliding Window
We can maintain simple statistics over sliding windows, using
O( 1 log2 N) space, where
ε
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
18.
Data Streams Approximation Algorithms
10110001111010 1011101
Sliding Window
We can maintain simple statistics over sliding windows, using
O( 1 log2 N) space, where
ε
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
19.
Data Streams Approximation Algorithms
101100011110101 0111010
Sliding Window
We can maintain simple statistics over sliding windows, using
O( 1 log2 N) space, where
ε
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
21.
ADWIN: Adaptive sliding window
ADWIN
An adaptive sliding window whose size is recomputed online
according to the rate of change observed.
ADWIN has rigorous guarantees (theorems)
On ratio of false positives and negatives
On the relation of the size of the current window and
change rates
ADWIN using a Data Stream Sliding Window Model,
can provide the exact counts of 1’s in O(1) time per point.
tries O(log W ) cutpoints
uses O( 1 log W ) memory words
ε
the processing time per example is O(log W ) (amortized
and worstcase).
22.
Time Change Detectors and Predictors: A General
Framework
Estimation

xt

Estimator
23.
Time Change Detectors and Predictors: A General
Framework
Estimation

xt

Estimator Alarm
 
Change Detect.
24.
Time Change Detectors and Predictors: A General
Framework
Estimation

xt

Estimator Alarm
 
Change Detect.
6
6
?

Memory
25.
Window Management Models
W = 101010110111111
Equal & ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111 10101011011 1111
[Kifer+ 04] [Gama+ 04]
Equal size adjacent ADWIN: All Adjacent
subwindows subwindows
1010101 1011 1111 1 01010110111111
[Dasu+ 06]
26.
Window Management Models
W = 101010110111111
Equal & ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111 10101011011 1111
[Kifer+ 04] [Gama+ 04]
Equal size adjacent ADWIN: All Adjacent
subwindows subwindows
1010101 1011 1111 10 1010110111111
[Dasu+ 06]
27.
Window Management Models
W = 101010110111111
Equal & ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111 10101011011 1111
[Kifer+ 04] [Gama+ 04]
Equal size adjacent ADWIN: All Adjacent
subwindows subwindows
1010101 1011 1111 101 010110111111
[Dasu+ 06]
28.
Window Management Models
W = 101010110111111
Equal & ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111 10101011011 1111
[Kifer+ 04] [Gama+ 04]
Equal size adjacent ADWIN: All Adjacent
subwindows subwindows
1010101 1011 1111 1010 10110111111
[Dasu+ 06]
29.
Window Management Models
W = 101010110111111
Equal & ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111 10101011011 1111
[Kifer+ 04] [Gama+ 04]
Equal size adjacent ADWIN: All Adjacent
subwindows subwindows
1010101 1011 1111 10101 0110111111
[Dasu+ 06]
30.
Window Management Models
W = 101010110111111
Equal & ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111 10101011011 1111
[Kifer+ 04] [Gama+ 04]
Equal size adjacent ADWIN: All Adjacent
subwindows subwindows
1010101 1011 1111 101010 110111111
[Dasu+ 06]
31.
Window Management Models
W = 101010110111111
Equal & ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111 10101011011 1111
[Kifer+ 04] [Gama+ 04]
Equal size adjacent ADWIN: All Adjacent
subwindows subwindows
1010101 1011 1111 1010101 10111111
[Dasu+ 06]
32.
Window Management Models
W = 101010110111111
Equal & ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111 10101011011 1111
[Kifer+ 04] [Gama+ 04]
Equal size adjacent ADWIN: All Adjacent
subwindows subwindows
1010101 1011 1111 10101011 0111111
[Dasu+ 06]
33.
Window Management Models
W = 101010110111111
Equal & ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111 10101011011 1111
[Kifer+ 04] [Gama+ 04]
Equal size adjacent ADWIN: All Adjacent
subwindows subwindows
1010101 1011 1111 101010110 111111
[Dasu+ 06]
34.
Window Management Models
W = 101010110111111
Equal & ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111 10101011011 1111
[Kifer+ 04] [Gama+ 04]
Equal size adjacent ADWIN: All Adjacent
subwindows subwindows
1010101 1011 1111 1010101101 11111
[Dasu+ 06]
35.
Window Management Models
W = 101010110111111
Equal & ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111 10101011011 1111
[Kifer+ 04] [Gama+ 04]
Equal size adjacent ADWIN: All Adjacent
subwindows subwindows
1010101 1011 1111 10101011011 1111
[Dasu+ 06]
36.
Window Management Models
W = 101010110111111
Equal & ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111 10101011011 1111
[Kifer+ 04] [Gama+ 04]
Equal size adjacent ADWIN: All Adjacent
subwindows subwindows
1010101 1011 1111 101010110111 111
[Dasu+ 06]
37.
Window Management Models
W = 101010110111111
Equal & ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111 10101011011 1111
[Kifer+ 04] [Gama+ 04]
Equal size adjacent ADWIN: All Adjacent
subwindows subwindows
1010101 1011 1111 1010101101111 11
[Dasu+ 06]
40.
Pattern Relaxed Support
Guojie Song, Dongqing Yang, Bin Cui, Baihua Zheng,
Yunfeng Liu and Kunqing Xie.
CLAIM: An Efﬁcient Method for Relaxed Frequent Closed
Itemsets Mining over Stream Data
Linear Relaxed Interval:The support space of all
subpatterns can be divided into n = 1/εr intervals, where
εr is a userspeciﬁed relaxed factor, and each interval can
be denoted by Ii = [li , ui ), where li = (n − i) ∗ εr ≥ 0,
ui = (n − i + 1) ∗ εr ≤ 1 and i ≤ n.
Linear Relaxed closed subpattern t: if and only if there
exists no proper superpattern t of t such that their suports
belong to the same interval Ii .
41.
Pattern Relaxed Support
As the number of closed frequent patterns is not linear with
respect support, we introduce a new relaxed support:
Logarithmic Relaxed Interval:The support space of all
subpatterns can be divided into n = 1/εr intervals, where
εr is a userspeciﬁed relaxed factor, and each interval can
be denoted by Ii = [li , ui ), where li = c i , ui = c i+1 − 1
and i ≤ n.
Logarithmic Relaxed closed subpattern t: if and only if
there exists no proper superpattern t of t such that their
suports belong to the same interval Ii .
42.
Galois Lattice of closed set of trees
1 2 3
D
We need 12 13 23
a Galois
connection pair
a closure operator
123
43.
Incremental mining on closed frequent trees
1 Adding a tree
transaction, does
not decrease the
number of closed
trees for D. 1 2 3
2 Adding a
transaction with a
closed tree, does
not modify the
number of closed 12 13 23
trees for D.
123
44.
Sliding Window mining on closed frequent trees
1 Deleting a tree
transaction, does
not increase the
number of closed
trees for D. 1 2 3
2 Deleting a tree
transaction that is
repeated, does not
modify the number
of closed trees for 12 13 23
D.
123
45.
Algorithms
Algorithms
Incremental: I NC T REE N AT
Sliding Window: W IN T REE N AT
Adaptive: A DAT REE N AT Uses ADWIN to monitor change
ADWIN
An adaptive sliding window whose size is recomputed online
according to the rate of change observed.
ADWIN has rigorous guarantees (theorems)
On ratio of false positives and negatives
On the relation of the size of the current window and
change rates
46.
Experimental Validation: TN1
CMTreeMiner
300
Time 200
(sec.)
100
I NC T REE N AT
2 4 6 8
Size (Milions)
Figure: Time on experiments on ordered trees on TN1 dataset
47.
Experimental Validation
45
35
Number of Closed Trees
25 AdaTreeInc 1
AdaTreeInc 2
15
5
0 21.460 42.920 64.380 85.840 107.300 128.760 150.220 171.680 193.140
Number of Samples
Figure: Number of closed trees maintaining the same number of
closed datasets on input data
49.
Summary
Conclusions
New logarithmic relaxed closed support
Using Galois Latice Theory, we present methods for mining
closed trees
Incremental: I NC T REE N AT
Sliding Window: W IN T REE N AT
Adaptive: A DAT REE N AT using ADWIN to monitor change
Future Work
Labeled Trees and XML data.
Be the first to comment