Adaptive XML Tree Mining on Evolving Data Streams

Adaptive XML Tree Mining on Evolving Data Streams

Albert Bifet

Laboratory for Relational Algorithmics, Complexity and Learning LARCA
Departament de Llenguatges i Sistemes Informàtics
Universitat Politècnica de Catalunya

Porto, 21 May 2009

Mining Evolving Massive Structured Data

The basic problem
Finding interesting structure
on data
Mining massive data
Mining time varying data
Mining on real time
Mining XML data

The Disintegration of Persistence
of Memory 1952-54

Salvador Dalí

2 / 30

XML Tree Classiﬁcation on evolving data
streams

D D D D

B B B B B B B

C C C C C C

A A

C LASS 1 C LASS 2 C LASS 1 C LASS 2

D

Figure: A dataset example

3 / 30

Tree Pattern Mining

Given a dataset of trees, ﬁnd the
complete set of frequent subtrees
Frequent Tree Pattern (FT):
Include all the trees whose
support is no less than min_sup

Closed Frequent Tree Pattern
(CT):
Include no tree which has a
Trees are sanctuaries. super-tree with the same
Whoever knows how support
to listen to them,
can learn the truth. CT ⊆ FT

Herman Hesse

4 / 30

Mining Closed Frequent Trees

Our trees are: Our subtrees are:
Labeled and Unlabeled Induced
Ordered and Unordered Top-down

Two different ordered trees
but the same unordered tree

5 / 30

A tale of two trees

Consider D = {A, B}, where Frequent subtrees
A B
A:

B:

and let min_sup = 2.

6 / 30

A tale of two trees

Consider D = {A, B}, where Closed subtrees
A B
A:

B:

and let min_sup = 2.

6 / 30

streams

D D D D

B B B B B B B

C C C C C C

A A

C LASS 1 C LASS 2 C LASS 1 C LASS 2

D

Figure: A dataset example

7 / 30

streams

Tree Trans.
Closed Freq. not Closed Trees 1 2 3 4
D
B B

c1 C C C C 1 0 1 0
D
B B C A
C C A

c2 A A 1 0 0 1

8 / 30

streams
Frequent Trees
c1 c2 c3 c4
Id 1
c1 f1 1 2 3
c2 f2 f2 f2 1
c3 f3 1 2 3 4 5
c4 f4 f4 f4 f4 f4
1 1 1 1 1 1 1 0 0 1 1 1 1 1 1
2 0 0 0 0 0 0 1 1 1 1 1 1 1 1
3 1 1 0 0 0 0 1 1 1 1 1 1 1 1
4 0 0 1 1 1 1 1 1 1 1 1 1 1 1

Closed Maximal
Trees Trees
Id Tree c1 c2 c3 c4 c1 c2 c3 Class
1 1 1 0 1 1 1 0 C LASS 1
2 0 0 1 1 0 0 1 C LASS 2
3 1 0 1 1 1 0 1 C LASS 1
4 0 1 1 1 0 1 1 C LASS 2

9 / 30

XML Tree Framework on evolving data
streams

XML Tree Classification Framework Components
An XML closed frequent tree miner
A Data stream classifier algorithm, which we will feed with tuples
to be classified online.

10 / 30

Mining Evolving Tree Data Streams

Problem
Given a data stream D of rooted and unordered trees, ﬁnd
frequent closed trees.

We provide three algorithms,
of increasing power
Incremental
Sliding Window
Adaptive

D

11 / 30

Mining Closed Unordered Subtrees

C LOSED _S UBTREES(t, D, min_sup, T )
1
2
3 for every t that can be extended from t in one step
4 do if Support(t ) ≥ min_sup
5 then T ← C LOSED _S UBTREES(t , D, min_sup, T )
6
7
8
9
10 return T

12 / 30


1 if not C ANONICAL _R EPRESENTATIVE(t)
2 then return T
6
7
8
9
10 return T

12 / 30


1 if not C ANONICAL _R EPRESENTATIVE(t)
2 then return T
6 do if Support(t ) = Support(t)
7 then t is not closed
8 if t is closed
9 then insert t into T
10 return T

12 / 30

Example
D = {A, B} A = (0, 1, 2, 3, 2, 1) B = (0, 1, 2, 3, 1, 2, 2)

min_sup = 2.

(0, 1, 2, 1)

(0, 1, 2, 2, 1)
(0, 1, 1)
(0, 1, 2, 2)
(0) (0, 1)
(0, 1, 2) (0, 1, 2, 3, 1)

(0, 1, 2, 3)

13 / 30

Experimental results

TreeNat CMTreeMiner
Unlabeled Trees Labeled Trees
Top-Down Subtrees Induced Subtrees
No Occurrences Occurrences

14 / 30

Closure Operator on Trees
D: the finite input dataset of trees
T : the (infinite) set of all trees

Definition
We define the following the Galois connection pair:
For finite A ⊆ D
σ (A) is the set of subtrees of the A trees in T

σ (A) = {t ∈ T ∀ t ∈ A (t t )}
For finite B ⊂ T
τD (B) is the set of supertrees of the B trees in D

τD (B) = {t ∈ D ∀ t ∈ B (t t )}

Closure Operator
The composition ΓD = σ ◦ τD is a closure operator.
15 / 30

Galois Lattice of closed set of trees

1 2 3

12 13 23

123
16 / 30


1 2 3

D

B={ } 12 13 23

123
17 / 30


B={ }

1 2 3

τD (B) = { , }
12 13 23

123
17 / 30


B={ }

1 2 3

τD (B) = { , }
12 13 23

ΓD (B) = σ ◦τD(B) = { and its subtrees }

123
17 / 30

Algorithms

Algorithms
Incremental: I NC T REE N AT
Sliding Window: W IN T REE N AT
Adaptive: A DAT REE N AT Uses ADWIN to monitor change

ADWIN
An adaptive sliding window whose size is recomputed online
according to the rate of change observed.

ADWIN has rigorous guarantees (theorems)
On ratio of false positives and false negatives
On the relation of the size of the current window and change
rates

18 / 30

Experimental Validation: TN1

CMTreeMiner
300

Time 200
(sec.)
100
I NC T REE N AT
2 4 6 8
Size (Milions)

Figure: Experiments on ordered trees with TN1 dataset

19 / 30

What is MOA?

{M}assive {O}nline {A}nalysis is a framework for online learning
from data streams.

It is closely related to WEKA
It includes a collection of ofﬂine and online as well as tools for
evaluation:
boosting and bagging
Hoeffding Trees
with and without Naïve Bayes classiﬁers at the leaves.

20 / 30

WEKA: the bird

21 / 30

MOA: the bird

The Moa (another native NZ bird) is not only ﬂightless, like the
Weka, but also extinct.

22 / 30

Data stream classiﬁcation cycle

1 Process an example at a
time, and inspect it only
once (at most)
2 Use a limited amount of
memory
3 Work in a limited amount
of time
4 Be ready to predict at any
point

23 / 30

Environments and Data Sources

Environments
Sensor Network: 100Kb
Handheld Computer: 32 Mb
Server: 400 Mb

Data Sources
Random Tree Generator
Random RBF Generator
LED Generator
Waveform Generator
Function Generator

24 / 30

Algorithms

Naive Bayes Prediction strategies
Decision stumps Majority class
Hoeffding Tree Naive Bayes Leaves
Hoeffding Option Tree Adaptive Hybrid
Bagging and Boosting

25 / 30

Hoeffding Option Tree
Hoeffding Option Trees
Regular Hoeffding tree containing additional option nodes that
allow several tests to be applied, leading to multiple Hoeffding
trees as separate paths.

26 / 30

GUI
java -cp .:moa.jar:weka.jar
-javaagent:sizeofag.jar moa.gui.TaskLauncher

27 / 30

Ensemble Methods
http://www.cs.waikato.ac.nz/∼abifet/MOA/

New ensemble methods:
ADWIN bagging: When a change is detected, the worst classiﬁer
is removed and a new classiﬁer is added.
Adaptive-Size Hoeffding Tree bagging

28 / 30

XML Tree Framework on evolving data
streams

Maximal Closed
# Trees Att. Acc. Mem. Att. Acc. Mem.
CSLOG12 15483 84 79.64 1.2 228 78.12 2.54
CSLOG23 15037 88 79.81 1.21 243 78.77 2.75
CSLOG31 15702 86 79.94 1.25 243 77.60 2.73
CSLOG123 23111 84 80.02 1.7 228 78.91 4.18

Table: BAGGING on unordered trees.

29 / 30

Conclusions

XML tree stream classiﬁer system.

Using Galois Latice Theory, we present methods for mining
closed trees
Incremental
Sliding Window
Adaptive: using ADWIN to monitor change

We use MOA data stream classiﬁers.

30 / 30

Adaptive XML Tree Mining on Evolving Data Streams

Recommended

Recommended

More Related Content

More from Albert Bifet

More from Albert Bifet (20)

Recently uploaded

Recently uploaded (20)

Adaptive XML Tree Mining on Evolving Data Streams