Presentation ucb 2012

The ClusTree: Indexing Micro-Clusters
for Anytime Stream Mining

Philipp Kranen1, Ira Assent2, Corinna Baldauf1, Thomas Seidl1
1DataManagement and Data Exploration Group,
RWTH Aachen University, Germany
2Department of Computer Science, Aarhus University, Denmark

P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Motivating examples

emergency
pre full professional
classifier classifier decision

normal


Applications and tasks

Modeling
Classification
data rate
constant
data rate
varying

Outlier
detection


Agenda

I. The Anytime principle
Anytime algorithms for stream data mining

II. The ClusTree
Self-adaptive anytime stream clustering

III. The MOA Framework
An open source framework for stream mining algorithms

4


Definitions I

 Stream
A stream : → : → , is an infinite sequence
of objects ∈ from a d‐dimensional input space and
∈ , ∀ is the discrete arrival time of object .
 Inter‐arrival time
The inter‐arrival time between two consecutive objects and
is denoted as Δt , i.e. 0 Δ ∈ .
 Constant and varying streams
A stream is called constant ↔ Δ Δ ∀ ,
 Stream algorithms
– Online algorithms – the input is given one at a time
– Budget algorithms – tailored to a specific time budget b
– Anytime algorithms – provide a result after any amount of processing time
5


Definitions II

 Budget Algorithms – tailored to a specific time budget
– Available time < budget  no result
– Available time > budget  idle times

 How should stream processing be done?

quality
– Little time  fast result
– More time  use it to improve the result
time

 Anytime Algorithms – provide a result after any time
For a given input an anytime algorithm can provide a first result after a very
short initialization time and it uses additional time to improve its result. The
algorithm is interruptible after any time and will deliver the best result
obtained until the point of interruption.
6


Anytime algorithms on constant streams

 Can we do better than using all available time?

tf td
Yes we can! constant data stream type 1
type 2

…
arrival interval ta type m

 Distribute computation time according to confidence values
– Spend less time on confident items
– Use additional time for uncertain objects

 Prerequisites
– Anytime algorithm
– Confidence measure
7


Existing anytime classification approaches

 Anytime support vector machines
 Anytime nearest neighbor classification
 Anytime Bayesian classification
 Categorical data
 Continuous data
 Others
 Anytime induction of decision trees
 Anytime A* algorithm
 Anytime clustering
 Anytime outlier detection

[References on last slide.]
8


Sampling, buffering, anytime clustering

 What about sampling?
 Not appropriate for classification or outlier detection.

 What about buffering?
 Durations of bursts are unknown.

 Why anytime clustering?
 …
 “Smart buffering”
 Use micro‐clusters as input for further analysis
 Provide constant (maximal) granularity at regular intervals
9


Agenda


II. The ClusTree


10


Problem statement

 Clustering is a frequently used technique
 Provides an overview, reduces amount of data, groups similar objects
 Streaming scenario:
 Use summaries (micro clusters) as input for further analysis
 But: endless amounts of data (streams) are hard to handle

 Stream clustering challenges:
 Single pass clustering
Anytime
 Limited time, varying time allowance
 Limited memory, yet least information loss Fine grained
 Evolving data Drift&Novelty
 Flexible number and size of clusters
Self-adaptive

11


Related work

 Stream clustering approaches and paradigms
 Convex clustering approaches (k-center)
 Density-based, grid-based approaches
 kernels, graphs, fractal dimensions, …
 Process chunks, merge results
 Maintain list, remove oldest or merge closest pair
 Online and Offline component

 All approaches have to restrict themselves to the worst case time

12


Goals

 Anytime clustering Anytime
 don’t miss any point, no matter at which speed

 Adaptive model size Self-adaptive
 don’t restrict model to worst case assumptions

 Fine grained representation Fine grained
 provide more detailed input for offline component

 Compatible to existing work on drift and novelty Drift&Novelty
 Aging / Decay
 Snapshots / Drift & Novelty

13


ClusTree – basic idea

 Cluster features CF = (N, LS, SS) represent micro-clusters
 Allow to compute statistics like mean and variance
 Maintain a balanced hierarchical data structure less time
 Insert new object into more time
the closest subtree
 Insertion stops
if next object arrives
 Most detailed model
is stored at leaf level
 Tree (= model) grows
if more time is available

14


ClusTree structure and anytime insert Fine grained
Anytime

 Hierarchy of micro-clusters CF = (N, LS, SS)
 New objects (x1 … xd) are simply added to the cluster feature
 N = N + 1, LSi = LSi + xi, SSi = SSi + (xi)2
 Anytime insert: buffer object locally in a local buffer CF

inner entry
LS1 (t) SS1 (t) LS1 (t) SS1 (t)

n(t)
b
… … n(t)
b
… …
LSd SSd LSd b SSd b

LS1 (t) SS1 (t)

n(t)
b
… …
leaf entry LSd SSd
15


Buffer and hitchhiker Self-adaptive

 Buffer: interrupt insertion – aggregate objects on interrupt
 Hitchhiker: resume insertion – take buffer along (if same way)
 Maximally two objects to descend with
 Tree grows through splitting nodes starting from the leaf
entry structure:
(CF, pointer, CFb )

. Level 1: root

. Level 2: hitchhike

. Level 3: buffer

. . . . Level 4: insert .

destination of destination of . 16


Maintaining an up-to-date view Drift&Novelty

 Goal: Compatible to existing work on drift and novelty
 New leaf entries get a unique ID
 Aging by an exponential decay function w(Δt) = β‐λΔt
 Benefits of the employed decay function
 Avoid splits by reusing insignificant entries
 An entry’s CF still represents exactly its subtree and its buffer

Lemma 1 (ClusTree Invariant): For each inner entry es with timestamp t + Δt
and decay function w(Δt) = 2‐λΔt it holds
s
es .CF (t  t )  ( w(t )   esi .CF (t ) )  es .buffer (t  t )
i 1
[Proof in the paper.]

17


Extensions of the ClusTree

 Insertion of aggregates
for extremely fast streams

 Iterative depth first descent
for slower streams

 Local look ahead
to reduce overlapping

 Explicit noise handling
and noise to cluster events
a) b) c) d)
e e n e e e n e e e n e e e n

L
L L L


Evaluation – anytime clustering and aggregation

Forest Covertype
 Anytime clustering (90.000 pps)
 88% purity on leaf level
 Purity on higher levels
corresponds to faster streams
 >70% purity starting
three levels under root

 Aggregation (varying streams)
 Purity drops under 70%
at 150.000 pps
 Aggregation significantly
improves the purity
on the leaf level
19


Evaluation – adaptive clustering

 Setup for constant streams
 ClusTree: stream speed  maintainable #MC
 DenStream [SDM06] & CluStream [VLDB03]: #MC  processable pps
 ClusTree results: #MC is exponential (#dists is logarithmic) 20


Agenda


II. The ClusTree


21


The MOA framework

 Extensible open source software
– Data generators, file streams

– Stream mining algorithms

– Measure collection

 Supported stream mining tasks
– Stream clustering, stream
classification, outlier detection, …

 Repeatable/benchmark settings

 In collaboration with


References

 Anytime SVM: DeCoste: Anytime Query-Tuned Kernel Machines via Cholesky
Factorization. SDM, 2003
 DeCoste et al.: Fast query-optimized kernel machine classification via incremental
approximate nearest support vectors. ICML, 2003
 Bayes (continuous data): Seidl et al.: Indexing density models for incremental
learning and anytime classification on data streams. EDBT, 2009
 Bayes (categorical): Yang et al.: Classifying under computational resource constraints:
anytime classification using probabilistic estimators. Machine Learning, 2007
 Anytime Nearest Neighbor: Ueno et al.: Anytime Classification Using the Nearest
Neighbor Algorithm with Applications to Stream Mining. ICDM, 2006
 Anytime + constant: Kranen et al.: Harnessing the strengths of anytime algorithms
for constant data streams. DMKD Journal, 2009
 ClusTree: Kranen et al.: Self-Adaptive Anytime Stream Clustering. ICDM 2009
 A complete list of references including stream clustering, MOA, evaluation, etc.:
Kranen: Anytime Algorithms for Stream Data Mining. PhD Thesis, RWTH Aachen, 2011
23

Presentation ucb 2012

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Recently uploaded

Recently uploaded (20)

Presentation ucb 2012