The ClusTree: Indexing Micro-Clusters     for Anytime Stream Mining  Philipp Kranen1, Ira Assent2, Corinna Baldauf1, Thoma...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningMotivating exa...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining    Applicatio...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningAgenda  I.    ...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningDefinitions I ...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningDefinitions II...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningAnytime algori...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningExisting anyti...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningSampling, buff...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningAgenda  I.    ...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningProblem statem...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningRelated work  ...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningGoals       A...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningClusTree – bas...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningClusTree struc...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningBuffer and hit...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningMaintaining an...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningExtensions of ...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningEvaluation – a...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningEvaluation – a...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningAgenda  I.    ...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningThe MOA framew...
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningReferences   ...
Upcoming SlideShare
Loading in …5
×

Presentation ucb 2012

751 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
751
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
36
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Presentation ucb 2012

  1. 1. The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Philipp Kranen1, Ira Assent2, Corinna Baldauf1, Thomas Seidl1 1DataManagement and Data Exploration Group, RWTH Aachen University, Germany 2Department of Computer Science, Aarhus University, Denmark
  2. 2. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningMotivating examples emergency pre full professional classifier classifier decision normal
  3. 3. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Applications and tasks Modeling Classificationdata rateconstantdata rate varying Outlier detection
  4. 4. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningAgenda I. The Anytime principle Anytime algorithms for stream data mining II. The ClusTree Self-adaptive anytime stream clustering III. The MOA Framework An open source framework for stream mining algorithms 4
  5. 5. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningDefinitions I  Stream A stream  : → : → , is an infinite sequence  of objects  ∈ from a d‐dimensional input space  and ∈ ,  ∀ is the discrete arrival time of object  .  Inter‐arrival time The inter‐arrival time between two consecutive objects and  is denoted as Δt , i.e. 0 Δ ∈ .  Constant and varying streams A stream  is called constant  ↔ Δ Δ ∀ ,  Stream algorithms – Online algorithms – the input is given one at a time – Budget algorithms – tailored to a specific time budget b – Anytime algorithms – provide a result after any amount of processing time 5
  6. 6. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningDefinitions II  Budget Algorithms – tailored to a specific time budget – Available time < budget  no result – Available time > budget  idle times  How should stream processing be done? quality – Little time  fast result – More time  use it to improve the result time  Anytime Algorithms – provide a result after any time For a given input an anytime algorithm can provide a first result after a very short initialization time and it uses additional time to improve its result. The algorithm is interruptible after any time and will deliver the best result obtained until the point of interruption. 6
  7. 7. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningAnytime algorithms on constant streams  Can we do better than using all available time?  tf td Yes we can! constant data stream type 1 type 2 … arrival interval ta type m  Distribute computation time according to confidence values – Spend less time on confident items – Use additional time for uncertain objects  Prerequisites – Anytime algorithm – Confidence measure 7
  8. 8. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningExisting anytime classification approaches  Anytime support vector machines  Anytime nearest neighbor classification  Anytime Bayesian classification  Categorical data  Continuous data  Others  Anytime induction of decision trees  Anytime A* algorithm  Anytime clustering  Anytime outlier detection [References on last slide.] 8
  9. 9. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningSampling, buffering, anytime clustering  What about sampling?  Not appropriate for classification or outlier detection.  What about buffering?  Durations of bursts are unknown.  Why anytime clustering?  …  “Smart buffering”  Use micro‐clusters as input for further analysis  Provide constant (maximal) granularity at regular intervals 9
  10. 10. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningAgenda I. The Anytime principle Anytime algorithms for stream data mining II. The ClusTree Self-adaptive anytime stream clustering III. The MOA Framework An open source framework for stream mining algorithms 10
  11. 11. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningProblem statement  Clustering is a frequently used technique  Provides an overview, reduces amount of data, groups similar objects  Streaming scenario:  Use summaries (micro clusters) as input for further analysis  But: endless amounts of data (streams) are hard to handle  Stream clustering challenges:  Single pass clustering Anytime  Limited time, varying time allowance  Limited memory, yet least information loss Fine grained  Evolving data Drift&Novelty  Flexible number and size of clusters Self-adaptive 11
  12. 12. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningRelated work  Stream clustering approaches and paradigms  Convex clustering approaches (k-center)  Density-based, grid-based approaches  kernels, graphs, fractal dimensions, …  Process chunks, merge results  Maintain list, remove oldest or merge closest pair  Online and Offline component  All approaches have to restrict themselves to the worst case time 12
  13. 13. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningGoals  Anytime clustering Anytime  don’t miss any point, no matter at which speed  Adaptive model size Self-adaptive  don’t restrict model to worst case assumptions  Fine grained representation Fine grained  provide more detailed input for offline component  Compatible to existing work on drift and novelty Drift&Novelty  Aging / Decay  Snapshots / Drift & Novelty 13
  14. 14. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningClusTree – basic idea  Cluster features CF = (N, LS, SS) represent micro-clusters  Allow to compute statistics like mean and variance  Maintain a balanced hierarchical data structure less time  Insert new object into more time the closest subtree  Insertion stops if next object arrives  Most detailed model is stored at leaf level  Tree (= model) grows if more time is available 14
  15. 15. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningClusTree structure and anytime insert Fine grained Anytime  Hierarchy of micro-clusters CF = (N, LS, SS)  New objects (x1 … xd) are simply added to the cluster feature  N = N + 1, LSi = LSi + xi, SSi = SSi + (xi)2  Anytime insert: buffer object locally in a local buffer CF inner entry LS1 (t) SS1 (t) LS1 (t) SS1 (t) n(t) b … … n(t) b … … LSd SSd LSd b SSd b LS1 (t) SS1 (t) n(t) b … … leaf entry LSd SSd 15
  16. 16. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningBuffer and hitchhiker Self-adaptive  Buffer: interrupt insertion – aggregate objects on interrupt  Hitchhiker: resume insertion – take buffer along (if same way)  Maximally two objects to descend with  Tree grows through splitting nodes starting from the leaf entry structure: (CF, pointer, CFb ) . Level 1: root . Level 2: hitchhike . Level 3: buffer . . . . Level 4: insert . destination of destination of . 16
  17. 17. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningMaintaining an up-to-date view Drift&Novelty  Goal: Compatible to existing work on drift and novelty  New leaf entries get a unique ID  Aging by an exponential decay function w(Δt) = β‐λΔt  Benefits of the employed decay function  Avoid splits by reusing insignificant entries  An entry’s CF still represents exactly its subtree and its buffer Lemma 1 (ClusTree Invariant): For each inner entry es with timestamp t + Δt and decay function w(Δt) = 2‐λΔt it holds s es .CF (t  t )  ( w(t )   esi .CF (t ) )  es .buffer (t  t ) i 1 [Proof in the paper.] 17
  18. 18. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningExtensions of the ClusTree  Insertion of aggregates for extremely fast streams  Iterative depth first descent for slower streams  Local look ahead to reduce overlapping  Explicit noise handling and noise to cluster events a) b) c) d) e e n e e e n e e e n e e e n L L L L
  19. 19. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningEvaluation – anytime clustering and aggregation Forest Covertype  Anytime clustering (90.000 pps)  88% purity on leaf level  Purity on higher levels corresponds to faster streams  >70% purity starting three levels under root  Aggregation (varying streams)  Purity drops under 70% at 150.000 pps  Aggregation significantly improves the purity on the leaf level 19
  20. 20. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningEvaluation – adaptive clustering  Setup for constant streams  ClusTree: stream speed  maintainable #MC  DenStream [SDM06] & CluStream [VLDB03]: #MC  processable pps  ClusTree results: #MC is exponential (#dists is logarithmic) 20
  21. 21. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningAgenda I. The Anytime principle Anytime algorithms for stream data mining II. The ClusTree Self-adaptive anytime stream clustering III. The MOA Framework An open source framework for stream mining algorithms 21
  22. 22. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningThe MOA framework  Extensible open source software – Data generators, file streams – Stream mining algorithms – Measure collection  Supported stream mining tasks – Stream clustering, stream classification, outlier detection, …  Repeatable/benchmark settings  In collaboration with
  23. 23. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream MiningReferences  Anytime SVM: DeCoste: Anytime Query-Tuned Kernel Machines via Cholesky Factorization. SDM, 2003  DeCoste et al.: Fast query-optimized kernel machine classification via incremental approximate nearest support vectors. ICML, 2003  Bayes (continuous data): Seidl et al.: Indexing density models for incremental learning and anytime classification on data streams. EDBT, 2009  Bayes (categorical): Yang et al.: Classifying under computational resource constraints: anytime classification using probabilistic estimators. Machine Learning, 2007  Anytime Nearest Neighbor: Ueno et al.: Anytime Classification Using the Nearest Neighbor Algorithm with Applications to Stream Mining. ICDM, 2006  Anytime + constant: Kranen et al.: Harnessing the strengths of anytime algorithms for constant data streams. DMKD Journal, 2009  ClusTree: Kranen et al.: Self-Adaptive Anytime Stream Clustering. ICDM 2009  A complete list of references including stream clustering, MOA, evaluation, etc.: Kranen: Anytime Algorithms for Stream Data Mining. PhD Thesis, RWTH Aachen, 2011 23

×