SlideShare a Scribd company logo
The ClusTree: Indexing Micro-Clusters
     for Anytime Stream Mining




  Philipp Kranen1, Ira Assent2, Corinna Baldauf1, Thomas Seidl1
              1DataManagement and Data Exploration Group,
                   RWTH Aachen University, Germany
       2Department of Computer Science, Aarhus University, Denmark
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Motivating examples




                                                   emergency
                                               pre                        full                      professional
                                            classifier                 classifier                     decision

                                                         normal
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

    Applications and tasks

                                                                                                                   Modeling
            Classification
data rate
constant
data rate
 varying




                         Outlier
                         detection
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Agenda



  I.      The Anytime principle
           Anytime algorithms for stream data mining


  II.     The ClusTree
           Self-adaptive anytime stream clustering


  III. The MOA Framework
           An open source framework for stream mining algorithms




                                                                                                               4
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Definitions I

   Stream
         A stream  :  →        : → ,        is an infinite sequence 
         of objects  ∈ from a d‐dimensional input space  and
            ∈ ,       ∀    is the discrete arrival time of object  .
   Inter‐arrival time
         The inter‐arrival time between two consecutive objects                              and 
         is denoted as Δt             , i.e. 0 Δ ∈ .
   Constant and varying streams
         A stream  is called constant  ↔ Δ                        Δ 	∀ ,
   Stream algorithms
         – Online algorithms – the input is given one at a time
         – Budget algorithms – tailored to a specific time budget b
         – Anytime algorithms – provide a result after any amount of processing time
                                                                                                               5
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Definitions II

   Budget Algorithms – tailored to a specific time budget
         – Available time < budget                                  no result
         – Available time > budget                                  idle times


   How should stream processing be done?




                                                                                          quality
         – Little time                        fast result
         – More time                          use it to improve the result
                                                                                                               time

   Anytime Algorithms – provide a result after any time
            For a given input an anytime algorithm can provide a first result after a very
            short initialization time and it uses additional time to improve its result. The
            algorithm is interruptible after any time and will deliver the best result
            obtained until the point of interruption.
                                                                                                                      6
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Anytime algorithms on constant streams

   Can we do better than using all available time? 

                                                                                      tf                       td
        Yes we can!                                 constant data stream                                            type 1
                                                                                                                    type 2




                                                                                                                …
                                                        arrival interval ta                                         type m




   Distribute computation time according to confidence values
         – Spend less time on confident items
         – Use additional time for uncertain objects


   Prerequisites
         – Anytime algorithm
         – Confidence measure
                                                                                                                       7
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Existing anytime classification approaches

   Anytime support vector machines
   Anytime nearest neighbor classification
   Anytime Bayesian classification
          Categorical data
          Continuous data
   Others
          Anytime induction of decision trees
          Anytime A* algorithm
          Anytime clustering
          Anytime outlier detection


  [References on last slide.]
                                                                                                               8
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Sampling, buffering, anytime clustering

   What about sampling?
          Not appropriate for classification or outlier detection.


       What about buffering?
          Durations of bursts are unknown.


   Why anytime clustering?
          …
          “Smart buffering”
                 Use micro‐clusters as input for further analysis
                 Provide constant (maximal) granularity at regular intervals
                                                                                                               9
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Agenda



  I.      The Anytime principle
           Anytime algorithms for stream data mining


  II.     The ClusTree
           Self-adaptive anytime stream clustering


  III. The MOA Framework
           An open source framework for stream mining algorithms




                                                                                                               10
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Problem statement

       Clustering is a frequently used technique
              Provides an overview, reduces amount of data, groups similar objects
              Streaming scenario:
                 Use summaries (micro clusters) as input for further analysis
                 But: endless amounts of data (streams) are hard to handle



       Stream clustering challenges:
              Single pass clustering
                                                                                                        Anytime
              Limited time, varying time allowance
              Limited memory, yet least information loss                                            Fine grained
              Evolving data                                                                        Drift&Novelty
              Flexible number and size of clusters
                                                                                                     Self-adaptive

                                                                                                                     11
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Related work

       Stream clustering approaches and paradigms
              Convex clustering approaches (k-center)
              Density-based, grid-based approaches
              kernels, graphs, fractal dimensions, …
              Process chunks, merge results
              Maintain list, remove oldest or merge closest pair
              Online and Offline component


       All approaches have to restrict themselves to the worst case time




                                                                                                               12
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Goals

       Anytime clustering                                                                                       Anytime
              don’t miss any point, no matter at which speed

       Adaptive model size                                                                                    Self-adaptive
             don’t restrict model to worst case assumptions

       Fine grained representation                                                                            Fine grained
               provide more detailed input for offline component

       Compatible to existing work on drift and novelty                                                       Drift&Novelty
              Aging / Decay
              Snapshots / Drift & Novelty




                                                                                                                       13
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

ClusTree – basic idea

       Cluster features CF = (N, LS, SS) represent micro-clusters
              Allow to compute statistics like mean and variance
       Maintain a balanced hierarchical data structure                                                        less time
              Insert new object into                                                                           more time
               the closest subtree
              Insertion stops
               if next object arrives
              Most detailed model
               is stored at leaf level
              Tree (= model) grows
               if more time is available




                                                                                                                   14
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

ClusTree structure and anytime insert                                                                          Fine grained
                                                                                                                Anytime


   Hierarchy of micro-clusters CF = (N, LS, SS)
   New objects (x1 … xd) are simply added to the cluster feature
              N = N + 1, LSi = LSi + xi, SSi = SSi + (xi)2
       Anytime insert: buffer object locally in a local buffer CF

                   inner entry
                          LS1 (t) SS1   (t)              LS1 (t) SS1      (t)

                   n(t)
                     b
                          …       …               n(t)
                                                    b
                                                         …       …
                          LSd SSd                        LSd b SSd        b




                                                      LS1 (t) SS1   (t)

                                               n(t)
                                                 b
                                                      …       …
                               leaf entry             LSd SSd
                                                                                                                       15
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Buffer and hitchhiker                                                                                          Self-adaptive




   Buffer: interrupt insertion – aggregate objects on interrupt
   Hitchhiker: resume insertion – take buffer along (if same way)
            Maximally two objects to descend with
       Tree grows through splitting nodes starting from the leaf
                                                           entry structure:
                                                          (CF, pointer, CFb )


                              .                                                             Level 1: root

                 .                                                                          Level 2: hitchhike

       .                                                                                    Level 3: buffer

                 .                                 .     .      .                           Level 4: insert        .

             destination of                            destination of     .                                            16
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Maintaining an up-to-date view                                                                                 Drift&Novelty




       Goal: Compatible to existing work on drift and novelty
              New leaf entries get a unique ID
              Aging by an exponential decay function w(Δt) = β‐λΔt
       Benefits of the employed decay function
              Avoid splits by reusing insignificant entries
              An entry’s CF still represents exactly its subtree and its buffer


             Lemma 1 (ClusTree Invariant): For each inner entry es with timestamp t + Δt
             and decay function w(Δt) = 2‐λΔt it holds
                                                             s
                           es .CF (t  t )  ( w(t )   esi .CF (t ) )  es .buffer (t  t )
                                                             i 1
             [Proof in the paper.]

                                                                                                                       17
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Extensions of the ClusTree

       Insertion of aggregates
          for extremely fast streams


       Iterative depth first descent
           for slower streams


       Local look ahead
          to reduce overlapping


       Explicit noise handling
          and noise to cluster events
         a)                      b)                         c)                                           d)
              e   e    n              e   e   e   n              e   e   e   n                                 e   e   e   n

                                                  L
                                                                             L      L        L
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Evaluation – anytime clustering and aggregation

                                                                                  Forest Covertype
       Anytime clustering (90.000 pps)
              88% purity on leaf level
              Purity on higher levels
               corresponds to faster streams
              >70% purity starting
               three levels under root



       Aggregation (varying streams)
              Purity drops under 70%
               at 150.000 pps
              Aggregation significantly
               improves the purity
               on the leaf level
                                                                                                               19
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Evaluation – adaptive clustering




       Setup for constant streams
              ClusTree: stream speed  maintainable #MC
              DenStream [SDM06] & CluStream [VLDB03]: #MC  processable pps
       ClusTree results: #MC is exponential (#dists is logarithmic)                                           20
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

Agenda



  I.      The Anytime principle
           Anytime algorithms for stream data mining


  II.     The ClusTree
           Self-adaptive anytime stream clustering


  III. The MOA Framework
           An open source framework for stream mining algorithms




                                                                                                               21
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

The MOA framework

       Extensible open source software
         – Data generators, file streams

         – Stream mining algorithms

         – Measure collection

       Supported stream mining tasks
         – Stream clustering, stream
              classification, outlier detection, …

       Repeatable/benchmark settings

       In collaboration with
P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining

References

      Anytime SVM: DeCoste: Anytime Query-Tuned Kernel Machines via Cholesky
       Factorization. SDM, 2003
      DeCoste et al.: Fast query-optimized kernel machine classification via incremental
       approximate nearest support vectors. ICML, 2003
      Bayes (continuous data): Seidl et al.: Indexing density models for incremental
       learning and anytime classification on data streams. EDBT, 2009
      Bayes (categorical): Yang et al.: Classifying under computational resource constraints:
       anytime classification using probabilistic estimators. Machine Learning, 2007
      Anytime Nearest Neighbor: Ueno et al.: Anytime Classification Using the Nearest
       Neighbor Algorithm with Applications to Stream Mining. ICDM, 2006
      Anytime + constant: Kranen et al.: Harnessing the strengths of anytime algorithms
       for constant data streams. DMKD Journal, 2009
      ClusTree: Kranen et al.: Self-Adaptive Anytime Stream Clustering. ICDM 2009
      A complete list of references including stream clustering, MOA, evaluation, etc.:
       Kranen: Anytime Algorithms for Stream Data Mining. PhD Thesis, RWTH Aachen, 2011
                                                                                                               23

More Related Content

Viewers also liked

Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...
Shakas Technologies
 
Contributory broadcast encryption with efficient encryption and short ciphert...
Contributory broadcast encryption with efficient encryption and short ciphert...Contributory broadcast encryption with efficient encryption and short ciphert...
Contributory broadcast encryption with efficient encryption and short ciphert...
Shakas Technologies
 
Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...
ieeepondy
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
Krish_ver2
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
hadifar
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
SSA KPI
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
Raffaele Capaldo
 
Ppt 1
Ppt 1Ppt 1
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
Edureka!
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Jewel Refran
 
Stamp enabling privacy preserving location proofs for mobile users
Stamp enabling privacy preserving location proofs for mobile usersStamp enabling privacy preserving location proofs for mobile users
Stamp enabling privacy preserving location proofs for mobile users
LeMeniz Infotech
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
Prashanth Guntal
 
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCAREK-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
International Journal of Technical Research & Application
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
Rajesh Piryani
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
Derek Kane
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminar
shilpi nagpal
 
Slideshare Powerpoint presentation
Slideshare Powerpoint presentationSlideshare Powerpoint presentation
Slideshare Powerpoint presentation
elliehood
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
Mandy Suzanne
 

Viewers also liked (19)

Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...
 
Contributory broadcast encryption with efficient encryption and short ciphert...
Contributory broadcast encryption with efficient encryption and short ciphert...Contributory broadcast encryption with efficient encryption and short ciphert...
Contributory broadcast encryption with efficient encryption and short ciphert...
 
Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
Ppt 1
Ppt 1Ppt 1
Ppt 1
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Stamp enabling privacy preserving location proofs for mobile users
Stamp enabling privacy preserving location proofs for mobile usersStamp enabling privacy preserving location proofs for mobile users
Stamp enabling privacy preserving location proofs for mobile users
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCAREK-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminar
 
Slideshare Powerpoint presentation
Slideshare Powerpoint presentationSlideshare Powerpoint presentation
Slideshare Powerpoint presentation
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Recently uploaded

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 

Recently uploaded (20)

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 

Presentation ucb 2012

  • 1. The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Philipp Kranen1, Ira Assent2, Corinna Baldauf1, Thomas Seidl1 1DataManagement and Data Exploration Group, RWTH Aachen University, Germany 2Department of Computer Science, Aarhus University, Denmark
  • 2. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Motivating examples emergency pre full professional classifier classifier decision normal
  • 3. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Applications and tasks Modeling Classification data rate constant data rate varying Outlier detection
  • 4. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Agenda I. The Anytime principle Anytime algorithms for stream data mining II. The ClusTree Self-adaptive anytime stream clustering III. The MOA Framework An open source framework for stream mining algorithms 4
  • 5. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Definitions I  Stream A stream  : → : → , is an infinite sequence  of objects  ∈ from a d‐dimensional input space  and ∈ ,  ∀ is the discrete arrival time of object  .  Inter‐arrival time The inter‐arrival time between two consecutive objects and  is denoted as Δt , i.e. 0 Δ ∈ .  Constant and varying streams A stream  is called constant  ↔ Δ Δ ∀ ,  Stream algorithms – Online algorithms – the input is given one at a time – Budget algorithms – tailored to a specific time budget b – Anytime algorithms – provide a result after any amount of processing time 5
  • 6. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Definitions II  Budget Algorithms – tailored to a specific time budget – Available time < budget  no result – Available time > budget  idle times  How should stream processing be done? quality – Little time  fast result – More time  use it to improve the result time  Anytime Algorithms – provide a result after any time For a given input an anytime algorithm can provide a first result after a very short initialization time and it uses additional time to improve its result. The algorithm is interruptible after any time and will deliver the best result obtained until the point of interruption. 6
  • 7. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Anytime algorithms on constant streams  Can we do better than using all available time?  tf td Yes we can! constant data stream type 1 type 2 … arrival interval ta type m  Distribute computation time according to confidence values – Spend less time on confident items – Use additional time for uncertain objects  Prerequisites – Anytime algorithm – Confidence measure 7
  • 8. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Existing anytime classification approaches  Anytime support vector machines  Anytime nearest neighbor classification  Anytime Bayesian classification  Categorical data  Continuous data  Others  Anytime induction of decision trees  Anytime A* algorithm  Anytime clustering  Anytime outlier detection [References on last slide.] 8
  • 9. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Sampling, buffering, anytime clustering  What about sampling?  Not appropriate for classification or outlier detection.  What about buffering?  Durations of bursts are unknown.  Why anytime clustering?  …  “Smart buffering”  Use micro‐clusters as input for further analysis  Provide constant (maximal) granularity at regular intervals 9
  • 10. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Agenda I. The Anytime principle Anytime algorithms for stream data mining II. The ClusTree Self-adaptive anytime stream clustering III. The MOA Framework An open source framework for stream mining algorithms 10
  • 11. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Problem statement  Clustering is a frequently used technique  Provides an overview, reduces amount of data, groups similar objects  Streaming scenario:  Use summaries (micro clusters) as input for further analysis  But: endless amounts of data (streams) are hard to handle  Stream clustering challenges:  Single pass clustering Anytime  Limited time, varying time allowance  Limited memory, yet least information loss Fine grained  Evolving data Drift&Novelty  Flexible number and size of clusters Self-adaptive 11
  • 12. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Related work  Stream clustering approaches and paradigms  Convex clustering approaches (k-center)  Density-based, grid-based approaches  kernels, graphs, fractal dimensions, …  Process chunks, merge results  Maintain list, remove oldest or merge closest pair  Online and Offline component  All approaches have to restrict themselves to the worst case time 12
  • 13. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Goals  Anytime clustering Anytime  don’t miss any point, no matter at which speed  Adaptive model size Self-adaptive  don’t restrict model to worst case assumptions  Fine grained representation Fine grained  provide more detailed input for offline component  Compatible to existing work on drift and novelty Drift&Novelty  Aging / Decay  Snapshots / Drift & Novelty 13
  • 14. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining ClusTree – basic idea  Cluster features CF = (N, LS, SS) represent micro-clusters  Allow to compute statistics like mean and variance  Maintain a balanced hierarchical data structure less time  Insert new object into more time the closest subtree  Insertion stops if next object arrives  Most detailed model is stored at leaf level  Tree (= model) grows if more time is available 14
  • 15. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining ClusTree structure and anytime insert Fine grained Anytime  Hierarchy of micro-clusters CF = (N, LS, SS)  New objects (x1 … xd) are simply added to the cluster feature  N = N + 1, LSi = LSi + xi, SSi = SSi + (xi)2  Anytime insert: buffer object locally in a local buffer CF inner entry LS1 (t) SS1 (t) LS1 (t) SS1 (t) n(t) b … … n(t) b … … LSd SSd LSd b SSd b LS1 (t) SS1 (t) n(t) b … … leaf entry LSd SSd 15
  • 16. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Buffer and hitchhiker Self-adaptive  Buffer: interrupt insertion – aggregate objects on interrupt  Hitchhiker: resume insertion – take buffer along (if same way)  Maximally two objects to descend with  Tree grows through splitting nodes starting from the leaf entry structure: (CF, pointer, CFb ) . Level 1: root . Level 2: hitchhike . Level 3: buffer . . . . Level 4: insert . destination of destination of . 16
  • 17. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Maintaining an up-to-date view Drift&Novelty  Goal: Compatible to existing work on drift and novelty  New leaf entries get a unique ID  Aging by an exponential decay function w(Δt) = β‐λΔt  Benefits of the employed decay function  Avoid splits by reusing insignificant entries  An entry’s CF still represents exactly its subtree and its buffer Lemma 1 (ClusTree Invariant): For each inner entry es with timestamp t + Δt and decay function w(Δt) = 2‐λΔt it holds s es .CF (t  t )  ( w(t )   esi .CF (t ) )  es .buffer (t  t ) i 1 [Proof in the paper.] 17
  • 18. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Extensions of the ClusTree  Insertion of aggregates for extremely fast streams  Iterative depth first descent for slower streams  Local look ahead to reduce overlapping  Explicit noise handling and noise to cluster events a) b) c) d) e e n e e e n e e e n e e e n L L L L
  • 19. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Evaluation – anytime clustering and aggregation Forest Covertype  Anytime clustering (90.000 pps)  88% purity on leaf level  Purity on higher levels corresponds to faster streams  >70% purity starting three levels under root  Aggregation (varying streams)  Purity drops under 70% at 150.000 pps  Aggregation significantly improves the purity on the leaf level 19
  • 20. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Evaluation – adaptive clustering  Setup for constant streams  ClusTree: stream speed  maintainable #MC  DenStream [SDM06] & CluStream [VLDB03]: #MC  processable pps  ClusTree results: #MC is exponential (#dists is logarithmic) 20
  • 21. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining Agenda I. The Anytime principle Anytime algorithms for stream data mining II. The ClusTree Self-adaptive anytime stream clustering III. The MOA Framework An open source framework for stream mining algorithms 21
  • 22. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining The MOA framework  Extensible open source software – Data generators, file streams – Stream mining algorithms – Measure collection  Supported stream mining tasks – Stream clustering, stream classification, outlier detection, …  Repeatable/benchmark settings  In collaboration with
  • 23. P. Kranen, I. Assent, C. Baldauf, T. Seidl – The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining References  Anytime SVM: DeCoste: Anytime Query-Tuned Kernel Machines via Cholesky Factorization. SDM, 2003  DeCoste et al.: Fast query-optimized kernel machine classification via incremental approximate nearest support vectors. ICML, 2003  Bayes (continuous data): Seidl et al.: Indexing density models for incremental learning and anytime classification on data streams. EDBT, 2009  Bayes (categorical): Yang et al.: Classifying under computational resource constraints: anytime classification using probabilistic estimators. Machine Learning, 2007  Anytime Nearest Neighbor: Ueno et al.: Anytime Classification Using the Nearest Neighbor Algorithm with Applications to Stream Mining. ICDM, 2006  Anytime + constant: Kranen et al.: Harnessing the strengths of anytime algorithms for constant data streams. DMKD Journal, 2009  ClusTree: Kranen et al.: Self-Adaptive Anytime Stream Clustering. ICDM 2009  A complete list of references including stream clustering, MOA, evaluation, etc.: Kranen: Anytime Algorithms for Stream Data Mining. PhD Thesis, RWTH Aachen, 2011 23