A Framework for Clustering Evolving Data Streams Yueshen Xu Zhejiang Univ CCNT, Middleware Middle ware, CCNT, ZJU 11/05/11
S tream  P rocessing     E vent  S tream  P rocessing     C omplex  E vent  P rocessing 11/05/11 Middleware, CCNT, ZJU D ata  S tream  M ining E vent  S tream  P rocessing C omplex  E vent  P rocessing In-memory Computing Real –time Computing Big data Computing Mode Real Application SAP… Taobao Yahoo:S4 Baidu Brown&MIT We are endeavoring!
The paper itself Published in VLDB 2003 Have been cited 635 times By C.C. Aggarwal, Jiawei Han, Jiayong Wang,  and Philip S.Yu 11/05/11 Middleware, CCNT, ZJU Watson UIUC THU UIC A standard, a bible as well as an obligatory reading Expert Pundit Expert Pundit !
Data Stream & Streaming Data What is data stream?——Those Data sets behave just like water flow (I think) An  infinite  process consisting of data which  continuously  evolves with time (C.C. Aggarwal, Jiawei Han  et al ) The formalized description is a multi-dimensional record, and  is the corresponding time stamp. 11/05/11 Middleware, CCNT, ZJU The data model makes a determining influence on the computing model     How?
Principles Be very different from those for static data sets (my own thought)  One-pass scan You can have the only one chance to see it No storage for primitive data Infinite, another form of big data No necessity In-memory mining Instantaneous Preference for new coming data  User point of view 11/05/11 Middleware, CCNT, ZJU Approximate results You must change your  old ideas about traditional static data sets Ordered, Countable, Enumerable, Infinite, no-storage  Data Model Vital!
The Framework The methodology    The core value of the paper Micro- and macro-clustering process Necessity and inevitability under this frame The pyramidal time frame Balancing between the accuracy and storage capability 11/05/11 Middleware, CCNT, ZJU The principle of  approximate results Cluster Feature Vector Additivity The micro clusters Is it sophistic?
C luster  F eature  V ector Definition CFV is defined as a  tuple , the sum of the squares of the data values :  Sigma & Square , the sum of data values :  Sigma   , the sum of the squares of the time stamps :  Sigma & Square , the sum of the time stamps  : Sigma , the number  of data items belonging to the cluster Why are they opted for? 11/05/11 Middleware, CCNT, ZJU Why CFV? User – oriented Additivity Not come up by Prof. Han  et al
Pyramidal Time Frame(1) Snapshots are classified into different  orders  which can vary from 1 to log(T) Snapshots of the i-th order at time intervals of  Only the last  snapshots of order i are stored  11/05/11 Middleware, CCNT, ZJU An example Worse Case~~
Pyramidal Time Frame(2) The difference from his book Divided by  , but not by  The number of orders is constant 11/05/11 Middleware, CCNT, ZJU Best case No redundancy Why (my own thought) The newer is left, and the older is abandoned The lower level is not friendly to those old snapshots, but the higher one does Not only punish , but protect for the older one
Micro-Cluster(1) ------Procedure 11/05/11 Middleware, CCNT, ZJU t h h’ Micro cluster(CFV) Snapshots T
Micro-Cluster(2) ------Initialization What is to be initialized? Micro-clusters  The number of micro-clusters maintained in each snapshot is constant Determined by the amount of memory  available Larger than the natural number of clusters, but smaller than the number of data points in the data stream Each cluster owns an unique id 11/05/11 Middleware, CCNT, ZJU Supported by the experiment Reasonable ?
Micro-Cluster(3) ------Updating A new data point is coming, what will be done? Join, Delete & Merge Join : find the nearest one Find the nearest micro-cluster and be involved in its boundary RMS & Distance Delete : find the oldest one The average time stamp of the last m data point Take the time stamp contained in CFV as the approximation Merge : find the closest two clusters They don’t explain how     idlist 11/05/11 Middleware, CCNT, ZJU
Macro-Cluster(1) ------Find the approximate time stamp What’s the analyst behavior? Find clusters over a past time horizon of  h All about  : additivity property I don’t understand how they cope with the fault tolerance  Only two snapshots are necessary What is to be clustered? CFV 11/05/11 Middleware, CCNT, ZJU Not user-friendly
Macro-Cluster(2) ------modified k-means What has been modified in k-means? The micro-clusters are treated as  pseudo-points The seeds are no longer picked randomly  The more points, the more important Experiments are sufficient 11/05/11 Middleware, CCNT, ZJU
Q&A 11/05/11 Middleware, CCNT, ZJU Stream Stream Stream Stream Stream Stream Stream Stream

Stream data mining & CluStream framework

  • 1.
    A Framework forClustering Evolving Data Streams Yueshen Xu Zhejiang Univ CCNT, Middleware Middle ware, CCNT, ZJU 11/05/11
  • 2.
    S tream P rocessing  E vent S tream P rocessing  C omplex E vent P rocessing 11/05/11 Middleware, CCNT, ZJU D ata S tream M ining E vent S tream P rocessing C omplex E vent P rocessing In-memory Computing Real –time Computing Big data Computing Mode Real Application SAP… Taobao Yahoo:S4 Baidu Brown&MIT We are endeavoring!
  • 3.
    The paper itselfPublished in VLDB 2003 Have been cited 635 times By C.C. Aggarwal, Jiawei Han, Jiayong Wang, and Philip S.Yu 11/05/11 Middleware, CCNT, ZJU Watson UIUC THU UIC A standard, a bible as well as an obligatory reading Expert Pundit Expert Pundit !
  • 4.
    Data Stream &Streaming Data What is data stream?——Those Data sets behave just like water flow (I think) An infinite process consisting of data which continuously evolves with time (C.C. Aggarwal, Jiawei Han et al ) The formalized description is a multi-dimensional record, and is the corresponding time stamp. 11/05/11 Middleware, CCNT, ZJU The data model makes a determining influence on the computing model  How?
  • 5.
    Principles Be verydifferent from those for static data sets (my own thought) One-pass scan You can have the only one chance to see it No storage for primitive data Infinite, another form of big data No necessity In-memory mining Instantaneous Preference for new coming data User point of view 11/05/11 Middleware, CCNT, ZJU Approximate results You must change your old ideas about traditional static data sets Ordered, Countable, Enumerable, Infinite, no-storage Data Model Vital!
  • 6.
    The Framework Themethodology  The core value of the paper Micro- and macro-clustering process Necessity and inevitability under this frame The pyramidal time frame Balancing between the accuracy and storage capability 11/05/11 Middleware, CCNT, ZJU The principle of approximate results Cluster Feature Vector Additivity The micro clusters Is it sophistic?
  • 7.
    C luster F eature V ector Definition CFV is defined as a tuple , the sum of the squares of the data values : Sigma & Square , the sum of data values : Sigma , the sum of the squares of the time stamps : Sigma & Square , the sum of the time stamps : Sigma , the number of data items belonging to the cluster Why are they opted for? 11/05/11 Middleware, CCNT, ZJU Why CFV? User – oriented Additivity Not come up by Prof. Han et al
  • 8.
    Pyramidal Time Frame(1)Snapshots are classified into different orders which can vary from 1 to log(T) Snapshots of the i-th order at time intervals of Only the last snapshots of order i are stored 11/05/11 Middleware, CCNT, ZJU An example Worse Case~~
  • 9.
    Pyramidal Time Frame(2)The difference from his book Divided by , but not by The number of orders is constant 11/05/11 Middleware, CCNT, ZJU Best case No redundancy Why (my own thought) The newer is left, and the older is abandoned The lower level is not friendly to those old snapshots, but the higher one does Not only punish , but protect for the older one
  • 10.
    Micro-Cluster(1) ------Procedure 11/05/11Middleware, CCNT, ZJU t h h’ Micro cluster(CFV) Snapshots T
  • 11.
    Micro-Cluster(2) ------Initialization Whatis to be initialized? Micro-clusters The number of micro-clusters maintained in each snapshot is constant Determined by the amount of memory available Larger than the natural number of clusters, but smaller than the number of data points in the data stream Each cluster owns an unique id 11/05/11 Middleware, CCNT, ZJU Supported by the experiment Reasonable ?
  • 12.
    Micro-Cluster(3) ------Updating Anew data point is coming, what will be done? Join, Delete & Merge Join : find the nearest one Find the nearest micro-cluster and be involved in its boundary RMS & Distance Delete : find the oldest one The average time stamp of the last m data point Take the time stamp contained in CFV as the approximation Merge : find the closest two clusters They don’t explain how  idlist 11/05/11 Middleware, CCNT, ZJU
  • 13.
    Macro-Cluster(1) ------Find theapproximate time stamp What’s the analyst behavior? Find clusters over a past time horizon of h All about : additivity property I don’t understand how they cope with the fault tolerance Only two snapshots are necessary What is to be clustered? CFV 11/05/11 Middleware, CCNT, ZJU Not user-friendly
  • 14.
    Macro-Cluster(2) ------modified k-meansWhat has been modified in k-means? The micro-clusters are treated as pseudo-points The seeds are no longer picked randomly The more points, the more important Experiments are sufficient 11/05/11 Middleware, CCNT, ZJU
  • 15.
    Q&A 11/05/11 Middleware,CCNT, ZJU Stream Stream Stream Stream Stream Stream Stream Stream