A Framework for Clustering Evolving Data Streams Yueshen Xu Zhejiang Univ CCNT, Middleware Middle ware, CCNT, ZJU 11/05/11
S tream  P rocessing     E vent  S tream  P rocessing     C omplex  E vent  P rocessing 11/05/11 Middleware, CCNT, ZJU D...
The paper itself <ul><li>Published in VLDB 2003 </li></ul><ul><li>Have been cited 635 times </li></ul><ul><li>By C.C. Agga...
Data Stream & Streaming Data <ul><li>What is data stream?——Those Data sets behave just like water flow (I think) </li></ul...
Principles <ul><li>Be very different from those for static data sets </li></ul><ul><li>(my own thought)  </li></ul><ul><li...
The Framework <ul><li>The methodology </li></ul><ul><li>   The core value of the paper </li></ul><ul><li>Micro- and macro...
C luster  F eature  V ector <ul><li>Definition </li></ul><ul><li>CFV is defined as a  tuple </li></ul><ul><li>, the sum of...
Pyramidal Time Frame(1) <ul><li>Snapshots are classified into different  orders  which can vary from 1 to log(T) </li></ul...
Pyramidal Time Frame(2) <ul><li>The difference from his book </li></ul><ul><li>Divided by  , but not by  </li></ul><ul><li...
Micro-Cluster(1) ------Procedure 11/05/11 Middleware, CCNT, ZJU t h h’ Micro cluster(CFV) Snapshots T
Micro-Cluster(2) ------Initialization <ul><li>What is to be initialized? </li></ul><ul><li>Micro-clusters  </li></ul><ul><...
Micro-Cluster(3) ------Updating <ul><li>A new data point is coming, what will be done? </li></ul><ul><li>Join, Delete & Me...
Macro-Cluster(1) ------Find the approximate time stamp <ul><li>What’s the analyst behavior? </li></ul><ul><li>Find cluster...
Macro-Cluster(2) ------modified k-means <ul><li>What has been modified in k-means? </li></ul><ul><li>The micro-clusters ar...
<ul><li>Q&A </li></ul>11/05/11 Middleware, CCNT, ZJU Stream Stream Stream Stream Stream Stream Stream Stream
Upcoming SlideShare
Loading in …5
×

Stream data mining & CluStream framework

2,702 views

Published on

This ppt is my learning report addressed in my lab which was composed by myself. I hope it is of help and use to you friends.

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,702
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
76
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Stream data mining & CluStream framework

  1. 1. A Framework for Clustering Evolving Data Streams Yueshen Xu Zhejiang Univ CCNT, Middleware Middle ware, CCNT, ZJU 11/05/11
  2. 2. S tream P rocessing  E vent S tream P rocessing  C omplex E vent P rocessing 11/05/11 Middleware, CCNT, ZJU D ata S tream M ining E vent S tream P rocessing C omplex E vent P rocessing In-memory Computing Real –time Computing Big data Computing Mode Real Application SAP… Taobao Yahoo:S4 Baidu Brown&MIT We are endeavoring!
  3. 3. The paper itself <ul><li>Published in VLDB 2003 </li></ul><ul><li>Have been cited 635 times </li></ul><ul><li>By C.C. Aggarwal, Jiawei Han, Jiayong Wang, </li></ul><ul><li>and Philip S.Yu </li></ul>11/05/11 Middleware, CCNT, ZJU Watson UIUC THU UIC <ul><li>A standard, a bible as well as an obligatory reading </li></ul>Expert Pundit Expert Pundit !
  4. 4. Data Stream & Streaming Data <ul><li>What is data stream?——Those Data sets behave just like water flow (I think) </li></ul><ul><li>An infinite process consisting of data which continuously evolves with time (C.C. Aggarwal, Jiawei Han et al ) </li></ul><ul><li>The formalized description </li></ul><ul><li>is a multi-dimensional record, and is the corresponding time stamp. </li></ul>11/05/11 Middleware, CCNT, ZJU <ul><li>The data model makes a determining influence on the computing model  How? </li></ul>
  5. 5. Principles <ul><li>Be very different from those for static data sets </li></ul><ul><li>(my own thought) </li></ul><ul><li>One-pass scan </li></ul><ul><li>You can have the only one chance to see it </li></ul><ul><li>No storage for primitive data </li></ul><ul><li>Infinite, another form of big data </li></ul><ul><li>No necessity </li></ul><ul><li>In-memory mining </li></ul><ul><li>Instantaneous </li></ul><ul><li>Preference for new coming data </li></ul><ul><li>User point of view </li></ul>11/05/11 Middleware, CCNT, ZJU <ul><li>Approximate results </li></ul><ul><li>You must change your old ideas about traditional static data sets </li></ul>Ordered, Countable, Enumerable, Infinite, no-storage Data Model Vital!
  6. 6. The Framework <ul><li>The methodology </li></ul><ul><li> The core value of the paper </li></ul><ul><li>Micro- and macro-clustering process </li></ul><ul><li>Necessity and inevitability under this frame </li></ul><ul><li>The pyramidal time frame </li></ul><ul><li>Balancing between the accuracy and storage capability </li></ul>11/05/11 Middleware, CCNT, ZJU The principle of approximate results <ul><li>Cluster Feature Vector </li></ul><ul><li>Additivity </li></ul><ul><li>The micro clusters </li></ul>Is it sophistic?
  7. 7. C luster F eature V ector <ul><li>Definition </li></ul><ul><li>CFV is defined as a tuple </li></ul><ul><li>, the sum of the squares of the data values : Sigma & Square </li></ul><ul><li>, the sum of data values : Sigma </li></ul><ul><li>, the sum of the squares of the time stamps : Sigma & Square </li></ul><ul><li>, the sum of the time stamps : Sigma </li></ul><ul><li>, the number of data items belonging to the cluster </li></ul>Why are they opted for? 11/05/11 Middleware, CCNT, ZJU <ul><li>Why CFV? </li></ul><ul><li>User – oriented </li></ul><ul><li>Additivity </li></ul>Not come up by Prof. Han et al
  8. 8. Pyramidal Time Frame(1) <ul><li>Snapshots are classified into different orders which can vary from 1 to log(T) </li></ul><ul><li>Snapshots of the i-th order at time intervals of </li></ul><ul><li>Only the last snapshots of order i are stored </li></ul>11/05/11 Middleware, CCNT, ZJU <ul><li>An example </li></ul>Worse Case~~
  9. 9. Pyramidal Time Frame(2) <ul><li>The difference from his book </li></ul><ul><li>Divided by , but not by </li></ul><ul><li>The number of orders is constant </li></ul>11/05/11 Middleware, CCNT, ZJU Best case No redundancy <ul><li>Why </li></ul><ul><li>(my own thought) </li></ul><ul><li>The newer is left, and the older is abandoned </li></ul><ul><li>The lower level is not friendly to those old snapshots, but the higher one does </li></ul><ul><li>Not only punish , but protect for the older one </li></ul>
  10. 10. Micro-Cluster(1) ------Procedure 11/05/11 Middleware, CCNT, ZJU t h h’ Micro cluster(CFV) Snapshots T
  11. 11. Micro-Cluster(2) ------Initialization <ul><li>What is to be initialized? </li></ul><ul><li>Micro-clusters </li></ul><ul><li>The number of micro-clusters maintained in each snapshot is constant </li></ul><ul><li>Determined by the amount of memory available </li></ul><ul><li>Larger than the natural number of clusters, but smaller than the number of data points in the data stream </li></ul><ul><li>Each cluster owns an unique id </li></ul>11/05/11 Middleware, CCNT, ZJU Supported by the experiment Reasonable ?
  12. 12. Micro-Cluster(3) ------Updating <ul><li>A new data point is coming, what will be done? </li></ul><ul><li>Join, Delete & Merge </li></ul><ul><li>Join : find the nearest one </li></ul><ul><li>Find the nearest micro-cluster and be involved in its boundary </li></ul><ul><li>RMS & Distance </li></ul><ul><li>Delete : find the oldest one </li></ul><ul><li>The average time stamp of the last m data point </li></ul><ul><li>Take the time stamp contained in CFV as the approximation </li></ul><ul><li>Merge : find the closest two clusters </li></ul><ul><li>They don’t explain how  idlist </li></ul>11/05/11 Middleware, CCNT, ZJU
  13. 13. Macro-Cluster(1) ------Find the approximate time stamp <ul><li>What’s the analyst behavior? </li></ul><ul><li>Find clusters over a past time horizon of h </li></ul><ul><li>All about : additivity property </li></ul><ul><li>I don’t understand how they cope with the fault tolerance </li></ul><ul><li>Only two snapshots are necessary </li></ul><ul><li>What is to be clustered? </li></ul><ul><li>CFV </li></ul>11/05/11 Middleware, CCNT, ZJU Not user-friendly
  14. 14. Macro-Cluster(2) ------modified k-means <ul><li>What has been modified in k-means? </li></ul><ul><li>The micro-clusters are treated as pseudo-points </li></ul><ul><li>The seeds are no longer picked randomly </li></ul><ul><li>The more points, the more important </li></ul><ul><li>Experiments are sufficient </li></ul>11/05/11 Middleware, CCNT, ZJU
  15. 15. <ul><li>Q&A </li></ul>11/05/11 Middleware, CCNT, ZJU Stream Stream Stream Stream Stream Stream Stream Stream

×