Distributed clustering from data streams

1,469 views

Published on

Simple objects that surround us are gaining sensors, computational power, and actuators, and are changing from static, into adaptive and reactive systems. In this talk we discuss issues for knowledge discovery from distributed data streams generated by sensors with limited computational resources.
We present two clustering algorithms for two different tasks: clustering streaming data, which searches for dense regions of the data space, and clustering streaming data sources, which finds groups of sources that behave similarly over time. In the first setting, a cluster is defined to be a set of data points. In the second setting, a cluster is defined to be a set of sensors. We conclude the talk by presenting the lessons learned.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,469
On SlideShare
0
From Embeds
0
Number of Embeds
821
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Distributed clustering from data streams

  1. 1. Distributed Clustering for Smart GridsPedro Rodrigues, João Gama University of Porto, Portugal Project KDUS (PTDC/EIA-EIA/98355/2008)4 September 2011NGDM 11
  2. 2. Smart Grids Smart Grids: monitoring information on the top of electrical grid Internet-like communications layer A shift in the way in which power grids are operated Intelligent monitoring in real time Interactive with consumers and markets Optimized to make the best use of resources and equipment Predictive rather than reactive Distributed across geographical and organizational boundaries 2NGDM 11
  3. 3. Smart Grids and Data Mining Smart grid forms a network (eventually decomposable) of distributed sources of high-speed data streams. The dynamics of data are unknown: the topology of network changes over time, the number of meters tends to increase and the context where the meter acts evolves over time. Several data mining tasks are involved: prediction, cluster (profiling) analysis, event and anomaly detection, correlation analysis, etc. All these characteristics constitute real challenges and opportunities for applied research in distributed data mining. The requirements of near real-time analysis for multiple time horizons and multiple space aggregations make these analysis an even harder research challenge. 3NGDM 11
  4. 4. Outline Rationale Clustering distributed data streams Local-to-Global Clustering of data sources 4NGDM 11
  5. 5. Rationale Sensor Networks Sensors are usually small, low-cost devices capable of sensing some attribute and of communicating with other sensors. Sensor networks can include thousands of sensors, each one being capable of measuring, analysing and transmitting a stream of data. Resources are scarse, which reduce the possibilities for heavy computation,while operating under a limited bandwidth. 5NGDM 11
  6. 6. Rationale Comprehension of Ubiquitous Data Streams Comprehension Extract information about global interaction between sources by looking at the data they produce. When no other information is available, usual knowledge discovery approaches are based on unsupervised techniques (e.g. clustering). However, two different stream clustering problems exist: clustering streaming data points (e.g. meter readings) clustering streaming data sources (e.g. meters) 6NGDM 11
  7. 7. Rationale Comprehension by Clustering Data Points Information about dense regions of the sensor data space. Cluster A Cluster B Cluster C 7NGDM 11
  8. 8. Rationale Comprehension by Clustering Data Sources Information about groups of sensors that behave similarly over time. Possible scenario Cluster A Cluster B Cluster C Sensors collecting electricity demand data from different homes, exploring similar consumption patterns. 8NGDM 11
  9. 9. DGClust Setting and Objective Setting Sensors in a wide network produce streams of heterogeneously distributed data (each sensor produces a univariate stream of data) Objective Cluster A Cluster B Cluster C To keep a clustering of the observations that are created by 9 aggregating each nodes data as a feature in a centralized stream.NGDM 11
  10. 10. DGClust Problems and Research Question Problems high-speed data streams excessive storage and processing widely spread network heavy communication centralized clustering high dimensionality dynamic data outdated models Research Question Does local discretization and representative clustering improve validity, communication and computation loads when applied to distributed sensor data streams? 10NGDM 11
  11. 11. DGClust Methodology : Local Step DGClust – Distributed Grid Clustering (Local Step) Each sensor keeps an online ordinal discretization of its data. Partition Incremental Discretization Current State low D 11NGDM 11
  12. 12. DGClust Methodology : Aggregating Step DGClust – Distributed Grid Clustering (Aggregating Step) The central server gathers the global state of the network. Sensors whose state has not change since last communication, do not transmit to server. low low low low D D high high high high A A B B B B B B high high low low 12NGDM 11
  13. 13. DGClust Methodology : Representative Step DGClust – Distributed Grid Clustering (Representative Step) Server keeps a small list of the most frequent global states. Space-Saving Frequent Items Monitoring # low high high high 523 low low low low D C C B A D high high high high high low low low 334 D B B B A A B B high high low low low low 89 D A B A A B high low ... 13NGDM 11
  14. 14. DGClust Methodology : Clustering Step DGClust – Distributed Grid Clustering (Clustering Step) Server applies partitional clustering to the most frequent states. Furthest Point Clustering + Online Adaptive K-Means 14NGDM 11
  15. 15. DGClust Example (k=5) Varying Resources 15NGDM 11
  16. 16. DGClust Main Findings Quality of results does not depend on the number of sensors. Communication reduction is constant with any number of sensors (as long as direct link with server exists). higher clustering quality higher discretization granularity lower communication reduction higher number of sensors more clustering updates 16NGDM 11
  17. 17. L2GClust Setting and Objective Setting Sensors in a wide network produce streams of heterogeneously distributed data (each sensor produces a univariate stream of data) Objective Cluster A Cluster B Cluster C To keep, at each node, a clustering of the entire network of sensors. 17NGDM 11
  18. 18. L2GClust Methodology : Local Sketch Each sensor keeps a sketch of its most recent data. 10.2 The common approach for focus on recent data are sliding windows1. Even within the sliding window, the most recent data point is usually more important than the last one which is about to be discarded. In ubiquitous streaming data sources, such as sensor networks, resources like memory and processing power are scarse. Some times, there is not even enough memory to store all the data points inside the window. Memoryless α-fading average 18NGDM 11
  19. 19. L2GClust Example : Local Clustering 1 10 2 100 10 11 99 95 5 10 10 3 12 2 19NGDM 11
  20. 20. L2GClust Example : Local Clustering Centroids {6.9, 98.0} 1 10 2 100 10 11 99 95 5 10 10 3 12 2 20NGDM 11
  21. 21. L2GClust Methodology : Local Clustering This estimate is computed by clustering the centroids of direct neighbors’ estimates of the global clustering. Furthest Point Clustering Basically, each node performs an ensemble of clusterings from its direct neighbors. Instead of broadcasting the sketch of the its own data, each node broadcasts its estimate of the global clustering. 21NGDM 11
  22. 22. L2GClust Example : Local Clustering Centroids {6.9, 98.0} 88.07 87.37 88.06 4.19 2.80 {7.71, 97.1} 3.74 1.21 {10.59, 97.38} 3.58 {5.10, 95.00} 2.41 3.50 88.06 88.03 86.31 88.12 22NGDM 11
  23. 23. L2GClust Example : Local Clustering Centroids {6.9, 98.0} 88.07 87.37 88.06 4.19 2.80 {7.71, 97.1} 3.74 1.21 {10.59, 97.38} 3.58 {5.10, 95.00} 2.41 3.50 88.06 88.03 86.31 88.12 23NGDM 11
  24. 24. L2GClust Example : Local Clustering Centroids {6.9, 98.0} 88.07 87.37 88.06 4.19 2.80 3.74 1.21 {10.36, 97.1} 3.58 2.41 3.50 88.06 88.03 86.31 88.12 24NGDM 11
  25. 25. L2GClust Evaluation Summary Comparison was performed with same strategy executed at a central server with access to all data. Measured outcomes were the agreement between a nodes clustering estimate and the centralized clustering, averaged over all nodes. Kappa statistic cluster sanity Proportion of agreement cluster validity K=(P(A)-P(e))/(1-P(e)) State-of-the-art Simulator Each sensor in the simulation (Visual Sense) generates a Gaussian stream with mean from one of the predefined Gaussian clusters. Evaluated parameters were number of clusters, network size, and cluster overlap. 25NGDM 11
  26. 26. L2GClust Results 26 Average proportion of agreement converges (with small fluctuations).NGDM 11
  27. 27. L2GClust Results 27 Sanity was confirmed with Kappa statistic always above 0.58.NGDM 11
  28. 28. L2GClust Results 28 Real data from electricity demand sensors showed ability to improve with examples.NGDM 11
  29. 29. L2GClust Main Properties Local sketch yields: memoryless storage of summaries; a straightforward adaptation to most recent data; a reduction of the systems sensitivity to uncertainty; Local clustering with direct neighbors yields: no forwarding of information (reduced communication); low dimensionality of the clustering problem; sensitive information better preserved. Future Work Evaluate L2GClust on smart grid sensor networks. 29NGDM 11
  30. 30. Thank you! 30NGDM 11

×