Scalable Distributed Real-Time Clustering for Big Data Streams

  • 974 views
Uploaded on

Analyzing and applying machine learning algorithms to a possibly infinite flow of data is a challenging task. This presentation presents the SAMOA framework, which allows the development of machine …

Analyzing and applying machine learning algorithms to a possibly infinite flow of data is a challenging task. This presentation presents the SAMOA framework, which allows the development of machine learning algorithms on top of any distributed stream processing engine. It also demonstrates the development and use of a distributed clustering algorithm based on CluStream using the Apache S4 platform.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
974
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
53
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Scalable Distributed Real-Time Clustering for Big Data Streams European Masters in Distributed Computing (EMDC) Student Antonio Severien severien@yahoo-inc.com Supervisors Albert Bifet (Yahoo! Research) Gianmarco De Francisci Morales (Yahoo! Research) Marta Arias (Universitat Politecnica de Catalunya)
  • 2. 27/06/13 Contributions ¤  SAMOA (Scalable Advanced Massive Online Analysis) ¤  Stream Processing Engine (SPE) abstraction framework ¤  Machine learning libraries adapter layer ¤  API for implementing data flow topologies ¤  SAMOA Clustering Algorithm ¤  Distributed stream clustering algorithm based on CluStream* ¤  Parallelize clustering task and scale-up on resource usage (*) “A Framework for Clustering Evolving Data Streams”, Aggarwal et al. 2003 2
  • 3. 27/06/13 Motivation ¤  How BIG is BIG in BIG Data??? ¤  2.5 quintillion of bytes generated every day. ¤  90% of todays data was generated in the last 2 years ¤  Sensors, social networks, e-business, mobile, internet logs, etc. ¤  Problems… 3 Vs ¤  Storage is unviable due to massive Volume ¤  Production rate on increasing in Velocity ¤  Different sources, different data, different types means Variety 3
  • 4. 27/06/13 Where is the Big Data? ¤  Where is the food? ¤  Databases? ¤  Data warehouses? ¤  Distributed databases? ¤  Distributed file systems? ¤  It’s flowing online! It’s Streaming! 4
  • 5. 27/06/13 Crunching Big Data ¤  Map and Reduce ¤  MapReduce/GFS ¤  Hadoop/HDFS ¤  Stream Processing Engines (SPE) ¤  Apache S4 ¤  Twitter Storm 5
  • 6. 27/06/13 Distributed Systems ¤  Actors Model ¤  Independent concurrent processes ¤  Communicate asynchronously by message passing ¤  MapReduce Model ¤  Mappers: filter and sorting ¤  Reducers: summary and aggregation ¤  Large volume of data distributed ¤  Iterative: map-reduce-map-reduce… 6
  • 7. 27/06/13 Streaming ¤  Streaming Model ¤  One-pass processing: discard item after use ¤  Low memory usage: store statistics and summaries ¤  Unbounded flow of data ¤  Evolving data sets ¤  Limited processing time ¤  Arrival order is not guaranteed 7
  • 8. 27/06/13 Making sense ¤  Machine Learning & Data Mining ¤  Make sense, extract patterns and react accordingly ¤  Train machines to “think” ¤  Perceive behavior ¤  Relations between similar information ¤  Unsupervised Learning ¤  Clustering algorithms 8
  • 9. 27/06/13 Machine Learning Tools ¤  Mahout ¤  Machine learning framework used on top of Hadoop/HDFS ¤  Batch processing with MapReduce model ¤  Open-source and good community support ¤  Massive Online Analysis (MOA) ¤  Stream machine learning tool ¤  Many algorithms implemented; based on WEKA ¤  Single machine constraint ¤  Jubatus ¤  Distributed streaming machine learning framework ¤  No clustering algorithms yet ¤  No stream platform abstraction 9
  • 10. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) ¤  Distributed data streaming machine learning framework ¤  Stream Platform Engine Abstraction ¤  Code once, run everywhere ¤  Focus on distributed algorithm design ¤  Fault-tolerance, communication, consistency and availability are provided by the underlying distributed processing platform ¤  Initial release provides integration with, ¤  Apache S4 ¤  Twitter Storm 10
  • 11. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 11
  • 12. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 12
  • 13. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 13
  • 14. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 14
  • 15. 27/06/13 ( Apache S4 ) ¤  Distributed, semi fault-tolerant, stream processing platform ¤  Based on the Actors model and inspired by the MapReduce model ¤  Flexibility on data flow; any topology and processor unit can be built, besides the mappers and reducers design ¤  Specialized in processing events from a stream and emitting events into a stream 15
  • 16. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Topology Task PI STREAM SOURCE Stream PI EPI PI PI MAP S4 App Stream STREAM SOURCE PE PE PE PE PE 16
  • 17. 27/06/13 How to use? ¤  Adding SPE using API ¤  S4ProcessingItem: processing element wrapper ¤  S4Stream: wrapper for a S4 stream ¤  S4ComponentFactory: provides components specific from Apache S4, such as processing elements and streams ¤  S4TopologyBuilder: creates the topology instances ¤  Adding algorithm and building topology class  SimpleTask  {   ...    TopologyBuilder  topologyBuilder  =  new  TopologyBuilder(  );      EntranceProcessinItem  entranceProcessingItem  =        topologyBuilder.createEntrancePI(  new  SourceProcessor(  )  );      Stream  stream  =  topologyBuilder.createStream(  entranceProcessingItem  );    ProcessingItem  processingItem  =  topologyBuilder.createPI(  new  Processor(  )  );    processingItem.connectInputKey(  stream  );     ...     17
  • 18. 27/06/13 Grouping the Best of All ¤  Flexible programming model ¤  Distributed stream processing engine abstraction ¤  Integrated machine learning and data mining algorithms ¤  Easy API to implement new algorithms and SPE adapters 18
  • 19. 27/06/13 SAMOA Clustering Algorithm ¤  Distributed stream clustering algorithm ¤  Validate SAMOA implementation and ¤  Integration with Apache S4 using the SAMOA-S4 adapter ¤  Deploy on Apache S4 19
  • 20. 27/06/13 Stream Clustering Algorithm ¤  CluStream Framework ¤  Based on k-means ¤  Online phase (micro-clustering) ¤  Offline phase (macro-clustering) ¤  k-means: partition a set of data into k distinct clusters according to a similarity function ¤  Minimization of squared Euclidean distance objective function: 20
  • 21. 27/06/13 K-means Clustering Algorithm ¤  Advantages ¤  Simple, fast and efficient ¤  Known issues with k-means ¤  Sensitive to initial seeding ¤  Minimization problem is NP-hard even for simple configurations ¤  1-dimensional points ¤  Global optimum not guaranteed ¤  Good for spherical clustering, not good for arbitrary shapes 21
  • 22. 27/06/13 Distributed Stream Clustering ¤  Online micro-clustering ¤  Apply on a local clustering phase ¤  Cluster Feature Vectors with Timestamp (CFT) ¤  N: number of data objects ¤  LS: linear sum of data objects ¤  SS: sum of squares of data objects ¤  LST: sum of timestamps ¤  SST: sum of squares of timestamps ¤  Offline macro-clustering ¤  Use of micro-clusters as weighted pseudo-points ¤  Apply on a global clustering phase with a weighted k-means ¤  Uses probabilistic seeding depending on the weighted micro-clusters 22
  • 23. 27/06/13 CluStream Snapshot Micro-clusters Macro-clusters Ground Truth 23
  • 24. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Clustering Task Clustering STREAM SOURCE Global Clustering PI Distribution PI OUTPUT Local Clustering PI Evaluation OUTPUT Sampling PI Evaluator PI 24
  • 25. 27/06/13 Experiments, Evaluation & Results ¤  Experimental Setup ¤  Four 2.4Mhz Intel Xeon dual-quadcore, 48GB RAM ¤  Process parallelism level: 1, 8 & 16 ¤  Instance dimensions: 3 & 15 ¤  Source dataset: random events generator ¤  Noise: 0% & 10% ¤  Cluster movement speed: move 0.1 unit every 500 & 12000 instances ¤  Evaluations ¤  Scalability: measure throughput when adding concurrent processes ¤  Clustering quality: measure if the clustering algorithm are accurate 25
  • 26. 27/06/13 Scalability Throughput (instances/second) Baseline Comparison Evaluation Step 26
  • 27. 27/06/13 Scalability Average Throughput (instances/second) Average Throughput with Dimensions 3 and 15 Process Parallelism 27
  • 28. 27/06/13 Scalability Avg. Cumulative Throughput (instances/sec) Parallelism Throughput with Dimension 3 Process Parallelism 28
  • 29. 27/06/13 Clustering Quality Metrics ¤  Internal & External evaluations ¤  Internal evaluation uses attributes available from the clustering structure. ¤  External evaluation uses external validation structures. ¤  ex.: ground truth provided by the source generator. ¤  Metrics ¤  Cohesion coefficient (SSE): measures the intra clusters sum of squares error ¤  Separation coefficient (BSS): measures the inter cluster betweensum of squares. 29
  • 30. 27/06/13 Clustering Quality 0% Noise Snapshot 25,000 instances Snapshot 45,000 instances 30
  • 31. 27/06/13 Clustering Quality 0% Noise Ratio = BSS / GT 31
  • 32. 27/06/13 Clustering Quality 10% Noise Snapshot 25,000 instances Snapshot 45,000 instances Good clustering Poor clustering 32
  • 33. 27/06/13 Clustering Quality 10% Noise 33
  • 34. 27/06/13 Conclusion ¤  There is important information on the massive amount of data being produced and discarded ¤  There is a need for tools to deal with this efficiently ¤  Efforts have been done to crunch big data ¤  Interpreting and retrieving relevant information is where machine learning and data mining operate ¤  Using real-time analysis responds faster to evolving data ¤  SAMOA abstracts the platform and maintains the algorithms; good to implement, test and use. 34
  • 35. 27/06/13 Acknowledgements ¤  Thanks the Erasmus Mundus and all three universities (UPC, KTH and IST) for providing this opportunity ¤  Thanks all the EMDC students ¤  Thanks Yahoo! Research for the great project 35