Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Scalable Distributed Real-Time
Clustering for Big Data Streams
European Masters in Distributed Computing (EMDC)

Student
A...
27/06/13

Contributions
¤  SAMOA (Scalable Advanced Massive Online Analysis)
¤  Stream Processing Engine (SPE) abstracti...
27/06/13

Motivation
¤  How BIG is BIG in BIG Data???
¤  2.5 quintillion of bytes generated every day.
¤  90% of todays...
27/06/13

Where is the Big Data?
¤  Where is the food?
¤  Databases?
¤  Data warehouses?
¤  Distributed databases?
¤ ...
27/06/13

Crunching Big Data
¤  Map and Reduce
¤  MapReduce/GFS
¤  Hadoop/HDFS

¤  Stream Processing Engines (SPE)
¤ ...
27/06/13

Distributed Systems
¤  Actors Model
¤  Independent concurrent processes
¤  Communicate asynchronously by mess...
27/06/13

Streaming
¤  Streaming Model
¤  One-pass processing: discard item after use
¤  Low memory usage: store statis...
27/06/13

Making sense
¤  Machine Learning & Data Mining
¤  Make sense, extract patterns and react accordingly
¤  Train...
27/06/13

Machine Learning Tools
¤  Mahout
¤  Machine learning framework used on top of Hadoop/HDFS
¤  Batch processing...
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)
¤  Distributed data streaming machine learning framework
¤  ...
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
...
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
...
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
...
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
...
27/06/13

( Apache S4 )
¤  Distributed, semi fault-tolerant, stream processing
platform
¤  Based on the Actors model and...
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)
SAMOA Topology

Task
PI

STREAM
SOURCE

Stream

PI

EPI
PI

PI...
27/06/13

How to use?
¤  Adding SPE using API
¤  S4ProcessingItem: processing element wrapper
¤  S4Stream: wrapper for ...
27/06/13

Grouping the Best of All
¤  Flexible programming model
¤  Distributed stream processing engine abstraction
¤ ...
27/06/13

SAMOA Clustering Algorithm
¤  Distributed stream clustering algorithm
¤  Validate SAMOA implementation and
¤ ...
27/06/13

Stream Clustering Algorithm
¤  CluStream Framework
¤  Based on k-means
¤  Online phase (micro-clustering)
¤ ...
27/06/13

K-means Clustering Algorithm
¤  Advantages
¤  Simple, fast and efficient

¤  Known issues with k-means
¤  Se...
27/06/13

Distributed Stream Clustering
¤  Online micro-clustering
¤  Apply on a local clustering phase
¤  Cluster Feat...
27/06/13

CluStream Snapshot
Micro-clusters

Macro-clusters

Ground Truth

23
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)
SAMOA Clustering Task
Clustering

STREAM
SOURCE

Global
Cluste...
27/06/13

Experiments, Evaluation & Results
¤  Experimental Setup
¤  Four 2.4Mhz Intel Xeon dual-quadcore, 48GB RAM
¤  ...
27/06/13

Scalability

Throughput (instances/second)

Baseline Comparison

Evaluation Step
26
27/06/13

Scalability

Average Throughput
(instances/second)

Average Throughput with Dimensions 3 and 15

Process Paralle...
27/06/13

Scalability

Avg. Cumulative Throughput
(instances/sec)

Parallelism Throughput with
Dimension 3

Process Parall...
27/06/13

Clustering Quality Metrics
¤  Internal & External evaluations
¤  Internal evaluation uses attributes available...
27/06/13

Clustering Quality 0% Noise
Snapshot 25,000 instances

Snapshot 45,000 instances

30
27/06/13

Clustering Quality 0% Noise
Ratio = BSS / GT

31
27/06/13

Clustering Quality 10% Noise
Snapshot 25,000 instances

Snapshot 45,000 instances

Good clustering
Poor clusteri...
27/06/13

Clustering Quality 10% Noise

33
27/06/13

Conclusion
¤  There is important information on the massive amount of
data being produced and discarded
¤  The...
27/06/13

Acknowledgements
¤  Thanks the Erasmus Mundus and all three universities
(UPC, KTH and IST) for providing this ...
Upcoming SlideShare
Loading in …5
×

3

Share

Download to read offline

Scalable Distributed Real-Time Clustering for Big Data Streams

Download to read offline

Analyzing and applying machine learning algorithms to a possibly infinite flow of data is a challenging task. This presentation presents the SAMOA framework, which allows the development of machine learning algorithms on top of any distributed stream processing engine. It also demonstrates the development and use of a distributed clustering algorithm based on CluStream using the Apache S4 platform.

Related Books

Free with a 30 day trial from Scribd

See all

Scalable Distributed Real-Time Clustering for Big Data Streams

  1. 1. Scalable Distributed Real-Time Clustering for Big Data Streams European Masters in Distributed Computing (EMDC) Student Antonio Severien severien@yahoo-inc.com Supervisors Albert Bifet (Yahoo! Research) Gianmarco De Francisci Morales (Yahoo! Research) Marta Arias (Universitat Politecnica de Catalunya)
  2. 2. 27/06/13 Contributions ¤  SAMOA (Scalable Advanced Massive Online Analysis) ¤  Stream Processing Engine (SPE) abstraction framework ¤  Machine learning libraries adapter layer ¤  API for implementing data flow topologies ¤  SAMOA Clustering Algorithm ¤  Distributed stream clustering algorithm based on CluStream* ¤  Parallelize clustering task and scale-up on resource usage (*) “A Framework for Clustering Evolving Data Streams”, Aggarwal et al. 2003 2
  3. 3. 27/06/13 Motivation ¤  How BIG is BIG in BIG Data??? ¤  2.5 quintillion of bytes generated every day. ¤  90% of todays data was generated in the last 2 years ¤  Sensors, social networks, e-business, mobile, internet logs, etc. ¤  Problems… 3 Vs ¤  Storage is unviable due to massive Volume ¤  Production rate on increasing in Velocity ¤  Different sources, different data, different types means Variety 3
  4. 4. 27/06/13 Where is the Big Data? ¤  Where is the food? ¤  Databases? ¤  Data warehouses? ¤  Distributed databases? ¤  Distributed file systems? ¤  It’s flowing online! It’s Streaming! 4
  5. 5. 27/06/13 Crunching Big Data ¤  Map and Reduce ¤  MapReduce/GFS ¤  Hadoop/HDFS ¤  Stream Processing Engines (SPE) ¤  Apache S4 ¤  Twitter Storm 5
  6. 6. 27/06/13 Distributed Systems ¤  Actors Model ¤  Independent concurrent processes ¤  Communicate asynchronously by message passing ¤  MapReduce Model ¤  Mappers: filter and sorting ¤  Reducers: summary and aggregation ¤  Large volume of data distributed ¤  Iterative: map-reduce-map-reduce… 6
  7. 7. 27/06/13 Streaming ¤  Streaming Model ¤  One-pass processing: discard item after use ¤  Low memory usage: store statistics and summaries ¤  Unbounded flow of data ¤  Evolving data sets ¤  Limited processing time ¤  Arrival order is not guaranteed 7
  8. 8. 27/06/13 Making sense ¤  Machine Learning & Data Mining ¤  Make sense, extract patterns and react accordingly ¤  Train machines to “think” ¤  Perceive behavior ¤  Relations between similar information ¤  Unsupervised Learning ¤  Clustering algorithms 8
  9. 9. 27/06/13 Machine Learning Tools ¤  Mahout ¤  Machine learning framework used on top of Hadoop/HDFS ¤  Batch processing with MapReduce model ¤  Open-source and good community support ¤  Massive Online Analysis (MOA) ¤  Stream machine learning tool ¤  Many algorithms implemented; based on WEKA ¤  Single machine constraint ¤  Jubatus ¤  Distributed streaming machine learning framework ¤  No clustering algorithms yet ¤  No stream platform abstraction 9
  10. 10. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) ¤  Distributed data streaming machine learning framework ¤  Stream Platform Engine Abstraction ¤  Code once, run everywhere ¤  Focus on distributed algorithm design ¤  Fault-tolerance, communication, consistency and availability are provided by the underlying distributed processing platform ¤  Initial release provides integration with, ¤  Apache S4 ¤  Twitter Storm 10
  11. 11. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 11
  12. 12. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 12
  13. 13. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 13
  14. 14. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 14
  15. 15. 27/06/13 ( Apache S4 ) ¤  Distributed, semi fault-tolerant, stream processing platform ¤  Based on the Actors model and inspired by the MapReduce model ¤  Flexibility on data flow; any topology and processor unit can be built, besides the mappers and reducers design ¤  Specialized in processing events from a stream and emitting events into a stream 15
  16. 16. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Topology Task PI STREAM SOURCE Stream PI EPI PI PI MAP S4 App Stream STREAM SOURCE PE PE PE PE PE 16
  17. 17. 27/06/13 How to use? ¤  Adding SPE using API ¤  S4ProcessingItem: processing element wrapper ¤  S4Stream: wrapper for a S4 stream ¤  S4ComponentFactory: provides components specific from Apache S4, such as processing elements and streams ¤  S4TopologyBuilder: creates the topology instances ¤  Adding algorithm and building topology class  SimpleTask  {   ...    TopologyBuilder  topologyBuilder  =  new  TopologyBuilder(  );      EntranceProcessinItem  entranceProcessingItem  =        topologyBuilder.createEntrancePI(  new  SourceProcessor(  )  );      Stream  stream  =  topologyBuilder.createStream(  entranceProcessingItem  );    ProcessingItem  processingItem  =  topologyBuilder.createPI(  new  Processor(  )  );    processingItem.connectInputKey(  stream  );     ...     17
  18. 18. 27/06/13 Grouping the Best of All ¤  Flexible programming model ¤  Distributed stream processing engine abstraction ¤  Integrated machine learning and data mining algorithms ¤  Easy API to implement new algorithms and SPE adapters 18
  19. 19. 27/06/13 SAMOA Clustering Algorithm ¤  Distributed stream clustering algorithm ¤  Validate SAMOA implementation and ¤  Integration with Apache S4 using the SAMOA-S4 adapter ¤  Deploy on Apache S4 19
  20. 20. 27/06/13 Stream Clustering Algorithm ¤  CluStream Framework ¤  Based on k-means ¤  Online phase (micro-clustering) ¤  Offline phase (macro-clustering) ¤  k-means: partition a set of data into k distinct clusters according to a similarity function ¤  Minimization of squared Euclidean distance objective function: 20
  21. 21. 27/06/13 K-means Clustering Algorithm ¤  Advantages ¤  Simple, fast and efficient ¤  Known issues with k-means ¤  Sensitive to initial seeding ¤  Minimization problem is NP-hard even for simple configurations ¤  1-dimensional points ¤  Global optimum not guaranteed ¤  Good for spherical clustering, not good for arbitrary shapes 21
  22. 22. 27/06/13 Distributed Stream Clustering ¤  Online micro-clustering ¤  Apply on a local clustering phase ¤  Cluster Feature Vectors with Timestamp (CFT) ¤  N: number of data objects ¤  LS: linear sum of data objects ¤  SS: sum of squares of data objects ¤  LST: sum of timestamps ¤  SST: sum of squares of timestamps ¤  Offline macro-clustering ¤  Use of micro-clusters as weighted pseudo-points ¤  Apply on a global clustering phase with a weighted k-means ¤  Uses probabilistic seeding depending on the weighted micro-clusters 22
  23. 23. 27/06/13 CluStream Snapshot Micro-clusters Macro-clusters Ground Truth 23
  24. 24. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Clustering Task Clustering STREAM SOURCE Global Clustering PI Distribution PI OUTPUT Local Clustering PI Evaluation OUTPUT Sampling PI Evaluator PI 24
  25. 25. 27/06/13 Experiments, Evaluation & Results ¤  Experimental Setup ¤  Four 2.4Mhz Intel Xeon dual-quadcore, 48GB RAM ¤  Process parallelism level: 1, 8 & 16 ¤  Instance dimensions: 3 & 15 ¤  Source dataset: random events generator ¤  Noise: 0% & 10% ¤  Cluster movement speed: move 0.1 unit every 500 & 12000 instances ¤  Evaluations ¤  Scalability: measure throughput when adding concurrent processes ¤  Clustering quality: measure if the clustering algorithm are accurate 25
  26. 26. 27/06/13 Scalability Throughput (instances/second) Baseline Comparison Evaluation Step 26
  27. 27. 27/06/13 Scalability Average Throughput (instances/second) Average Throughput with Dimensions 3 and 15 Process Parallelism 27
  28. 28. 27/06/13 Scalability Avg. Cumulative Throughput (instances/sec) Parallelism Throughput with Dimension 3 Process Parallelism 28
  29. 29. 27/06/13 Clustering Quality Metrics ¤  Internal & External evaluations ¤  Internal evaluation uses attributes available from the clustering structure. ¤  External evaluation uses external validation structures. ¤  ex.: ground truth provided by the source generator. ¤  Metrics ¤  Cohesion coefficient (SSE): measures the intra clusters sum of squares error ¤  Separation coefficient (BSS): measures the inter cluster betweensum of squares. 29
  30. 30. 27/06/13 Clustering Quality 0% Noise Snapshot 25,000 instances Snapshot 45,000 instances 30
  31. 31. 27/06/13 Clustering Quality 0% Noise Ratio = BSS / GT 31
  32. 32. 27/06/13 Clustering Quality 10% Noise Snapshot 25,000 instances Snapshot 45,000 instances Good clustering Poor clustering 32
  33. 33. 27/06/13 Clustering Quality 10% Noise 33
  34. 34. 27/06/13 Conclusion ¤  There is important information on the massive amount of data being produced and discarded ¤  There is a need for tools to deal with this efficiently ¤  Efforts have been done to crunch big data ¤  Interpreting and retrieving relevant information is where machine learning and data mining operate ¤  Using real-time analysis responds faster to evolving data ¤  SAMOA abstracts the platform and maintains the algorithms; good to implement, test and use. 34
  35. 35. 27/06/13 Acknowledgements ¤  Thanks the Erasmus Mundus and all three universities (UPC, KTH and IST) for providing this opportunity ¤  Thanks all the EMDC students ¤  Thanks Yahoo! Research for the great project 35
  • tantrieuf31

    Nov. 7, 2014
  • tf0054

    Nov. 7, 2014
  • dileepajayakody

    Oct. 10, 2014

Analyzing and applying machine learning algorithms to a possibly infinite flow of data is a challenging task. This presentation presents the SAMOA framework, which allows the development of machine learning algorithms on top of any distributed stream processing engine. It also demonstrates the development and use of a distributed clustering algorithm based on CluStream using the Apache S4 platform.

Views

Total views

3,442

On Slideshare

0

From embeds

0

Number of embeds

47

Actions

Downloads

109

Shares

0

Comments

0

Likes

3

×