SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

282 views

Published on

A general overview of the APACHE SAMOA platform for mining big data streams using machine learning algorithms running on distributed stream processing platforms such as Apache STORM, Apache Flink, Apache Samza and Apache Apex.

Results are shown from experimentation with VHT, the Vertical Hoeffding Tree proposed in "VHT: Vertical Hoeffding Tree." N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Mordupo. IEEE BigData 2016.

Presentation in APACHE BIG DATA Europe 2015

Published in: Science
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total views
282
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
3
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

  1. 1. SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis Associate Researcher Telefonica I+D, Barcelona 1
  2. 2. What is Big Data? Search queries Facebook posts Emails Tweets Photo shares Clicks on ads … 2
  3. 3. How BIG is your data? Volume (+ Variety) Too large for RAM of single commodity server Velocity Too fast for CPU of single commodity server 3
  4. 4. What is the Streaming Paradigm? High amount of data, high speed of arrival Updated models at “real” time Potentially infinite sequence of data Change over time (concept drift) 4
  5. 5. Mining Big Data Streams Approximation algorithms: Single pass, one data item at a time Sub-linear space and time per data item Small error with high probability A platform solution: Support different algorithms & processing engines Distributed Scalable 5
  6. 6. What is SAMOA? Scalable Advanced Massive Online Analysis A platform for mining big data streams Framework for developing new distributed stream mining algorithms Framework for deploying algorithms on new distributed stream processing engines 6
  7. 7. Taxonomy Machine Learning Distributed Batch Hadoop Mahout Stream S4, Storm SAMOA Non Distributed Batch R, WEKA, … Stream MOA 7
  8. 8. SAMOA ArchitectureArchitecture SASAMOA% Machine Learning Algorithms Distributed Stream Processing Engines Flink 8
  9. 9. Why is SAMOA important? Program once, run everywhere Reuse existing infrastructure Avoid deploy cycles No system downtime No complex backup/update process No need to select update frequency 9
  10. 10. ML Developer API ML Developer API Processing Item Processor Stream 10
  11. 11. ML Developer API L Developer API TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); ! Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); ! Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo); ML Developer API TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); ! Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); ! Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) 11
  12. 12. Deployment Deployment SAMOA-S4.jar SAMOA-API.jar SAMOA-Storm.jar samoa-storm-deployable.jar samoa-s4-deployable.s4r S4 bindings Storm bindings API. Algorithm developer depends only on this To S4 cluster To Storm cluster 12
  13. 13. Easy to get! 13
  14. 14. Easy to get! 14
  15. 15. Easy to get! 15
  16. 16. Easy to test! bin/samoa storm target/SAMOA-Storm-0.3.0-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -i 1000000 -f 100000 -l (classifiers.trees.VerticalHoeffdingTree -p 4 -k) -s (generators.RandomTreeGenerator –r 1 -c 2 -o 10 -u 10)" 16
  17. 17. Case study: Decision Trees VHT: Vertical Hoeffding Tree* 17 Task Parallelism Task parallelism *VHT: Vertical Hoeffding Tree. N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Mordupo. IEEE BigData 2016.
  18. 18. Case study: VHT 18 Horizontal Parallelism Stats Stats Stats Stream Histograms Model Instances Model UpdatesHorizontal Parallelism
  19. 19. Case study: VHT 19 Vertical Parallelism Stats Stats Stats Stream Model Attributes SplitsVertical Parallelism
  20. 20. Benefits of Vertical Parallelism High number of attributes: high level parallelism (e.g., documents) vs. task parallelism: obvious parallelism observed vs. horizontal parallelism: reduced memory usage (no model replication) parallelized split computation 20
  21. 21. Vertical Hoeffding Tree 21 Vertical Hoeffding Tree Control Split Result Source (n) Model (n) Stats (n) Evaluator (1) InstanceStream Shuffle Grouping Key Grouping All Grouping
  22. 22. Preliminary results: Tweets Zipf skew: 1.5 Bag of words: 100, 1000, 10000 (attributes) Size of tweet: ~15 words Instances: 1,000,000 Class: positive or negative (Gaussian random variable) 10 runs Local vs. Storm virtual cluster 22
  23. 23. Results: Accuracy 23 0 20 40 60 80 100 4 8 16 local CorrectClassification% Parallelism Level Classification Accuracy vs. Parallelism Level vs. Number of Attributes 100 words 1000 words 10000 words
  24. 24. Results: Speedup 24 0 1 2 3 4 5 4 8 16 Speedup Parallelism Level Speedup vs. Parallelism Level vs. Number of Attributes 100 words 1000 words 10000 words
  25. 25. Is SAMOA for you? Are you dealing with: Big fast data? Possibly endless streams of data? Evolving data? Do you need updated models at real time? Do you want to test an algorithm on different DSPEs? 25
  26. 26. SAMOA Team Albert Bifet Gianmarco De Francisci Morales Nicolas Kourtellis Matthieu Morel Arinto Murdopo Olivier Van Laere 26
  27. 27. Status Apache Incubator  Released version 0.3.0 in July Execution Engines Input:  Local FS  HDFS  Kafka [pending] Parallel algorithms Vertical Hoeffding Tree (classification) CluStream (clustering) Adaptive Model Rules (regression) PARMA (frequent pattern mining) [pending] Execution engines CluStream (clustering) Adaptive Model Rules (regression) PARMA (frequent pattern mining) [pendin ecution engines sification) ession) ining) [pending] Heron? 27
  28. 28. Algorithms in SAMOA Existing:  Vertical Hoeffding Tree (classification)  CluStream (clustering)  Adaptive Model Rules (regression) Pending:  Distributed Naïve Bayes  Stochastic Gradient Descent  Adaptive + Boosting VHT  Parallelized Gradient Boosted Decision Tree  PARMA (frequent pattern mining)  … Check Samoa Roadmap for more Looking for contributors! 28
  29. 29. SAMOA: A Platform for Mining Big Data Streams @ApacheSAMOA http://samoa.incubator.apache.org/ https://github.com/apache/incubator-samoa Nicolas Kourtellis @kourtellis nicolas.kourtellis@telefonica.com 29

×