Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)

766 views

Published on

A general overview of the APACHE SAMOA platform for mining big data streams using machine learning algorithms running on distributed stream processing platforms such as Apache STORM, Apache Flink, Apache Samza and Apache Apex.

Results are shown from experimentation with VHT, the Vertical Hoeffding Tree proposed in "VHT: Vertical Hoeffding Tree." N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Mordupo. IEEE BigData 2016.

Presentation in APACHE BIG DATA North America 2016

Published in: Science

SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2016)

  1. 1. SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis Associate Researcher Telefonica I+D, Barcelona @kourtellis @ApacheSAMOA 1
  2. 2. What is Big Data? Search queries Facebook posts Emails Tweets Photo shares Clicks on ads … 2
  3. 3. How BIG is your data? Volume (+ Variety) Too large for RAM of single commodity server Velocity Too fast for CPU of single commodity server 3
  4. 4. What is the Streaming Paradigm? High amount of data, high speed of arrival Updated models at “real” time Potentially infinite sequence of data Change over time (concept drift) 4
  5. 5. Mining Big Data Streams Approximation algorithms: Single pass, one data item at a time Sub-linear space and time per data item Small error with high probability A platform solution: Support different algorithms & processing engines Distributed Scalable 5
  6. 6. What is SAMOA? Scalable Advanced Massive Online Analysis A platform for mining big data streams Framework for developing new distributed stream mining algorithms Framework for deploying algorithms on new distributed stream processing engines 6
  7. 7. Taxonomy Machine Learning Distributed Batch Hadoop Mahout Stream S4, Storm SAMOA Non Distributed Batch R, WEKA, … Stream MOA 7
  8. 8. SAMOA ArchitectureArchitecture SASAMOA% Machine Learning Algorithms Distributed Stream Processing Engines Flink 8
  9. 9. Why is SAMOA important? Program once, run everywhere Reuse existing infrastructure Avoid deploy cycles No system downtime No complex backup/update process No need to select update frequency 9
  10. 10. ML Developer API ML Developer API Processing Item Processor Stream 10
  11. 11. ML Developer API L Developer API TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); ! Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); ! Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo); ML Developer API TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); ! Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); ! Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) 11
  12. 12. Deployment Deployment SAMOA-S4.jar SAMOA-API.jar SAMOA-Storm.jar samoa-storm-deployable.jar samoa-s4-deployable.s4r S4 bindings Storm bindings API. Algorithm developer depends only on this To S4 cluster To Storm cluster 12
  13. 13. Easy to get! 13
  14. 14. Easy to get! 14
  15. 15. Easy to get! 15
  16. 16. Easy to test! bin/samoa storm target/SAMOA-Storm-0.3.0-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -i 1000000 -f 100000 -l (classifiers.trees.VerticalHoeffdingTree -p 4 -k) -s (generators.RandomTreeGenerator –r 1 -c 2 -o 10 -u 10)" 16
  17. 17. Case study: Decision Trees VHT: Vertical Hoeffding Tree* 17 Task Parallelism Task parallelism *VHT: Vertical Hoeffding Tree. N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Mordupo. IEEE BigData 2016.
  18. 18. Case study: VHT 18 Horizontal Parallelism Stats Stats Stats Stream Histograms Model Instances Model UpdatesHorizontal Parallelism
  19. 19. Case study: VHT 19 Vertical Parallelism Stats Stats Stats Stream Model Attributes SplitsVertical Parallelism
  20. 20. Benefits of Vertical Parallelism High number of attributes: high level parallelism (e.g., documents) vs. task parallelism: obvious parallelism observed vs. horizontal parallelism: reduced memory usage (no model replication) parallelized split computation 20
  21. 21. Vertical Hoeffding Tree 21 Vertical Hoeffding Tree Control Split Result Source (n) Model (n) Stats (n) Evaluator (1) InstanceStream Shuffle Grouping Key Grouping All Grouping
  22. 22. Preliminary results: Dense instances Random decision tree Mixed categorical and numerical attributes 10-10, 100-100, 1k-1k, 10k-10k Instances: 1,000,000 2 balanced classes 10 different seeded runs Test every 100k instances MOA HT vs. Local VHT vs. Storm cluster VHT 22
  23. 23. Results: Accuracy 23 80 85 90 95 100 10-10 100-100 1k-1k 10k-10k %accuracy nominal attributes - numerical attributes Dense attributes local moa 100
  24. 24. Results: Accuracy 0 20 40 60 80 100 10-10 100-100 1k-1k 10k-10k %accuracy parallelism = 2 sharding wok wk(0) wk(1k) wk(10k) local 0 20 40 60 80 100 10-10 100-100 1k-1k 10k-10k nominal attributes - numerical attributes parallelism = 4 1 24
  25. 25. Results: Accuracy Evolution 25
  26. 26. Results: Speedup 26
  27. 27. Results: Speedup 27
  28. 28. Preliminary results: Artificial Tweets Zipf skew: 1.5 Bag of words: 100, 1000, 10000 (attributes) Size of tweet: ~15 words Instances: 1,000,000 Class: positive or negative  Gaussian random variable 10 different seeded runs Test every 100k instances MOA HT vs. Local VHT vs. Storm cluster VHT 28
  29. 29. Results: Accuracy 29
  30. 30. Results: Accuracy 30
  31. 31. Results: Accuracy Evolution 31
  32. 32. Results: Speedup 32
  33. 33. Results: Speedup 33
  34. 34. Is SAMOA for you? Are you dealing with: Big fast data? Possibly endless streams of data? Evolving data? Do you need updated models at real time? Do you want to test an algorithm on different DSPEs? 34
  35. 35. SAMOA Team Albert Bifet Gianmarco De Francisci Morales Nicolas Kourtellis Matthieu Morel Arinto Murdopo Olivier Van Laere 35
  36. 36. Status  Apache Incubator  Released version 0.3.0 in July  Execution Engines  Input:  Local FS  HDFS  Avro  Kafka [pending] Parallel algorithms Vertical Hoeffding Tree (classification) CluStream (clustering) Adaptive Model Rules (regression) PARMA (frequent pattern mining) [pending] Execution engines Vertical Hoeffding Tree (classification) CluStream (clustering) Adaptive Model Rules (regression) PARMA (frequent pattern mining) [pending] Execution engines sification) ession) ining) [pending] Heron? 36 Apache Beam?
  37. 37. Algorithms in SAMOA Existing:  Vertical Hoeffding Tree (classification)  CluStream (clustering)  Adaptive Model Rules (regression) Pending:  Distributed Naïve Bayes  Stochastic Gradient Descent  Adaptive + Boosting VHT  Parallelized Gradient Boosted Decision Tree  PARMA (frequent pattern mining)  … Check Samoa Roadmap for more Looking for contributors! 37
  38. 38. SAMOA: A Platform for Mining Big Data Streams @ApacheSAMOA http://samoa.incubator.apache.org/ https://github.com/apache/incubator-samoa Nicolas Kourtellis @kourtellis nicolas.kourtellis@telefonica.com 38

×