Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink

7,263 views

Published on

Flink Forward 2015

Published in: Technology

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink

  1. 1. APACHE SAMOA: MINING BIG DATA STREAMS WITH APACHE FLINK Albert Bifet @abifet 12 October 2015
  2. 2. APACHE SAMOA 0.3.0 • Released July 2015 pReduce Limitations ample w compute in real time (latency less than 1 second): redictions requent items as Twitter hashtags entiment analysis 14
  3. 3. Streaming Predictive Analytics on Apache Flink Author: Foteini Beligianni Examiner: Vladimir Vlassov Supervisors: Seif Haridi Paris Carbone A thesis submitted for the degree of Master of Science in Distributed Systems and Services
  4. 4. MOTIVATION
  5. 5. REALTIME ANALYTICS eal time analytics
  6. 6. REALTIME ANALYTICS real time analytics
  7. 7. APACHE SAMOAVISION • Distributed stream mining platform • Library of state-of-the-art algorithms
 for practitioners • Development and collaboration framework
 for researchers • Algorithms & Systems
  8. 8. IMPORTANCE • Example: spam detection in comments onYahoo News • Trends change in time • Need to retrain model with new data Importance$of$O •  As$spam$trends$change retrain$the$model$with
  9. 9. INTERNET OF THINGS • EMC Digital Universe, 2014 digital universe Figure 3: EMC Digital Universe, 2014 7
  10. 10. BIG DATA STREAM • Volume +Velocity (+Variety) • Too large for single commodity server main memory • Too fast for single commodity server CPU • A solution should be: • Distributed • Scalable
  11. 11. BIG DATA PROCESSING ENGINES • Low latency • High Latency (Not real time) apache storm Storm characteristics for real-time data processing workloads 1 Fast 2 Scalable 3 Fault-tolerant 4 Reliable 5 Easy to operate apache samza from linkedin Storm and Samza are fairly similar. Both systems provide: 1 a partitioned stream model, 2 a distributed execution environment, 3 an API for stream processing, 4 fault tolerance, 5 Kafka integration real time computation: streaming computation MapReduce Limitations Example How compute in real time (latency less than 1 second): 1 predictions 2 frequent items as Twitter hashtags 3 sentiment analysis 14 apache spark streaming
  12. 12. DATA SCIENCEdata scientist Figure 1: 2
  13. 13. MACHINE LEARNING • Classification • Regression • Clustering • Frequent Pattern Mining
  14. 14. WHAT IS APACHE SAMOA?
  15. 15. STREAMING MODEL • Sequence is potentially infinite • High amount of data, high speed of arrival • Change over time (concept drift) • Approximation algorithms
 (small error with high probability) • Single pass, one data item at a time • Sub-linear space and time per data item
  16. 16. TAXONOMY Data Mining Distributed Batch Hadoop Mahout Stream Storm, S4, Samza SAMOA Non Distributed Batch R, WEKA, … Stream MOA
  17. 17. ARCHITECTURE An adapter for integrating Apache Flink into Apache SAMOA was implemente n scope of this master thesis, with the main parts of its implementation bein addressed in this section. With the use of our adapter, ML algorithms can b executed on top of Apache Flink. The implemented adapter will be used for th evaluation of the ML pipelines and HT algorithm variations. Figure 20: Apache SAMOA’s high level architecture.
  18. 18. STATUSSTATUS • Parallel algorithms • Classification (Vertical HoeffdingTree) • Clustering (CluStream) • Regression (Adaptive Model Rules) • Execution engines
  19. 19. IS SAMOA USEFUL FORYOU? • Only if you need to deal with: • Large fast data • Evolving process (model updates) • What is happening now? • Use feedback in real-time • Adapt to changes faster
  20. 20. GROUPINGS • Key Grouping 
 (hashing) • Shuffle Grouping
 (round-robin) • All Grouping
 (broadcast) PE PE PEI PEI PEI PEI
  21. 21. PE PE PEI PEI PEI PEI GROUPINGS • Key Grouping 
 (hashing) • Shuffle Grouping
 (round-robin) • All Grouping
 (broadcast)
  22. 22. PE PE PEI PEI PEI PEI GROUPINGS • Key Grouping 
 (hashing) • Shuffle Grouping
 (round-robin) • All Grouping
 (broadcast)
  23. 23. PE PE PEI PEI PEI PEI GROUPINGS • Key Grouping 
 (hashing) • Shuffle Grouping
 (round-robin) • All Grouping
 (broadcast)
  24. 24. ML DEVELOPER API Processing Item Processor Stream
  25. 25. ML DEVELOPER API TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo);
  26. 26. VERTICAL HOEFFDINGTREE (VHT)
  27. 27. DECISIONTREE • Nodes are tests on attributes • Branches are possible outcomes • Leafs are class assignments
 
 Class Instance Attributes Road Tested? Mileage? Age? NoYes High ✅ ❌ Low OldRecent ✅ ❌ Car deal?
  28. 28. HOEFFDINGTREE • Sample of stream enough for near optimal decision • Estimate merit of alternatives from prefix of stream • Choose sample size based on statistical principles • When to expand a leaf? • Let x1 be the most informative attribute,
 x2 the second most informative one • Hoeffding bound: split if G(x1, x2) > ✏ = r R2 ln(1/ ) 2n P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00
  29. 29. PARALLEL DECISIONTREES • Which kind of parallelism? • Task • Data • Horizontal • Vertical Data Attributes Instances
  30. 30. HORIZONTAL PARALLELISM Y. Ben-Haim and E.Tom-Tov,“A Streaming Parallel DecisionTree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 Stats Stats Stats Stream Histograms Model Instances Model Updates Aggregation to compute splits Single attribute tracked in multiple node 30
  31. 31. HOEFFDINGTREE PROFILING Other 6 % Split 24 % Learn 70 % CPU time for training
 100 nominal and 100 numeric attributes
  32. 32. VERTICAL PARALLELISM Single attribute tracked in single node Stats Stats Stats Stream Model Attributes Splits
  33. 33. ADVANTAGES OFVERTICAL • High number of attributes => high level of parallelism
 (e.g., documents) • Vs task parallelism • Parallelism observed immediately • Vs horizontal parallelism • Reduced memory usage (no model replication) • Parallelized split computation
  34. 34. VERTICAL HOEFFDINGTREE Control Split Result Source (n) Model (n) Stats (n) Evaluator (1) InstanceStream Shuffle Grouping Key Grouping All Grouping
  35. 35. ACCURACY No. Leaf Nodes VHT2 – tree-100 30 Very close and very high accuracy
  36. 36. PERFORMANCE 35 0 50 100 150 200 250 MHT VHT2-par-3 ExecutionTime(seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec
  37. 37. Streaming Predictive Analytics on Apache Flink Author: Foteini Beligianni Examiner: Vladimir Vlassov Supervisors: Seif Haridi Paris Carbone A thesis submitted for the degree of Master of Science in Distributed Systems and Services
  38. 38. REPLICATED MODELVHT (RMVHT)4 ALGORITHM IMPLEMENTATION 4.1.2 Replicated Model of VHT Algorithm (RmVHT)
  39. 39. COMPARISON NATIVEVHT6 EXPERIMENTAL EVALUATION Figure 22: Prequential classification error of Flink’s native VHT SAMOA’s VHT and RmVHT algorithm for UCI-Forest Covertype data set.Flink’s native VHT has data source with parallelism equal to 1.
  40. 40. COMPARISON NATIVEVHT6 EXPERIMENTAL EVALUATION Figure 25: Prequential classification error of Flink’s native VHT, SAMOA’s VHT and RmVHT algorithm for UCI-Forest Covertype data set. Flink’s na- tive VHT has data source with parallelism equal to 8.
  41. 41. COMPARISON NATIVEVHT The Higgs data set is a synthetic data set, a detailed description of which is presented in Appendix Section A.2.1. In general we observe that Higgs is not such a good data set to be used for classification with a DT classifier. As we see in Figure 27, SAMOA’s VHT learns slower than Flink’s native VHT but achieves lower prequential classification error at the end. On the other hand Flink’s VHT seems to learn faster at the beginning, but then its prequential classification error remains stable and slightly greater than SAMOA’s. Figure 27: Prequential classification error of Flink’s native VHT, SAMOA’s VHT and RmVHT algorithm for UCI-HIGGS data set.
  42. 42. COMPARISON NATIVEVHT As we observe in Figure 31, for the Waveform21 data set SAMOA’s VHT outper- forms Flink’s native VHT implementation. Moreover, we see that SAMOA’s VHT is learning slower, but achieves lower classification error at the end, whereas Flink’s native VHT learns faster, as it decreases very fast the classification error, but then its error remains stable. Figure 31: Classification error of VHT and RmVHT classifier, for Waveform 21- attribute data set on Apache Flink and Apache SAMOA. In Figure 32, we observe that for the Led data set Flink’s native VHT outper-
  43. 43. COMPARISON NATIVEVHT • NativeVHT is faster than SAMOAVHT • NativeVHT is more accurate than SAMOAVHT in real datasets • Future work for nativeVHT: stress test with nominal attributes, and use Gini Impurity
  44. 44. CONCLUSIONS • Streaming is the future and is happening now • Mining big data streams is an open field • SAMOA:A Platform for Mining Big Data Streams • Available and open-source (incubating @ASF)
 http://samoa.incubator.apache.org • A platform for collaboration and research on
 distributed stream mining
  45. 45. OPEN CHALLENGES • Distributed stream mining algorithms • Active & semi-supervised learning + crowdsourcing • Millions of classes (e.g.,Wikipedia pages) • Multi-target learning • System issues (load balancing, communication) • Programming paradigms and abstractions
  46. 46. THETEAM Albert
 Bifet Matthieu
 Morel Gianmarco
 De Francisci Morales Arinto
 Murdopo Nicolas
 Kourtellis Olivier
 Van Laere
  47. 47. SUPPORTING ORGANISATIONS
  48. 48. THANKS! https://samoa.incubator.apache.org @ApacheSAMOA

×