FlinkML - Big data application meetup

438 views

Published on

My slides from the Big Data Applications meetup on 27th of July, talking about FlinkML. Also some things about open-source ML development and an illustation of interactive Flink machine learning with Apache Zeppelin.

Published in: Software
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
438
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

FlinkML - Big data application meetup

  1. 1. FlinkML: Large-scale Machine Learning with Apache Flink Theodore Vasiloudis, Swedish Institute of Computer Science (SICS) Big Data Application Meetup July 27th, 2016
  2. 2. Large-scale Machine Learning
  3. 3. What do we mean?
  4. 4. What do we mean? ● Small-scale learning ● Large-scale learning Source: Léon Bottou
  5. 5. What do we mean? ● Small-scale learning ○ We have a small-scale learning problem when the active budget constraint is the number of examples. ● Large-scale learning Source: Léon Bottou
  6. 6. What do we mean? ● Small-scale learning ○ We have a small-scale learning problem when the active budget constraint is the number of examples. ● Large-scale learning ○ We have a large-scale learning problem when the active budget constraint is the computing time. Source: Léon Bottou
  7. 7. Apache Flink
  8. 8. What is Apache Flink? ● Distributed stream and batch data processing engine ● Easy and powerful APIs for batch and real-time streaming analysis ● Backed by a very robust execution backend ○ true streaming dataflow engine ○ custom memory manager ○ native iterations ○ cost-based optimizer
  9. 9. What is Apache Flink?
  10. 10. What does Flink give us? ● Expressive APIs ● Pipelined stream processor ● Closed loop iterations
  11. 11. Expressive APIs ● Main bounded data abstraction: DataSet ● Program using functional-style transformations, creating a dataflow. case class Word(word: String, frequency: Int) val lines: DataSet[String] = env.readTextFile(...) lines.flatMap(line => line.split(“ “).map(word => Word(word, 1)) .groupBy(“word”).sum(“frequency”) .print()
  12. 12. Pipelined Stream Processor
  13. 13. Iterate in the dataflow
  14. 14. Iterate by looping ● Loop in client submits one job per iteration step ● Reuse data by caching in memory or disk
  15. 15. Iterate in the dataflow
  16. 16. Delta iterations
  17. 17. Performance Extending the Yahoo Streaming Benchmark
  18. 18. FlinkML
  19. 19. FlinkML ● New effort to bring large-scale machine learning to Apache Flink
  20. 20. FlinkML ● New effort to bring large-scale machine learning to Apache Flink ● Goals: ○ Truly scalable implementations ○ Keep glue code to a minimum ○ Ease of use
  21. 21. FlinkML: Overview
  22. 22. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression
  23. 23. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS)
  24. 24. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS) ● Pre-processing ○ Polynomial features ○ Feature scaling
  25. 25. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS) ● Pre-processing ○ Polynomial features ○ Feature scaling ● Unsupervised learning ○ Quad-tree exact kNN search
  26. 26. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS) ● Pre-processing ○ Polynomial features ○ Feature scaling ● Unsupervised learning ○ Quad-tree exact kNN search ● sklearn-like ML pipelines
  27. 27. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ...
  28. 28. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)
  29. 29. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001) mlr.fit(trainingData)
  30. 30. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001) mlr.fit(trainingData) // The fitted model can now be used to make predictions val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
  31. 31. FlinkML Pipelines val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression()
  32. 32. FlinkML Pipelines val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression() // Construct pipeline of standard scaler, polynomial features and multiple linear // regression val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
  33. 33. FlinkML Pipelines val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression() // Construct pipeline of standard scaler, polynomial features and multiple linear // regression val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr) // Train pipeline pipeline.fit(trainingData) // Calculate predictions val predictions = pipeline.predict(testingData)
  34. 34. FlinkML: Focus on scalability
  35. 35. Alternating Least Squares R ≅ X Y✕Users Items
  36. 36. Naive Alternating Least Squares
  37. 37. Blocked Alternating Least Squares
  38. 38. Blocked ALS performance FlinkML blocked ALS performance
  39. 39. Going beyond SGD in large-scale optimization
  40. 40. ● Beyond SGD → Use Primal-Dual framework ● Slow updates → Immediately apply local updates CoCoA: Communication Efficient Coordinate Ascent
  41. 41. Primal-dual framework Source: Smith (2014)
  42. 42. Primal-dual framework Source: Smith (2014)
  43. 43. Immediately Apply Updates Source: Smith (2014)
  44. 44. Immediately Apply Updates Source: Smith (2014) Source: Smith (2014)
  45. 45. CoCoA: Communication Efficient Coordinate Ascent
  46. 46. CoCoA performance Source: Jaggi (2014)
  47. 47. CoCoA performance Available on FlinkML SVM
  48. 48. Dealing with stragglers: SSP Iterations
  49. 49. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration Dealing with stragglers: SSP Iterations
  50. 50. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. Dealing with stragglers: SSP Iterations
  51. 51. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. Dealing with stragglers: SSP Iterations
  52. 52. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. ● SSP: State Synchronous parallel ○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones. Dealing with stragglers: SSP Iterations
  53. 53. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. ● SSP: State Synchronous parallel ○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones. ○ Allows for progress, while keeping convergence guarantees. Dealing with stragglers: SSP Iterations
  54. 54. Dealing with stragglers: SSP Iterations Source: Ho et al. (2013)
  55. 55. SSP Iterations in Flink: Lasso Regression Source: Peel et al. (2015)
  56. 56. SSP Iterations in Flink: Lasso Regression Source: Peel et al. (2015) PR submitted
  57. 57. Challenges in developing an open-source ML library
  58. 58. Challenges in open-source ML libraries ● Depth or breadth ● Design choices ● Testing
  59. 59. Challenges in open-source ML libraries ● Attracting developers ● What to commit ● Avoiding code rot
  60. 60. Current and future work on FlinkML
  61. 61. Current work ● Tooling ○ Evaluation & cross-validation framework ○ Distributed linear algebra ○ Streaming predictors ● Algorithms ○ Implicit ALS ○ Multi-layer perceptron ○ Efficient streaming decision trees ○ Colum-wise statistics, histograms
  62. 62. Future of Machine Learning on Flink ● Streaming ML ○ Flink already has SAMOA bindings. ○ Preliminary work already started, implement SOTA algorithms and develop new techniques.
  63. 63. Future of Machine Learning on Flink ● Streaming ML ○ Flink already has SAMOA bindings. ○ Preliminary work already started, implement SOTA algorithms and develop new techniques. ● “Computation efficient” learning ○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning with modest computing resources.
  64. 64. Check it out: @thvasilo tvas@sics.se flink.apache.org ci.apache.org/projects/flink/flink-docs-master/libs/ml
  65. 65. “Demo”
  66. 66. “Demo”
  67. 67. “Demo”
  68. 68. “Demo”
  69. 69. “Demo”
  70. 70. “Demo”
  71. 71. “Demo”
  72. 72. “Demo”
  73. 73. Thank you @thvasilo tvas@sics.se flink.apache.org ci.apache.org/projects/flink/flink-docs-master/libs/ml
  74. 74. References ● Flink Project: flink.apache.org ● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ ● Leon Botou: Learning with Large Datasets ● Smith (2014): CoCoA AMPCAMP Presentation ● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014. ● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS 2013. ● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData 2015 ● Recent INRIA paper examining Spark vs. Flink (batch only) ● Extending the Yahoo streaming benchmark (And winning the Twitter Hack-week with Flink) ● Also interesting: Bayesian anomaly detection in Flink

×