Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- FlinkML: Large Scale Machine Learni... by Theodoros Vasiloudis 651 views
- Why apache Flink is the 4G of Big D... by Slim Baltagi 23014 views
- Márton Balassi Streaming ML with F... by Flink Forward 279 views
- A few questions about large scale m... by Theodoros Vasiloudis 402 views
- Apache Flink Training - Table API by dataArtisans 1657 views
- Apache flink - retour d'expérience ... by Bilal BALTAGI 282 views

438 views

Published on

Published in:
Software

No Downloads

Total views

438

On SlideShare

0

From Embeds

0

Number of Embeds

13

Shares

0

Downloads

11

Comments

0

Likes

1

No embeds

No notes for slide

- 1. FlinkML: Large-scale Machine Learning with Apache Flink Theodore Vasiloudis, Swedish Institute of Computer Science (SICS) Big Data Application Meetup July 27th, 2016
- 2. Large-scale Machine Learning
- 3. What do we mean?
- 4. What do we mean? ● Small-scale learning ● Large-scale learning Source: Léon Bottou
- 5. What do we mean? ● Small-scale learning ○ We have a small-scale learning problem when the active budget constraint is the number of examples. ● Large-scale learning Source: Léon Bottou
- 6. What do we mean? ● Small-scale learning ○ We have a small-scale learning problem when the active budget constraint is the number of examples. ● Large-scale learning ○ We have a large-scale learning problem when the active budget constraint is the computing time. Source: Léon Bottou
- 7. Apache Flink
- 8. What is Apache Flink? ● Distributed stream and batch data processing engine ● Easy and powerful APIs for batch and real-time streaming analysis ● Backed by a very robust execution backend ○ true streaming dataflow engine ○ custom memory manager ○ native iterations ○ cost-based optimizer
- 9. What is Apache Flink?
- 10. What does Flink give us? ● Expressive APIs ● Pipelined stream processor ● Closed loop iterations
- 11. Expressive APIs ● Main bounded data abstraction: DataSet ● Program using functional-style transformations, creating a dataflow. case class Word(word: String, frequency: Int) val lines: DataSet[String] = env.readTextFile(...) lines.flatMap(line => line.split(“ “).map(word => Word(word, 1)) .groupBy(“word”).sum(“frequency”) .print()
- 12. Pipelined Stream Processor
- 13. Iterate in the dataflow
- 14. Iterate by looping ● Loop in client submits one job per iteration step ● Reuse data by caching in memory or disk
- 15. Iterate in the dataflow
- 16. Delta iterations
- 17. Performance Extending the Yahoo Streaming Benchmark
- 18. FlinkML
- 19. FlinkML ● New effort to bring large-scale machine learning to Apache Flink
- 20. FlinkML ● New effort to bring large-scale machine learning to Apache Flink ● Goals: ○ Truly scalable implementations ○ Keep glue code to a minimum ○ Ease of use
- 21. FlinkML: Overview
- 22. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression
- 23. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS)
- 24. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS) ● Pre-processing ○ Polynomial features ○ Feature scaling
- 25. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS) ● Pre-processing ○ Polynomial features ○ Feature scaling ● Unsupervised learning ○ Quad-tree exact kNN search
- 26. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS) ● Pre-processing ○ Polynomial features ○ Feature scaling ● Unsupervised learning ○ Quad-tree exact kNN search ● sklearn-like ML pipelines
- 27. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ...
- 28. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)
- 29. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001) mlr.fit(trainingData)
- 30. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001) mlr.fit(trainingData) // The fitted model can now be used to make predictions val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
- 31. FlinkML Pipelines val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression()
- 32. FlinkML Pipelines val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression() // Construct pipeline of standard scaler, polynomial features and multiple linear // regression val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
- 33. FlinkML Pipelines val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression() // Construct pipeline of standard scaler, polynomial features and multiple linear // regression val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr) // Train pipeline pipeline.fit(trainingData) // Calculate predictions val predictions = pipeline.predict(testingData)
- 34. FlinkML: Focus on scalability
- 35. Alternating Least Squares R ≅ X Y✕Users Items
- 36. Naive Alternating Least Squares
- 37. Blocked Alternating Least Squares
- 38. Blocked ALS performance FlinkML blocked ALS performance
- 39. Going beyond SGD in large-scale optimization
- 40. ● Beyond SGD → Use Primal-Dual framework ● Slow updates → Immediately apply local updates CoCoA: Communication Efficient Coordinate Ascent
- 41. Primal-dual framework Source: Smith (2014)
- 42. Primal-dual framework Source: Smith (2014)
- 43. Immediately Apply Updates Source: Smith (2014)
- 44. Immediately Apply Updates Source: Smith (2014) Source: Smith (2014)
- 45. CoCoA: Communication Efficient Coordinate Ascent
- 46. CoCoA performance Source: Jaggi (2014)
- 47. CoCoA performance Available on FlinkML SVM
- 48. Dealing with stragglers: SSP Iterations
- 49. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration Dealing with stragglers: SSP Iterations
- 50. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. Dealing with stragglers: SSP Iterations
- 51. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. Dealing with stragglers: SSP Iterations
- 52. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. ● SSP: State Synchronous parallel ○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones. Dealing with stragglers: SSP Iterations
- 53. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. ● SSP: State Synchronous parallel ○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones. ○ Allows for progress, while keeping convergence guarantees. Dealing with stragglers: SSP Iterations
- 54. Dealing with stragglers: SSP Iterations Source: Ho et al. (2013)
- 55. SSP Iterations in Flink: Lasso Regression Source: Peel et al. (2015)
- 56. SSP Iterations in Flink: Lasso Regression Source: Peel et al. (2015) PR submitted
- 57. Challenges in developing an open-source ML library
- 58. Challenges in open-source ML libraries ● Depth or breadth ● Design choices ● Testing
- 59. Challenges in open-source ML libraries ● Attracting developers ● What to commit ● Avoiding code rot
- 60. Current and future work on FlinkML
- 61. Current work ● Tooling ○ Evaluation & cross-validation framework ○ Distributed linear algebra ○ Streaming predictors ● Algorithms ○ Implicit ALS ○ Multi-layer perceptron ○ Efficient streaming decision trees ○ Colum-wise statistics, histograms
- 62. Future of Machine Learning on Flink ● Streaming ML ○ Flink already has SAMOA bindings. ○ Preliminary work already started, implement SOTA algorithms and develop new techniques.
- 63. Future of Machine Learning on Flink ● Streaming ML ○ Flink already has SAMOA bindings. ○ Preliminary work already started, implement SOTA algorithms and develop new techniques. ● “Computation efficient” learning ○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning with modest computing resources.
- 64. Check it out: @thvasilo tvas@sics.se flink.apache.org ci.apache.org/projects/flink/flink-docs-master/libs/ml
- 65. “Demo”
- 66. “Demo”
- 67. “Demo”
- 68. “Demo”
- 69. “Demo”
- 70. “Demo”
- 71. “Demo”
- 72. “Demo”
- 73. Thank you @thvasilo tvas@sics.se flink.apache.org ci.apache.org/projects/flink/flink-docs-master/libs/ml
- 74. References ● Flink Project: flink.apache.org ● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ ● Leon Botou: Learning with Large Datasets ● Smith (2014): CoCoA AMPCAMP Presentation ● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014. ● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS 2013. ● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData 2015 ● Recent INRIA paper examining Spark vs. Flink (batch only) ● Extending the Yahoo streaming benchmark (And winning the Twitter Hack-week with Flink) ● Also interesting: Bayesian anomaly detection in Flink

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment