My talk from SICS Data Science Day, describing FlinkML, the Machine Learning library for Apache Flink.
I talk about our approach to large-scale machine learning and how we utilize state-of-the-art algorithms to ensure FlinkML is a truly scalable library.
You can watch a video of the talk here: https://youtu.be/k29qoCm4c_k
3. What is Apache Flink?
● Large-scale data processing engine
● Easy and powerful APIs for batch and real-time streaming analysis
● Backed by a very robust execution backend
○ true streaming dataflow engine
○ custom memory manager
○ native iterations
○ cost-based optimizer
5. What does Flink give us?
● Expressive APIs
● Pipelined stream processor
● Closed loop iterations
6. Expressive APIs
● Main distributed data abstraction: DataSet
● Program using functional-style transformations, creating a Dataflow.
case class Word(word: String, frequency: Int)
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap(line => line.split(“ “).map(word => Word(word, 1))
.groupBy(“word”).sum(“frequency”)
.print()
15. What do we mean?
● Small-scale learning ● Large-scale learning
Source: Léon Bottou
16. What do we mean?
● Small-scale learning
○ We have a small-scale learning problem
when the active budget constraint is the
number of examples.
● Large-scale learning
Source: Léon Bottou
17. What do we mean?
● Small-scale learning
○ We have a small-scale learning problem
when the active budget constraint is the
number of examples.
● Large-scale learning
○ We have a large-scale learning problem
when the active budget constraint is the
computing time.
Source: Léon Bottou
18. What do we mean?
● What about the complexity of the problem?
19. What do we mean?
● What about the complexity of the problem?
Source: Wired Magazine
21. What do we mean?
● What about the complexity of the problem?
“When you get to a trillion [parameters], you’re getting to something that’s got a chance
of really understanding some stuff.” - Hinton, 2013
Source: Wired Magazine
22. What do we mean?
● We have a large-scale learning problem when the active budget constraint
is the computing time and/or the model complexity.
25. FlinkML
● New effort to bring large-scale machine learning to Flink
● Goals:
○ Truly scalable implementations
○ Keep glue code to a minimum
○ Ease of use
28. FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ SVM
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
● Pre-processing
○ Polynomial features
○ Feature scaling
29. FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ SVM
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
● Pre-processing
○ Polynomial features
○ Feature scaling
● sklearn-like ML pipelines
30. FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
31. FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
32. FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
mlr.fit(trainingData)
33. FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
mlr.fit(trainingData)
// The fitted model can now be used to make predictions
val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
34. FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
35. FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
// Construct pipeline of standard scaler, polynomial features and multiple linear
// regression
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
36. FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
// Construct pipeline of standard scaler, polynomial features and multiple linear
// regression
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
// Train pipeline
pipeline.fit(trainingData)
// Calculate predictions
val predictions = pipeline.predict(testingData)
43. ● Beyond SGD → Use Primal-Dual framework
● Slow updates → Immediately apply local updates
● Average over batch size → Average over K (nodes) << batch size
CoCoA: Communication Efficient Coordinate
Ascent
52. Achieving model parallelism:
The parameter server
● The parameter server is essentially a distributed key-value store with two
basic commands: push and pull
○ push updates the model
○ pull retrieves a (lazily) updated model
● Allows us to store a model into multiple nodes, read and update it as
needed.
53. Architecture of a parameter server communicating with groups of workers.
Source:
Li (2014)
56. ● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
Dealing with stragglers: SSP Iterations
57. ● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
Dealing with stragglers: SSP Iterations
58. ● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
Dealing with stragglers: SSP Iterations
59. ● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
● SSP: State Synchronous parallel
○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.
Dealing with stragglers: SSP Iterations
60. ● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
● SSP: State Synchronous parallel
○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.
○ Allows for progress, while keeping convergence guarantees.
Dealing with stragglers: SSP Iterations
65. Coming soon
● Tooling
○ Evaluation & cross-validation framework
○ Predictive Model Markup Language
● Algorithms
○ Quad-tree kNN search
○ Efficient streaming decision trees
○ k-means and extensions
○ Colum-wise statistics, histograms
66. FlinkML Roadmap
● Hyper-parameter optimization
● More communication-efficient optimization algorithms
● Generalized Linear Models
● Latent Dirichlet Allocation
67. Future of Machine Learning on Flink
● Streaming ML
○ Flink already has SAMOA bindings.
○ We plan to kickstart the streaming ML library of Flink, and develop new algorithms.
68. Future of FlinkML
● Streaming ML
○ Flink already has SAMOA bindings.
○ We plan to kickstart the streaming ML library of Flink, and develop new algorithms.
● “Computation efficient” learning
○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning
with modest computing resources.
73. References
● Flink Project: flink.apache.org
● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/
● Leon Botou: Learning with Large Datasets
● Wired: Computer Brain Escapes Google's X Lab to Supercharge Search
● Smith: CoCoA AMPCAMP Presentation
● CMU Petuum: Petuum Project
● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014.
● Li (2014): "Scaling distributed machine learning with the parameter server." OSDI 2014.
● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS
2013.
● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData
2015
● Xing (2015): “Petuum: A New Platform for Distributed Machine Learning on Big Data”, KDD 2015
I would like to thank professor Eric Xing for his permission to use parts of the structure from his great tutorial
on large-scale machine learning: A New Look at the System, Algorithm and Theory Foundations of Distributed
Machine Learning