FlinkML: Large-scale Machine
Learning with Apache Flink
Theodore Vasiloudis, Swedish Institute of Computer Science (SICS)
Big Data Application Meetup
July 27th, 2016
Large-scale Machine Learning
What do we mean?
What do we mean?
● Small-scale learning ● Large-scale learning
Source: Léon Bottou
What do we mean?
● Small-scale learning
○ We have a small-scale learning problem
when the active budget constraint is the
number of examples.
● Large-scale learning
Source: Léon Bottou
What do we mean?
● Small-scale learning
○ We have a small-scale learning problem
when the active budget constraint is the
number of examples.
● Large-scale learning
○ We have a large-scale learning problem
when the active budget constraint is the
computing time.
Source: Léon Bottou
Apache Flink
What is Apache Flink?
● Distributed stream and batch data processing engine
● Easy and powerful APIs for batch and real-time streaming analysis
● Backed by a very robust execution backend
○ true streaming dataflow engine
○ custom memory manager
○ native iterations
○ cost-based optimizer
What is Apache Flink?
What does Flink give us?
● Expressive APIs
● Pipelined stream processor
● Closed loop iterations
Expressive APIs
● Main bounded data abstraction: DataSet
● Program using functional-style transformations, creating a dataflow.
case class Word(word: String, frequency: Int)
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap(line => line.split(“ “).map(word => Word(word, 1))
.groupBy(“word”).sum(“frequency”)
.print()
Pipelined Stream Processor
Iterate in the dataflow
Iterate by looping
● Loop in client submits one job per iteration step
● Reuse data by caching in memory or disk
Iterate in the dataflow
Delta iterations
Performance
Extending the Yahoo Streaming Benchmark
FlinkML
FlinkML
● New effort to bring large-scale machine learning to Apache Flink
FlinkML
● New effort to bring large-scale machine learning to Apache Flink
● Goals:
○ Truly scalable implementations
○ Keep glue code to a minimum
○ Ease of use
FlinkML: Overview
FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ Support Vector Machine
○ Multiple linear regression
FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ Support Vector Machine
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ Support Vector Machine
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
● Pre-processing
○ Polynomial features
○ Feature scaling
FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ Support Vector Machine
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
● Pre-processing
○ Polynomial features
○ Feature scaling
● Unsupervised learning
○ Quad-tree exact kNN search
FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ Support Vector Machine
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
● Pre-processing
○ Polynomial features
○ Feature scaling
● Unsupervised learning
○ Quad-tree exact kNN search
● sklearn-like ML pipelines
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
mlr.fit(trainingData)
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
mlr.fit(trainingData)
// The fitted model can now be used to make predictions
val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
// Construct pipeline of standard scaler, polynomial features and multiple linear
// regression
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
// Construct pipeline of standard scaler, polynomial features and multiple linear
// regression
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
// Train pipeline
pipeline.fit(trainingData)
// Calculate predictions
val predictions = pipeline.predict(testingData)
FlinkML: Focus on scalability
Alternating Least Squares
R ≅ X Y✕Users
Items
Naive Alternating Least Squares
Blocked Alternating Least Squares
Blocked ALS performance
FlinkML blocked ALS performance
Going beyond SGD in large-scale
optimization
● Beyond SGD → Use Primal-Dual framework
● Slow updates → Immediately apply local updates
CoCoA: Communication Efficient Coordinate
Ascent
Primal-dual framework
Source: Smith
(2014)
Primal-dual framework
Source: Smith
(2014)
Immediately Apply Updates
Source: Smith
(2014)
Immediately Apply Updates
Source: Smith
(2014)
Source: Smith
(2014)
CoCoA: Communication Efficient Coordinate
Ascent
CoCoA performance
Source:
Jaggi
(2014)
CoCoA performance
Available on FlinkML
SVM
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
● SSP: State Synchronous parallel
○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
● SSP: State Synchronous parallel
○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.
○ Allows for progress, while keeping convergence guarantees.
Dealing with stragglers: SSP Iterations
Dealing with stragglers: SSP Iterations
Source: Ho et al.
(2013)
SSP Iterations in Flink: Lasso Regression
Source: Peel et
al. (2015)
SSP Iterations in Flink: Lasso Regression
Source: Peel et
al. (2015)
PR submitted
Challenges in developing an
open-source ML library
Challenges in open-source ML libraries
● Depth or breadth
● Design choices
● Testing
Challenges in open-source ML libraries
● Attracting developers
● What to commit
● Avoiding code rot
Current and future work on
FlinkML
Current work
● Tooling
○ Evaluation & cross-validation framework
○ Distributed linear algebra
○ Streaming predictors
● Algorithms
○ Implicit ALS
○ Multi-layer perceptron
○ Efficient streaming decision trees
○ Colum-wise statistics, histograms
Future of Machine Learning on Flink
● Streaming ML
○ Flink already has SAMOA bindings.
○ Preliminary work already started, implement SOTA algorithms and develop new
techniques.
Future of Machine Learning on Flink
● Streaming ML
○ Flink already has SAMOA bindings.
○ Preliminary work already started, implement SOTA algorithms and develop new
techniques.
● “Computation efficient” learning
○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning
with modest computing resources.
Check it out:
@thvasilo
tvas@sics.se
flink.apache.org
ci.apache.org/projects/flink/flink-docs-master/libs/ml
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
Thank you
@thvasilo
tvas@sics.se
flink.apache.org
ci.apache.org/projects/flink/flink-docs-master/libs/ml
References
● Flink Project: flink.apache.org
● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/
● Leon Botou: Learning with Large Datasets
● Smith (2014): CoCoA AMPCAMP Presentation
● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014.
● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS
2013.
● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData
2015
● Recent INRIA paper examining Spark vs. Flink (batch only)
● Extending the Yahoo streaming benchmark (And winning the Twitter Hack-week with Flink)
● Also interesting: Bayesian anomaly detection in Flink

FlinkML - Big data application meetup

  • 1.
    FlinkML: Large-scale Machine Learningwith Apache Flink Theodore Vasiloudis, Swedish Institute of Computer Science (SICS) Big Data Application Meetup July 27th, 2016
  • 2.
  • 3.
  • 4.
    What do wemean? ● Small-scale learning ● Large-scale learning Source: Léon Bottou
  • 5.
    What do wemean? ● Small-scale learning ○ We have a small-scale learning problem when the active budget constraint is the number of examples. ● Large-scale learning Source: Léon Bottou
  • 6.
    What do wemean? ● Small-scale learning ○ We have a small-scale learning problem when the active budget constraint is the number of examples. ● Large-scale learning ○ We have a large-scale learning problem when the active budget constraint is the computing time. Source: Léon Bottou
  • 7.
  • 8.
    What is ApacheFlink? ● Distributed stream and batch data processing engine ● Easy and powerful APIs for batch and real-time streaming analysis ● Backed by a very robust execution backend ○ true streaming dataflow engine ○ custom memory manager ○ native iterations ○ cost-based optimizer
  • 9.
  • 10.
    What does Flinkgive us? ● Expressive APIs ● Pipelined stream processor ● Closed loop iterations
  • 11.
    Expressive APIs ● Mainbounded data abstraction: DataSet ● Program using functional-style transformations, creating a dataflow. case class Word(word: String, frequency: Int) val lines: DataSet[String] = env.readTextFile(...) lines.flatMap(line => line.split(“ “).map(word => Word(word, 1)) .groupBy(“word”).sum(“frequency”) .print()
  • 12.
  • 13.
  • 14.
    Iterate by looping ●Loop in client submits one job per iteration step ● Reuse data by caching in memory or disk
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    FlinkML ● New effortto bring large-scale machine learning to Apache Flink
  • 20.
    FlinkML ● New effortto bring large-scale machine learning to Apache Flink ● Goals: ○ Truly scalable implementations ○ Keep glue code to a minimum ○ Ease of use
  • 21.
  • 22.
    FlinkML: Overview ● SupervisedLearning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression
  • 23.
    FlinkML: Overview ● SupervisedLearning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS)
  • 24.
    FlinkML: Overview ● SupervisedLearning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS) ● Pre-processing ○ Polynomial features ○ Feature scaling
  • 25.
    FlinkML: Overview ● SupervisedLearning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS) ● Pre-processing ○ Polynomial features ○ Feature scaling ● Unsupervised learning ○ Quad-tree exact kNN search
  • 26.
    FlinkML: Overview ● SupervisedLearning ○ Optimization framework ○ Support Vector Machine ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS) ● Pre-processing ○ Polynomial features ○ Feature scaling ● Unsupervised learning ○ Quad-tree exact kNN search ● sklearn-like ML pipelines
  • 27.
    FlinkML API // LabeledVectoris a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ...
  • 28.
    FlinkML API // LabeledVectoris a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)
  • 29.
    FlinkML API // LabeledVectoris a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001) mlr.fit(trainingData)
  • 30.
    FlinkML API // LabeledVectoris a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001) mlr.fit(trainingData) // The fitted model can now be used to make predictions val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
  • 31.
    FlinkML Pipelines val scaler= StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression()
  • 32.
    FlinkML Pipelines val scaler= StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression() // Construct pipeline of standard scaler, polynomial features and multiple linear // regression val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
  • 33.
    FlinkML Pipelines val scaler= StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression() // Construct pipeline of standard scaler, polynomial features and multiple linear // regression val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr) // Train pipeline pipeline.fit(trainingData) // Calculate predictions val predictions = pipeline.predict(testingData)
  • 34.
    FlinkML: Focus onscalability
  • 35.
    Alternating Least Squares R≅ X Y✕Users Items
  • 36.
  • 37.
  • 38.
    Blocked ALS performance FlinkMLblocked ALS performance
  • 39.
    Going beyond SGDin large-scale optimization
  • 40.
    ● Beyond SGD→ Use Primal-Dual framework ● Slow updates → Immediately apply local updates CoCoA: Communication Efficient Coordinate Ascent
  • 41.
  • 42.
  • 43.
  • 44.
    Immediately Apply Updates Source:Smith (2014) Source: Smith (2014)
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
    ● BSP: BulkSynchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration Dealing with stragglers: SSP Iterations
  • 50.
    ● BSP: BulkSynchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. Dealing with stragglers: SSP Iterations
  • 51.
    ● BSP: BulkSynchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. Dealing with stragglers: SSP Iterations
  • 52.
    ● BSP: BulkSynchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. ● SSP: State Synchronous parallel ○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones. Dealing with stragglers: SSP Iterations
  • 53.
    ● BSP: BulkSynchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. ● SSP: State Synchronous parallel ○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones. ○ Allows for progress, while keeping convergence guarantees. Dealing with stragglers: SSP Iterations
  • 54.
    Dealing with stragglers:SSP Iterations Source: Ho et al. (2013)
  • 55.
    SSP Iterations inFlink: Lasso Regression Source: Peel et al. (2015)
  • 56.
    SSP Iterations inFlink: Lasso Regression Source: Peel et al. (2015) PR submitted
  • 57.
    Challenges in developingan open-source ML library
  • 58.
    Challenges in open-sourceML libraries ● Depth or breadth ● Design choices ● Testing
  • 59.
    Challenges in open-sourceML libraries ● Attracting developers ● What to commit ● Avoiding code rot
  • 60.
    Current and futurework on FlinkML
  • 61.
    Current work ● Tooling ○Evaluation & cross-validation framework ○ Distributed linear algebra ○ Streaming predictors ● Algorithms ○ Implicit ALS ○ Multi-layer perceptron ○ Efficient streaming decision trees ○ Colum-wise statistics, histograms
  • 62.
    Future of MachineLearning on Flink ● Streaming ML ○ Flink already has SAMOA bindings. ○ Preliminary work already started, implement SOTA algorithms and develop new techniques.
  • 63.
    Future of MachineLearning on Flink ● Streaming ML ○ Flink already has SAMOA bindings. ○ Preliminary work already started, implement SOTA algorithms and develop new techniques. ● “Computation efficient” learning ○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning with modest computing resources.
  • 64.
  • 65.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 74.
  • 75.
  • 76.
    References ● Flink Project:flink.apache.org ● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ ● Leon Botou: Learning with Large Datasets ● Smith (2014): CoCoA AMPCAMP Presentation ● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014. ● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS 2013. ● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData 2015 ● Recent INRIA paper examining Spark vs. Flink (batch only) ● Extending the Yahoo streaming benchmark (And winning the Twitter Hack-week with Flink) ● Also interesting: Bayesian anomaly detection in Flink