FlinkML - Big data application meetup

FlinkML: Large-scale Machine
Learning with Apache Flink
Theodore Vasiloudis, Swedish Institute of Computer Science (SICS)
Big Data Application Meetup
July 27th, 2016

What do we mean?
● Small-scale learning ● Large-scale learning
Source: Léon Bottou

What do we mean?
● Small-scale learning
○ We have a small-scale learning problem
when the active budget constraint is the
number of examples.
● Large-scale learning

What do we mean?
● Small-scale learning
○ We have a small-scale learning problem
number of examples.
● Large-scale learning
○ We have a large-scale learning problem
computing time.

What is Apache Flink?
● Distributed stream and batch data processing engine
● Easy and powerful APIs for batch and real-time streaming analysis
● Backed by a very robust execution backend
○ true streaming dataflow engine
○ custom memory manager
○ native iterations
○ cost-based optimizer

What does Flink give us?
● Expressive APIs
● Pipelined stream processor
● Closed loop iterations

Expressive APIs
● Main bounded data abstraction: DataSet
● Program using functional-style transformations, creating a dataflow.
case class Word(word: String, frequency: Int)
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap(line => line.split(“ “).map(word => Word(word, 1))
.groupBy(“word”).sum(“frequency”)
.print()

Iterate by looping
● Loop in client submits one job per iteration step
● Reuse data by caching in memory or disk

Performance
Extending the Yahoo Streaming Benchmark

FlinkML
● New effort to bring large-scale machine learning to Apache Flink

FlinkML
● New effort to bring large-scale machine learning to Apache Flink
● Goals:
○ Truly scalable implementations
○ Keep glue code to a minimum
○ Ease of use

FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ Support Vector Machine
○ Multiple linear regression

FlinkML: Overview
● Recommendation
○ Alternating Least Squares (ALS)

FlinkML: Overview
● Recommendation
● Pre-processing
○ Polynomial features
○ Feature scaling

FlinkML: Overview
● Recommendation
● Pre-processing
○ Feature scaling
● Unsupervised learning
○ Quad-tree exact kNN search

FlinkML: Overview
● Recommendation
● Pre-processing
○ Feature scaling
● Unsupervised learning
○ Quad-tree exact kNN search
● sklearn-like ML pipelines

FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...

FlinkML API
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)

FlinkML API
.setStepsize(0.01)
.setIterations(100)
mlr.fit(trainingData)

FlinkML API
.setStepsize(0.01)
.setIterations(100)
mlr.fit(trainingData)
// The fitted model can now be used to make predictions
val predictions: DataSet[LabeledVector] = mlr.predict(testingData)

FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)

FlinkML Pipelines
// Construct pipeline of standard scaler, polynomial features and multiple linear
// regression
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)

FlinkML Pipelines
// Construct pipeline of standard scaler, polynomial features and multiple linear
// regression
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
// Train pipeline
pipeline.fit(trainingData)
// Calculate predictions
val predictions = pipeline.predict(testingData)

Alternating Least Squares
R ≅ X Y✕Users
Items

Naive Alternating Least Squares

Blocked Alternating Least Squares

Blocked ALS performance
FlinkML blocked ALS performance

Going beyond SGD in large-scale
optimization

● Beyond SGD → Use Primal-Dual framework
● Slow updates → Immediately apply local updates
CoCoA: Communication Efficient Coordinate
Ascent

Primal-dual framework
Source: Smith
(2014)

Immediately Apply Updates
Source: Smith
(2014)

Immediately Apply Updates
Source: Smith
(2014)
Source: Smith
(2014)

CoCoA: Communication Efficient Coordinate
Ascent

CoCoA performance
Source:
Jaggi
(2014)

CoCoA performance
Available on FlinkML
SVM

Dealing with stragglers: SSP Iterations

● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration

● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.

○ Can be fast, but can often diverge.

● SSP: State Synchronous parallel
○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.

● SSP: State Synchronous parallel
○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.
○ Allows for progress, while keeping convergence guarantees.

Source: Ho et al.
(2013)

SSP Iterations in Flink: Lasso Regression
Source: Peel et
al. (2015)

SSP Iterations in Flink: Lasso Regression
Source: Peel et
al. (2015)
PR submitted

Challenges in developing an
open-source ML library

Challenges in open-source ML libraries
● Depth or breadth
● Design choices
● Testing

Challenges in open-source ML libraries
● Attracting developers
● What to commit
● Avoiding code rot

Current and future work on
FlinkML

Current work
● Tooling
○ Evaluation & cross-validation framework
○ Distributed linear algebra
○ Streaming predictors
● Algorithms
○ Implicit ALS
○ Multi-layer perceptron
○ Efficient streaming decision trees
○ Colum-wise statistics, histograms

Future of Machine Learning on Flink
● Streaming ML
○ Flink already has SAMOA bindings.
○ Preliminary work already started, implement SOTA algorithms and develop new
techniques.

Future of Machine Learning on Flink
● Streaming ML
○ Flink already has SAMOA bindings.
○ Preliminary work already started, implement SOTA algorithms and develop new
techniques.
● “Computation efficient” learning
○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning
with modest computing resources.

Check it out:
@thvasilo
tvas@sics.se
flink.apache.org
ci.apache.org/projects/flink/flink-docs-master/libs/ml

Thank you
@thvasilo
tvas@sics.se
flink.apache.org
ci.apache.org/projects/flink/flink-docs-master/libs/ml

References
● Flink Project: flink.apache.org
● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/
● Leon Botou: Learning with Large Datasets
● Smith (2014): CoCoA AMPCAMP Presentation
● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014.
● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS
2013.
● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData
2015
● Recent INRIA paper examining Spark vs. Flink (batch only)
● Extending the Yahoo streaming benchmark (And winning the Twitter Hack-week with Flink)
● Also interesting: Bayesian anomaly detection in Flink

FlinkML - Big data application meetup

More Related Content

What's hot

Viewers also liked

Similar to FlinkML - Big data application meetup

Recently uploaded

FlinkML - Big data application meetup