My slides from the Big Data Applications meetup on 27th of July, talking about FlinkML. Also some things about open-source ML development and an illustation of interactive Flink machine learning with Apache Zeppelin.
Project Based Learning (A.I).pptx detail explanation
FlinkML - Big data application meetup
1. FlinkML: Large-scale Machine
Learning with Apache Flink
Theodore Vasiloudis, Swedish Institute of Computer Science (SICS)
Big Data Application Meetup
July 27th, 2016
4. What do we mean?
● Small-scale learning ● Large-scale learning
Source: Léon Bottou
5. What do we mean?
● Small-scale learning
○ We have a small-scale learning problem
when the active budget constraint is the
number of examples.
● Large-scale learning
Source: Léon Bottou
6. What do we mean?
● Small-scale learning
○ We have a small-scale learning problem
when the active budget constraint is the
number of examples.
● Large-scale learning
○ We have a large-scale learning problem
when the active budget constraint is the
computing time.
Source: Léon Bottou
8. What is Apache Flink?
● Distributed stream and batch data processing engine
● Easy and powerful APIs for batch and real-time streaming analysis
● Backed by a very robust execution backend
○ true streaming dataflow engine
○ custom memory manager
○ native iterations
○ cost-based optimizer
10. What does Flink give us?
● Expressive APIs
● Pipelined stream processor
● Closed loop iterations
11. Expressive APIs
● Main bounded data abstraction: DataSet
● Program using functional-style transformations, creating a dataflow.
case class Word(word: String, frequency: Int)
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap(line => line.split(“ “).map(word => Word(word, 1))
.groupBy(“word”).sum(“frequency”)
.print()
20. FlinkML
● New effort to bring large-scale machine learning to Apache Flink
● Goals:
○ Truly scalable implementations
○ Keep glue code to a minimum
○ Ease of use
23. FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ Support Vector Machine
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
24. FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ Support Vector Machine
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
● Pre-processing
○ Polynomial features
○ Feature scaling
25. FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ Support Vector Machine
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
● Pre-processing
○ Polynomial features
○ Feature scaling
● Unsupervised learning
○ Quad-tree exact kNN search
26. FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ Support Vector Machine
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
● Pre-processing
○ Polynomial features
○ Feature scaling
● Unsupervised learning
○ Quad-tree exact kNN search
● sklearn-like ML pipelines
27. FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
28. FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
29. FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
mlr.fit(trainingData)
30. FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
mlr.fit(trainingData)
// The fitted model can now be used to make predictions
val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
31. FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
32. FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
// Construct pipeline of standard scaler, polynomial features and multiple linear
// regression
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
33. FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
// Construct pipeline of standard scaler, polynomial features and multiple linear
// regression
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
// Train pipeline
pipeline.fit(trainingData)
// Calculate predictions
val predictions = pipeline.predict(testingData)
49. ● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
Dealing with stragglers: SSP Iterations
50. ● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
Dealing with stragglers: SSP Iterations
51. ● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
Dealing with stragglers: SSP Iterations
52. ● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
● SSP: State Synchronous parallel
○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.
Dealing with stragglers: SSP Iterations
53. ● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
● SSP: State Synchronous parallel
○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.
○ Allows for progress, while keeping convergence guarantees.
Dealing with stragglers: SSP Iterations
61. Current work
● Tooling
○ Evaluation & cross-validation framework
○ Distributed linear algebra
○ Streaming predictors
● Algorithms
○ Implicit ALS
○ Multi-layer perceptron
○ Efficient streaming decision trees
○ Colum-wise statistics, histograms
62. Future of Machine Learning on Flink
● Streaming ML
○ Flink already has SAMOA bindings.
○ Preliminary work already started, implement SOTA algorithms and develop new
techniques.
63. Future of Machine Learning on Flink
● Streaming ML
○ Flink already has SAMOA bindings.
○ Preliminary work already started, implement SOTA algorithms and develop new
techniques.
● “Computation efficient” learning
○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning
with modest computing resources.