SlideShare a Scribd company logo
1 of 84
Download to read offline
FlinkML: Large-scale Machine
Learning with Apache Flink
Theodore Vasiloudis, SICS
SICS Data Science Day
October 21st, 2015
Apache Flink
What is Apache Flink?
● Large-scale data processing engine
● Easy and powerful APIs for batch and real-time streaming analysis
● Backed by a very robust execution backend
○ true streaming dataflow engine
○ custom memory manager
○ native iterations
○ cost-based optimizer
What is Apache Flink?
What does Flink give us?
● Expressive APIs
● Pipelined stream processor
● Closed loop iterations
Expressive APIs
● Main distributed data abstraction: DataSet
● Program using functional-style transformations, creating a Dataflow.
case class Word(word: String, frequency: Int)
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap(line => line.split(“ “).map(word => Word(word, 1))
.groupBy(“word”).sum(“frequency”)
.print()
Pipelined Stream Processor
Iterate in the Dataflow
Iterate by looping
● Loop in client submits one job per iteration step
● Reuse data by caching in memory or disk
Iterate in the Dataflow
Delta iterations
Delta iterations
Learn more in
Vasia’s Gelly talk!
Large-scale Machine Learning
What do we mean?
What do we mean?
● Small-scale learning ● Large-scale learning
Source: Léon Bottou
What do we mean?
● Small-scale learning
○ We have a small-scale learning problem
when the active budget constraint is the
number of examples.
● Large-scale learning
Source: Léon Bottou
What do we mean?
● Small-scale learning
○ We have a small-scale learning problem
when the active budget constraint is the
number of examples.
● Large-scale learning
○ We have a large-scale learning problem
when the active budget constraint is the
computing time.
Source: Léon Bottou
What do we mean?
● What about the complexity of the problem?
What do we mean?
● What about the complexity of the problem?
Source: Wired Magazine
Deep learning
What do we mean?
● What about the complexity of the problem?
“When you get to a trillion [parameters], you’re getting to something that’s got a chance
of really understanding some stuff.” - Hinton, 2013
Source: Wired Magazine
What do we mean?
● We have a large-scale learning problem when the active budget constraint
is the computing time and/or the model complexity.
FlinkML
FlinkML
● New effort to bring large-scale machine learning to Flink
FlinkML
● New effort to bring large-scale machine learning to Flink
● Goals:
○ Truly scalable implementations
○ Keep glue code to a minimum
○ Ease of use
FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ SVM
○ Multiple linear regression
FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ SVM
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ SVM
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
● Pre-processing
○ Polynomial features
○ Feature scaling
FlinkML: Overview
● Supervised Learning
○ Optimization framework
○ SVM
○ Multiple linear regression
● Recommendation
○ Alternating Least Squares (ALS)
● Pre-processing
○ Polynomial features
○ Feature scaling
● sklearn-like ML pipelines
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
mlr.fit(trainingData)
FlinkML API
// LabeledVector is a feature vector with a label (class or real value)
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val mlr = MultipleLinearRegression()
.setStepsize(0.01)
.setIterations(100)
.setConvergenceThreshold(0.001)
mlr.fit(trainingData)
// The fitted model can now be used to make predictions
val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
// Construct pipeline of standard scaler, polynomial features and multiple linear
// regression
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
FlinkML Pipelines
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
// Construct pipeline of standard scaler, polynomial features and multiple linear
// regression
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
// Train pipeline
pipeline.fit(trainingData)
// Calculate predictions
val predictions = pipeline.predict(testingData)
State of the art in large-scale ML
Alternating Least Squares
R ≅ X Y✕Users
Items
Naive Alternating Least Squares
Blocked Alternating Least Squares
Blocked ALS performance
FlinkML blocked ALS performance
Going beyond SGD in large-scale
optimization
● Beyond SGD → Use Primal-Dual framework
● Slow updates → Immediately apply local updates
● Average over batch size → Average over K (nodes) << batch size
CoCoA: Communication Efficient Coordinate
Ascent
Primal-dual framework
Source: Smith
(2014)
Primal-dual framework
Source: Smith
(2014)
Immediately Apply Updates
Source: Smith
(2014)
Immediately Apply Updates
Source: Smith
(2014)
Source: Smith
(2014)
Average over nodes (K) instead of batches
Source: Smith
(2014)
CoCoA: Communication Efficient Coordinate
Ascent
CoCoA performance
Source:
Jaggi
(2014)
CoCoA performance
Available on FlinkML
SVM
Achieving model parallelism:
The parameter server
● The parameter server is essentially a distributed key-value store with two
basic commands: push and pull
○ push updates the model
○ pull retrieves a (lazily) updated model
● Allows us to store a model into multiple nodes, read and update it as
needed.
Architecture of a parameter server communicating with groups of workers.
Source:
Li (2014)
Comparison with other large-scale learning systems.
Source:
Li (2014)
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
● SSP: State Synchronous parallel
○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.
Dealing with stragglers: SSP Iterations
● BSP: Bulk Synchronous parallel
○ Every worker needs to wait for the others to finish before starting the next iteration
● ASP: Asynchronous parallel
○ Every worker can work individually, update model as needed.
○ Can be fast, but can often diverge.
● SSP: State Synchronous parallel
○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.
○ Allows for progress, while keeping convergence guarantees.
Dealing with stragglers: SSP Iterations
Dealing with stragglers: SSP Iterations
Source: Ho et al.
(2013)
SSP Iterations in Flink: Lasso Regression
Source: Peel et
al. (2015)
SSP Iterations in Flink: Lasso Regression
Source: Peel et
al. (2015)
To be merged soon
into FlinkML
Current and future work on
FlinkML
Coming soon
● Tooling
○ Evaluation & cross-validation framework
○ Predictive Model Markup Language
● Algorithms
○ Quad-tree kNN search
○ Efficient streaming decision trees
○ k-means and extensions
○ Colum-wise statistics, histograms
FlinkML Roadmap
● Hyper-parameter optimization
● More communication-efficient optimization algorithms
● Generalized Linear Models
● Latent Dirichlet Allocation
Future of Machine Learning on Flink
● Streaming ML
○ Flink already has SAMOA bindings.
○ We plan to kickstart the streaming ML library of Flink, and develop new algorithms.
Future of FlinkML
● Streaming ML
○ Flink already has SAMOA bindings.
○ We plan to kickstart the streaming ML library of Flink, and develop new algorithms.
● “Computation efficient” learning
○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning
with modest computing resources.
Recent large-scale learning systems
Source: Xing (2015)
Recent large-scale learning systems
Source: Xing (2015)
How to
get here?
Demo?
Thank you.
@thvasilo
tvas@sics.se
References
● Flink Project: flink.apache.org
● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/
● Leon Botou: Learning with Large Datasets
● Wired: Computer Brain Escapes Google's X Lab to Supercharge Search
● Smith: CoCoA AMPCAMP Presentation
● CMU Petuum: Petuum Project
● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014.
● Li (2014): "Scaling distributed machine learning with the parameter server." OSDI 2014.
● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS
2013.
● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData
2015
● Xing (2015): “Petuum: A New Platform for Distributed Machine Learning on Big Data”, KDD 2015
I would like to thank professor Eric Xing for his permission to use parts of the structure from his great tutorial
on large-scale machine learning: A New Look at the System, Algorithm and Theory Foundations of Distributed
Machine Learning
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
“Demo”
Thank you.
@thvasilo
tvas@sics.se

More Related Content

What's hot

Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetAnkit Beohar
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connectconfluent
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLSeveralnines
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®confluent
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connectKnoldus Inc.
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 

What's hot (20)

Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Kafka vs kinesis
Kafka vs kinesisKafka vs kinesis
Kafka vs kinesis
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 

Similar to FlinkML: Large Scale Machine Learning with Apache Flink

FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetupTheodoros Vasiloudis
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemDatabricks
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Pitfalls of machine learning in production
Pitfalls of machine learning in productionPitfalls of machine learning in production
Pitfalls of machine learning in productionAntoine Sauray
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsLightbend
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsLightbend
 
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Databricks
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Deploying Machine Learning Models with Pulsar Functions - Pulsar Summit Asia...
Deploying Machine Learning Models with Pulsar Functions  - Pulsar Summit Asia...Deploying Machine Learning Models with Pulsar Functions  - Pulsar Summit Asia...
Deploying Machine Learning Models with Pulsar Functions - Pulsar Summit Asia...StreamNative
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Databricks
 
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and IstioAdvanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and IstioAnimesh Singh
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
 
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVMVoxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVMManuel Bernhardt
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationBigML, Inc
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward
 

Similar to FlinkML: Large Scale Machine Learning with Apache Flink (20)

FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Pitfalls of machine learning in production
Pitfalls of machine learning in productionPitfalls of machine learning in production
Pitfalls of machine learning in production
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML Models
 
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Deploying Machine Learning Models with Pulsar Functions - Pulsar Summit Asia...
Deploying Machine Learning Models with Pulsar Functions  - Pulsar Summit Asia...Deploying Machine Learning Models with Pulsar Functions  - Pulsar Summit Asia...
Deploying Machine Learning Models with Pulsar Functions - Pulsar Summit Asia...
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and IstioAdvanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVMVoxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 

FlinkML: Large Scale Machine Learning with Apache Flink

  • 1. FlinkML: Large-scale Machine Learning with Apache Flink Theodore Vasiloudis, SICS SICS Data Science Day October 21st, 2015
  • 3. What is Apache Flink? ● Large-scale data processing engine ● Easy and powerful APIs for batch and real-time streaming analysis ● Backed by a very robust execution backend ○ true streaming dataflow engine ○ custom memory manager ○ native iterations ○ cost-based optimizer
  • 4. What is Apache Flink?
  • 5. What does Flink give us? ● Expressive APIs ● Pipelined stream processor ● Closed loop iterations
  • 6. Expressive APIs ● Main distributed data abstraction: DataSet ● Program using functional-style transformations, creating a Dataflow. case class Word(word: String, frequency: Int) val lines: DataSet[String] = env.readTextFile(...) lines.flatMap(line => line.split(“ “).map(word => Word(word, 1)) .groupBy(“word”).sum(“frequency”) .print()
  • 8. Iterate in the Dataflow
  • 9. Iterate by looping ● Loop in client submits one job per iteration step ● Reuse data by caching in memory or disk
  • 10. Iterate in the Dataflow
  • 12. Delta iterations Learn more in Vasia’s Gelly talk!
  • 14. What do we mean?
  • 15. What do we mean? ● Small-scale learning ● Large-scale learning Source: Léon Bottou
  • 16. What do we mean? ● Small-scale learning ○ We have a small-scale learning problem when the active budget constraint is the number of examples. ● Large-scale learning Source: Léon Bottou
  • 17. What do we mean? ● Small-scale learning ○ We have a small-scale learning problem when the active budget constraint is the number of examples. ● Large-scale learning ○ We have a large-scale learning problem when the active budget constraint is the computing time. Source: Léon Bottou
  • 18. What do we mean? ● What about the complexity of the problem?
  • 19. What do we mean? ● What about the complexity of the problem? Source: Wired Magazine
  • 21. What do we mean? ● What about the complexity of the problem? “When you get to a trillion [parameters], you’re getting to something that’s got a chance of really understanding some stuff.” - Hinton, 2013 Source: Wired Magazine
  • 22. What do we mean? ● We have a large-scale learning problem when the active budget constraint is the computing time and/or the model complexity.
  • 24. FlinkML ● New effort to bring large-scale machine learning to Flink
  • 25. FlinkML ● New effort to bring large-scale machine learning to Flink ● Goals: ○ Truly scalable implementations ○ Keep glue code to a minimum ○ Ease of use
  • 26. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ SVM ○ Multiple linear regression
  • 27. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ SVM ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS)
  • 28. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ SVM ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS) ● Pre-processing ○ Polynomial features ○ Feature scaling
  • 29. FlinkML: Overview ● Supervised Learning ○ Optimization framework ○ SVM ○ Multiple linear regression ● Recommendation ○ Alternating Least Squares (ALS) ● Pre-processing ○ Polynomial features ○ Feature scaling ● sklearn-like ML pipelines
  • 30. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ...
  • 31. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)
  • 32. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001) mlr.fit(trainingData)
  • 33. FlinkML API // LabeledVector is a feature vector with a label (class or real value) val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001) mlr.fit(trainingData) // The fitted model can now be used to make predictions val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
  • 34. FlinkML Pipelines val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression()
  • 35. FlinkML Pipelines val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression() // Construct pipeline of standard scaler, polynomial features and multiple linear // regression val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
  • 36. FlinkML Pipelines val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression() // Construct pipeline of standard scaler, polynomial features and multiple linear // regression val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr) // Train pipeline pipeline.fit(trainingData) // Calculate predictions val predictions = pipeline.predict(testingData)
  • 37. State of the art in large-scale ML
  • 38. Alternating Least Squares R ≅ X Y✕Users Items
  • 41. Blocked ALS performance FlinkML blocked ALS performance
  • 42. Going beyond SGD in large-scale optimization
  • 43. ● Beyond SGD → Use Primal-Dual framework ● Slow updates → Immediately apply local updates ● Average over batch size → Average over K (nodes) << batch size CoCoA: Communication Efficient Coordinate Ascent
  • 47. Immediately Apply Updates Source: Smith (2014) Source: Smith (2014)
  • 48. Average over nodes (K) instead of batches Source: Smith (2014)
  • 49. CoCoA: Communication Efficient Coordinate Ascent
  • 52. Achieving model parallelism: The parameter server ● The parameter server is essentially a distributed key-value store with two basic commands: push and pull ○ push updates the model ○ pull retrieves a (lazily) updated model ● Allows us to store a model into multiple nodes, read and update it as needed.
  • 53. Architecture of a parameter server communicating with groups of workers. Source: Li (2014)
  • 54. Comparison with other large-scale learning systems. Source: Li (2014)
  • 55. Dealing with stragglers: SSP Iterations
  • 56. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration Dealing with stragglers: SSP Iterations
  • 57. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. Dealing with stragglers: SSP Iterations
  • 58. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. Dealing with stragglers: SSP Iterations
  • 59. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. ● SSP: State Synchronous parallel ○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones. Dealing with stragglers: SSP Iterations
  • 60. ● BSP: Bulk Synchronous parallel ○ Every worker needs to wait for the others to finish before starting the next iteration ● ASP: Asynchronous parallel ○ Every worker can work individually, update model as needed. ○ Can be fast, but can often diverge. ● SSP: State Synchronous parallel ○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones. ○ Allows for progress, while keeping convergence guarantees. Dealing with stragglers: SSP Iterations
  • 61. Dealing with stragglers: SSP Iterations Source: Ho et al. (2013)
  • 62. SSP Iterations in Flink: Lasso Regression Source: Peel et al. (2015)
  • 63. SSP Iterations in Flink: Lasso Regression Source: Peel et al. (2015) To be merged soon into FlinkML
  • 64. Current and future work on FlinkML
  • 65. Coming soon ● Tooling ○ Evaluation & cross-validation framework ○ Predictive Model Markup Language ● Algorithms ○ Quad-tree kNN search ○ Efficient streaming decision trees ○ k-means and extensions ○ Colum-wise statistics, histograms
  • 66. FlinkML Roadmap ● Hyper-parameter optimization ● More communication-efficient optimization algorithms ● Generalized Linear Models ● Latent Dirichlet Allocation
  • 67. Future of Machine Learning on Flink ● Streaming ML ○ Flink already has SAMOA bindings. ○ We plan to kickstart the streaming ML library of Flink, and develop new algorithms.
  • 68. Future of FlinkML ● Streaming ML ○ Flink already has SAMOA bindings. ○ We plan to kickstart the streaming ML library of Flink, and develop new algorithms. ● “Computation efficient” learning ○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning with modest computing resources.
  • 69. Recent large-scale learning systems Source: Xing (2015)
  • 70. Recent large-scale learning systems Source: Xing (2015) How to get here?
  • 71. Demo?
  • 73. References ● Flink Project: flink.apache.org ● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ ● Leon Botou: Learning with Large Datasets ● Wired: Computer Brain Escapes Google's X Lab to Supercharge Search ● Smith: CoCoA AMPCAMP Presentation ● CMU Petuum: Petuum Project ● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014. ● Li (2014): "Scaling distributed machine learning with the parameter server." OSDI 2014. ● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS 2013. ● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData 2015 ● Xing (2015): “Petuum: A New Platform for Distributed Machine Learning on Big Data”, KDD 2015 I would like to thank professor Eric Xing for his permission to use parts of the structure from his great tutorial on large-scale machine learning: A New Look at the System, Algorithm and Theory Foundations of Distributed Machine Learning
  • 75.
  • 82.