[FFE19] Build a Flink AI Ecosystem

Build a Flink AI Ecosystem
Jiangjie (Becket) Qin
Flink Forward Berlin 2019

Agenda
• Why AI Ecosystem on Flink?
• Flink ML Pipeline & Flink ML Libs
• Deep learning on Flink
• Enhanced Iteration & Dynamic Model Serving
• Better Python support
2

Lambda - what’s everyone doing
HDFS
Message Queue
Batch Processing
Stream Processing
Combine the
results
Query Result
Offline path
Online path
3
Batch Layer
Speed Layer
Serving Layer

Lambda - what’s everyone doing
HDFS
Message Queue
Batch Processing
(Spark/M-R)
Stream Processing
(Flink/Storm)
Combine the
results
Query Result
• Two code bases for online and offline processing logic
• High maintenance cost
• Difficult to ensure consistent processing logic
Offline path
Online path
4

Batch-Stream Processing Unification
• Use the same engine for online and offline processing
• Spark
• Flink
HDFS
Message Queue
Batch Processing
(Flink/Spark)
Stream Processing
(Flink/Spark Streaming)
Combine the
results
Query Result
Offline path
Online path
5

So what about ML?
• A typical ML scenario
• Offline training (TF, PyTorch, etc)
• Static models
• Online inference (Flink)
• The data preprocessing logic in training and inference are often two
code bases
HDFS Offline Training Inference
Static model
Preprocessing
PreprocessingOffline path Online path
6

So what about ML?
• Online training is gaining popularity
• More prompt model update
• Dynamic model and continuous training
• Progressive validation
• More sophisticated monitoring and model deployment / rollback
Message Queue Online Training Inference
dynamic model
PreprocessingOffline path Online path
7

“Lambda” architecture for ML
• Offline training: a static base model
• Online training: incremental updates to the base model
• Users have to deal with different systems / code bases
Message Queue
Offline Training
Online Training
Inference
Dynamic model
Static
model
Preprocessing
HDFS Preprocessing
Offline path
Online path
Static model
8

Value of Flink
• The inference is latency sensitive online / nearline processing
• Flink is a good option in this case
Message Queue
Offline Training
Online Training
Inference
Dynamic model
Static
model
Preprocessing
HDFS Preprocessing
Offline path
Online path
Static model
9

Batch-Stream Unification in ML
• The online inference is latency sensitive online / nearline processing
• Flink is a good option in this case
• Use Flink everywhere to avoid maintaining different code bases.
Message Queue
Offline Training
Online Training
Inference
Dynamic model
Static
model
Preprocessing
HDFS Preprocessing
Offline path
Online path
Static model
10

Additional Values
• One-stop data processing solution
• Shared dataset management
• Switch processing APIs freely
Dataset Management
DataStreamSQL ML CEP

Flink AI Ecosystem By ML Stages
Rich connector
support &
Dataset
management
Stream-Batch unification
Strong SQL support
Enhanced Iteration
Flink ML Lib
DL on Flink (TF, PyTorch)
Dynamic model serving
Model Management
Rollout / Rollback
Online monitoring
Online evaluation
Message
Queue
Offline Training
Online Training
Inference
Dynamicmodel
Static
model
Preprocessing
HDFS Preprocessing
Offline path Online path
Static
model
Model Validation
Flink ML Pipeline,
Python support
12
Data
Acquisition
Model Training Model Validation &
Serving
InferencePreprocessing
Efforts&RequirementsAIFlowMLStage

Flink AI Ecosystem By ML Stages
Rich connector
support &
Dataset
management
Stream-Batch unification
Strong SQL support
Enhanced Iteration
Flink ML Lib
DL on Flink (TF, PyTorch)
Dynamic model serving
Model Management
Rollout / Rollback
Online monitoring
Online evaluation
Message
Queue
Offline Training
Online Training
Inference
Dynamicmodel
Static
model
Preprocessing
HDFS Preprocessing
Offline path Online path
Static
model
Model Validation
Flink ML Pipeline,
Python support
13
Data
Acquisition
Model Training Model Validation &
Serving
InferencePreprocessing
Efforts&RequirementsAIFlowMLStage

Agenda
14

Flink ML Pipeline - Overview
PipelineStage
EstimatorTransformer
Model
K-Means
NaiveBayes
Linear
regression
DecisionTree
RandomForest
GBDT
Table based ML Pipeline
table2=Transformer.
transform(table1) Estimator.fit(table2)
ML Lib Developers ML Lib Users
……
Input
Table
Output
Table
15
Data -> Data transition
(Preprocessing, Inference)
Data -> Model transition
(Model Training)

K-Means
NaiveBayes
Linear
regression
GBDT
DecisionTree
PCA
Random
Forest
Correlation
ML libs
……
Rewrite Flink ML Libs
• ML pipeline based
• Table API based
• Battle tested algorithms
Flink ML Libs
16

Training
Inference
Estimator Model
Estimator.fit(input1)
Input1: Table
Model
Result
Table
Model.transform(input2)
Input2: Table
pipeline.fit(input1)
pipeline.transform(input2)
ML Pipeline - Simple Case
17

output1=Transformer.
transform(input1)
Estimator Pipeline
pipeline.fit(input1)
Estimator.fit(output1)
pipeline.transform(input2)
Model.transform(output2)
Result Table
ModelTransformerInput1: TableTraining
ModelTransformer
output2=Transformer.
transform(input2)
Model Pipeline
Input2: Table
Model Pipeline
Inference
ML Pipeline
18

Value of Flink ML Pipeline
• Unify APIs of Model Training and Inference for the end users
• End users only needs to deal with either Estimators or Transformers
• Ensure consistent logic between training and inference
• The same pipeline topology in training will be persisted and used for inference
19

Agenda
20

Data Acquisition
Data Process and
Transformation
Model Training Test and Validation Model Serving
Model or Params
Tuning
Deep Learning Pipeline
21

Distributed TF framework in a Cluster/Environment
WORKER WORKER WORKER
PS PS
Resulting
Model
One Flink job in Cluster/Environment
SOURCE
SOURCE
JOIN UDTF
External
Storage
Queue
>>> >>>
Data Acquisition
Data Process and
Transformation
Model Training
22

Data Acquisition
Data Process and
Transformation
Model Training Test and Validation Model Serving
Model or Params
Tuning
23

One single Flink job in a Cluster/Environment
Distributed TF framework in a Cluster/Environment
WORKER WORKER WORKER
PS PS
Resulting
Model
SOURCE
SOURCE
JOIN UDTF WORKER
PS PS
WORKER WORKER
One Flink job in Cluster/Environment
SOURCE
SOURCE
JOIN UDTF
External
Storage
Queue
>>> >>>
Resulting
Model
TensorFlow-Flink Integration
24

DL on Flink and ML Pipeline integration
One single Flink job in a Cluster/Environment
SOURCE
SOURCE
JOIN UDTF WORKER
PS PS
WORKER WORKER
Resulting
Model
Transformer Estimator
The ML Pipeline API could be used for both traditional ML and deep learning.
25

Agenda
26

• Native iteration implemented by the processing engine
• Feedback edge on the processing DAG
• Improve the caveats in DataSet / DataStream iterations
Flink Cluster
Partition 1
Partition 2
Partition 3
Partition N
…
map
map
map
map
…
Enhance Iteration in Flink
27

{
val a: Table = ...
val b: Table = ...
val resultSeq = Table.iterate(a, b) {
val next_a = b.select('v_b + 1 as 'v_a)
val next_b = next_a.select('v_a * 2 as 'v_b)
Seq(next_a, next_b)
}.times(10)
}
Iteration variables
Step function
Termination condition
Multi-variable iteration
28

{
val a: Table = ...
val b: Table = ...
val resultSeq = iterate(a, b) {
val next_a = iterate(a) {
Seq(a.select(‘v_a + 1 as 'v_a))
}.times(100).head
val next_b = next_a.select('v_a * 2 as 'v_b)
Seq(next_a, next_b)
}.times(10)
}
Nested Iteration
29

Mini-batch iteration
• A stream is chunked in to multiple mini-batches
• Each mini-batch iterates independently in the iteration loop
• The results are emitted in the mini-batch order
MB3
MB2
MB1
Flink Cluster
Partition 1
Partition 2
Partition 3
Partition N
…
map
map
map
map
…
MB2 MB1
30

Mini-batch iteration
• Native support for Stochastic Gradient Descendent (SGD)
• Native support for online learning
31

Iteration and Dynamic Model Update
Model
Initial model
Samples
Gradient
Computing
Gradient
Reduce
Model_V1
Model_V2
Model_V3
…
Final Model
32

Iteration and Dynamic Model Update
Model
Initial model
Samples
Gradient
Computing
Gradient
Reduce
Model_V1
Model_V2
Model_V3
…
Final Model
33

Dynamic Model Serving
Message
Queue
Offline Training
Online Training
Dynamicmodel
Static
model
Preprocessing
HDFS Preprocessing Static model
Model Validation
Samples
Inference
Model_V1
Model_V2
Model_V3
…
The exact same mechanism of native iteration could be used for dynamic model serving.
34

Agenda
35

Python
process
Java process
input
Python Table API Python UDF
Python
TableAPI
Java
gateway
Server
RPC (Py4j)
Python
gateway
Python VM
DAGGragh
upstream input
downstream output
output
Flink Python Table API
36
Working with Apache Beam Community

More Python API Support
• Flink ML Pipeline
• Flink-AI-Extended
• DataStream
37

Summary
• Flink has unique values in AI use case
• Flink suits very well in the “lambda” ML architecture
• Multiple ongoing works to make Flink more AI friendly
• Flink ML Pipeline
• Flink ML Libs
• Iteration enhancement
• Python API
• …
38

Q & A
We are hiring!!
becket.qin@gmail.com

[FFE19] Build a Flink AI Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [FFE19] Build a Flink AI Ecosystem

Similar to [FFE19] Build a Flink AI Ecosystem (20)

Recently uploaded

Recently uploaded (20)

[FFE19] Build a Flink AI Ecosystem