Apache Flink 101
- The Rise of Stream Processing and Beyond
Bowen Li
Commiter@Apache Flink, Senior Engineer@Alibaba
Nov 20, 2019 @Big Data Bellevue
● Flink use cases
● Flink in a nutshell - what makes Flink successful in stream processing
● Beyond stream processing
○ Batch
○ Data warehousing and Notebook
○ AI/Machine Learning
○ Serverless
Agenda
Time = Value
Business demands Real Time Computation
Flink at Alibaba
● Powers real time computations of all business units at Alibaba group
● Powers all search and recommendations, both online and offline
● Provided as cloud service to public on Ali Cloud
Single’s Day Global Shopping Festival on 11/11
Single’s Day Stats - Nov 11, 2019
Alibaba
○ GMV
■ $14 million in the first 21s
■ $1.4 billion in the first 96s
■ $38 billion in 24h
○ 982 PB data generated in total
○ 544,000 transactions/sec at peak
Flink @ Alibaba
○ 2 billion events/sec, 3 TB/sec - up 111% from 2018
Flink at Alibaba
Use Case 1: online ML
○ hundreds of millions events
○ 100+ billions features
○ e2e second latency
○ real-time training, feature and module update
Flink at Alibaba
Use Case 1: online ML
Flink at Alibaba
Use Case 2: Real Time GMV dashboard
Flink at Alibaba
Use Case 2: Real Time GMV dashboard
Flink at Scale
Real Time AI / ML
Real Time AnalyticsReal Time Fraud Detection/Risk Management
Real Time Dynamic Pricing
Flink at Scale
Real Time Compute Service on public cloud backed by Flink
Kinesis Analytics
Flink in a Nutshell
- key differentiaters from other open source solutions
ancient squirrel from Ice Age!
Flink in a Nutshell
Stateful Computations …...
Stateful Computations …...
Why (built-in) state matters?
● computation with context, rather than single record
● no network IO, lower latency
● no external dependency, full control by framework for exactly-once semantics
● ...
Stateful Computations …...
Flink provides built-in state backends that support rich, arbitrary data
structure in a fully fault tolerant way
○ In-memory, splliable backend
○ RocksDB backend
Stateful Computations over Event Streams…...
Flink in a Nutshell
Stateful Computations over Event Streams…...
It means a few things…
1. All your data is data streams!
○ batch v.s. streaming - just execution models
○ bounded v.s. unbounded data streams - key difference
○ technically all data processing is stream processing
Stateful Computations over Event Streams…...
2. Streaming-first, pipelined execution
○ record flowing thru the system
-> extremely high throughput, super fast, ultra low latency
○ fondamentally different from batch-first, staged-execution framework
○ can’t achieved by mini-batch workaround
Stateful Computations over Event Streams…...
3. Event comes with time!
○ event time v.s. processing time
○ windows aggregation, sessionazation, pattern recognition, time-base joins
○ out-of-order and late data
Flink support all the most comprehensive time semantics natively
from the beginning
Stateful Computations over Event Streams
in an Expressive …...
Flink in a Nutshell
Stateful Computations over Event Streams
in an Expressive …...
Layered APIs with the most comprehensive semantics
Flink’s layered APIs
Streaming SQL Table API DataStream API ProcessFunction
More
- declarative
- optimizable
- understandable
- stable
- unified for streaming and batch
More
- advanced
- precise control
- optimized
(if you know what you are
doing)
Streaming SQL
Flink SQL> CREATE TABLE test(user BIGINT, msg VARCHAR, ts VARCHAR)
WITH (
'connector.type' = 'kafka',
'connector.topic' = 'topic_name',
'format.type' = 'avro',
'connector.startup-mode' = 'earliest-offset'
)
Flink SQL> SELECT * FROM mytest;
Flink SQL> INSERT INTO unique_user SELECT DISTINCT user FROM test;
Table API
// Java/Scala/Python
tableEnvironment
.connect(
new Kafka().version("1.0")
.topic("topic_name")
.startFromEarliest())
.schema(...)
.isAppendMode()
.registerTableSource("test");
Table test = tableEnv.scan("test");
test.select("user").disintct().insertInto("unique_user");
DataStream API
// Java/Scala
FlinkKafkaConsumer<...> consumer = new FlinkKafkaConsumer<>("test", ...);
consumer.setStartFromEarlist();
DataStream<...> stream = env.addSource(consumer);
stream
.keyBy("user")
.flatMap(new DataStreamDistinctReduce() { … });
Does SQL Make Sense in Streaming?
Stream-Table Duality
Word Count
Hello 1
World 1
Hello 2
World 2
Hello 3
Hello 4
Word Count
Hello 4
World 2
Stream Table
materialized
changelog
popular as CDC in database replication
Does SQL Make Sense in Streaming?
Stream
Data
Dynamic
Table
Dynamic
Table
Stream
Data
Continuous Query
Static
Table
← snapshot →
Static
Table
Flink
Static QueryBatch
Data
Batch
Data
Stateful Computations over Event Streams
in an Expressive, Scalable …...
Flink in a Nutshell
Stateful Computations over Event Streams
in an Expressive, Scalable …...
● Horizontally scalable
● Battle tested
○ trillions of records per day
○ terabytes of state
○ run on thousands of cores
Stateful Computations over Event Streams
in an Expressive, Scalable, Operational-focused
…...
Flink in a Nutshell
Stateful Computations over Event Streams
in an Expressive, Scalable, Operational-focused
…...
● Deploy anywhere
○ kubernetes, yarn, mesos, standalone
● Deploy flexibly
○ per-job mode, session mode
● Highly available
○ with HA setup
Stateful Computations over Event Streams
in an Expressive, Scalable, Operational-focused,
Fault Tolerant way
Flink in a Nutshell
Stateful Computations over Event Streams
in an Expressive, Scalable, Operational-focused,
Fault Tolerant way
● Checkpoint/Savepoint
○ on-the-fly, don’t scrafice performance
○ support incremental
● Exactly-once
○ State consistency
○ End-to-end with transactional connectors
Apache Flink:
Stateful Computations over Event Streams
in an
Expressive,
Scalable,
Operational-focused,
Fault Tolerant way
The only open source framework that provide all the above capabilities
Going Beyond Stream Processing
● Batch -> Unified Data Processing
● Data Warehousing and Notebook
● Machine Learning / AI / DL
● Serverless
Recap the lambda architecture ......
○ infra: high operation cost
○ dev: costly maintenance, and hard to learn 2+ stack
○ business: hard to sync to guarantee logic consistency
Why Unified Streaming and Batch Data Processing?
MQ / Pub-Sub
HDFS / S3
Stream Processing
(online)
Batch Processing
(offline)
Combine Results
(serving)
Why Flink?
Flink’s philosophy: streaming first, with batch is a special case of streaming
State-of-the-Art Batch Processing on a Stream Processor
<= Flink 1.8 from Flink 1.9
Performance of Blink versus Spark in the TPC-DS benchmark, aggregate time for all queries together.
Presentation by Xiaowei Jiang at Flink Forward Beijing, 2018.
Data Warehousing
Initial integration in Flink 1.9 for Hive 2.3.4 and 1.2.1
Full integration (read, write, udf) in Flink 1.10 for all Hive 1.x, 2.x, and 3.x
Notebook
Machine Learning & AI & DL
Machine Learning & AI
Recap the “lambda” architecture, again, for ML
MQ / Pub-Sub
HDFS / S3
Online
Training
Offline
Training
Model
Validation
Preprocessing
Dynamic Model
Static Model
Static Model
Preprocessing
Inference
Flink is popular for online ML now
MQ / Pub-Sub
HDFS / S3
Online
Training
Offline
Training
Model
Validation
Preprocessing
Dynamic Model
Static Model
Static Model
Preprocessing
Inference
Streaming-Batch Unified ML
Use Flink everywhere to reduce maintenance and operation cost of code and infra
MQ / Pub-Sub
HDFS / S3
Online
Training
Offline
Training
Model
Validation
Preprocessing
Dynamic Model
Static Model
Static Model
Preprocessing
Inference
Machine Learning & AI
ML Stage
ML Flow
Efforts &
Requirements
MQ / Pub-Sub
HDFS / S3
Online Training
Offline Training
Model
Validation
Preprocessing
Dynamic Model
Static Model
Static Model
Preprocessing
Inference
Data Acquisition Preprocessing Model Training Model Validation
& Serving
Inference
Rich Connectors
Dataset Management
Stream-Batch unification
Strong API & SQL Support
Enhanced Iteration
Flink ML lib
DL on Flink
(TF, PyTorch)
Model Serving
Model Registry
& Management
Rollout/Rollback
Online
Evaluation
Flink ML Pipeline
Python API support
Machine Learning & AI
ML Stage
ML Flow
Efforts &
Requirements
MQ / Pub-Sub
HDFS / S3
Online Training
Offline Training
Model
Validation
Preprocessing
Dynamic Model
Static Model
Static Model
Preprocessing
Inference
Data Acquisition Preprocessing Model Training Model Validation
& Serving
Inference
Rich Connectors
Dataset Management
Stream-Batch unification
Strong API & SQL Support
Enhaced Iteration
Flink ML lib
DL on Flink
(TF, PyTorch)
Model Serving
Model Management
Rollout/Rollback
Online
Evaluation
Flink ML Pipeline
Python API support
Flink ML Libs
● Completely rewrote
● Based on ML pipeline, powered by Table API
● Battle tested algorithms
○ K-means
○ Naive Bayes
○ Linear regression
○ GBDT
○ Decision tree
○ PCA
○ Random forest
○ Correlation
○ ….
Flink ML Pipeline
training
inference
input table 1 ModelTransformerEstimatorTransformer
input table 2 result tableModelTransformer
Two type of operators
● Transformer: data -> data
● Estimator: data -> model
Estimator Pipeline Model Pipeline
Model Pipeline
pipeline.fit(input1)
pipeline.transform(input2)
Deep Learning Pipeline
Data Acquisition
Data Process
& Tranformation
Model Training Model Validation Model Serving
Parameter
Tuning
Deep Learning Pipeline
source 1
source 2
join udtf
Flink Cluster
External
MQ / FS
Tensorflow
Cluster
worker workerworker
PS PS
Flink Deep Learning Pipeline
Data Acquisition
Data Process
& Tranformation
Model Training Model Validation Model Serving
Parameter
Tuning
Flink + Tensorflow Integration
source 1
source 2
join udtf
Flink Cluster
External
MQ / FS
Tensorflow
Cluster
worker workerworker
PS PS
source 1
source 2
join udtf
a single Flink Cluster
worker workerworker
PS PS
DL on Flink ML Pipeline
source 1
source 2
join udtf
worker workerworker
PS PS
Transformer Estimator
check out flink-ai-extended https://github.com/alibaba/flink-ai-extended
Serverless
Serverless
Event Driven Function as a Service
Benefits:
● elastic
● lightweight
Challenges:
○ state
■ consistency
■ IO
■ capacity
○ hard to build complex
applications
Event Driven State Management Composable
isn’t that….
Event Driven State Management Composable
isn’t that….
Stream Processing!
Check out project State Function, announced in Oct 2019!
https://statefun.io/
It officially became part of Apache Flink last week.
Thanks!
Twitter: @Bowen__Li
ML: dev / user@flink.apache.org
Meetup: Seattle Flink Meetup
https://www.meetup.com/seattle-apache-flink/
Q & A

Apache Flink 101 - the rise of stream processing and beyond

  • 1.
    Apache Flink 101 -The Rise of Stream Processing and Beyond Bowen Li Commiter@Apache Flink, Senior Engineer@Alibaba Nov 20, 2019 @Big Data Bellevue
  • 2.
    ● Flink usecases ● Flink in a nutshell - what makes Flink successful in stream processing ● Beyond stream processing ○ Batch ○ Data warehousing and Notebook ○ AI/Machine Learning ○ Serverless Agenda
  • 3.
    Time = Value Businessdemands Real Time Computation
  • 5.
    Flink at Alibaba ●Powers real time computations of all business units at Alibaba group ● Powers all search and recommendations, both online and offline ● Provided as cloud service to public on Ali Cloud
  • 6.
    Single’s Day GlobalShopping Festival on 11/11
  • 7.
    Single’s Day Stats- Nov 11, 2019 Alibaba ○ GMV ■ $14 million in the first 21s ■ $1.4 billion in the first 96s ■ $38 billion in 24h ○ 982 PB data generated in total ○ 544,000 transactions/sec at peak Flink @ Alibaba ○ 2 billion events/sec, 3 TB/sec - up 111% from 2018
  • 8.
    Flink at Alibaba UseCase 1: online ML
  • 9.
    ○ hundreds ofmillions events ○ 100+ billions features ○ e2e second latency ○ real-time training, feature and module update Flink at Alibaba Use Case 1: online ML
  • 10.
    Flink at Alibaba UseCase 2: Real Time GMV dashboard
  • 11.
    Flink at Alibaba UseCase 2: Real Time GMV dashboard
  • 12.
    Flink at Scale RealTime AI / ML Real Time AnalyticsReal Time Fraud Detection/Risk Management Real Time Dynamic Pricing
  • 13.
    Flink at Scale RealTime Compute Service on public cloud backed by Flink Kinesis Analytics
  • 14.
    Flink in aNutshell - key differentiaters from other open source solutions ancient squirrel from Ice Age!
  • 15.
    Flink in aNutshell Stateful Computations …...
  • 16.
    Stateful Computations …... Why(built-in) state matters? ● computation with context, rather than single record ● no network IO, lower latency ● no external dependency, full control by framework for exactly-once semantics ● ...
  • 17.
    Stateful Computations …... Flinkprovides built-in state backends that support rich, arbitrary data structure in a fully fault tolerant way ○ In-memory, splliable backend ○ RocksDB backend
  • 18.
    Stateful Computations overEvent Streams…... Flink in a Nutshell
  • 19.
    Stateful Computations overEvent Streams…... It means a few things… 1. All your data is data streams! ○ batch v.s. streaming - just execution models ○ bounded v.s. unbounded data streams - key difference ○ technically all data processing is stream processing
  • 20.
    Stateful Computations overEvent Streams…... 2. Streaming-first, pipelined execution ○ record flowing thru the system -> extremely high throughput, super fast, ultra low latency ○ fondamentally different from batch-first, staged-execution framework ○ can’t achieved by mini-batch workaround
  • 21.
    Stateful Computations overEvent Streams…... 3. Event comes with time! ○ event time v.s. processing time ○ windows aggregation, sessionazation, pattern recognition, time-base joins ○ out-of-order and late data Flink support all the most comprehensive time semantics natively from the beginning
  • 22.
    Stateful Computations overEvent Streams in an Expressive …... Flink in a Nutshell
  • 23.
    Stateful Computations overEvent Streams in an Expressive …... Layered APIs with the most comprehensive semantics
  • 24.
    Flink’s layered APIs StreamingSQL Table API DataStream API ProcessFunction More - declarative - optimizable - understandable - stable - unified for streaming and batch More - advanced - precise control - optimized (if you know what you are doing)
  • 25.
    Streaming SQL Flink SQL>CREATE TABLE test(user BIGINT, msg VARCHAR, ts VARCHAR) WITH ( 'connector.type' = 'kafka', 'connector.topic' = 'topic_name', 'format.type' = 'avro', 'connector.startup-mode' = 'earliest-offset' ) Flink SQL> SELECT * FROM mytest; Flink SQL> INSERT INTO unique_user SELECT DISTINCT user FROM test;
  • 26.
    Table API // Java/Scala/Python tableEnvironment .connect( newKafka().version("1.0") .topic("topic_name") .startFromEarliest()) .schema(...) .isAppendMode() .registerTableSource("test"); Table test = tableEnv.scan("test"); test.select("user").disintct().insertInto("unique_user");
  • 27.
    DataStream API // Java/Scala FlinkKafkaConsumer<...>consumer = new FlinkKafkaConsumer<>("test", ...); consumer.setStartFromEarlist(); DataStream<...> stream = env.addSource(consumer); stream .keyBy("user") .flatMap(new DataStreamDistinctReduce() { … });
  • 28.
    Does SQL MakeSense in Streaming? Stream-Table Duality Word Count Hello 1 World 1 Hello 2 World 2 Hello 3 Hello 4 Word Count Hello 4 World 2 Stream Table materialized changelog popular as CDC in database replication
  • 29.
    Does SQL MakeSense in Streaming? Stream Data Dynamic Table Dynamic Table Stream Data Continuous Query Static Table ← snapshot → Static Table Flink Static QueryBatch Data Batch Data
  • 30.
    Stateful Computations overEvent Streams in an Expressive, Scalable …... Flink in a Nutshell
  • 31.
    Stateful Computations overEvent Streams in an Expressive, Scalable …... ● Horizontally scalable ● Battle tested ○ trillions of records per day ○ terabytes of state ○ run on thousands of cores
  • 32.
    Stateful Computations overEvent Streams in an Expressive, Scalable, Operational-focused …... Flink in a Nutshell
  • 33.
    Stateful Computations overEvent Streams in an Expressive, Scalable, Operational-focused …... ● Deploy anywhere ○ kubernetes, yarn, mesos, standalone ● Deploy flexibly ○ per-job mode, session mode ● Highly available ○ with HA setup
  • 34.
    Stateful Computations overEvent Streams in an Expressive, Scalable, Operational-focused, Fault Tolerant way Flink in a Nutshell
  • 35.
    Stateful Computations overEvent Streams in an Expressive, Scalable, Operational-focused, Fault Tolerant way ● Checkpoint/Savepoint ○ on-the-fly, don’t scrafice performance ○ support incremental ● Exactly-once ○ State consistency ○ End-to-end with transactional connectors
  • 36.
    Apache Flink: Stateful Computationsover Event Streams in an Expressive, Scalable, Operational-focused, Fault Tolerant way The only open source framework that provide all the above capabilities
  • 37.
    Going Beyond StreamProcessing ● Batch -> Unified Data Processing ● Data Warehousing and Notebook ● Machine Learning / AI / DL ● Serverless
  • 38.
    Recap the lambdaarchitecture ...... ○ infra: high operation cost ○ dev: costly maintenance, and hard to learn 2+ stack ○ business: hard to sync to guarantee logic consistency Why Unified Streaming and Batch Data Processing? MQ / Pub-Sub HDFS / S3 Stream Processing (online) Batch Processing (offline) Combine Results (serving)
  • 39.
    Why Flink? Flink’s philosophy:streaming first, with batch is a special case of streaming
  • 40.
    State-of-the-Art Batch Processingon a Stream Processor <= Flink 1.8 from Flink 1.9
  • 41.
    Performance of Blinkversus Spark in the TPC-DS benchmark, aggregate time for all queries together. Presentation by Xiaowei Jiang at Flink Forward Beijing, 2018.
  • 42.
    Data Warehousing Initial integrationin Flink 1.9 for Hive 2.3.4 and 1.2.1 Full integration (read, write, udf) in Flink 1.10 for all Hive 1.x, 2.x, and 3.x
  • 43.
  • 44.
  • 45.
    Machine Learning &AI Recap the “lambda” architecture, again, for ML MQ / Pub-Sub HDFS / S3 Online Training Offline Training Model Validation Preprocessing Dynamic Model Static Model Static Model Preprocessing Inference
  • 46.
    Flink is popularfor online ML now MQ / Pub-Sub HDFS / S3 Online Training Offline Training Model Validation Preprocessing Dynamic Model Static Model Static Model Preprocessing Inference
  • 47.
    Streaming-Batch Unified ML UseFlink everywhere to reduce maintenance and operation cost of code and infra MQ / Pub-Sub HDFS / S3 Online Training Offline Training Model Validation Preprocessing Dynamic Model Static Model Static Model Preprocessing Inference
  • 48.
    Machine Learning &AI ML Stage ML Flow Efforts & Requirements MQ / Pub-Sub HDFS / S3 Online Training Offline Training Model Validation Preprocessing Dynamic Model Static Model Static Model Preprocessing Inference Data Acquisition Preprocessing Model Training Model Validation & Serving Inference Rich Connectors Dataset Management Stream-Batch unification Strong API & SQL Support Enhanced Iteration Flink ML lib DL on Flink (TF, PyTorch) Model Serving Model Registry & Management Rollout/Rollback Online Evaluation Flink ML Pipeline Python API support
  • 49.
    Machine Learning &AI ML Stage ML Flow Efforts & Requirements MQ / Pub-Sub HDFS / S3 Online Training Offline Training Model Validation Preprocessing Dynamic Model Static Model Static Model Preprocessing Inference Data Acquisition Preprocessing Model Training Model Validation & Serving Inference Rich Connectors Dataset Management Stream-Batch unification Strong API & SQL Support Enhaced Iteration Flink ML lib DL on Flink (TF, PyTorch) Model Serving Model Management Rollout/Rollback Online Evaluation Flink ML Pipeline Python API support
  • 50.
    Flink ML Libs ●Completely rewrote ● Based on ML pipeline, powered by Table API ● Battle tested algorithms ○ K-means ○ Naive Bayes ○ Linear regression ○ GBDT ○ Decision tree ○ PCA ○ Random forest ○ Correlation ○ ….
  • 51.
    Flink ML Pipeline training inference inputtable 1 ModelTransformerEstimatorTransformer input table 2 result tableModelTransformer Two type of operators ● Transformer: data -> data ● Estimator: data -> model Estimator Pipeline Model Pipeline Model Pipeline pipeline.fit(input1) pipeline.transform(input2)
  • 52.
    Deep Learning Pipeline DataAcquisition Data Process & Tranformation Model Training Model Validation Model Serving Parameter Tuning
  • 53.
    Deep Learning Pipeline source1 source 2 join udtf Flink Cluster External MQ / FS Tensorflow Cluster worker workerworker PS PS
  • 54.
    Flink Deep LearningPipeline Data Acquisition Data Process & Tranformation Model Training Model Validation Model Serving Parameter Tuning
  • 55.
    Flink + TensorflowIntegration source 1 source 2 join udtf Flink Cluster External MQ / FS Tensorflow Cluster worker workerworker PS PS source 1 source 2 join udtf a single Flink Cluster worker workerworker PS PS
  • 56.
    DL on FlinkML Pipeline source 1 source 2 join udtf worker workerworker PS PS Transformer Estimator check out flink-ai-extended https://github.com/alibaba/flink-ai-extended
  • 57.
  • 58.
    Serverless Event Driven Functionas a Service Benefits: ● elastic ● lightweight Challenges: ○ state ■ consistency ■ IO ■ capacity ○ hard to build complex applications
  • 59.
    Event Driven StateManagement Composable isn’t that….
  • 60.
    Event Driven StateManagement Composable isn’t that…. Stream Processing!
  • 61.
    Check out projectState Function, announced in Oct 2019! https://statefun.io/ It officially became part of Apache Flink last week.
  • 62.
    Thanks! Twitter: @Bowen__Li ML: dev/ user@flink.apache.org Meetup: Seattle Flink Meetup https://www.meetup.com/seattle-apache-flink/
  • 63.