KEYNOTE Flink Forward San Francisco 2019: From Stream Processor to a Unified Data Processing System - Stephan Ewen & Xiaowei Jiang & Robert Metzger

© 2019 Ververica
Presenter Name & Title
PRESENTATION TITLE
The Apache Flink® Conference
Stream Processing | Event Driven | Real Time
San Francisco 1-2, 2019

© 2019 Ververica2
A big thanks to our Sponsors
Platinum
Gold
Silver
Community
Flink Fest
Media

© 2019 Ververica3
A big thanks to our Program Committee
Fabian Hueske Dean Wampler
Sonali SharmaEric SammerStefan RichterTyler Akidau
Jamie Grier

© 2019 Ververica4
A big thanks to our Speakers

© 2019 Ververica5
Flink Forward App
Rate speakers and sessions with
a chance to win a Flybrix drone!
Get involved!
Apache Flink User Survey
Win a trip to one of the next Flink Forward
conferences of your choice!
Flink Forward Survey
Help us improve the quality of Flink
Forward. We appreciate your feedback!
Community Contribution
Sign up as a content contribut or for blog
posts or speaking oppor tunities.
Flink Forward Social Feed - View & Engage!
Get involved

© 2019 Ververica
Kostas Tzoumas
INTRODUCING VERVERICA

© 2019 Ververica7
Founded in 2014 by the original creators of Apache Flink to
commercialize the open source project and support the
community

© 2019 Ververica9
Why?
• Alibaba has been the largest user of Flink and second largest
contributor for years
• Deeply committed to open source and creating technological impact
• Joining forces made a lot of sense for the two teams in order to collaborate
even closer and accelerate their contributions to Flink

© 2019 Ververica10
Flink at Alibaba (few examples)
Taobao is the largest e-commerce platform globally with more than 600 million
monthly active users. Every time a user logs into the Taobao app they see a
different landing page personalized for the user and depending on the latest real-
time activity in the platform. Using Flink for real-time machine learning at Taobao
has resulted in over 20% increase in purchase conversion rate. At peak during
Singles Day last year, the system processed over 1.7 billion events/sec.
In the Hangzhou City Brain Project, Flink is used to process in real-time data from
a variety of sensors (traffic cameras, map applications, etc), and manage traffic
signals in 128 intersections. The City Brain project has halved traveling times for
ambulances and commuters. Traffic accidents can be detected immediately, and
help can reach the accident site within 5 minutes.

© 2019 Ververica11
What’s in a name?
verum (“real” in Latin)
Understanding the truth about the world by
getting the real-time view

© 2019 Ververica12
What is Ververica?
1. Double down on the open source community and improve its health
and diversity
2. Contribute a number of innovations to the open source project starting
with Alibaba’s Blink for batch processing
3. Create an ecosystem and foundation for the commercial success of
Flink projects and products across the world
Our #1 goal is to position Apache Flink for the next 10 years of its life

© 2019 Ververica13
Ververica Commercial Products
Full continuation of our commercial products and
services
• Ververica Platform including Apache Flink,
Application Manager, and Streaming Ledger
• Apache Flink Training and Consulting Services
• Enterprise Support
A lot of innovation coming here as well leveraging
existing work in Alibaba Cloud

© 2019 Ververica14
Announcing: Ververica Partner Program
We are looking for partners to help us develop the broader Flink
ecosystem
• Ververica Platform Partner
Preferred partners of our commercial products around the globe
• Ververica Services Partner
Service provider on Apache Flink certified by Ververica
Sign up here! ververica.com/partner-program

© 2019 Ververica
Stephan Ewen
Xiaowei Jiang
Robert Metzger
From Stream Processor to
Unified Data Processing System

© 2019 Ververica16
Use Cases Presented Today

© 2019 Ververica17
Apache Flink and Public Clouds
cloud
on-prem

© 2019 Ververica
Data Processing Applications

© 2019 Ververica
more lag time
Batch
Processing
Continuous
Processing
Event-driven
Applications
Transactional
Processing
more real time
Stream Processing
Data
Pipelines
Streaming
Analytics

© 2019 Ververica
more lag time
Batch
Processing
Continuous
Processing
Event-driven
Applications
Transactional
Processing
more real time
Stream Processing
Data
Pipelines
Streaming
Analytics
Batch Processing & Continuous Streaming
Analytics & Applications

© 2019 Ververica
more lag time
Batch
Processing
Continuous
Processing
Event-driven
Applications
Transactional
Processing
more real time
Stream Processing
Data
Pipelines
Streaming
Analytics
Flink community's focus over the last releases

© 2019 Ververica
more lag time
Batch
Processing
Continuous
Processing
Event-driven
Applications
Transactional
Processing
more real time
Recent Features
Data
Pipelines
Streaming
Analytics
Time-versioned Joins
MATCH_RECOGNIZE
Schema Upgrades
SELECT *
FROM TaxiRides
MATCH_RECOGNIZE (
PARTITION BY driverId
ORDER BY rideTime
MEASURES
S.rideId as sRideId
AFTER MATCH SKIP PAST LAST ROW
PATTERN (S M{2,} E)
DEFINE
S AS S.isStart = true,
M AS M.rideId <> S.rideId,
E AS E.isStart = false
AND E.rideId = S.rideId)
A B
A X Y

© 2019 Ververica
more lag time
Batch
Processing
Continuous
Processing
Event-driven
Applications
Transactional
Processing
more real time
Stream Processing
Data
Pipelines
Streaming
Analytics
"Steam Processing takes on ACID"
by Seth Wiesman
11am, Nikko I

© 2019 Ververica
more lag time
Batch
Processing
Continuous
Processing
Event-driven
Applications
Transactional
Processing
more real time
Stream Processing
Data
Pipelines
Streaming
Analytics
SQL and SQL Ecosystem / tools
Machine Learning
Graphs
Batch Performance
Batch Fault Tolerance
Interactive Queries
Dashboards

© 2019 Ververica
The Relationship between
Batch and Streaming

© 2019 Ververica26
Everything Streams
That is about 60% of the truth…

© 2019 Ververica27
The remaining 40% of the truth
Continuous
Streaming
Batch
Processing
Data is incomplete
Latency SLAs
Completeness and
Latency is a tradeoff
Data is as complete
as it gets within
the job
No Low Latency SLAs

© 2019 Ververica28
The remaining 40% of the truth
Continuous
Streaming
Batch
Processing
Data is incomplete
Latency SLAs
Completeness and
Latency is a tradeoff
Data is as complete
as it gets within
the job
No Low Latency SLAs

© 2019 Ververica29
Streaming versus Batch Join

© 2019 Ververica30
Streaming versus Batch Join
2x RocksDB
LSM-Trees 1x Hybrid Hash Join
DataStream API DataSet API
push-based
operators
low-latency
minimize
in-flight data
pull-based
operators
flexible data
flow control
high latency
no checkpoints

© 2019 Ververica31
Exploiting the Batch Special Case
Planner/Optimizer
See also: "Towards Flink 2.0: Rethinking the stack and APIs to unify Batch & Stream"
by Aljoscha Krettek, 2pm, Nikko II/III
Continuous Operators
Streaming
Scheduler Rules
Additional Bounded
Operators
Additional
Scheduling Strategies
if (bounded && non-incremental)
activates additional
optimizer choices
Core operators,
cover all cases

© 2019 Ververica
Stream Processing,
Analytics, and Applications

© 2019 Ververica33
Process Function (events, state, time)
DataStream API (streams, windows)
Stream SQL / Tables (dynamic tables)
Stream- & Batch
Data Processing
High-level
Analytics API
Stateful Event-
Driven Applications
val stats = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum((a, b) -> a.add(b))
def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = {
// work with event and state
(event, state.value) match { … }
out.collect(…) // emit events
state.update(…) // modify state
// schedule a timer callback
ctx.timerService.registerEventTimeTimer(event.timestamp + 500)
}
How we showed the API stack in the past…

© 2019 Ververica
Applications
(physical)
Analytics
(declarative)
DataStream API Table API
Types are Java / Scala classes Logical Schema for Tables
Transformation Functions Declarative Language (SQL, Table DSL)
State implicit in operationsExplicit control over State
Explicit control over Time SLAs define when to trigger
Executes as described Automatic Optimization

© 2019 Ververica35
Rethinking the Flink Stack
Stream Operator & DAG API
Runtime
DataSet
(deprecated)
DataStream Table / SQL
Still possible to mix and
match within a program

© 2019 Ververica
SQL, Notebooks,
and Machine Learning

© 2019 Ververica
JobGraphPhysical PlanTable API & SQL Logical Plan
Pluggable
Catalog
Query
Optimizer
SubQuery
Decorrelation
Filter/Project
Push-Down
Join
Reorder
…
Adding a new Table API / SQL Query Processor (Blink)

© 2019 Ververica
Sub-QueryANSI Syntax Data Type Join
Advanced
Aggregates
Window
Operator
Over Window TPC-H/TPC-DS
Functional Improvements

© 2019 Ververica
Performant
Operators
Operator codegen
HashAgg/Local-global
Agg
Improved HashJoin
Semi/Anti join
Vectorization
Resource
Optimizations
Stats based estimation
Dynamic memory
allocation
Expression
Optimizations
Record Format
Operate binary data
JVM intrinsics
Hot method codegen
Rich Stats
NDV
NULL count
Avg length
Max length
Min
Max
Cost Based
Join order
Join type
Agg strategy
……
Advanced Rules
Subplan reuse
Join condition
expansion
Shuffle removal
Distinct Agg rewrite
….
Query Execution Query Optimizer
Performance Improvements

© 2019 Ververica
Blink
Spark
1T TPC-DS Queries
Blink
Spark
10T TPC-DS Queries
Blink
Spark
30T TPC-DS Queries
2000S
20000S
50000S ？
Batch SQL Benchmark

© 2019 Ververica
Common
Implementation
Modular/
Composable
Applications
Dynamic
Query Logic
Interactive
Programming
Ease of
Use
Table API

© 2019 Ververica
June, 2019
TableAPI Refactor
FLIP-32
July, 2019
Initial Blink Runner Merge
Flink 1.9 Release
Oct, 2019
Full Merge
Table API Layer
Flink
Runner
Blink
Runner
Blink SQL Merge Plan

© 2019 Ververica
Flink Hive
+
For More？
Integrate Flink with Hive Ecosystem
Xuefu Zhang & Bowen Li, Alibaba 12:20pm - 1:00pm Carmel
HMS Data
FlinkHive
Hive Integration

© 2019 Ververica
Unified Interface ML algorithms Common Utilities
For More？
When Table meets AI: Build Flink AI Ecosystem on Table API
Shaoxuan Wang, Alibaba
4:30pm - 5:10pm Nikko II & II
High performance ML library based on Flink
Xu Yang, Alibaba
2:50pm - 3:10pm Carmel
Clustering
K-means
Latent Dirichlet allocation (LDA)
Bisecting k-means
Gaussian Mixture Model (GMM)
Regression
Linear regression
Lasso regression
Ridge regression
Generalized linear regression
Survival regression
Isotonic regression
Classifier
Binomial logistic regression
Multinomial logistic regression
Multilayer perceptron classifier
Linear Support Vector Machine
Naive Bayes
Random Forest
GBDT
Decision Tree
Others
Collaborative filtering
FP-Growth
PrefixSpan
Proposal for Machine Learning

© 2019 Ververica48
A growing Apache Flink Community
… not only Flink’s codebase that is growing massively …
230 Emails / week

© 2019 Ververica51
Growing the Contributors Community
• Cleanup & reorganization of the Jira
components
• Flinkbot: Improve pull request reviews and
labeling
• Discussions about improving the contribution
workflow
• PMC mentoring new committer candidates
• Flink Community Packages website

© 2019 Ververica
The Apache Flink community is
more active than ever
Apache Flink continues to evolve with the
Stream Processing space.
Seamlessly integrate analytics, machine learning, applications
and very fast batch processing on top of stream processing

KEYNOTE Flink Forward San Francisco 2019: From Stream Processor to a Unified Data Processing System - Stephan Ewen & Xiaowei Jiang & Robert Metzger

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to KEYNOTE Flink Forward San Francisco 2019: From Stream Processor to a Unified Data Processing System - Stephan Ewen & Xiaowei Jiang & Robert Metzger

Similar to KEYNOTE Flink Forward San Francisco 2019: From Stream Processor to a Unified Data Processing System - Stephan Ewen & Xiaowei Jiang & Robert Metzger (20)

More from Flink Forward

More from Flink Forward (20)

Recently uploaded

Recently uploaded (20)

KEYNOTE Flink Forward San Francisco 2019: From Stream Processor to a Unified Data Processing System - Stephan Ewen & Xiaowei Jiang & Robert Metzger

Editor's Notes