Apache flink 1.7 and Beyond

Apache Flink® 1.7 and Beyond
公司：data Artisans
职位：Engineering Lead
演讲者：Till Rohrmann
@stsffap
1

2
Original creators
of
Apache Flink®
dA Platform
Stream Processing
for the Enterprise

3
What is Apache Flink?
Batch
Processing
process static and
historic data
Data Stream
Processing
realtime results
from data streams
Event-driven
Applications
data-driven actions
and services
Stateful Computations Over
Data Streams

Flink 1.7: What happened so
far?
4

• Contributors: 112
• Resolved issues: 430
• Commits: 970
• Changes LOC: +103824/-63124
5
Flink 1.7.0 in Numbers

• E.g. changing requirements, new algorithms, better serializers,
bug fixes, etc.
• Expensive to restart application from scratch (maintain state)
6
Flink Applications Need to Evolve

• Support for changing state schema
• Adding/Removing fields
• Changing type of fields
• Currently fully supported when using Avro
types
7
State Schema Evolution
“Upgrading Stateful Flink
Streaming Applications:
State of the Union” by
Tzu-Li Tai Today @
5:20 pm Room 2

8
Converting Currencies
7:12pm 9:37am 8:45am
€ 1
$ 1.13
CN¥ 7.8

9
Temporal Tables and Joins
13 11 7
Currency Rate Time
CN¥ 7.8 3
CN¥ 7.89 5
CN¥ 7.75 915 14 12
7 4

10
SQL for Pattern Analysis
SELECT * from ?

11
MATCH_RECOGNIZESELECT *
FROM TaxiRides
MATCH_RECOGNIZE (
PARTITION BY driverId
ORDER BY rideTime
MEASURES
S.rideId as sRideId
AFTER MATCH SKIP PAST LAST ROW
PATTERN (S M{2,} E)
DEFINE
S AS S.isStart = true,
M AS M.rideId <> S.rideId,
E AS E.isStart = false
AND E.rideId = S.rideId
)

• ElasticSearch 6 Table Sink
• Support for views in SQL Client
• More built-in functions: TO_BASE64, LOG2, REPLACE, COSH,…
12
More SQL Improvements
“Flink Streaming SQL 2018”
by Piotr Nowojski Today @
4:00 pm Room 2

• Scala 2.12 Support
• Exactly-once S3 StreamingFileSink
• Kafka 2.0 connector
• Versioned REST API
• Removal of legacy mode
13
Other Notable Features

Flink 1.8+: What is happening
next?
14

15
Capability Spectrum
offline real time
Batch
Event-driven
applications
Streaming
analytics
Strict SLA
applications
Flink

• Deploying Flink applications should be as easy as starting a process
• Bundle application code and Flink into a single image
• Process connects to other application processes and figures out its role
• Removing the cluster out of the equation
16
Flink as a Library
P1
P2 P3 P4
New process

• Active mode
• Flink is aware of underlying cluster framework
• Flink allocate resources
• E.g. existing YARN and Mesos integration
• Reactive mode
• Flink is oblivious to its runtime environment
• External system allocates and releases resources
• Flink scales with respect to available resources
• Relevant for environments: Kubernetes, Docker, as a library
17
Reactive vs. Active

18
Dynamic Scaling
• Latency
• Throughput
• Resource utilization
• Connector signals

• No fundamental difference between batch and stream processing
• Batch allows optimizations because data is bounded and
”complete”
• Batch and streaming still separately treated from task level upwards
• Working toward a single runtime for batch and streaming workloads
19
Batch-Streaming Unification

• Lazy scheduling (batch case)
• Deploy tasks starting from the
sources
• Whenever data is produced
start consumers
• Scheduling of idling tasks 
resource under-utilization
20
Flink Scheduler
src
src
join join
src
build side
build side
prob
e
side
probe
side

• More efficient scheduling by
taking dependencies into account
• E.g. probe side is only scheduled
after build side has been
processed
21
Batch Scheduler
src
src
join join
src
build side
build side
prob
e
side
probe
side
(1)
(2)
(2)
(3)

• Make Flink’s scheduler extendable &
pluggable
• Scheduler considers dependencies and
reacts to signals from ExecutionGraph
• Specialized scheduler for different use
cases
22
Extendable Scheduler
Scheduler
Streaming
Scheduler
Batch
Scheduler
Speculative
Scheduler

• Tasks own produced result
partitions
• Containers cannot be freed
until result is consumed
• One implementation for
streaming and batch loads
23
Flink’s Shuffle Service
Result partitionContainer

• Result partitions are written to
an external shuffle service
• Containers can be freed early
• Different implementations
based on use case
24
External & Persistent Shuffle Service
External shuffle
service (e.g. Yarn,
DFS)

• Support for external catalogs
(Confluent Schema Registry, Hive
Meta Store)
• Data definition language (DDL)
25
End-to-end SQL Only Pipelines
Hive Meta Store
Table
Source
Table
Sink
Output schema
information
Input schema
information
SQL
Query

• Flink 1.7.0 added many new features around SQL, connectors and state evolution
• A lot of new features in the pipeline
• Join the community!
• Subscribe to mailing lists
• Participate in Flink development
• Become active
26
TL;DL

Apache flink 1.7 and Beyond

More Related Content

What's hot

Similar to Apache flink 1.7 and Beyond

More from Till Rohrmann

Recently uploaded

Apache flink 1.7 and Beyond

Editor's Notes