Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans

1
Piotr Nowojski
@PiotrNowojski
piotr@data-artisans.com
Apache Flink
Better, Faster & UncutBig data Warsaw 201

2
Original creators of Apache
Flink®
Providers of the
dA Platform 2, a supported
Flink distribution

This will be about ...
3
● What is Apache Flink?
● What can I do with it?
● What has recently changed?

4
Stateful Stream Processing
“Why” and “What is” ...

Stream Processing
5
Your
Code
process records
one-at-a-time
...
Long running computation, on an endless stream of input

Distributed Stream Processing
6
Your
Code
...
...
...
Your
Code
Your
Code
● partitions input streams by
some key in the data
● distributes computation
across multiple instances
● Each instance is responsible
for some key range

qwe
7
...
...
Your
Code
Your
Code
Process
var sum = 0
def map(element):
sum += element
return sum

var sum = 0
def map(element):
sum += element
return sum
8
...
...
Your
Code
Your
Code
Process
● embedded local state
backend
● State co-partitioned with
the input stream by key

State fault tolerance
9
Fault tolerance concerns for a stateful stream processor:
● No guarantees, at-least-once vs exactly-once
● How to ensure exactly-once semantics for the state?
● How to create consistent snapshots of distributed state?
● More importantly, how to do it efficiently without abrupting computation?

10
...
Your
Code
Your
Code
Your
Code
State
State
State
Your
Code
State
● Consistent snapshotting:

11
...
Your
Code
Your
Code
Your
Code
State
State
State
Your
Code
State
checkpointed
state
checkpointed
state
checkpointed
state
Distributed File System
Checkpoint
● Consistent snapshotting:

12
...
Your
Code
Your
Code
Your
Code
State
State
State
Your
Code
State
checkpointed
state
checkpointed
state
checkpointed
state
Restore
● Recover all embedded state
● Reset position in input stream
Distributed File System

Apache Flink Stack
14
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.

Programming Model
15
Computation
Computation
Computation
Computation
Source Source
Sink
Sink
Transformation
state
state
state
state

API and Execution
16
Source
DataStream<String> lines = env.addSource(new FlinkKafkaConsumer010(…));
DataStream<Event> events = lines.map(line -> parse(line));
DataStream<Statistic> stats = events
.keyBy("id")
.timeWindow(Time.seconds(5))
.aggregate(new MyAggregationFunction());
stats.addSink(new BucketingSink(path));
map()
[1]
keyBy()/
window()/
apply()
[1]
Transformation
Transformation
Sink
Streaming
DataflowkeyBy()/
window()/
apply()
[2]
map()
[1]
map()
[2]
Source
[1]
Source
[2]
Sink
[1]

Levels of abstraction
17
Process Function (events, state, time)
DataStream API (streams, windows)
Table API (dynamic tables)
Stream SQL
low-level (stateful
stream processing)
stream processing &
analytics
declarative DSL
high-level langauge

Exactly-once
19
● End to end exactly-once
○ before Flink 1.4 only for writing to Files
○ since Flink 1.4 also for Pravega and Kafka

Exactly-once two-phase commit
Kafka
Data Source Data SinkOperator
Kafka
State
Backend
Job
Manager
(External
System)
(External
System)

Kafka
Kafka
State
Backend
Job
Manager
Inject checkpoint
barrier (1)
Pre-commit
(checkpoint starts)

Kafka
Kafka
State
Backend
Job
Manager
Inject checkpoint
barrier (1)
Pre-commit without
external state
Snapshot offsets (2)
Pass checkpoint barrier (2)

Kafka
Kafka
State
Backend
Job
Manager
Inject checkpoint
barrier (1)
Pre-commit second
operator without external
stateSnapshot offsets (2)
Snapshot
state (3)

Kafka
Kafka
State
Backend
Job
Manager
Inject checkpoint
barrier (1)
Pre-commit with external
state in data sink
Snapshot offsets (2)
Snapshot
state (3)
Pre-commit external
transaction (4)
Snapshot
state (4)

Kafka
Data Source Data SinkWindow
Kafka
State
Backend
Job
Manager
Notify checkpoint
completed (1)

Kafka
Data Source Data SinkWindow
Kafka
State
Backend
Job
Manager
Notify checkpoint
completed (1)
Commit external
transaction (2)

Exactly-once
27
● End to end exactly-once requires transactional writes support from
external systems
● Kafka supports transactional writes only since 0.11 version (released in
second half of 2017)

Large state
29
● Large state (multiple GB per machine) - too long checkpointing
● Large state with small changes
● Long recovery times

Flink State and Distributed Snapshots
3
0
State
Backend
Stateful
Operation
Source
Event

3
1
Trigger checkpoint
Inject checkpoint barrier
Stateful
Operation
Source

3
2
Take state snapshot
Synchronously trigger
state snapshot (e.g.
copy-on-write)
Stateful
Operation
Source
„Asynchronous Snapshotting“

3
3
DFS
Processing pipeline continues
Durably persist
full snapshots
asynchronously
Stateful
Operation
Source

Asynchronous Checkpoints
34
● Minimize pipeline stall time while taking the snapshot
● Keep overhead (memory, CPU,…) as low as possible while writing
the snapshot
● Support multiple parallel checkpoints

Incremental checkpointing
35
● Large state with small changes over time
● Efficiently detect the (minimal) set of state changes between two
checkpoints
○ copy on write
● Persist only the difference
○ faster taking snapshots
○ slower recovery (replaying differences)

36
DFS
Local recovery - checkpointing
Stateful
Operation
Source
Copy-on-write
snapshots
Checkpointing state of the operators

37
DFS
Durably persist
full snapshots
asynchronously
Local recovery - checkpointing
Stateful
Operation
Source
Store snapshots on
local disks as well
Checkpointing state of the operators

38
Local recovery - restoring
Stateful
Operation
Source
Restoring state of the operators
DFS
Recover from by
default from DFS
Recover from local
state if possible

Local recovery
39
● Higher cost of asynchronous snapshotting
○ If system is not overloaded do not affect latency/throughput
● Faster recovery
○ Smaller downtime

Challenges in Streaming
41
● Streaming is very different from batching and micro-batching
● Providing high throughput with low latency is difficult

Challenges in Streaming
42
Your
Code
process records
one-at-a-time
...
Batching allows to hide handing over costs in an exchange for
higher latency

Network changes in Flink 1.5
43

TL; DR
45
● Stateful stream processing as a paradigm for continuous
data
● Apache Flink is a sophisticated and battle-tested
stateful stream processor with a comprehensive set of
features
● Efficiency, management, and operational issues for state
are taken very seriously

4
Thank you!
@PiotrNowojski
@ApacheFlink
@dataArtisans

We are hiring!
data-artisans.com/careers

Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans

More Related Content

What's hot

Similar to Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans

More from Evention

Recently uploaded

Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans