State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

Paris Carbone<parisc@kth.se> - KTH Royal Institute of Technology
Stephan Ewen<stephan@data-artisans.com> - data Artisans
Gyula Fóra<gyula.fora@king.com> - King Digital Entertainment Ltd
Seif Haridi<haridi@kth.se> - KTH Royal Institute of Technology
Stefan Richter<s.richter@data-artisans.com> - data Artisans
Kostas Tzoumas<kostas@data-artisans.com> - data Artisans
1
State Management in
Apache Flink®
Consistent Stateful Distributed Stream Processing
@vldb17

Overview
• The Apache Flink System Architecture
• Pipelined Consistent Snapshots
• Operations with Snapshots
• Large Scale Deployments and Evaluation
2

The Apache Flink
Framework
Cluster Backend Metrics
Dataﬂow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
3

Distributed Architecture
Dataﬂow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
Client
4

Dataﬂow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
Client
optimised
logical graph
4

Dataﬂow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling
• state partitioning
• snapshot coordination
Client
optimised
logical graph
4

Zookeeper
• passive failover
• snapshot metadata
Dataﬂow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling
Client
optimised
logical graph
4

Zookeeper
Dataﬂow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling
Client
optimised
logical graph
• memory management
• local snapshot execution
• ﬂow control
physical
long-running
tasks
4

Zookeeper
Dataﬂow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling
Client
optimised
logical graph
• ﬂow control
physical
long-running
tasks
locally managed state
4

Zookeeper
Dataﬂow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling
Client
optimised
logical graph
• ﬂow control
physical
long-running
tasks
locally managed state
External
Snapshot Store
(e.g., hdfs)
partial snapshots
4

1. End-to-End
Guarantees
Snapshots
2. Reconﬁguration
3. Version Control 4. Isolation
Snapshots
5

1. End-to-End
Guarantees
Snapshots
2. Reconﬁguration
Snapshots
6

Stateful Processing
tasktasktask
7

Stateful Processing
tasktasktask
invoke per
input record
7

Stateful Processing
tasktasktask
readwrite
managed
state
logical operations
(collections)
invoke per
input record
7

Local
State Backend
physical
operations
In-Memory(Heap)
Embedded Off-heap+Disk
Key-Value Store
(RocksDB)
Stateful Processing
tasktasktask
readwrite
managed
state
logical operations
(collections)
invoke per
input record
7

Local
State Backend
physical
operations
In-Memory(Heap)
Embedded Off-heap+Disk
Key-Value Store
(RocksDB)
Stateful Processing
tasktasktask
readwrite
managed
state
logical operations
(collections)
invoke per
input record
state = f(input)
7

local
statesinput
streams
stream
processor
8

local
statesinput
streams
divide computation
into epochs
stream
processor
8

local
statesinput
streams
capture all local
states after
completing an
epoch
divide computation
into epochs
stream
processor
8

local
statesinput
streams
capture all local
states after
completing an
epoch
divide computation
into epochs
stream
processor
can rollback input and state
to captured point in the past
8

Snapshot
Store
copy states
A Synchronous Approach
master
9

drain epoch 1
Snapshot
Store
copy states
master
9

drain epoch 2
Snapshot
Store
copy states
master
9

• In use: Storm Trident and Spark Streaming
• A conservative approach, equivalent to batching
• Can cause unnecessary latency (master coordination)
• Processing is no longer continuous
• Forces many tasks to be idle
• Instead, in Apache Flink snapshots are pipelined
Synchronous Snapshots
10

Pipelined Snapshots
Snapshot
Store
async state copy
11

Pipelined Snapshots
Snapshot
Store
async state copy
insert markers
11

Pipelined Snapshots
Snapshot
Store
async state copy
insert markers
A
B
C
D
E
11

Pipelined Snapshots
Snapshot
Store
async state copy
A
B
C
D
E
11

Pipelined Snapshots
Snapshot
Store
async state copy
A
B
C
D
E
B
11

Pipelined Snapshots
Snapshot
Store
async state copy
epoch alignment
A
B
C
D
E
B
11

Pipelined Snapshots
Snapshot
Store
async state copy
epoch alignment
A
B
C
D
E
B
A
11

Pipelined Snapshots
Snapshot
Store
async state copy
A
B
C
D
E
B
A
C
11

Pipelined Snapshots
Snapshot
Store
async state copy
A
B
C
D
E
B
A
C
D
E
11

Pipelined Snapshots
Snapshot
Store
async state copy
snapshot
completes
A
B
C
D
E
B
A
C
D
E
11

Pipelined Snapshots (cycles)
12

Problem: we cannot wait indeﬁnitely for records in cycles
12

Problem: we cannot wait indeﬁnitely for records in cycles
Solution: log in
snapshot inﬂight
records within a cycle
Replay upon recovery.
12

• Offers exactly-once processing guarantees
• Issued periodically/externally by the user
• Naturally respects ﬂow control mechanisms
• Channel state logging limited to cycles only
• Multiple epoch snapshots can be pipelined
• Can offer weaker at-least-once processing guarantees
by simply dropping aligning vs no alignment cost
Technique Highlights
13

1. End-to-End
Guarantees
Snapshots
2. Reconﬁguration
Snapshots Usages
14

Exactly-Once: Input and Processing
Important Assumptions
• Input streams are persisted with offset indexes (e.g., Kafka, Kinesis)
• Data Channels are FIFO and reliable (no loss)
Each epoch either completes or repeats
15

• Idempontency ~ repeated operations can be tolerated after
recovery/rollback (works for mutable stores).
• Transactional Processing ~ Requires a two-phase
coordination. A snapshot completion eventually leads to
external commit (e.g., Flink’s HDFS RollingSink*)
in-progress committedpendingpending
epoch n-1 epoch n-2 epoch n-3epoch n
Exactly-Once Output
16

Snapshots
2. Reconﬁguration
Snapshots Usages
1. End-to-End
Guarantees
17

Dataﬂow Reconﬁguration
stop
snap-1 snap-2
18

stop
snap-1 snap-2
snap-3
…
change
parallelism
18

stop
snap-1 snap-2
snap-3
…
change
parallelism
Problem: How is state repartitioned from a snapshot?
18

Reconﬁguration: The Issue
19

0x100: bob
…
…
…
…
0x449: alice
reconﬁgure
case I full scan
Scan Remote Storage for Responsible Keys
19

0x100: bob
…
…
…
…
0x449: alice
reconﬁgure
case I full scan
too slow
19

case II
0x100: bob
…
…
…
…
0x449: alice
reconﬁgure
Include Key Locations in Snapshot Metadata
bob: 0x100
carol: 0x344
…
alice: 0x449
chuck: 0x630
…
0x100: bob
…
…
…
…
0x449: alice
reconﬁgure
case I full scan
too slow
19

case II
0x100: bob
…
…
…
…
0x449: alice
reconﬁgure
Include Key Locations in Snapshot Metadata
bob: 0x100
carol: 0x344
…
alice: 0x449
chuck: 0x630
…
0x100: bob
…
…
…
…
0x449: alice
reconﬁgure
case I full scan
too slow
too much
19

Reconﬁguration: Key Groups
Pre-partition state in
hash(K) space, into key-groups
bob…
…
… …
…
…
alice
20

bob…
…
… …
…
…
• Snapshot Metadata:
Contains a reference per stored
Key-Group (less metadata)
• Reconﬁguration:
Contiguous key-group allocation
to available tasks (less IO)
alice
20

bob…
…
… …
…
…
• Snapshot Metadata:
Contains a reference per stored
Key-Group (less metadata)
• Reconﬁguration:
Contiguous key-group allocation
to available tasks (less IO)
alice
Note: number of key groups controls trade-off between metadata to
keep and reconﬁguration speed
20

Snapshots
2. Reconﬁguration
Snapshots Usages
1. End-to-End
Guarantees
21

Version Control
Pipeline v.1
22

Version Control
fork and
update
Pipeline v.1
Pipeline v.2
22

Version Control
fork and
update
Pipeline v.1
Pipeline v.3
Pipeline v.2
22

Snapshots
2. Reconﬁguration
Snapshots Usages
1. End-to-End
Guarantees
23

Isolation Levels
select from facebook.userID, clients.name …
inner join clients on …
read-committed
(snapshot)
read-uncommitted
(dirty read on latest state)
external
query
24

Large Scale Deployment at King
25

Large Scale Deployment at King100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
TotalSnapshottingTime(sec)
total time / snapshot
(alignment + async copies)
25

Large Scale Deployment at King100
200
300
400
500
0
50
100
150
200
250
~runtime overhead
25

30 50 70
Parallelism
0
200
400
600
800
1000
1200
1400
TotalAlignmentTime(msec)
PROC
WIN
OUT
alignment
cost
100
200
300
400
500
0
50
100
150
200
250
~runtime overhead
25

30 50 70
Parallelism
0
200
400
600
800
1000
1200
1400
TotalAlignmentTime(msec)
PROC
WIN
OUT
alignment
cost
100
200
300
400
500
0
50
100
150
200
250
~runtime overhead
• #shuffles (keyby)
• parallelism
25

Teaser: More paper
highlights
• We can use the same technique to coordinate
externally managed state with snapshots.
• Epoch markers can act as on-the-ﬂy
reconﬁguration points.
• Internals of asynchronous and incremental
snapshots.
26

Paris Carbone<parisc@kth.se> - KTH Royal Institute of Technology
Stephan Ewen<stephan@data-artisans.com> - data Artisans
Gyula Fóra<gyula.fora@king.com> - King Digital Entertainment Ltd
Seif Haridi<haridi@kth.se> - KTH Royal Institute of Technology
Stefan Richter<s.richter@data-artisans.com> - data Artisans
Kostas Tzoumas<kostas@data-artisans.com> - data Artisans
27
State Management in
Apache Flink®
Consistent Stateful Distributed Stream Processing
@vldb17

State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

More Related Content

What's hot

Similar to State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

More from Paris Carbone

Recently uploaded

State Management in Apache Flink : Consistent Stateful Distributed Stream Processing