Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Scaling Apache Flink® to
very large State
Stephan Ewen (@StephanEwen)
State in Streaming Programs
2
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String, ...
State in Streaming Programs
3
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String, ...
Internal & External State
4
External State Internal State
• State in a separate data store
• Can store "state capacity" in...
Scaling Stateful Computation
5
State Sharding Larger-than-memory State
• Operators keep state shards (partitions)
• Stream...
Scaling State Fault Tolerance
6
Scale Checkpointing
• Checkpoint asynchronous
• Checkpoint less (incremental)
Scale Recove...
Asynchronous Checkpoints
7
Asynchronous Checkpoints
8
window()/
sum()
Source /
filter() /
map()
State index
(e.g., RocksDB)
Events are persistent
and...
Asynchronous Checkpoints
9
window()/
sum()
Source /
filter() /
map()
Trigger checkpoint Inject checkpoint barrier
Asynchronous Checkpoints
10
window()/
sum()
Source /
filter() /
map()
Take state snapshot RocksDB:
Trigger state
copy-on-w...
Asynchronous Checkpoints
11
window()/
sum()
Source /
filter() /
map()
Persist state snapshots Durably persist
snapshots
as...
Asynchronous Checkpoints
12
RocksDB
LSM Tree
Asynchronous Checkpoints
Asynchronous checkpoints work with RocksDBStateBackend
 In Flink 1.1.x, use
RocksDBStateBackend....
Work in Progress
14
The following slides show ideas, designs,
and work in progress
The final techniques ending up in Flink...
Incremental Checkpointing
15
G
H
C
D
Full Checkpointing
16
Checkpoint 1 Checkpoint 2 Checkpoint 3
I
E
A
B
C
D
A
B
C
D
A
F
C
D
E
@t1 @t2 @t3
A
F
C
D
E
G...
G
H
C
D
Incremental Checkpointing
17
Checkpoint 1 Checkpoint 2 Checkpoint 3
I
E
A
B
C
D
A
B
C
D
A
F
C
D
E
E
F
G
H
I
@t1 @t...
Incremental Checkpointing
18
Checkpoint 1 Checkpoint 2 Checkpoint 3 Checkpoint 4
d2
C1 d2 d3
C4C1 C1
Chk 1 Chk 2 Chk 3 Chk...
Incremental Checkpointing
19
Discussions
 To prevent applying many deltas, perform a full checkpoint
once in a while
• Op...
Incremental Recovery
20
Full Recovery
21
Flink's recovery provides "global consistency":
After recovery, all states are together
as if a failure f...
Incremental Recovery
22
Incremental Recovery
23
Incremental Recovery
24
State Replication
25
Standby State Replication
26
Biggest delay during recovery is loading state
Only way to alleviate this delay is if machine...
27
Thank you!
Questions?
Upcoming SlideShare
Loading in …5
×

Stephan Ewen - Scaling to large State

1,553 views

Published on

http://flink-forward.org/kb_sessions/scaling-stream-processing-with-apache-flink-to-very-large-state/

The majority of streaming programs is ‘stateful’: Windowed Aggregations, Sessions, Joins, Complex Event Processing, Tables – they all require to keep some form of state across individual events. With the migration of more and more complex batch jobs or data processing pipelines to streaming applications, some streaming programs need to keep terabytes of state. Apache Flink implements a checkpointing-based recovery mechanism that guarantees exactly-once semantics for state also in the presence of failures. The cost of checkpointing and recovery depends on the size of the program’s state. In this talk, we will discuss the current status of stateful processing in Apache Flink, as well as the ongoing efforts to make Flink’s fault tolerance mechanism scale to very large state sizes, supporting frequent checkpoints and faster recovery of large state, without requiring excessive numbers of machines.

Published in: Data & Analytics
  • Be the first to comment

Stephan Ewen - Scaling to large State

  1. 1. Scaling Apache Flink® to very large State Stephan Ewen (@StephanEwen)
  2. 2. State in Streaming Programs 2 case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String, count: Long) env.addSource(…) .map(bytes => Event.parse(bytes) ) .keyBy("producer") .mapWithState { (event: Event, state: Option[Int]) => { // pattern rules } .filter(alert => alert.msg.contains("CRITICAL")) .keyBy("msg") .timeWindow(Time.seconds(10)) .sum("count") Source map() mapWith State() filter() window() sum()keyBy keyBy
  3. 3. State in Streaming Programs 3 case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String, count: Long) env.addSource(…) .map(bytes => Event.parse(bytes) ) .keyBy("producer") .mapWithState { (event: Event, state: Option[Int]) => { // pattern rules } .filter(alert => alert.msg.contains("CRITICAL")) .keyBy("msg") .timeWindow(Time.seconds(10)) .sum("count") Source map() mapWith State() filter() window() sum()keyBy keyBy Stateless Stateful
  4. 4. Internal & External State 4 External State Internal State • State in a separate data store • Can store "state capacity" independent • Usually much slower than internal state • Hard to get "exactly-once" guarantees • State in the stream processor • Faster than external state • Always exactly-once consistent • Stream processor has to handle scalability
  5. 5. Scaling Stateful Computation 5 State Sharding Larger-than-memory State • Operators keep state shards (partitions) • Stream and state partitioning symmetric  All state operations are local • Increasing the operator parallelism is like adding nodes to a key/value store • State is naturally fastest in main memory • Some applications have lot of historic data  Lot of state, moderate throughput • Flink has a RocksDB-based state backend to allow for state that is kept partially in memory, partially on disk
  6. 6. Scaling State Fault Tolerance 6 Scale Checkpointing • Checkpoint asynchronous • Checkpoint less (incremental) Scale Recovery • Need to recover fewer operators • Replicate state Performance during regular operation Performance at recovery time
  7. 7. Asynchronous Checkpoints 7
  8. 8. Asynchronous Checkpoints 8 window()/ sum() Source / filter() / map() State index (e.g., RocksDB) Events are persistent and ordered (per partition / key) in the log (e.g., Apache Kafka) Events flow without replication or synchronous writes
  9. 9. Asynchronous Checkpoints 9 window()/ sum() Source / filter() / map() Trigger checkpoint Inject checkpoint barrier
  10. 10. Asynchronous Checkpoints 10 window()/ sum() Source / filter() / map() Take state snapshot RocksDB: Trigger state copy-on-write
  11. 11. Asynchronous Checkpoints 11 window()/ sum() Source / filter() / map() Persist state snapshots Durably persist snapshots asynchronously Processing pipeline continues
  12. 12. Asynchronous Checkpoints 12 RocksDB LSM Tree
  13. 13. Asynchronous Checkpoints Asynchronous checkpoints work with RocksDBStateBackend  In Flink 1.1.x, use RocksDBStateBackend.enableFullyAsyncSnapshots()  In Flink 1.2.x, it is the default mode  FsStateBackend and MemStateBackend not yet fully async. 13
  14. 14. Work in Progress 14 The following slides show ideas, designs, and work in progress The final techniques ending up in Flink releases may be different, depending on results.
  15. 15. Incremental Checkpointing 15
  16. 16. G H C D Full Checkpointing 16 Checkpoint 1 Checkpoint 2 Checkpoint 3 I E A B C D A B C D A F C D E @t1 @t2 @t3 A F C D E G H C D I E
  17. 17. G H C D Incremental Checkpointing 17 Checkpoint 1 Checkpoint 2 Checkpoint 3 I E A B C D A B C D A F C D E E F G H I @t1 @t2 @t3
  18. 18. Incremental Checkpointing 18 Checkpoint 1 Checkpoint 2 Checkpoint 3 Checkpoint 4 d2 C1 d2 d3 C4C1 C1 Chk 1 Chk 2 Chk 3 Chk 4Storage
  19. 19. Incremental Checkpointing 19 Discussions  To prevent applying many deltas, perform a full checkpoint once in a while • Option 1: Every N checkpoints • Option 2: Once size of deltas is as large as full checkpoint  Ideally: Having a separate merger of deltas • See later slides on state replication
  20. 20. Incremental Recovery 20
  21. 21. Full Recovery 21 Flink's recovery provides "global consistency": After recovery, all states are together as if a failure free run happened Even in the presence of non-determinism • Network • External lookups and other non-deterministic user code All operators rewind to latest completed checkpoint
  22. 22. Incremental Recovery 22
  23. 23. Incremental Recovery 23
  24. 24. Incremental Recovery 24
  25. 25. State Replication 25
  26. 26. Standby State Replication 26 Biggest delay during recovery is loading state Only way to alleviate this delay is if machines for recovery do not need to load state  Keep state outside Stream Processor  Have hot standbys that can immediately proceed Standbys: Replicate state to N other TaskManagers Failures of up to (N-1) TaskManagers, no state loading necessary Replication consistency managed by checkpoints Replication can happen in addition to checkpointing to DFS
  27. 27. 27 Thank you! Questions?

×