Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Forward SF 2017: Stefan Richter - Improvements for large state and recovery in Flink

431 views

Published on



Stateful stream processing with exactly-once guarantees is one of Apache Flink's distinctive features and we can observe that the scale of state that is managed by Flink in production constantly grows. This leads to a couple of interesting challenges for state handling in Flink. In this talk, we presents current and future developments to improve the handling of large state and recovery in Apache Flink. We show how to keep snapshots for large state swift and how to minimize negative effects on job performance through incremental and asynchronous checkpointing. Furthermore, we discuss how to greatly accelerate recovery under failures and for rescaling. In this context, we go into details about improved execution graph recovery, caching state on task managers, and considering new features of modern storage architectures for our state backends.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Flink Forward SF 2017: Stefan Richter - Improvements for large state and recovery in Flink

  1. 1. 1 Stefan Richter
 @stefanrrichter
 
 April 11, 2017 Improvements for large state in Apache Flink
  2. 2. State in Streaming Programs 2 case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String, count: Long) env.addSource(…)
 .map(bytes => Event.parse(bytes) ) .keyBy("producer") .mapWithState { (event: Event, state: Option[Int]) => { // pattern rules } .filter(alert => alert.msg.contains("CRITICAL")) .keyBy("msg") .timeWindow(Time.seconds(10)) .sum("count") Source map() mapWith
 State() filter() window()
 sum()keyBy keyBy
  3. 3. State in Streaming Programs 3 case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String, count: Long) env.addSource(…)
 .map(bytes => Event.parse(bytes) ) .keyBy("producer") .mapWithState { (event: Event, state: Option[Int]) => { // pattern rules } .filter(alert => alert.msg.contains("CRITICAL")) .keyBy("msg") .timeWindow(Time.seconds(10)) .sum("count") Source map() mapWith
 State() filter() window()
 sum()keyBy keyBy Stateless Stateful
  4. 4. Internal vs External State 4 External State Internal State • State in a separate data store • Can store "state capacity" independent • Usually much slower than internal state • Hard to get "exactly-once" guarantees • State in the stream processor • Faster than external state • Working area local to computation • Checkpoints to stable store (DFS) • Always exactly-once consistent • Stream processor has to handle scalability
  5. 5. Keyed State Backends 5 HeapKeyedStateBackend RocksDBKeyedStateBackend -State lives in memory, on Java heap -Operates on objects -Think of a hash map {key obj -> state obj} -Async snapshots supported -State lives in off-heap memory and on disk -Operates on bytes, uses serialization -Think of K/V store {key bytes -> state bytes} -Log-structured-merge (LSM) tree -Async snapshots -Incremental snapshots
  6. 6. Asynchronous Checkpoints 6
  7. 7. Synchronous Checkpointing 7 Checkpoint Coordinator thread Event processing thread Checkpointing thread (loop: processElement) (trigger checkpoint) (acknowledge checkpoint) (write state to DFS) Task Manager Job Manager Why is async checkpointing so essential for large state?
  8. 8. Synchronous Checkpointing 8 Checkpoint Coordinator thread Event processing thread Checkpointing thread (loop: processElement) (trigger checkpoint) (acknowledge checkpoint) (write state to DFS) Task Manager Job Manager Problem: All event processing is on hold here to avoid concurrent modifications to the state that is written
  9. 9. Asynchronous Checkpointing 9 Checkpoint Coordinator thread Event processing thread Checkpointing thread (loop: processElement) (snapshot state) (trigger checkpoint) (acknowledge checkpoint)(write state to DFS) Task Manager Job Manager
  10. 10. Asynchronous Checkpointing 10 Checkpoint Coordinator thread Event processing thread Checkpointing thread (loop: processElement) (snapshot state) (trigger checkpoint) (acknowledge checkpoint)(write state to DFS) Task Manager Job Manager Problem: How to deal with concurrent modifications?
  11. 11. Incremental Checkpoints 11
  12. 12. What we will discuss ▪ What are incremental checkpoints? ▪ Why is RocksDB so well suited for this? ▪ How do we integrate this with Flink’s checkpointing? 12 Driven by and
  13. 13. Full Checkpointing 13 K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S Checkpoint 1 Checkpoint 2 Checkpoint 3 time
  14. 14. Incremental Checkpointing 14 K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S K S 2 B 4 W 6 N K S 3 K 4 L K S 2 Q 4 - 9 S iCheckpoint 1 iCheckpoint 2 iCheckpoint 3 Δ(-,c1) Δ(c1,c2) Δ(c2,c3) time
  15. 15. Incremental Recovery 15 K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S K S 2 B 4 W 6 N K S 3 K 4 L K S 2 Q 4 - 9 S iCheckpoint 1 iCheckpoint 2 iCheckpoint 3 Δ(-,c1) Δ(c1,c2) Δ(c2,c3) time Recovery? + +
  16. 16. RocksDB Architecture (simplified) 16 Memtable SSTable-7 SSTable-6 Memory … Storage key_1 val_1 key_2 val_2 …Index + sorted by key - All writes go against Memtable - Mutable Buffer (couple MB) - Unique keys - Reads consider Memtable first, then SSTables - Immutable - We can consider newly created SSTables as Δs! periodic flush periodic merge
  17. 17. RocksDB Compaction ▪ Background Thread merges SSTable files ▪ Removes copies of the same key (latest version survives) ▪ Actually deletion of keys 17 2 C 7 N 9 Q 1 V 7 - 9 S 1 V 2 C 9 S SSTable-1 SSTable-2 SSTable-3 merge Compaction consolidates our Δs!
  18. 18. Flink’s Incremental Checkpointing 18 Checkpoint Coordinator StatefulMap (1/3) StatefulMap (2/3) StatefulMap (3/3) DFS Network SharedStateRegistry
  19. 19. Flink’s Incremental Checkpointing 19 Checkpoint Coordinator StatefulMap (1/3) StatefulMap (2/3) StatefulMap (3/3) DFS Network Step 1: Checkpoint Coordinator sends checkpoint barrier that triggers a snapshot on each instance SharedStateRegistry
  20. 20. Flink’s Incremental Checkpointing 20 StatefulMap (1/3) Checkpoint Coordinator StatefulMap (2/3) StatefulMap (3/3) DFS Δ3 Δ2 Δ1 Network Step 2: Each instance writes its incremental snapshot to distributed storage SharedStateRegistry
  21. 21. Flink’s Incremental Checkpointing 21 StatefulMap (1/3) Checkpoint Coordinator StatefulMap (2/3) StatefulMap (3/3) DFS Δ3 Δ2 Δ1 Network Step 2: Each instance writes its incremental snapshot to distributed storage SharedStateRegistry
  22. 22. Incremental Snapshot of Operator 22 data manifest 01010101 00110011 10101010 11001100 01010101 00225.sst share Local FS 01010101 00110011 10101010 11001100 01010101 00226.sst SharedState Registry Distributed FS://sfmap/1/
  23. 23. Incremental Snapshot of Operator 23 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00225.sst manifest 01010101 00110011 10101010 11001100 01010101 00225.sst 00226.sst 01010101 00110011 10101010 11001100 01010101 ++ share Local FS 01010101 00110011 10101010 11001100 01010101 00226.sst SharedState Registry copy / hardlink Distributed FS://sfmap/1/
  24. 24. Incremental Snapshot of Operator 24 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00225.sst manifest 01010101 00110011 10101010 11001100 01010101 00225.sst 00226.sst 01010101 00110011 10101010 11001100 01010101 ++ share 01010101 00110011 10101010 11001100 01010101 00226.sst 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list Local FS SharedState Registry 01010101 00110011 10101010 11001100 01010101 00226.sst chk-1 Distributed FS://sfmap/1/ List of SSTables referenced by snapshot async upload to DFS
  25. 25. Flink’s Incremental Checkpointing 25 StatefulMap (1/3) Checkpoint Coordinator StatefulMap (2/3) StatefulMap (3/3) DFS H1 H2 H3Network Step 3: Each instance acknowledges and sends a handle (e.g. file path in DFS) to the Checkpoint Coordinator. SharedStateRegistry Δ3 Δ2 Δ1
  26. 26. Incremental Snapshot of Operator 26 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00225.sst manifest 01010101 00110011 10101010 11001100 01010101 00225.sst 00226.sst 01010101 00110011 10101010 11001100 01010101 ++ share 01010101 00110011 10101010 11001100 01010101 00226.sst 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list Local FS SharedState Registry 01010101 00110011 10101010 11001100 01010101 00226.sst chk-1 {00225.sst = 1} {00226.sst = 1} Distributed FS://sfmap/1/
  27. 27. Flink’s Incremental Checkpointing 27 StatefulMap (1/3) Checkpoint Coordinator StatefulMap (2/3) StatefulMap (3/3) DFS H1 H3 CP 1 H2 Network SharedStateRegistry Δ3 Δ2 Δ1Step 4: Checkpoint Coordinator signals CP1 success to all instances.
  28. 28. Flink’s Incremental Checkpointing 28 Checkpoint Coordinator StatefulMap (1/3) StatefulMap (2/3) StatefulMap (3/3) DFS Network SharedStateRegistry H1 H3 CP 1 H2 What happens when another CP is triggered? Δ3 Δ2 Δ1
  29. 29. Incremental Snapshot of Operator 29 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list Local FS {00225.sst = 1} {00226.sst = 1} SharedState Registry Distributed FS://sfmap/1/
  30. 30. Incremental Snapshot of Operator 30 data chk-2 chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 + + share 01010101 00110011 10101010 11001100 01010101 00226.sst 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list Local FS {00225.sst = 1} {00226.sst = 1} SharedState Registry Distributed FS://sfmap/1/ 00229.sst
  31. 31. Incremental Snapshot of Operator 31 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list chk-2 manifest sst.list Local FS {00225.sst = 1} {00226.sst = 2} {00228.sst = 1} {00229.sst = 1} SharedState Registry Distributed FS://sfmap/1/ chk-2 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 + + upload missing SSTable files
  32. 32. Deleting Incremental Checkpoints 32 StatefulMap (1/3) Checkpoint Coordinator StatefulMap (2/3) StatefulMap (3/3) DFS H1 H3 CP 1 H1 H2H3 CP 2 H2 Network Deleting an outdated checkpoint SharedStateRegistry Δ3 Δ2 Δ1 Δ3 Δ2 Δ1
  33. 33. Deleting Incremental Snapshot 33 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list chk-2 manifest sst.list Local FS {00225.sst = 1} {00226.sst = 2} {00228.sst = 1} {00229.sst = 1} SharedState Registry Distributed FS://sfmap/1/
  34. 34. Deleting Incremental Snapshot 34 data manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 00225.sst chk-2 manifest sst.list Local FS {00225.sst = 0} {00226.sst = 1} {00228.sst = 1} {00229.sst = 1} SharedState Registry Distributed FS://sfmap/1/
  35. 35. Deleting Incremental Snapshot 35 data manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 chk-2 manifest sst.list Local FS {00226.sst = 1} {00228.sst = 1} {00229.sst = 1} SharedState Registry Distributed FS://sfmap/1/
  36. 36. Wrapping up 36
  37. 37. Incremental checkpointing benefits ▪ Incremental checkpoints can dramatically reduce CP overhead for large state. ▪ Incremental checkpoints are async. ▪ RocksDB’s compaction consolidates the increments. Keeps overhead low for recovery. 37
  38. 38. Incremental checkpointing limitations ▪ Breaks the unification of checkpoints and savepoints (CP: low overhead, SP: features) ▪ RocksDB specific format. ▪ Currently no support for rescaling from incremental checkpoint. 38
  39. 39. Further improvements in Flink 1.3/4 ▪ AsyncHeapKeyedStateBackend (merged) ▪ AsyncHeapOperatorStateBackend (PR) ▪ MapState (merged) ▪ RocksDBInternalTimerService (PR) ▪ AsyncHeapInternalTimerService 39
  40. 40. Questions? 40

×