Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apache Flink Streaming
Resiliency and Consistency
Paris Carbone - PhD Candidate
KTH Royal Institute of Technology
<parisc@...
Overview
• The Flink Execution Engine Architecture
• Exactly-once-processing challenges
• Using Snapshots for Recovery
• T...
Current Focus
3
Streaming APIBatch API
Flink Optimiser
Flink Runtime
Table
ML
Gelly
ML
Gelly
StateManagement
Executing Pipelines
• Streaming pipelines
translate into job
graphs
4
Flink Runtime
Flink Job Graph Builder/Optimiser
Flin...
Unbounded Data
Processing Architectures
5
(Hadoop, Spark) (Spark Streaming)
1) Streaming (Distributed Data Flow)
LONG-LIVE...
The Flink Runtime Engine
• Tasks run operator
logic in a pipelined
fashion
• They are scheduled
among workers
• State is k...
Task Failures
• Task failures are guaranteed to occur
• We need to make them transparent
• Can we simply recover a task fr...
Record Acknowledgements
• We can monitor the
consumption of every event
• It guarantees no loss
• Duplicates and inconsist...
Atomic transactions per
update
• It guarantees no loss and
consistent state.
• No duplication can be
achieved e.g. with bl...
Lessons Learned from
Batch
10
batch-1batch-2
• If a batch computation fails, simply repeat computation
as a transaction
• ...
Distributed Snapshots
11
Distributed Snapshots
11
“A collection of operator states
and records in transit (channels) that reflects a
moment at a val...
Distributed Snapshots
11
“A collection of operator states
and records in transit (channels) that reflects a
moment at a val...
Distributed Snapshots
11
“A collection of operator states
and records in transit (channels) that reflects a
moment at a val...
Distributed Snapshots
11
“A collection of operator states
and records in transit (channels) that reflects a
moment at a val...
Distributed Snapshots
11
“A collection of operator states
and records in transit (channels) that reflects a
moment at a val...
Distributed Snapshots
11
“A collection of operator states
and records in transit (channels) that reflects a
moment at a val...
Distributed Snapshots
11
“A collection of operator states
and records in transit (channels) that reflects a
moment at a val...
Distributed Snapshots
11
“A collection of operator states
and records in transit (channels) that reflects a
moment at a val...
Distributed Snapshots
11
Assumptions
• repeatable sources
• reliable FIFO channels
“A collection of operator states
and re...
Distributed Snapshots
12
Distributed Snapshots
12
Distributed Snapshots
12
t1
snap - t1
Distributed Snapshots
12
t1
snap - t1
Distributed Snapshots
12
t2t1
snap - t1 snap - t2
Distributed Snapshots
12
t2t1
snap - t1 snap - t2
Distributed Snapshots
12
t3t2t1
snap - t1 snap - t2
Distributed Snapshots
12
t3t2t1
reset from snap t2
snap - t1 snap - t2
Distributed Snapshots
12
t3t2t1
reset from snap t2
snap - t1 snap - t2
Assumptions
• repeatable sources
• reliable FIFO ch...
Taking Snapshots
13
t2t1
execution snapshots
Initial approach (see Naiad)
• Pause execution on t1,t2,..
• Collect state
• ...
Lamport On the Rescue
14
“The global-state-detection algorithm is to be
superimposed on the underlying computation:
it mus...
Chandy-Lamport Snapshots
15
p1 p2
p1
Using Markers/Barriers
• Triggers Snapshots
• Separates preshot-postshot events
• Lea...
Asynchronous Snapshots
16
t2t1
snap - t1 snap - t2
snapshotting snapshotting
propagating markers
Asynchronous Barrier
Snapshotting
17
barriers
Asynchronous Barrier
Snapshotting
18
Asynchronous Barrier
Snapshotting
19
Asynchronous Barrier
Snapshotting
19
Asynchronous Barrier
Snapshotting
20
Asynchronous Barrier
Snapshotting
21
Asynchronous Barrier
Snapshotting
21
Asynchronous Barrier
Snapshotting
22
Asynchronous Barrier
Snapshotting
22
Asynchronous Barrier
Snapshotting
22
pending
Asynchronous Barrier
Snapshotting
22
aligning
pending
Asynchronous Barrier
Snapshotting
23
aligning
aligning
Asynchronous Barrier
Snapshotting
24
aligning
Asynchronous Barrier
Snapshotting
24
aligning
Asynchronous Barrier
Snapshotting
25
aligning
Asynchronous Barrier
Snapshotting
25
aligning
Asynchronous Barrier
Snapshotting
26
Asynchronous Barrier
Snapshotting Benefits
27
• Taking advantage of the execution graph structure
• No records in transit i...
Snapshots on Cyclic
Dataflows
28
Snapshots on Cyclic
Dataflows
29
Snapshots on Cyclic
Dataflows
30
Snapshots on Cyclic
Dataflows
30
Snapshots on Cyclic
Dataflows
31
Snapshots on Cyclic
Dataflows
31
deadlock
Snapshots on Cyclic
Dataflows
• Checkpointing should eventually terminate
• Records in transit within loops should be inclu...
Snapshots on Cyclic
Dataflows
33
Snapshots on Cyclic
Dataflows
33
Snapshots on Cyclic
Dataflows
34
Snapshots on Cyclic
Dataflows
34
Snapshots on Cyclic
Dataflows
35
downstream
backup
Snapshots on Cyclic
Dataflows
36
downstream
backup
Snapshots on Cyclic
Dataflows
36
downstream
backup
Snapshots on Cyclic
Dataflows
37
Implementation in Flink
38
• Coordinator/Driver Actor per job in JM
• sends periodic markers to sources
• collects acknowl...
Recovery
39
• Full rescheduling of the execution graph from the
latest checkpoint - simple, no further modifications
• Part...
Performance Impact
40
http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-...
Node Failures
41
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task
Client
job graph
optimiser
JM State
• P...
Node Failures
41
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task
Client
job graph
optimiser
JM State
• P...
Node Failures
41
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Client
job graph
optimiser
JM Stat...
Node Failures
41
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Client
job graph
optimiser
JM Stat...
Node Failures
41
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Client
job graph
optimiser
JM Stat...
Passive Standby JM
42
JM JM JM
Zookeeper State
• Leader JM Address
• Pending Execution Graphs
• State Snapshots
Client
Eventual Leader Election
43
time
Eventual Leader Election
43
JM JM JMt0
time
Eventual Leader Election
43
JM JM JMt0
t1 JM JM JM
time
Eventual Leader Election
43
JM JM JMt0
t1
t2
JM JM JM
JM JM JM
…recovering
time
Eventual Leader Election
43
JM JM JMt0
t1
t2
t3
JM JM JM
JM JM JM
…recovering
JM JMJM
time
Scheduling Jobs
44
Job
Manager
Task
Manager
Task
Manager
Task
Manager
Client
optimiser
Zookeeper
Scheduling Jobs
44
Job
Manager
Task
Manager
Task
Manager
Task
Manager
Client
optimiser
Zookeeperjob-1
Scheduling Jobs
44
Job
Manager
Task
Manager
Task
Manager
Task
Manager
Client
optimiser
Zookeeperjob-1
execution
graph
Scheduling Jobs
44
Job
Manager
Task
Manager
Task
Manager
Task
Manager
Client
optimiser
Zookeeperjob-1
execution
graph
bloc...
Scheduling Jobs
44
Job
Manager
Task
Manager
Task
Manager
Task
Manager
Client
optimiser
Zookeeperjob-1
execution
graph
task...
Scheduling Jobs
44
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Client
optimiser
Zookeeperjob-1
...
Logging Snapshots
45
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Zookeeper
State Backend
asynch...
Logging Snapshots
45
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Zookeeper
State Backend
asynch...
Logging Snapshots
45
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Zookeeper
State Backend
asynch...
Logging Snapshots
45
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Zookeeper
State Backend
asynch...
Logging Snapshots
45
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Zookeeper
State Backend
asynch...
Logging Snapshots
45
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Zookeeper
State Backend
asynch...
Logging Snapshots
45
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Zookeeper
State Backend
asynch...
Logging Snapshots
45
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Zookeeper
State Backend
asynch...
Logging Snapshots
45
Job
Manager
Task
Manager
Task
Manager
Task
Manager
task task task task
Zookeeper
State Backend
asynch...
HA in Zookeeper
• Task Managers query ZK for the current leader before
connecting (with lease)
• Upon standby failover res...
Summary
• The Flink execution monitors long running tasks
• Establishing exactly-once-processing guarantees with
snapshots...
Upcoming SlideShare
Loading in …5
×

Tech Talk @ Google on Flink Fault Tolerance and HA

1,655 views

Published on

A technical review and examples on the core of the fault tolerance architecture of Flink.

Published in: Data & Analytics

Tech Talk @ Google on Flink Fault Tolerance and HA

  1. 1. Apache Flink Streaming Resiliency and Consistency Paris Carbone - PhD Candidate KTH Royal Institute of Technology <parisc@kth.se, senorcarbone@apache.org> 1 Technical Talk @ Google
  2. 2. Overview • The Flink Execution Engine Architecture • Exactly-once-processing challenges • Using Snapshots for Recovery • The ABS Algorithm for DAGs and Cyclic Dataflows • Recovery, Cost and Performance • Job Manager High Availability 2
  3. 3. Current Focus 3 Streaming APIBatch API Flink Optimiser Flink Runtime Table ML Gelly ML Gelly StateManagement
  4. 4. Executing Pipelines • Streaming pipelines translate into job graphs 4 Flink Runtime Flink Job Graph Builder/Optimiser Flink ClientUser Pipeline Job Graph
  5. 5. Unbounded Data Processing Architectures 5 (Hadoop, Spark) (Spark Streaming) 1) Streaming (Distributed Data Flow) LONG-LIVED TASK EXECUTION STATE IS KEPT INSIDE TASKS 2) Micro-Batch
  6. 6. The Flink Runtime Engine • Tasks run operator logic in a pipelined fashion • They are scheduled among workers • State is kept within tasks 6 LONG-LIVED TASK EXECUTION Job Manager • scheduling • monitoring
  7. 7. Task Failures • Task failures are guaranteed to occur • We need to make them transparent • Can we simply recover a task from scratch? 7 task state consistent state no duplicatesno loss
  8. 8. Record Acknowledgements • We can monitor the consumption of every event • It guarantees no loss • Duplicates and inconsistent state are not handled 8 task state consistent state no duplicatesno loss bookkeeping
  9. 9. Atomic transactions per update • It guarantees no loss and consistent state. • No duplication can be achieved e.g. with bloom filters or caching • Fine grained recovery • Non-constant load in external storage can lead to unsustainable execution 9 task state consistent state no duplicatesno loss state state state Distributed Storage Trident
  10. 10. Lessons Learned from Batch 10 batch-1batch-2 • If a batch computation fails, simply repeat computation as a transaction • Transaction rate is constant • Can we apply these principles to a true streaming execution?
  11. 11. Distributed Snapshots 11
  12. 12. Distributed Snapshots 11 “A collection of operator states and records in transit (channels) that reflects a moment at a valid execution”
  13. 13. Distributed Snapshots 11 “A collection of operator states and records in transit (channels) that reflects a moment at a valid execution” p1 p2 p3 p4
  14. 14. Distributed Snapshots 11 “A collection of operator states and records in transit (channels) that reflects a moment at a valid execution” p1 p2 p3 p4
  15. 15. Distributed Snapshots 11 “A collection of operator states and records in transit (channels) that reflects a moment at a valid execution” p1 p2 p3 p4 consistent cut
  16. 16. Distributed Snapshots 11 “A collection of operator states and records in transit (channels) that reflects a moment at a valid execution” p1 p2 p3 p4 consistent cut
  17. 17. Distributed Snapshots 11 “A collection of operator states and records in transit (channels) that reflects a moment at a valid execution” p1 p2 p3 p4 consistent cut causality violation
  18. 18. Distributed Snapshots 11 “A collection of operator states and records in transit (channels) that reflects a moment at a valid execution” p1 p2 p3 p4 consistent cut inconsistent cut causality violation
  19. 19. Distributed Snapshots 11 “A collection of operator states and records in transit (channels) that reflects a moment at a valid execution” p1 p2 p3 p4 consistent cut inconsistent cut causality violation Idea: We can resume from a snapshot that defines a consistent cut
  20. 20. Distributed Snapshots 11 Assumptions • repeatable sources • reliable FIFO channels “A collection of operator states and records in transit (channels) that reflects a moment at a valid execution” p1 p2 p3 p4 consistent cut inconsistent cut causality violation Idea: We can resume from a snapshot that defines a consistent cut
  21. 21. Distributed Snapshots 12
  22. 22. Distributed Snapshots 12
  23. 23. Distributed Snapshots 12 t1 snap - t1
  24. 24. Distributed Snapshots 12 t1 snap - t1
  25. 25. Distributed Snapshots 12 t2t1 snap - t1 snap - t2
  26. 26. Distributed Snapshots 12 t2t1 snap - t1 snap - t2
  27. 27. Distributed Snapshots 12 t3t2t1 snap - t1 snap - t2
  28. 28. Distributed Snapshots 12 t3t2t1 reset from snap t2 snap - t1 snap - t2
  29. 29. Distributed Snapshots 12 t3t2t1 reset from snap t2 snap - t1 snap - t2 Assumptions • repeatable sources • reliable FIFO channels
  30. 30. Taking Snapshots 13 t2t1 execution snapshots Initial approach (see Naiad) • Pause execution on t1,t2,.. • Collect state • Restore execution
  31. 31. Lamport On the Rescue 14 “The global-state-detection algorithm is to be superimposed on the underlying computation: it must run concurrently with, but not alter, this underlying computation”
  32. 32. Chandy-Lamport Snapshots 15 p1 p2 p1 Using Markers/Barriers • Triggers Snapshots • Separates preshot-postshot events • Leads to a consistent execution cut markers propagate the snapshot execution under continuous ingestion markers
  33. 33. Asynchronous Snapshots 16 t2t1 snap - t1 snap - t2 snapshotting snapshotting propagating markers
  34. 34. Asynchronous Barrier Snapshotting 17 barriers
  35. 35. Asynchronous Barrier Snapshotting 18
  36. 36. Asynchronous Barrier Snapshotting 19
  37. 37. Asynchronous Barrier Snapshotting 19
  38. 38. Asynchronous Barrier Snapshotting 20
  39. 39. Asynchronous Barrier Snapshotting 21
  40. 40. Asynchronous Barrier Snapshotting 21
  41. 41. Asynchronous Barrier Snapshotting 22
  42. 42. Asynchronous Barrier Snapshotting 22
  43. 43. Asynchronous Barrier Snapshotting 22 pending
  44. 44. Asynchronous Barrier Snapshotting 22 aligning pending
  45. 45. Asynchronous Barrier Snapshotting 23 aligning aligning
  46. 46. Asynchronous Barrier Snapshotting 24 aligning
  47. 47. Asynchronous Barrier Snapshotting 24 aligning
  48. 48. Asynchronous Barrier Snapshotting 25 aligning
  49. 49. Asynchronous Barrier Snapshotting 25 aligning
  50. 50. Asynchronous Barrier Snapshotting 26
  51. 51. Asynchronous Barrier Snapshotting Benefits 27 • Taking advantage of the execution graph structure • No records in transit included in the snapshot • Aligning has lower impact than halting
  52. 52. Snapshots on Cyclic Dataflows 28
  53. 53. Snapshots on Cyclic Dataflows 29
  54. 54. Snapshots on Cyclic Dataflows 30
  55. 55. Snapshots on Cyclic Dataflows 30
  56. 56. Snapshots on Cyclic Dataflows 31
  57. 57. Snapshots on Cyclic Dataflows 31 deadlock
  58. 58. Snapshots on Cyclic Dataflows • Checkpointing should eventually terminate • Records in transit within loops should be included in the checkpoint
  59. 59. Snapshots on Cyclic Dataflows 33
  60. 60. Snapshots on Cyclic Dataflows 33
  61. 61. Snapshots on Cyclic Dataflows 34
  62. 62. Snapshots on Cyclic Dataflows 34
  63. 63. Snapshots on Cyclic Dataflows 35 downstream backup
  64. 64. Snapshots on Cyclic Dataflows 36 downstream backup
  65. 65. Snapshots on Cyclic Dataflows 36 downstream backup
  66. 66. Snapshots on Cyclic Dataflows 37
  67. 67. Implementation in Flink 38 • Coordinator/Driver Actor per job in JM • sends periodic markers to sources • collects acknowledgements with state handles and registers complete snapshots • injects references to latest consistent state handles and back-edge records in pending execution graph tasks • Tasks • snapshot state transparently in given backend upon receiving a barrier and send ack to coordinator • propagate barriers further like normal records
  68. 68. Recovery 39 • Full rescheduling of the execution graph from the latest checkpoint - simple, no further modifications • Partial rescheduling of the execution graph for all upstream dependencies - trickier, duplicate elimination should be guaranteed
  69. 69. Performance Impact 40 http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
  70. 70. Node Failures 41 Job Manager Task Manager Task Manager Task Manager task task task Client job graph optimiser JM State • Pending Execution Graphs • Snapshots (State Handles)
  71. 71. Node Failures 41 Job Manager Task Manager Task Manager Task Manager task task task Client job graph optimiser JM State • Pending Execution Graphs • Snapshots (State Handles)
  72. 72. Node Failures 41 Job Manager Task Manager Task Manager Task Manager task task task task Client job graph optimiser JM State • Pending Execution Graphs • Snapshots (State Handles)
  73. 73. Node Failures 41 Job Manager Task Manager Task Manager Task Manager task task task task Client job graph optimiser JM State • Pending Execution Graphs • Snapshots (State Handles)
  74. 74. Node Failures 41 Job Manager Task Manager Task Manager Task Manager task task task task Client job graph optimiser JM State • Pending Execution Graphs • Snapshots (State Handles) Single Point Of Failure
  75. 75. Passive Standby JM 42 JM JM JM Zookeeper State • Leader JM Address • Pending Execution Graphs • State Snapshots Client
  76. 76. Eventual Leader Election 43 time
  77. 77. Eventual Leader Election 43 JM JM JMt0 time
  78. 78. Eventual Leader Election 43 JM JM JMt0 t1 JM JM JM time
  79. 79. Eventual Leader Election 43 JM JM JMt0 t1 t2 JM JM JM JM JM JM …recovering time
  80. 80. Eventual Leader Election 43 JM JM JMt0 t1 t2 t3 JM JM JM JM JM JM …recovering JM JMJM time
  81. 81. Scheduling Jobs 44 Job Manager Task Manager Task Manager Task Manager Client optimiser Zookeeper
  82. 82. Scheduling Jobs 44 Job Manager Task Manager Task Manager Task Manager Client optimiser Zookeeperjob-1
  83. 83. Scheduling Jobs 44 Job Manager Task Manager Task Manager Task Manager Client optimiser Zookeeperjob-1 execution graph
  84. 84. Scheduling Jobs 44 Job Manager Task Manager Task Manager Task Manager Client optimiser Zookeeperjob-1 execution graph blocking write in Zookeeper
  85. 85. Scheduling Jobs 44 Job Manager Task Manager Task Manager Task Manager Client optimiser Zookeeperjob-1 execution graph task blocking write in Zookeeper
  86. 86. Scheduling Jobs 44 Job Manager Task Manager Task Manager Task Manager task task task task Client optimiser Zookeeperjob-1 execution graph task blocking write in Zookeeper
  87. 87. Logging Snapshots 45 Job Manager Task Manager Task Manager Task Manager task task task task Zookeeper State Backend asynchronous writes in Zookeeper
  88. 88. Logging Snapshots 45 Job Manager Task Manager Task Manager Task Manager task task task task Zookeeper State Backend asynchronous writes in Zookeeper Checkpointing states
  89. 89. Logging Snapshots 45 Job Manager Task Manager Task Manager Task Manager task task task task Zookeeper State Backend asynchronous writes in Zookeeper Checkpointing states
  90. 90. Logging Snapshots 45 Job Manager Task Manager Task Manager Task Manager task task task task Zookeeper State Backend asynchronous writes in Zookeeper state handle Checkpointing states
  91. 91. Logging Snapshots 45 Job Manager Task Manager Task Manager Task Manager task task task task Zookeeper State Backend asynchronous writes in Zookeeper state handle Checkpointing states
  92. 92. Logging Snapshots 45 Job Manager Task Manager Task Manager Task Manager task task task task Zookeeper State Backend asynchronous writes in Zookeeper state handle Checkpointing states
  93. 93. Logging Snapshots 45 Job Manager Task Manager Task Manager Task Manager task task task task Zookeeper State Backend asynchronous writes in Zookeeper state handle Checkpointing states
  94. 94. Logging Snapshots 45 Job Manager Task Manager Task Manager Task Manager task task task task Zookeeper State Backend asynchronous writes in Zookeeper state handle Checkpointing states
  95. 95. Logging Snapshots 45 Job Manager Task Manager Task Manager Task Manager task task task task Zookeeper State Backend asynchronous writes in Zookeeper global snapshot state handle Checkpointing states
  96. 96. HA in Zookeeper • Task Managers query ZK for the current leader before connecting (with lease) • Upon standby failover restart pending execution graphs • Inject state handles from the last global snapshot
  97. 97. Summary • The Flink execution monitors long running tasks • Establishing exactly-once-processing guarantees with snapshots can be achieved without halting the execution • The ABS algorithm persists the minimal state at a low cost • Master HA is achieved through Zookeeper and passive failover

×