Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

6,376 views

Published on

Flink Forward 2015

Published in: Technology
  • Be the first to comment

Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

  1. 1. Fault Tolerance and Job Recovery in Apache Flink™ Till Rohrmann trohrmann@apache.org @stsffap
  2. 2. 2
  3. 3. Better be safe than sorry  Failures will happen  EMC estimated $1.7 billion costs due to data loss and system downtime  Recovery will save you time and costs  Switch between algorithms  Live upgrade of your system 3
  4. 4. Fault Tolerance 4
  5. 5. Fault tolerance guarantees  At most once • No guarantees at all  At least once • For many applications sufficient  Exactly once  Flink provides all guarantees 5
  6. 6. Checkpoints  Consistent snapshots of distributed data stream and operator state 6
  7. 7. Barriers  Markers for checkpoints  Injected in the data flow 7
  8. 8. 8  Alignment for multi-input operators
  9. 9. Operator State  Stateless operators  System state  User defined state 9 ds.filter(_ != 0) ds.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS)) public class CounterSum implements RichReduceFunction<Long> { private OperatorState<Long> counter; @Override public Long reduce(Long v1, Long v2) throws Exception { counter.update(counter.value() + 1); return v1 + v2; } @Override public void open(Configuration config) { counter = getRuntimeContext().getOperatorState(“counter”, 0L, false); } }
  10. 10. 10
  11. 11. 11
  12. 12. 12
  13. 13. 13
  14. 14. Advantages  Separation of app logic from recovery • Checkpointing interval is just a config parameter  High throughput • Controllable checkpointing overhead  Low impact on latency 14
  15. 15. 15
  16. 16. Cluster High Availability 16
  17. 17. Without high availability 17 JobManager TaskManager
  18. 18. With high availability 18 JobManager TaskManager Stand-by JobManager Apache Zookeeper™ KEEP GOING
  19. 19. Persisting jobs 19 JobManager Client TaskManagers Apache Zookeeper™ Job 1. Submit job
  20. 20. Persisting jobs 20 JobManager Client TaskManagers Apache Zookeeper™ 1. Submit job 2. Persist execution graph
  21. 21. Persisting jobs 21 JobManager Client TaskManagers Apache Zookeeper™ 1. Submit job 2. Persist execution graph 3. Write handle to ZooKeeper
  22. 22. Persisting jobs 22 JobManager Client TaskManagers Apache Zookeeper™ 1. Submit job 2. Persist execution graph 3. Write handle to ZooKeeper 4. Deploy tasks
  23. 23. Handling checkpoints 23 JobManager Client TaskManagers Apache Zookeeper™ 1. Take snapshots
  24. 24. Handling checkpoints 24 JobManager Client TaskManagers Apache Zookeeper™ 1. Take snapshots 2. Persist snapshots 3. Send handles to JM
  25. 25. Handling checkpoints 25 JobManager Client TaskManagers Apache Zookeeper™ 1. Take snapshots 2. Persist snapshots 3. Send handles to JM 4. Create global checkpoint
  26. 26. Handling checkpoints 26 JobManager Client TaskManagers Apache Zookeeper™ 1. Take snapshots 2. Persist snapshots 3. Send handles to JM 4. Create global checkpoint 5. Persist global checkpoint
  27. 27. Handling checkpoints 27 JobManager Client TaskManagers Apache Zookeeper™ 1. Take snapshots 2. Persist snapshots 3. Send handles to JM 4. Create global checkpoint 5. Persist global checkpoint 6. Write handle to ZooKeeper
  28. 28. Conclusion 28
  29. 29. 29
  30. 30. 30
  31. 31. TL;DL  Job recovery mechanism with low latency and high throughput  Exactly one processing semantics  No single point of failure  Flink will always keep processing your data 31
  32. 32. flink.apache.org @ApacheFlink

×