Distributed Snapshots


Published on

My presentation on distributed snapshots for graduate OS course

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Distributed Snapshots

  1. 1. Distributed Snapshots: Determining Global States of Distributed Systems K. Mani Chandy Leslie Lamport
  2. 2. Overview <ul><li>Paper shows the Snapshot Algorithm </li></ul><ul><li>Aims to discover a global state of the distributed system </li></ul>
  3. 3. Motivation <ul><li>We want Global State Discovery </li></ul><ul><li>Communication latency and clock skew prevent us from doing this well </li></ul><ul><li>Applications of global state discovery </li></ul><ul><ul><li>Checkpointing </li></ul></ul><ul><ul><li>Detection of Deadlock with Global Resources … why? </li></ul></ul><ul><ul><li>Consistent view of Distributed Bank Accounts </li></ul></ul><ul><ul><li>Phase Detection (e.g. Barriers) </li></ul></ul>
  4. 4. What is a Global State? <ul><li>Processes are finite state machines (FSM’s) </li></ul><ul><li>A global state of a system is a set of states {p 1 , … ,p n } such that p i represents the state of process i. </li></ul><ul><li>… is this sufficient? </li></ul>
  5. 5. NO! What about channels? <ul><li>Insufficient characterization of the system! </li></ul><ul><li>Processes communicate using channels </li></ul><ul><li>Must account for messages currently in transit </li></ul>
  6. 6. Stable Properties <ul><li>Algorithm targeted at specific problems </li></ul><ul><li>Check if a stable property holds </li></ul><ul><ul><li>Once it is true, remains true for all later points </li></ul></ul><ul><li>“ Are all lights currently green?” Is this an example of a stable property? </li></ul>
  7. 7. Quick Recap <ul><li>We want Global State Detection </li></ul><ul><li>Stable Properties </li></ul><ul><li>Moving on … </li></ul><ul><li>System Model </li></ul><ul><li>Assumptions </li></ul><ul><li>Chandy-Lamport Algorithm </li></ul>
  8. 8. Eagle’s Eye View
  9. 9. Definitions <ul><li>The state of a channel is the sequence of messages moving through it </li></ul><ul><li>An event is an atomic action that </li></ul><ul><ul><li>May change the state of a process </li></ul></ul><ul><ul><li>May change at most one channel incident on the process </li></ul></ul><ul><ul><li>Defined as a 5-tuple <p,s,s’, M, c> </li></ul></ul>
  10. 10. Assumptions (oh no!) <ul><li>Channels </li></ul><ul><ul><li>FIFO </li></ul></ul><ul><ul><li>Infinite Buffers </li></ul></ul><ul><ul><li>Error-free </li></ul></ul><ul><ul><li>Finite delivery time </li></ul></ul><ul><li>No failures </li></ul><ul><li>States can be captured in finite time </li></ul><ul><li>Hidden assumption: steps in algorithm must be atomic in terms of process state (why?) </li></ul>
  11. 11. Snapshot (Chandy-Lamport) Algorithm <ul><li>A process decides to take a snapshot “spontaneously” and sends itself a marker . </li></ul><ul><li>Upon receiving the marker over a channel c a process will … </li></ul><ul><ul><li>If marker not previously seen, record state, state of c is empty, start recording other incoming channels, and send marker to neighbors </li></ul></ul><ul><ul><li>Else stop recording, state of c is the sequence of messages recorded since [1] </li></ul></ul><ul><li>Will a marker ever be received on the same channel twice? </li></ul>
  12. 12. Algorithm in Action
  13. 13. Termination of Algorithm <ul><li>When a marker received on every incoming channel </li></ul><ul><li>How could you distribute the actual snapshot? </li></ul><ul><li>How would we handle multiple concurrent snapshots? </li></ul>
  14. 14. Properties of Snapshot <ul><li>Global state returned is reachable from start and before end of snapshot </li></ul><ul><li>System never necessarily in the state of a snapshot </li></ul><ul><li>Can obtain a consistent global state with it. </li></ul><ul><li>How can we guarantee state returned actually occurred? </li></ul>
  15. 15. Stability Detection <ul><li>If the stable property is true, it is true by the end of the algorithm. </li></ul><ul><li>If it is false, it was false at the beginning of the snapshot. </li></ul><ul><li>Intuitive explanation? </li></ul>
  16. 16. Issues <ul><li>Many assumptions necessary </li></ul><ul><ul><li>Overhead becomes high with methods that work around assumptions </li></ul></ul><ul><li>Cannot discover transient properties </li></ul><ul><li>Hard to see type of problems to solve with algorithm </li></ul><ul><li>How would you deal with failures? Termination? </li></ul><ul><li>At best a good guess. How would you do this? </li></ul>
  17. 17. Questions