Distributed Snapshots
Upcoming SlideShare
Loading in...5

Distributed Snapshots



My presentation on distributed snapshots for graduate OS course

My presentation on distributed snapshots for graduate OS course



Total Views
Views on SlideShare
Embed Views



4 Embeds 38

http://susheel-athmakuri.blogspot.com 24
http://www.slideshare.net 12
http://facebook.slideshare.com 1
http://susheel-athmakuri.blogspot.in 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Distributed Snapshots Distributed Snapshots Presentation Transcript

  • Distributed Snapshots: Determining Global States of Distributed Systems K. Mani Chandy Leslie Lamport
  • Overview
    • Paper shows the Snapshot Algorithm
    • Aims to discover a global state of the distributed system
  • Motivation
    • We want Global State Discovery
    • Communication latency and clock skew prevent us from doing this well
    • Applications of global state discovery
      • Checkpointing
      • Detection of Deadlock with Global Resources … why?
      • Consistent view of Distributed Bank Accounts
      • Phase Detection (e.g. Barriers)
  • What is a Global State?
    • Processes are finite state machines (FSM’s)
    • A global state of a system is a set of states {p 1 , … ,p n } such that p i represents the state of process i.
    • … is this sufficient?
  • NO! What about channels?
    • Insufficient characterization of the system!
    • Processes communicate using channels
    • Must account for messages currently in transit
  • Stable Properties
    • Algorithm targeted at specific problems
    • Check if a stable property holds
      • Once it is true, remains true for all later points
    • “ Are all lights currently green?” Is this an example of a stable property?
  • Quick Recap
    • We want Global State Detection
    • Stable Properties
    • Moving on …
    • System Model
    • Assumptions
    • Chandy-Lamport Algorithm
  • Eagle’s Eye View
  • Definitions
    • The state of a channel is the sequence of messages moving through it
    • An event is an atomic action that
      • May change the state of a process
      • May change at most one channel incident on the process
      • Defined as a 5-tuple <p,s,s’, M, c>
  • Assumptions (oh no!)
    • Channels
      • FIFO
      • Infinite Buffers
      • Error-free
      • Finite delivery time
    • No failures
    • States can be captured in finite time
    • Hidden assumption: steps in algorithm must be atomic in terms of process state (why?)
  • Snapshot (Chandy-Lamport) Algorithm
    • A process decides to take a snapshot “spontaneously” and sends itself a marker .
    • Upon receiving the marker over a channel c a process will …
      • If marker not previously seen, record state, state of c is empty, start recording other incoming channels, and send marker to neighbors
      • Else stop recording, state of c is the sequence of messages recorded since [1]
    • Will a marker ever be received on the same channel twice?
  • Algorithm in Action
  • Termination of Algorithm
    • When a marker received on every incoming channel
    • How could you distribute the actual snapshot?
    • How would we handle multiple concurrent snapshots?
  • Properties of Snapshot
    • Global state returned is reachable from start and before end of snapshot
    • System never necessarily in the state of a snapshot
    • Can obtain a consistent global state with it.
    • How can we guarantee state returned actually occurred?
  • Stability Detection
    • If the stable property is true, it is true by the end of the algorithm.
    • If it is false, it was false at the beginning of the snapshot.
    • Intuitive explanation?
  • Issues
    • Many assumptions necessary
      • Overhead becomes high with methods that work around assumptions
    • Cannot discover transient properties
    • Hard to see type of problems to solve with algorithm
    • How would you deal with failures? Termination?
    • At best a good guess. How would you do this?
  • Questions