Papers We Love
San Francisco Edition
July 24th, 2014
Henry Robinson
henry@cloudera.com / @henryr
• Software engineer at Cloudera since 2009
• My interests are in databases and distributed
systems
• I write about them - in particular, about papers in
those areas - at http://the-paper-trail.org
Papers We Love
San Francisco Edition
July 24th, 2014
Henry Robinson
henry@cloudera.com / @henryr
Papers We Love
San Francisco Edition
July 24th, 2014
Henry Robinson
henry@cloudera.com / @henryr
Papers of which we
are quite fond
• Impossibility of Distributed
Consensus with One Faulty
Process, by Fischer, Lynch
and Paterson (1985)
• Dijkstra award winner 2001
• Walk through the proof (leaving rigour for the paper
itself)
• Show how this gives rise to a framework for thinking
about distributed systems
or: agreeing to agree
Consensus
• Consensus is the problem of having a set of
processes agree on a value proposed by one of
those processes
• Validity: the value agreed upon must have been
proposed by some process
• Termination: at least one non-faulty process
eventually decides
• Agreement: all deciding processes agree on the
same value
• Validity: the value agreed upon must have been
proposed by some process - safety
• Termination: at least one non-faulty process
eventually decides - liveness
• Agreement: all deciding processes agree on the
same value - safety
Transactional Commit
Should I commit this
transaction?
[Magic consensus protocol]
YES! No :(
Replicated State Machines
Client
Node 1
Node 2
Node 3
N-2N-3
N =
S
N-1
N-2N-3
N =
S
N-1
N-2N-3
N =
S
N-1
1: Client proposes !
state N should !
be S
2: Magic consensus !
protocol
3: New state written to!
log
Strong Leader Election
1: Who’s the
leader?
Strong Leader Election
A cast of
millions
2: Magic
consensus
protocol
1: Who’s the
leader?
Strong Leader Election
A cast of
millions
2: Magic
consensus
protocol
3: There can only
be one
1: Who’s the
leader?
What does FLP actually say?
Fischer
Fischer Lynch
Fischer Lynch Paterson
Fischer Lynch Paterson
Choose at most two.
Distributed consensus is impossible when at
least one process might fail
Distributed consensus is impossible when at
least one process might fail
“[a] surprising result”
Distributed* consensus is impossible when
at least one process might fail
*i.e. message passing
Distributed consensus is impossible when at
least one process might fail
Termination
Validity
Agreement
Distributed consensus is impossible when at
least one process might fail
No algorithm solves consensus in every case
Distributed consensus is impossible when at
least one process might fail
Crash failures
Hierarchy of Failure Modes
Crash failures!
!
Fail by stopping
Omission failures!
!
!
!
!
Fail by dropping messages
Hierarchy of Failure Modes
Crash failures!
!
Fail by stopping
Byzantine failures!
!
!
!
!
!
!
!
!
Fail by doing whatever the hell I like
Omission failures!
!
!
!
!
Fail by dropping messages
Hierarchy of Failure Modes
Crash failures!
!
Fail by stopping
More on the system model
• The system model is the abstraction we layer over
messy computers and networks in order to actually
reason about them.
• Message deliveries are the only way that nodes
may communicate
• Messages are delivered in any order
• But are never lost (c.f. crash model vs. omission
model), and are always delivered exactly once
• Nodes do not have access to a shared clock.
• So cannot mutually estimate the passage of time
• Messages are the only way that nodes may co-
ordinate with each other
The Proof itself
Some definitions
• Configuration: the state of every node in the system,
plus the set of undelivered (but sent) messages!
• Initial configuration: what each node in the system
would propose as the decisions at time 0
• Univalent: a state from which only one decision is
possible, no matter what messages are received (0-
valent and 1-valent can only decide 0 or 1 respectively)
• Bivalent: a state from which either decision value is still
possible.
Proof sketch
Initial,
‘undecided’,
configuration
Proof sketch
Initial,
‘undecided’,
configuration
Undecided state
Messages
delivered
Proof sketch
Initial,
‘undecided’,
configuration
Undecided state
Messages
delivered
More
messages
delivered
Proof sketch
Initial,
‘undecided’,
configuration
Undecided state
Messages
delivered
Lemma 2:
This always exists!
More
messages
delivered
Proof sketch
Initial,
‘undecided’,
configuration
Undecided state
Messages
delivered
Lemma 2:
This always exists!
Lemma 3:
You can always get here!
More
messages
delivered
Lemma 2: Communication Matters
2-node system
C: 00!
V: 1
C: 01!
V: 0
C: 11!
V: 0
C: 10!
V: 1
(C:XY means process 0 has initial value X, process 1 has initial value
Y)
2-node system
C: 00!
V: 1
C: 01!
V: 0
C: 11!
V: 0
C: 10!
V: 1
These two configurations
differ only at one node, but their
valencies are different
(C:XY means process 0 has initial value X, process 1 has initial value
Y)
2-node system
C: 00!
V: 1
C: 01!
V: 0
C: 11!
V: 0
C: 10!
V: 1
These two configurations
differ only at one node, but their
valencies are different
(C:XY means process 0 has initial value X, process 1 has initial value
Y)
2-node system
C: 00!
V: 1
C: 01!
V: 0
C: 11!
V: 0
C: 10!
V: 1
(C:XY means process 0 has initial value X, process 1 has initial value
Y)
I decided 1!
All executions of the protocol -
i.e. set of messages delivered
2-node system
C: 00!
V: 1
C: 01!
V: 0
C: 11!
V: 0
C: 10!
V: 1
(C:XY means process 0 has initial value X, process 1 has initial value
Y)
I decided 0!
All executions of the protocol -
i.e. set of messages delivered
2-node system
C: 00!
V: 1
C: 01!
V: 0
C: 11!
V: 0
C: 10!
V: 1
(C:XY means process 0 has initial value X, process 1 has initial value
Y)
I decided 0!
What if process 1 fails? Are the
configurations any different?
I decided 1!
2-node system
C: 00!
V: 1
C: 01!
V: 0
C: 11!
V: 0
C: 10!
V: 1
(C:XY means process 0 has initial value X, process 1 has initial value
Y)
I decided 0!
For the remaining processes:
no difference in initial state, but
different outcome ?!
I decided 1!
Same
execution
Every protocol has an undecided (‘bivalent’)
initial state
Lemma 3: Indecisiveness is Sticky
Configuration
C (bivalent)
e-not-
delivered
Configuration
Configuration
Configuration
Configuration
Configuration
e-arrived-last
Configuration
Configuration
Configuration
Configuration
Configuration
Some message e is
sent in C
Configuration
C (bivalent)
e-not-
delivered
Configuration
Configuration
Configuration
Configuration
Configuration
e-arrived-last
Configuration
Configuration
Configuration
Configuration
Configuration
One of
these
must
be
bivalent
Some message e is
sent in C
Configuration set D
• Consider the possibilities:
• If one of those configurations in D is bivalent,
we’re done
• Otherwise show that lack of bivalent state leads
to contradiction
• Do this by first showing that there must be both
0-valent and 1-valent configurations in D
• and that this leads to a contradiction
D
Configuration
C (bivalent)
0-valent!
e not received
0-valent!
e received
Either the protocol goes through
D before it reaches the 0-valent
configuration…
2. e is received
1. C moves to
0-valent configuration
before receiving e
D
Configuration
C (bivalent)
0-valent!
e not received
0-valent!
e received
Or the protocol gets to the
0-valent configuration after
receiving e
in which case this state also
must be 0-valent and in D
1. e is received
2. 0-valent state is arrived at
Now for the contradiction
• There must be two configurations C0 and C1 that
are separated by a single message m where
receiving e in Ci moves the configuration to Di
• We will write that as Ci + e = Di
• So C0 + m = C1
• and C0 + m + e = C1 + e = D1
• and C0 + e = D0
• Now consider the destinations of m and e. If they
go to different processes, their receipt is
commutative
• C0 + m + e = D1
• C0 + e + m = D0 + m = D1
• Contradiction: D0 is 0-valent!
• Instead, e and m might go to the same process p.
• Consider a deciding computation R from the
original bivalent state C, where p does nothing (i.e.
looks like it failed)
• Since to get to D0 and D1, only e and m have been
received, only p took any steps to get there.
• So R can apply to both D0 and D1.
• Since D0 and D1 are both univalent, so the
configurations D0 + R and D1 + R are both
univalent.
• Now remember:
• A = C + R
• D1 = C + m + e
• D0 = C + e
• But what about
• C + R + m + e = A + m + e = D1 + R => 1-valent
• C + R + e = A + e = D0 + R => 0-valent
• Let e be some event that might be sent in configuration C. Then let D be the set
of all configurations where e is received last and let C be the set of
configurations where e has not been received.
• D either contains a bivalent configuration, or both 0- and 1-valent
configurations. If it contains a bivalent configuration, we’re done. So assume it
does not.
• Now there must be some C0 and C1 in C where C0 + e is 0-valent, but C1 + e
is 1-valent, and C1 = C0 + e’
• Consider two possibilities for the destination of e’ and e. If they are not the
same, then we can say C0 + e + e’ == C0 + e’ + e = C1 + e = D1 -> 1-valent.
But C0 + e -> 0-valent.
• If they are the same, then let A be the configuration reached by a deciding run
from C0 when p does nothing (looks like it failed). We can also apply that run
from D0 and D1 to get to E0 and E1. But we can get from A to either E0 or E1
by applying e or e’ + e. This is a contradiction.
What are the consequences?
!
“These results do not show that such problems
cannot be “solved” in practice; rather, they
point up the need for more refined models of
distributed computing that better reflect realistic
assumptions about processor and
communication timings, and for less stringent
requirements on the solution to such problems.
(For example, termination might be required
only with probability 1.) “
Paxos
• Paxos cleverly defers to its leader election scheme
• If leader election is perfect, so is Paxos!
• But perfect leader election is solvable iff consensus
is.
• Impossibilities all the way down…
Randomized Consensus
• Nice way to circumvent technical impossibilities:
make their probability vanishingly small
• Ben-Or gave an algorithm that terminates with
probability 1
• (But the rate of convergence might be high)
Failure Detectors
• Deep connection between the ability to tell if a
machine has failed, and consensus.
• Lots of research into ‘weak’ failure detectors, and
how weak they can be and still solve consensus
FLP vs CAP
• FLP and CAP are not the same thing (see http://the-
paper-trail.org/blog/flp-and-cap-arent-the-same-
thing/)
• FLP is a stronger result, because the system model
has fewer restrictions (crash stop vs omission)
• Theorem: CAP is actually really boring
Further reading
• 100 Impossibility Proofs for Distributed Computing
(Lynch, 1989)
• The Weakest Failure Detector for Solving
Consensus (Chandra and Toueg, 1996)
• Sharing Memory Robustly in Message-Passing
Systems (Attiya et. al., 1995)
• Wait-Free Synchronization (Herlihy, 1991)
• Another Advantage of Free Choice: Completely
Asynchronous Agreement Protocols (Ben-Or, 1983)

Impossibility of Consensus with One Faulty Process - Papers We Love SF

  • 1.
    Papers We Love SanFrancisco Edition July 24th, 2014 Henry Robinson henry@cloudera.com / @henryr
  • 2.
    • Software engineerat Cloudera since 2009 • My interests are in databases and distributed systems • I write about them - in particular, about papers in those areas - at http://the-paper-trail.org
  • 3.
    Papers We Love SanFrancisco Edition July 24th, 2014 Henry Robinson henry@cloudera.com / @henryr
  • 4.
    Papers We Love SanFrancisco Edition July 24th, 2014 Henry Robinson henry@cloudera.com / @henryr Papers of which we are quite fond
  • 5.
    • Impossibility ofDistributed Consensus with One Faulty Process, by Fischer, Lynch and Paterson (1985) • Dijkstra award winner 2001
  • 6.
    • Walk throughthe proof (leaving rigour for the paper itself) • Show how this gives rise to a framework for thinking about distributed systems
  • 7.
    or: agreeing toagree Consensus
  • 8.
    • Consensus isthe problem of having a set of processes agree on a value proposed by one of those processes
  • 9.
    • Validity: thevalue agreed upon must have been proposed by some process • Termination: at least one non-faulty process eventually decides • Agreement: all deciding processes agree on the same value
  • 10.
    • Validity: thevalue agreed upon must have been proposed by some process - safety • Termination: at least one non-faulty process eventually decides - liveness • Agreement: all deciding processes agree on the same value - safety
  • 11.
    Transactional Commit Should Icommit this transaction? [Magic consensus protocol] YES! No :(
  • 12.
    Replicated State Machines Client Node1 Node 2 Node 3 N-2N-3 N = S N-1 N-2N-3 N = S N-1 N-2N-3 N = S N-1 1: Client proposes ! state N should ! be S 2: Magic consensus ! protocol 3: New state written to! log
  • 13.
    Strong Leader Election 1:Who’s the leader?
  • 14.
    Strong Leader Election Acast of millions 2: Magic consensus protocol 1: Who’s the leader?
  • 15.
    Strong Leader Election Acast of millions 2: Magic consensus protocol 3: There can only be one 1: Who’s the leader?
  • 16.
    What does FLPactually say?
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    Distributed consensus isimpossible when at least one process might fail
  • 22.
    Distributed consensus isimpossible when at least one process might fail “[a] surprising result”
  • 23.
    Distributed* consensus isimpossible when at least one process might fail *i.e. message passing
  • 24.
    Distributed consensus isimpossible when at least one process might fail Termination Validity Agreement
  • 25.
    Distributed consensus isimpossible when at least one process might fail No algorithm solves consensus in every case
  • 26.
    Distributed consensus isimpossible when at least one process might fail Crash failures
  • 27.
    Hierarchy of FailureModes Crash failures! ! Fail by stopping
  • 28.
    Omission failures! ! ! ! ! Fail bydropping messages Hierarchy of Failure Modes Crash failures! ! Fail by stopping
  • 29.
    Byzantine failures! ! ! ! ! ! ! ! ! Fail bydoing whatever the hell I like Omission failures! ! ! ! ! Fail by dropping messages Hierarchy of Failure Modes Crash failures! ! Fail by stopping
  • 30.
    More on thesystem model
  • 31.
    • The systemmodel is the abstraction we layer over messy computers and networks in order to actually reason about them.
  • 32.
    • Message deliveriesare the only way that nodes may communicate • Messages are delivered in any order • But are never lost (c.f. crash model vs. omission model), and are always delivered exactly once
  • 33.
    • Nodes donot have access to a shared clock. • So cannot mutually estimate the passage of time • Messages are the only way that nodes may co- ordinate with each other
  • 34.
  • 35.
    Some definitions • Configuration:the state of every node in the system, plus the set of undelivered (but sent) messages! • Initial configuration: what each node in the system would propose as the decisions at time 0 • Univalent: a state from which only one decision is possible, no matter what messages are received (0- valent and 1-valent can only decide 0 or 1 respectively) • Bivalent: a state from which either decision value is still possible.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
    Proof sketch Initial, ‘undecided’, configuration Undecided state Messages delivered Lemma2: This always exists! Lemma 3: You can always get here! More messages delivered
  • 41.
  • 42.
    2-node system C: 00! V:1 C: 01! V: 0 C: 11! V: 0 C: 10! V: 1 (C:XY means process 0 has initial value X, process 1 has initial value Y)
  • 43.
    2-node system C: 00! V:1 C: 01! V: 0 C: 11! V: 0 C: 10! V: 1 These two configurations differ only at one node, but their valencies are different (C:XY means process 0 has initial value X, process 1 has initial value Y)
  • 44.
    2-node system C: 00! V:1 C: 01! V: 0 C: 11! V: 0 C: 10! V: 1 These two configurations differ only at one node, but their valencies are different (C:XY means process 0 has initial value X, process 1 has initial value Y)
  • 45.
    2-node system C: 00! V:1 C: 01! V: 0 C: 11! V: 0 C: 10! V: 1 (C:XY means process 0 has initial value X, process 1 has initial value Y) I decided 1! All executions of the protocol - i.e. set of messages delivered
  • 46.
    2-node system C: 00! V:1 C: 01! V: 0 C: 11! V: 0 C: 10! V: 1 (C:XY means process 0 has initial value X, process 1 has initial value Y) I decided 0! All executions of the protocol - i.e. set of messages delivered
  • 47.
    2-node system C: 00! V:1 C: 01! V: 0 C: 11! V: 0 C: 10! V: 1 (C:XY means process 0 has initial value X, process 1 has initial value Y) I decided 0! What if process 1 fails? Are the configurations any different? I decided 1!
  • 48.
    2-node system C: 00! V:1 C: 01! V: 0 C: 11! V: 0 C: 10! V: 1 (C:XY means process 0 has initial value X, process 1 has initial value Y) I decided 0! For the remaining processes: no difference in initial state, but different outcome ?! I decided 1! Same execution
  • 49.
    Every protocol hasan undecided (‘bivalent’) initial state
  • 50.
  • 51.
  • 52.
  • 53.
    • Consider thepossibilities: • If one of those configurations in D is bivalent, we’re done • Otherwise show that lack of bivalent state leads to contradiction • Do this by first showing that there must be both 0-valent and 1-valent configurations in D • and that this leads to a contradiction
  • 54.
    D Configuration C (bivalent) 0-valent! e notreceived 0-valent! e received Either the protocol goes through D before it reaches the 0-valent configuration… 2. e is received 1. C moves to 0-valent configuration before receiving e
  • 55.
    D Configuration C (bivalent) 0-valent! e notreceived 0-valent! e received Or the protocol gets to the 0-valent configuration after receiving e in which case this state also must be 0-valent and in D 1. e is received 2. 0-valent state is arrived at
  • 56.
    Now for thecontradiction
  • 57.
    • There mustbe two configurations C0 and C1 that are separated by a single message m where receiving e in Ci moves the configuration to Di • We will write that as Ci + e = Di • So C0 + m = C1 • and C0 + m + e = C1 + e = D1 • and C0 + e = D0
  • 58.
    • Now considerthe destinations of m and e. If they go to different processes, their receipt is commutative • C0 + m + e = D1 • C0 + e + m = D0 + m = D1 • Contradiction: D0 is 0-valent!
  • 59.
    • Instead, eand m might go to the same process p. • Consider a deciding computation R from the original bivalent state C, where p does nothing (i.e. looks like it failed) • Since to get to D0 and D1, only e and m have been received, only p took any steps to get there. • So R can apply to both D0 and D1.
  • 60.
    • Since D0and D1 are both univalent, so the configurations D0 + R and D1 + R are both univalent.
  • 61.
    • Now remember: •A = C + R • D1 = C + m + e • D0 = C + e • But what about • C + R + m + e = A + m + e = D1 + R => 1-valent • C + R + e = A + e = D0 + R => 0-valent
  • 62.
    • Let ebe some event that might be sent in configuration C. Then let D be the set of all configurations where e is received last and let C be the set of configurations where e has not been received. • D either contains a bivalent configuration, or both 0- and 1-valent configurations. If it contains a bivalent configuration, we’re done. So assume it does not. • Now there must be some C0 and C1 in C where C0 + e is 0-valent, but C1 + e is 1-valent, and C1 = C0 + e’ • Consider two possibilities for the destination of e’ and e. If they are not the same, then we can say C0 + e + e’ == C0 + e’ + e = C1 + e = D1 -> 1-valent. But C0 + e -> 0-valent. • If they are the same, then let A be the configuration reached by a deciding run from C0 when p does nothing (looks like it failed). We can also apply that run from D0 and D1 to get to E0 and E1. But we can get from A to either E0 or E1 by applying e or e’ + e. This is a contradiction.
  • 63.
    What are theconsequences?
  • 64.
    ! “These results donot show that such problems cannot be “solved” in practice; rather, they point up the need for more refined models of distributed computing that better reflect realistic assumptions about processor and communication timings, and for less stringent requirements on the solution to such problems. (For example, termination might be required only with probability 1.) “
  • 65.
    Paxos • Paxos cleverlydefers to its leader election scheme • If leader election is perfect, so is Paxos! • But perfect leader election is solvable iff consensus is. • Impossibilities all the way down…
  • 66.
    Randomized Consensus • Niceway to circumvent technical impossibilities: make their probability vanishingly small • Ben-Or gave an algorithm that terminates with probability 1 • (But the rate of convergence might be high)
  • 67.
    Failure Detectors • Deepconnection between the ability to tell if a machine has failed, and consensus. • Lots of research into ‘weak’ failure detectors, and how weak they can be and still solve consensus
  • 68.
  • 69.
    • FLP andCAP are not the same thing (see http://the- paper-trail.org/blog/flp-and-cap-arent-the-same- thing/) • FLP is a stronger result, because the system model has fewer restrictions (crash stop vs omission)
  • 70.
    • Theorem: CAPis actually really boring
  • 71.
  • 72.
    • 100 ImpossibilityProofs for Distributed Computing (Lynch, 1989) • The Weakest Failure Detector for Solving Consensus (Chandra and Toueg, 1996) • Sharing Memory Robustly in Message-Passing Systems (Attiya et. al., 1995) • Wait-Free Synchronization (Herlihy, 1991) • Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols (Ben-Or, 1983)