Konrad 'ktoso' Malawski 
Distributed Consensus 
“What do we eat for lunch?” 
GeeCON 2014 @ Kraków, PL 
A.K.A. 
Konrad `@ktosopl` Malawski
Konrad 'ktoso' Malawski 
Distributed Consensus 
GeeCON 2014 @ Kraków, PL 
A.K.A. 
“What do we eat for lunch?” 
real world 
edition 
Konrad `@ktosopl` Malawski
hAkker @ 
Konrad `@ktosopl` Malawski
hAkker @ 
Konrad `@ktosopl` Malawski 
typesafe.com 
geecon.org 
Java.pl / KrakowScala.pl 
sckrk.com / meetup.com/Paper-Cup @ London 
GDGKrakow.pl 
meetup.com/Lambda-Lounge-Krakow
You? 
Distributed systems?
You? 
Distributed systems? 
?
You? 
Distributed systems? 
? 
?
What is this talk about? 
The network. 
! 
How to think about distributed systems. 
! 
Some healthy madness. 
Code in slides covers only “simplest possible case”.
Ordering[T] 
Slightly chronological. 
! 
By no means is it “worst to best”.
Consensus
Consensus - informal 
“we all agree on something”
Consensus - formal 
Termination 
Every correct process decides some value. 
! 
Validity 
If all correct processes propose the same value v, 
then all correct processes decide v. 
! 
Integrity 
If a correct process decides v, 
then v must have been proposed by some correct process. 
! 
Agreement 
Every correct process must agree on the same value.
Consensus
Consensus
Distributed Consensus
Distributed Consensus 
What is a distributed system anyway?
Distributed system definition 
A distributed system is one in which the failure 
of a computer you didn't even know existed 
can render your own computer unusable. 
— Leslie Lamport 
http://research.microsoft.com/en-us/um/people/lamport/pubs/distributed-system.txt
Distributed system definition 
A system in which participants communicate 
asynchronously using messages. 
http://research.microsoft.com/en-us/um/people/lamport/pubs/distributed-system.txt
Distributed Systems - failure detection
Distributed Systems - failure detection
Distributed Systems - failure detection 
Jim had quit CorpSoft a while ago, 
but no-one ever told Bob…
Distributed Systems - failure detection
Distributed Systems - failure detection 
Failure detection: 
• can only rely on external knowledge 
• but what if there’s no-one to tell you? 
• thus: must be in-some-way time based
Two Generals Problem
Two Generals Problem 
Yellow and Blue armies must attack Pink City. 
They must attack together, otherwise they’ll die in vain. 
Now they must agree on the exact time of the attack. 
! 
They can only send messengers, which Pink may intercept and kill.
Two Generals Problem
Two Generals Problem - happy case 
I need to inform blue 
about my attack plan. 
I don’t know when 
yellow will attack…
Two Generals Problem - happy case
1) Initial message not lost
Two Generals Problem - happy case 
I don’t know if 
Blue will also attack at 
13:37… I’ll wait until I 
hear back from him.
Two Generals Problem - happy case 
I don’t know if 
Blue will also attack at 
13:37… I’ll wait until I 
hear back from him. 
Why?
2) Message might have not reached blue
Blue must confirm the reception of the command
1) Yellow is now sure, but Blue isn’t!
1) Yellow is now sure, but Blue isn’t! 
Why?
2) Blue’s confirmation might have been lost!
This is exactly mirrors the initial situation!
2 Generals Problem 
Translated to Akka
2 Generals translated to Akka: 
Akka Actors implement the Actor Model: 
! 
Actors: 
• communicate via messages 
• create other actors 
• change their behaviour on receiving a msg 
!
2 Generals translated to Akka: 
Akka Actors implement the Actor Model: 
! 
Actors: 
• communicate via messages 
• create other actors 
• change their behaviour on receiving a msg 
! 
Gains? 
Distribution / separation / modelling abstraction
2 Generals translated to Akka: 
case class AttackAt(when: Date) 
Presentation–sized–snippet = does not cover all cases
2 Generals translated to Akka: 
! 
! 
class General(general: Option[ActorRef]) extends Actor {! 
! 
! val WhenIWantToAttack: Date = ???! 
! 
general foreach { _ ! AttackAt(WhenIWantToAttack) }! 
! 
def receive = {! 
case AttackAt(when) =>! 
println(s”General ${otherGeneralName} attacks at $when”)! 
! ! ! println(s”I must confirm this!")! 
! 
sender() ! AttackAt(when)! 
}! 
! 
def otherGeneralName = ! 
! ! ! if(self.path.name == “blue")!“yellow" else "blue"! 
}! 
Presentation–sized–snippet = does not cover all cases
2 Generals translated to Akka: 
! 
! 
class General(general: Option[ActorRef]) extends Actor {! 
! 
! val WhenIWantToAttack: Date = ???! 
! 
general foreach { _ ! AttackAt(WhenIWantToAttack) }! 
! 
def receive = {! 
case AttackAt(when) =>! 
println(s”General ${otherGeneralName} attacks at $when”)! 
! ! ! println(s”I must confirm this!")! 
! 
sender() ! AttackAt(when)! 
}! 
! 
def otherGeneralName = ! 
! ! ! if(self.path.name == “blue")!“yellow" else "blue"! 
}! 
Presentation–sized–snippet = does not cover all cases
2 Generals translated to Akka: 
! 
! 
class General(general: Option[ActorRef]) extends Actor {! 
! 
! val WhenIWantToAttack: Date = ???! 
! 
general foreach { _ ! AttackAt(WhenIWantToAttack) }! 
! 
def receive = {! 
case AttackAt(when) =>! 
println(s”General ${otherGeneralName} attacks at $when”)! 
! ! ! println(s”I must confirm this!")! 
! 
sender() ! AttackAt(when)! 
}! 
! 
def otherGeneralName = ! 
! ! ! if(self.path.name == “blue")!“yellow" else "blue"! 
}! 
Presentation–sized–snippet = does not cover all cases
2 Generals translated to Akka: 
! 
! 
class General(general: Option[ActorRef]) extends Actor {! 
! 
! val WhenIWantToAttack: Date = ???! 
! 
general foreach { _ ! AttackAt(WhenIWantToAttack) }! 
! 
def receive = {! 
case AttackAt(when) =>! 
println(s”General ${otherGeneralName} attacks at $when”)! 
! ! ! println(s”I must confirm this!")! 
! 
sender() ! AttackAt(when)! 
}! 
! 
def otherGeneralName = ! 
! ! ! if (self.path.name == “blue")!"yellow" else "blue"! 
}! 
Presentation–sized–snippet = does not cover all cases
2 Generals translated to Akka: 
val system = ActorSystem("two-generals")! 
! 
val blue = ! 
system.actorOf(Props(new General(general = None)), name = "blue")! 
! 
val yellow = ! 
system.actorOf(Props(new General(Some(blue))), name = "yellow")! 
The blue general attacks at 13:37, I must confirm this!! 
The yellow general attacks at 13:37, I must confirm this!! 
The blue general attacks at 13:37, I must confirm this!! 
... 
Presentation–sized–snippet = does not cover all cases
8 Fallacies of Distributed Computing
8 Fallacies of Distributed Computing 
1. The network is reliable. 
2. Latency is zero. 
3. Bandwidth is infinite. 
4. The network is secure. 
5. Topology doesn’t change. 
6. There is one administrator. 
7. Transport cost is zero. 
8. The network is homogeneous. 
Peter Deutsch “The Eight Fallacies of Distributed Computing” 
https://blogs.oracle.com/jag/resource/Fallacies.html
Failure Models
Failure models: 
Fail – Stop 
Fail – Recover 
Byzantine
Failure models: 
Fail – Stop 
Fail – Recover 
Byzantine
Failure models: 
Fail – Stop 
Fail – Recover 
Byzantine
Failure models: 
Fail – Stop 
Fail – Recover 
Byzantine
2-phase commit
2PC - step 1: Propose value
2PC - step 1: Propose value
2PC - step 1: Promise to agree on write
2PC - step 2: Commit the write
2PC - step 1: Propose value, and die
2PC - step 1: Propose value to 1 node, and die
2PC: Prepare needs timeouts
2PC: Timeouts + recovery committer
2PC: Timeouts + recovery committer
2PC: Timeouts + recovery committer
2PC: Timeouts + recovery committer
2PC: Timeouts + recovery committer
Still can’t tolerate if the 
“accepted value” Actor dies
2PC: Timeouts + recovery committer
2PC: Timeouts + recovery committer
2 Phase Commit 
translated to Akka
2PC translated to Akka 
case class Prepare(value: Any)! 
case object Commit! 
! 
sealed class AcceptorStatus! 
case object Prepared extends AcceptorStatus! 
case object Conflict extends AcceptorStatus! 
! 
Presentation–sized–snippet = does not cover all cases
2PC translated to Akka 
case class Prepare(value: Any)! 
case object Commit! 
! 
sealed class AcceptorStatus! 
case object Prepared extends AcceptorStatus! 
case object Conflict extends AcceptorStatus! 
! 
Presentation–sized–snippet = does not cover all cases
2PC translated to Akka 
class Proposer(acceptors: List[ActorRef]) extends Actor {! 
var transactionId = 0! 
var preparedAcceptors = 0! 
! 
def receive = {! 
case value: String =>! 
transactionId += 1! 
acceptors foreach { _ ! Prepare(transactionId, value) }! 
! 
case Prepared =>! 
preparedAcceptors += 1! 
! 
if (preparedAcceptors == acceptors.size)! 
acceptors foreach { _ ! Commit }! 
! 
case Conflict =>! 
! ! ! ! ! context stop self! 
}! 
}! 
Presentation–sized–snippet = does not cover all cases
2PC translated to Akka 
class Proposer(acceptors: List[ActorRef]) extends Actor {! 
var transactionId = 0! 
var preparedAcceptors = 0! 
! 
def receive = {! 
case value: String =>! 
transactionId += 1! 
acceptors foreach { _ ! Prepare(transactionId, value) }! 
! 
case Prepared =>! 
preparedAcceptors += 1! 
! 
if (preparedAcceptors == acceptors.size)! 
acceptors foreach { _ ! Commit }! 
! 
case Conflict =>! 
! ! ! ! ! context stop self! 
}! 
}! 
Presentation–sized–snippet = does not cover all cases
2PC translated to Akka 
class Proposer(acceptors: List[ActorRef]) extends Actor {! 
var transactionId = 0! 
var preparedAcceptors = 0! 
! 
def receive = {! 
case value: String =>! 
transactionId += 1! 
acceptors foreach { _ ! Prepare(transactionId, value) }! 
! 
case Prepared =>! 
preparedAcceptors += 1! 
! 
if (preparedAcceptors == acceptors.size)! 
acceptors foreach { _ ! Commit }! 
! 
case Conflict =>! 
! ! ! ! ! context stop self! 
}! 
}! 
Presentation–sized–snippet = does not cover all cases
2PC with ResumeProposer in Akka 
case class Prepare(value: Any)! 
case object Commit! 
! 
sealed class AcceptorStatus! 
case object Prepared extends AcceptorStatus! 
case object Conflict extends AcceptorStatus! 
case class Committed(value: Any) extends AcceptorStatus! 
Presentation–sized–snippet = does not cover all cases
2PC with ResumeProposer in Akka 
! 
class ResumeProposer(! 
proposer: ActorRef, ! 
acceptors: List[ActorRef]) extends Actor {! 
! 
context watch proposer! 
! 
var anyAcceptorCommitted = false! 
! 
def receive = {! 
case Terminated(`proposer`) =>! 
println("Proposer died! Try to finish the transaction...")! 
acceptors map { _ ! StatusPlz }! 
! 
case _: AcceptorStatus =>! 
// impl of recovery here! 
}! 
} 
Presentation–sized–snippet = does not cover all cases
2PC with ResumeProposer in Akka 
Presentation–sized–snippet = does not cover all cases
Quorum
Quorum voting 
From the perspective of 
the Omnipotent Observer *
Quorum voting 
From the perspective of 
the Omnipotent Observer * 
* does not exist in a running system
Quorum voting
Quorum voting
Quorum voting
Quorum voting
Quorum voting
Quorum voting
Quorum voting – split votes
Quorum voting – split votes
Quorum voting – split votes
Quorum voting – split votes
Quorum voting – split votes
James Mickens “The Saddest Moment” 
http://research.microsoft.com/en-us/people/mickens/thesaddestmoment.pdf
Paxos
Basic Paxos 
= 
“choose exactly one value”
Paxos - photo by Luigi Piazzi
Paxos: a high-level overview 
It’s the distributed systems algorithm
Paxos: a high-level overview 
JavaZone had a full session on Paxos already today…
A few Paxos whitepapers 
"Reaching Agreement in the Presence of Faults” – Lamport, 1980 
… 
“FLP Impossibility Result” – Fisher et al, 1985 
“The Part Time Parliament” – Lamport, 1998 
… 
“Paxos made Simple” – Lamport, 2001 
“Fast Paxos” – Lamport, 2005 
… 
“Paxos made Live” – Chandra et al, 2007 
… 
“Paxos made Moderately Complex” – Rennesse, 2011 ;-)
Lamport’s “Replicated State Machine”
Paxos: The cast
Paxos: The cast
Paxos: The cast
Paxos: The cast
Paxos: The cast
Paxos: The cast
! 
Consensus time! 
Chose a value (raise your hand)
Consensus time! 
Chose a value (raise your hand): 
v1 = Basic Paxos + Raft v2 = Just Raft
Consensus time! 
Chose a value (raise your hand): 
v1 = Basic Paxos + Raft v2 = Just Raft
Consensus time! 
Chose a value (raise your hand): 
v1 = Basic Paxos + Raft v2 = Just Raft
Consensus time! 
Chose a value (raise your hand): 
v1 = Basic Paxos + Raft v2 = Just Raft 
(if enough time, Paxos)
Basic Paxos 
simple example
Paxos: Proposals 
ProposalNr must: 
• be greaterThan any prev proposalNr 
used by this Proposer 
• example: [roundNr|serverId]
Paxos: 2 phases 
Phase 1: Prepare 
Phase 2: Accept
Paxos, Prepare Phase 
n = nextSeqNr()
Paxos, Prepare Phase 
acceptors ! Prepare(n, value)
Paxos, Prepare Phase 
case Prepare(n, value) =>! 
if (n > minProposal) {! 
minProposal = n! 
accVal = value! 
}! 
! 
sender() ! Accepted(minProposal, accVal)
Paxos, Prepare Phase 
case Prepare(n, value) =>! 
if (n > minProposal) {! 
minProposal = n! 
accVal = value! 
}! 
! 
sender() ! Accepted(minProposal, accVal)
Paxos, Prepare Phase 
value = highestN(responses).accVal ! 
// replace my value, with accepted value!
Paxos, Accept Phase 
acceptors ! Accept(n, value)
Paxos, Accept Phase 
case Accept(n, value) =>! 
if (n >= minProposal) {! 
acceptedProposal = minProposal = n! 
acceptedValue = value! 
}! 
! 
learners ! Learn(value)! 
sender() ! minProposal
Paxos, Accept Phase
Paxos, Accept Phase
Paxos, Accept Phase 
if (acceptedN > n) restartPaxos()! 
else println(n + “ was chosen!”)
Basic Paxos 
Basic Paxos, needs extensions for the “real world”. 
Additions: 
• “stable leader” 
• performance (basic = 2 * broadcast roundtrip) 
• ensure full replication 
• configuration changes
Multi Paxos
Multi Paxos 
“Basically everyone does it, 
but everyone does it differently.”
Multi Paxos 
• Keeps the Leader 
• Clients find and talk to the Leader 
• Skips Phase 1, in stable state 
• 2 delays instead of 4, until learning a value
Raft
Raft – inspired by Paxos 
Paxos is great. 
Multi-Paxos is great, but no “common understanding”. 
! 
! 
Raft wants to be understandable and just as solid. 
"In search of an understandable consensus protocol" (2013)
Raft – inspired by Paxos 
! 
! 
• Leader based 
• Less processes than Paxos 
• It’s goal is simplicity 
• “Basic” includes snapshotting / membership
Raft - summarised on one page 
Diego Ongaro & John Ouserhout – In search of an understandable consensus protocol
Raft
Raft
Raft - starting the cluster
Raft - Election timeout
Raft - 1st election
Raft - 1st election
Raft - Election Timeout
Raft - Election Phase
Raft
Raft
Raft
Raft
Raft
Raft
Raft
Raft
Raft
Raft
Raft – heartbeat = empty entries
Raft – heartbeat = empty entries
Akka–Raft 
! 
(community project) 
(work in progress)
Raft, reminder:
Raft translated to Akka 
abstract class RaftActor ! 
! extends Actor ! 
! with FSM[RaftState, Metadata]
Raft translated to Akka 
abstract class RaftActor ! 
! extends Actor ! 
! with FSM[RaftState, Metadata]
Raft translated to Akka 
onTransition {! 
! 
case Follower -> Candidate =>! 
self ! BeginElection! 
resetElectionDeadline()! 
! 
// ...! 
}
Raft translated to Akka 
onTransition {! 
! 
case Follower -> Candidate =>! 
self ! BeginElection! 
resetElectionDeadline()! 
! 
// ...! 
}
Raft translated to Akka 
! 
case Event(BeginElection, m: ElectionMeta) =>! 
log.info("Init election (among {} nodes) for {}”,! 
m.config.members.size, m.currentTerm)! 
! 
val request = RequestVote(m.currentTerm, m.clusterSelf, 
replicatedLog.lastTerm, replicatedLog.lastIndex)! 
! 
m.membersExceptSelf foreach { _ ! request }! 
! 
val includingThisVote = m.incVote! 
stay() using includingThisVote.withVoteFor(m.currentTerm, 
m.clusterSelf)! 
}!
Raft translated to Akka
Raft Heartbeat using Akka 
sendHeartbeat(m)! 
log.info("Starting hearbeat, with interval: {}", heartbeatInterval)! 
setTimer(HeartbeatName, SendHeartbeat, heartInterval, repeat = true)! 
akka-raft is a work in progress community project – it may change a lot
Raft Heartbeat using Akka 
sendHeartbeat(m)! 
log.info("Starting hearbeat, with interval: {}", heartbeatInterval)! 
setTimer(HeartbeatName, SendHeartbeat, heartInterval, repeat = true)! 
akka-raft is a work in progress community project – it may change a lot
Raft Heartbeat using Akka 
sendHeartbeat(m)! 
log.info("Starting hearbeat, with interval: {}", heartbeatInterval)! 
setTimer(HeartbeatName, SendHeartbeat, heartInterval, repeat = true)! 
val leaderBehaviour = {! 
// ...! 
case Event(SendHeartbeat, m: LeaderMeta) =>! 
sendHeartbeat(m)! 
stay()! 
akka-raft is a work in progress community project – it may change a lot 
}
Akka-Raft in User-Land //alpha!!! 
class WordConcatRaftActor extends RaftActor {! 
! 
type Command = Cmnd! 
! 
var words = Vector[String]()! 
! 
/** Applied when command committed by Raft consensus */! 
def apply = {! 
case AppendWord(word) =>! 
words = words :+ word! 
word! 
! 
case GetWords =>! 
log.info("Replying with {}", words.toList)! 
words.toList! 
}! 
}! 
akka-raft is a work in progress community project – it may change a lot
FLP Impossibility
FLP Impossibility Proof (19 
Impossibility of Distributed Consensus with One Faulty Process 
1985 by Fisher, Lynch, Paterson
FLP Impossibility Result 
Impossibility of Distributed Consensus with One Faulty Process 
1985 by Fisher, Lynch, Paterson
FLP Impossibility Result 
Impossibility of Distributed Consensus with One Faulty Process 
1985 by Fisher, Lynch, Paterson
ktoso @ typesafe.com 
twitter: ktosopl 
github: ktoso 
blog: project13.pl 
team blog: letitcrash.com 
JavaZone @ Oslo 2014 
! 
! 
Takk! 
Dzięki! 
Thanks! 
ありがとう! 
akka.io
Happy Byzantine Lunch-time! 
Konrad 'ktoso' Malawski 
GeeCON 2014 @ Kraków, PL
©Typesafe 2014 – All Rights Reserved
Links 
1. http://cs-www.cs.yale.edu/homes/arvind/cs425/doc/fischer.pdf 
2. http://hydra.infosys.tuwien.ac.at/teaching/courses/AdvancedDistributedSystems/download/ 
1975_Akkoyunlu,%20Ekanadham,%20Huber_Some%20constraints%20and%20tradeoffs 
%20in%20the%20design%20of%20network%20communications.pdf 
3. http://research.microsoft.com/en-us/people/mickens/thesaddestmoment.pdf 
4. http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf 
5. http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf 
6. http://the-paper-trail.org/blog/consensus-protocols-paxos/ 
7. http://static.googleusercontent.com/media/research.google.com/en//archive/ 
paxos_made_live.pdf 
8. http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06. 
pdf 
9. https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf 
10. Recent Leslie Lamport interview: http://www.se-radio.net/2014/04/episode-203-leslie-lamport- 
on-distributed-systems/ 
11. http://book.mixu.net/distsys/ 
12. http://codahale.com/you-cant-sacrifice-partition-tolerance/ 
Peter Deutsch “The Eight Fallacies of Distributed Computing” 
https://blogs.oracle.com/jag/resource/Fallacies.html
Links 
1. Excellent Paxos lecture by Diego Ongaro 
https://www.youtube.com/watch?v=JEpsBg0AO6o 
2. Fallacies, actual paper: http://www.rgoarchitects.com/Files/fallacies.pdf 
3. Diego Ongaro & John Ouserhout – In search of an understandable consensus protocol 
4. http://macs.citadel.edu/rudolphg/csci604/ImpossibilityofConsensus.pdf 
Peter Deutsch “The Eight Fallacies of Distributed Computing” 
https://blogs.oracle.com/jag/resource/Fallacies.html
Images / drawings 
1. Paxos Island Photo – Luigi Piazzi (CC license) https://www.flickr.com/photos/photolupi/ 
3686769346/in/photolist-6BME5J-orKHL2-58qmez-58uz7s-7bRwTj-7bRvHY-6DdRC2- 
fBqFFU-35KTg7-8vbe23-bsBGL7-58qq6z-58uAjG-8vbeCd-d1Sqqw-d1Smsj-d1Sqi5- 
d1SoMA-d1SmBE-d1SpVo-d1Sk2U-d1SoBQ-d1SoXu-d1SoqN-d1Spqu-d1Sq4w-d1SpLU-d1SKDG- 
d1Skcu-d1Sp8f-d1Sqaq-d1SpCw-75YaVN-d1SLs1-d1SK15-d1SJiC-d1Suiu-d1SKtS-d1SjQS- 
d1StyU-d1SKi1-d1SxGS-d1Sm6j-d1Sxdh-d1SKMN-d1SxAq-d1SwgC-d1Smgj-d1SvhJ- 
d1SjC7 
2. Drawings – myself (use-them-at-will-unless-mocking-my-horrible-drawing-skills-license) 
Peter Deutsch “The Eight Fallacies of Distributed Computing” 
https://blogs.oracle.com/jag/resource/Fallacies.html

Distributed Consensus A.K.A. "What do we eat for lunch?"

  • 1.
    Konrad 'ktoso' Malawski Distributed Consensus “What do we eat for lunch?” GeeCON 2014 @ Kraków, PL A.K.A. Konrad `@ktosopl` Malawski
  • 2.
    Konrad 'ktoso' Malawski Distributed Consensus GeeCON 2014 @ Kraków, PL A.K.A. “What do we eat for lunch?” real world edition Konrad `@ktosopl` Malawski
  • 3.
    hAkker @ Konrad`@ktosopl` Malawski
  • 4.
    hAkker @ Konrad`@ktosopl` Malawski typesafe.com geecon.org Java.pl / KrakowScala.pl sckrk.com / meetup.com/Paper-Cup @ London GDGKrakow.pl meetup.com/Lambda-Lounge-Krakow
  • 5.
  • 6.
  • 7.
  • 8.
    What is thistalk about? The network. ! How to think about distributed systems. ! Some healthy madness. Code in slides covers only “simplest possible case”.
  • 9.
    Ordering[T] Slightly chronological. ! By no means is it “worst to best”.
  • 10.
  • 11.
    Consensus - informal “we all agree on something”
  • 12.
    Consensus - formal Termination Every correct process decides some value. ! Validity If all correct processes propose the same value v, then all correct processes decide v. ! Integrity If a correct process decides v, then v must have been proposed by some correct process. ! Agreement Every correct process must agree on the same value.
  • 13.
  • 14.
  • 15.
  • 16.
    Distributed Consensus Whatis a distributed system anyway?
  • 17.
    Distributed system definition A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. — Leslie Lamport http://research.microsoft.com/en-us/um/people/lamport/pubs/distributed-system.txt
  • 18.
    Distributed system definition A system in which participants communicate asynchronously using messages. http://research.microsoft.com/en-us/um/people/lamport/pubs/distributed-system.txt
  • 19.
    Distributed Systems -failure detection
  • 20.
    Distributed Systems -failure detection
  • 21.
    Distributed Systems -failure detection Jim had quit CorpSoft a while ago, but no-one ever told Bob…
  • 22.
    Distributed Systems -failure detection
  • 23.
    Distributed Systems -failure detection Failure detection: • can only rely on external knowledge • but what if there’s no-one to tell you? • thus: must be in-some-way time based
  • 24.
  • 25.
    Two Generals Problem Yellow and Blue armies must attack Pink City. They must attack together, otherwise they’ll die in vain. Now they must agree on the exact time of the attack. ! They can only send messengers, which Pink may intercept and kill.
  • 26.
  • 27.
    Two Generals Problem- happy case I need to inform blue about my attack plan. I don’t know when yellow will attack…
  • 28.
  • 29.
  • 30.
    Two Generals Problem- happy case I don’t know if Blue will also attack at 13:37… I’ll wait until I hear back from him.
  • 31.
    Two Generals Problem- happy case I don’t know if Blue will also attack at 13:37… I’ll wait until I hear back from him. Why?
  • 32.
    2) Message mighthave not reached blue
  • 33.
    Blue must confirmthe reception of the command
  • 34.
    1) Yellow isnow sure, but Blue isn’t!
  • 35.
    1) Yellow isnow sure, but Blue isn’t! Why?
  • 36.
    2) Blue’s confirmationmight have been lost!
  • 37.
    This is exactlymirrors the initial situation!
  • 38.
    2 Generals Problem Translated to Akka
  • 39.
    2 Generals translatedto Akka: Akka Actors implement the Actor Model: ! Actors: • communicate via messages • create other actors • change their behaviour on receiving a msg !
  • 40.
    2 Generals translatedto Akka: Akka Actors implement the Actor Model: ! Actors: • communicate via messages • create other actors • change their behaviour on receiving a msg ! Gains? Distribution / separation / modelling abstraction
  • 41.
    2 Generals translatedto Akka: case class AttackAt(when: Date) Presentation–sized–snippet = does not cover all cases
  • 42.
    2 Generals translatedto Akka: ! ! class General(general: Option[ActorRef]) extends Actor {! ! ! val WhenIWantToAttack: Date = ???! ! general foreach { _ ! AttackAt(WhenIWantToAttack) }! ! def receive = {! case AttackAt(when) =>! println(s”General ${otherGeneralName} attacks at $when”)! ! ! ! println(s”I must confirm this!")! ! sender() ! AttackAt(when)! }! ! def otherGeneralName = ! ! ! ! if(self.path.name == “blue")!“yellow" else "blue"! }! Presentation–sized–snippet = does not cover all cases
  • 43.
    2 Generals translatedto Akka: ! ! class General(general: Option[ActorRef]) extends Actor {! ! ! val WhenIWantToAttack: Date = ???! ! general foreach { _ ! AttackAt(WhenIWantToAttack) }! ! def receive = {! case AttackAt(when) =>! println(s”General ${otherGeneralName} attacks at $when”)! ! ! ! println(s”I must confirm this!")! ! sender() ! AttackAt(when)! }! ! def otherGeneralName = ! ! ! ! if(self.path.name == “blue")!“yellow" else "blue"! }! Presentation–sized–snippet = does not cover all cases
  • 44.
    2 Generals translatedto Akka: ! ! class General(general: Option[ActorRef]) extends Actor {! ! ! val WhenIWantToAttack: Date = ???! ! general foreach { _ ! AttackAt(WhenIWantToAttack) }! ! def receive = {! case AttackAt(when) =>! println(s”General ${otherGeneralName} attacks at $when”)! ! ! ! println(s”I must confirm this!")! ! sender() ! AttackAt(when)! }! ! def otherGeneralName = ! ! ! ! if(self.path.name == “blue")!“yellow" else "blue"! }! Presentation–sized–snippet = does not cover all cases
  • 45.
    2 Generals translatedto Akka: ! ! class General(general: Option[ActorRef]) extends Actor {! ! ! val WhenIWantToAttack: Date = ???! ! general foreach { _ ! AttackAt(WhenIWantToAttack) }! ! def receive = {! case AttackAt(when) =>! println(s”General ${otherGeneralName} attacks at $when”)! ! ! ! println(s”I must confirm this!")! ! sender() ! AttackAt(when)! }! ! def otherGeneralName = ! ! ! ! if (self.path.name == “blue")!"yellow" else "blue"! }! Presentation–sized–snippet = does not cover all cases
  • 46.
    2 Generals translatedto Akka: val system = ActorSystem("two-generals")! ! val blue = ! system.actorOf(Props(new General(general = None)), name = "blue")! ! val yellow = ! system.actorOf(Props(new General(Some(blue))), name = "yellow")! The blue general attacks at 13:37, I must confirm this!! The yellow general attacks at 13:37, I must confirm this!! The blue general attacks at 13:37, I must confirm this!! ... Presentation–sized–snippet = does not cover all cases
  • 47.
    8 Fallacies ofDistributed Computing
  • 48.
    8 Fallacies ofDistributed Computing 1. The network is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure. 5. Topology doesn’t change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous. Peter Deutsch “The Eight Fallacies of Distributed Computing” https://blogs.oracle.com/jag/resource/Fallacies.html
  • 49.
  • 50.
    Failure models: Fail– Stop Fail – Recover Byzantine
  • 51.
    Failure models: Fail– Stop Fail – Recover Byzantine
  • 52.
    Failure models: Fail– Stop Fail – Recover Byzantine
  • 53.
    Failure models: Fail– Stop Fail – Recover Byzantine
  • 54.
  • 55.
    2PC - step1: Propose value
  • 56.
    2PC - step1: Propose value
  • 57.
    2PC - step1: Promise to agree on write
  • 58.
    2PC - step2: Commit the write
  • 59.
    2PC - step1: Propose value, and die
  • 60.
    2PC - step1: Propose value to 1 node, and die
  • 61.
  • 62.
    2PC: Timeouts +recovery committer
  • 63.
    2PC: Timeouts +recovery committer
  • 64.
    2PC: Timeouts +recovery committer
  • 65.
    2PC: Timeouts +recovery committer
  • 66.
    2PC: Timeouts +recovery committer
  • 67.
    Still can’t tolerateif the “accepted value” Actor dies
  • 68.
    2PC: Timeouts +recovery committer
  • 69.
    2PC: Timeouts +recovery committer
  • 70.
    2 Phase Commit translated to Akka
  • 71.
    2PC translated toAkka case class Prepare(value: Any)! case object Commit! ! sealed class AcceptorStatus! case object Prepared extends AcceptorStatus! case object Conflict extends AcceptorStatus! ! Presentation–sized–snippet = does not cover all cases
  • 72.
    2PC translated toAkka case class Prepare(value: Any)! case object Commit! ! sealed class AcceptorStatus! case object Prepared extends AcceptorStatus! case object Conflict extends AcceptorStatus! ! Presentation–sized–snippet = does not cover all cases
  • 73.
    2PC translated toAkka class Proposer(acceptors: List[ActorRef]) extends Actor {! var transactionId = 0! var preparedAcceptors = 0! ! def receive = {! case value: String =>! transactionId += 1! acceptors foreach { _ ! Prepare(transactionId, value) }! ! case Prepared =>! preparedAcceptors += 1! ! if (preparedAcceptors == acceptors.size)! acceptors foreach { _ ! Commit }! ! case Conflict =>! ! ! ! ! ! context stop self! }! }! Presentation–sized–snippet = does not cover all cases
  • 74.
    2PC translated toAkka class Proposer(acceptors: List[ActorRef]) extends Actor {! var transactionId = 0! var preparedAcceptors = 0! ! def receive = {! case value: String =>! transactionId += 1! acceptors foreach { _ ! Prepare(transactionId, value) }! ! case Prepared =>! preparedAcceptors += 1! ! if (preparedAcceptors == acceptors.size)! acceptors foreach { _ ! Commit }! ! case Conflict =>! ! ! ! ! ! context stop self! }! }! Presentation–sized–snippet = does not cover all cases
  • 75.
    2PC translated toAkka class Proposer(acceptors: List[ActorRef]) extends Actor {! var transactionId = 0! var preparedAcceptors = 0! ! def receive = {! case value: String =>! transactionId += 1! acceptors foreach { _ ! Prepare(transactionId, value) }! ! case Prepared =>! preparedAcceptors += 1! ! if (preparedAcceptors == acceptors.size)! acceptors foreach { _ ! Commit }! ! case Conflict =>! ! ! ! ! ! context stop self! }! }! Presentation–sized–snippet = does not cover all cases
  • 76.
    2PC with ResumeProposerin Akka case class Prepare(value: Any)! case object Commit! ! sealed class AcceptorStatus! case object Prepared extends AcceptorStatus! case object Conflict extends AcceptorStatus! case class Committed(value: Any) extends AcceptorStatus! Presentation–sized–snippet = does not cover all cases
  • 77.
    2PC with ResumeProposerin Akka ! class ResumeProposer(! proposer: ActorRef, ! acceptors: List[ActorRef]) extends Actor {! ! context watch proposer! ! var anyAcceptorCommitted = false! ! def receive = {! case Terminated(`proposer`) =>! println("Proposer died! Try to finish the transaction...")! acceptors map { _ ! StatusPlz }! ! case _: AcceptorStatus =>! // impl of recovery here! }! } Presentation–sized–snippet = does not cover all cases
  • 78.
    2PC with ResumeProposerin Akka Presentation–sized–snippet = does not cover all cases
  • 79.
  • 80.
    Quorum voting Fromthe perspective of the Omnipotent Observer *
  • 81.
    Quorum voting Fromthe perspective of the Omnipotent Observer * * does not exist in a running system
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
    Quorum voting –split votes
  • 89.
    Quorum voting –split votes
  • 90.
    Quorum voting –split votes
  • 91.
    Quorum voting –split votes
  • 92.
    Quorum voting –split votes
  • 93.
    James Mickens “TheSaddest Moment” http://research.microsoft.com/en-us/people/mickens/thesaddestmoment.pdf
  • 94.
  • 95.
    Basic Paxos = “choose exactly one value”
  • 96.
    Paxos - photoby Luigi Piazzi
  • 97.
    Paxos: a high-leveloverview It’s the distributed systems algorithm
  • 98.
    Paxos: a high-leveloverview JavaZone had a full session on Paxos already today…
  • 99.
    A few Paxoswhitepapers "Reaching Agreement in the Presence of Faults” – Lamport, 1980 … “FLP Impossibility Result” – Fisher et al, 1985 “The Part Time Parliament” – Lamport, 1998 … “Paxos made Simple” – Lamport, 2001 “Fast Paxos” – Lamport, 2005 … “Paxos made Live” – Chandra et al, 2007 … “Paxos made Moderately Complex” – Rennesse, 2011 ;-)
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
    ! Consensus time! Chose a value (raise your hand)
  • 108.
    Consensus time! Chosea value (raise your hand): v1 = Basic Paxos + Raft v2 = Just Raft
  • 109.
    Consensus time! Chosea value (raise your hand): v1 = Basic Paxos + Raft v2 = Just Raft
  • 110.
    Consensus time! Chosea value (raise your hand): v1 = Basic Paxos + Raft v2 = Just Raft
  • 111.
    Consensus time! Chosea value (raise your hand): v1 = Basic Paxos + Raft v2 = Just Raft (if enough time, Paxos)
  • 112.
  • 113.
    Paxos: Proposals ProposalNrmust: • be greaterThan any prev proposalNr used by this Proposer • example: [roundNr|serverId]
  • 114.
    Paxos: 2 phases Phase 1: Prepare Phase 2: Accept
  • 115.
    Paxos, Prepare Phase n = nextSeqNr()
  • 116.
    Paxos, Prepare Phase acceptors ! Prepare(n, value)
  • 117.
    Paxos, Prepare Phase case Prepare(n, value) =>! if (n > minProposal) {! minProposal = n! accVal = value! }! ! sender() ! Accepted(minProposal, accVal)
  • 118.
    Paxos, Prepare Phase case Prepare(n, value) =>! if (n > minProposal) {! minProposal = n! accVal = value! }! ! sender() ! Accepted(minProposal, accVal)
  • 119.
    Paxos, Prepare Phase value = highestN(responses).accVal ! // replace my value, with accepted value!
  • 120.
    Paxos, Accept Phase acceptors ! Accept(n, value)
  • 121.
    Paxos, Accept Phase case Accept(n, value) =>! if (n >= minProposal) {! acceptedProposal = minProposal = n! acceptedValue = value! }! ! learners ! Learn(value)! sender() ! minProposal
  • 122.
  • 123.
  • 124.
    Paxos, Accept Phase if (acceptedN > n) restartPaxos()! else println(n + “ was chosen!”)
  • 125.
    Basic Paxos BasicPaxos, needs extensions for the “real world”. Additions: • “stable leader” • performance (basic = 2 * broadcast roundtrip) • ensure full replication • configuration changes
  • 126.
  • 127.
    Multi Paxos “Basicallyeveryone does it, but everyone does it differently.”
  • 128.
    Multi Paxos •Keeps the Leader • Clients find and talk to the Leader • Skips Phase 1, in stable state • 2 delays instead of 4, until learning a value
  • 129.
  • 130.
    Raft – inspiredby Paxos Paxos is great. Multi-Paxos is great, but no “common understanding”. ! ! Raft wants to be understandable and just as solid. "In search of an understandable consensus protocol" (2013)
  • 131.
    Raft – inspiredby Paxos ! ! • Leader based • Less processes than Paxos • It’s goal is simplicity • “Basic” includes snapshotting / membership
  • 132.
    Raft - summarisedon one page Diego Ongaro & John Ouserhout – In search of an understandable consensus protocol
  • 133.
  • 134.
  • 135.
    Raft - startingthe cluster
  • 136.
  • 137.
    Raft - 1stelection
  • 138.
    Raft - 1stelection
  • 139.
  • 140.
  • 141.
  • 142.
  • 143.
  • 144.
  • 145.
  • 146.
  • 147.
  • 148.
  • 149.
  • 150.
  • 151.
    Raft – heartbeat= empty entries
  • 152.
    Raft – heartbeat= empty entries
  • 153.
    Akka–Raft ! (communityproject) (work in progress)
  • 154.
  • 155.
    Raft translated toAkka abstract class RaftActor ! ! extends Actor ! ! with FSM[RaftState, Metadata]
  • 156.
    Raft translated toAkka abstract class RaftActor ! ! extends Actor ! ! with FSM[RaftState, Metadata]
  • 157.
    Raft translated toAkka onTransition {! ! case Follower -> Candidate =>! self ! BeginElection! resetElectionDeadline()! ! // ...! }
  • 158.
    Raft translated toAkka onTransition {! ! case Follower -> Candidate =>! self ! BeginElection! resetElectionDeadline()! ! // ...! }
  • 159.
    Raft translated toAkka ! case Event(BeginElection, m: ElectionMeta) =>! log.info("Init election (among {} nodes) for {}”,! m.config.members.size, m.currentTerm)! ! val request = RequestVote(m.currentTerm, m.clusterSelf, replicatedLog.lastTerm, replicatedLog.lastIndex)! ! m.membersExceptSelf foreach { _ ! request }! ! val includingThisVote = m.incVote! stay() using includingThisVote.withVoteFor(m.currentTerm, m.clusterSelf)! }!
  • 160.
  • 161.
    Raft Heartbeat usingAkka sendHeartbeat(m)! log.info("Starting hearbeat, with interval: {}", heartbeatInterval)! setTimer(HeartbeatName, SendHeartbeat, heartInterval, repeat = true)! akka-raft is a work in progress community project – it may change a lot
  • 162.
    Raft Heartbeat usingAkka sendHeartbeat(m)! log.info("Starting hearbeat, with interval: {}", heartbeatInterval)! setTimer(HeartbeatName, SendHeartbeat, heartInterval, repeat = true)! akka-raft is a work in progress community project – it may change a lot
  • 163.
    Raft Heartbeat usingAkka sendHeartbeat(m)! log.info("Starting hearbeat, with interval: {}", heartbeatInterval)! setTimer(HeartbeatName, SendHeartbeat, heartInterval, repeat = true)! val leaderBehaviour = {! // ...! case Event(SendHeartbeat, m: LeaderMeta) =>! sendHeartbeat(m)! stay()! akka-raft is a work in progress community project – it may change a lot }
  • 164.
    Akka-Raft in User-Land//alpha!!! class WordConcatRaftActor extends RaftActor {! ! type Command = Cmnd! ! var words = Vector[String]()! ! /** Applied when command committed by Raft consensus */! def apply = {! case AppendWord(word) =>! words = words :+ word! word! ! case GetWords =>! log.info("Replying with {}", words.toList)! words.toList! }! }! akka-raft is a work in progress community project – it may change a lot
  • 165.
  • 166.
    FLP Impossibility Proof(19 Impossibility of Distributed Consensus with One Faulty Process 1985 by Fisher, Lynch, Paterson
  • 167.
    FLP Impossibility Result Impossibility of Distributed Consensus with One Faulty Process 1985 by Fisher, Lynch, Paterson
  • 168.
    FLP Impossibility Result Impossibility of Distributed Consensus with One Faulty Process 1985 by Fisher, Lynch, Paterson
  • 169.
    ktoso @ typesafe.com twitter: ktosopl github: ktoso blog: project13.pl team blog: letitcrash.com JavaZone @ Oslo 2014 ! ! Takk! Dzięki! Thanks! ありがとう! akka.io
  • 170.
    Happy Byzantine Lunch-time! Konrad 'ktoso' Malawski GeeCON 2014 @ Kraków, PL
  • 171.
    ©Typesafe 2014 –All Rights Reserved
  • 172.
    Links 1. http://cs-www.cs.yale.edu/homes/arvind/cs425/doc/fischer.pdf 2. http://hydra.infosys.tuwien.ac.at/teaching/courses/AdvancedDistributedSystems/download/ 1975_Akkoyunlu,%20Ekanadham,%20Huber_Some%20constraints%20and%20tradeoffs %20in%20the%20design%20of%20network%20communications.pdf 3. http://research.microsoft.com/en-us/people/mickens/thesaddestmoment.pdf 4. http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf 5. http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf 6. http://the-paper-trail.org/blog/consensus-protocols-paxos/ 7. http://static.googleusercontent.com/media/research.google.com/en//archive/ paxos_made_live.pdf 8. http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06. pdf 9. https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf 10. Recent Leslie Lamport interview: http://www.se-radio.net/2014/04/episode-203-leslie-lamport- on-distributed-systems/ 11. http://book.mixu.net/distsys/ 12. http://codahale.com/you-cant-sacrifice-partition-tolerance/ Peter Deutsch “The Eight Fallacies of Distributed Computing” https://blogs.oracle.com/jag/resource/Fallacies.html
  • 173.
    Links 1. ExcellentPaxos lecture by Diego Ongaro https://www.youtube.com/watch?v=JEpsBg0AO6o 2. Fallacies, actual paper: http://www.rgoarchitects.com/Files/fallacies.pdf 3. Diego Ongaro & John Ouserhout – In search of an understandable consensus protocol 4. http://macs.citadel.edu/rudolphg/csci604/ImpossibilityofConsensus.pdf Peter Deutsch “The Eight Fallacies of Distributed Computing” https://blogs.oracle.com/jag/resource/Fallacies.html
  • 174.
    Images / drawings 1. Paxos Island Photo – Luigi Piazzi (CC license) https://www.flickr.com/photos/photolupi/ 3686769346/in/photolist-6BME5J-orKHL2-58qmez-58uz7s-7bRwTj-7bRvHY-6DdRC2- fBqFFU-35KTg7-8vbe23-bsBGL7-58qq6z-58uAjG-8vbeCd-d1Sqqw-d1Smsj-d1Sqi5- d1SoMA-d1SmBE-d1SpVo-d1Sk2U-d1SoBQ-d1SoXu-d1SoqN-d1Spqu-d1Sq4w-d1SpLU-d1SKDG- d1Skcu-d1Sp8f-d1Sqaq-d1SpCw-75YaVN-d1SLs1-d1SK15-d1SJiC-d1Suiu-d1SKtS-d1SjQS- d1StyU-d1SKi1-d1SxGS-d1Sm6j-d1Sxdh-d1SKMN-d1SxAq-d1SwgC-d1Smgj-d1SvhJ- d1SjC7 2. Drawings – myself (use-them-at-will-unless-mocking-my-horrible-drawing-skills-license) Peter Deutsch “The Eight Fallacies of Distributed Computing” https://blogs.oracle.com/jag/resource/Fallacies.html