Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed Consensus A.K.A. "What do we eat for lunch?"

10,654 views

Published on

Distributed Consensus is everywhere! Even if not obvious at first, most apps nowadays are distributed systems, and these sometimes have to "agree on a value", this is where consensus algorithms come in. In this session we'll look at the general problem and solve a few example cases using the RAFT algorithm implemented using Akka's Actor and Cluster modules.

Published in: Technology
  • Be the first to comment

Distributed Consensus A.K.A. "What do we eat for lunch?"

  1. 1. Konrad 'ktoso' Malawski Distributed Consensus “What do we eat for lunch?” GeeCON 2014 @ Kraków, PL A.K.A. Konrad `@ktosopl` Malawski
  2. 2. Konrad 'ktoso' Malawski Distributed Consensus GeeCON 2014 @ Kraków, PL A.K.A. “What do we eat for lunch?” real world edition Konrad `@ktosopl` Malawski
  3. 3. hAkker @ Konrad `@ktosopl` Malawski
  4. 4. hAkker @ Konrad `@ktosopl` Malawski typesafe.com geecon.org Java.pl / KrakowScala.pl sckrk.com / meetup.com/Paper-Cup @ London GDGKrakow.pl meetup.com/Lambda-Lounge-Krakow
  5. 5. You? Distributed systems?
  6. 6. You? Distributed systems? ?
  7. 7. You? Distributed systems? ? ?
  8. 8. What is this talk about? The network. ! How to think about distributed systems. ! Some healthy madness. Code in slides covers only “simplest possible case”.
  9. 9. Ordering[T] Slightly chronological. ! By no means is it “worst to best”.
  10. 10. Consensus
  11. 11. Consensus - informal “we all agree on something”
  12. 12. Consensus - formal Termination Every correct process decides some value. ! Validity If all correct processes propose the same value v, then all correct processes decide v. ! Integrity If a correct process decides v, then v must have been proposed by some correct process. ! Agreement Every correct process must agree on the same value.
  13. 13. Consensus
  14. 14. Consensus
  15. 15. Distributed Consensus
  16. 16. Distributed Consensus What is a distributed system anyway?
  17. 17. Distributed system definition A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. — Leslie Lamport http://research.microsoft.com/en-us/um/people/lamport/pubs/distributed-system.txt
  18. 18. Distributed system definition A system in which participants communicate asynchronously using messages. http://research.microsoft.com/en-us/um/people/lamport/pubs/distributed-system.txt
  19. 19. Distributed Systems - failure detection
  20. 20. Distributed Systems - failure detection
  21. 21. Distributed Systems - failure detection Jim had quit CorpSoft a while ago, but no-one ever told Bob…
  22. 22. Distributed Systems - failure detection
  23. 23. Distributed Systems - failure detection Failure detection: • can only rely on external knowledge • but what if there’s no-one to tell you? • thus: must be in-some-way time based
  24. 24. Two Generals Problem
  25. 25. Two Generals Problem Yellow and Blue armies must attack Pink City. They must attack together, otherwise they’ll die in vain. Now they must agree on the exact time of the attack. ! They can only send messengers, which Pink may intercept and kill.
  26. 26. Two Generals Problem
  27. 27. Two Generals Problem - happy case I need to inform blue about my attack plan. I don’t know when yellow will attack…
  28. 28. Two Generals Problem - happy case
  29. 29. 1) Initial message not lost
  30. 30. Two Generals Problem - happy case I don’t know if Blue will also attack at 13:37… I’ll wait until I hear back from him.
  31. 31. Two Generals Problem - happy case I don’t know if Blue will also attack at 13:37… I’ll wait until I hear back from him. Why?
  32. 32. 2) Message might have not reached blue
  33. 33. Blue must confirm the reception of the command
  34. 34. 1) Yellow is now sure, but Blue isn’t!
  35. 35. 1) Yellow is now sure, but Blue isn’t! Why?
  36. 36. 2) Blue’s confirmation might have been lost!
  37. 37. This is exactly mirrors the initial situation!
  38. 38. 2 Generals Problem Translated to Akka
  39. 39. 2 Generals translated to Akka: Akka Actors implement the Actor Model: ! Actors: • communicate via messages • create other actors • change their behaviour on receiving a msg !
  40. 40. 2 Generals translated to Akka: Akka Actors implement the Actor Model: ! Actors: • communicate via messages • create other actors • change their behaviour on receiving a msg ! Gains? Distribution / separation / modelling abstraction
  41. 41. 2 Generals translated to Akka: case class AttackAt(when: Date) Presentation–sized–snippet = does not cover all cases
  42. 42. 2 Generals translated to Akka: ! ! class General(general: Option[ActorRef]) extends Actor {! ! ! val WhenIWantToAttack: Date = ???! ! general foreach { _ ! AttackAt(WhenIWantToAttack) }! ! def receive = {! case AttackAt(when) =>! println(s”General ${otherGeneralName} attacks at $when”)! ! ! ! println(s”I must confirm this!")! ! sender() ! AttackAt(when)! }! ! def otherGeneralName = ! ! ! ! if(self.path.name == “blue")!“yellow" else "blue"! }! Presentation–sized–snippet = does not cover all cases
  43. 43. 2 Generals translated to Akka: ! ! class General(general: Option[ActorRef]) extends Actor {! ! ! val WhenIWantToAttack: Date = ???! ! general foreach { _ ! AttackAt(WhenIWantToAttack) }! ! def receive = {! case AttackAt(when) =>! println(s”General ${otherGeneralName} attacks at $when”)! ! ! ! println(s”I must confirm this!")! ! sender() ! AttackAt(when)! }! ! def otherGeneralName = ! ! ! ! if(self.path.name == “blue")!“yellow" else "blue"! }! Presentation–sized–snippet = does not cover all cases
  44. 44. 2 Generals translated to Akka: ! ! class General(general: Option[ActorRef]) extends Actor {! ! ! val WhenIWantToAttack: Date = ???! ! general foreach { _ ! AttackAt(WhenIWantToAttack) }! ! def receive = {! case AttackAt(when) =>! println(s”General ${otherGeneralName} attacks at $when”)! ! ! ! println(s”I must confirm this!")! ! sender() ! AttackAt(when)! }! ! def otherGeneralName = ! ! ! ! if(self.path.name == “blue")!“yellow" else "blue"! }! Presentation–sized–snippet = does not cover all cases
  45. 45. 2 Generals translated to Akka: ! ! class General(general: Option[ActorRef]) extends Actor {! ! ! val WhenIWantToAttack: Date = ???! ! general foreach { _ ! AttackAt(WhenIWantToAttack) }! ! def receive = {! case AttackAt(when) =>! println(s”General ${otherGeneralName} attacks at $when”)! ! ! ! println(s”I must confirm this!")! ! sender() ! AttackAt(when)! }! ! def otherGeneralName = ! ! ! ! if (self.path.name == “blue")!"yellow" else "blue"! }! Presentation–sized–snippet = does not cover all cases
  46. 46. 2 Generals translated to Akka: val system = ActorSystem("two-generals")! ! val blue = ! system.actorOf(Props(new General(general = None)), name = "blue")! ! val yellow = ! system.actorOf(Props(new General(Some(blue))), name = "yellow")! The blue general attacks at 13:37, I must confirm this!! The yellow general attacks at 13:37, I must confirm this!! The blue general attacks at 13:37, I must confirm this!! ... Presentation–sized–snippet = does not cover all cases
  47. 47. 8 Fallacies of Distributed Computing
  48. 48. 8 Fallacies of Distributed Computing 1. The network is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure. 5. Topology doesn’t change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous. Peter Deutsch “The Eight Fallacies of Distributed Computing” https://blogs.oracle.com/jag/resource/Fallacies.html
  49. 49. Failure Models
  50. 50. Failure models: Fail – Stop Fail – Recover Byzantine
  51. 51. Failure models: Fail – Stop Fail – Recover Byzantine
  52. 52. Failure models: Fail – Stop Fail – Recover Byzantine
  53. 53. Failure models: Fail – Stop Fail – Recover Byzantine
  54. 54. 2-phase commit
  55. 55. 2PC - step 1: Propose value
  56. 56. 2PC - step 1: Propose value
  57. 57. 2PC - step 1: Promise to agree on write
  58. 58. 2PC - step 2: Commit the write
  59. 59. 2PC - step 1: Propose value, and die
  60. 60. 2PC - step 1: Propose value to 1 node, and die
  61. 61. 2PC: Prepare needs timeouts
  62. 62. 2PC: Timeouts + recovery committer
  63. 63. 2PC: Timeouts + recovery committer
  64. 64. 2PC: Timeouts + recovery committer
  65. 65. 2PC: Timeouts + recovery committer
  66. 66. 2PC: Timeouts + recovery committer
  67. 67. Still can’t tolerate if the “accepted value” Actor dies
  68. 68. 2PC: Timeouts + recovery committer
  69. 69. 2PC: Timeouts + recovery committer
  70. 70. 2 Phase Commit translated to Akka
  71. 71. 2PC translated to Akka case class Prepare(value: Any)! case object Commit! ! sealed class AcceptorStatus! case object Prepared extends AcceptorStatus! case object Conflict extends AcceptorStatus! ! Presentation–sized–snippet = does not cover all cases
  72. 72. 2PC translated to Akka case class Prepare(value: Any)! case object Commit! ! sealed class AcceptorStatus! case object Prepared extends AcceptorStatus! case object Conflict extends AcceptorStatus! ! Presentation–sized–snippet = does not cover all cases
  73. 73. 2PC translated to Akka class Proposer(acceptors: List[ActorRef]) extends Actor {! var transactionId = 0! var preparedAcceptors = 0! ! def receive = {! case value: String =>! transactionId += 1! acceptors foreach { _ ! Prepare(transactionId, value) }! ! case Prepared =>! preparedAcceptors += 1! ! if (preparedAcceptors == acceptors.size)! acceptors foreach { _ ! Commit }! ! case Conflict =>! ! ! ! ! ! context stop self! }! }! Presentation–sized–snippet = does not cover all cases
  74. 74. 2PC translated to Akka class Proposer(acceptors: List[ActorRef]) extends Actor {! var transactionId = 0! var preparedAcceptors = 0! ! def receive = {! case value: String =>! transactionId += 1! acceptors foreach { _ ! Prepare(transactionId, value) }! ! case Prepared =>! preparedAcceptors += 1! ! if (preparedAcceptors == acceptors.size)! acceptors foreach { _ ! Commit }! ! case Conflict =>! ! ! ! ! ! context stop self! }! }! Presentation–sized–snippet = does not cover all cases
  75. 75. 2PC translated to Akka class Proposer(acceptors: List[ActorRef]) extends Actor {! var transactionId = 0! var preparedAcceptors = 0! ! def receive = {! case value: String =>! transactionId += 1! acceptors foreach { _ ! Prepare(transactionId, value) }! ! case Prepared =>! preparedAcceptors += 1! ! if (preparedAcceptors == acceptors.size)! acceptors foreach { _ ! Commit }! ! case Conflict =>! ! ! ! ! ! context stop self! }! }! Presentation–sized–snippet = does not cover all cases
  76. 76. 2PC with ResumeProposer in Akka case class Prepare(value: Any)! case object Commit! ! sealed class AcceptorStatus! case object Prepared extends AcceptorStatus! case object Conflict extends AcceptorStatus! case class Committed(value: Any) extends AcceptorStatus! Presentation–sized–snippet = does not cover all cases
  77. 77. 2PC with ResumeProposer in Akka ! class ResumeProposer(! proposer: ActorRef, ! acceptors: List[ActorRef]) extends Actor {! ! context watch proposer! ! var anyAcceptorCommitted = false! ! def receive = {! case Terminated(`proposer`) =>! println("Proposer died! Try to finish the transaction...")! acceptors map { _ ! StatusPlz }! ! case _: AcceptorStatus =>! // impl of recovery here! }! } Presentation–sized–snippet = does not cover all cases
  78. 78. 2PC with ResumeProposer in Akka Presentation–sized–snippet = does not cover all cases
  79. 79. Quorum
  80. 80. Quorum voting From the perspective of the Omnipotent Observer *
  81. 81. Quorum voting From the perspective of the Omnipotent Observer * * does not exist in a running system
  82. 82. Quorum voting
  83. 83. Quorum voting
  84. 84. Quorum voting
  85. 85. Quorum voting
  86. 86. Quorum voting
  87. 87. Quorum voting
  88. 88. Quorum voting – split votes
  89. 89. Quorum voting – split votes
  90. 90. Quorum voting – split votes
  91. 91. Quorum voting – split votes
  92. 92. Quorum voting – split votes
  93. 93. James Mickens “The Saddest Moment” http://research.microsoft.com/en-us/people/mickens/thesaddestmoment.pdf
  94. 94. Paxos
  95. 95. Basic Paxos = “choose exactly one value”
  96. 96. Paxos - photo by Luigi Piazzi
  97. 97. Paxos: a high-level overview It’s the distributed systems algorithm
  98. 98. Paxos: a high-level overview JavaZone had a full session on Paxos already today…
  99. 99. A few Paxos whitepapers "Reaching Agreement in the Presence of Faults” – Lamport, 1980 … “FLP Impossibility Result” – Fisher et al, 1985 “The Part Time Parliament” – Lamport, 1998 … “Paxos made Simple” – Lamport, 2001 “Fast Paxos” – Lamport, 2005 … “Paxos made Live” – Chandra et al, 2007 … “Paxos made Moderately Complex” – Rennesse, 2011 ;-)
  100. 100. Lamport’s “Replicated State Machine”
  101. 101. Paxos: The cast
  102. 102. Paxos: The cast
  103. 103. Paxos: The cast
  104. 104. Paxos: The cast
  105. 105. Paxos: The cast
  106. 106. Paxos: The cast
  107. 107. ! Consensus time! Chose a value (raise your hand)
  108. 108. Consensus time! Chose a value (raise your hand): v1 = Basic Paxos + Raft v2 = Just Raft
  109. 109. Consensus time! Chose a value (raise your hand): v1 = Basic Paxos + Raft v2 = Just Raft
  110. 110. Consensus time! Chose a value (raise your hand): v1 = Basic Paxos + Raft v2 = Just Raft
  111. 111. Consensus time! Chose a value (raise your hand): v1 = Basic Paxos + Raft v2 = Just Raft (if enough time, Paxos)
  112. 112. Basic Paxos simple example
  113. 113. Paxos: Proposals ProposalNr must: • be greaterThan any prev proposalNr used by this Proposer • example: [roundNr|serverId]
  114. 114. Paxos: 2 phases Phase 1: Prepare Phase 2: Accept
  115. 115. Paxos, Prepare Phase n = nextSeqNr()
  116. 116. Paxos, Prepare Phase acceptors ! Prepare(n, value)
  117. 117. Paxos, Prepare Phase case Prepare(n, value) =>! if (n > minProposal) {! minProposal = n! accVal = value! }! ! sender() ! Accepted(minProposal, accVal)
  118. 118. Paxos, Prepare Phase case Prepare(n, value) =>! if (n > minProposal) {! minProposal = n! accVal = value! }! ! sender() ! Accepted(minProposal, accVal)
  119. 119. Paxos, Prepare Phase value = highestN(responses).accVal ! // replace my value, with accepted value!
  120. 120. Paxos, Accept Phase acceptors ! Accept(n, value)
  121. 121. Paxos, Accept Phase case Accept(n, value) =>! if (n >= minProposal) {! acceptedProposal = minProposal = n! acceptedValue = value! }! ! learners ! Learn(value)! sender() ! minProposal
  122. 122. Paxos, Accept Phase
  123. 123. Paxos, Accept Phase
  124. 124. Paxos, Accept Phase if (acceptedN > n) restartPaxos()! else println(n + “ was chosen!”)
  125. 125. Basic Paxos Basic Paxos, needs extensions for the “real world”. Additions: • “stable leader” • performance (basic = 2 * broadcast roundtrip) • ensure full replication • configuration changes
  126. 126. Multi Paxos
  127. 127. Multi Paxos “Basically everyone does it, but everyone does it differently.”
  128. 128. Multi Paxos • Keeps the Leader • Clients find and talk to the Leader • Skips Phase 1, in stable state • 2 delays instead of 4, until learning a value
  129. 129. Raft
  130. 130. Raft – inspired by Paxos Paxos is great. Multi-Paxos is great, but no “common understanding”. ! ! Raft wants to be understandable and just as solid. "In search of an understandable consensus protocol" (2013)
  131. 131. Raft – inspired by Paxos ! ! • Leader based • Less processes than Paxos • It’s goal is simplicity • “Basic” includes snapshotting / membership
  132. 132. Raft - summarised on one page Diego Ongaro & John Ouserhout – In search of an understandable consensus protocol
  133. 133. Raft
  134. 134. Raft
  135. 135. Raft - starting the cluster
  136. 136. Raft - Election timeout
  137. 137. Raft - 1st election
  138. 138. Raft - 1st election
  139. 139. Raft - Election Timeout
  140. 140. Raft - Election Phase
  141. 141. Raft
  142. 142. Raft
  143. 143. Raft
  144. 144. Raft
  145. 145. Raft
  146. 146. Raft
  147. 147. Raft
  148. 148. Raft
  149. 149. Raft
  150. 150. Raft
  151. 151. Raft – heartbeat = empty entries
  152. 152. Raft – heartbeat = empty entries
  153. 153. Akka–Raft ! (community project) (work in progress)
  154. 154. Raft, reminder:
  155. 155. Raft translated to Akka abstract class RaftActor ! ! extends Actor ! ! with FSM[RaftState, Metadata]
  156. 156. Raft translated to Akka abstract class RaftActor ! ! extends Actor ! ! with FSM[RaftState, Metadata]
  157. 157. Raft translated to Akka onTransition {! ! case Follower -> Candidate =>! self ! BeginElection! resetElectionDeadline()! ! // ...! }
  158. 158. Raft translated to Akka onTransition {! ! case Follower -> Candidate =>! self ! BeginElection! resetElectionDeadline()! ! // ...! }
  159. 159. Raft translated to Akka ! case Event(BeginElection, m: ElectionMeta) =>! log.info("Init election (among {} nodes) for {}”,! m.config.members.size, m.currentTerm)! ! val request = RequestVote(m.currentTerm, m.clusterSelf, replicatedLog.lastTerm, replicatedLog.lastIndex)! ! m.membersExceptSelf foreach { _ ! request }! ! val includingThisVote = m.incVote! stay() using includingThisVote.withVoteFor(m.currentTerm, m.clusterSelf)! }!
  160. 160. Raft translated to Akka
  161. 161. Raft Heartbeat using Akka sendHeartbeat(m)! log.info("Starting hearbeat, with interval: {}", heartbeatInterval)! setTimer(HeartbeatName, SendHeartbeat, heartInterval, repeat = true)! akka-raft is a work in progress community project – it may change a lot
  162. 162. Raft Heartbeat using Akka sendHeartbeat(m)! log.info("Starting hearbeat, with interval: {}", heartbeatInterval)! setTimer(HeartbeatName, SendHeartbeat, heartInterval, repeat = true)! akka-raft is a work in progress community project – it may change a lot
  163. 163. Raft Heartbeat using Akka sendHeartbeat(m)! log.info("Starting hearbeat, with interval: {}", heartbeatInterval)! setTimer(HeartbeatName, SendHeartbeat, heartInterval, repeat = true)! val leaderBehaviour = {! // ...! case Event(SendHeartbeat, m: LeaderMeta) =>! sendHeartbeat(m)! stay()! akka-raft is a work in progress community project – it may change a lot }
  164. 164. Akka-Raft in User-Land //alpha!!! class WordConcatRaftActor extends RaftActor {! ! type Command = Cmnd! ! var words = Vector[String]()! ! /** Applied when command committed by Raft consensus */! def apply = {! case AppendWord(word) =>! words = words :+ word! word! ! case GetWords =>! log.info("Replying with {}", words.toList)! words.toList! }! }! akka-raft is a work in progress community project – it may change a lot
  165. 165. FLP Impossibility
  166. 166. FLP Impossibility Proof (19 Impossibility of Distributed Consensus with One Faulty Process 1985 by Fisher, Lynch, Paterson
  167. 167. FLP Impossibility Result Impossibility of Distributed Consensus with One Faulty Process 1985 by Fisher, Lynch, Paterson
  168. 168. FLP Impossibility Result Impossibility of Distributed Consensus with One Faulty Process 1985 by Fisher, Lynch, Paterson
  169. 169. ktoso @ typesafe.com twitter: ktosopl github: ktoso blog: project13.pl team blog: letitcrash.com JavaZone @ Oslo 2014 ! ! Takk! Dzięki! Thanks! ありがとう! akka.io
  170. 170. Happy Byzantine Lunch-time! Konrad 'ktoso' Malawski GeeCON 2014 @ Kraków, PL
  171. 171. ©Typesafe 2014 – All Rights Reserved
  172. 172. Links 1. http://cs-www.cs.yale.edu/homes/arvind/cs425/doc/fischer.pdf 2. http://hydra.infosys.tuwien.ac.at/teaching/courses/AdvancedDistributedSystems/download/ 1975_Akkoyunlu,%20Ekanadham,%20Huber_Some%20constraints%20and%20tradeoffs %20in%20the%20design%20of%20network%20communications.pdf 3. http://research.microsoft.com/en-us/people/mickens/thesaddestmoment.pdf 4. http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf 5. http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf 6. http://the-paper-trail.org/blog/consensus-protocols-paxos/ 7. http://static.googleusercontent.com/media/research.google.com/en//archive/ paxos_made_live.pdf 8. http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06. pdf 9. https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf 10. Recent Leslie Lamport interview: http://www.se-radio.net/2014/04/episode-203-leslie-lamport- on-distributed-systems/ 11. http://book.mixu.net/distsys/ 12. http://codahale.com/you-cant-sacrifice-partition-tolerance/ Peter Deutsch “The Eight Fallacies of Distributed Computing” https://blogs.oracle.com/jag/resource/Fallacies.html
  173. 173. Links 1. Excellent Paxos lecture by Diego Ongaro https://www.youtube.com/watch?v=JEpsBg0AO6o 2. Fallacies, actual paper: http://www.rgoarchitects.com/Files/fallacies.pdf 3. Diego Ongaro & John Ouserhout – In search of an understandable consensus protocol 4. http://macs.citadel.edu/rudolphg/csci604/ImpossibilityofConsensus.pdf Peter Deutsch “The Eight Fallacies of Distributed Computing” https://blogs.oracle.com/jag/resource/Fallacies.html
  174. 174. Images / drawings 1. Paxos Island Photo – Luigi Piazzi (CC license) https://www.flickr.com/photos/photolupi/ 3686769346/in/photolist-6BME5J-orKHL2-58qmez-58uz7s-7bRwTj-7bRvHY-6DdRC2- fBqFFU-35KTg7-8vbe23-bsBGL7-58qq6z-58uAjG-8vbeCd-d1Sqqw-d1Smsj-d1Sqi5- d1SoMA-d1SmBE-d1SpVo-d1Sk2U-d1SoBQ-d1SoXu-d1SoqN-d1Spqu-d1Sq4w-d1SpLU-d1SKDG- d1Skcu-d1Sp8f-d1Sqaq-d1SpCw-75YaVN-d1SLs1-d1SK15-d1SJiC-d1Suiu-d1SKtS-d1SjQS- d1StyU-d1SKi1-d1SxGS-d1Sm6j-d1Sxdh-d1SKMN-d1SxAq-d1SwgC-d1Smgj-d1SvhJ- d1SjC7 2. Drawings – myself (use-them-at-will-unless-mocking-my-horrible-drawing-skills-license) Peter Deutsch “The Eight Fallacies of Distributed Computing” https://blogs.oracle.com/jag/resource/Fallacies.html

×