The paxos commit algorithm

755 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
755
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
23
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

The paxos commit algorithm

  1. 1. xx The Paxos Commit Algorithm Paxos Commit Protocol  Jim Gray and Leslie Lamport Microsoft Research - 1 January 2004  Review by Ahmed Hamza 
  2. 2. xx The Paxos Commit Algorithm Agenda         Paxos Commit Algorithm: Overview The participating processes  The resource managers  The leader  The acceptors Paxos Commit Algorithm: the base version Failure scenarios Optimizations for Paxos Commit Performance Paxos Commit vs. Two-Phase Commit Using a dynamic set of resource managers
  3. 3. xx The Paxos Commit Algorithm Paxos Commit Algorithm: Overview        Paxos was applied to Transaction Commit by L.Lamport and Jim Gray in Consensus on Transaction Commit One instance of Paxos (consensus algorithm) is executed for each resource manager, in order to agree upon a value (Prepared/Aborted) proposed by it “Not-synchronous” Commit algorithm Fault-tolerant (unlike 2PC)  Intended to be used in systems where failures are fail-stop only, for both processes and network Safety is guaranteed (unlike 3PC) Formally specified and checked Can be optimized to the theoretically best performance
  4. 4. xx The Paxos Commit Algorithm Participants: the resource managers N resource managers (“RM”) execute the distributed transaction, then choose a value (“locally chosen value” or “LCV”; ‘p’ for prepared iff it is willing to commit)  Every RM tries to get its LCV accepted by a majority set of acceptors (“MS”: any subset with a cardinality strictly greater than half of the total).  Each RM is the first proposer in its own instance of Paxos  Participants: the leader Coordinates the commit algorithm  All the instances of Paxos share the same leader  It is not a single point of failure (unlike 2PC)  Assumed always defined (true, many leader-(s)election algorithms exist) and unique (not necessarily true, but unlike 3PC safety does not rely on it) 
  5. 5. xx The Paxos Commit Algorithm Participants: the acceptors a       A denotes the set of acceptors All the instances of Paxos share the same set A of acceptors 2F+1 acceptors involved in order to achieve tolerance to F failures We will consider only F+1 acceptors, leaving F more for “spare” purposes (less communication overhead) Each acceptors keep track of its own progress in a Nx1 vector Vectors need to be merged into a Nx|MS| table, called aState, in order to take the global decision (we want “many” p‟s) RM1 Ok! Consensus box (MS) p RM2 AC1 AC3 Paxos Ok! AC2 AC4 p RM3 AC5 Ok! aState Acc1 Acc2 Acc3 Acc4 Acc5 1st instance a a a a a 2nd instance p p p p p 3rd instance p p p p p
  6. 6. xx The Paxos Commit Algorithm Paxos Commit (base) : Writes on log rm RM acc MS L AC0 AC1 AC2 RM0 RM1 RM2 RM3 (N=5) (F=2) A v { p, a} RM4 1x p2a 0 BeginCommit (N-1) x (N(F+1)-1) x Fx p2b 0 v(0) prepare p2a rm 0 v(rm) rm 0 v(rm) rm 0 v(rm) rm 0 v(rm) rm 0 v(rm) acc rm 0 v(rm) Opt. Not blocked iff F acceptors respond T2 T1 If (Global Commit) p3 commit then abort else p3 xN
  7. 7. xx The Paxos Commit Algorithm Global Commit Condition Global Commit ( rm)( b)( MS)( acc MS)(  p2b acc rm b p was sent rec.) That is: there must be one and only one row for each RM involved in the commitment; in each row of those rows there must be at least F+1 entries that have „p‟ as a value and refer to the same ballot
  8. 8. xx The Paxos Commit Algorithm [T1] What if some RMs do not submit their LCV? j Leader One majority of acceptors RM m issing RM v { p, a} bL1 >0 p1a p1b “accept?” “promise” Leader: «Has resource manager j ever proposed you a value?» (1) Acceptori: «Yes, in my last session (ballot) bi with it I accepted its proposal vi» (2) Acceptori: «No, never» (Promise not to answer any bL2<bL1) If (at least |MS| acceptors answered) p2a “prepare?” If (for ALL of them case (2) holds) then V=„a‟ [FREE] else V=v(maximum({bi}) Leader: «I am j, I propose V» [FORCED]
  9. 9. xx The Paxos Commit Algorithm [T2] What if the leader fails?  L1 ignored trusted If the leader fails, some leader-(s)election algorithm is executed. A faulty election (2+ leaders) doesn‟t preclude safety ( 3PC), but can impede progress… MS L2 b1 >0  trusted b2>b1 ignored  T ignored trusted  b3>b2 T b4>b3 trusted T Non-terminating example: infinite sequence of p1a-p1bp2a messages from 2 leaders Not really likely to happen It can be avoided (random T?)
  10. 10. xx The Paxos Commit Algorithm Optimizations for Paxos Commit (1)  Co-Location: each acceptor is on the same node as a RM and the initiating RM is on the same node as the initial leader RM0 RM1 BeginCommit p3 p2a L p2a AC0   RM2 RM4 RM3 p2a AC1 AC2 -1 message phase (BeginCommit), -(F+2) messages “Real-Time assumptions”: RMs can prepare spontaneously. The prepare phase is not needed anymore, RMs just “know” they have to prepare in some amount of time RM0 AC0 L RM1 RM2 AC1 AC2 RM3 RM4 (N-1) x  -1 message phase (Prepare), -(N-1) messages prepare Not needed anymore!
  11. 11. xx The Paxos Commit Algorithm Optimizations for Paxos Commit (2)  RM0 AC0 Phase 3 elimination: the acceptors send their phase2b messages (the columns of aState) directly to the RMs, that evaluate the global commit condition L RM1 RM2 AC1 AC2 RM3 RM4 RM0 AC0 L RM1 RM2 AC1 AC2 RM3 RM4 p2b p2b p3   Paxos Commit + Phase 3 Elimination = Faster Paxos Commit (FPC) FPC + Co-location + R.T.A. = Optimal Consensus Algorithm
  12. 12. xx The Paxos Commit Algorithm Performance 2PC Paxos Commit Faster Paxos Commit No coloc. Coloc. No coloc. Coloc. No coloc. Coloc. Message delays* 4 3 5 4 4 3 Messages* 3N-1 3N-3 NF+F+3N-1 NF+3N-3 2NF+3N-1 2FN-2F+3N-3 Stable storage write delays** 2 2 2 Stable storage writes** N+1 N+F+1 N+F+1 *Not Assuming RMs’ concurrent preparation (slides-like scenario) **Assuming RMs’ concurrent preparation (r.t. constraints needed)  If we deploy only one acceptor for Paxos Commit (F=0), its fault tolerance and cost are the same as 2PC‟s. Are they exactly the same protocol in that case?
  13. 13. xx The Paxos Commit Algorithm Paxos Commit vs. 2PC  Yes, but… Other RMs TM RM1 2PC from Lamport and Gray’s paper T2 T1  2PC from the slides of the course …two slightly different versions of 2PC!
  14. 14. xx The Paxos Commit Algorithm Using a dynamic set of RM    join You add one process, the registrar, that acts just like another resource manager, despite the following:  vregistrar { p, a} pad  vregistrar {rm : rm joined the transaction} Pad RMs can join the transaction until the Commit Protocol begins The global commit condition now holds on the set of resource managers proposed by the registrar and decided in its own instance of Paxos: a RM1 Ok! p join RM2 MS AC1 Ok! AC3 Paxos join REG p RM3 AC2 AC4 Ok! RM1;RM2;RM3 AC5 Ok! RM1 RM2 RM3 Global Commit DynRM ( rm vregistrar )( b)( MS )( acc MS )( p2b acc rm b p was sent rec.)
  15. 15. xx The Paxos Commit Algorithm Thank You! Questions?

×