the Paxos Commit algorithm

7,572 views
7,365 views

Published on

This is the presentation I used to give a seminar about the "Paxos Commit" algorithm. It is one of Leslie Lamport's works (in this case, a joint work between him and Jim Gray). You can find the original paper here:
http://research.microsoft.com/users/lamport/pubs/pubs.html#paxos-commit

Feel free to post comments ;)
Enjoy.

Published in: Technology, Business
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,572
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
209
Comments
0
Likes
16
Embeds 0
No embeds

No notes for slide
  • AUDIO Registrazione: Natalia Editing: Tiziana Note: ok GRAFICA Editing: Tiziana (08.06.05) Note: ok
  • the Paxos Commit algorithm

    1. 2. Agenda <ul><li>Paxos Commit Algorithm: Overview </li></ul><ul><li>The participating processes </li></ul><ul><ul><li>The resource managers </li></ul></ul><ul><ul><li>The leader </li></ul></ul><ul><ul><li>The acceptors </li></ul></ul><ul><li>Paxos Commit Algorithm: the base version </li></ul><ul><li>Failure scenarios </li></ul><ul><li>Optimizations for Paxos Commit </li></ul><ul><li>Performance </li></ul><ul><li>Paxos Commit vs. Two-Phase Commit </li></ul><ul><li>Using a dynamic set of resource managers </li></ul>
    2. 3. Paxos Commit Algorithm: Overview <ul><li>Paxos was applied to Transaction Commit by L.Lamport and Jim Gray in Consensus on Transaction Commit </li></ul><ul><li>One instance of Paxos (consensus algorithm) is executed for each resource manager, in order to agree upon a value (Prepared/Aborted) proposed by it </li></ul><ul><li>“ Not-synchronous” Commit algorithm </li></ul><ul><li>Fault-tolerant (unlike 2PC) </li></ul><ul><ul><li>Intended to be used in systems where failures are fail-stop only, for both processes and network </li></ul></ul><ul><li>Safety is guaranteed (unlike 3PC) </li></ul><ul><li>Formally specified and checked </li></ul><ul><li>Can be optimized to the theoretically best performance </li></ul>
    3. 4. Participants: the resource managers <ul><li>N resource managers (“RM”) execute the distributed transaction, then choose a value (“locally chosen value” or “LCV”; ‘p’ for prepared iff it is willing to commit) </li></ul><ul><li>Every RM tries to get its LCV accepted by a majority set of acceptors (“MS”: any subset with a cardinality strictly greater than half of the total). </li></ul><ul><li>Each RM is the first proposer in its own instance of Paxos </li></ul>Participants: the leader <ul><li>Coordinates the commit algorithm </li></ul><ul><li>All the instances of Paxos share the same leader </li></ul><ul><li>It is not a single point of failure (unlike 2PC) </li></ul><ul><li>Assumed always defined (true, many leader-(s)election algorithms exist) and unique (not necessarily true, but unlike 3PC safety does not rely on it) </li></ul>
    4. 5. Participants: the acceptors <ul><li>A denotes the set of acceptors </li></ul><ul><li>All the instances of Paxos share the same set A of acceptors </li></ul><ul><li>2F+1 acceptors involved in order to achieve tolerance to F failures </li></ul><ul><li>We will consider only F+1 acceptors, leaving F more for “spare” purposes (less communication overhead) </li></ul><ul><li>Each acceptors keep track of its own progress in a Nx1 vector </li></ul><ul><li>Vectors need to be merged into a Nx|MS| table, called aState, in order to take the global decision (we want “many” p’s) </li></ul>p p p p p AC 4 AC 5 AC 1 AC 2 AC 3 Consensus box (MS) RM1 a Ok! RM2 p Ok! RM3 p Ok! a a a a a p p p p p 3 rd instance 1 st instance 2 nd instance Acc 1 Acc 2 Acc 3 Acc 4 Acc 5 aState Paxos
    5. 6. Paxos Commit (base) Not blocked iff F acceptors respond T 1 T 2 (N=5) (F=2) : Writes on log AC0 L AC1 AC2 RM1 RM2 RM3 RM4 RM0 prepare (N-1) x p2a rm 0 v(rm) (N(F+1)-1) x rm 0 v(rm) rm 0 v(rm) p2b acc rm 0 v(rm) rm 0 v(rm) rm 0 v(rm) Opt. F x If ( Global Commit ) then commit p3 else abort p3 x N 0 0 v(0) p2a 1x BeginCommit
    6. 7. Global Commit Condition <ul><li>That is: there must be one and only one row for each RM involved in the commitment; in each row of those rows there must be at least F+1 entries that have ‘ p ’ as a value and refer to the same ballot </li></ul>p2b acc rm b p
    7. 8. [T 1 ] What if some RMs do not submit their LCV? Leader One majority of acceptors Leader : «Has resource manager j ever proposed you a value?» (Promise not to answer any b L2 <b L1 ) “ accept?” “ promise” “ prepare?” p1a p1b p2a (1) Acceptor i : «Yes, in my last session (ballot) b i with it I accepted its proposal v i » (2) Acceptor i : «No, never» If (at least |MS| acceptors answered) If (for ALL of them case (2) holds) then V=‘a’ [FREE] else V=v(maximum({b i }) [FORCED] Leader : «I am j, I propose V» b L1 >0
    8. 9. [T 2 ] What if the leader fails? <ul><li>If the leader fails, some leader-(s)election algorithm is executed. A faulty election (2+ leaders) doesn’t preclude safety (  3PC), but can impede progress… </li></ul>trusted trusted trusted ignored ignored ignored MS b 1 >0 b 2 > b 1 <ul><li>Non-terminating example: infinite sequence of p1a-p1b-p2a messages from 2 leaders </li></ul><ul><li>Not really likely to happen </li></ul><ul><li>It can be avoided (random T ?) </li></ul>b 3 > b 2 b 4 > b 3 T T T L2 L1 trusted
    9. 10. Optimizations for Paxos Commit (1) <ul><li>Co-Location: each acceptor is on the same node as a RM and the initiating RM is on the same node as the initial leader </li></ul><ul><ul><li>-1 message phase (BeginCommit), -(F+2) messages </li></ul></ul><ul><li>“ Real-Time assumptions”: RMs can prepare spontaneously. The prepare phase is not needed anymore, RMs just “know” they have to prepare in some amount of time </li></ul><ul><ul><li>-1 message phase (Prepare), -(N-1) messages </li></ul></ul>RM3 RM0 AC0 p2a BeginCommit L p3 RM1 AC1 p2a RM4 RM2 AC2 p2a RM0 AC0 L RM3 RM4 RM1 AC1 RM2 AC2 Not needed anymore! prepare (N-1) x
    10. 11. Optimizations for Paxos Commit (2) <ul><li>Phase 3 elimination: the acceptors send their phase2b messages (the columns of aState) directly to the RMs, that evaluate the global commit condition </li></ul><ul><ul><ul><li>Paxos Commit + Phase 3 Elimination = Faster Paxos Commit (FPC) </li></ul></ul></ul><ul><ul><ul><li>FPC + Co-location + R.T.A. = Optimal Consensus Algorithm </li></ul></ul></ul>p2b RM0 AC0 L RM3 RM4 RM1 AC1 RM2 AC2 RM0 AC0 L RM3 RM4 RM1 AC1 RM2 AC2 p2b p3
    11. 12. Performance <ul><li>If we deploy only one acceptor for Paxos Commit (F=0), its fault tolerance and cost are the same as 2PC’s. Are they exactly the same protocol in that case? </li></ul>*Not Assuming RMs’ concurrent preparation (slides-like scenario) **Assuming RMs’ concurrent preparation (r.t. constraints needed) N +F +1 N +F +1 N+1 Stable storage writes** 2 2 2 Stable storage write delays** 2FN-2F +3N-3 2NF +3N-1 NF +3N-3 NF+F +3N-1 3N-3 3N-1 Messages* 3 Coloc. 4 4 5 3 4 Message delays* No coloc. Coloc. No coloc. Coloc. No coloc. Faster Paxos Commit Paxos Commit 2PC
    12. 13. Paxos Commit vs. 2PC <ul><li>Yes, but… </li></ul>T 1 T 2 TM RM1 Other RMs 2PC from Lamport and Gray’s paper 2PC from the slides of the course <ul><li>… two slightly different versions of 2PC! </li></ul>
    13. 14. Using a dynamic set of RM <ul><li>You add one process, the registrar , that acts just like another resource manager, despite the following: </li></ul><ul><ul><li>pad </li></ul></ul><ul><ul><li>Pad </li></ul></ul><ul><li>RMs can join the transaction until the Commit Protocol begins </li></ul><ul><li>The global commit condition now holds on the set of resource managers proposed by the registrar and decided in its own instance of Paxos: </li></ul>AC 4 AC 5 AC 1 AC 2 AC 3 MS RM1 a Ok! RM2 p Ok! RM3 p Ok! Paxos REG RM1;RM2;RM3 Ok! join join join RM1 RM2 RM3 p2b acc rm b p
    14. 15. Thank You! Questions?

    ×