Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

8,654 views

Published on

http://research.microsoft.com/users/lamport/pubs/pubs.html#paxos-commit

Feel free to post comments ;)

Enjoy.

No Downloads

Total views

8,654

On SlideShare

0

From Embeds

0

Number of Embeds

69

Shares

0

Downloads

227

Comments

0

Likes

18

No embeds

No notes for slide

- 2. Agenda <ul><li>Paxos Commit Algorithm: Overview </li></ul><ul><li>The participating processes </li></ul><ul><ul><li>The resource managers </li></ul></ul><ul><ul><li>The leader </li></ul></ul><ul><ul><li>The acceptors </li></ul></ul><ul><li>Paxos Commit Algorithm: the base version </li></ul><ul><li>Failure scenarios </li></ul><ul><li>Optimizations for Paxos Commit </li></ul><ul><li>Performance </li></ul><ul><li>Paxos Commit vs. Two-Phase Commit </li></ul><ul><li>Using a dynamic set of resource managers </li></ul>
- 3. Paxos Commit Algorithm: Overview <ul><li>Paxos was applied to Transaction Commit by L.Lamport and Jim Gray in Consensus on Transaction Commit </li></ul><ul><li>One instance of Paxos (consensus algorithm) is executed for each resource manager, in order to agree upon a value (Prepared/Aborted) proposed by it </li></ul><ul><li>“ Not-synchronous” Commit algorithm </li></ul><ul><li>Fault-tolerant (unlike 2PC) </li></ul><ul><ul><li>Intended to be used in systems where failures are fail-stop only, for both processes and network </li></ul></ul><ul><li>Safety is guaranteed (unlike 3PC) </li></ul><ul><li>Formally specified and checked </li></ul><ul><li>Can be optimized to the theoretically best performance </li></ul>
- 4. Participants: the resource managers <ul><li>N resource managers (“RM”) execute the distributed transaction, then choose a value (“locally chosen value” or “LCV”; ‘p’ for prepared iff it is willing to commit) </li></ul><ul><li>Every RM tries to get its LCV accepted by a majority set of acceptors (“MS”: any subset with a cardinality strictly greater than half of the total). </li></ul><ul><li>Each RM is the first proposer in its own instance of Paxos </li></ul>Participants: the leader <ul><li>Coordinates the commit algorithm </li></ul><ul><li>All the instances of Paxos share the same leader </li></ul><ul><li>It is not a single point of failure (unlike 2PC) </li></ul><ul><li>Assumed always defined (true, many leader-(s)election algorithms exist) and unique (not necessarily true, but unlike 3PC safety does not rely on it) </li></ul>
- 5. Participants: the acceptors <ul><li>A denotes the set of acceptors </li></ul><ul><li>All the instances of Paxos share the same set A of acceptors </li></ul><ul><li>2F+1 acceptors involved in order to achieve tolerance to F failures </li></ul><ul><li>We will consider only F+1 acceptors, leaving F more for “spare” purposes (less communication overhead) </li></ul><ul><li>Each acceptors keep track of its own progress in a Nx1 vector </li></ul><ul><li>Vectors need to be merged into a Nx|MS| table, called aState, in order to take the global decision (we want “many” p’s) </li></ul>p p p p p AC 4 AC 5 AC 1 AC 2 AC 3 Consensus box (MS) RM1 a Ok! RM2 p Ok! RM3 p Ok! a a a a a p p p p p 3 rd instance 1 st instance 2 nd instance Acc 1 Acc 2 Acc 3 Acc 4 Acc 5 aState Paxos
- 6. Paxos Commit (base) Not blocked iff F acceptors respond T 1 T 2 (N=5) (F=2) : Writes on log AC0 L AC1 AC2 RM1 RM2 RM3 RM4 RM0 prepare (N-1) x p2a rm 0 v(rm) (N(F+1)-1) x rm 0 v(rm) rm 0 v(rm) p2b acc rm 0 v(rm) rm 0 v(rm) rm 0 v(rm) Opt. F x If ( Global Commit ) then commit p3 else abort p3 x N 0 0 v(0) p2a 1x BeginCommit
- 7. Global Commit Condition <ul><li>That is: there must be one and only one row for each RM involved in the commitment; in each row of those rows there must be at least F+1 entries that have ‘ p ’ as a value and refer to the same ballot </li></ul>p2b acc rm b p
- 8. [T 1 ] What if some RMs do not submit their LCV? Leader One majority of acceptors Leader : «Has resource manager j ever proposed you a value?» (Promise not to answer any b L2 <b L1 ) “ accept?” “ promise” “ prepare?” p1a p1b p2a (1) Acceptor i : «Yes, in my last session (ballot) b i with it I accepted its proposal v i » (2) Acceptor i : «No, never» If (at least |MS| acceptors answered) If (for ALL of them case (2) holds) then V=‘a’ [FREE] else V=v(maximum({b i }) [FORCED] Leader : «I am j, I propose V» b L1 >0
- 9. [T 2 ] What if the leader fails? <ul><li>If the leader fails, some leader-(s)election algorithm is executed. A faulty election (2+ leaders) doesn’t preclude safety ( 3PC), but can impede progress… </li></ul>trusted trusted trusted ignored ignored ignored MS b 1 >0 b 2 > b 1 <ul><li>Non-terminating example: infinite sequence of p1a-p1b-p2a messages from 2 leaders </li></ul><ul><li>Not really likely to happen </li></ul><ul><li>It can be avoided (random T ?) </li></ul>b 3 > b 2 b 4 > b 3 T T T L2 L1 trusted
- 10. Optimizations for Paxos Commit (1) <ul><li>Co-Location: each acceptor is on the same node as a RM and the initiating RM is on the same node as the initial leader </li></ul><ul><ul><li>-1 message phase (BeginCommit), -(F+2) messages </li></ul></ul><ul><li>“ Real-Time assumptions”: RMs can prepare spontaneously. The prepare phase is not needed anymore, RMs just “know” they have to prepare in some amount of time </li></ul><ul><ul><li>-1 message phase (Prepare), -(N-1) messages </li></ul></ul>RM3 RM0 AC0 p2a BeginCommit L p3 RM1 AC1 p2a RM4 RM2 AC2 p2a RM0 AC0 L RM3 RM4 RM1 AC1 RM2 AC2 Not needed anymore! prepare (N-1) x
- 11. Optimizations for Paxos Commit (2) <ul><li>Phase 3 elimination: the acceptors send their phase2b messages (the columns of aState) directly to the RMs, that evaluate the global commit condition </li></ul><ul><ul><ul><li>Paxos Commit + Phase 3 Elimination = Faster Paxos Commit (FPC) </li></ul></ul></ul><ul><ul><ul><li>FPC + Co-location + R.T.A. = Optimal Consensus Algorithm </li></ul></ul></ul>p2b RM0 AC0 L RM3 RM4 RM1 AC1 RM2 AC2 RM0 AC0 L RM3 RM4 RM1 AC1 RM2 AC2 p2b p3
- 12. Performance <ul><li>If we deploy only one acceptor for Paxos Commit (F=0), its fault tolerance and cost are the same as 2PC’s. Are they exactly the same protocol in that case? </li></ul>*Not Assuming RMs’ concurrent preparation (slides-like scenario) **Assuming RMs’ concurrent preparation (r.t. constraints needed) N +F +1 N +F +1 N+1 Stable storage writes** 2 2 2 Stable storage write delays** 2FN-2F +3N-3 2NF +3N-1 NF +3N-3 NF+F +3N-1 3N-3 3N-1 Messages* 3 Coloc. 4 4 5 3 4 Message delays* No coloc. Coloc. No coloc. Coloc. No coloc. Faster Paxos Commit Paxos Commit 2PC
- 13. Paxos Commit vs. 2PC <ul><li>Yes, but… </li></ul>T 1 T 2 TM RM1 Other RMs 2PC from Lamport and Gray’s paper 2PC from the slides of the course <ul><li>… two slightly different versions of 2PC! </li></ul>
- 14. Using a dynamic set of RM <ul><li>You add one process, the registrar , that acts just like another resource manager, despite the following: </li></ul><ul><ul><li>pad </li></ul></ul><ul><ul><li>Pad </li></ul></ul><ul><li>RMs can join the transaction until the Commit Protocol begins </li></ul><ul><li>The global commit condition now holds on the set of resource managers proposed by the registrar and decided in its own instance of Paxos: </li></ul>AC 4 AC 5 AC 1 AC 2 AC 3 MS RM1 a Ok! RM2 p Ok! RM3 p Ok! Paxos REG RM1;RM2;RM3 Ok! join join join RM1 RM2 RM3 p2b acc rm b p
- 15. Thank You! Questions?

No public clipboards found for this slide

Be the first to comment