SlideShare a Scribd company logo
1 of 111
Download to read offline
Distributed Systems Consensus
Amir H. Payberah
amir@sics.se
Amirkabir University of Technology
(Tehran Polytechnic)
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 1 / 56
What is the Problem?
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 2 / 56
Two Generalsā€™ Problem
Two generals need to be agree on time to attack to win.
They communicate through messengers, who may be killed on their
way.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 3 / 56
Two Generalsā€™ Problem
Two generals need to be agree on time to attack to win.
They communicate through messengers, who may be killed on their
way.
Agreement is the problem.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 3 / 56
Replicated State Machine Problem (1/2)
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 4 / 56
Replicated State Machine Problem (1/2)
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 4 / 56
Replicated State Machine Problem (1/2)
The solution: replicate the server.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 4 / 56
Replicated State Machine Problem (2/2)
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56
Replicated State Machine Problem (2/2)
Make the server deterministic (state machine).
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56
Replicated State Machine Problem (2/2)
Make the server deterministic (state machine).
Replicate the server.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56
Replicated State Machine Problem (2/2)
Make the server deterministic (state machine).
Replicate the server.
Ensure correct replicas step through
the same sequence of state
transitions (How?)
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56
Replicated State Machine Problem (2/2)
Make the server deterministic (state machine).
Replicate the server.
Ensure correct replicas step through
the same sequence of state
transitions (How?)
Agreement is the problem.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56
The Agreement Problem
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 6 / 56
The Agreement Problem
Some nodes propose values (or actions) by sending them to the
others.
All nodes must decide whether to accept or reject those values.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 7 / 56
The Agreement Problem
Some nodes propose values (or actions) by sending them to the
others.
All nodes must decide whether to accept or reject those values.
But, ...
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 7 / 56
The Agreement Problem
Some nodes propose values (or actions) by sending them to the
others.
All nodes must decide whether to accept or reject those values.
But, ...
Concurrent processes and uncertainty of timing, order of events and
inputs.
Failure and recovery of machines/processors, of communication
channels.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 7 / 56
Agreement Requirements
Safety
Liveness
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56
Agreement Requirements
Safety
ā€¢ Validity: only a value that has been proposed may be chosen.
Liveness
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56
Agreement Requirements
Safety
ā€¢ Validity: only a value that has been proposed may be chosen.
ā€¢ Agreement: no two correct nodes choose diļ¬€erent values.
Liveness
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56
Agreement Requirements
Safety
ā€¢ Validity: only a value that has been proposed may be chosen.
ā€¢ Agreement: no two correct nodes choose diļ¬€erent values.
ā€¢ Integrity: a node chooses at most once.
Liveness
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56
Agreement Requirements
Safety
ā€¢ Validity: only a value that has been proposed may be chosen.
ā€¢ Agreement: no two correct nodes choose diļ¬€erent values.
ā€¢ Integrity: a node chooses at most once.
Liveness
ā€¢ Termination: every correct node eventually choose a value.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56
Agreement in Distributed Systems: Possible Solutions
Two-Phase Commit (2PC)
Paxos
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 9 / 56
Two-Phase Commit (2PC)
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 10 / 56
The Two-Phase Commit (2PC) Problem
The problem ļ¬rst was encountered in database systems.
Suppose a database system is updating some complicated data
structures that include parts residing on more than one machine.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 11 / 56
The Two-Phase Commit (2PC) Problem
The problem ļ¬rst was encountered in database systems.
Suppose a database system is updating some complicated data
structures that include parts residing on more than one machine.
System model:
ā€¢ Concurrent processes and uncertainty of timing, order of events and
inputs (asynchronous systems).
ā€¢ Failure and recovery of machines/processors, of communication
channels.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 11 / 56
Intuitive Example (1/3)
You want to organize outing with 3 friends at 6pm Tuesday.
ā€¢ Go out only if all friends can make it.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 12 / 56
Intuitive Example (2/3)
What do you do?
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 13 / 56
Intuitive Example (2/3)
What do you do?
ā€¢ Call each of them and ask if can do 6pm on Tuesday (voting phase)
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 13 / 56
Intuitive Example (2/3)
What do you do?
ā€¢ Call each of them and ask if can do 6pm on Tuesday (voting phase)
ā€¢ If all can do Tuesday, call each friend back to ACK (commit)
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 13 / 56
Intuitive Example (2/3)
What do you do?
ā€¢ Call each of them and ask if can do 6pm on Tuesday (voting phase)
ā€¢ If all can do Tuesday, call each friend back to ACK (commit)
ā€¢ If one cannot do Tuesday, call other three to cancel (abort)
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 13 / 56
Intuitive Example (3/3)
Critical details
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 14 / 56
Intuitive Example (3/3)
Critical details
ā€¢ While you were calling everyone to ask, people who have promised
they can do 6pm Tuesday must reserve that slot.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 14 / 56
Intuitive Example (3/3)
Critical details
ā€¢ While you were calling everyone to ask, people who have promised
they can do 6pm Tuesday must reserve that slot.
ā€¢ You need to remember the decision and tell anyone whom you have
not been able to reach during commit/abort phase.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 14 / 56
Intuitive Example (3/3)
Critical details
ā€¢ While you were calling everyone to ask, people who have promised
they can do 6pm Tuesday must reserve that slot.
ā€¢ You need to remember the decision and tell anyone whom you have
not been able to reach during commit/abort phase.
That is exactly how 2PC works.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 14 / 56
The 2PC Players
Coordinator (Transaction Manager)
ā€¢ Begins transaction.
ā€¢ Responsible for commit/abort.
Participants (Resource Managers)
ā€¢ The servers with the data used in the distributed transaction.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 15 / 56
The 2PC Algorithm
Phase 1 - prepare phase
Phase 2 - commit phase
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 16 / 56
The 2PC Algorithm - Prepare Phase
Coordinator asks each participant canCommit.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 17 / 56
The 2PC Algorithm - Prepare Phase
Coordinator asks each participant canCommit.
Participants must prepare to commit using permanent storage
before answering yes.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 17 / 56
The 2PC Algorithm - Prepare Phase
Coordinator asks each participant canCommit.
Participants must prepare to commit using permanent storage
before answering yes.
ā€¢ Lock the objects.
ā€¢ Participants are not allowed to cause an abort after it replies yes to
canCommit.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 17 / 56
The 2PC Algorithm - Prepare Phase
Coordinator asks each participant canCommit.
Participants must prepare to commit using permanent storage
before answering yes.
ā€¢ Lock the objects.
ā€¢ Participants are not allowed to cause an abort after it replies yes to
canCommit.
Outcome of the transaction is uncertain until doCommit or
doAbort.
ā€¢ Other participants might still cause an abort.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 17 / 56
The 2PC Algorithm - Commit Phase
The coordinator collects all votes.
ā€¢ If unanimous yes, causes commit.
ā€¢ If any participant voted no, causes abort.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 18 / 56
The 2PC Algorithm - Commit Phase
The coordinator collects all votes.
ā€¢ If unanimous yes, causes commit.
ā€¢ If any participant voted no, causes abort.
The fate of the transaction is decided atomically at the
coordinator, once all participants vote.
ā€¢ Coordinator records fate using permanent storage.
ā€¢ Then broadcasts doCommit or doAbort to participants.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 18 / 56
2PC Sequence of Events
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 19 / 56
Recovery in 2PC
Recovery after timeouts.
Recovery after crashes and reboot.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 20 / 56
Recovery in 2PC
Recovery after timeouts.
Recovery after crashes and reboot.
Note: you cannot diļ¬€erentiate between the above in a realistic
asynchronous network.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 20 / 56
Handling Timeout
To avoid processes blocking for ever.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 21 / 56
Handling Timeout
To avoid processes blocking for ever.
Two scenarios:
ā€¢ Coordinator waits for votes from participants.
ā€¢ Participant is waiting for the ļ¬nal decision.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 21 / 56
Handling Timeout at Coordinator
If B voted no, can coordinator unilaterally abort?
If B voted yes, can coordinator unilaterally abort/commit?
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 22 / 56
Handling Timeout at Coordinator
If B voted no, can coordinator unilaterally abort?
If B voted yes, can coordinator unilaterally abort/commit?
ā€¢ Coordinator waits for votes from participants.
ā€¢ Participant is waiting for the ļ¬nal decision.
ā€¢ Coordinator timeout abort and send
doAbort to participants.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 22 / 56
Handling Timeout at Participants
If B times out on TM and has voted yes, then execute termination
protocol.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56
Handling Timeout at Participants
If B times out on TM and has voted yes, then execute termination
protocol.
Simple protocol: participant is remained blocked
until it can establish communication with coordinator.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56
Handling Timeout at Participants
If B times out on TM and has voted yes, then execute termination
protocol.
Simple protocol: participant is remained blocked
until it can establish communication with coordinator.
Cooperative protocol: participant sends a decision-request message
to other participants.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56
Handling Timeout at Participants
If B times out on TM and has voted yes, then execute termination
protocol.
Simple protocol: participant is remained blocked
until it can establish communication with coordinator.
Cooperative protocol: participant sends a decision-request message
to other participants.
ā€¢ B sends status message to A
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56
Handling Timeout at Participants
If B times out on TM and has voted yes, then execute termination
protocol.
Simple protocol: participant is remained blocked
until it can establish communication with coordinator.
Cooperative protocol: participant sends a decision-request message
to other participants.
ā€¢ B sends status message to A
ā€¢ If A has received commit/abort from TM, ...
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56
Handling Timeout at Participants
If B times out on TM and has voted yes, then execute termination
protocol.
Simple protocol: participant is remained blocked
until it can establish communication with coordinator.
Cooperative protocol: participant sends a decision-request message
to other participants.
ā€¢ B sends status message to A
ā€¢ If A has received commit/abort from TM, ...
ā€¢ If A has not responded to TM, ...
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56
Handling Timeout at Participants
If B times out on TM and has voted yes, then execute termination
protocol.
Simple protocol: participant is remained blocked
until it can establish communication with coordinator.
Cooperative protocol: participant sends a decision-request message
to other participants.
ā€¢ B sends status message to A
ā€¢ If A has received commit/abort from TM, ...
ā€¢ If A has not responded to TM, ...
ā€¢ If A has responded with no/yes ...
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56
Handling Crash and Recovery (1/2)
All nodes must log protocol progress.
ā€¢ Participants: prepared, uncertain, committed/aborted
ā€¢ Coordinator: prepared, committed/aborted, done
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 24 / 56
Handling Crash and Recovery (1/2)
All nodes must log protocol progress.
ā€¢ Participants: prepared, uncertain, committed/aborted
ā€¢ Coordinator: prepared, committed/aborted, done
Nodes cannot back out if commit is decided.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 24 / 56
Handling Crash and Recovery (2/2)
Coordinator crashes:
ā€¢ If it ļ¬nds no commit on disk, it aborts.
ā€¢ If it ļ¬nds commit, it commits.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 25 / 56
Handling Crash and Recovery (2/2)
Coordinator crashes:
ā€¢ If it ļ¬nds no commit on disk, it aborts.
ā€¢ If it ļ¬nds commit, it commits.
Participant crashes:
ā€¢ If it ļ¬nds no yes on disk, it aborts.
ā€¢ If it ļ¬nds yes, runs termination protocol to decide.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 25 / 56
Fault-Tolerance Limitations of 2PC
Even with recovery enabled, 2PC is not really fault-tolerant (or
live), because it can be blocked even when one (or a few)
machines fail.
Blocking means that it does not make progress during the failures.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 26 / 56
Fault-Tolerance Limitations of 2PC
Even with recovery enabled, 2PC is not really fault-tolerant (or
live), because it can be blocked even when one (or a few)
machines fail.
Blocking means that it does not make progress during the failures.
Any scenarios?
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 26 / 56
2PC Blocking Scenario
TM sends doCommit to A, A gets it and commits, and then both
TM and A die.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 27 / 56
2PC Blocking Scenario
TM sends doCommit to A, A gets it and commits, and then both
TM and A die.
B, C, D have already also replied yes, have locked their mutexes,
and now need to wait for TM or A to reappear.
ā€¢ They cannot recover the decision with certainty until TM or A are
online.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 27 / 56
2PC Blocking Scenario
TM sends doCommit to A, A gets it and commits, and then both
TM and A die.
B, C, D have already also replied yes, have locked their mutexes,
and now need to wait for TM or A to reappear.
ā€¢ They cannot recover the decision with certainty until TM or A are
online.
This is why 2PC is called a blocking protocol:
2PC is safe, but not live.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 27 / 56
Impossibility of
Distributed Consensus
with One Faulty Process
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 28 / 56
FLP
Fischer-Lynch-Paterson (FLP)
ā€¢ M.J. Fischer, N.A. Lynch, and M.S. Paterson, Impossibility of distributed
consensus with one faulty process, Journal of the ACM, 1985.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 29 / 56
FLP Impossibility Result
It is impossible for a set of processors in an asynchronous system
to agree on a binary value, even if only a single process is subject
to an unannounced failure.
The core of the problem is asynchrony.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 30 / 56
FLP Impossibility Result
What FLP says: you cannot guarantee both safety and progress
when there is even a single fault at an inopportune moment.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 31 / 56
FLP Impossibility Result
What FLP says: you cannot guarantee both safety and progress
when there is even a single fault at an inopportune moment.
What FLP does not say: in practice, how close can you get to the
ideal (always safe and live)?
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 31 / 56
FLP Impossibility Result
What FLP says: you cannot guarantee both safety and progress
when there is even a single fault at an inopportune moment.
What FLP does not say: in practice, how close can you get to the
ideal (always safe and live)?
So, Paxos ...
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 31 / 56
Paxos
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 32 / 56
Paxos
The only known completely-safe and largely-live agreement
protocol.
L. Lamport, The part-time parliament, ACM Transactions on
Computer Systems, 1998.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 33 / 56
The Paxos Players
Proposers
ā€¢ Suggests values for consideration by acceptors.
Acceptors
ā€¢ Considers the values proposed by proposers.
ā€¢ Renders an accept/reject decision.
Learners
ā€¢ Learns the chosen value.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 34 / 56
The Paxos Players
Proposers
ā€¢ Suggests values for consideration by acceptors.
Acceptors
ā€¢ Considers the values proposed by proposers.
ā€¢ Renders an accept/reject decision.
Learners
ā€¢ Learns the chosen value.
A node can act as more than one roles (usually 3).
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 34 / 56
Single Proposal, Single Acceptor
Use just one acceptor
ā€¢ Collects proposersā€™ proposals.
ā€¢ Decides the value and tells everyone else.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 35 / 56
Single Proposal, Single Acceptor
Use just one acceptor
ā€¢ Collects proposersā€™ proposals.
ā€¢ Decides the value and tells everyone else.
Sounds familiar?
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 35 / 56
Single Proposal, Single Acceptor
Use just one acceptor
ā€¢ Collects proposersā€™ proposals.
ā€¢ Decides the value and tells everyone else.
Sounds familiar?
ā€¢ two-phase commit (2PC)
ā€¢ acceptor fails = protocol blocks
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 35 / 56
Single Proposal, Multiple Acceptors
One acceptor is not fault-tolerant enough.
Letā€™s have multiple acceptors.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56
Single Proposal, Multiple Acceptors
One acceptor is not fault-tolerant enough.
Letā€™s have multiple acceptors.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56
Single Proposal, Multiple Acceptors
One acceptor is not fault-tolerant enough.
Letā€™s have multiple acceptors.
From there, must reach a decision. How?
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56
Single Proposal, Multiple Acceptors
One acceptor is not fault-tolerant enough.
Letā€™s have multiple acceptors.
From there, must reach a decision. How?
Decision = value accepted by the majority.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56
Single Proposal, Multiple Acceptors
One acceptor is not fault-tolerant enough.
Letā€™s have multiple acceptors.
From there, must reach a decision. How?
Decision = value accepted by the majority.
P1: an acceptor must accept ļ¬rst proposal it receives.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56
Multiple Proposals, Multiple Acceptors
If there are multiple proposals, no proposal may get the majority.
ā€¢ 3 proposals may each get 1/3 of the acceptors.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 37 / 56
Multiple Proposals, Multiple Acceptors
If there are multiple proposals, no proposal may get the majority.
ā€¢ 3 proposals may each get 1/3 of the acceptors.
Solution: acceptors can accept multiple proposals, distinguished by
a unique proposal number.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 37 / 56
Handling Multiple Proposals
All chosen proposals must have the same value.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 38 / 56
Handling Multiple Proposals
All chosen proposals must have the same value.
P2: If a proposal with value v is chosen, then every higher-numbered
proposal that is chosen also has value v.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 38 / 56
Handling Multiple Proposals
All chosen proposals must have the same value.
P2: If a proposal with value v is chosen, then every higher-numbered
proposal that is chosen also has value v.
ā€¢ P2a: ... accepted ...
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 38 / 56
Handling Multiple Proposals
All chosen proposals must have the same value.
P2: If a proposal with value v is chosen, then every higher-numbered
proposal that is chosen also has value v.
ā€¢ P2a: ... accepted ...
ā€¢ P2b: ... proposed ...
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 38 / 56
The Paxos Algorithm
Phase 1a - prepare phase
Phase 1b - promise phase
Phase 2a - accept phase
Phase 2b - accepted phase
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 39 / 56
Paxos Algorithm
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 40 / 56
Paxos Algorithm - Prepare Phase
A proposer selects a proposal number n and sends a prepare
request with number n to majority of acceptors.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 41 / 56
Paxos Algorithm - Promise Phase
If an acceptor receives a prepare request with number n greater
than that of any prepare request it saw
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 42 / 56
Paxos Algorithm - Promise Phase
If an acceptor receives a prepare request with number n greater
than that of any prepare request it saw
ā€¢ It responses yes to that request with a promise not to accept any
more proposals numbered less than n.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 42 / 56
Paxos Algorithm - Promise Phase
If an acceptor receives a prepare request with number n greater
than that of any prepare request it saw
ā€¢ It responses yes to that request with a promise not to accept any
more proposals numbered less than n.
ā€¢ It includes the highest-numbered proposal (if any) that it has
accepted.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 42 / 56
Paxos Algorithm - Accept Phase
If the proposer receives a response yes to its prepare requests from
a majority of acceptors
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 43 / 56
Paxos Algorithm - Accept Phase
If the proposer receives a response yes to its prepare requests from
a majority of acceptors
ā€¢ It sends an accept request to each of those acceptors for a
proposal numbered n with a values v, which is the value of the
highest-numbered proposal among the responses.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 43 / 56
Paxos Algorithm - Accepted Phase
If an acceptor receives an accept request for a proposal numbered
n
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 44 / 56
Paxos Algorithm - Accepted Phase
If an acceptor receives an accept request for a proposal numbered
n
ā€¢ It accepts the proposal unless it has already responded to a
prepare request having a number greater than n.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 44 / 56
Deļ¬nition of Chosen
A value is chosen at proposal number n, iļ¬€ majority of acceptors
accept that value in phase 2 of the proposal number.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 45 / 56
Paxos Example
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 46 / 56
Paxos Example
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 47 / 56
Paxos - Safety (1/3)
If a value v is chosen at proposal number n, any value that is sent
out in phase 2 of any later proposal numbers must be also v.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 48 / 56
Paxos - Safety (2/3)
Decision = Majority (any two majorities share at least one element)
Therefore after the ļ¬rst round in which there is a decision, any
subsequent round involves at least one acceptor that has accepted
v.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 49 / 56
Paxos - Safety (3/3)
Now suppose our claim is not true, and let m is the ļ¬rst proposal
number that is later than n and in 2nd phase, the value sent out is
w = v.
This is not possible, because if the proposer P was able to start
2nd phase for w, it means it got a majority to accept round for m
(for m > n). So, either:
ā€¢ v would not have been the value decided, or
ā€¢ v would have been proposed by P
Therefore, once a majority accepts v, that never changes.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 50 / 56
Liveness
If two or more proposers race to propose new values, they might
step on each other toes all the time.
ā€¢ P1: prepare(n1)
ā€¢ P2: prepare(n2)
ā€¢ P1: accept(n1, v1)
ā€¢ P2: accept(n2, v2)
ā€¢ P1: prepare(n3)
ā€¢ P2: prepare(n4)
ā€¢ ...
ā€¢ n1 < n2 < n3 < n4 < Ā· Ā· Ā·
With randomness, this occurs exceedingly rarely.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 51 / 56
Paxos is Everywhere
Google: Chubby (Paxos-based distributed lock service)
ā€¢ Most Google services use Chubby directly or indirectly
Yahoo: Zookeeper (Paxos-based distributed lock service)
ā€¢ Zookeeper is open-source and integrates with Hadoop
UW: Scatter (Paxos-based consistent DHT)
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 52 / 56
Summary
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 53 / 56
Summary
Replicated State Machine
2PC: blocking
FLP impossibility
Paxos
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 54 / 56
References:
M.J. Fischer, N.A. Lynch, and M.S. Paterson, Impossibility of dis-
tributed consensus with one faulty process, Journal of the ACM,
1985.
L. Lamport, The part-time parliament, ACM Transactions on Com-
puter Systems, 1998.
L. Lamport, Paxos made simple, ACM Sigact News, 2001.
J. Gray, and L. Lamport, Consensus on transaction commit, ACM
Transactions on Database Systems, 2006.
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 55 / 56
Questions?
Acknowledgements
Some slides were derived from Jinyang Li slides (New York University).
Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 56 / 56

More Related Content

What's hot

17. PAXOS consensus Algorithm.pptx
17. PAXOS consensus Algorithm.pptx17. PAXOS consensus Algorithm.pptx
17. PAXOS consensus Algorithm.pptxMohan Kumar Ch
Ā 
Introduction to Blockchain and Smart Contracts
Introduction to Blockchain and Smart ContractsIntroduction to Blockchain and Smart Contracts
Introduction to Blockchain and Smart ContractsSaad Zaher
Ā 
System models in distributed system
System models in distributed systemSystem models in distributed system
System models in distributed systemishapadhy
Ā 
Apache Kafka and Blockchain - Comparison and a Kafka-native Implementation
Apache Kafka and Blockchain - Comparison and a Kafka-native ImplementationApache Kafka and Blockchain - Comparison and a Kafka-native Implementation
Apache Kafka and Blockchain - Comparison and a Kafka-native ImplementationKai WƤhner
Ā 
Ethereum 2.0
Ethereum 2.0Ethereum 2.0
Ethereum 2.0Gene Leybzon
Ā 
Consensus Algorithms.pptx
Consensus Algorithms.pptxConsensus Algorithms.pptx
Consensus Algorithms.pptxRajapriya82
Ā 
Best Practices for Streaming IoT Data with MQTT and Apache KafkaĀ®
Best Practices for Streaming IoT Data with MQTT and Apache KafkaĀ®Best Practices for Streaming IoT Data with MQTT and Apache KafkaĀ®
Best Practices for Streaming IoT Data with MQTT and Apache KafkaĀ®confluent
Ā 
PoW vs. PoS - Key Differences
PoW vs. PoS - Key DifferencesPoW vs. PoS - Key Differences
PoW vs. PoS - Key Differences101 Blockchains
Ā 
Blockchain: The New Technology and Its Applications for Libraries
Blockchain: The New Technology and Its Applications for LibrariesBlockchain: The New Technology and Its Applications for Libraries
Blockchain: The New Technology and Its Applications for LibrariesBohyun Kim
Ā 
Election algorithms
Election algorithmsElection algorithms
Election algorithmsAnkush Kumar
Ā 
Understanding hd wallets design and implementation
Understanding hd wallets  design and implementationUnderstanding hd wallets  design and implementation
Understanding hd wallets design and implementationArcBlock
Ā 
Keptn- A Cloud-native application life-cycle orchestration.pdf
Keptn- A Cloud-native application life-cycle orchestration.pdfKeptn- A Cloud-native application life-cycle orchestration.pdf
Keptn- A Cloud-native application life-cycle orchestration.pdfKnoldus Inc.
Ā 
Transactions and Concurrency Control
Transactions and Concurrency ControlTransactions and Concurrency Control
Transactions and Concurrency ControlDilum Bandara
Ā 
Solidity
SoliditySolidity
Soliditygavofyork
Ā 
Understanding Electronic Commerce
Understanding Electronic CommerceUnderstanding Electronic Commerce
Understanding Electronic CommerceTyler Hannan
Ā 

What's hot (20)

17. PAXOS consensus Algorithm.pptx
17. PAXOS consensus Algorithm.pptx17. PAXOS consensus Algorithm.pptx
17. PAXOS consensus Algorithm.pptx
Ā 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
Ā 
Process Management-Process Migration
Process Management-Process MigrationProcess Management-Process Migration
Process Management-Process Migration
Ā 
Introduction to Blockchain and Smart Contracts
Introduction to Blockchain and Smart ContractsIntroduction to Blockchain and Smart Contracts
Introduction to Blockchain and Smart Contracts
Ā 
System models in distributed system
System models in distributed systemSystem models in distributed system
System models in distributed system
Ā 
cloud computing: Vm migration
cloud computing: Vm migrationcloud computing: Vm migration
cloud computing: Vm migration
Ā 
Apache Kafka and Blockchain - Comparison and a Kafka-native Implementation
Apache Kafka and Blockchain - Comparison and a Kafka-native ImplementationApache Kafka and Blockchain - Comparison and a Kafka-native Implementation
Apache Kafka and Blockchain - Comparison and a Kafka-native Implementation
Ā 
Highā€“Performance Computing
Highā€“Performance ComputingHighā€“Performance Computing
Highā€“Performance Computing
Ā 
24 Multithreaded Algorithms
24 Multithreaded Algorithms24 Multithreaded Algorithms
24 Multithreaded Algorithms
Ā 
Ethereum 2.0
Ethereum 2.0Ethereum 2.0
Ethereum 2.0
Ā 
Consensus Algorithms.pptx
Consensus Algorithms.pptxConsensus Algorithms.pptx
Consensus Algorithms.pptx
Ā 
Best Practices for Streaming IoT Data with MQTT and Apache KafkaĀ®
Best Practices for Streaming IoT Data with MQTT and Apache KafkaĀ®Best Practices for Streaming IoT Data with MQTT and Apache KafkaĀ®
Best Practices for Streaming IoT Data with MQTT and Apache KafkaĀ®
Ā 
PoW vs. PoS - Key Differences
PoW vs. PoS - Key DifferencesPoW vs. PoS - Key Differences
PoW vs. PoS - Key Differences
Ā 
Blockchain: The New Technology and Its Applications for Libraries
Blockchain: The New Technology and Its Applications for LibrariesBlockchain: The New Technology and Its Applications for Libraries
Blockchain: The New Technology and Its Applications for Libraries
Ā 
Election algorithms
Election algorithmsElection algorithms
Election algorithms
Ā 
Understanding hd wallets design and implementation
Understanding hd wallets  design and implementationUnderstanding hd wallets  design and implementation
Understanding hd wallets design and implementation
Ā 
Keptn- A Cloud-native application life-cycle orchestration.pdf
Keptn- A Cloud-native application life-cycle orchestration.pdfKeptn- A Cloud-native application life-cycle orchestration.pdf
Keptn- A Cloud-native application life-cycle orchestration.pdf
Ā 
Transactions and Concurrency Control
Transactions and Concurrency ControlTransactions and Concurrency Control
Transactions and Concurrency Control
Ā 
Solidity
SoliditySolidity
Solidity
Ā 
Understanding Electronic Commerce
Understanding Electronic CommerceUnderstanding Electronic Commerce
Understanding Electronic Commerce
Ā 

Similar to Paxos

Time, clocks and the ordering of events
Time, clocks and the ordering of eventsTime, clocks and the ordering of events
Time, clocks and the ordering of eventsAmir Payberah
Ā 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2Amir Payberah
Ā 
Process Synchronization - Part1
Process Synchronization -  Part1Process Synchronization -  Part1
Process Synchronization - Part1Amir Payberah
Ā 
Process Synchronization - Part2
Process Synchronization -  Part2Process Synchronization -  Part2
Process Synchronization - Part2Amir Payberah
Ā 
Gossip-based algorithms
Gossip-based algorithmsGossip-based algorithms
Gossip-based algorithmsAmir Payberah
Ā 
Virtual Memory - Part2
Virtual Memory - Part2Virtual Memory - Part2
Virtual Memory - Part2Amir Payberah
Ā 
MegaStore and Spanner
MegaStore and SpannerMegaStore and Spanner
MegaStore and SpannerAmir Payberah
Ā 
Process Management - Part1
Process Management - Part1Process Management - Part1
Process Management - Part1Amir Payberah
Ā 
Mesos and YARN
Mesos and YARNMesos and YARN
Mesos and YARNAmir Payberah
Ā 
Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Amir Payberah
Ā 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Amir Payberah
Ā 
Introduction to stream processing
Introduction to stream processingIntroduction to stream processing
Introduction to stream processingAmir Payberah
Ā 
CPU Scheduling - Part1
CPU Scheduling - Part1CPU Scheduling - Part1
CPU Scheduling - Part1Amir Payberah
Ā 

Similar to Paxos (15)

Time, clocks and the ordering of events
Time, clocks and the ordering of eventsTime, clocks and the ordering of events
Time, clocks and the ordering of events
Ā 
Deadlocks
DeadlocksDeadlocks
Deadlocks
Ā 
Chubby
ChubbyChubby
Chubby
Ā 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2
Ā 
Process Synchronization - Part1
Process Synchronization -  Part1Process Synchronization -  Part1
Process Synchronization - Part1
Ā 
Process Synchronization - Part2
Process Synchronization -  Part2Process Synchronization -  Part2
Process Synchronization - Part2
Ā 
Gossip-based algorithms
Gossip-based algorithmsGossip-based algorithms
Gossip-based algorithms
Ā 
Virtual Memory - Part2
Virtual Memory - Part2Virtual Memory - Part2
Virtual Memory - Part2
Ā 
MegaStore and Spanner
MegaStore and SpannerMegaStore and Spanner
MegaStore and Spanner
Ā 
Process Management - Part1
Process Management - Part1Process Management - Part1
Process Management - Part1
Ā 
Mesos and YARN
Mesos and YARNMesos and YARN
Mesos and YARN
Ā 
Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1
Ā 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2
Ā 
Introduction to stream processing
Introduction to stream processingIntroduction to stream processing
Introduction to stream processing
Ā 
CPU Scheduling - Part1
CPU Scheduling - Part1CPU Scheduling - Part1
CPU Scheduling - Part1
Ā 

More from Amir Payberah

P2P Content Distribution Network
P2P Content Distribution NetworkP2P Content Distribution Network
P2P Content Distribution NetworkAmir Payberah
Ā 
Cloud Computing
Cloud ComputingCloud Computing
Cloud ComputingAmir Payberah
Ā 
Data Intensive Computing Frameworks
Data Intensive Computing FrameworksData Intensive Computing Frameworks
Data Intensive Computing FrameworksAmir Payberah
Ā 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformAmir Payberah
Ā 
The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformAmir Payberah
Ā 
Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module ProgrammingAmir Payberah
Ā 
File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2Amir Payberah
Ā 
File System Implementation - Part1
File System Implementation - Part1File System Implementation - Part1
File System Implementation - Part1Amir Payberah
Ā 
File System Interface
File System InterfaceFile System Interface
File System InterfaceAmir Payberah
Ā 
Virtual Memory - Part1
Virtual Memory - Part1Virtual Memory - Part1
Virtual Memory - Part1Amir Payberah
Ā 
Main Memory - Part2
Main Memory - Part2Main Memory - Part2
Main Memory - Part2Amir Payberah
Ā 
Main Memory - Part1
Main Memory - Part1Main Memory - Part1
Main Memory - Part1Amir Payberah
Ā 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2Amir Payberah
Ā 
Process Management - Part3
Process Management - Part3Process Management - Part3
Process Management - Part3Amir Payberah
Ā 
Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Amir Payberah
Ā 

More from Amir Payberah (20)

P2P Content Distribution Network
P2P Content Distribution NetworkP2P Content Distribution Network
P2P Content Distribution Network
Ā 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
Ā 
Data Intensive Computing Frameworks
Data Intensive Computing FrameworksData Intensive Computing Frameworks
Data Intensive Computing Frameworks
Ā 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform
Ā 
The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics Platform
Ā 
Security
SecuritySecurity
Security
Ā 
Protection
ProtectionProtection
Protection
Ā 
Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module Programming
Ā 
IO Systems
IO SystemsIO Systems
IO Systems
Ā 
File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2
Ā 
File System Implementation - Part1
File System Implementation - Part1File System Implementation - Part1
File System Implementation - Part1
Ā 
File System Interface
File System InterfaceFile System Interface
File System Interface
Ā 
Storage
StorageStorage
Storage
Ā 
Virtual Memory - Part1
Virtual Memory - Part1Virtual Memory - Part1
Virtual Memory - Part1
Ā 
Main Memory - Part2
Main Memory - Part2Main Memory - Part2
Main Memory - Part2
Ā 
Main Memory - Part1
Main Memory - Part1Main Memory - Part1
Main Memory - Part1
Ā 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2
Ā 
Threads
ThreadsThreads
Threads
Ā 
Process Management - Part3
Process Management - Part3Process Management - Part3
Process Management - Part3
Ā 
Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3
Ā 

Recently uploaded

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
Ā 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
Ā 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
Ā 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
Ā 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
Ā 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
Ā 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
Ā 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
Ā 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
Ā 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
Ā 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
Ā 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto GonzƔlez Trastoy
Ā 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
Ā 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
Ā 
Russian Call Girls in Karol Bagh Aasnvi āž”ļø 8264348440 šŸ’‹šŸ“ž Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi āž”ļø 8264348440 šŸ’‹šŸ“ž Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi āž”ļø 8264348440 šŸ’‹šŸ“ž Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi āž”ļø 8264348440 šŸ’‹šŸ“ž Independent Escort S...soniya singh
Ā 
(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...
(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...
(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...gurkirankumar98700
Ā 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
Ā 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
Ā 

Recently uploaded (20)

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
Ā 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
Ā 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
Ā 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Ā 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Ā 
Call Girls In Mukherjee Nagar šŸ“± 9999965857 šŸ¤© Delhi šŸ«¦ HOT AND SEXY VVIP šŸŽ SE...
Call Girls In Mukherjee Nagar šŸ“±  9999965857  šŸ¤© Delhi šŸ«¦ HOT AND SEXY VVIP šŸŽ SE...Call Girls In Mukherjee Nagar šŸ“±  9999965857  šŸ¤© Delhi šŸ«¦ HOT AND SEXY VVIP šŸŽ SE...
Call Girls In Mukherjee Nagar šŸ“± 9999965857 šŸ¤© Delhi šŸ«¦ HOT AND SEXY VVIP šŸŽ SE...
Ā 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
Ā 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
Ā 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
Ā 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
Ā 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Ā 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
Ā 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
Ā 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Ā 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
Ā 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
Ā 
Russian Call Girls in Karol Bagh Aasnvi āž”ļø 8264348440 šŸ’‹šŸ“ž Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi āž”ļø 8264348440 šŸ’‹šŸ“ž Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi āž”ļø 8264348440 šŸ’‹šŸ“ž Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi āž”ļø 8264348440 šŸ’‹šŸ“ž Independent Escort S...
Ā 
(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...
(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...
(Genuine) Escort Service Lucknow | Starting ā‚¹,5K To @25k with A/C šŸ§‘šŸ½ā€ā¤ļøā€šŸ§‘šŸ» 89...
Ā 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Ā 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
Ā 

Paxos

  • 1. Distributed Systems Consensus Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 1 / 56
  • 2. What is the Problem? Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 2 / 56
  • 3. Two Generalsā€™ Problem Two generals need to be agree on time to attack to win. They communicate through messengers, who may be killed on their way. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 3 / 56
  • 4. Two Generalsā€™ Problem Two generals need to be agree on time to attack to win. They communicate through messengers, who may be killed on their way. Agreement is the problem. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 3 / 56
  • 5. Replicated State Machine Problem (1/2) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 4 / 56
  • 6. Replicated State Machine Problem (1/2) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 4 / 56
  • 7. Replicated State Machine Problem (1/2) The solution: replicate the server. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 4 / 56
  • 8. Replicated State Machine Problem (2/2) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56
  • 9. Replicated State Machine Problem (2/2) Make the server deterministic (state machine). Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56
  • 10. Replicated State Machine Problem (2/2) Make the server deterministic (state machine). Replicate the server. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56
  • 11. Replicated State Machine Problem (2/2) Make the server deterministic (state machine). Replicate the server. Ensure correct replicas step through the same sequence of state transitions (How?) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56
  • 12. Replicated State Machine Problem (2/2) Make the server deterministic (state machine). Replicate the server. Ensure correct replicas step through the same sequence of state transitions (How?) Agreement is the problem. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 5 / 56
  • 13. The Agreement Problem Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 6 / 56
  • 14. The Agreement Problem Some nodes propose values (or actions) by sending them to the others. All nodes must decide whether to accept or reject those values. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 7 / 56
  • 15. The Agreement Problem Some nodes propose values (or actions) by sending them to the others. All nodes must decide whether to accept or reject those values. But, ... Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 7 / 56
  • 16. The Agreement Problem Some nodes propose values (or actions) by sending them to the others. All nodes must decide whether to accept or reject those values. But, ... Concurrent processes and uncertainty of timing, order of events and inputs. Failure and recovery of machines/processors, of communication channels. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 7 / 56
  • 17. Agreement Requirements Safety Liveness Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56
  • 18. Agreement Requirements Safety ā€¢ Validity: only a value that has been proposed may be chosen. Liveness Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56
  • 19. Agreement Requirements Safety ā€¢ Validity: only a value that has been proposed may be chosen. ā€¢ Agreement: no two correct nodes choose diļ¬€erent values. Liveness Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56
  • 20. Agreement Requirements Safety ā€¢ Validity: only a value that has been proposed may be chosen. ā€¢ Agreement: no two correct nodes choose diļ¬€erent values. ā€¢ Integrity: a node chooses at most once. Liveness Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56
  • 21. Agreement Requirements Safety ā€¢ Validity: only a value that has been proposed may be chosen. ā€¢ Agreement: no two correct nodes choose diļ¬€erent values. ā€¢ Integrity: a node chooses at most once. Liveness ā€¢ Termination: every correct node eventually choose a value. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 8 / 56
  • 22. Agreement in Distributed Systems: Possible Solutions Two-Phase Commit (2PC) Paxos Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 9 / 56
  • 23. Two-Phase Commit (2PC) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 10 / 56
  • 24. The Two-Phase Commit (2PC) Problem The problem ļ¬rst was encountered in database systems. Suppose a database system is updating some complicated data structures that include parts residing on more than one machine. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 11 / 56
  • 25. The Two-Phase Commit (2PC) Problem The problem ļ¬rst was encountered in database systems. Suppose a database system is updating some complicated data structures that include parts residing on more than one machine. System model: ā€¢ Concurrent processes and uncertainty of timing, order of events and inputs (asynchronous systems). ā€¢ Failure and recovery of machines/processors, of communication channels. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 11 / 56
  • 26. Intuitive Example (1/3) You want to organize outing with 3 friends at 6pm Tuesday. ā€¢ Go out only if all friends can make it. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 12 / 56
  • 27. Intuitive Example (2/3) What do you do? Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 13 / 56
  • 28. Intuitive Example (2/3) What do you do? ā€¢ Call each of them and ask if can do 6pm on Tuesday (voting phase) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 13 / 56
  • 29. Intuitive Example (2/3) What do you do? ā€¢ Call each of them and ask if can do 6pm on Tuesday (voting phase) ā€¢ If all can do Tuesday, call each friend back to ACK (commit) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 13 / 56
  • 30. Intuitive Example (2/3) What do you do? ā€¢ Call each of them and ask if can do 6pm on Tuesday (voting phase) ā€¢ If all can do Tuesday, call each friend back to ACK (commit) ā€¢ If one cannot do Tuesday, call other three to cancel (abort) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 13 / 56
  • 31. Intuitive Example (3/3) Critical details Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 14 / 56
  • 32. Intuitive Example (3/3) Critical details ā€¢ While you were calling everyone to ask, people who have promised they can do 6pm Tuesday must reserve that slot. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 14 / 56
  • 33. Intuitive Example (3/3) Critical details ā€¢ While you were calling everyone to ask, people who have promised they can do 6pm Tuesday must reserve that slot. ā€¢ You need to remember the decision and tell anyone whom you have not been able to reach during commit/abort phase. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 14 / 56
  • 34. Intuitive Example (3/3) Critical details ā€¢ While you were calling everyone to ask, people who have promised they can do 6pm Tuesday must reserve that slot. ā€¢ You need to remember the decision and tell anyone whom you have not been able to reach during commit/abort phase. That is exactly how 2PC works. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 14 / 56
  • 35. The 2PC Players Coordinator (Transaction Manager) ā€¢ Begins transaction. ā€¢ Responsible for commit/abort. Participants (Resource Managers) ā€¢ The servers with the data used in the distributed transaction. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 15 / 56
  • 36. The 2PC Algorithm Phase 1 - prepare phase Phase 2 - commit phase Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 16 / 56
  • 37. The 2PC Algorithm - Prepare Phase Coordinator asks each participant canCommit. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 17 / 56
  • 38. The 2PC Algorithm - Prepare Phase Coordinator asks each participant canCommit. Participants must prepare to commit using permanent storage before answering yes. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 17 / 56
  • 39. The 2PC Algorithm - Prepare Phase Coordinator asks each participant canCommit. Participants must prepare to commit using permanent storage before answering yes. ā€¢ Lock the objects. ā€¢ Participants are not allowed to cause an abort after it replies yes to canCommit. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 17 / 56
  • 40. The 2PC Algorithm - Prepare Phase Coordinator asks each participant canCommit. Participants must prepare to commit using permanent storage before answering yes. ā€¢ Lock the objects. ā€¢ Participants are not allowed to cause an abort after it replies yes to canCommit. Outcome of the transaction is uncertain until doCommit or doAbort. ā€¢ Other participants might still cause an abort. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 17 / 56
  • 41. The 2PC Algorithm - Commit Phase The coordinator collects all votes. ā€¢ If unanimous yes, causes commit. ā€¢ If any participant voted no, causes abort. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 18 / 56
  • 42. The 2PC Algorithm - Commit Phase The coordinator collects all votes. ā€¢ If unanimous yes, causes commit. ā€¢ If any participant voted no, causes abort. The fate of the transaction is decided atomically at the coordinator, once all participants vote. ā€¢ Coordinator records fate using permanent storage. ā€¢ Then broadcasts doCommit or doAbort to participants. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 18 / 56
  • 43. 2PC Sequence of Events Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 19 / 56
  • 44. Recovery in 2PC Recovery after timeouts. Recovery after crashes and reboot. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 20 / 56
  • 45. Recovery in 2PC Recovery after timeouts. Recovery after crashes and reboot. Note: you cannot diļ¬€erentiate between the above in a realistic asynchronous network. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 20 / 56
  • 46. Handling Timeout To avoid processes blocking for ever. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 21 / 56
  • 47. Handling Timeout To avoid processes blocking for ever. Two scenarios: ā€¢ Coordinator waits for votes from participants. ā€¢ Participant is waiting for the ļ¬nal decision. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 21 / 56
  • 48. Handling Timeout at Coordinator If B voted no, can coordinator unilaterally abort? If B voted yes, can coordinator unilaterally abort/commit? Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 22 / 56
  • 49. Handling Timeout at Coordinator If B voted no, can coordinator unilaterally abort? If B voted yes, can coordinator unilaterally abort/commit? ā€¢ Coordinator waits for votes from participants. ā€¢ Participant is waiting for the ļ¬nal decision. ā€¢ Coordinator timeout abort and send doAbort to participants. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 22 / 56
  • 50. Handling Timeout at Participants If B times out on TM and has voted yes, then execute termination protocol. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56
  • 51. Handling Timeout at Participants If B times out on TM and has voted yes, then execute termination protocol. Simple protocol: participant is remained blocked until it can establish communication with coordinator. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56
  • 52. Handling Timeout at Participants If B times out on TM and has voted yes, then execute termination protocol. Simple protocol: participant is remained blocked until it can establish communication with coordinator. Cooperative protocol: participant sends a decision-request message to other participants. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56
  • 53. Handling Timeout at Participants If B times out on TM and has voted yes, then execute termination protocol. Simple protocol: participant is remained blocked until it can establish communication with coordinator. Cooperative protocol: participant sends a decision-request message to other participants. ā€¢ B sends status message to A Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56
  • 54. Handling Timeout at Participants If B times out on TM and has voted yes, then execute termination protocol. Simple protocol: participant is remained blocked until it can establish communication with coordinator. Cooperative protocol: participant sends a decision-request message to other participants. ā€¢ B sends status message to A ā€¢ If A has received commit/abort from TM, ... Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56
  • 55. Handling Timeout at Participants If B times out on TM and has voted yes, then execute termination protocol. Simple protocol: participant is remained blocked until it can establish communication with coordinator. Cooperative protocol: participant sends a decision-request message to other participants. ā€¢ B sends status message to A ā€¢ If A has received commit/abort from TM, ... ā€¢ If A has not responded to TM, ... Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56
  • 56. Handling Timeout at Participants If B times out on TM and has voted yes, then execute termination protocol. Simple protocol: participant is remained blocked until it can establish communication with coordinator. Cooperative protocol: participant sends a decision-request message to other participants. ā€¢ B sends status message to A ā€¢ If A has received commit/abort from TM, ... ā€¢ If A has not responded to TM, ... ā€¢ If A has responded with no/yes ... Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 23 / 56
  • 57. Handling Crash and Recovery (1/2) All nodes must log protocol progress. ā€¢ Participants: prepared, uncertain, committed/aborted ā€¢ Coordinator: prepared, committed/aborted, done Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 24 / 56
  • 58. Handling Crash and Recovery (1/2) All nodes must log protocol progress. ā€¢ Participants: prepared, uncertain, committed/aborted ā€¢ Coordinator: prepared, committed/aborted, done Nodes cannot back out if commit is decided. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 24 / 56
  • 59. Handling Crash and Recovery (2/2) Coordinator crashes: ā€¢ If it ļ¬nds no commit on disk, it aborts. ā€¢ If it ļ¬nds commit, it commits. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 25 / 56
  • 60. Handling Crash and Recovery (2/2) Coordinator crashes: ā€¢ If it ļ¬nds no commit on disk, it aborts. ā€¢ If it ļ¬nds commit, it commits. Participant crashes: ā€¢ If it ļ¬nds no yes on disk, it aborts. ā€¢ If it ļ¬nds yes, runs termination protocol to decide. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 25 / 56
  • 61. Fault-Tolerance Limitations of 2PC Even with recovery enabled, 2PC is not really fault-tolerant (or live), because it can be blocked even when one (or a few) machines fail. Blocking means that it does not make progress during the failures. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 26 / 56
  • 62. Fault-Tolerance Limitations of 2PC Even with recovery enabled, 2PC is not really fault-tolerant (or live), because it can be blocked even when one (or a few) machines fail. Blocking means that it does not make progress during the failures. Any scenarios? Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 26 / 56
  • 63. 2PC Blocking Scenario TM sends doCommit to A, A gets it and commits, and then both TM and A die. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 27 / 56
  • 64. 2PC Blocking Scenario TM sends doCommit to A, A gets it and commits, and then both TM and A die. B, C, D have already also replied yes, have locked their mutexes, and now need to wait for TM or A to reappear. ā€¢ They cannot recover the decision with certainty until TM or A are online. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 27 / 56
  • 65. 2PC Blocking Scenario TM sends doCommit to A, A gets it and commits, and then both TM and A die. B, C, D have already also replied yes, have locked their mutexes, and now need to wait for TM or A to reappear. ā€¢ They cannot recover the decision with certainty until TM or A are online. This is why 2PC is called a blocking protocol: 2PC is safe, but not live. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 27 / 56
  • 66. Impossibility of Distributed Consensus with One Faulty Process Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 28 / 56
  • 67. FLP Fischer-Lynch-Paterson (FLP) ā€¢ M.J. Fischer, N.A. Lynch, and M.S. Paterson, Impossibility of distributed consensus with one faulty process, Journal of the ACM, 1985. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 29 / 56
  • 68. FLP Impossibility Result It is impossible for a set of processors in an asynchronous system to agree on a binary value, even if only a single process is subject to an unannounced failure. The core of the problem is asynchrony. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 30 / 56
  • 69. FLP Impossibility Result What FLP says: you cannot guarantee both safety and progress when there is even a single fault at an inopportune moment. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 31 / 56
  • 70. FLP Impossibility Result What FLP says: you cannot guarantee both safety and progress when there is even a single fault at an inopportune moment. What FLP does not say: in practice, how close can you get to the ideal (always safe and live)? Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 31 / 56
  • 71. FLP Impossibility Result What FLP says: you cannot guarantee both safety and progress when there is even a single fault at an inopportune moment. What FLP does not say: in practice, how close can you get to the ideal (always safe and live)? So, Paxos ... Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 31 / 56
  • 72. Paxos Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 32 / 56
  • 73. Paxos The only known completely-safe and largely-live agreement protocol. L. Lamport, The part-time parliament, ACM Transactions on Computer Systems, 1998. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 33 / 56
  • 74. The Paxos Players Proposers ā€¢ Suggests values for consideration by acceptors. Acceptors ā€¢ Considers the values proposed by proposers. ā€¢ Renders an accept/reject decision. Learners ā€¢ Learns the chosen value. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 34 / 56
  • 75. The Paxos Players Proposers ā€¢ Suggests values for consideration by acceptors. Acceptors ā€¢ Considers the values proposed by proposers. ā€¢ Renders an accept/reject decision. Learners ā€¢ Learns the chosen value. A node can act as more than one roles (usually 3). Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 34 / 56
  • 76. Single Proposal, Single Acceptor Use just one acceptor ā€¢ Collects proposersā€™ proposals. ā€¢ Decides the value and tells everyone else. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 35 / 56
  • 77. Single Proposal, Single Acceptor Use just one acceptor ā€¢ Collects proposersā€™ proposals. ā€¢ Decides the value and tells everyone else. Sounds familiar? Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 35 / 56
  • 78. Single Proposal, Single Acceptor Use just one acceptor ā€¢ Collects proposersā€™ proposals. ā€¢ Decides the value and tells everyone else. Sounds familiar? ā€¢ two-phase commit (2PC) ā€¢ acceptor fails = protocol blocks Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 35 / 56
  • 79. Single Proposal, Multiple Acceptors One acceptor is not fault-tolerant enough. Letā€™s have multiple acceptors. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56
  • 80. Single Proposal, Multiple Acceptors One acceptor is not fault-tolerant enough. Letā€™s have multiple acceptors. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56
  • 81. Single Proposal, Multiple Acceptors One acceptor is not fault-tolerant enough. Letā€™s have multiple acceptors. From there, must reach a decision. How? Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56
  • 82. Single Proposal, Multiple Acceptors One acceptor is not fault-tolerant enough. Letā€™s have multiple acceptors. From there, must reach a decision. How? Decision = value accepted by the majority. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56
  • 83. Single Proposal, Multiple Acceptors One acceptor is not fault-tolerant enough. Letā€™s have multiple acceptors. From there, must reach a decision. How? Decision = value accepted by the majority. P1: an acceptor must accept ļ¬rst proposal it receives. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 36 / 56
  • 84. Multiple Proposals, Multiple Acceptors If there are multiple proposals, no proposal may get the majority. ā€¢ 3 proposals may each get 1/3 of the acceptors. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 37 / 56
  • 85. Multiple Proposals, Multiple Acceptors If there are multiple proposals, no proposal may get the majority. ā€¢ 3 proposals may each get 1/3 of the acceptors. Solution: acceptors can accept multiple proposals, distinguished by a unique proposal number. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 37 / 56
  • 86. Handling Multiple Proposals All chosen proposals must have the same value. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 38 / 56
  • 87. Handling Multiple Proposals All chosen proposals must have the same value. P2: If a proposal with value v is chosen, then every higher-numbered proposal that is chosen also has value v. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 38 / 56
  • 88. Handling Multiple Proposals All chosen proposals must have the same value. P2: If a proposal with value v is chosen, then every higher-numbered proposal that is chosen also has value v. ā€¢ P2a: ... accepted ... Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 38 / 56
  • 89. Handling Multiple Proposals All chosen proposals must have the same value. P2: If a proposal with value v is chosen, then every higher-numbered proposal that is chosen also has value v. ā€¢ P2a: ... accepted ... ā€¢ P2b: ... proposed ... Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 38 / 56
  • 90. The Paxos Algorithm Phase 1a - prepare phase Phase 1b - promise phase Phase 2a - accept phase Phase 2b - accepted phase Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 39 / 56
  • 91. Paxos Algorithm Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 40 / 56
  • 92. Paxos Algorithm - Prepare Phase A proposer selects a proposal number n and sends a prepare request with number n to majority of acceptors. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 41 / 56
  • 93. Paxos Algorithm - Promise Phase If an acceptor receives a prepare request with number n greater than that of any prepare request it saw Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 42 / 56
  • 94. Paxos Algorithm - Promise Phase If an acceptor receives a prepare request with number n greater than that of any prepare request it saw ā€¢ It responses yes to that request with a promise not to accept any more proposals numbered less than n. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 42 / 56
  • 95. Paxos Algorithm - Promise Phase If an acceptor receives a prepare request with number n greater than that of any prepare request it saw ā€¢ It responses yes to that request with a promise not to accept any more proposals numbered less than n. ā€¢ It includes the highest-numbered proposal (if any) that it has accepted. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 42 / 56
  • 96. Paxos Algorithm - Accept Phase If the proposer receives a response yes to its prepare requests from a majority of acceptors Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 43 / 56
  • 97. Paxos Algorithm - Accept Phase If the proposer receives a response yes to its prepare requests from a majority of acceptors ā€¢ It sends an accept request to each of those acceptors for a proposal numbered n with a values v, which is the value of the highest-numbered proposal among the responses. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 43 / 56
  • 98. Paxos Algorithm - Accepted Phase If an acceptor receives an accept request for a proposal numbered n Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 44 / 56
  • 99. Paxos Algorithm - Accepted Phase If an acceptor receives an accept request for a proposal numbered n ā€¢ It accepts the proposal unless it has already responded to a prepare request having a number greater than n. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 44 / 56
  • 100. Deļ¬nition of Chosen A value is chosen at proposal number n, iļ¬€ majority of acceptors accept that value in phase 2 of the proposal number. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 45 / 56
  • 101. Paxos Example Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 46 / 56
  • 102. Paxos Example Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 47 / 56
  • 103. Paxos - Safety (1/3) If a value v is chosen at proposal number n, any value that is sent out in phase 2 of any later proposal numbers must be also v. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 48 / 56
  • 104. Paxos - Safety (2/3) Decision = Majority (any two majorities share at least one element) Therefore after the ļ¬rst round in which there is a decision, any subsequent round involves at least one acceptor that has accepted v. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 49 / 56
  • 105. Paxos - Safety (3/3) Now suppose our claim is not true, and let m is the ļ¬rst proposal number that is later than n and in 2nd phase, the value sent out is w = v. This is not possible, because if the proposer P was able to start 2nd phase for w, it means it got a majority to accept round for m (for m > n). So, either: ā€¢ v would not have been the value decided, or ā€¢ v would have been proposed by P Therefore, once a majority accepts v, that never changes. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 50 / 56
  • 106. Liveness If two or more proposers race to propose new values, they might step on each other toes all the time. ā€¢ P1: prepare(n1) ā€¢ P2: prepare(n2) ā€¢ P1: accept(n1, v1) ā€¢ P2: accept(n2, v2) ā€¢ P1: prepare(n3) ā€¢ P2: prepare(n4) ā€¢ ... ā€¢ n1 < n2 < n3 < n4 < Ā· Ā· Ā· With randomness, this occurs exceedingly rarely. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 51 / 56
  • 107. Paxos is Everywhere Google: Chubby (Paxos-based distributed lock service) ā€¢ Most Google services use Chubby directly or indirectly Yahoo: Zookeeper (Paxos-based distributed lock service) ā€¢ Zookeeper is open-source and integrates with Hadoop UW: Scatter (Paxos-based consistent DHT) Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 52 / 56
  • 108. Summary Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 53 / 56
  • 109. Summary Replicated State Machine 2PC: blocking FLP impossibility Paxos Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 54 / 56
  • 110. References: M.J. Fischer, N.A. Lynch, and M.S. Paterson, Impossibility of dis- tributed consensus with one faulty process, Journal of the ACM, 1985. L. Lamport, The part-time parliament, ACM Transactions on Com- puter Systems, 1998. L. Lamport, Paxos made simple, ACM Sigact News, 2001. J. Gray, and L. Lamport, Consensus on transaction commit, ACM Transactions on Database Systems, 2006. Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 55 / 56
  • 111. Questions? Acknowledgements Some slides were derived from Jinyang Li slides (New York University). Amir H. Payberah (Tehran Polytechnic) Consensus 1393/6/31 56 / 56