20080426 distributed algorithms_pedone_lecture04

709 views

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
709
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

20080426 distributed algorithms_pedone_lecture04

  1. 1. Fault-tolerant broadcasts  Reliable broadcast Fault-tolerant broadcasts – System model Part II: Algorithms • ∏ = { p1, p2, …, pn } • Asynchronous system (i.e., no timing assumptions) • Crash-stop failure model (at most f failures) • Quasi-reliable channels – If p sends a message m to q and both processes are correct, then q eventually receives m © Fernando Pedone Fault-tolerant broadcasts Fault-tolerant broadcasts – Non-uniform reliable broadcast – Non-uniform reliable broadcast (cont’d) Agreement: if a correct process delivers m, P P then all correct processes eventually deliver m. Q Q broadcast(R, m) works as follows: send(m) to all R R upon receive(m) for the first time do P deliver(R, m) Q send(m) to all R © Fernando Pedone © Fernando Pedone 1
  2. 2. Fault-tolerant broadcasts Fault-tolerant broadcasts – Uniform reliable broadcast – Uniform reliable broadcast (cont’d) – f = 1 Uniform agreement: if a process delivers m, P P then all correct processes eventually deliver m. Q Q broadcast(UR, m) works as follows: send(m) to all R R upon receive(m) P if [for f+1 processes q: received(m) from q] then deliver(UR, m) Q if [didn’t send m yet] then send(m) to all R © Fernando Pedone © Fernando Pedone Fault-tolerant broadcasts Fault-tolerant broadcasts – Cost of reliable broadcast algorithms – Reducing the number of messages • Asynchronous system + S Number of Latency Messages broadcast(R, m) works as follows: Reliable send(m) to all Broadcast n2 δ (optimized) upon receive(m) for the first time do deliver(R, m) Uniform start Task Propagate(m, sender(m)) Reliable n2 2δ Broadcast Task Propagate(m, q) if suspect q then send(m) to all © Fernando Pedone © Fernando Pedone 2
  3. 3. Fault-tolerant broadcasts Fault-tolerant broadcasts  FIFO reliable broadcast Initialization msgSet ← Ø – Transforms Reliable Broadcast into next[q] ← 1, for each process q FIFO Uniform Reliable Broadcast To execute broadcast(F, m): broadcast(R, m) – Assumptions deliver(F,–) occurs as follows: upon deliver(R, m) do • If m is the i-th message broadcast by p, then m is q ← sender(m) tagged with sender(m) = p and seq#(m) = i msgSet ← msgSet ∪ { m } while (∃m’ ∈ msgSet: sender(m’) = q and next[q] = seq#(m’)) • Each process keeps a local vector of counters deliver(F, m’) next[q] ← next[q] + 1 next[1..n] © Fernando Pedone © Fernando Pedone Fault-tolerant broadcasts Fault-tolerant broadcasts  Causal reliable broadcast Initialization rcntDlvrs ← ⊥ – Transforms Uniform Reliable Broadcast into To execute broadcast(C, m): Causal Reliable Broadcast broadcast(F, rcntDlvrs ⊕ 〈 m 〉 ) rcntDlvrs ← ⊥ – Notation deliver(C,–) occurs as follows: • Message sequences: 〈 m1, m2, … 〉 upon deliver(F, 〈 m1, m2 , …, ml 〉) do for i from 1 to l do • ⊥ is the empty sequence if p has not previously executed deliver(C, mi) then deliver(C, mi) • ⊕ is the concatenation operator rcntDlvrs ← rcntDlvrs ⊕ 〈 mi 〉 © Fernando Pedone © Fernando Pedone 3
  4. 4. Fault-tolerant broadcasts Fault-tolerant broadcasts  Atomic broadcast Reliable Total order Atomic – From atomic broadcast to consensus Broadcast Broadcast deliver(v1) First broadcast(v1) deliver(v2) propose(v1 ) decide(v2) Atomic broadcast FIFO order FIFO order Transformation P P Consensus FIFO Total order FIFO Atomic broadcast(v2) propose(v2 ) Broadcast Broadcast Q Q propose(v3 ) Second Causal order Causal order Transformation R R Causal Total order Causal Atomic To execute propose(v): Broadcast Broadcast broadcast(v) Let v be the first value delivered decide(v) © Fernando Pedone © Fernando Pedone Fault-tolerant broadcasts Fault-tolerant broadcasts – From consensus to atomic broadcast – From consensus to atomic broadcast (cont’d) Initialization R-delivered ← Ø deliver(A,m1) A-delivered ← Ø broadcast(A,m1) k←0 deliver(A,m2) broadcast(R,m1) propose(m3,m2 ) To execute broadcast(A, m): broadcast(R, m) P deliver(A,–) occurs as follows: upon deliver(R, m) do broadcast(A,m2) R-delivered ← R-delivered ∪ { m } broadcast(R,m2) propose(m2,m1 ,m3) upon R-delivered A-delivered Ø Q k←k+1 A-undelivered ← R-delivered A-delivered broadcast(A,m3) propose(k, A-undelivered) broadcast(R,m3) propose(m1,m2 ) wait until decide(k, msgSetk ) R A-deliverk ← msgSetk A-delivered deliver all messages in A-deliverk in some deterministic order A-delivered ← A-delivered ∪ A-deliverk © Fernando Pedone © Fernando Pedone 4
  5. 5. Fault-tolerant broadcasts Fault-tolerant broadcasts – From consensus to atomic broadcast (cont’d)  Generic broadcast – Single-message case m P P ACK(m) Q Q Possibly Only message suspecting the Received so far coordinator R R Reliable broadcast Consensus © Fernando Pedone © Fernando Pedone Fault-tolerant broadcasts Fault-tolerant broadcasts – Two non-conflicting messages – The general case (m1 and m2 conflict) ACK(m1 ) order violated How many ACK’s are m1 ACK(m1) P1 … m1 needed for delivery? P … ACK(m1 ) P2 ACK(m1 ) ACK(m2 ) m2 … Q P3 ACK(m1 ,m2) m2 P4 … ACK(m2 ) ACK(m2) R ACK(m2 ) nACK ≤ n/2 ⇒ order violated © Fernando Pedone © Fernando Pedone 5
  6. 6. Fault-tolerant broadcasts Fault-tolerant broadcasts – The general case (m1 and m2 conflict) – The general case (m1 and m2 conflict) m1 ACK(m1 ) • Conflicts should be detected: nACK > n / 2 P1 • For conflicting messages: “If m1 and m2 conflict, and m1 (or m2) has been P2 Conflict detected: ACK(m1 ) Consensus is needed delivered before consensus, then consensus cannot P3 contradict this order.” ACK(m1 ) P4 m2 How to ensure this??? P5 Consensus © Fernando Pedone © Fernando Pedone Fault-tolerant broadcasts Fault-tolerant broadcasts – The general case (m1 and m2 conflict) – The general case (m1 and m2 conflict) m1 ACK(m1) P1: m1 P1 P1 m1 m2 P2: m1 P2 P2 m1 m2 m1 ACK(m1) P3: m1 P3 P3 … { P3:m 1 ; P 4:m 2 ; P5:m 3 } m2 m1 m2 ACK(m2) P4: m2 { P3:m 1 ; P 4:m 2 ; P5:m 3 } P4 P4 … m2 m1 m3 ACK(m3) P5: m2 P5 P5 … { P3:m 1 ; P 4:m 2 ; P5:m 3 } Exchange all Check Consensus Which messages ACKed messages should be chosen? © Fernando Pedone © Fernando Pedone 6
  7. 7. Fault-tolerant broadcasts Fault-tolerant broadcasts – If nACK = (n+1) / 2 (i.e., majority), how much nACK should nCHK be? m1 – How to choose a message in the “check m1 phase”? Could m2 be delivered m1 m1 in the “ack phase”? – Problem: we can’t tolerate failures m1 m1 – What can be done? m1 Can we be sure that m1 was – Increase nACK… for example, if nACK = 4 and m2 delivered in the “ack phase”? n = 5, could we have nCHK = 4? nCHK © Fernando Pedone © Fernando Pedone Fault-tolerant broadcasts Fault-tolerant broadcasts – The general case (cont’d) – What is the relation between nACK and nCHK? nCHK/2 + (n-nCHK) < nACK nCHK + 2n -2nCHK < 2nACK -nCHK + 2n < 2nACK nCHK/2 + (n-nCHK) < nACK 2nACK + nCHK > 2n P1 ACK nACK n-nCHK P2 Best case: nACK = nCHK = x P3 CHK nCHK/2 2x + x > 2n … nCHK Pn-1 x > 2n / 3 (Optimal for ACK with 2δ) Pn © Fernando Pedone © Fernando Pedone 7
  8. 8. Fault-tolerant broadcasts Fault-tolerant broadcasts Initialization R-delivered ← G-delivered ← Ø – Optimality (i.e., lower bound) pending1 ← g-Deliver1 ← Ø k←1 m To execute broadcast(m): P broadcast(R, m) ACK(m) when deliver(R, m) do R-delivered ← R-delivered ∪ { m } Q when (R-delivered (G-delivered ∪ pendingk) Ø) if [ for all m, m’ ∈ (R-delivered G-delivered): m doesn’t conflict with m’ ] then pending k ← R-delivered G-delivered send (k, pendingk, ACK) to all else R Handle conflict (next slide) when receive (k, pendingk, ACK) from pj while ∃m such that [ for nACK pj: received (k, pendingk, ACK) from pj and m ∈ (pendingk g-Deliverk) ] do S g-Deliverk ← g-Deliverk ∪ { m } δ δ deliver(m) © Fernando Pedone © Fernando Pedone Fault-tolerant broadcasts Paxos Handling conflicts:  The part-time parliament when (R-delivered (G-delivered ∪ pendingk) Ø) if [ for all m, m’ ∈ (R-delivered G-delivered): m doesn’t conflict with m’ ] then ... – Leslie Lamport else send (k, pendingk, CHK) to all wait until [ for nCHK pj : received (k, pendingk, CHK) from pj ] – Consensus algorithm msgSet = { m | for (nCHK+1)/2 pj : received (k, { …, m, … }, CHK) from p j propose(k, msgSet, (R-delivered (G-delivered ∪ pendingk))) – SRC Research Report 49, Sep. 1, 1989 wait until decide (k, NCset, Cset) for each m ∈ NCset (G-delivered ∪ g-Deliverk) do deliver(m) – ACM TOCS, May 1998 in ID order: for each m Cset (G-delivered ∪ g-Deliverk) do deliver(m) G-delivered ← G-delivered ∪ NCset ∪ Cset • Submitted in 1990 k←k+1 pendingk ← g-Deliverk ← Ø © Fernando Pedone © Fernando Pedone 8
  9. 9. Paxos Paxos Choosing a value  The protocol Proposers v Phase 1a Phase 1b Phase 2a Phase 2b – System model ballot number b b,v b,v • ∏ = { p1, p2, …, pn } bal =0 bal =b bal =b msg=⊥ msg=⊥ msg=v • Asynchronous system, plus… bal =0 bal =b bal =b Acceptors msg=⊥ msg=⊥ msg=v • Leader-election oracle (Ω) bal =0 bal =b bal =b • Crash-recovery failure model msg=⊥ msg=⊥ msg=v • Unreliable channels if b>bal then bal←b stable storage • Stable storage Learners © Fernando Pedone © Fernando Pedone Paxos Learning a chosen value Paxos  timeout v’ Proposers Phase 2b Phase 3 Phase 2b Proposers b’< b b,v v bal =b msg=v b,v bal =b Acceptors Acceptors msg=v b,v bal =b msg=v b,v ignore message Learners Learners © Fernando Pedone © Fernando Pedone Handling old ballot numbers 9
  10. 10. Paxos Paxos The termination problem  new value = v v’ v1,b1 Proposers b’> b b’,v P v2,b2 b1 b1,v1 b3 b1 b2 b2 bal =b bal =b’ bal =0 bal =b1 bal =b2 bal =b3 msg=v msg=v msg=⊥ msg=⊥ msg=⊥ msg=⊥ bal =b bal =b’ bal =0 bal =b1 bal =b2 bal =b3 Acceptors msg=v msg=v A msg=⊥ msg=⊥ msg=⊥ msg=⊥ b’,v b’,v bal =b bal =b’ bal =0 bal =b1 bal =b2 bal =b3 msg=v msg=v msg=⊥ msg=⊥ msg=⊥ msg=⊥ Learners L © Fernando Pedone Handling new ballot numbers © Fernando Pedone Paxos A leader-based protocol Paxos Ballot reservation v1 v1 v2 P v2 P v2 leader leader bal =0 … msg=⊥ 1 2 1 1 2 bal =0 A msg=⊥ 1 2 A 1 1 2 bal =0 msg=⊥ 1 2 1 1 2 Consensus 1 Consensus n Consensus 1 © Fernando Pedone Phase 1 Phase 2 © Fernando Pedone 10

×