Zab dsn-2011

5,298
-1

Published on

Talk given at the DSN conference.

Published in: Technology, Education
0 Comments
27 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,298
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
27
Embeds 0
No embeds

No notes for slide

Zab dsn-2011

  1. 1. Zab: High-performance broadcast for primary-backup systems Flavio Junqueira, Benjamin Reed, Marco Serafini Yahoo! Research June 2011
  2. 2. Setting up the stage• Background: ZooKeeper• Coordination service ! Web-scale applications ! Intensive use (high performance) ! Source of truth for many applications June 2011 2
  3. 3. ZooKeeper• Open source Apache project• Used in production ! Yahoo! ! Facebook ! Rackspace ! ... http://zookeeper.apache.org June 2011 3
  4. 4. ZooKeeper• ... is a leader-based, replicated service ! Processes crash and recover• Leader ! Executes requests Leader Follower Follower ! Propagates state updates Broadcast Deliver Deliver• Follower Atomic broadcast ! Applies state updates June 2011 4
  5. 5. ZooKeeper• Client Client ! Submits operations to a server Request ! If follower, forwards to Leader Follower Follower leader Broadcast Deliver Deliver ! Leader executes and propagates state update Atomic broadcast June 2011 5
  6. 6. ZooKeeper• State updates ! All followers apply the same updates ! All followers apply them in the same order ! Atomic broadcast• Performance requirements ! Multiple outstanding operations ! Low latency and high throughput June 2011 6
  7. 7. ZooKeeper• Update configuration and create ready• If ready exists, then configuration isconsistent setData del setData /cfg/client /cfg/ready /cfg/server create B B Follower /cfg/ready Leader create /cfg/ready setData Follower /cfg/server setData B /cfg/client del B /cfg/ready • If 1 doesn’t commit, then 2+3 can’t • If 2+3 don’t commit, then 4 must not commit commit June 2011 7
  8. 8. ZooKeeper• Exploring Paxos ! Efficient consensus protocol ! State-machine replication ! Multiple consecutive instances• Why is it not suitable out of the box? ! Does not guarantee order ! Multiple outstanding operations June 2011 8
  9. 9. Paxos at a glance 1b: Acceptor promises 2b: If quorum, value not to accept lower is chosen ballotsAcceptor + Learner 1a 1b 2a 2b 3a Acceptor +Proposer + Learner 1a 1b 2a 2b 3aAcceptor + Learner Phase 1: Phase 2: Phase 3: Selects Proposes Value value to a value learned propose June 2011 9
  10. 10. Paxos run Interleaves operations of P1, 27: <1a,3> 27: <2a, 3, C> P2, and and P3 28: <1a,3> 28: <2a, 3, B> 29: <1a,3> 29: <2a, 3, D>P3 Has accepted A and B from P1A1 27: <1, A> 27: <1b, 1, A> 28: <1, B> 28: <1b, 1, B> 29: <1b, _, _>A2 Has 27: <3, C> 27: <2, C> accepted C 28: <3, B> from P2 29: <3, D>A3 27: <2, C> 27: <1b, 2, C> 27: <3, C> 28: <1b, _, _> 28: <3, B> 29: <1b, _, _> 29: <3, D> June 2011 10
  11. 11. ZooKeeper• Another requirement ! Minimize downtime ! Efficient recovery• Reduce the amount of state transfered• Zab ! One identifier ! Missing values for each process June 2011 11
  12. 12. Zab and PO Broadcast
  13. 13. Definitions• Processes: Lead or Follow• Followers ! Maintain a history of transactions (updates)• Transaction identifiers: !e,c" ! e : epoch number of the leader ! c : epoch counter June 2011 13
  14. 14. Properties of PO Broadcast• Integrity ! Only broadcast transactions are delivered ! Leader recovers before broadcasting new transactions• Total order and agreement ! Followers deliver the same transactions and in the same order June 2011 14
  15. 15. Primary order• Local: Transactions of a leader accepted in order• Global: Transactions in history respect the order of epochs June 2011 15
  16. 16. Primary order• Local: Transactions of a primary accepted in order• Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") abcast(!e,12") LeaderFollower June 2011 16
  17. 17. Primary order• Local: Transactions of a primary accepted in order• Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") abcast(!e,12") LeaderFollower June 2011 17
  18. 18. Primary order• Local: Transactions of a primary accepted in order• Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") Leader abcast(!e’,1") Leader’ Follower June 2011 18
  19. 19. Primary order• Local: Transactions of a primary accepted in order• Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") Leader abcast(!e’,1") Leader’Follower June 2011 19
  20. 20. Zab in Phases• Phase 0 - Leader election ! Prospective leader elected• Phase 1- Discovery ! Followers promise not to go back to previous epochs ! Followers send to their last epoch and history ! selects longest history of latest epoch June 2011 20
  21. 21. Zab in Phases• Phase 2 - Synchronization ! sends new history to followers ! Followers confirm leadership• Phase 3 - Broadcast ! proposes new transactions ! commits if quorum acknowledges June 2011 21
  22. 22. Zab in Phases• Phases 1 and 2: Recovery ! Critical to guarantee order with multiple outstanding transactions• Phase 3: Broadcast ! Just like Phases 2 and 3 of Paxos June 2011 22
  23. 23. Zab: Sample run f1 f2 f3 !0,1" !0,1" !0,1" !0,2" !0,2" !0,3"New epoch f1.a = 0, f2.a = 0, f3.a = 0, !0,3" !0,2" !0,1" Initial history of new epoch June 2011 23
  24. 24. Zab: Sample run f1 f2 f3 !0,1" !0,1" !0,1" !0,2" !0,2" !0,2" Chosen! !1,1" !1,1" !1,2"New epoch f1.a = 1, f2.a = 1, f3.a = 2, !1,2" !1,1" !0,2" Can’t happen! June 2011 24
  25. 25. Paxos run (revisited) Epoch 1, Phase 3 Epoch 2, Phase 3 Epoch 3, Phase 3 L1 History: # Phases 1 L2 History: # Phases 1 L3 History: !2,1",C and 2 and 2 of Epoch 2 of Epoch 3Follower 1 Epoch: 1 Epoch: 1 Epoch: 3 !1,1",A !1,1",A !2,1",C !1,2",B !1,2",B !3,1",DFollower 2 Epoch: 1 Epoch: 2 Epoch: 2 # !2,1",C !2,1",CFollower 3 Epoch: 3 Epoch: 1 Epoch: 2 # !2,1",C !2,1",C !3,1",D June 2011 25
  26. 26. Notes on implementation• Use of TCP ! Ordered delivery, retransmissions, etc. ! Notion of session• Elect leader with most committed txns ! No follower ! leader copies• Recovery ! Last zxid is sufficient ! In Phase 2, leader commands to add or truncate June 2011 26
  27. 27. Performance
  28. 28. Experimental setup• Implementation in Java• 13 identical servers ! Xeon 2.50GHz, Gigabit interface, two SATA disks http://zookeeper.apache.org June 2011 28
  29. 29. Throughput Continuous saturated throughput 70000 Net only Net + Disk 60000 Net + Disk (no write cache) Net cap 50000Operations per second 40000 30000 20000 10000 0 2 4 6 8 10 12 14 Number of servers in ensemble June 2011 29
  30. 30. Latency June 2011 30
  31. 31. Wrap up
  32. 32. Conclusion• Zookeeper ! Multiple outstanding operations ! Dependencies between consecutive updates• Zab ! Primary Order Broadcast ! Synchronization phase ! Efficient recovery June 2011 32
  33. 33. Questions?http://zookeeper.apache.org

×