Designing IA for AI - Information Architecture Conference 2024
High-performance broadcast for primary-backup systems
1. Zab: High-performance broadcast
for primary-backup systems
Flavio Junqueira, Benjamin Reed, Marco Serafini
Yahoo! Research
June 2011
2. Setting up the stage
• Background: ZooKeeper
• Coordination service
! Web-scale applications
! Intensive use (high performance)
! Source of truth for many applications
June 2011 2
3. ZooKeeper
• Open source Apache project
• Used in production
! Yahoo!
! Facebook
! Rackspace
! ...
http://zookeeper.apache.org
June 2011 3
4. ZooKeeper
• ... is a leader-based, replicated service
! Processes crash and recover
• Leader
! Executes requests
Leader Follower Follower
! Propagates state updates
Broadcast Deliver Deliver
• Follower
Atomic broadcast
! Applies state updates
June 2011 4
5. ZooKeeper
• Client
Client
! Submits operations to a
server Request
! If follower, forwards to Leader Follower Follower
leader
Broadcast Deliver Deliver
! Leader executes and
propagates state update Atomic broadcast
June 2011 5
6. ZooKeeper
• State updates
! All followers apply the same updates
! All followers apply them in the same order
! Atomic broadcast
• Performance requirements
! Multiple outstanding operations
! Low latency and high throughput
June 2011 6
7. ZooKeeper
• Update configuration and create ready
• If ready exists, then configuration is
consistent
setData del
setData /cfg/client /cfg/ready
/cfg/server
create B
B
Follower
/cfg/ready
Leader
create
/cfg/ready setData Follower
/cfg/server setData
B /cfg/client del
B /cfg/ready
• If 1 doesn’t commit, then 2+3 can’t • If 2+3 don’t commit, then 4 must not
commit commit
June 2011 7
8. ZooKeeper
• Exploring Paxos
! Efficient consensus protocol
! State-machine replication
! Multiple consecutive instances
• Why is it not suitable out of the box?
! Does not guarantee order
! Multiple outstanding operations
June 2011 8
9. Paxos at a glance
1b: Acceptor promises 2b: If quorum, value
not to accept lower is chosen
ballots
Acceptor + Learner
1a 1b 2a 2b 3a
Acceptor +
Proposer + Learner
1a 1b 2a 2b 3a
Acceptor + Learner
Phase 1: Phase 2: Phase 3:
Selects Proposes Value
value to a value learned
propose
June 2011 9
10. Paxos run Interleaves
operations of P1,
27: <1a,3> 27: <2a, 3, C> P2, and and P3
28: <1a,3> 28: <2a, 3, B>
29: <1a,3> 29: <2a, 3, D>
P3
Has
accepted A and
B from P1
A1
27: <1, A> 27: <1b, 1, A>
28: <1, B> 28: <1b, 1, B>
29: <1b, _, _>
A2
Has 27: <3, C>
27: <2, C>
accepted C 28: <3, B>
from P2 29: <3, D>
A3
27: <2, C> 27: <1b, 2, C> 27: <3, C>
28: <1b, _, _> 28: <3, B>
29: <1b, _, _>
29: <3, D>
June 2011 10
11. ZooKeeper
• Another requirement
! Minimize downtime
! Efficient recovery
• Reduce the amount of state transfered
• Zab
! One identifier
! Missing values for each process
June 2011 11
13. Definitions
• Processes: Lead or Follow
• Followers
! Maintain a history of transactions (updates)
• Transaction identifiers: !e,c"
! e : epoch number of the leader
! c : epoch counter
June 2011 13
14. Properties of PO Broadcast
• Integrity
! Only broadcast transactions are delivered
! Leader recovers before broadcasting new transactions
• Total order and agreement
! Followers deliver the same transactions and in the
same order
June 2011 14
15. Primary order
• Local: Transactions of a leader accepted in
order
• Global: Transactions in history respect the
order of epochs
June 2011 15
16. Primary order
• Local: Transactions of a primary accepted in
order
• Global: Transactions in history respect the
order of epochs
abcast(!e,10") abcast(!e,11") abcast(!e,12")
Leader
Follower
June 2011 16
17. Primary order
• Local: Transactions of a primary accepted in
order
• Global: Transactions in history respect the
order of epochs
abcast(!e,10") abcast(!e,11") abcast(!e,12")
Leader
Follower
June 2011 17
18. Primary order
• Local: Transactions of a primary accepted in
order
• Global: Transactions in history respect the
order of epochs
abcast(!e,10") abcast(!e,11")
Leader
abcast(!e’,1")
Leader’
Follower
June 2011 18
19. Primary order
• Local: Transactions of a primary accepted in
order
• Global: Transactions in history respect the
order of epochs
abcast(!e,10") abcast(!e,11")
Leader
abcast(!e’,1")
Leader’
Follower
June 2011 19
20. Zab in Phases
• Phase 0 - Leader election
! Prospective leader elected
• Phase 1- Discovery
! Followers promise not to go back to previous
epochs
! Followers send to their last epoch and history
! selects longest history of latest epoch
June 2011 20
21. Zab in Phases
• Phase 2 - Synchronization
! sends new history to followers
! Followers confirm leadership
• Phase 3 - Broadcast
! proposes new transactions
! commits if quorum acknowledges
June 2011 21
22. Zab in Phases
• Phases 1 and 2: Recovery
! Critical to guarantee order with multiple
outstanding transactions
• Phase 3: Broadcast
! Just like Phases 2 and 3 of Paxos
June 2011 22
23. Zab: Sample run
f1 f2 f3
!0,1" !0,1" !0,1"
!0,2" !0,2"
!0,3"
New epoch
f1.a = 0, f2.a = 0, f3.a = 0,
!0,3" !0,2" !0,1"
Initial history
of new epoch
June 2011 23
24. Zab: Sample run
f1 f2 f3
!0,1" !0,1" !0,1"
!0,2" !0,2" !0,2"
Chosen! !1,1" !1,1"
!1,2"
New epoch
f1.a = 1, f2.a = 1, f3.a = 2,
!1,2" !1,1" !0,2"
Can’t happen!
June 2011 24
26. Notes on implementation
• Use of TCP
! Ordered delivery, retransmissions, etc.
! Notion of session
• Elect leader with most committed txns
! No follower ! leader copies
• Recovery
! Last zxid is sufficient
! In Phase 2, leader commands to add or truncate
June 2011 26
28. Experimental setup
• Implementation in Java
• 13 identical servers
! Xeon 2.50GHz, Gigabit interface, two SATA
disks
http://zookeeper.apache.org
June 2011 28
29. Throughput
Continuous saturated throughput
70000
Net only
Net + Disk
60000 Net + Disk (no write cache)
Net cap
50000
Operations per second
40000
30000
20000
10000
0
2 4 6 8 10 12 14
Number of servers in ensemble
June 2011 29