SlideShare a Scribd company logo
Zab: High-performance broadcast
  for primary-backup systems
  Flavio Junqueira, Benjamin Reed, Marco Serafini

                Yahoo! Research
                     June 2011
Setting up the stage


•   Background: ZooKeeper
•   Coordination service
    ! Web-scale applications
    ! Intensive use (high performance)
    ! Source of truth for many applications


                        June 2011             2
ZooKeeper

•   Open source Apache project
•   Used in production
    ! Yahoo!
    ! Facebook
    ! Rackspace
    ! ...
                                http://zookeeper.apache.org

                    June 2011                             3
ZooKeeper

•   ... is a leader-based, replicated service
    ! Processes crash and recover

•   Leader
    ! Executes requests
                                          Leader     Follower     Follower
    ! Propagates state updates
                                     Broadcast     Deliver      Deliver

•   Follower
                                                 Atomic broadcast
    ! Applies state updates

                              June 2011                                   4
ZooKeeper

•   Client
                                                    Client
    ! Submits operations to a
      server                                               Request

    ! If follower, forwards to          Leader     Follower      Follower
      leader
                                   Broadcast     Deliver       Deliver
    ! Leader executes and
      propagates state update                  Atomic broadcast


                            June 2011                                    5
ZooKeeper

•   State updates
    ! All followers apply the same updates
    ! All followers apply them in the same order
    ! Atomic broadcast

•   Performance requirements
    ! Multiple outstanding operations
    ! Low latency and high throughput
                         June 2011                 6
ZooKeeper
• Update configuration and create ready
• If ready exists, then configuration is
consistent
                                                    setData        del
                                     setData      /cfg/client   /cfg/ready
                                    /cfg/server
                         create          B
                                                       B
                                                                             Follower
                       /cfg/ready

         Leader
                        create
                      /cfg/ready     setData                                 Follower
                                    /cfg/server     setData
                                         B        /cfg/client      del
                                                       B        /cfg/ready




    • If 1 doesn’t commit, then 2+3 can’t                • If 2+3 don’t commit, then 4 must not
    commit                                               commit
                                             June 2011                                       7
ZooKeeper

•   Exploring Paxos
    ! Efficient consensus protocol
    ! State-machine replication
    ! Multiple consecutive instances

•   Why is it not suitable out of the box?
    ! Does not guarantee order
    ! Multiple outstanding operations

                        June 2011            8
Paxos at a glance
                     1b: Acceptor promises         2b: If quorum, value
                      not to accept lower                 is chosen
                             ballots
Acceptor + Learner


                     1a               1b        2a                  2b 3a
    Acceptor +
Proposer + Learner

                      1a              1b          2a                2b    3a
Acceptor + Learner

                          Phase 1:                       Phase 2:           Phase 3:
                           Selects                       Proposes            Value
                          value to                        a value           learned
                          propose

                                             June 2011                                 9
Paxos run                                           Interleaves
                                                                             operations of P1,
           27: <1a,3>                                    27: <2a, 3, C>      P2, and and P3
           28: <1a,3>                                    28: <2a, 3, B>
           29: <1a,3>                                    29: <2a, 3, D>
P3
                        Has
                   accepted A and
                     B from P1
A1
     27: <1, A>               27: <1b, 1, A>
     28: <1, B>               28: <1b, 1, B>
                              29: <1b, _, _>
A2
                             Has                                          27: <3, C>
     27: <2, C>
                         accepted C                                       28: <3, B>
                           from P2                                        29: <3, D>
A3
     27: <2, C>                         27: <1b, 2, C>          27: <3, C>
                                        28: <1b, _, _>          28: <3, B>
                                        29: <1b, _, _>
                                                                29: <3, D>




                                          June 2011                                              10
ZooKeeper

•   Another requirement
    ! Minimize downtime
    ! Efficient recovery

•   Reduce the amount of state transfered
•   Zab
    ! One identifier
    ! Missing values for each process

                          June 2011         11
Zab and PO Broadcast
Definitions

•   Processes: Lead or Follow
•   Followers
    ! Maintain a history of transactions (updates)

•   Transaction identifiers: !e,c"

    ! e : epoch number of the leader
    ! c : epoch counter

                             June 2011               13
Properties of PO Broadcast


•   Integrity
    ! Only broadcast transactions are delivered
    ! Leader recovers before broadcasting new transactions

•   Total order and agreement
    ! Followers deliver the same transactions and in the
      same order


                             June 2011                       14
Primary order

•   Local: Transactions of a leader accepted in
    order
•   Global: Transactions in history respect the
    order of epochs




                      June 2011                   15
Primary order

•    Local: Transactions of a primary accepted in
     order
•    Global: Transactions in history respect the
     order of epochs
             abcast(!e,10") abcast(!e,11") abcast(!e,12")
    Leader



Follower



                                     June 2011              16
Primary order

•    Local: Transactions of a primary accepted in
     order
•    Global: Transactions in history respect the
     order of epochs
             abcast(!e,10") abcast(!e,11") abcast(!e,12")
    Leader



Follower



                                    June 2011               17
Primary order

•     Local: Transactions of a primary accepted in
      order
•     Global: Transactions in history respect the
      order of epochs
               abcast(!e,10") abcast(!e,11")
    Leader

                                               abcast(!e’,1")
    Leader’


    Follower
                                        June 2011               18
Primary order

•    Local: Transactions of a primary accepted in
     order
•    Global: Transactions in history respect the
     order of epochs
              abcast(!e,10")         abcast(!e,11")
    Leader

                               abcast(!e’,1")
    Leader’


Follower
                                       June 2011      19
Zab in Phases

•   Phase 0 - Leader election
    ! Prospective leader          elected

•   Phase 1- Discovery
    ! Followers promise not to go back to previous
      epochs
    ! Followers send to          their last epoch and history

    !    selects longest history of latest epoch
                           June 2011                            20
Zab in Phases

•   Phase 2 - Synchronization
    !    sends new history to followers

    ! Followers confirm leadership

•   Phase 3 - Broadcast
    !    proposes new transactions

    !    commits if quorum acknowledges

                       June 2011          21
Zab in Phases


•   Phases 1 and 2: Recovery
    ! Critical to guarantee order with multiple
      outstanding transactions

•   Phase 3: Broadcast
    ! Just like Phases 2 and 3 of Paxos



                         June 2011                22
Zab: Sample run

                  f1                  f2       f3

               !0,1"               !0,1"     !0,1"
               !0,2"               !0,2"
               !0,3"
New epoch
             f1.a = 0,          f2.a = 0,   f3.a = 0,
               !0,3"              !0,2"       !0,1"
            Initial history
            of new epoch



                              June 2011                 23
Zab: Sample run

                  f1               f2         f3

                !0,1"          !0,1"        !0,1"
                !0,2"          !0,2"        !0,2"
     Chosen!    !1,1"          !1,1"
                !1,2"
New epoch

               f1.a = 1,      f2.a = 1,    f3.a = 2,
                 !1,2"          !1,1"        !0,2"

                           Can’t happen!


                              June 2011                24
Paxos run (revisited)
       Epoch 1, Phase 3                Epoch 2, Phase 3                  Epoch 3, Phase 3
         L1 History: #     Phases 1     L2 History: #        Phases 1     L3 History: !2,1",C
                             and 2                             and 2
                          of Epoch 2                        of Epoch 3




Follower 1
              Epoch: 1                           Epoch: 1                      Epoch: 3
              !1,1",A                            !1,1",A                       !2,1",C
              !1,2",B                            !1,2",B                       !3,1",D
Follower 2
              Epoch: 1                           Epoch: 2                      Epoch: 2
              #                                  !2,1",C                       !2,1",C

Follower 3                                                                     Epoch: 3
              Epoch: 1                           Epoch: 2
              #                                  !2,1",C                       !2,1",C
                                                                               !3,1",D



                                           June 2011                                            25
Notes on implementation

•   Use of TCP
    ! Ordered delivery, retransmissions, etc.

    ! Notion of session

•   Elect leader with most committed txns
    ! No follower ! leader copies

•   Recovery
    ! Last zxid is sufficient
    ! In Phase 2, leader commands to add or truncate

                               June 2011               26
Performance
Experimental setup


•   Implementation in Java
•   13 identical servers
    ! Xeon 2.50GHz, Gigabit interface, two SATA
      disks


                                   http://zookeeper.apache.org

                       June 2011                             28
Throughput
                                        Continuous saturated throughput
                        70000
                                                                         Net only
                                                                      Net + Disk
                        60000                         Net + Disk (no write cache)
                                                                          Net cap

                        50000
Operations per second




                        40000


                        30000


                        20000


                        10000


                            0
                                2   4     6           8          10           12    14
                                        Number of servers in ensemble




                                                  June 2011                              29
Latency




  June 2011   30
Wrap up
Conclusion

•   Zookeeper
    ! Multiple outstanding operations
    ! Dependencies between consecutive updates

•   Zab
    ! Primary Order Broadcast
    ! Synchronization phase
    ! Efficient recovery


                              June 2011          32
Questions?


http://zookeeper.apache.org

More Related Content

Similar to Zab dsn-2011

20110903 candycane
20110903 candycane20110903 candycane
20110903 candycaneYusuke Ando
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
codeBeamer: Agile ALM & Collaboration Solution
codeBeamer: Agile ALM & Collaboration Solution codeBeamer: Agile ALM & Collaboration Solution
codeBeamer: Agile ALM & Collaboration Solution
Intland Software GmbH
 
Sv jug - mar 2013 - sl
Sv jug - mar 2013 - slSv jug - mar 2013 - sl
Sv jug - mar 2013 - slCloudBees
 
Zararfa SummerCamp 2012 - Community update and Zarafa Development Process
Zararfa SummerCamp 2012 - Community update and Zarafa Development ProcessZararfa SummerCamp 2012 - Community update and Zarafa Development Process
Zararfa SummerCamp 2012 - Community update and Zarafa Development ProcessZarafa
 
Getting started with GIT
Getting started with GITGetting started with GIT
Getting started with GIT
pratz0909
 
New York Kubernetes: CI/CD Patterns for Kubernetes
New York Kubernetes: CI/CD Patterns for KubernetesNew York Kubernetes: CI/CD Patterns for Kubernetes
New York Kubernetes: CI/CD Patterns for Kubernetes
Andrew Phillips
 
How to Introduce Continuous Delivery
How to Introduce Continuous DeliveryHow to Introduce Continuous Delivery
How to Introduce Continuous Delivery
Dr. Alexander Schwartz
 
Value-Stream-Mapping,
Value-Stream-Mapping, Value-Stream-Mapping,
Value-Stream-Mapping,
Towo Toivola
 
Atril-Déjà Vu Tea mserver 2 general presentation
Atril-Déjà Vu Tea mserver 2   general presentationAtril-Déjà Vu Tea mserver 2   general presentation
Atril-Déjà Vu Tea mserver 2 general presentation
cohlmann
 
New features in Pig 0.11
New features in Pig 0.11New features in Pig 0.11
New features in Pig 0.11Hortonworks
 
Go Training
Go TrainingGo Training
Go Training
alice yang
 
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Puppet
 
AWS Customer Presentation - The Server Labs
AWS Customer Presentation - The Server Labs AWS Customer Presentation - The Server Labs
AWS Customer Presentation - The Server Labs
Amazon Web Services
 
Kubernetes I Deep Dive.pptx
Kubernetes I Deep Dive.pptxKubernetes I Deep Dive.pptx
Kubernetes I Deep Dive.pptx
ssuser368371
 
Lean and Kanban Principles for Software Developers
Lean and Kanban Principles for Software DevelopersLean and Kanban Principles for Software Developers
Lean and Kanban Principles for Software Developers
Cory Foy
 
Is Advanced Verification for FPGA based Logic needed
Is Advanced Verification for FPGA based Logic neededIs Advanced Verification for FPGA based Logic needed
Is Advanced Verification for FPGA based Logic needed
chiportal
 
Stairway to heaven webinar
Stairway to heaven webinarStairway to heaven webinar
Stairway to heaven webinarCloudBees
 
Release This! - Tools for a Smooth Release Cycle
Release This! - Tools for a Smooth Release CycleRelease This! - Tools for a Smooth Release Cycle
Release This! - Tools for a Smooth Release Cycle
Perforce
 

Similar to Zab dsn-2011 (20)

20110903 candycane
20110903 candycane20110903 candycane
20110903 candycane
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
codeBeamer: Agile ALM & Collaboration Solution
codeBeamer: Agile ALM & Collaboration Solution codeBeamer: Agile ALM & Collaboration Solution
codeBeamer: Agile ALM & Collaboration Solution
 
Sv jug - mar 2013 - sl
Sv jug - mar 2013 - slSv jug - mar 2013 - sl
Sv jug - mar 2013 - sl
 
Zararfa SummerCamp 2012 - Community update and Zarafa Development Process
Zararfa SummerCamp 2012 - Community update and Zarafa Development ProcessZararfa SummerCamp 2012 - Community update and Zarafa Development Process
Zararfa SummerCamp 2012 - Community update and Zarafa Development Process
 
Getting started with GIT
Getting started with GITGetting started with GIT
Getting started with GIT
 
New York Kubernetes: CI/CD Patterns for Kubernetes
New York Kubernetes: CI/CD Patterns for KubernetesNew York Kubernetes: CI/CD Patterns for Kubernetes
New York Kubernetes: CI/CD Patterns for Kubernetes
 
How to Introduce Continuous Delivery
How to Introduce Continuous DeliveryHow to Introduce Continuous Delivery
How to Introduce Continuous Delivery
 
Value-Stream-Mapping,
Value-Stream-Mapping, Value-Stream-Mapping,
Value-Stream-Mapping,
 
Atril-Déjà Vu Tea mserver 2 general presentation
Atril-Déjà Vu Tea mserver 2   general presentationAtril-Déjà Vu Tea mserver 2   general presentation
Atril-Déjà Vu Tea mserver 2 general presentation
 
New features in Pig 0.11
New features in Pig 0.11New features in Pig 0.11
New features in Pig 0.11
 
Go Training
Go TrainingGo Training
Go Training
 
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
 
Subversion last minute survival crash course
Subversion  last minute survival crash courseSubversion  last minute survival crash course
Subversion last minute survival crash course
 
AWS Customer Presentation - The Server Labs
AWS Customer Presentation - The Server Labs AWS Customer Presentation - The Server Labs
AWS Customer Presentation - The Server Labs
 
Kubernetes I Deep Dive.pptx
Kubernetes I Deep Dive.pptxKubernetes I Deep Dive.pptx
Kubernetes I Deep Dive.pptx
 
Lean and Kanban Principles for Software Developers
Lean and Kanban Principles for Software DevelopersLean and Kanban Principles for Software Developers
Lean and Kanban Principles for Software Developers
 
Is Advanced Verification for FPGA based Logic needed
Is Advanced Verification for FPGA based Logic neededIs Advanced Verification for FPGA based Logic needed
Is Advanced Verification for FPGA based Logic needed
 
Stairway to heaven webinar
Stairway to heaven webinarStairway to heaven webinar
Stairway to heaven webinar
 
Release This! - Tools for a Smooth Release Cycle
Release This! - Tools for a Smooth Release CycleRelease This! - Tools for a Smooth Release Cycle
Release This! - Tools for a Smooth Release Cycle
 

Recently uploaded

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 

Recently uploaded (20)

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 

Zab dsn-2011

  • 1. Zab: High-performance broadcast for primary-backup systems Flavio Junqueira, Benjamin Reed, Marco Serafini Yahoo! Research June 2011
  • 2. Setting up the stage • Background: ZooKeeper • Coordination service ! Web-scale applications ! Intensive use (high performance) ! Source of truth for many applications June 2011 2
  • 3. ZooKeeper • Open source Apache project • Used in production ! Yahoo! ! Facebook ! Rackspace ! ... http://zookeeper.apache.org June 2011 3
  • 4. ZooKeeper • ... is a leader-based, replicated service ! Processes crash and recover • Leader ! Executes requests Leader Follower Follower ! Propagates state updates Broadcast Deliver Deliver • Follower Atomic broadcast ! Applies state updates June 2011 4
  • 5. ZooKeeper • Client Client ! Submits operations to a server Request ! If follower, forwards to Leader Follower Follower leader Broadcast Deliver Deliver ! Leader executes and propagates state update Atomic broadcast June 2011 5
  • 6. ZooKeeper • State updates ! All followers apply the same updates ! All followers apply them in the same order ! Atomic broadcast • Performance requirements ! Multiple outstanding operations ! Low latency and high throughput June 2011 6
  • 7. ZooKeeper • Update configuration and create ready • If ready exists, then configuration is consistent setData del setData /cfg/client /cfg/ready /cfg/server create B B Follower /cfg/ready Leader create /cfg/ready setData Follower /cfg/server setData B /cfg/client del B /cfg/ready • If 1 doesn’t commit, then 2+3 can’t • If 2+3 don’t commit, then 4 must not commit commit June 2011 7
  • 8. ZooKeeper • Exploring Paxos ! Efficient consensus protocol ! State-machine replication ! Multiple consecutive instances • Why is it not suitable out of the box? ! Does not guarantee order ! Multiple outstanding operations June 2011 8
  • 9. Paxos at a glance 1b: Acceptor promises 2b: If quorum, value not to accept lower is chosen ballots Acceptor + Learner 1a 1b 2a 2b 3a Acceptor + Proposer + Learner 1a 1b 2a 2b 3a Acceptor + Learner Phase 1: Phase 2: Phase 3: Selects Proposes Value value to a value learned propose June 2011 9
  • 10. Paxos run Interleaves operations of P1, 27: <1a,3> 27: <2a, 3, C> P2, and and P3 28: <1a,3> 28: <2a, 3, B> 29: <1a,3> 29: <2a, 3, D> P3 Has accepted A and B from P1 A1 27: <1, A> 27: <1b, 1, A> 28: <1, B> 28: <1b, 1, B> 29: <1b, _, _> A2 Has 27: <3, C> 27: <2, C> accepted C 28: <3, B> from P2 29: <3, D> A3 27: <2, C> 27: <1b, 2, C> 27: <3, C> 28: <1b, _, _> 28: <3, B> 29: <1b, _, _> 29: <3, D> June 2011 10
  • 11. ZooKeeper • Another requirement ! Minimize downtime ! Efficient recovery • Reduce the amount of state transfered • Zab ! One identifier ! Missing values for each process June 2011 11
  • 12. Zab and PO Broadcast
  • 13. Definitions • Processes: Lead or Follow • Followers ! Maintain a history of transactions (updates) • Transaction identifiers: !e,c" ! e : epoch number of the leader ! c : epoch counter June 2011 13
  • 14. Properties of PO Broadcast • Integrity ! Only broadcast transactions are delivered ! Leader recovers before broadcasting new transactions • Total order and agreement ! Followers deliver the same transactions and in the same order June 2011 14
  • 15. Primary order • Local: Transactions of a leader accepted in order • Global: Transactions in history respect the order of epochs June 2011 15
  • 16. Primary order • Local: Transactions of a primary accepted in order • Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") abcast(!e,12") Leader Follower June 2011 16
  • 17. Primary order • Local: Transactions of a primary accepted in order • Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") abcast(!e,12") Leader Follower June 2011 17
  • 18. Primary order • Local: Transactions of a primary accepted in order • Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") Leader abcast(!e’,1") Leader’ Follower June 2011 18
  • 19. Primary order • Local: Transactions of a primary accepted in order • Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") Leader abcast(!e’,1") Leader’ Follower June 2011 19
  • 20. Zab in Phases • Phase 0 - Leader election ! Prospective leader elected • Phase 1- Discovery ! Followers promise not to go back to previous epochs ! Followers send to their last epoch and history ! selects longest history of latest epoch June 2011 20
  • 21. Zab in Phases • Phase 2 - Synchronization ! sends new history to followers ! Followers confirm leadership • Phase 3 - Broadcast ! proposes new transactions ! commits if quorum acknowledges June 2011 21
  • 22. Zab in Phases • Phases 1 and 2: Recovery ! Critical to guarantee order with multiple outstanding transactions • Phase 3: Broadcast ! Just like Phases 2 and 3 of Paxos June 2011 22
  • 23. Zab: Sample run f1 f2 f3 !0,1" !0,1" !0,1" !0,2" !0,2" !0,3" New epoch f1.a = 0, f2.a = 0, f3.a = 0, !0,3" !0,2" !0,1" Initial history of new epoch June 2011 23
  • 24. Zab: Sample run f1 f2 f3 !0,1" !0,1" !0,1" !0,2" !0,2" !0,2" Chosen! !1,1" !1,1" !1,2" New epoch f1.a = 1, f2.a = 1, f3.a = 2, !1,2" !1,1" !0,2" Can’t happen! June 2011 24
  • 25. Paxos run (revisited) Epoch 1, Phase 3 Epoch 2, Phase 3 Epoch 3, Phase 3 L1 History: # Phases 1 L2 History: # Phases 1 L3 History: !2,1",C and 2 and 2 of Epoch 2 of Epoch 3 Follower 1 Epoch: 1 Epoch: 1 Epoch: 3 !1,1",A !1,1",A !2,1",C !1,2",B !1,2",B !3,1",D Follower 2 Epoch: 1 Epoch: 2 Epoch: 2 # !2,1",C !2,1",C Follower 3 Epoch: 3 Epoch: 1 Epoch: 2 # !2,1",C !2,1",C !3,1",D June 2011 25
  • 26. Notes on implementation • Use of TCP ! Ordered delivery, retransmissions, etc. ! Notion of session • Elect leader with most committed txns ! No follower ! leader copies • Recovery ! Last zxid is sufficient ! In Phase 2, leader commands to add or truncate June 2011 26
  • 28. Experimental setup • Implementation in Java • 13 identical servers ! Xeon 2.50GHz, Gigabit interface, two SATA disks http://zookeeper.apache.org June 2011 28
  • 29. Throughput Continuous saturated throughput 70000 Net only Net + Disk 60000 Net + Disk (no write cache) Net cap 50000 Operations per second 40000 30000 20000 10000 0 2 4 6 8 10 12 14 Number of servers in ensemble June 2011 29
  • 30. Latency June 2011 30
  • 32. Conclusion • Zookeeper ! Multiple outstanding operations ! Dependencies between consecutive updates • Zab ! Primary Order Broadcast ! Synchronization phase ! Efficient recovery June 2011 32