SlideShare a Scribd company logo
1 of 33
Zab: High-performance broadcast
  for primary-backup systems
  Flavio Junqueira, Benjamin Reed, Marco Serafini

                Yahoo! Research
                     June 2011
Setting up the stage


•   Background: ZooKeeper
•   Coordination service
    ! Web-scale applications
    ! Intensive use (high performance)
    ! Source of truth for many applications


                        June 2011             2
ZooKeeper

•   Open source Apache project
•   Used in production
    ! Yahoo!
    ! Facebook
    ! Rackspace
    ! ...
                                http://zookeeper.apache.org

                    June 2011                             3
ZooKeeper

•   ... is a leader-based, replicated service
    ! Processes crash and recover

•   Leader
    ! Executes requests
                                          Leader     Follower     Follower
    ! Propagates state updates
                                     Broadcast     Deliver      Deliver

•   Follower
                                                 Atomic broadcast
    ! Applies state updates

                              June 2011                                   4
ZooKeeper

•   Client
                                                    Client
    ! Submits operations to a
      server                                               Request

    ! If follower, forwards to          Leader     Follower      Follower
      leader
                                   Broadcast     Deliver       Deliver
    ! Leader executes and
      propagates state update                  Atomic broadcast


                            June 2011                                    5
ZooKeeper

•   State updates
    ! All followers apply the same updates
    ! All followers apply them in the same order
    ! Atomic broadcast

•   Performance requirements
    ! Multiple outstanding operations
    ! Low latency and high throughput
                         June 2011                 6
ZooKeeper
• Update configuration and create ready
• If ready exists, then configuration is
consistent
                                                    setData        del
                                     setData      /cfg/client   /cfg/ready
                                    /cfg/server
                         create          B
                                                       B
                                                                             Follower
                       /cfg/ready

         Leader
                        create
                      /cfg/ready     setData                                 Follower
                                    /cfg/server     setData
                                         B        /cfg/client      del
                                                       B        /cfg/ready




    • If 1 doesn’t commit, then 2+3 can’t                • If 2+3 don’t commit, then 4 must not
    commit                                               commit
                                             June 2011                                       7
ZooKeeper

•   Exploring Paxos
    ! Efficient consensus protocol
    ! State-machine replication
    ! Multiple consecutive instances

•   Why is it not suitable out of the box?
    ! Does not guarantee order
    ! Multiple outstanding operations

                        June 2011            8
Paxos at a glance
                     1b: Acceptor promises         2b: If quorum, value
                      not to accept lower                 is chosen
                             ballots
Acceptor + Learner


                     1a               1b        2a                  2b 3a
    Acceptor +
Proposer + Learner

                      1a              1b          2a                2b    3a
Acceptor + Learner

                          Phase 1:                       Phase 2:           Phase 3:
                           Selects                       Proposes            Value
                          value to                        a value           learned
                          propose

                                             June 2011                                 9
Paxos run                                           Interleaves
                                                                             operations of P1,
           27: <1a,3>                                    27: <2a, 3, C>      P2, and and P3
           28: <1a,3>                                    28: <2a, 3, B>
           29: <1a,3>                                    29: <2a, 3, D>
P3
                        Has
                   accepted A and
                     B from P1
A1
     27: <1, A>               27: <1b, 1, A>
     28: <1, B>               28: <1b, 1, B>
                              29: <1b, _, _>
A2
                             Has                                          27: <3, C>
     27: <2, C>
                         accepted C                                       28: <3, B>
                           from P2                                        29: <3, D>
A3
     27: <2, C>                         27: <1b, 2, C>          27: <3, C>
                                        28: <1b, _, _>          28: <3, B>
                                        29: <1b, _, _>
                                                                29: <3, D>




                                          June 2011                                              10
ZooKeeper

•   Another requirement
    ! Minimize downtime
    ! Efficient recovery

•   Reduce the amount of state transfered
•   Zab
    ! One identifier
    ! Missing values for each process

                          June 2011         11
Zab and PO Broadcast
Definitions

•   Processes: Lead or Follow
•   Followers
    ! Maintain a history of transactions (updates)

•   Transaction identifiers: !e,c"

    ! e : epoch number of the leader
    ! c : epoch counter

                             June 2011               13
Properties of PO Broadcast


•   Integrity
    ! Only broadcast transactions are delivered
    ! Leader recovers before broadcasting new transactions

•   Total order and agreement
    ! Followers deliver the same transactions and in the
      same order


                             June 2011                       14
Primary order

•   Local: Transactions of a leader accepted in
    order
•   Global: Transactions in history respect the
    order of epochs




                      June 2011                   15
Primary order

•    Local: Transactions of a primary accepted in
     order
•    Global: Transactions in history respect the
     order of epochs
             abcast(!e,10") abcast(!e,11") abcast(!e,12")
    Leader



Follower



                                     June 2011              16
Primary order

•    Local: Transactions of a primary accepted in
     order
•    Global: Transactions in history respect the
     order of epochs
             abcast(!e,10") abcast(!e,11") abcast(!e,12")
    Leader



Follower



                                    June 2011               17
Primary order

•     Local: Transactions of a primary accepted in
      order
•     Global: Transactions in history respect the
      order of epochs
               abcast(!e,10") abcast(!e,11")
    Leader

                                               abcast(!e’,1")
    Leader’


    Follower
                                        June 2011               18
Primary order

•    Local: Transactions of a primary accepted in
     order
•    Global: Transactions in history respect the
     order of epochs
              abcast(!e,10")         abcast(!e,11")
    Leader

                               abcast(!e’,1")
    Leader’


Follower
                                       June 2011      19
Zab in Phases

•   Phase 0 - Leader election
    ! Prospective leader          elected

•   Phase 1- Discovery
    ! Followers promise not to go back to previous
      epochs
    ! Followers send to          their last epoch and history

    !    selects longest history of latest epoch
                           June 2011                            20
Zab in Phases

•   Phase 2 - Synchronization
    !    sends new history to followers

    ! Followers confirm leadership

•   Phase 3 - Broadcast
    !    proposes new transactions

    !    commits if quorum acknowledges

                       June 2011          21
Zab in Phases


•   Phases 1 and 2: Recovery
    ! Critical to guarantee order with multiple
      outstanding transactions

•   Phase 3: Broadcast
    ! Just like Phases 2 and 3 of Paxos



                         June 2011                22
Zab: Sample run

                  f1                  f2       f3

               !0,1"               !0,1"     !0,1"
               !0,2"               !0,2"
               !0,3"
New epoch
             f1.a = 0,          f2.a = 0,   f3.a = 0,
               !0,3"              !0,2"       !0,1"
            Initial history
            of new epoch



                              June 2011                 23
Zab: Sample run

                  f1               f2         f3

                !0,1"          !0,1"        !0,1"
                !0,2"          !0,2"        !0,2"
     Chosen!    !1,1"          !1,1"
                !1,2"
New epoch

               f1.a = 1,      f2.a = 1,    f3.a = 2,
                 !1,2"          !1,1"        !0,2"

                           Can’t happen!


                              June 2011                24
Paxos run (revisited)
       Epoch 1, Phase 3                Epoch 2, Phase 3                  Epoch 3, Phase 3
         L1 History: #     Phases 1     L2 History: #        Phases 1     L3 History: !2,1",C
                             and 2                             and 2
                          of Epoch 2                        of Epoch 3




Follower 1
              Epoch: 1                           Epoch: 1                      Epoch: 3
              !1,1",A                            !1,1",A                       !2,1",C
              !1,2",B                            !1,2",B                       !3,1",D
Follower 2
              Epoch: 1                           Epoch: 2                      Epoch: 2
              #                                  !2,1",C                       !2,1",C

Follower 3                                                                     Epoch: 3
              Epoch: 1                           Epoch: 2
              #                                  !2,1",C                       !2,1",C
                                                                               !3,1",D



                                           June 2011                                            25
Notes on implementation

•   Use of TCP
    ! Ordered delivery, retransmissions, etc.

    ! Notion of session

•   Elect leader with most committed txns
    ! No follower ! leader copies

•   Recovery
    ! Last zxid is sufficient
    ! In Phase 2, leader commands to add or truncate

                               June 2011               26
Performance
Experimental setup


•   Implementation in Java
•   13 identical servers
    ! Xeon 2.50GHz, Gigabit interface, two SATA
      disks


                                   http://zookeeper.apache.org

                       June 2011                             28
Throughput
                                        Continuous saturated throughput
                        70000
                                                                         Net only
                                                                      Net + Disk
                        60000                         Net + Disk (no write cache)
                                                                          Net cap

                        50000
Operations per second




                        40000


                        30000


                        20000


                        10000


                            0
                                2   4     6           8          10           12    14
                                        Number of servers in ensemble




                                                  June 2011                              29
Latency




  June 2011   30
Wrap up
Conclusion

•   Zookeeper
    ! Multiple outstanding operations
    ! Dependencies between consecutive updates

•   Zab
    ! Primary Order Broadcast
    ! Synchronization phase
    ! Efficient recovery


                              June 2011          32
Questions?


http://zookeeper.apache.org

More Related Content

What's hot

Le protocole HTTP
Le protocole HTTPLe protocole HTTP
Le protocole HTTPSouhaib El
 
Introduction au Framework Laravel
Introduction au Framework LaravelIntroduction au Framework Laravel
Introduction au Framework LaravelHoucem Hedhly
 
บทที่ 10 การเงินการธนาคาร การคลังรัฐบาล
บทที่ 10 การเงินการธนาคาร การคลังรัฐบาลบทที่ 10 การเงินการธนาคาร การคลังรัฐบาล
บทที่ 10 การเงินการธนาคาร การคลังรัฐบาลOrnkapat Bualom
 
Moteur de Recommandation
Moteur de RecommandationMoteur de Recommandation
Moteur de RecommandationSoft Computing
 
softCours design pattern m youssfi partie 9 creation des objets abstract fact...
softCours design pattern m youssfi partie 9 creation des objets abstract fact...softCours design pattern m youssfi partie 9 creation des objets abstract fact...
softCours design pattern m youssfi partie 9 creation des objets abstract fact...ENSET, Université Hassan II Casablanca
 
Diagramme d’activités
Diagramme d’activitésDiagramme d’activités
Diagramme d’activitésabdoMarocco
 
Théorie des langages - 04 Théorie des langages
Théorie des langages - 04 Théorie des langagesThéorie des langages - 04 Théorie des langages
Théorie des langages - 04 Théorie des langagesYann Caron
 
Scikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonScikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonGael Varoquaux
 
ความยืดหยุ่นของอุปสงค์ (Elasticity of demand)
ความยืดหยุ่นของอุปสงค์ (Elasticity of demand)ความยืดหยุ่นของอุปสงค์ (Elasticity of demand)
ความยืดหยุ่นของอุปสงค์ (Elasticity of demand)Areewan Plienduang
 
Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013
Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013
Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013Amazon Web Services
 
텐서플로우로 배우는 딥러닝
텐서플로우로 배우는 딥러닝텐서플로우로 배우는 딥러닝
텐서플로우로 배우는 딥러닝찬웅 주
 
Installation et Configuration ee JDK et de Tomcat
Installation et Configuration ee JDK et de TomcatInstallation et Configuration ee JDK et de Tomcat
Installation et Configuration ee JDK et de TomcatMohamed Ben Bouzid
 
Unit 4 costs production
Unit 4 costs productionUnit 4 costs production
Unit 4 costs productionsavinee
 

What's hot (20)

Le protocole HTTP
Le protocole HTTPLe protocole HTTP
Le protocole HTTP
 
Support distributed computing and caching avec hazelcast
Support distributed computing and caching avec hazelcastSupport distributed computing and caching avec hazelcast
Support distributed computing and caching avec hazelcast
 
Introduction au Framework Laravel
Introduction au Framework LaravelIntroduction au Framework Laravel
Introduction au Framework Laravel
 
Uml examen
Uml  examenUml  examen
Uml examen
 
บทที่ 10 การเงินการธนาคาร การคลังรัฐบาล
บทที่ 10 การเงินการธนาคาร การคลังรัฐบาลบทที่ 10 การเงินการธนาคาร การคลังรัฐบาล
บทที่ 10 การเงินการธนาคาร การคลังรัฐบาล
 
Moteur de Recommandation
Moteur de RecommandationMoteur de Recommandation
Moteur de Recommandation
 
softCours design pattern m youssfi partie 9 creation des objets abstract fact...
softCours design pattern m youssfi partie 9 creation des objets abstract fact...softCours design pattern m youssfi partie 9 creation des objets abstract fact...
softCours design pattern m youssfi partie 9 creation des objets abstract fact...
 
Traitement distribue en BIg Data - KAFKA Broker and Kafka Streams
Traitement distribue en BIg Data - KAFKA Broker and Kafka StreamsTraitement distribue en BIg Data - KAFKA Broker and Kafka Streams
Traitement distribue en BIg Data - KAFKA Broker and Kafka Streams
 
Diagramme d’activités
Diagramme d’activitésDiagramme d’activités
Diagramme d’activités
 
Zookeeper 소개
Zookeeper 소개Zookeeper 소개
Zookeeper 소개
 
Théorie des langages - 04 Théorie des langages
Théorie des langages - 04 Théorie des langagesThéorie des langages - 04 Théorie des langages
Théorie des langages - 04 Théorie des langages
 
Scikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonScikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en Python
 
ความยืดหยุ่นของอุปสงค์ (Elasticity of demand)
ความยืดหยุ่นของอุปสงค์ (Elasticity of demand)ความยืดหยุ่นของอุปสงค์ (Elasticity of demand)
ความยืดหยุ่นของอุปสงค์ (Elasticity of demand)
 
Econ100 04 sec015
Econ100 04 sec015Econ100 04 sec015
Econ100 04 sec015
 
Java RMI
Java RMIJava RMI
Java RMI
 
Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013
Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013
Advanced Data Migration Techniques for Amazon RDS (DAT308) | AWS re:Invent 2013
 
텐서플로우로 배우는 딥러닝
텐서플로우로 배우는 딥러닝텐서플로우로 배우는 딥러닝
텐서플로우로 배우는 딥러닝
 
Installation et Configuration ee JDK et de Tomcat
Installation et Configuration ee JDK et de TomcatInstallation et Configuration ee JDK et de Tomcat
Installation et Configuration ee JDK et de Tomcat
 
Unit 4 costs production
Unit 4 costs productionUnit 4 costs production
Unit 4 costs production
 
Value Chain กรณี Starbucks
Value Chain กรณี StarbucksValue Chain กรณี Starbucks
Value Chain กรณี Starbucks
 

Similar to High-performance broadcast for primary-backup systems

PushToTest TestMaker 6.5 Open Source Test Design Document
PushToTest TestMaker 6.5 Open Source Test Design DocumentPushToTest TestMaker 6.5 Open Source Test Design Document
PushToTest TestMaker 6.5 Open Source Test Design DocumentClever Moe
 
Environment Delivery Management Services
Environment Delivery Management  ServicesEnvironment Delivery Management  Services
Environment Delivery Management Servicesdrummondrj
 
Approximating Change Sets at Philips Healthcare: A Case Study
Approximating Change Sets at Philips Healthcare: A Case StudyApproximating Change Sets at Philips Healthcare: A Case Study
Approximating Change Sets at Philips Healthcare: A Case StudyRahul Premraj
 
20110903 candycane
20110903 candycane20110903 candycane
20110903 candycaneYusuke Ando
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
codeBeamer: Agile ALM & Collaboration Solution
codeBeamer: Agile ALM & Collaboration Solution codeBeamer: Agile ALM & Collaboration Solution
codeBeamer: Agile ALM & Collaboration Solution Intland Software GmbH
 
Sv jug - mar 2013 - sl
Sv jug - mar 2013 - slSv jug - mar 2013 - sl
Sv jug - mar 2013 - slCloudBees
 
Zararfa SummerCamp 2012 - Community update and Zarafa Development Process
Zararfa SummerCamp 2012 - Community update and Zarafa Development ProcessZararfa SummerCamp 2012 - Community update and Zarafa Development Process
Zararfa SummerCamp 2012 - Community update and Zarafa Development ProcessZarafa
 
Getting started with GIT
Getting started with GITGetting started with GIT
Getting started with GITpratz0909
 
New York Kubernetes: CI/CD Patterns for Kubernetes
New York Kubernetes: CI/CD Patterns for KubernetesNew York Kubernetes: CI/CD Patterns for Kubernetes
New York Kubernetes: CI/CD Patterns for KubernetesAndrew Phillips
 
Value-Stream-Mapping,
Value-Stream-Mapping, Value-Stream-Mapping,
Value-Stream-Mapping, Towo Toivola
 
Atril-Déjà Vu Tea mserver 2 general presentation
Atril-Déjà Vu Tea mserver 2   general presentationAtril-Déjà Vu Tea mserver 2   general presentation
Atril-Déjà Vu Tea mserver 2 general presentationcohlmann
 
New features in Pig 0.11
New features in Pig 0.11New features in Pig 0.11
New features in Pig 0.11Hortonworks
 
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12Puppet
 
AWS Customer Presentation - The Server Labs
AWS Customer Presentation - The Server Labs AWS Customer Presentation - The Server Labs
AWS Customer Presentation - The Server Labs Amazon Web Services
 
Kubernetes I Deep Dive.pptx
Kubernetes I Deep Dive.pptxKubernetes I Deep Dive.pptx
Kubernetes I Deep Dive.pptxssuser368371
 
Lean and Kanban Principles for Software Developers
Lean and Kanban Principles for Software DevelopersLean and Kanban Principles for Software Developers
Lean and Kanban Principles for Software DevelopersCory Foy
 

Similar to High-performance broadcast for primary-backup systems (20)

PushToTest TestMaker 6.5 Open Source Test Design Document
PushToTest TestMaker 6.5 Open Source Test Design DocumentPushToTest TestMaker 6.5 Open Source Test Design Document
PushToTest TestMaker 6.5 Open Source Test Design Document
 
Environment Delivery Management Services
Environment Delivery Management  ServicesEnvironment Delivery Management  Services
Environment Delivery Management Services
 
Approximating Change Sets at Philips Healthcare: A Case Study
Approximating Change Sets at Philips Healthcare: A Case StudyApproximating Change Sets at Philips Healthcare: A Case Study
Approximating Change Sets at Philips Healthcare: A Case Study
 
20110903 candycane
20110903 candycane20110903 candycane
20110903 candycane
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
codeBeamer: Agile ALM & Collaboration Solution
codeBeamer: Agile ALM & Collaboration Solution codeBeamer: Agile ALM & Collaboration Solution
codeBeamer: Agile ALM & Collaboration Solution
 
Sv jug - mar 2013 - sl
Sv jug - mar 2013 - slSv jug - mar 2013 - sl
Sv jug - mar 2013 - sl
 
Zararfa SummerCamp 2012 - Community update and Zarafa Development Process
Zararfa SummerCamp 2012 - Community update and Zarafa Development ProcessZararfa SummerCamp 2012 - Community update and Zarafa Development Process
Zararfa SummerCamp 2012 - Community update and Zarafa Development Process
 
Getting started with GIT
Getting started with GITGetting started with GIT
Getting started with GIT
 
New York Kubernetes: CI/CD Patterns for Kubernetes
New York Kubernetes: CI/CD Patterns for KubernetesNew York Kubernetes: CI/CD Patterns for Kubernetes
New York Kubernetes: CI/CD Patterns for Kubernetes
 
How to Introduce Continuous Delivery
How to Introduce Continuous DeliveryHow to Introduce Continuous Delivery
How to Introduce Continuous Delivery
 
Value-Stream-Mapping,
Value-Stream-Mapping, Value-Stream-Mapping,
Value-Stream-Mapping,
 
Atril-Déjà Vu Tea mserver 2 general presentation
Atril-Déjà Vu Tea mserver 2   general presentationAtril-Déjà Vu Tea mserver 2   general presentation
Atril-Déjà Vu Tea mserver 2 general presentation
 
New features in Pig 0.11
New features in Pig 0.11New features in Pig 0.11
New features in Pig 0.11
 
Go Training
Go TrainingGo Training
Go Training
 
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
Continuous Development with Jenkins - Stephen Connolly at PuppetCamp Dublin '12
 
Subversion last minute survival crash course
Subversion  last minute survival crash courseSubversion  last minute survival crash course
Subversion last minute survival crash course
 
AWS Customer Presentation - The Server Labs
AWS Customer Presentation - The Server Labs AWS Customer Presentation - The Server Labs
AWS Customer Presentation - The Server Labs
 
Kubernetes I Deep Dive.pptx
Kubernetes I Deep Dive.pptxKubernetes I Deep Dive.pptx
Kubernetes I Deep Dive.pptx
 
Lean and Kanban Principles for Software Developers
Lean and Kanban Principles for Software DevelopersLean and Kanban Principles for Software Developers
Lean and Kanban Principles for Software Developers
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

High-performance broadcast for primary-backup systems

  • 1. Zab: High-performance broadcast for primary-backup systems Flavio Junqueira, Benjamin Reed, Marco Serafini Yahoo! Research June 2011
  • 2. Setting up the stage • Background: ZooKeeper • Coordination service ! Web-scale applications ! Intensive use (high performance) ! Source of truth for many applications June 2011 2
  • 3. ZooKeeper • Open source Apache project • Used in production ! Yahoo! ! Facebook ! Rackspace ! ... http://zookeeper.apache.org June 2011 3
  • 4. ZooKeeper • ... is a leader-based, replicated service ! Processes crash and recover • Leader ! Executes requests Leader Follower Follower ! Propagates state updates Broadcast Deliver Deliver • Follower Atomic broadcast ! Applies state updates June 2011 4
  • 5. ZooKeeper • Client Client ! Submits operations to a server Request ! If follower, forwards to Leader Follower Follower leader Broadcast Deliver Deliver ! Leader executes and propagates state update Atomic broadcast June 2011 5
  • 6. ZooKeeper • State updates ! All followers apply the same updates ! All followers apply them in the same order ! Atomic broadcast • Performance requirements ! Multiple outstanding operations ! Low latency and high throughput June 2011 6
  • 7. ZooKeeper • Update configuration and create ready • If ready exists, then configuration is consistent setData del setData /cfg/client /cfg/ready /cfg/server create B B Follower /cfg/ready Leader create /cfg/ready setData Follower /cfg/server setData B /cfg/client del B /cfg/ready • If 1 doesn’t commit, then 2+3 can’t • If 2+3 don’t commit, then 4 must not commit commit June 2011 7
  • 8. ZooKeeper • Exploring Paxos ! Efficient consensus protocol ! State-machine replication ! Multiple consecutive instances • Why is it not suitable out of the box? ! Does not guarantee order ! Multiple outstanding operations June 2011 8
  • 9. Paxos at a glance 1b: Acceptor promises 2b: If quorum, value not to accept lower is chosen ballots Acceptor + Learner 1a 1b 2a 2b 3a Acceptor + Proposer + Learner 1a 1b 2a 2b 3a Acceptor + Learner Phase 1: Phase 2: Phase 3: Selects Proposes Value value to a value learned propose June 2011 9
  • 10. Paxos run Interleaves operations of P1, 27: <1a,3> 27: <2a, 3, C> P2, and and P3 28: <1a,3> 28: <2a, 3, B> 29: <1a,3> 29: <2a, 3, D> P3 Has accepted A and B from P1 A1 27: <1, A> 27: <1b, 1, A> 28: <1, B> 28: <1b, 1, B> 29: <1b, _, _> A2 Has 27: <3, C> 27: <2, C> accepted C 28: <3, B> from P2 29: <3, D> A3 27: <2, C> 27: <1b, 2, C> 27: <3, C> 28: <1b, _, _> 28: <3, B> 29: <1b, _, _> 29: <3, D> June 2011 10
  • 11. ZooKeeper • Another requirement ! Minimize downtime ! Efficient recovery • Reduce the amount of state transfered • Zab ! One identifier ! Missing values for each process June 2011 11
  • 12. Zab and PO Broadcast
  • 13. Definitions • Processes: Lead or Follow • Followers ! Maintain a history of transactions (updates) • Transaction identifiers: !e,c" ! e : epoch number of the leader ! c : epoch counter June 2011 13
  • 14. Properties of PO Broadcast • Integrity ! Only broadcast transactions are delivered ! Leader recovers before broadcasting new transactions • Total order and agreement ! Followers deliver the same transactions and in the same order June 2011 14
  • 15. Primary order • Local: Transactions of a leader accepted in order • Global: Transactions in history respect the order of epochs June 2011 15
  • 16. Primary order • Local: Transactions of a primary accepted in order • Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") abcast(!e,12") Leader Follower June 2011 16
  • 17. Primary order • Local: Transactions of a primary accepted in order • Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") abcast(!e,12") Leader Follower June 2011 17
  • 18. Primary order • Local: Transactions of a primary accepted in order • Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") Leader abcast(!e’,1") Leader’ Follower June 2011 18
  • 19. Primary order • Local: Transactions of a primary accepted in order • Global: Transactions in history respect the order of epochs abcast(!e,10") abcast(!e,11") Leader abcast(!e’,1") Leader’ Follower June 2011 19
  • 20. Zab in Phases • Phase 0 - Leader election ! Prospective leader elected • Phase 1- Discovery ! Followers promise not to go back to previous epochs ! Followers send to their last epoch and history ! selects longest history of latest epoch June 2011 20
  • 21. Zab in Phases • Phase 2 - Synchronization ! sends new history to followers ! Followers confirm leadership • Phase 3 - Broadcast ! proposes new transactions ! commits if quorum acknowledges June 2011 21
  • 22. Zab in Phases • Phases 1 and 2: Recovery ! Critical to guarantee order with multiple outstanding transactions • Phase 3: Broadcast ! Just like Phases 2 and 3 of Paxos June 2011 22
  • 23. Zab: Sample run f1 f2 f3 !0,1" !0,1" !0,1" !0,2" !0,2" !0,3" New epoch f1.a = 0, f2.a = 0, f3.a = 0, !0,3" !0,2" !0,1" Initial history of new epoch June 2011 23
  • 24. Zab: Sample run f1 f2 f3 !0,1" !0,1" !0,1" !0,2" !0,2" !0,2" Chosen! !1,1" !1,1" !1,2" New epoch f1.a = 1, f2.a = 1, f3.a = 2, !1,2" !1,1" !0,2" Can’t happen! June 2011 24
  • 25. Paxos run (revisited) Epoch 1, Phase 3 Epoch 2, Phase 3 Epoch 3, Phase 3 L1 History: # Phases 1 L2 History: # Phases 1 L3 History: !2,1",C and 2 and 2 of Epoch 2 of Epoch 3 Follower 1 Epoch: 1 Epoch: 1 Epoch: 3 !1,1",A !1,1",A !2,1",C !1,2",B !1,2",B !3,1",D Follower 2 Epoch: 1 Epoch: 2 Epoch: 2 # !2,1",C !2,1",C Follower 3 Epoch: 3 Epoch: 1 Epoch: 2 # !2,1",C !2,1",C !3,1",D June 2011 25
  • 26. Notes on implementation • Use of TCP ! Ordered delivery, retransmissions, etc. ! Notion of session • Elect leader with most committed txns ! No follower ! leader copies • Recovery ! Last zxid is sufficient ! In Phase 2, leader commands to add or truncate June 2011 26
  • 28. Experimental setup • Implementation in Java • 13 identical servers ! Xeon 2.50GHz, Gigabit interface, two SATA disks http://zookeeper.apache.org June 2011 28
  • 29. Throughput Continuous saturated throughput 70000 Net only Net + Disk 60000 Net + Disk (no write cache) Net cap 50000 Operations per second 40000 30000 20000 10000 0 2 4 6 8 10 12 14 Number of servers in ensemble June 2011 29
  • 30. Latency June 2011 30
  • 32. Conclusion • Zookeeper ! Multiple outstanding operations ! Dependencies between consecutive updates • Zab ! Primary Order Broadcast ! Synchronization phase ! Efficient recovery June 2011 32