SlideShare a Scribd company logo
Redundancy Doesn't Always
           Mean "HA" or "Cluster"
            A cautionary tale against using hammers to solve all redundancy and resiliency problems ...

            OpenStack Design Summit – Oct 2012



             Randy Bias                                                                        Dan Sneddon
             @randybias                                                                        @dxs
             CTO, Cloudscaling                                                                 Sr. Engineer, Cloudscaling


                      CCA - NoDerivs 3.0 Unported License - Usage OK, no modifications, full attribution*
                                                * All unlicensed or borrowed works retain their original licenses   1
Thursday, October 18, 12
Our Journey Today


      1. “HA” pairs are not the only type of redundancy

      2. Alternative redundancy patterns for HA

      3. Redundancy patterns in Open Cloud System*




        * Cloudscaling’s OpenStack-powered cloud operating system (“distribution”)

                                                                         2
Thursday, October 18, 12
What Do We Mean By “HA”?
      We mean what most people mean ...




                   Two servers or network devices that look like one

                                                         3
Thursday, October 18, 12
“HA HA”?
      HA pairs come in a couple flavors




                            Active / Passive

                                               4
Thursday, October 18, 12
“HA HA”?
      People like this flavor best, but it’s not always possible...




                               Active / Active

                                                      5
Thursday, October 18, 12
“HA HA HA HA HA”??
      Many people wish they could get it more like this ...




                           HA cluster aka ‘massive operational nightmare’

                                                              6
Thursday, October 18, 12
Cluster<bleep>!
      Imagine this was 4 or 6 nodes in the cluster




• 4 network tech.
• 7 NICs / node
• A million different ways
      to break




                                                     7
Thursday, October 18, 12
“HA” Pairs Are One Type of Redundancy
      Herein lies the problem ...




                                    8
Thursday, October 18, 12
The Problem With “HA”-mmers
      There are many, but these two matter most ...



      • Catastrophic failures
      • No scale out



                                            9
Thursday, October 18, 12
HA Pairs Have Binary Failures
      Either working or dead, nothing in-between




                                           10
Thursday, October 18, 12
What is Scale-out?


                           A   B



                                   A   B    C     D        N




                           A   B




      Scale-up - Make boxes        Scale-out - Make moar
    bigger (usually an HA pair)            boxes


                                             11
Thursday, October 18, 12
Scaling out is a mindset
      Scaling up is like treating your servers as pets




     bowzer.company.com                     web001.company.com


                           Servers *are* cattle
                                                   12
Thursday, October 18, 12
HA Pair Failures* - 100% down
      Hardware rarely fails, operators fail, software fails
                     Who        Type            Year            Why           Duration
                      Apple     Switch           2005            Bug             2 hrs

                 Flexiscale      SAN             2007          Ops Err          24 hrs

                     Vendio      NAS             2008          Ops Err           8 hrs

                UOL Brazil       SAN             2011            Bug            72 hrs

                    Twitter   Datacenter         2012         Bug+Ops            2 hrs

           * This is a handful of examples as a baseline; I’m sure you can find many more

                                                                         13
Thursday, October 18, 12
“HA” Pairs Are an All-in Move
      They better not fail ...




                                 14
Thursday, October 18, 12
Risk Reduction
      Many small failure domains is usually better




                                            15
Thursday, October 18, 12
Big failure domains vs. small
      Would you rather have the whole cloud down or just a
      small bit for a short period of time?




                                     Still a scale-up pattern ...
                                   wouldn’t you rather scale-out?
                                                16
Thursday, October 18, 12
Pair vs. Scale-out Load Balancing
      No scale-out




           State Sync      Shared-nothing Architecture

         (100% loss)               (20% loss)

                                          17
Thursday, October 18, 12
Pair vs. Scale-out Load Balancing
      No scale-out




           State Sync      Shared-nothing Architecture

         (100% loss)               (20% loss)

                                          17
Thursday, October 18, 12
What’s Usually an “HA” Pair in OpenStack?
      Everything ...



                 Service Endpoints     Messaging System
                       (APIs)               (RPC)


                    Worker Threads
                                          Database
                    (e.g. Scheduler,
                                          (MySQL)
                      Networking)

                                             18
Thursday, October 18, 12
What needs to be an HA pair?
      Not much needs state synchronization



                 Service Endpoints     Messaging System
                       (APIs)               (RPC)


                    Worker Threads
                                          Database
                    (e.g. Scheduler,
                      Networking)         (MySQL)

                                             19
Thursday, October 18, 12
Fault Tolerance Methodologies




                                      20
Thursday, October 18, 12
Fault Tolerance in OCS




                               21
Thursday, October 18, 12
Service Distribution
      High Availability Without Compromise




          Resilient        Stateless         Scale-out




                                             22
Thursday, October 18, 12
Service Distribution
      Combines Standard Networking Technologies
                                                    router ospf
         OSPF              /etc/quagga/ospfd.conf    ospf router-id 10.1.1.1
                                                     network 10.1.255.1 area 0.0.0.0


                                                    interface lo:2
         Anycast           /etc/quagga/zebra.conf    description Pound listening address
                                                     ip address 10.1.255.1/32


                                                    ListenHTTP
                                                        Address 10.1.255.1
                                                        Port 8774
         Load-                                          xHTTP
                                                        Service
                                                            BackEnd
                                                                    1


         Balancing         /etc/pound/pound.conf
                                                            End
                                                                Address 10.1.1.1
                                                                Port 8774

         Proxy                                              BackEnd
                                                                Address 10.1.1.2
                                                                Port 8774
                                                            End
                                                        End
                                                    End


                                                                       23
Thursday, October 18, 12
Resilient OpenStack
      Horizontally Scalable, No Single Point Of Failure

             Service Distribution          ZeroMQ

                 Service Endpoints     Messaging System
                       (APIs)               (RPC)

             Service Distribution         MMR + HA
                    Worker Threads        Database
                    (e.g. Scheduler,
                      Networking)         (MySQL)



Thursday, October 18, 12
Service Distribution Advantages
      What Makes This a Superior Solution?

      • True horizontal scalability with no centralized controller
      • Services are always running, failover is nearly instant
      • Reduced complexity, fewer idle resources
      • No need for separate load balancers


       Server              Server         Server   Server   Server   Server   Server
                                                                                       ...
                Failover            vs.             Distributed Services

                                                                      25
Thursday, October 18, 12
Perfect For Site Resiliency
      Service Distribution Works With Multiple Sites
        • Traditional HA pairs do not support cross-site resiliency

        • Service Distribution fail across sites without DNS redirections




                                                             26
Thursday, October 18, 12
Service Distribution in Action
        Example: Distributed Load Balancing
                 1)        OSPF

                                                      OSPF Router(s)




                                      OSPF                                    OSPF
                                  advertisement                           advertisement

                                                  V
                                         Quagga                         Quagga


                                         HTTP Proxy                    HTTP Proxy




                                                                        27
Thursday, October 18, 12
Service Distribution in Action
        Example: Distributed Load Balancing
                 1)        OSPF

                                                            OSPF Router(s)

                 2)        ECMP Per-flow
                           Load Balancing


                                            OSPF                                    OSPF
                                        advertisement                           advertisement
                                                                Per-Flow
                                                                  Load
                 3)        Load-balancing               V
                                                                Balancing
                                               Quagga                         Quagga
                           HTTP Proxy

                                               HTTP Proxy                    HTTP Proxy




                                                                              28
Thursday, October 18, 12
Service Distribution in Action
        Example: Distributed Load Balancing
                 1)        OSPF

                                                             OSPF Router(s)

                 2)        ECMP Per-flow
                           Load Balancing


                                             OSPF                                        OSPF
                                         advertisement                               advertisement
                                                                 Per-Flow
                                                                   Load
                 3)        Load-balancing                V
                                                                 Balancing
                                                Quagga                             Quagga
                           HTTP Proxy

                                                HTTP Proxy                        HTTP Proxy

                 4)        Unlimited #
                           of Back-End
                           Servers
                                                  Server     Server      Server     Server




                                                                                   29
Thursday, October 18, 12
Failure Resiliency
            Client                      Client                             Client               Client

    1                               2                                  3                    4




                           1                          2          3                              4
                      Load Balancer/
                       Load Balancer/      Load Balancer/        Load Balancer/     Load Balancer/
                           Proxy
                           Proxy               Proxy                 Proxy              Proxy

                                                                                                          10%
                           Server        Server             Server         Server           Server        Load
                                                                                                          Each
                           Server        Server             Server         Server           Server       Server

                                                                                       30
Thursday, October 18, 12
Failure Resiliency
            Client                      Client                             Client               Client

    1                               2                                  3                    4




                           1                      12             3                              4

                            X
                      Load Balancer/
                       Load Balancer/      Load Balancer/        Load Balancer/     Load Balancer/
                           Proxy
                           Proxy               Proxy                 Proxy              Proxy

                                                                                                          10%
                           Server        Server             Server         Server           Server        Load
                                                                                                          Each
                           Server        Server             Server         Server           Server       Server

                                                                                       31
Thursday, October 18, 12
Failure Resiliency
            Client                      Client                             Client               Client

    1                               2                                  3                    4




                           1                          2          3                              4
                      Load Balancer/
                       Load Balancer/      Load Balancer/        Load Balancer/     Load Balancer/
                           Proxy
                           Proxy               Proxy                 Proxy              Proxy

                                                                                                            10%

                           X
                           Server

                           Server
                                         Server

                                         Server
                                                            Server

                                                            Server
                                                                           Server

                                                                           Server
                                                                                            Server

                                                                                            Server
                                                                                                         Increased
                                                                                                           Server
                                                                                                            Load

                                                                                       32
Thursday, October 18, 12
OCS NAT Service
      Example: Scale-out Network Address Translation
             BGP                               Multiple ISP
                                               providers



             NAT



             Service
             Distribution



              VMs

                                          33
Thursday, October 18, 12
Brokerless Messaging With ZeroMQ
      Avoiding RabbitMQ’s Single Point Of Failure
                           Nova-Compute




                                        Single Point
                                        Of Failure



                             RabbitMQ
                              Broker



            Nova-Scheduler                  Nova-API

                            RabbitMQ
                           (Brokered)
                                                       34
Thursday, October 18, 12
Brokerless Messaging With ZeroMQ
      Avoiding RabbitMQ’s Single Point Of Failure
                           Nova-Compute                             Nova-Compute




                                        Single Point
                                        Of Failure



                             RabbitMQ
                              Broker



            Nova-Scheduler                  Nova-API     Nova-Scheduler            Nova-API

                            RabbitMQ                   vs.          ZeroMQ
                           (Brokered)                            (Peer To Peer)
                                                                          35
Thursday, October 18, 12
What did we learn today?


          1. HA-mmers are for nails

          2. Scale-out rules for redundancy

          3. Design-for-failure is a mentality, not a pair

          4. Resiliency over redundancy



                                                      36
Thursday, October 18, 12
Q&A
 Randy Bias                                         Dan Sneddon
 @randybias                                         @dxs
 CTO, Cloudscaling                                  Sr. Engineer, Cloudscaling



                                                      OCS 2.0
                      Public Cloud Benefits | Private Cloud Control | Open Cloud Economics

                                                                             37
Thursday, October 18, 12

More Related Content

Viewers also liked

AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...
AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...
AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...
AppDynamics
 
Designing apps for resiliency
Designing apps for resiliencyDesigning apps for resiliency
Designing apps for resiliency
Masashi Narumoto
 
FORUM PA 2015 - Microservices with IBM Bluemix
FORUM PA 2015 - Microservices with IBM BluemixFORUM PA 2015 - Microservices with IBM Bluemix
FORUM PA 2015 - Microservices with IBM Bluemix
gjuljo
 
Fault tolerance made easy
Fault tolerance made easyFault tolerance made easy
Fault tolerance made easy
Uwe Friedrichsen
 
Architecture without an end state
Architecture without an end stateArchitecture without an end state
Architecture without an end state
Michael Nygard
 
Patterns of resilience
Patterns of resiliencePatterns of resilience
Patterns of resilience
Uwe Friedrichsen
 
Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013
Ariel Tseitlin
 
[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」
[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」
[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」
AINOW
 
Resilient Architecture
Resilient ArchitectureResilient Architecture
Resilient Architecture
Matt Stine
 

Viewers also liked (10)

AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...
AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...
AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...
 
Designing apps for resiliency
Designing apps for resiliencyDesigning apps for resiliency
Designing apps for resiliency
 
Resilience engineering
Resilience engineeringResilience engineering
Resilience engineering
 
FORUM PA 2015 - Microservices with IBM Bluemix
FORUM PA 2015 - Microservices with IBM BluemixFORUM PA 2015 - Microservices with IBM Bluemix
FORUM PA 2015 - Microservices with IBM Bluemix
 
Fault tolerance made easy
Fault tolerance made easyFault tolerance made easy
Fault tolerance made easy
 
Architecture without an end state
Architecture without an end stateArchitecture without an end state
Architecture without an end state
 
Patterns of resilience
Patterns of resiliencePatterns of resilience
Patterns of resilience
 
Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013
 
[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」
[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」
[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」
 
Resilient Architecture
Resilient ArchitectureResilient Architecture
Resilient Architecture
 

Similar to OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

Considerations for Building Your Private Cloud.pdf
Considerations for Building Your Private Cloud.pdfConsiderations for Building Your Private Cloud.pdf
Considerations for Building Your Private Cloud.pdf
OpenStack Foundation
 
Cloud Tech III: Actionable Metrics
Cloud Tech III: Actionable MetricsCloud Tech III: Actionable Metrics
Cloud Tech III: Actionable Metrics
royrapoport
 
Riak Use Cases : Dissecting The Solutions To Hard Problems
Riak Use Cases : Dissecting The Solutions To Hard ProblemsRiak Use Cases : Dissecting The Solutions To Hard Problems
Riak Use Cases : Dissecting The Solutions To Hard ProblemsAndy Gross
 
NoSQL @ Qbranch -2010-04-15
NoSQL @ Qbranch -2010-04-15NoSQL @ Qbranch -2010-04-15
NoSQL @ Qbranch -2010-04-15
Mårten Gustafson
 
Erlang for video delivery
Erlang for video deliveryErlang for video delivery
Erlang for video delivery
Hugh Watkins
 
Apache Hadoop Talk at QCon
Apache Hadoop Talk at QConApache Hadoop Talk at QCon
Apache Hadoop Talk at QConCloudera, Inc.
 
Secrets of the asset pipeline
Secrets of the asset pipelineSecrets of the asset pipeline
Secrets of the asset pipeline
Ken Collins
 
Operating your OpenStack Private Cloud.pdf
Operating your OpenStack Private Cloud.pdfOperating your OpenStack Private Cloud.pdf
Operating your OpenStack Private Cloud.pdf
OpenStack Foundation
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
NETWAYS
 
MySQL Cluster no PayPal
MySQL Cluster no PayPalMySQL Cluster no PayPal
MySQL Cluster no PayPal
MySQL Brasil
 
Abstractions at Scale – Our Experiences at Twitter
Abstractions at Scale – Our Experiences at TwitterAbstractions at Scale – Our Experiences at Twitter
Abstractions at Scale – Our Experiences at Twitter
Leonidas Tsementzis
 
DreamObjects - Ceph Day Nov 2012
DreamObjects - Ceph Day Nov 2012DreamObjects - Ceph Day Nov 2012
DreamObjects - Ceph Day Nov 2012
Ceph Community
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed Database
ArangoDB Database
 
Architecting for Change: QCONNYC 2012
Architecting for Change: QCONNYC 2012Architecting for Change: QCONNYC 2012
Architecting for Change: QCONNYC 2012
Kellan
 
The computer science behind a modern disributed data store
The computer science behind a modern disributed data storeThe computer science behind a modern disributed data store
The computer science behind a modern disributed data store
J On The Beach
 
Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
Howie Rosenshine
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
Dean Wampler
 
Inside the Atlassian OnDemand Private Cloud
Inside the Atlassian OnDemand Private CloudInside the Atlassian OnDemand Private Cloud
Inside the Atlassian OnDemand Private Cloud
Atlassian
 
WordPress: Performance Optimization and Scaling - WordCamp Las Vegas 2011
WordPress: Performance Optimization and Scaling - WordCamp Las Vegas 2011WordPress: Performance Optimization and Scaling - WordCamp Las Vegas 2011
WordPress: Performance Optimization and Scaling - WordCamp Las Vegas 2011Matt Martz
 
Riak intro to..
Riak intro to..Riak intro to..
Riak intro to..Adron Hall
 

Similar to OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster" (20)

Considerations for Building Your Private Cloud.pdf
Considerations for Building Your Private Cloud.pdfConsiderations for Building Your Private Cloud.pdf
Considerations for Building Your Private Cloud.pdf
 
Cloud Tech III: Actionable Metrics
Cloud Tech III: Actionable MetricsCloud Tech III: Actionable Metrics
Cloud Tech III: Actionable Metrics
 
Riak Use Cases : Dissecting The Solutions To Hard Problems
Riak Use Cases : Dissecting The Solutions To Hard ProblemsRiak Use Cases : Dissecting The Solutions To Hard Problems
Riak Use Cases : Dissecting The Solutions To Hard Problems
 
NoSQL @ Qbranch -2010-04-15
NoSQL @ Qbranch -2010-04-15NoSQL @ Qbranch -2010-04-15
NoSQL @ Qbranch -2010-04-15
 
Erlang for video delivery
Erlang for video deliveryErlang for video delivery
Erlang for video delivery
 
Apache Hadoop Talk at QCon
Apache Hadoop Talk at QConApache Hadoop Talk at QCon
Apache Hadoop Talk at QCon
 
Secrets of the asset pipeline
Secrets of the asset pipelineSecrets of the asset pipeline
Secrets of the asset pipeline
 
Operating your OpenStack Private Cloud.pdf
Operating your OpenStack Private Cloud.pdfOperating your OpenStack Private Cloud.pdf
Operating your OpenStack Private Cloud.pdf
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
 
MySQL Cluster no PayPal
MySQL Cluster no PayPalMySQL Cluster no PayPal
MySQL Cluster no PayPal
 
Abstractions at Scale – Our Experiences at Twitter
Abstractions at Scale – Our Experiences at TwitterAbstractions at Scale – Our Experiences at Twitter
Abstractions at Scale – Our Experiences at Twitter
 
DreamObjects - Ceph Day Nov 2012
DreamObjects - Ceph Day Nov 2012DreamObjects - Ceph Day Nov 2012
DreamObjects - Ceph Day Nov 2012
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed Database
 
Architecting for Change: QCONNYC 2012
Architecting for Change: QCONNYC 2012Architecting for Change: QCONNYC 2012
Architecting for Change: QCONNYC 2012
 
The computer science behind a modern disributed data store
The computer science behind a modern disributed data storeThe computer science behind a modern disributed data store
The computer science behind a modern disributed data store
 
Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Inside the Atlassian OnDemand Private Cloud
Inside the Atlassian OnDemand Private CloudInside the Atlassian OnDemand Private Cloud
Inside the Atlassian OnDemand Private Cloud
 
WordPress: Performance Optimization and Scaling - WordCamp Las Vegas 2011
WordPress: Performance Optimization and Scaling - WordCamp Las Vegas 2011WordPress: Performance Optimization and Scaling - WordCamp Las Vegas 2011
WordPress: Performance Optimization and Scaling - WordCamp Las Vegas 2011
 
Riak intro to..
Riak intro to..Riak intro to..
Riak intro to..
 

More from Randy Bias

Services are the New Cloud Platform (Services-as-a-Platform)
Services are the New Cloud Platform (Services-as-a-Platform)Services are the New Cloud Platform (Services-as-a-Platform)
Services are the New Cloud Platform (Services-as-a-Platform)
Randy Bias
 
Rebooting the OpenContrail Community
Rebooting the OpenContrail CommunityRebooting the OpenContrail Community
Rebooting the OpenContrail Community
Randy Bias
 
The History of Pets vs. Cattle ... And Using It Properly
The History of Pets vs. Cattle ... And Using It ProperlyThe History of Pets vs. Cattle ... And Using It Properly
The History of Pets vs. Cattle ... And Using It Properly
Randy Bias
 
State of the Stack v4 - OpenStack in All It's Glory
State of the Stack v4 - OpenStack in All It's GloryState of the Stack v4 - OpenStack in All It's Glory
State of the Stack v4 - OpenStack in All It's Glory
Randy Bias
 
Connect Expo 2015 - Australia - Bringing OpenStack into the Enterprise
Connect Expo 2015 - Australia - Bringing OpenStack into the EnterpriseConnect Expo 2015 - Australia - Bringing OpenStack into the Enterprise
Connect Expo 2015 - Australia - Bringing OpenStack into the Enterprise
Randy Bias
 
The Cloud Revolution - Philippines Cloud Summit
The Cloud Revolution - Philippines Cloud SummitThe Cloud Revolution - Philippines Cloud Summit
The Cloud Revolution - Philippines Cloud Summit
Randy Bias
 
The Lie of a Benevolent Dictator; the Truth of a Working Democratic Meritocracy
The Lie of a Benevolent Dictator; the Truth of a Working Democratic MeritocracyThe Lie of a Benevolent Dictator; the Truth of a Working Democratic Meritocracy
The Lie of a Benevolent Dictator; the Truth of a Working Democratic Meritocracy
Randy Bias
 
OpenStack Architected Like AWS (and GCP)
OpenStack Architected Like AWS (and GCP)OpenStack Architected Like AWS (and GCP)
OpenStack Architected Like AWS (and GCP)
Randy Bias
 
OpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking ArchitectureOpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking Architecture
Randy Bias
 
Pets vs. Cattle: The Elastic Cloud Story
Pets vs. Cattle: The Elastic Cloud StoryPets vs. Cattle: The Elastic Cloud Story
Pets vs. Cattle: The Elastic Cloud Story
Randy Bias
 
SFBay OpenStack Meetup // Neutron and SDN in Production – Dec 3 2013
SFBay OpenStack Meetup // Neutron and SDN in Production – Dec 3 2013SFBay OpenStack Meetup // Neutron and SDN in Production – Dec 3 2013
SFBay OpenStack Meetup // Neutron and SDN in Production – Dec 3 2013
Randy Bias
 
AWS Repatriation: Bring Your Apps Back
AWS Repatriation: Bring Your Apps BackAWS Repatriation: Bring Your Apps Back
AWS Repatriation: Bring Your Apps Back
Randy Bias
 
State of the Stack v2
State of the Stack v2State of the Stack v2
State of the Stack v2
Randy Bias
 
Networking is NOT Free: Lessons in Network Design
Networking is NOT Free: Lessons in Network DesignNetworking is NOT Free: Lessons in Network Design
Networking is NOT Free: Lessons in Network Design
Randy Bias
 
Scale-Out Block Storage
Scale-Out Block StorageScale-Out Block Storage
Scale-Out Block Storage
Randy Bias
 
State of the Stack April 2013
State of the Stack April 2013State of the Stack April 2013
State of the Stack April 2013
Randy Bias
 
Open Cloud System Networking Vision
Open Cloud System Networking VisionOpen Cloud System Networking Vision
Open Cloud System Networking Vision
Randy Bias
 
OpenStack Summit :: Profiling the Nova Scheduler
OpenStack Summit :: Profiling the Nova SchedulerOpenStack Summit :: Profiling the Nova Scheduler
OpenStack Summit :: Profiling the Nova Scheduler
Randy Bias
 
OpenStack Summit :: Pimp My Cloud
OpenStack Summit :: Pimp My CloudOpenStack Summit :: Pimp My Cloud
OpenStack Summit :: Pimp My Cloud
Randy Bias
 
2012 open storage summit keynote
2012 open storage summit   keynote2012 open storage summit   keynote
2012 open storage summit keynote
Randy Bias
 

More from Randy Bias (20)

Services are the New Cloud Platform (Services-as-a-Platform)
Services are the New Cloud Platform (Services-as-a-Platform)Services are the New Cloud Platform (Services-as-a-Platform)
Services are the New Cloud Platform (Services-as-a-Platform)
 
Rebooting the OpenContrail Community
Rebooting the OpenContrail CommunityRebooting the OpenContrail Community
Rebooting the OpenContrail Community
 
The History of Pets vs. Cattle ... And Using It Properly
The History of Pets vs. Cattle ... And Using It ProperlyThe History of Pets vs. Cattle ... And Using It Properly
The History of Pets vs. Cattle ... And Using It Properly
 
State of the Stack v4 - OpenStack in All It's Glory
State of the Stack v4 - OpenStack in All It's GloryState of the Stack v4 - OpenStack in All It's Glory
State of the Stack v4 - OpenStack in All It's Glory
 
Connect Expo 2015 - Australia - Bringing OpenStack into the Enterprise
Connect Expo 2015 - Australia - Bringing OpenStack into the EnterpriseConnect Expo 2015 - Australia - Bringing OpenStack into the Enterprise
Connect Expo 2015 - Australia - Bringing OpenStack into the Enterprise
 
The Cloud Revolution - Philippines Cloud Summit
The Cloud Revolution - Philippines Cloud SummitThe Cloud Revolution - Philippines Cloud Summit
The Cloud Revolution - Philippines Cloud Summit
 
The Lie of a Benevolent Dictator; the Truth of a Working Democratic Meritocracy
The Lie of a Benevolent Dictator; the Truth of a Working Democratic MeritocracyThe Lie of a Benevolent Dictator; the Truth of a Working Democratic Meritocracy
The Lie of a Benevolent Dictator; the Truth of a Working Democratic Meritocracy
 
OpenStack Architected Like AWS (and GCP)
OpenStack Architected Like AWS (and GCP)OpenStack Architected Like AWS (and GCP)
OpenStack Architected Like AWS (and GCP)
 
OpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking ArchitectureOpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking Architecture
 
Pets vs. Cattle: The Elastic Cloud Story
Pets vs. Cattle: The Elastic Cloud StoryPets vs. Cattle: The Elastic Cloud Story
Pets vs. Cattle: The Elastic Cloud Story
 
SFBay OpenStack Meetup // Neutron and SDN in Production – Dec 3 2013
SFBay OpenStack Meetup // Neutron and SDN in Production – Dec 3 2013SFBay OpenStack Meetup // Neutron and SDN in Production – Dec 3 2013
SFBay OpenStack Meetup // Neutron and SDN in Production – Dec 3 2013
 
AWS Repatriation: Bring Your Apps Back
AWS Repatriation: Bring Your Apps BackAWS Repatriation: Bring Your Apps Back
AWS Repatriation: Bring Your Apps Back
 
State of the Stack v2
State of the Stack v2State of the Stack v2
State of the Stack v2
 
Networking is NOT Free: Lessons in Network Design
Networking is NOT Free: Lessons in Network DesignNetworking is NOT Free: Lessons in Network Design
Networking is NOT Free: Lessons in Network Design
 
Scale-Out Block Storage
Scale-Out Block StorageScale-Out Block Storage
Scale-Out Block Storage
 
State of the Stack April 2013
State of the Stack April 2013State of the Stack April 2013
State of the Stack April 2013
 
Open Cloud System Networking Vision
Open Cloud System Networking VisionOpen Cloud System Networking Vision
Open Cloud System Networking Vision
 
OpenStack Summit :: Profiling the Nova Scheduler
OpenStack Summit :: Profiling the Nova SchedulerOpenStack Summit :: Profiling the Nova Scheduler
OpenStack Summit :: Profiling the Nova Scheduler
 
OpenStack Summit :: Pimp My Cloud
OpenStack Summit :: Pimp My CloudOpenStack Summit :: Pimp My Cloud
OpenStack Summit :: Pimp My Cloud
 
2012 open storage summit keynote
2012 open storage summit   keynote2012 open storage summit   keynote
2012 open storage summit keynote
 

OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"

  • 1. Redundancy Doesn't Always Mean "HA" or "Cluster" A cautionary tale against using hammers to solve all redundancy and resiliency problems ... OpenStack Design Summit – Oct 2012 Randy Bias Dan Sneddon @randybias @dxs CTO, Cloudscaling Sr. Engineer, Cloudscaling CCA - NoDerivs 3.0 Unported License - Usage OK, no modifications, full attribution* * All unlicensed or borrowed works retain their original licenses 1 Thursday, October 18, 12
  • 2. Our Journey Today 1. “HA” pairs are not the only type of redundancy 2. Alternative redundancy patterns for HA 3. Redundancy patterns in Open Cloud System* * Cloudscaling’s OpenStack-powered cloud operating system (“distribution”) 2 Thursday, October 18, 12
  • 3. What Do We Mean By “HA”? We mean what most people mean ... Two servers or network devices that look like one 3 Thursday, October 18, 12
  • 4. “HA HA”? HA pairs come in a couple flavors Active / Passive 4 Thursday, October 18, 12
  • 5. “HA HA”? People like this flavor best, but it’s not always possible... Active / Active 5 Thursday, October 18, 12
  • 6. “HA HA HA HA HA”?? Many people wish they could get it more like this ... HA cluster aka ‘massive operational nightmare’ 6 Thursday, October 18, 12
  • 7. Cluster<bleep>! Imagine this was 4 or 6 nodes in the cluster • 4 network tech. • 7 NICs / node • A million different ways to break 7 Thursday, October 18, 12
  • 8. “HA” Pairs Are One Type of Redundancy Herein lies the problem ... 8 Thursday, October 18, 12
  • 9. The Problem With “HA”-mmers There are many, but these two matter most ... • Catastrophic failures • No scale out 9 Thursday, October 18, 12
  • 10. HA Pairs Have Binary Failures Either working or dead, nothing in-between 10 Thursday, October 18, 12
  • 11. What is Scale-out? A B A B C D N A B Scale-up - Make boxes Scale-out - Make moar bigger (usually an HA pair) boxes 11 Thursday, October 18, 12
  • 12. Scaling out is a mindset Scaling up is like treating your servers as pets bowzer.company.com web001.company.com Servers *are* cattle 12 Thursday, October 18, 12
  • 13. HA Pair Failures* - 100% down Hardware rarely fails, operators fail, software fails Who Type Year Why Duration Apple Switch 2005 Bug 2 hrs Flexiscale SAN 2007 Ops Err 24 hrs Vendio NAS 2008 Ops Err 8 hrs UOL Brazil SAN 2011 Bug 72 hrs Twitter Datacenter 2012 Bug+Ops 2 hrs * This is a handful of examples as a baseline; I’m sure you can find many more 13 Thursday, October 18, 12
  • 14. “HA” Pairs Are an All-in Move They better not fail ... 14 Thursday, October 18, 12
  • 15. Risk Reduction Many small failure domains is usually better 15 Thursday, October 18, 12
  • 16. Big failure domains vs. small Would you rather have the whole cloud down or just a small bit for a short period of time? Still a scale-up pattern ... wouldn’t you rather scale-out? 16 Thursday, October 18, 12
  • 17. Pair vs. Scale-out Load Balancing No scale-out State Sync Shared-nothing Architecture (100% loss) (20% loss) 17 Thursday, October 18, 12
  • 18. Pair vs. Scale-out Load Balancing No scale-out State Sync Shared-nothing Architecture (100% loss) (20% loss) 17 Thursday, October 18, 12
  • 19. What’s Usually an “HA” Pair in OpenStack? Everything ... Service Endpoints Messaging System (APIs) (RPC) Worker Threads Database (e.g. Scheduler, (MySQL) Networking) 18 Thursday, October 18, 12
  • 20. What needs to be an HA pair? Not much needs state synchronization Service Endpoints Messaging System (APIs) (RPC) Worker Threads Database (e.g. Scheduler, Networking) (MySQL) 19 Thursday, October 18, 12
  • 21. Fault Tolerance Methodologies 20 Thursday, October 18, 12
  • 22. Fault Tolerance in OCS 21 Thursday, October 18, 12
  • 23. Service Distribution High Availability Without Compromise Resilient Stateless Scale-out 22 Thursday, October 18, 12
  • 24. Service Distribution Combines Standard Networking Technologies router ospf OSPF /etc/quagga/ospfd.conf ospf router-id 10.1.1.1 network 10.1.255.1 area 0.0.0.0 interface lo:2 Anycast /etc/quagga/zebra.conf description Pound listening address ip address 10.1.255.1/32 ListenHTTP Address 10.1.255.1 Port 8774 Load- xHTTP Service BackEnd 1 Balancing /etc/pound/pound.conf End Address 10.1.1.1 Port 8774 Proxy BackEnd Address 10.1.1.2 Port 8774 End End End 23 Thursday, October 18, 12
  • 25. Resilient OpenStack Horizontally Scalable, No Single Point Of Failure Service Distribution ZeroMQ Service Endpoints Messaging System (APIs) (RPC) Service Distribution MMR + HA Worker Threads Database (e.g. Scheduler, Networking) (MySQL) Thursday, October 18, 12
  • 26. Service Distribution Advantages What Makes This a Superior Solution? • True horizontal scalability with no centralized controller • Services are always running, failover is nearly instant • Reduced complexity, fewer idle resources • No need for separate load balancers Server Server Server Server Server Server Server ... Failover vs. Distributed Services 25 Thursday, October 18, 12
  • 27. Perfect For Site Resiliency Service Distribution Works With Multiple Sites • Traditional HA pairs do not support cross-site resiliency • Service Distribution fail across sites without DNS redirections 26 Thursday, October 18, 12
  • 28. Service Distribution in Action Example: Distributed Load Balancing 1) OSPF OSPF Router(s) OSPF OSPF advertisement advertisement V Quagga Quagga HTTP Proxy HTTP Proxy 27 Thursday, October 18, 12
  • 29. Service Distribution in Action Example: Distributed Load Balancing 1) OSPF OSPF Router(s) 2) ECMP Per-flow Load Balancing OSPF OSPF advertisement advertisement Per-Flow Load 3) Load-balancing V Balancing Quagga Quagga HTTP Proxy HTTP Proxy HTTP Proxy 28 Thursday, October 18, 12
  • 30. Service Distribution in Action Example: Distributed Load Balancing 1) OSPF OSPF Router(s) 2) ECMP Per-flow Load Balancing OSPF OSPF advertisement advertisement Per-Flow Load 3) Load-balancing V Balancing Quagga Quagga HTTP Proxy HTTP Proxy HTTP Proxy 4) Unlimited # of Back-End Servers Server Server Server Server 29 Thursday, October 18, 12
  • 31. Failure Resiliency Client Client Client Client 1 2 3 4 1 2 3 4 Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Proxy Proxy Proxy Proxy Proxy 10% Server Server Server Server Server Load Each Server Server Server Server Server Server 30 Thursday, October 18, 12
  • 32. Failure Resiliency Client Client Client Client 1 2 3 4 1 12 3 4 X Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Proxy Proxy Proxy Proxy Proxy 10% Server Server Server Server Server Load Each Server Server Server Server Server Server 31 Thursday, October 18, 12
  • 33. Failure Resiliency Client Client Client Client 1 2 3 4 1 2 3 4 Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Load Balancer/ Proxy Proxy Proxy Proxy Proxy 10% X Server Server Server Server Server Server Server Server Server Server Increased Server Load 32 Thursday, October 18, 12
  • 34. OCS NAT Service Example: Scale-out Network Address Translation BGP Multiple ISP providers NAT Service Distribution VMs 33 Thursday, October 18, 12
  • 35. Brokerless Messaging With ZeroMQ Avoiding RabbitMQ’s Single Point Of Failure Nova-Compute Single Point Of Failure RabbitMQ Broker Nova-Scheduler Nova-API RabbitMQ (Brokered) 34 Thursday, October 18, 12
  • 36. Brokerless Messaging With ZeroMQ Avoiding RabbitMQ’s Single Point Of Failure Nova-Compute Nova-Compute Single Point Of Failure RabbitMQ Broker Nova-Scheduler Nova-API Nova-Scheduler Nova-API RabbitMQ vs. ZeroMQ (Brokered) (Peer To Peer) 35 Thursday, October 18, 12
  • 37. What did we learn today? 1. HA-mmers are for nails 2. Scale-out rules for redundancy 3. Design-for-failure is a mentality, not a pair 4. Resiliency over redundancy 36 Thursday, October 18, 12
  • 38. Q&A Randy Bias Dan Sneddon @randybias @dxs CTO, Cloudscaling Sr. Engineer, Cloudscaling OCS 2.0 Public Cloud Benefits | Private Cloud Control | Open Cloud Economics 37 Thursday, October 18, 12