SlideShare a Scribd company logo
1 of 24
•      At scale the incredibly rare is commonplace
•      Availability through application redundancy
       •   Inter-region replication
       •   AWS Regions & Availability Zones
       •   Recovery Oriented Computing
       •   Avoiding capacity meltdown
•      Example rare events from industry & AWS
•      Infrastructure & application lessons




    11/29/2012                http://perspectives.mvdirona.com   2
•      Server & disk failure rates:
       •   Disk drives: 4% to 6% annual failure rate
       •   Servers: 2% to 4% annual failure rate (AFR)
•      3% server AFR yields MTBF of 292,000 hours
       •   More than 33 years
•      But, at scale, in DC with 64,000 servers with 2 disks each:
       •   On average, more than 5 servers & 17 disks fail each day
•      Failure both inevitable & common
•      Applies to all infrastructure at all levels
       •   Switchgear, cooling plants, transformers, PDUs, servers, disks,…

    11/29/2012                     http://perspectives.mvdirona.com           3
Every day, AWS adds enough new server capacity to
       support all of Amazon’s global infrastructure when
        it was a $5.2B annual revenue enterprise (2003).




11/29/2012          http://perspectives.mvdirona.com        4
Total Number of S3 Objects

                                                                                                  >1 Trillion


                 Peak Requests:
                    500,000+                                                        762 Billion
                   per second




                                                                     262 Billion

                                                    102 Billion
                      14 Billion   40 Billion
   2.9 Billion

     Q4 2006           Q4 2007      Q4 2008           Q4 2009             Q4 2010     Q4 2011      Q4 2012

11/29/2012                             http://perspectives.mvdirona.com                                         5
US GovCloud       US West x 2             US East               LATAM         Europe West       Asia Pacific   Asia Pacific   Australia
(US ITAR Region   (N. California and   (Northern Virginia)     (Sao Paola)        (Dublin)        Region         Region        Region
   -- Oregon)          Oregon)                                                                   (Singapore)      (Tokyo)     (Australia)




                                             >10 data centers
                                             In US East alone




       9 AWS Regions and growing
       21 AWS Edge Locations for CloudFront (CDN) & Route 53 (DNS)

    11/29/2012                                               http://perspectives.mvdirona.com                                        6
•      At scale the incredibly rare is commonplace
•      Availability through application redundancy
       •   Inter-region replication
       •   AWS Regions & Availability Zones
       •   Recovery Oriented Computing
       •   Avoiding capacity meltdown
•      Example rare events from industry & AWS
•      Infrastructure & application lessons




    11/29/2012                http://perspectives.mvdirona.com   7
• 5th app availability “9” only via multi-datacenter replication
• Conventional approach:
  • Two datacenters in distant locations
  • Replicate all data to both datacenters                  99.999%
• The industry-wide dominant multi-DC availability approach
  • Looks rock solid but performs remarkably poorly in practice
• Acid Test:
  • Are you willing to pull the plug on the primary server?



 11/29/2012              http://perspectives.mvdirona.com         8
•      Asynchronous replication between datacenters
       •   Committing to an SSD order 1 to 2 msec
       •   LA to New York 74msec round trip
       •   You can’t wait 74 msec to commit a transaction
•      On failure, a difficult & high skill decision:
       •   Fail-over & lose transactions, or
       •   Don’t fail-over & lose availability
•      I’ve been in these calls in past roles
       •   No win situation
       •   Very hard to get right



    11/29/2012                        http://perspectives.mvdirona.com   9
• Fragile: Active/Passive doesn’t work
  • Failover to a system that hasn’t been taking operational load
  • Passive secondary not recently tested
  • Secondary config or S/W version different, incorrect load balancer
    config, incorrect network ACLs, latent hardware problem, router
    problem, resource shortage under load, …
  • Can’t test without negative customer impact
  • If you don’t test it, it won’t work
• 2-way redundancy expensive:
  • More than ½ capacity reserved to handle failure
  • 3 datacenters much less expensive but impractical w/o high scale
 11/29/2012               http://perspectives.mvdirona.com         10
• Choose Region to be close to user, close to data, or meeting
  jurisdictional requirements
• Synchronous replication to 2 (or better 3) Availability Zones
       •   Easy when less than 2 to 3 msec away
       •   Can failover w/o customer impact
•      ELB over EC2 instances in different AZs
•      Stateless EC2 apps easy
•      Persistent state is the challenge
       •   DynamoDB
       •   Simple Storage Service
       •   Mutli-AZ RDS
    11/29/2012                      http://perspectives.mvdirona.com   11
• Recovery Oriented Computing (ROC)
  • Assume software & hardware will fail frequently & unpredictably
  • Heavily instrument applications to detect failures
                 Bohr Bug
     App                         Urgent Alert
                                                                               Bohr bug: Repeatable functional
                                                                               software issue (functional bugs); should
           Heisenbug
                                                                               be rare in production
    Restart
                                                                               Heisenbug: Software issue that only
      Failure
                   Reboot                                                      occurs in unusual cross-request timing
                       Failure
                                                                               issues or the pattern of long sequences
                                  Re-image
                                                                               of independent operations; some found
                                     Failure
                                                Replace                        only in production



11/29/2012                                          http://perspectives.mvdirona.com                                 12
• All systems produce non-linear latencies and/or failures
  beyond application-specific load level
• Load limit is software release dependent
   • Changes as the application changes
• Canary in the data center
  • Route increased load to one server in the fleet
  • When starts showing non-linear delay or failure, immediately
    reduce load or take out of load balancer rotation
  • Result: limit is known before full fleet melts down


 2009/2/26                                         13
•   No amount of capacity head room is sufficient
•   Graceful degradation prior to admission control
    • First: shed non-critical workload
    • Then: degraded operations mode
    • Finally: admission control
•   Related concept: Metered rate-of-service admission
    • Allow in small increments of users when restarting after failure
•   Best practice: do not acquire new resources when failing away
    •   No new EC2 instances, no new EBS volumes, ….
    •   Minimize new AWS control plane resource requests
    •   Run active/active & just stop using failed instances

2009/2/26                                                      14
•      Run active/active & test in production
       • Without constant live load, it won’t work when needed
       • Test in production or it won’t work when needed
•      Amazon.com: Game Days
       • Disable all or part of amazon.com production capacity
         in entire datacenter
       • With warning & planning to avoid customer impact
•      Netflix: Chaos Monkey
       • .NET Application that can be run from command line
       • Can be pointed at set of resources in a region
       • Mon to Thurs 9am to 3pm random instance kill
       • Application configuration options (including opt out)
    11/29/2012                     http://perspectives.mvdirona.com   15
•      At scale the incredibly rare is commonplace
•      Availability through application redundancy
       •   Inter-region replication
       •   AWS Regions & Availability Zones
       •   Recovery Oriented Computing
       •   Avoiding capacity meltdown
•      Example rare events from industry & AWS
•      Infrastructure & application lessons




    11/29/2012                http://perspectives.mvdirona.com   16
…cause of the outage was an abrupt power
                                                failure in our state-of-the-art data center facility
                                                … The problem was not just that the power
                                                failure happened, the problem was that it
                                                happened abruptly, with no warning whatsoever,
                                                and all our equipment went down all at once…
                                                Data centers, certainly this one, have triple, and
                                                even quadruple, redundancy in their power
                                                systems just to prevent such an abrupt power
                                                outage.


                                                               Observations
                                                •    Single datacenter availability model
                                                     doesn’t work
                                                •    Costs scale with facility redundancy
                                                     levels
                                                •    Decreasing or inverse payback as
                                                     single facility redundancy increases
11/29/2012   http://perspectives.mvdirona.com                                                   17
Utility                High-Voltage             Mid-Voltage                                      Generator
                                                                                                           2.5kva
     Power                  Transformer              Transformer
    Provider                   10kva                     3kva

                                                                  480V                                 480V

                 110,000V                  13,500V                                   Programmable           Gen
                                                              Utility                    Logic
                                                             Breaker                                       Breaker
                                                                                       Controller
•     Normal Utility Failure Scenario
      •   Small faults common during year
      •   Utility fails but servers run on UPS
                                                                         UPS             UPS         UPS
      •   5 to 10 seconds later, start gen
      •   12 sec to stabilize gen prior to load
      •   5 to 10 sec for gen power stability                                      Data Center Pod
      •   Generator holds load until 30 min of                                      Critical Load
          stable utility power                                                          2.5mw
    11/29/2012                                  http://perspectives.mvdirona.com                               18
Utility                High-Voltage             Mid-Voltage                                       Generator
                                                                                                            2.5kva
     Power                  Transformer              Transformer
    Provider                   10kva                      3kva

                                                                   480V                                 480V

                 110,000V                  13,500V                                      Switchgear           Gen
                                                               Utility                  proprietery         Breaker
•   Failure Scenario                                          Breaker                      PLC
    •   10:41pm: Regional substation high voltage
        transformer failure
    •   Voltage spike sent through mid-voltage
        transformer into data center                                      UPS             UPS         UPS
    •   Switchgear proprietary PLC S/W incorrectly
        detects local datacenter ground fault
    •   10:41pm: Gen starts but breaker locked out
        •   avoid connecting gen into possible direct short                         Data Center Pod
    •   ~10:53pm UPSs discharge and fail sequentially                                Critical Load
                                                                                         2.5mw
    11/29/2012                                   http://perspectives.mvdirona.com                               19
•      At scale the incredibly rare is commonplace
•      Availability through application redundancy
       •   Inter-region replication
       •   AWS Regions & Availability Zones
       •   Recovery Oriented Computing
       •   Avoiding capacity meltdown
•      Example rare events from industry & AWS
•      Infrastructure & application lessons




    11/29/2012                http://perspectives.mvdirona.com   20
•      Multi-AZ is AWS unique & a powerful app redundancy model
•      Full power distribution redundancy & concurrent maintainability
       •   Power redundant even during maintenance operations
•      Switchgear now custom programmed to AWS specs
       •   “hold the load” prioritized above capital equipment protection
•      All configuration settings with maximum engineering headroom
•      Network redundancy & resiliency:
       •   Systematically replace 2-way redundancy with N-way (N>>2)
       •   Custom monitoring to pinpoint faulty router or link fault before app impact
•      Full production testing of all power distribution systems

    11/29/2012                       http://perspectives.mvdirona.com                    21
• Even incredibly rare events will happen at scale
• Multi-AZ & ROC protect against infrastructure, app, & admin issues
• Design applications using ROC principals
       • When application health is unknown, assume failure
       • Trust no single instance, server, router, or data center
       • Only use small number of simple, tested application failure paths
       • Test in production
•      No new resources when failing away
•      More high-scale application best practices at:
       •   http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf


    11/29/2012                         http://perspectives.mvdirona.com      22
• Black Swan events will happen at scale
• Multi-data center required for last 9
  • App & admin errors dominate infrastructure faults
• Multi-AZ redundancy will operate through unexpected
    failures in all levels in app & infrastructure stack
    • Multi-AZ is more reliable & easier to administer
    • Use a small number of simple app failure paths
    • Test failure paths frequently in production
• Reap reward of sleeping all night & riding through failure

 11/29/2012                 http://perspectives.mvdirona.com   23
We are sincerely eager to
 hear your feedback on this
presentation and on re:Invent.

 Please fill out an evaluation
   form when you have a
            chance.

More Related Content

What's hot

Accelerating Server Hardware Upgrades with PlateSpin Migrate P2P
Accelerating Server Hardware Upgrades with PlateSpin Migrate P2PAccelerating Server Hardware Upgrades with PlateSpin Migrate P2P
Accelerating Server Hardware Upgrades with PlateSpin Migrate P2PNovell
 
Virtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - VarrowVirtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - VarrowAndrew Miller
 
Server Virtualization Seminar Presentation
Server Virtualization Seminar PresentationServer Virtualization Seminar Presentation
Server Virtualization Seminar Presentationshabi_hassan
 
Planning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPMPlanning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPMWASdev Community
 
Solaris Linux Performance, Tools and Tuning
Solaris Linux Performance, Tools and TuningSolaris Linux Performance, Tools and Tuning
Solaris Linux Performance, Tools and TuningAdrian Cockcroft
 
Infrastructure as Code - Getting Started, Concepts & Tools
Infrastructure as Code - Getting Started, Concepts & ToolsInfrastructure as Code - Getting Started, Concepts & Tools
Infrastructure as Code - Getting Started, Concepts & ToolsLior Kamrat
 
Delphix Platform Overview
Delphix Platform OverviewDelphix Platform Overview
Delphix Platform OverviewFranco_Dagosto
 
Architecting with power vm
Architecting with power vmArchitecting with power vm
Architecting with power vmCharlie Cler
 
Five things virtualization has changed in your dr plan
Five things virtualization has changed in your dr planFive things virtualization has changed in your dr plan
Five things virtualization has changed in your dr planJosh Mazgelis
 
Healthcheck 07 application
Healthcheck 07 applicationHealthcheck 07 application
Healthcheck 07 applicationNakedi Kobo
 
Building azure applications ireland
Building azure applications irelandBuilding azure applications ireland
Building azure applications irelandMichael Meagher
 
Deploying Maximum HA Architecture With PostgreSQL
Deploying Maximum HA Architecture With PostgreSQLDeploying Maximum HA Architecture With PostgreSQL
Deploying Maximum HA Architecture With PostgreSQLDenish Patel
 
Software Defined Agility for IBM FlashSystem V9000
Software Defined Agility for IBM FlashSystem V9000Software Defined Agility for IBM FlashSystem V9000
Software Defined Agility for IBM FlashSystem V9000Catalogic Software
 
Nyoug delphix slideshare
Nyoug delphix slideshareNyoug delphix slideshare
Nyoug delphix slideshareKyle Hailey
 
SAP and VMware (Virtualizing SAP)
SAP and VMware (Virtualizing SAP)SAP and VMware (Virtualizing SAP)
SAP and VMware (Virtualizing SAP)Cenk Ersoy
 
Kscope 2013 delphix
Kscope 2013 delphixKscope 2013 delphix
Kscope 2013 delphixKyle Hailey
 
Vn212 rad rtb2_power_vm
Vn212 rad rtb2_power_vmVn212 rad rtb2_power_vm
Vn212 rad rtb2_power_vmSylvain Lamour
 

What's hot (20)

Accelerating Server Hardware Upgrades with PlateSpin Migrate P2P
Accelerating Server Hardware Upgrades with PlateSpin Migrate P2PAccelerating Server Hardware Upgrades with PlateSpin Migrate P2P
Accelerating Server Hardware Upgrades with PlateSpin Migrate P2P
 
Designing virtual infrastructure
Designing virtual infrastructureDesigning virtual infrastructure
Designing virtual infrastructure
 
Virtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - VarrowVirtualizing Tier One Applications - Varrow
Virtualizing Tier One Applications - Varrow
 
Server Virtualization Seminar Presentation
Server Virtualization Seminar PresentationServer Virtualization Seminar Presentation
Server Virtualization Seminar Presentation
 
Planning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPMPlanning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPM
 
Solaris Linux Performance, Tools and Tuning
Solaris Linux Performance, Tools and TuningSolaris Linux Performance, Tools and Tuning
Solaris Linux Performance, Tools and Tuning
 
Infrastructure as Code - Getting Started, Concepts & Tools
Infrastructure as Code - Getting Started, Concepts & ToolsInfrastructure as Code - Getting Started, Concepts & Tools
Infrastructure as Code - Getting Started, Concepts & Tools
 
Delphix Platform Overview
Delphix Platform OverviewDelphix Platform Overview
Delphix Platform Overview
 
Double-Take Software
Double-Take SoftwareDouble-Take Software
Double-Take Software
 
Architecting with power vm
Architecting with power vmArchitecting with power vm
Architecting with power vm
 
Five things virtualization has changed in your dr plan
Five things virtualization has changed in your dr planFive things virtualization has changed in your dr plan
Five things virtualization has changed in your dr plan
 
Healthcheck 07 application
Healthcheck 07 applicationHealthcheck 07 application
Healthcheck 07 application
 
Building azure applications ireland
Building azure applications irelandBuilding azure applications ireland
Building azure applications ireland
 
Deploying Maximum HA Architecture With PostgreSQL
Deploying Maximum HA Architecture With PostgreSQLDeploying Maximum HA Architecture With PostgreSQL
Deploying Maximum HA Architecture With PostgreSQL
 
Software Defined Agility for IBM FlashSystem V9000
Software Defined Agility for IBM FlashSystem V9000Software Defined Agility for IBM FlashSystem V9000
Software Defined Agility for IBM FlashSystem V9000
 
Nyoug delphix slideshare
Nyoug delphix slideshareNyoug delphix slideshare
Nyoug delphix slideshare
 
SAP and VMware (Virtualizing SAP)
SAP and VMware (Virtualizing SAP)SAP and VMware (Virtualizing SAP)
SAP and VMware (Virtualizing SAP)
 
Kscope 2013 delphix
Kscope 2013 delphixKscope 2013 delphix
Kscope 2013 delphix
 
Vn212 rad rtb2_power_vm
Vn212 rad rtb2_power_vmVn212 rad rtb2_power_vm
Vn212 rad rtb2_power_vm
 
What is Delphix
What is DelphixWhat is Delphix
What is Delphix
 

Similar to CPN208 Failures at Scale & How to Ride Through Them - AWS re: Invent 2012

High Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesHigh Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesRightScale
 
Architecting for failures in the Cloud - Barcamp Bangalore 2013
Architecting for failures in the Cloud - Barcamp Bangalore 2013Architecting for failures in the Cloud - Barcamp Bangalore 2013
Architecting for failures in the Cloud - Barcamp Bangalore 2013P3 InfoTech Solutions Pvt. Ltd.
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesJason TC HOU (侯宗成)
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internalsTokyo Azure Meetup
 
Running Siebel on AWS - Oracle Open World 13
Running Siebel on AWS - Oracle Open World 13Running Siebel on AWS - Oracle Open World 13
Running Siebel on AWS - Oracle Open World 13Milind Waikul
 
Containerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesContainerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesDataWorks Summit
 
Database Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big DataDatabase Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big Dataexponential-inc
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsJignesh Shah
 
VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...
VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...
VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...VMworld
 
Rightscale Webinar: Building Blocks for Private and Hybrid Clouds
Rightscale Webinar: Building Blocks for Private and Hybrid CloudsRightscale Webinar: Building Blocks for Private and Hybrid Clouds
Rightscale Webinar: Building Blocks for Private and Hybrid CloudsRightScale
 
Availability Considerations for SQL Server
Availability Considerations for SQL ServerAvailability Considerations for SQL Server
Availability Considerations for SQL ServerBob Roudebush
 
E2 evc 3-2-1-rule - mikeresseler
E2 evc   3-2-1-rule - mikeresselerE2 evc   3-2-1-rule - mikeresseler
E2 evc 3-2-1-rule - mikeresselerMike Resseler
 
Oracle Fusion Middleware - pragmatic approach to build up your applications -...
Oracle Fusion Middleware - pragmatic approach to build up your applications -...Oracle Fusion Middleware - pragmatic approach to build up your applications -...
Oracle Fusion Middleware - pragmatic approach to build up your applications -...ORACLE USER GROUP ESTONIA
 
HA and DR for Cloud Workloads
HA and DR for Cloud WorkloadsHA and DR for Cloud Workloads
HA and DR for Cloud Workloadsswamybabu
 
SolarWinds Scalability for the Enterprise
SolarWinds Scalability for the EnterpriseSolarWinds Scalability for the Enterprise
SolarWinds Scalability for the EnterpriseSolarWinds
 
1 architecture & design
1   architecture & design1   architecture & design
1 architecture & designMark Swarbrick
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Toronto-Oracle-Users-Group
 
Citrix XenDesktop: Dealing with Failure - SYN408
Citrix XenDesktop: Dealing with Failure - SYN408Citrix XenDesktop: Dealing with Failure - SYN408
Citrix XenDesktop: Dealing with Failure - SYN408Tom Gamull
 

Similar to CPN208 Failures at Scale & How to Ride Through Them - AWS re: Invent 2012 (20)

High Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best PracticesHigh Availability in the Cloud - Architectural Best Practices
High Availability in the Cloud - Architectural Best Practices
 
Architecting for failures in the Cloud - Barcamp Bangalore 2013
Architecting for failures in the Cloud - Barcamp Bangalore 2013Architecting for failures in the Cloud - Barcamp Bangalore 2013
Architecting for failures in the Cloud - Barcamp Bangalore 2013
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network Issues
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
 
Db trends final
Db trends   finalDb trends   final
Db trends final
 
Designing Scalable Applications
Designing Scalable ApplicationsDesigning Scalable Applications
Designing Scalable Applications
 
Running Siebel on AWS - Oracle Open World 13
Running Siebel on AWS - Oracle Open World 13Running Siebel on AWS - Oracle Open World 13
Running Siebel on AWS - Oracle Open World 13
 
Containerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesContainerized Hadoop beyond Kubernetes
Containerized Hadoop beyond Kubernetes
 
Database Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big DataDatabase Virtualization: The Next Wave of Big Data
Database Virtualization: The Next Wave of Big Data
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
 
VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...
VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...
VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...
 
Rightscale Webinar: Building Blocks for Private and Hybrid Clouds
Rightscale Webinar: Building Blocks for Private and Hybrid CloudsRightscale Webinar: Building Blocks for Private and Hybrid Clouds
Rightscale Webinar: Building Blocks for Private and Hybrid Clouds
 
Availability Considerations for SQL Server
Availability Considerations for SQL ServerAvailability Considerations for SQL Server
Availability Considerations for SQL Server
 
E2 evc 3-2-1-rule - mikeresseler
E2 evc   3-2-1-rule - mikeresselerE2 evc   3-2-1-rule - mikeresseler
E2 evc 3-2-1-rule - mikeresseler
 
Oracle Fusion Middleware - pragmatic approach to build up your applications -...
Oracle Fusion Middleware - pragmatic approach to build up your applications -...Oracle Fusion Middleware - pragmatic approach to build up your applications -...
Oracle Fusion Middleware - pragmatic approach to build up your applications -...
 
HA and DR for Cloud Workloads
HA and DR for Cloud WorkloadsHA and DR for Cloud Workloads
HA and DR for Cloud Workloads
 
SolarWinds Scalability for the Enterprise
SolarWinds Scalability for the EnterpriseSolarWinds Scalability for the Enterprise
SolarWinds Scalability for the Enterprise
 
1 architecture & design
1   architecture & design1   architecture & design
1 architecture & design
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
 
Citrix XenDesktop: Dealing with Failure - SYN408
Citrix XenDesktop: Dealing with Failure - SYN408Citrix XenDesktop: Dealing with Failure - SYN408
Citrix XenDesktop: Dealing with Failure - SYN408
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

CPN208 Failures at Scale & How to Ride Through Them - AWS re: Invent 2012

  • 1.
  • 2. At scale the incredibly rare is commonplace • Availability through application redundancy • Inter-region replication • AWS Regions & Availability Zones • Recovery Oriented Computing • Avoiding capacity meltdown • Example rare events from industry & AWS • Infrastructure & application lessons 11/29/2012 http://perspectives.mvdirona.com 2
  • 3. Server & disk failure rates: • Disk drives: 4% to 6% annual failure rate • Servers: 2% to 4% annual failure rate (AFR) • 3% server AFR yields MTBF of 292,000 hours • More than 33 years • But, at scale, in DC with 64,000 servers with 2 disks each: • On average, more than 5 servers & 17 disks fail each day • Failure both inevitable & common • Applies to all infrastructure at all levels • Switchgear, cooling plants, transformers, PDUs, servers, disks,… 11/29/2012 http://perspectives.mvdirona.com 3
  • 4. Every day, AWS adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $5.2B annual revenue enterprise (2003). 11/29/2012 http://perspectives.mvdirona.com 4
  • 5. Total Number of S3 Objects >1 Trillion Peak Requests: 500,000+ 762 Billion per second 262 Billion 102 Billion 14 Billion 40 Billion 2.9 Billion Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 11/29/2012 http://perspectives.mvdirona.com 5
  • 6. US GovCloud US West x 2 US East LATAM Europe West Asia Pacific Asia Pacific Australia (US ITAR Region (N. California and (Northern Virginia) (Sao Paola) (Dublin) Region Region Region -- Oregon) Oregon) (Singapore) (Tokyo) (Australia) >10 data centers In US East alone 9 AWS Regions and growing 21 AWS Edge Locations for CloudFront (CDN) & Route 53 (DNS) 11/29/2012 http://perspectives.mvdirona.com 6
  • 7. At scale the incredibly rare is commonplace • Availability through application redundancy • Inter-region replication • AWS Regions & Availability Zones • Recovery Oriented Computing • Avoiding capacity meltdown • Example rare events from industry & AWS • Infrastructure & application lessons 11/29/2012 http://perspectives.mvdirona.com 7
  • 8. • 5th app availability “9” only via multi-datacenter replication • Conventional approach: • Two datacenters in distant locations • Replicate all data to both datacenters 99.999% • The industry-wide dominant multi-DC availability approach • Looks rock solid but performs remarkably poorly in practice • Acid Test: • Are you willing to pull the plug on the primary server? 11/29/2012 http://perspectives.mvdirona.com 8
  • 9. Asynchronous replication between datacenters • Committing to an SSD order 1 to 2 msec • LA to New York 74msec round trip • You can’t wait 74 msec to commit a transaction • On failure, a difficult & high skill decision: • Fail-over & lose transactions, or • Don’t fail-over & lose availability • I’ve been in these calls in past roles • No win situation • Very hard to get right 11/29/2012 http://perspectives.mvdirona.com 9
  • 10. • Fragile: Active/Passive doesn’t work • Failover to a system that hasn’t been taking operational load • Passive secondary not recently tested • Secondary config or S/W version different, incorrect load balancer config, incorrect network ACLs, latent hardware problem, router problem, resource shortage under load, … • Can’t test without negative customer impact • If you don’t test it, it won’t work • 2-way redundancy expensive: • More than ½ capacity reserved to handle failure • 3 datacenters much less expensive but impractical w/o high scale 11/29/2012 http://perspectives.mvdirona.com 10
  • 11. • Choose Region to be close to user, close to data, or meeting jurisdictional requirements • Synchronous replication to 2 (or better 3) Availability Zones • Easy when less than 2 to 3 msec away • Can failover w/o customer impact • ELB over EC2 instances in different AZs • Stateless EC2 apps easy • Persistent state is the challenge • DynamoDB • Simple Storage Service • Mutli-AZ RDS 11/29/2012 http://perspectives.mvdirona.com 11
  • 12. • Recovery Oriented Computing (ROC) • Assume software & hardware will fail frequently & unpredictably • Heavily instrument applications to detect failures Bohr Bug App Urgent Alert Bohr bug: Repeatable functional software issue (functional bugs); should Heisenbug be rare in production Restart Heisenbug: Software issue that only Failure Reboot occurs in unusual cross-request timing Failure issues or the pattern of long sequences Re-image of independent operations; some found Failure Replace only in production 11/29/2012 http://perspectives.mvdirona.com 12
  • 13. • All systems produce non-linear latencies and/or failures beyond application-specific load level • Load limit is software release dependent • Changes as the application changes • Canary in the data center • Route increased load to one server in the fleet • When starts showing non-linear delay or failure, immediately reduce load or take out of load balancer rotation • Result: limit is known before full fleet melts down 2009/2/26 13
  • 14. No amount of capacity head room is sufficient • Graceful degradation prior to admission control • First: shed non-critical workload • Then: degraded operations mode • Finally: admission control • Related concept: Metered rate-of-service admission • Allow in small increments of users when restarting after failure • Best practice: do not acquire new resources when failing away • No new EC2 instances, no new EBS volumes, …. • Minimize new AWS control plane resource requests • Run active/active & just stop using failed instances 2009/2/26 14
  • 15. Run active/active & test in production • Without constant live load, it won’t work when needed • Test in production or it won’t work when needed • Amazon.com: Game Days • Disable all or part of amazon.com production capacity in entire datacenter • With warning & planning to avoid customer impact • Netflix: Chaos Monkey • .NET Application that can be run from command line • Can be pointed at set of resources in a region • Mon to Thurs 9am to 3pm random instance kill • Application configuration options (including opt out) 11/29/2012 http://perspectives.mvdirona.com 15
  • 16. At scale the incredibly rare is commonplace • Availability through application redundancy • Inter-region replication • AWS Regions & Availability Zones • Recovery Oriented Computing • Avoiding capacity meltdown • Example rare events from industry & AWS • Infrastructure & application lessons 11/29/2012 http://perspectives.mvdirona.com 16
  • 17. …cause of the outage was an abrupt power failure in our state-of-the-art data center facility … The problem was not just that the power failure happened, the problem was that it happened abruptly, with no warning whatsoever, and all our equipment went down all at once… Data centers, certainly this one, have triple, and even quadruple, redundancy in their power systems just to prevent such an abrupt power outage. Observations • Single datacenter availability model doesn’t work • Costs scale with facility redundancy levels • Decreasing or inverse payback as single facility redundancy increases 11/29/2012 http://perspectives.mvdirona.com 17
  • 18. Utility High-Voltage Mid-Voltage Generator 2.5kva Power Transformer Transformer Provider 10kva 3kva 480V 480V 110,000V 13,500V Programmable Gen Utility Logic Breaker Breaker Controller • Normal Utility Failure Scenario • Small faults common during year • Utility fails but servers run on UPS UPS UPS UPS • 5 to 10 seconds later, start gen • 12 sec to stabilize gen prior to load • 5 to 10 sec for gen power stability Data Center Pod • Generator holds load until 30 min of Critical Load stable utility power 2.5mw 11/29/2012 http://perspectives.mvdirona.com 18
  • 19. Utility High-Voltage Mid-Voltage Generator 2.5kva Power Transformer Transformer Provider 10kva 3kva 480V 480V 110,000V 13,500V Switchgear Gen Utility proprietery Breaker • Failure Scenario Breaker PLC • 10:41pm: Regional substation high voltage transformer failure • Voltage spike sent through mid-voltage transformer into data center UPS UPS UPS • Switchgear proprietary PLC S/W incorrectly detects local datacenter ground fault • 10:41pm: Gen starts but breaker locked out • avoid connecting gen into possible direct short Data Center Pod • ~10:53pm UPSs discharge and fail sequentially Critical Load 2.5mw 11/29/2012 http://perspectives.mvdirona.com 19
  • 20. At scale the incredibly rare is commonplace • Availability through application redundancy • Inter-region replication • AWS Regions & Availability Zones • Recovery Oriented Computing • Avoiding capacity meltdown • Example rare events from industry & AWS • Infrastructure & application lessons 11/29/2012 http://perspectives.mvdirona.com 20
  • 21. Multi-AZ is AWS unique & a powerful app redundancy model • Full power distribution redundancy & concurrent maintainability • Power redundant even during maintenance operations • Switchgear now custom programmed to AWS specs • “hold the load” prioritized above capital equipment protection • All configuration settings with maximum engineering headroom • Network redundancy & resiliency: • Systematically replace 2-way redundancy with N-way (N>>2) • Custom monitoring to pinpoint faulty router or link fault before app impact • Full production testing of all power distribution systems 11/29/2012 http://perspectives.mvdirona.com 21
  • 22. • Even incredibly rare events will happen at scale • Multi-AZ & ROC protect against infrastructure, app, & admin issues • Design applications using ROC principals • When application health is unknown, assume failure • Trust no single instance, server, router, or data center • Only use small number of simple, tested application failure paths • Test in production • No new resources when failing away • More high-scale application best practices at: • http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf 11/29/2012 http://perspectives.mvdirona.com 22
  • 23. • Black Swan events will happen at scale • Multi-data center required for last 9 • App & admin errors dominate infrastructure faults • Multi-AZ redundancy will operate through unexpected failures in all levels in app & infrastructure stack • Multi-AZ is more reliable & easier to administer • Use a small number of simple app failure paths • Test failure paths frequently in production • Reap reward of sleeping all night & riding through failure 11/29/2012 http://perspectives.mvdirona.com 23
  • 24. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.