•      At scale the incredibly rare is commonplace•      Availability through application redundancy       •   Inter-regio...
•      Server & disk failure rates:       •   Disk drives: 4% to 6% annual failure rate       •   Servers: 2% to 4% annual...
Every day, AWS adds enough new server capacity to       support all of Amazon’s global infrastructure when        it was a...
Total Number of S3 Objects                                                                                                ...
US GovCloud       US West x 2             US East               LATAM         Europe West       Asia Pacific   Asia Pacifi...
•      At scale the incredibly rare is commonplace•      Availability through application redundancy       •   Inter-regio...
• 5th app availability “9” only via multi-datacenter replication• Conventional approach:  • Two datacenters in distant loc...
•      Asynchronous replication between datacenters       •   Committing to an SSD order 1 to 2 msec       •   LA to New Y...
• Fragile: Active/Passive doesn’t work  • Failover to a system that hasn’t been taking operational load  • Passive seconda...
• Choose Region to be close to user, close to data, or meeting  jurisdictional requirements• Synchronous replication to 2 ...
• Recovery Oriented Computing (ROC)  • Assume software & hardware will fail frequently & unpredictably  • Heavily instrume...
• All systems produce non-linear latencies and/or failures  beyond application-specific load level• Load limit is software...
•   No amount of capacity head room is sufficient•   Graceful degradation prior to admission control    • First: shed non-...
•      Run active/active & test in production       • Without constant live load, it won’t work when needed       • Test i...
•      At scale the incredibly rare is commonplace•      Availability through application redundancy       •   Inter-regio...
…cause of the outage was an abrupt power                                                failure in our state-of-the-art da...
Utility                High-Voltage             Mid-Voltage                                      Generator                ...
Utility                High-Voltage             Mid-Voltage                                       Generator               ...
•      At scale the incredibly rare is commonplace•      Availability through application redundancy       •   Inter-regio...
•      Multi-AZ is AWS unique & a powerful app redundancy model•      Full power distribution redundancy & concurrent main...
• Even incredibly rare events will happen at scale• Multi-AZ & ROC protect against infrastructure, app, & admin issues• De...
• Black Swan events will happen at scale• Multi-data center required for last 9  • App & admin errors dominate infrastruct...
We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation   form wh...
CPN208 Failures at Scale & How to Ride Through Them - AWS re: Invent 2012
Upcoming SlideShare
Loading in …5
×

CPN208 Failures at Scale & How to Ride Through Them - AWS re: Invent 2012

6,668 views

Published on

At scale, rare and unexpected events will happen. Things eventually will go wrong. This talk dives into what can go wrong at scale and how to architect applications to ride through disaster obliviously. We’ll talk about AWS infrastructure design including Regions and Availability Zones and show how applications can be written and operated to best exploit this industry-unique infrastructure redundancy model. Believing that experience is one of the best teachers, we will go through some of the more interesting and educational industry post mortems including some experienced at AWS to motivate these application design decisions and show how they can mitigate the damage of the truly unexpected.

0 Comments
21 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,668
On SlideShare
0
From Embeds
0
Number of Embeds
1,249
Actions
Shares
0
Downloads
0
Comments
0
Likes
21
Embeds 0
No embeds

No notes for slide

CPN208 Failures at Scale & How to Ride Through Them - AWS re: Invent 2012

  1. 1. • At scale the incredibly rare is commonplace• Availability through application redundancy • Inter-region replication • AWS Regions & Availability Zones • Recovery Oriented Computing • Avoiding capacity meltdown• Example rare events from industry & AWS• Infrastructure & application lessons 11/29/2012 http://perspectives.mvdirona.com 2
  2. 2. • Server & disk failure rates: • Disk drives: 4% to 6% annual failure rate • Servers: 2% to 4% annual failure rate (AFR)• 3% server AFR yields MTBF of 292,000 hours • More than 33 years• But, at scale, in DC with 64,000 servers with 2 disks each: • On average, more than 5 servers & 17 disks fail each day• Failure both inevitable & common• Applies to all infrastructure at all levels • Switchgear, cooling plants, transformers, PDUs, servers, disks,… 11/29/2012 http://perspectives.mvdirona.com 3
  3. 3. Every day, AWS adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $5.2B annual revenue enterprise (2003).11/29/2012 http://perspectives.mvdirona.com 4
  4. 4. Total Number of S3 Objects >1 Trillion Peak Requests: 500,000+ 762 Billion per second 262 Billion 102 Billion 14 Billion 40 Billion 2.9 Billion Q4 2006 Q4 2007 Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 201211/29/2012 http://perspectives.mvdirona.com 5
  5. 5. US GovCloud US West x 2 US East LATAM Europe West Asia Pacific Asia Pacific Australia(US ITAR Region (N. California and (Northern Virginia) (Sao Paola) (Dublin) Region Region Region -- Oregon) Oregon) (Singapore) (Tokyo) (Australia) >10 data centers In US East alone 9 AWS Regions and growing 21 AWS Edge Locations for CloudFront (CDN) & Route 53 (DNS) 11/29/2012 http://perspectives.mvdirona.com 6
  6. 6. • At scale the incredibly rare is commonplace• Availability through application redundancy • Inter-region replication • AWS Regions & Availability Zones • Recovery Oriented Computing • Avoiding capacity meltdown• Example rare events from industry & AWS• Infrastructure & application lessons 11/29/2012 http://perspectives.mvdirona.com 7
  7. 7. • 5th app availability “9” only via multi-datacenter replication• Conventional approach: • Two datacenters in distant locations • Replicate all data to both datacenters 99.999%• The industry-wide dominant multi-DC availability approach • Looks rock solid but performs remarkably poorly in practice• Acid Test: • Are you willing to pull the plug on the primary server? 11/29/2012 http://perspectives.mvdirona.com 8
  8. 8. • Asynchronous replication between datacenters • Committing to an SSD order 1 to 2 msec • LA to New York 74msec round trip • You can’t wait 74 msec to commit a transaction• On failure, a difficult & high skill decision: • Fail-over & lose transactions, or • Don’t fail-over & lose availability• I’ve been in these calls in past roles • No win situation • Very hard to get right 11/29/2012 http://perspectives.mvdirona.com 9
  9. 9. • Fragile: Active/Passive doesn’t work • Failover to a system that hasn’t been taking operational load • Passive secondary not recently tested • Secondary config or S/W version different, incorrect load balancer config, incorrect network ACLs, latent hardware problem, router problem, resource shortage under load, … • Can’t test without negative customer impact • If you don’t test it, it won’t work• 2-way redundancy expensive: • More than ½ capacity reserved to handle failure • 3 datacenters much less expensive but impractical w/o high scale 11/29/2012 http://perspectives.mvdirona.com 10
  10. 10. • Choose Region to be close to user, close to data, or meeting jurisdictional requirements• Synchronous replication to 2 (or better 3) Availability Zones • Easy when less than 2 to 3 msec away • Can failover w/o customer impact• ELB over EC2 instances in different AZs• Stateless EC2 apps easy• Persistent state is the challenge • DynamoDB • Simple Storage Service • Mutli-AZ RDS 11/29/2012 http://perspectives.mvdirona.com 11
  11. 11. • Recovery Oriented Computing (ROC) • Assume software & hardware will fail frequently & unpredictably • Heavily instrument applications to detect failures Bohr Bug App Urgent Alert Bohr bug: Repeatable functional software issue (functional bugs); should Heisenbug be rare in production Restart Heisenbug: Software issue that only Failure Reboot occurs in unusual cross-request timing Failure issues or the pattern of long sequences Re-image of independent operations; some found Failure Replace only in production11/29/2012 http://perspectives.mvdirona.com 12
  12. 12. • All systems produce non-linear latencies and/or failures beyond application-specific load level• Load limit is software release dependent • Changes as the application changes• Canary in the data center • Route increased load to one server in the fleet • When starts showing non-linear delay or failure, immediately reduce load or take out of load balancer rotation • Result: limit is known before full fleet melts down 2009/2/26 13
  13. 13. • No amount of capacity head room is sufficient• Graceful degradation prior to admission control • First: shed non-critical workload • Then: degraded operations mode • Finally: admission control• Related concept: Metered rate-of-service admission • Allow in small increments of users when restarting after failure• Best practice: do not acquire new resources when failing away • No new EC2 instances, no new EBS volumes, …. • Minimize new AWS control plane resource requests • Run active/active & just stop using failed instances2009/2/26 14
  14. 14. • Run active/active & test in production • Without constant live load, it won’t work when needed • Test in production or it won’t work when needed• Amazon.com: Game Days • Disable all or part of amazon.com production capacity in entire datacenter • With warning & planning to avoid customer impact• Netflix: Chaos Monkey • .NET Application that can be run from command line • Can be pointed at set of resources in a region • Mon to Thurs 9am to 3pm random instance kill • Application configuration options (including opt out) 11/29/2012 http://perspectives.mvdirona.com 15
  15. 15. • At scale the incredibly rare is commonplace• Availability through application redundancy • Inter-region replication • AWS Regions & Availability Zones • Recovery Oriented Computing • Avoiding capacity meltdown• Example rare events from industry & AWS• Infrastructure & application lessons 11/29/2012 http://perspectives.mvdirona.com 16
  16. 16. …cause of the outage was an abrupt power failure in our state-of-the-art data center facility … The problem was not just that the power failure happened, the problem was that it happened abruptly, with no warning whatsoever, and all our equipment went down all at once… Data centers, certainly this one, have triple, and even quadruple, redundancy in their power systems just to prevent such an abrupt power outage. Observations • Single datacenter availability model doesn’t work • Costs scale with facility redundancy levels • Decreasing or inverse payback as single facility redundancy increases11/29/2012 http://perspectives.mvdirona.com 17
  17. 17. Utility High-Voltage Mid-Voltage Generator 2.5kva Power Transformer Transformer Provider 10kva 3kva 480V 480V 110,000V 13,500V Programmable Gen Utility Logic Breaker Breaker Controller• Normal Utility Failure Scenario • Small faults common during year • Utility fails but servers run on UPS UPS UPS UPS • 5 to 10 seconds later, start gen • 12 sec to stabilize gen prior to load • 5 to 10 sec for gen power stability Data Center Pod • Generator holds load until 30 min of Critical Load stable utility power 2.5mw 11/29/2012 http://perspectives.mvdirona.com 18
  18. 18. Utility High-Voltage Mid-Voltage Generator 2.5kva Power Transformer Transformer Provider 10kva 3kva 480V 480V 110,000V 13,500V Switchgear Gen Utility proprietery Breaker• Failure Scenario Breaker PLC • 10:41pm: Regional substation high voltage transformer failure • Voltage spike sent through mid-voltage transformer into data center UPS UPS UPS • Switchgear proprietary PLC S/W incorrectly detects local datacenter ground fault • 10:41pm: Gen starts but breaker locked out • avoid connecting gen into possible direct short Data Center Pod • ~10:53pm UPSs discharge and fail sequentially Critical Load 2.5mw 11/29/2012 http://perspectives.mvdirona.com 19
  19. 19. • At scale the incredibly rare is commonplace• Availability through application redundancy • Inter-region replication • AWS Regions & Availability Zones • Recovery Oriented Computing • Avoiding capacity meltdown• Example rare events from industry & AWS• Infrastructure & application lessons 11/29/2012 http://perspectives.mvdirona.com 20
  20. 20. • Multi-AZ is AWS unique & a powerful app redundancy model• Full power distribution redundancy & concurrent maintainability • Power redundant even during maintenance operations• Switchgear now custom programmed to AWS specs • “hold the load” prioritized above capital equipment protection• All configuration settings with maximum engineering headroom• Network redundancy & resiliency: • Systematically replace 2-way redundancy with N-way (N>>2) • Custom monitoring to pinpoint faulty router or link fault before app impact• Full production testing of all power distribution systems 11/29/2012 http://perspectives.mvdirona.com 21
  21. 21. • Even incredibly rare events will happen at scale• Multi-AZ & ROC protect against infrastructure, app, & admin issues• Design applications using ROC principals • When application health is unknown, assume failure • Trust no single instance, server, router, or data center • Only use small number of simple, tested application failure paths • Test in production• No new resources when failing away• More high-scale application best practices at: • http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf 11/29/2012 http://perspectives.mvdirona.com 22
  22. 22. • Black Swan events will happen at scale• Multi-data center required for last 9 • App & admin errors dominate infrastructure faults• Multi-AZ redundancy will operate through unexpected failures in all levels in app & infrastructure stack • Multi-AZ is more reliable & easier to administer • Use a small number of simple app failure paths • Test failure paths frequently in production• Reap reward of sleeping all night & riding through failure 11/29/2012 http://perspectives.mvdirona.com 23
  23. 23. We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.

×