Amazon AWS DisruptionSuhas A. Kelkar, Director, Incubator TeamMay 2nd, 2011<br />
5/6/2011<br />2<br />What Happened…Easter Weekend AWS Disruption<br />…<br />Affected region<br />
Amazon Basics<br />Image Credit: DongWoo Lee from blog.edog.net<br />
5/6/2011<br />4<br />AWS Basics<br /><ul><li> Regions
Geographically five regions: US East (N Virginia), US West (N California), EU (Ireland), APAC (Singapore) and APAC (Tokyo)
AWS EC2 SLA 99.95% availability for each region
Contains one or more Availability Zones
 Availability Zones
Distinct locations engineered to be insulated from failures in other availability zones in the same region
Upcoming SlideShare
Loading in …5
×

Amazon cloud failure

1,134 views

Published on

Post mortem of Amazon Cloud Disruption. What happened and why. Also how it can be avoided.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,134
On SlideShare
0
From Embeds
0
Number of Embeds
52
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Amazon cloud failure

  1. 1. Amazon AWS DisruptionSuhas A. Kelkar, Director, Incubator TeamMay 2nd, 2011<br />
  2. 2. 5/6/2011<br />2<br />What Happened…Easter Weekend AWS Disruption<br />…<br />Affected region<br />
  3. 3. Amazon Basics<br />Image Credit: DongWoo Lee from blog.edog.net<br />
  4. 4. 5/6/2011<br />4<br />AWS Basics<br /><ul><li> Regions
  5. 5. Geographically five regions: US East (N Virginia), US West (N California), EU (Ireland), APAC (Singapore) and APAC (Tokyo)
  6. 6. AWS EC2 SLA 99.95% availability for each region
  7. 7. Contains one or more Availability Zones
  8. 8. Availability Zones
  9. 9. Distinct locations engineered to be insulated from failures in other availability zones in the same region
  10. 10. EBS (Elastic Block Storage)
  11. 11. Create storage volumes 1GB to 1TB
  12. 12. One can create a file system on top of EBS volumes
  13. 13. EBS volumes are kept in a Availability Zone and can be attached to instances also in that same zone
  14. 14. Automatically replicated within same Availability Zone
  15. 15. Control plane services coordinate user requests and propagate them to EBS clusters</li></li></ul><li>EBS Architecture<br />Regions1..5<br />Control Plane Services<br />Availability Zone 1..n<br />EBS Cluster 1<br />EBS Cluster n<br />Secondary Replication Low Bandwidth Network<br />Primary High Bandwidth Network<br />....<br />Node 1<br />Node 2<br />Node n<br />....<br />5/6/2011<br />5<br />
  16. 16. 5/6/2011<br />6<br />Amazon Cloud Disruption Post Mortem<br /><ul><li> The Trigger
  17. 17. Incorrect traffic shift onto the lower capacity EBS network
  18. 18. Many nodes in the affected AZ got completely isolated and lost connection to their replicas
  19. 19. Re-mirroring storm
  20. 20. After rolling back incorrect traffic shift, the previously isolated nodes now began searching the EBS cluster for available server space so they could re-mirror data
  21. 21. Free capacity of cluster was soon exhausted leaving many nodes stuck in a loop searching for free space
  22. 22. This led to a re-mirroring storm where a large number of volumes were effectively “stuck” while the nodes searched on and on
  23. 23. Why did this affect other AZs
  24. 24. The EBS cluster became unable to service “create volume” API requests
  25. 25. This caused thread starvation in EBS control plane affecting service to other AZs</li></li></ul><li>5/6/2011<br />7<br />Amazon Cloud Disruption Post Mortem (contd.)<br /><ul><li> The 2 major factors at the root of this problem
  26. 26. Nodes failing to find new nodes did not back off aggressively enough
  27. 27. Race condition in the code (bug) that caused nodes to fail incorrectly when they were closing a large number of replication requests.</li></li></ul><li>Failure walk-through<br />Control Plane Services<br />Threads<br />Availability Zone 1<br />1. Traffic usually through primary high bandwidth network<br />EBS Cluster 1<br />2. Traffic incorrectly shifted (manual error) to low band width secondary network<br />Secondary Replication Low Bandwidth Network<br />3. This caused congestion in secondary network<br />Primary High Bandwidth Network<br />4. Nodes assumed replica destination has failed<br />5. Mistake quickly realized and Traffic shift rolled back<br />Node 1<br />Node 1<br />Node 2<br />Node n<br />Node 2<br />6. Re-mirroring storm (due to previous node isolation)<br />....<br />7. Free space runs out , nodes get stuck in a loop, volumes get stuck<br />Node n<br />8. API requests from control plane get stuck holding up threads in the control plane<br />Services to other AZs affected due to thread starvation in the control plane. Control Plane essentially experienced a distributed DoS attack!<br />5/6/2011<br />8<br />
  28. 28. 5/6/2011<br />9<br />Lessons learnt<br /><ul><li> Propagation of the Problem
  29. 29. Other AWS Services such as the Relational Database Service (RDS) and EC2 instances rely on the EBS Control Plane for their data volume needs.
  30. 30. RDS in particular uses multiple EBS volumes simultaneously.
  31. 31. Hence this issue that started with one AZ in one region quickly affected other regions
  32. 32. Lessons Learnt
  33. 33. Applications should not rely on a single Region or AZs
  34. 34. Appropriate Policies and Change Management Process (requiring manual approvals for risky changes) needs to be implemented
  35. 35. Automation instead of manual changes would have helped prevent errors
  36. 36. Modify the re-mirroring search algorithm to back off more aggressively in case of a large scale interruption
  37. 37. Making highly-reliable multi-AZ deployments easy to design and operate so that the customer adoption is accelerated
  38. 38. Finally, having a fail over hybridorheterogeneous environment would have helped.</li>

×