Amazon cloud failure

1. Amazon AWS DisruptionSuhas A. Kelkar, Director, Incubator TeamMay 2nd, 2011

2. 5/6/2011 2 What Happened…Easter Weekend AWS Disruption … Affected region

3. Amazon Basics Image Credit: DongWoo Lee from blog.edog.net

5. Geographically five regions: US East (N Virginia), US West (N California), EU (Ireland), APAC (Singapore) and APAC (Tokyo)

6. AWS EC2 SLA 99.95% availability for each region

7. Contains one or more Availability Zones

8. Availability Zones

9. Distinct locations engineered to be insulated from failures in other availability zones in the same region

10. EBS (Elastic Block Storage)

11. Create storage volumes 1GB to 1TB

12. One can create a file system on top of EBS volumes

13. EBS volumes are kept in a Availability Zone and can be attached to instances also in that same zone

14. Automatically replicated within same Availability Zone

17. Incorrect traffic shift onto the lower capacity EBS network

18. Many nodes in the affected AZ got completely isolated and lost connection to their replicas

19. Re-mirroring storm

20. After rolling back incorrect traffic shift, the previously isolated nodes now began searching the EBS cluster for available server space so they could re-mirror data

21. Free capacity of cluster was soon exhausted leaving many nodes stuck in a loop searching for free space

22. This led to a re-mirroring storm where a large number of volumes were effectively “stuck” while the nodes searched on and on

23. Why did this affect other AZs

24. The EBS cluster became unable to service “create volume” API requests

26. Nodes failing to find new nodes did not back off aggressively enough

29. Other AWS Services such as the Relational Database Service (RDS) and EC2 instances rely on the EBS Control Plane for their data volume needs.

30. RDS in particular uses multiple EBS volumes simultaneously.

31. Hence this issue that started with one AZ in one region quickly affected other regions

32. Lessons Learnt

33. Applications should not rely on a single Region or AZs

34. Appropriate Policies and Change Management Process (requiring manual approvals for risky changes) needs to be implemented

35. Automation instead of manual changes would have helped prevent errors

36. Modify the re-mirroring search algorithm to back off more aggressively in case of a large scale interruption

37. Making highly-reliable multi-AZ deployments easy to design and operate so that the customer adoption is accelerated

38. Finally, having a fail over hybridorheterogeneous environment would have helped.

Amazon cloud failure

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Amazon cloud failure

Similar to Amazon cloud failure (20)

More from Suhas Kelkar

More from Suhas Kelkar (6)

Recently uploaded

Recently uploaded (20)

Amazon cloud failure