The EBS cluster became unable to service “create volume” API requests
This caused thread starvation in EBS control plane affecting service to other AZs</li></li></ul><li>5/6/2011<br />7<br />Amazon Cloud Disruption Post Mortem (contd.)<br /><ul><li> The 2 major factors at the root of this problem
Nodes failing to find new nodes did not back off aggressively enough
Race condition in the code (bug) that caused nodes to fail incorrectly when they were closing a large number of replication requests.</li></li></ul><li>Failure walk-through<br />Control Plane Services<br />Threads<br />Availability Zone 1<br />1. Traffic usually through primary high bandwidth network<br />EBS Cluster 1<br />2. Traffic incorrectly shifted (manual error) to low band width secondary network<br />Secondary Replication Low Bandwidth Network<br />3. This caused congestion in secondary network<br />Primary High Bandwidth Network<br />4. Nodes assumed replica destination has failed<br />5. Mistake quickly realized and Traffic shift rolled back<br />Node 1<br />Node 1<br />Node 2<br />Node n<br />Node 2<br />6. Re-mirroring storm (due to previous node isolation)<br />....<br />7. Free space runs out , nodes get stuck in a loop, volumes get stuck<br />Node n<br />8. API requests from control plane get stuck holding up threads in the control plane<br />Services to other AZs affected due to thread starvation in the control plane. Control Plane essentially experienced a distributed DoS attack!<br />5/6/2011<br />8<br />
5/6/2011<br />9<br />Lessons learnt<br /><ul><li> Propagation of the Problem
Other AWS Services such as the Relational Database Service (RDS) and EC2 instances rely on the EBS Control Plane for their data volume needs.
RDS in particular uses multiple EBS volumes simultaneously.
Hence this issue that started with one AZ in one region quickly affected other regions