An incorrect traffic shift onto the lower capacity EBS network in one availability zone caused many nodes to become isolated and lose data replication. When the traffic shift was rolled back, the nodes began an intensive search for available storage space to re-mirror data, exhausting the cluster's free capacity. This led to a re-mirroring storm that left many volumes stuck. The overloaded EBS cluster could no longer service API requests from the control plane, causing thread starvation that disrupted other availability zones and regions. The root causes were nodes not backing off replication requests aggressively enough when failing to find space, and a race condition bug exacerbating node failures during large numbers of replication requests.