You deployed automation, enabled automatic database master failover and tested it many times: great, you can now sleep at night without being paged by a failing server. However, when you wake up in the morning, things might not have gone the way you expect. This talk will be about such a surprise.
Once upon a time, a failure brought down a MySQL master database. Automation kicked in and fixed things. However, a fancy failure, combined with human errors, an edge-case recovery, and a lack of oversight in tooling and scripting lead to a split-brain and data corruption. This talk will go into details about the convoluted—but still real-world—sequence of events that lead to this disaster. I cover what could have avoided the split-brain and what could have made data reconciliation easier.