Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Autopsy of an automation disaster

945 views

Published on

You’ve deployed automation, enabled automatic master failover and tested it many times: great, you can now sleep at night without being paged by a failing server. However, when you wake up in the morning, things might not have gone the way you expect. This talk will be about such surprise.

Once upon a time, a failure brought down a master. Automation kicked in and fixed things. However, a fancy failure, combined with human errors, with an edge-case recovery, and a lack of oversight in automation, lead to a split-brain. This talk will go into details about the convoluted - but still real world - sequence of events that lead to this disaster. I will cover what could have avoided the split-brain and what could have make things easier to fix it.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Autopsy of an automation disaster

  1. 1. Autopsy of an automation disaster Jean-François Gagné - Saturday, February 4, 2017 FOSDEM MySQL & Friends Devroom
  2. 2. 2 To err is human To really foul things up requires a computer[1] (or a script) [1]: http://quoteinvestigator.com/2010/12/07/foul-computer/
  3. 3. Booking.com ● Based in Amsterdam since 1996 ● Online Hotel/Accommodation/Travel Agent (OTA): ● +1.134.000 properties in 225 countries ● +1.200.000 room nights reserved daily ● +40 languages (website and customer service) ● +13.000 people working in 187 offices worldwide ● Part of the Priceline Group ● And we use MySQL: ● Thousands (1000s) of servers, ~90% replicating ● >150 masters: ~30 >50 slaves & ~10 >100 slaves 3
  4. 4. Session Summary 1. MySQL replication at Booking.com 2. Automation disaster: external eye 3. Chain of events: analysis 4. Learning / takeaway 4
  5. 5. MySQL replication at Booking.com ● Typical MySQL replication deployment at Booking.com: +---+ | M | +---+ | +------+-- ... --+---------------+-------- ... | | | | +---+ +---+ +---+ +---+ | S1| | S2| | Sn| | M1| +---+ +---+ +---+ +---+ | +-- ... --+ | | +---+ +---+ | T1| | Tm| +---+ +---+ 5
  6. 6. MySQL replication at Booking.com’ ● And we use Orchestrator: 6
  7. 7. MySQL replication at Booking.com’’ ● Orchestrator allows us to: ● Visualize our replication deployments ● Move slaves for planned maintenance of an intermediate master ● Automatically replace an intermediate master in case of its unexpected failure (thanks to pseudo-GTIDs when we have not deployed GTIDs) ● Automatically replace a master in case of a failure (failing over to a slave) ● But Orchestrator cannot replace a master alone: ● Booking.com uses DNS for master discovery ● So Orchestrator calls a homemade script to repoint DNS (and to do other magic) 7
  8. 8. Our subject database ● Simple replication deployment (in two data centers): DNS (master) +---+ points here --> | A | +---+ | +------------------------+ | | Reads +---+ +---+ happen here --> | B | | X | +---+ +---+ | +---+ And reads | Y | <-- happen here +---+ 8
  9. 9. Split brain: 1st event ● A and B (two servers in same data center) fail at the same time: DNS (master) +-/+ points here --> | A | but accesses +/-+ are now failing Reads +-/+ +---+ happen here --> | B | | X | but accesses +/-+ +---+ are now failing | +---+ And reads | Y | <-- happen here +---+ 9 (I will cover how/why this happened later.)
  10. 10. Split brain: 1st event’ ● Orchestrator fixes things: +-/+ | A | +/-+ Reads +-/+ +---+ Now, DNS (master) happen here --> | B | | X | <-- points here but accesses +/-+ +---+ are now failing | +---+ Reads | Y | <-- happen here +---+ 10
  11. 11. Split brain: disaster ● A few things happen in this day and night, and I wake-up to this: +-/+ | A | +/-+ DNS +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ | Y | +---+ 11
  12. 12. Split brain: disaster’ ● And to make things worse, reads are still happening on Y: +-/+ | A | +/-+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here +---+ 12
  13. 13. Split brain: disaster’’ ● This is not good: ● When A and B failed, X was promoted as the new master ● Something made DNS point to B (we will see what later)  writes are now happening on B ● But B is outdated: all writes to X (after the failure of A) did not reach B ● So we have data on X that cannot be read on B ● And we have new data on B that is not read on Y 13 +-/+ | A | +/-+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  14. 14. Split-brain: analysis ● Digging more in the chain of events, we find that: ● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B ● So after their failures, A and B came back and formed an isolated replication chain 14 +-/+ | A | +/-+ +-/+ +---+ DNS (master) | B | | X | <-- points here +/-+ +---+ | +---+ Reads | Y | <-- happen here +---+
  15. 15. Split-brain: analysis ● Digging more in the chain of events, we find that: ● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B ● So after their failures, A and B came back and formed an isolated replication chain ● And something caused a failure of A 15 +---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  16. 16. Split-brain: analysis ● Digging more in the chain of events, we find that: ● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B ● So after their failures, A and B came back and formed an isolated replication chain ● And something caused a failure of A ● But how did DNS end-up pointing to B ? ● The failover to B called the DNS repointing script ● The script stole the DNS entry from X and pointed it to B 16 +-/+ | A | +/-+ +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  17. 17. Split-brain: analysis ● Digging more in the chain of events, we find that: ● After the 1st failure of A, a 2nd one was detected and Orchestrator failed over to B ● So after their failures, A and B came back and formed an isolated replication chain ● And something caused a failure of A ● But how did DNS end-up pointing to B ? ● The failover to B called the DNS repointing script ● The script stole the DNS entry from X and pointed it to B ● But is that all: what made A fail ? 17 +-/+ | A | +/-+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  18. 18. Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-slaved to X 18 +---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  19. 19. Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-slaved to X ● But as A came back before re-slaving, it injected heartbeat and p-GTID to B 19 +-/+ | A | +/-+ +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  20. 20. Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-slaved to X ● But as A came back before re-slaving, it injected heartbeat and p-GTID to B ● Then B could have been re-cloned without problems 20 +---+ | A | +---+ | +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  21. 21. Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-slaved to X ● But as A came back before re-slaving, it injected heartbeat and p-GTID to B ● Then B could have been re-cloned without problems ● But A was re-cloned instead (human error #1) 21 +---+ | A | +---+ +-/+ +---+ DNS (master) | B | | X | <-- points here +/-+ +---+ | +---+ Reads | Y | <-- happen here +---+
  22. 22. Split-brain: analysis’ ● What made A fail ? ● Once A and B came back up as a new replication chain, they had outdated data ● If B would have come back before A, it could have been re-slaved to X ● But as A came back before re-slaving, it injected heartbeat and p-GTID to B ● Then B could have been re-cloned without problems ● But A was re-cloned instead (human error #1) ● Why did Orchestrator not fail over right away ? ● B was promoted hours after A was brought down… ● Because A was downed time only for 4 hours (human error #2) 22 +-/+ | A | +/-+ +---+ +---+ DNS (master) | B | | X | <-- points here +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  23. 23. Orchestrator anti-flapping ● Orchestrator has a failover throttling/acknowledgment mechanism[1]: ● Automated recovery will happen ● for an instance in a cluster that has not recently been recovered ● unless such recent recoveries were acknowledged. ● In our case: ● the recovery might have been acknowledged too early (human error #0 ?) ● or the “recently” timeout might have been too short ● and maybe Orchestrator should not have failed over the second time [1]: https://github.com/github/orchestrator/blob/master/docs/topology-recovery.md #blocking-acknowledgments-anti-flapping
  24. 24. Split brain: summary ● So in summary, this disaster was caused by: 1. A fancy failure: 2 servers failing in the same data center at the same time 2. A debatable premature acknowledgment in Orchestrator and probably too short a timeout for recent failover 3. Edge-case recovery: both servers forming a new replication topology 4. Re-cloning of the wrong server (A instead of B) 5. Too short downtime for the re-cloning 6. Orchestrator failing over something that it should not have 7. DNS repointing script not defensive enough 24
  25. 25. Fancy failure: more details ● Why did A and B fail at the same time ? ● Deployment error: the two servers in the same rack/failure domain ? ● And/or very unlucky ? ● Very unlucky because… 10 to 20 servers failed that day in the same data center Because human operations and sensitive hardware 25 DNS (master) +-/+ points here ---> | A | but now accesses +/-+ are failing +-/+ +---+ | B | | X | +/-+ +---+ | +---+ | Y | +---+
  26. 26. How to fix such situation ? ● Fixing non-intersecting data on B and X is hard. ● Some solutions are: ● Kill B or X (and lose data) ● Replay writes from B on X (manually or with replication) ● But AUTO_INCREMENTs are in the way: ● up to i used on A before 1st failover ● i-n to j1 used on X after recovery ● i to j2 used on B after 2nd failover 26 +-/+ | A | +/-+ DNS (master) +---+ +---+ points here --> | B | | X | +---+ +---+ | +---+ Reads | Y | <-- happen here +---+
  27. 27. Takeaway ● Twisted situations happen ● Automation (including failover) is not simple:  code automation scripts defensively ● Be mindful for premature acknowledgment ● Downtime more than less ● Shutdown slaves first ● Try something else than AUTO-INCREMENTs for PK (monotonically increasing UUID[1] [2] ?) [1]: https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/ [2]: http://mysql.rjweb.org/doc.php/uuid
  28. 28. Improvements in Orchestrator ● Orchestrator failed over something that it should not have ● Should Orchestrator be changed ? ● I do not know… ● Not easy to define what should be changed ● Suggestions welcome https://github.com/github/orchestrator/issues 28
  29. 29. Links ● Booking.com Blog: https://blog.booking.com/ ● GitHub Orchestrator: https://github.com/github/orchestrator/ ● UUID as Primary Key: ● https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/ ● http://mysql.rjweb.org/doc.php/uuid/ ● Myself: ● https://jfg-mysql.blogspot.com/ ● https://www.linkedin.com/in/jfg956/ 29
  30. 30. Thanks Jean-François Gagné jeanfrancois DOT gagne AT booking.com

×