Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The two little bugs that almost brought down Booking.com

711 views

Published on

This short talk will be about an incident that kept DBAs working on a weekend. Two bugs, one in our application code and one in the database, joined force and almost brought down Booking.com. And this occurred at one of the worst possible times. Curious about what happened: come to this talk to learn more.

Published in: Technology
  • Ich kann eine Website empfehlen. Er hat mir wirklich geholfen. ⇒ www.WritersHilfe.com ⇐ Zufrieden und beeindruckt.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

The two little bugs that almost brought down Booking.com

  1. 1. The two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagne AT booking.com April 25, 2017 – Percona Live Santa Clara 2017
  2. 2. 2 For a, b and c relatively small: Consequence(a + b + c) is much bigger than Conseq.(a) + Conseq.(b) + Conseq.(c)
  3. 3. MySQL/MariaDB replication at Booking.com ● Typical Booking.com MySQL/MariaDB replication deployment: +---+ | M | +---+ | +------+-- ... --+---------------+-------- ... | | | | +---+ +---+ +---+ +---+ | S1| | S2| | Sn| | M1| +---+ +---+ +---+ +---+ | +-- ... --+ | | +---+ +---+ | T1| | Tm| +---+ +---+ 3
  4. 4. Impacted setup (simplified) +---+ | M | +---+ | +---------- .... ----------+--------------+ | | | +---+ +---+ +---+ | M1| | Mi| | Mj| +---+ +---+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 4
  5. 5. Upgrade from 5.5 to new major version +---+ | M | +---+ | +---------- .... ----------+--------------+ | | | +---+ +---+ +---+ | M1| | Mi| | Mj| +---+ +---+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 5
  6. 6. Bad transaction on the master +---+ | M | <<-- “bad transaction” +---+ | +---------- .... ----------+--------------+ | | | +---+ +---+ +---+ | M1| | Mi| | Mj| +---+ +---+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 6
  7. 7. Oups ! M2 runs OOM and is killed +---+ | M | <<-- “bad transaction” +---+ | +---------- .... ----------+--------------+ | | | +---+ +-/+ +---+ | M1| | Mi| | Mj| +---+ +/-+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 7
  8. 8. Oups2 ! all “blue” run OOM and are killed +---+ | M | <<-- “bad transaction” +---+ | +---------- .... ----------+--------------+ | | | +---+ +-/+ +---+ | M1| | Mi| | Mj| +---+ +/-+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +-/+ +---+ +-/+ +---+ +---+ +---+ +---+ +-/+ +---+ +-/+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +/-+ +---+ +/-+ +---+ +---+ +---+ +---+ +/-+ +---+ +/-+ 8
  9. 9. What is the “bad” transaction ? ● DELETE FROM TABLE WHERE …lot of rows…; ● Transaction of ~2 GB in the binary logs (RBR) ● Obviously a bug in the application (but it should not have triggered an OOM) 9
  10. 10. What needs to be done next ? ● Reminder: 5.5 is not replication crash safe ● Next version is crash safe, but can’t… ● Crashed slaves either OOM again or are corrupted ● We need to re-clone all crashed slaves ! 10
  11. 11. What saved us ? ● Engaged team of skilled DBAs: all joined to help ● Data not too sensitive on replication delay ● Data not too sensitive on “skipping transactions” ● pt-slave-restart ● IDEMPOTENT mode ● A torrenting cloning tool 11
  12. 12. What could have helped us ? ● A “working” torrenting cloning tool… ● Not used often enough, so we did not know it was broken (fixed in less than 2 hours) ● An AUTO-FIX/AUTO-REPAIR mode (RBR) ● Instead of skipping transaction (and make data diverge) should repair (fix) slave drift (and make data converge) https://bugs.mysql.com/bug.php?id=54250 http://blog.wl0.org/2016/05/the-differences-between-idempotent- and-my-suggested-auto-repair-mode/ 12
  13. 13. ● We are hiring ! ● MySQL Engineer / DBA ● System Administrator ● System Engineer ● Site Reliability Engineer ● Developer / Designer ● Technical Team Lead ● Product Owner ● Data Scientist ● And many more… ● https://workingatbooking.com/ Want to know more…
  14. 14. Thanks Jean-François Gagné jeanfrancois DOT gagne AT booking.com

×