Successfully reported this slideshow.
Your SlideShare is downloading. ×

MySQL Infrastructure Testing Automation at GitHub

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 38 Ad

MySQL Infrastructure Testing Automation at GitHub

Download to read offline

The database team at GitHub is tasked with keeping the data available and with maintaining its integrity. Our infrastructure automates away much of our operation, but automation requires trust, and trust is gained by testing. This session highlights three examples of infrastructure testing automation that helps us sleep better at night:

- Backups: scheduling backups; making backup data accessible to our engineers; auto-restores and backup validation. What metrics and alerts we have in place.
- Failovers: how we continuously test our failover mechanism, orchestrator. How we setup a failover scenario, what defines a successful failover, how we automate away the cleanup. What we do in production.
- Schema migrations: how we ensure that gh-ost, our schema migration tool, which keeps rewriting our (and your!) data, does the right thing. How we test new branches in production without putting production data at risk.

The database team at GitHub is tasked with keeping the data available and with maintaining its integrity. Our infrastructure automates away much of our operation, but automation requires trust, and trust is gained by testing. This session highlights three examples of infrastructure testing automation that helps us sleep better at night:

- Backups: scheduling backups; making backup data accessible to our engineers; auto-restores and backup validation. What metrics and alerts we have in place.
- Failovers: how we continuously test our failover mechanism, orchestrator. How we setup a failover scenario, what defines a successful failover, how we automate away the cleanup. What we do in production.
- Schema migrations: how we ensure that gh-ost, our schema migration tool, which keeps rewriting our (and your!) data, does the right thing. How we test new branches in production without putting production data at risk.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to MySQL Infrastructure Testing Automation at GitHub (20)

Advertisement

Recently uploaded (20)

MySQL Infrastructure Testing Automation at GitHub

  1. 1. How people build software! MySQL Infrastructure Testing Automation 
 @ GitHub IkeWalker GitHub Boston MySQL Meetup December 11, 2017 1 !
  2. 2. How people build software! Agenda • About • MySQL @ GitHub • Automation • Backup/restores • Failovers • Schema migrations 2 !
  3. 3. How people build software! About me • Database Architect • Working with MySQL since 2006 • Organizer of Boston MySQL Meetup github.com/ikewalker @iowalker 3 !
  4. 4. How people build software! 4 • The world’s largest Octocat T-shirt and stickers store • And water bottles • And hoodies • We also do stuff related to things • Word is new swag is coming up GitHub
  5. 5. How people build software! GitHub • 66M repositories • 24M developers • 117K businesses • More than a million teams • World’s largest open source hosting • Alexa top 100 • Critical path in build flows 5 !
  6. 6. How people build software! MySQL at GitHub • GitHub stores repositories in git, and uses MySQL as the backend database for all related metadata: • Repository metadata, users, issues, pull requests, comments etc. • Website/API/Auth/more all use MySQL. • We run a few (growing number of) clusters, totaling over 100 MySQL servers. • The setup isn’t very large but very busy. 6 !
  7. 7. How people build software! MySQL at GitHub • Our MySQL servers must be available, responsive and in good state • GitHub has 99.95% SLA • Availability issues must be handled quickly, as automatically as possible. 7 !
  8. 8. How people build software! github/database-infrastructure • @ggunson, @jessbreckenridge, @jonahberquist, @shlomi-noach, @tomkrouper, @gtowey • Concerned with: • Data availability • Data integrity 8 !
  9. 9. How people build software! Testing 9 !
  10. 10. How people build software! Backups/restores that ^ 10
  11. 11. How people build software! Your data It’s important 11 !
  12. 12. How people build software! Restores • Dedicated restore servers. • One per cluster. • Continuously restores, catches up with replication, restores, catches up with replication, restores, … • Sending a “success” event at the end of each cycle. • We monitor for number of “success” events in past 24-ish hours, per cluster. 12 !
  13. 13. How people build software! 13 ! ! ! ! ! production replicas auto-restore replica master ! auto-restore replicas """""" backup replica
  14. 14. How people build software! Restores • New host provisioning uses same flow as restore. • A human may kick a restore/reclone manually. • Chatops: 
 .mysql backup-restore -H restore.this.host -r 14 !
  15. 15. How people build software! Restore failure • A specific backup/restore may fail because computers. • No reason for panic. • Previous backup/restores proven to be working • At most we lose time • Two sequential failures, or failures across clusters are incidents to be investigated 15 !
  16. 16. How people build software! Restore: delayed replica • One delayed replica per cluster • Lagging at 4 hours • Chatops: .mysql panic 16 !
  17. 17. How people build software! Failovers ^ that, too 17
  18. 18. How people build software! MySQL setup @ GitHub • Plain-old single writer master-replicas asynchronous replication. • Not yet semi-sync • Cross DC, multiple data centers • 5.7, RBR • Servers with special roles: production replica, backup, auto-restore, migration-test, analytics, … • 2-3 tiers of replication • Occasional cluster split (functional sharding) • Very dynamic, always changing 18 !
  19. 19. How people build software! Points of failure • Master failure, sev1 • Intermediate masters failure 19 ! ! ! ! ! ! ! ! ! !
  20. 20. How people build software! orchestrator • Topology discovery • Refactoring • Failovers for masters and intermediate masters • Open source, Apache 2 license • github.com/github/orchestrator 20 !
  21. 21. How people build software! orchestrator failovers @ GitHub • Automated master & intermediate master failovers for all clusters. • On failover, runs GitHub-specific hooks • Grabbing VIP/DNS • Updating server role • Kicking services (e.g. pt-heartbeat) • Notifying chat • Running puppet 21 !
  22. 22. How people build software! Testing cluster • Dedicated testing cluster in production • Does not take production traffic • “load-test” traffic • Resembles a production topology: • OS, MySQL Versions • Data centers • Server roles • DNS • Proxy • Used for many of our deployment tests 22 !
  23. 23. How people build software! Failover testing • Multiple times per day: • Setup the cluster in desired topology layout • Inject failure (kill/block/reject) • Wait, expect recovery • Check topology: • Expect new master, correct DNS changes, replica capacity, … • Restore old master from backup • (an implicit backup/restore test) • “success/failure” event 23 !
  24. 24. How people build software! Failover in production • We expect < 30s failover • Intermediate master failover has low impact on subset of users, depending on cluster/DC/server • Master failover implies outage • Planned master switchover takes a few seconds 24 !
  25. 25. How people build software! A moment of reflection 25
  26. 26. How people build software! What builds trust in failovers? A testing environment? 26 !
  27. 27. How people build software! Chaos testing in production • First steps into regular testing • Manual • Supported by our peers • Learning, understanding impact 27 !
  28. 28. How people build software! Tests that go wrong • Many things can go wrong • Corrupt replication • Invalidated servers • Unassigned DNS • Cleanups 28 !
  29. 29. How people build software! Schema migrations 29
  30. 30. How people build software! Is your data correct? The data you see is merely a ghost of your original data 30 !
  31. 31. How people build software! gh-ost • Young. 16 months old. • In production at GitHub since born. • Software • Bugs • Development • Bugs 31
  32. 32. How people build software! gh-ost testing • gh-ost works perfectly well on our data • Tested, re-tested, and tested again • Full coverage of production tables 32
  33. 33. How people build software! gh-ost testing servers • Dedicated servers that run continuous tests 33
  34. 34. How people build software! 34 ! ! ! # ! ! production replicas testing replica master ! gh-ost testing replicas ! ! ! # ! ! production replicas testing replica master !
  35. 35. How people build software! gh-ost testing • Trivial ENGINE=INNODB migration • Stop replication • Cut-over, cut-back • Checksum both tables, compare • Checksum failure: stop the world, alert • Success/failure: event • Drop ghost table • Catch up • Next table 35
  36. 36. How people build software! gh-ost development cycle • Work on branch
 .deploy gh-ost/mybranch to prod/mysql_role=ghost_testing • Let continuous tests run • Depending on nature of change, observe hours/days/more. • Merge • Tests run regardless of deployed branch 36
  37. 37. How people build software! Conclusion • Backup & restore • Failovers • Schema migrations 37
  38. 38. How people build software! Thank you! Questions? github.com/ikewalker @iowalker 38 !

×