The database team at GitHub is tasked with keeping the data available and with maintaining its integrity. Our infrastructure automates away much of our operation, but automation requires trust, and trust is gained by testing. This session highlights three examples of infrastructure testing automation that helps us sleep better at night:
- Backups: scheduling backups; making backup data accessible to our engineers; auto-restores and backup validation. What metrics and alerts we have in place.
- Failovers: how we continuously test our failover mechanism, orchestrator. How we setup a failover scenario, what defines a successful failover, how we automate away the cleanup. What we do in production.
- Schema migrations: how we ensure that gh-ost, our schema migration tool, which keeps rewriting our (and your!) data, does the right thing. How we test new branches in production without putting production data at risk.
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
MySQL Infrastructure Testing Automation at GitHub
1. How people build software!
MySQL Infrastructure
Testing Automation
@ GitHub
IkeWalker
GitHub
Boston MySQL Meetup
December 11, 2017
1
!
2. How people build software!
Agenda
• About
• MySQL @ GitHub
• Automation
• Backup/restores
• Failovers
• Schema migrations
2
!
3. How people build software!
About me
• Database Architect
• Working with MySQL since 2006
• Organizer of Boston MySQL Meetup
github.com/ikewalker
@iowalker
3
!
4. How people build software! 4
• The world’s largest Octocat T-shirt and stickers store
• And water bottles
• And hoodies
• We also do stuff related to things
• Word is new swag is coming up
GitHub
5. How people build software!
GitHub
• 66M repositories
• 24M developers
• 117K businesses
• More than a million teams
• World’s largest open source hosting
• Alexa top 100
• Critical path in build flows
5
!
6. How people build software!
MySQL at GitHub
• GitHub stores repositories in git, and uses MySQL
as the backend database for all related metadata:
• Repository metadata, users, issues, pull
requests, comments etc.
• Website/API/Auth/more all use MySQL.
• We run a few (growing number of) clusters, totaling
over 100 MySQL servers.
• The setup isn’t very large but very busy.
6
!
7. How people build software!
MySQL at GitHub
• Our MySQL servers must be available, responsive
and in good state
• GitHub has 99.95% SLA
• Availability issues must be handled quickly, as
automatically as possible.
7
!
8. How people build software!
github/database-infrastructure
• @ggunson, @jessbreckenridge, @jonahberquist,
@shlomi-noach, @tomkrouper, @gtowey
• Concerned with:
• Data availability
• Data integrity
8
!
12. How people build software!
Restores
• Dedicated restore servers.
• One per cluster.
• Continuously restores, catches up with replication,
restores, catches up with replication, restores, …
• Sending a “success” event at the end of each cycle.
• We monitor for number of “success” events in past
24-ish hours, per cluster.
12
!
13. How people build software! 13
!
!
!
!
!
production replicas
auto-restore replica
master
!
auto-restore replicas
""""""
backup replica
14. How people build software!
Restores
• New host provisioning uses same flow as restore.
• A human may kick a restore/reclone manually.
• Chatops:
.mysql backup-restore -H restore.this.host -r
14
!
15. How people build software!
Restore failure
• A specific backup/restore may fail because
computers.
• No reason for panic.
• Previous backup/restores proven to be working
• At most we lose time
• Two sequential failures, or failures across clusters
are incidents to be investigated
15
!
16. How people build software!
Restore: delayed replica
• One delayed replica per cluster
• Lagging at 4 hours
• Chatops: .mysql panic
16
!
18. How people build software!
MySQL setup @ GitHub
• Plain-old single writer master-replicas
asynchronous replication.
• Not yet semi-sync
• Cross DC, multiple data centers
• 5.7, RBR
• Servers with special roles: production replica,
backup, auto-restore, migration-test, analytics, …
• 2-3 tiers of replication
• Occasional cluster split (functional sharding)
• Very dynamic, always changing
18
!
19. How people build software!
Points of failure
• Master failure, sev1
• Intermediate masters failure
19
!
! !
!
!
!
! !
!
!
20. How people build software!
orchestrator
• Topology discovery
• Refactoring
• Failovers for masters and intermediate masters
• Open source, Apache 2 license
• github.com/github/orchestrator
20
!
21. How people build software!
orchestrator failovers @ GitHub
• Automated master & intermediate master failovers
for all clusters.
• On failover, runs GitHub-specific hooks
• Grabbing VIP/DNS
• Updating server role
• Kicking services (e.g. pt-heartbeat)
• Notifying chat
• Running puppet
21
!
22. How people build software!
Testing cluster
• Dedicated testing cluster in production
• Does not take production traffic
• “load-test” traffic
• Resembles a production topology:
• OS, MySQL Versions
• Data centers
• Server roles
• DNS
• Proxy
• Used for many of our deployment tests
22
!
23. How people build software!
Failover testing
• Multiple times per day:
• Setup the cluster in desired topology layout
• Inject failure (kill/block/reject)
• Wait, expect recovery
• Check topology:
• Expect new master, correct DNS changes,
replica capacity, …
• Restore old master from backup
• (an implicit backup/restore test)
• “success/failure” event
23
!
24. How people build software!
Failover in production
• We expect < 30s failover
• Intermediate master failover has low impact on
subset of users, depending on cluster/DC/server
• Master failover implies outage
• Planned master switchover takes a few seconds
24
!
26. How people build software!
What builds trust in failovers?
A testing environment?
26
!
27. How people build software!
Chaos testing in production
• First steps into regular testing
• Manual
• Supported by our peers
• Learning, understanding impact
27
!
28. How people build software!
Tests that go wrong
• Many things can go wrong
• Corrupt replication
• Invalidated servers
• Unassigned DNS
• Cleanups
28
!
30. How people build software!
Is your data correct?
The data you see is merely a ghost of your original data
30
!
31. How people build software!
gh-ost
• Young. 16 months old.
• In production at GitHub since born.
• Software
• Bugs
• Development
• Bugs
31
32. How people build software!
gh-ost testing
• gh-ost works perfectly well on our data
• Tested, re-tested, and tested again
• Full coverage of production tables
32
33. How people build software!
gh-ost testing servers
• Dedicated servers that run continuous tests
33
34. How people build software! 34
!
!
!
#
!
!
production replicas
testing replica
master
!
gh-ost testing replicas
!
!
!
#
!
!
production replicas
testing replica
master
!
35. How people build software!
gh-ost testing
• Trivial ENGINE=INNODB migration
• Stop replication
• Cut-over, cut-back
• Checksum both tables, compare
• Checksum failure: stop the world, alert
• Success/failure: event
• Drop ghost table
• Catch up
• Next table
35
36. How people build software!
gh-ost development cycle
• Work on branch
.deploy gh-ost/mybranch to prod/mysql_role=ghost_testing
• Let continuous tests run
• Depending on nature of change, observe hours/days/more.
• Merge
• Tests run regardless of deployed branch
36
37. How people build software!
Conclusion
• Backup & restore
• Failovers
• Schema migrations
37
38. How people build software!
Thank you!
Questions?
github.com/ikewalker
@iowalker
38
!