Spil Games: Migrate an infrastructure to Galera Cluster

  • 2,539 views
Uploaded on

Galera Cluster is a synchronous multi-master cluster for MySQL that allows you to synchronously replicate your data to every node in the cluster. Galera Cluster makes the life of a DBA easier with …

Galera Cluster is a synchronous multi-master cluster for MySQL that allows you to synchronously replicate your data to every node in the cluster. Galera Cluster makes the life of a DBA easier with features like automatic node joining, electing donor nodes and automatic node removal once a node has failed. There is also no need to distinguish master and slave relations in your application as all nodes in the cluster are writable. Consider all nodes in the cluster as one big MySQL database server.

In this talk I will explain how Spil Games strives to have their entire MySQL environment running on Galera Cluster. At Spil Games we have a large variety of MySQL implementations ranging from traditional functionally sharded database master-master pairs, single masters with many read slaves and user/location sharded database nodes. Migrating these different implementations to a Galera Cluster requires careful planning.

The biggest challenge is to migrate all our existing (ageing) asynchronous Master-Master setup to a synchronous Galera replicator. Because the ageing servers are less performant than the new servers used for Galera we merge multiple existing clusters into one cluster, saving us lots of space and power in our data center.

The session will include design choices, learnings and the pitfalls we fell into.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,539
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • so that may be the reason our name is not widely known.

Transcript

  • 1. Migrate an infrastructure to Galera Cluster Art van Scheppingen Head of Database Engineering
  • 2. 2 1. Who are we? 2. What is Galera? 3. What is Spil Games using Galera for? 4. What have we learned? 5. Future technologies 6. Conclusion 7. Questions? Overview
  • 3. Who are we? Who is Spil Games?
  • 4. 4 • Game publishers • Company founded in 2001 • 350+ employees world wide • 180M+ unique visitors per month • Over 60M registered users • 45 portals in 19 languages • Casual games • Social games • Real time multiplayer games • Mobile games • 40+ MySQL clusters • 65k queries per second • 8 Sphinx servers • 8k queries per second Facts
  • 5. 5 Geographic Reach 180 Million Monthly Active Users(*) Source: (*) Google Analytics, August 2012
  • 6. What is Galera? How to get Highly Available and beyond
  • 7. 1. Replication plugin for MySQL by Codership • Synchronous (parallel) replication • Supports InnoDB • MyISAM “works” • Committing transactions actually replicates data 2. Allows clustering of nodes • Minimum of 3 nodes for HA • Galera Arbitrator allows 2 nodes • Primary Component • Normally the full cluster is a PC • Split: several components, one PC is allowed What is Galera?
  • 8. How does Galera work? Galera Server-1 Server-2 Server-n MySQL MySQL MySQL Connect/read/write to any node Synchronous replication
  • 9. Galera synchronous replication Galera replication Server-n MySQL MySQL MySQL commit Client receives OK Transaction applied to slaves
  • 10. Galera vs Master-Slave replication Server-1 Server-2 Server-n MySQL master MySQL slave read onlyread+write Asynchronous single-threaded replication
  • 11. High Availability (1) Galera Server-1 Server-2 Server-n MySQL MySQL MySQL
  • 12. High Availability (2) Galera Server-1 Server-2 Server-n Load balancer / Query router MySQL MySQL MySQL
  • 13. High Availability (3) Galera Server-1 Load balancer MySQL MySQL MySQL Server-2 Load balancer Server-n Load balancer
  • 14. High Availability (4) Galera Server-1 Server-2 Server-n Load balancer (port 3306 read,3307 write) MySQL MySQL MySQL read+write read only read only
  • 15. GaleraGalera Server-1 Server-2 Server-n Load balancer read/write to two nodes New node SST Requests to join cluster Cluster drains node MySQL MySQL MySQL Synchronous replication MySQL Node joining SST (State Snapshot Transfer)
  • 16. GaleraGalera Server-1 Server-2 Server-n Load balancer Existing node IST Requests to join cluster MySQL MySQL MySQL Synchronous replication MySQL Node joining IST (Incremental State Transfer)
  • 17. Galera replication over WAN Galera replication Server-n MySQL MySQL DC1 DC2 Commit is delayed by RTT
  • 18. WAN replication Galera 2.x DC1 DC2 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6
  • 19. WAN replication Galera 3.x DC1 DC2 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6
  • 20. What are we using Galera for? Synchronous replication for the masses
  • 21. 1. Legacy services databases • MySQL Master-Master 2. SSP (Spil Storage Platform) • MySQL Master-Master (last pair remaining) • Galera 3. ROAR (Read Often, Alter Rarely) • Galera / Sphinx Our systems
  • 22. Master-Master setup used at Spil Games Server-1 Server-2 Server-n MySQL active master MySQL inactive master Asynchronous replication read onlyread+write MMM db-something (192.168.1.1) db-something-r1 (192.168.1.2) db-something-r2 (192.168.1.3)
  • 23. Master-Master setup used at Spil Games MySQL active master MySQL inactive master read onlyread+write MMM db-something (192.168.1.1) db-something-r1 (192.168.1.2) db-something-r2 (192.168.1.3) MySQL slave MySQL slave db-something-r3 (192.168.1.4) db-something-r4 (192.168.1.5)read only
  • 24. Migrating legacy dbs to Galera (lab) Galera MySQL MySQL MySQL legacy1 inactive master MySQL Clone database (innobackupex) Feed database dump (mysqldump) Start slaving legacy3 inactive master legacy2 inactive master
  • 25. Scaling Galera (1) Server-1 Server-2 Server-n Load balancer (port 3306 write+read) Galera GaleraGalera MySQLMySQL MySQL MySQLMySQL
  • 26. Scaling Galera (2) Server-1 Server-2 Server-n Load balancer (port 3306 write, 3307 read) Galera MySQL MySQL MySQLMySQL MySQL read only read only asynchronous replication asynchronous replication
  • 27. 1. Around 20 legacy database clusters • 50 servers in total 2. Maintenance • Master-Master requires a lot of (manual) maintenance 3. Replacement is needed • 35 of them will be older than 3 years in 2014 4. Current state: 1 pair migrated to a cluster in Openstack Why consolidate legacy systems?
  • 28. 28 • Storage API between application and databases • All data is sharded • User • Function • Location • Initial design: • Every cluster (two masters) will contain two shards • Data written interleaved • HA for both shards • Both masters active and “warmed up” SSP (Spil Storage Platform) SSP Shard 1 Shard 2
  • 29. SSP Master-Master setup Server-1 Server-2 Server-n MySQL active master MySQL active master Asynchronous replication read+writeread+write MMM db-ssp001 (192.168.2.1) db-ssp002 (192.168.2.2)
  • 30. SSP Master-Master setup Server-1 Server-2 Server-n MySQL broken master MySQL active master read+write MMM db-ssp001 (192.168.2.1) db-ssp002 (192.168.2.2)
  • 31. SSP Galera setup Galera Server-1 Server-2 Server-n Load balancer MySQL MySQL MySQL read/write to any node Synchronous replication
  • 32. 1. Total of 2 old style SSP shard nodes (1 pair) 2. Total of 6 Galera SSP shard nodes (2 clusters) 3. 6 more nodes are planned for Q2 4. Add Galera nodes/clusters when necessary Current state of the SSP
  • 33. What have we learned so far? Pitfalls, hurdles, etc
  • 34. Creating backups 1. Two ways to make backups: • Issue SST • Either mysqldump or Innobackupex • Regular Innobackupex • --galera_info • set global wsrep_desync=on to remove node
  • 35. Backup SST Galera Server-1 Server-2 Server-n Load balancer MySQL MySQL MySQL read/write to two nodes Synchronous replication Backup receiver SST Request SST Cluster drains node Backup done
  • 36. Backup Innobackupex Galera Server-1 Server-2 Server-n Load balancer MySQL MySQL MySQL read/write to two nodes Synchronous replication BackupPC wsrep_desync=ON Stream backup wsrep_desync=OFF
  • 37. Restoring backups 1. Restored backup can be used to prevent SST of new joiners 2. Automated backup verification • Restores (randomly) chosen backup • Installs necessary MySQL version (5.1/5.5) • Perform basic checks • Enable replication • Will not work fully as it needs a working cluster to join
  • 38. Monitoring 1. Cluster • Nodes in the cluster • Warning at 2, critical at 1 • Availability of the address 2. Load balancer • Node checks • Write only on one node 3. Performance monitoring • Adding metrics to mysql_statsd is easy • wsrep_flow_control
  • 39. Flow control 1. Usage of replication threads • Scale from 0.0 to 1.0 2. Recommended to stay below 0.1 (10% blocked) 3. Adding more nodes will not solve your problem 4. Increase replication threads • Recommended 2*CPU cores • What if 64 is not enough? • How do you close flood gates?
  • 40. Other things we bumped into… 1. MySQL version updates • Update one by one • PXC SST changes 2. Availability after restart • Joins cluster after IST/SST • LRU still loading 3. In descriptive errors during SST • Local user authentication (after starting mysqld with sudo!) 4. Schema changes
  • 41. Future for Galera at Spil Games What will we do in the near future?
  • 42. Openstack 1. Offer DAAS to our (internal) customers 2. Spawning (automated) database nodes and clusters when necessary 3. Mix and match Galera and regular MySQL replication
  • 43. WAN Replication 1. No immediate use case (yet) • No need for WAN in sharded environment • Game catalogue cluster needs it in the future 2. Test Galera 3.0 • Datacenter awareness
  • 44. MaxScale 1. Beta testing MaxScale for SkySQL • Works flawless in the lab (so far) • Not yet tested with mixed Galera/MySQL replication 2. MaxScale itself is not HA (yet) • Keepalived?
  • 45. Conclusion What is our verdict?
  • 46. Conclusion(s) 1. Galera definitely live up to expectations 2. Decreased cluster wide performance 3. Increased replication performance 4. High investment in time for initial setup/tools 5. Maintenance is easier 6. Well worth the investment for us
  • 47. Questions?
  • 48. • Building globally available storage layers Friday 1:50PM - 2:40PM @ Ballroom F • MaxScale, the pluggable router Today 4:50PM - 5:40PM @ Ballroom A • Geographical Disaster recovery challenges and solutions Today 4:50PM - 5:40PM @ Ballroom C • Write Conflicts in Multi-Master Replication Topologies Thursday 4:30PM - 5:20PM @ Ballroom H Other (Galara related) talks to attend to
  • 49. 49 • Presentation can be found at: http://spil.com/plsc2014galera • If you wish to contact me: Email: art@spilgames.com Twitter: @banpei • Engineering @ Spil Games Blog: http://engineering.spilgames.com Twitter: @spilengineering Thank you!
  • 50. 50 Our current HA environment: http://thinkaurelius.com/2013/03/30/titan-server-from-a-single-server-to-a-highly-available-cluster/ What we have learned so far: http://renaissanceronin.wordpress.com/2009/10/05/playing-with-plasma-cutters/ Near future: http://www.example-infographics.com/envisioning-the-near-future-of-technology/ Conclusion: http://www.flickr.com/photos/louisephotography/5796499806/in/photostream/ Photo sources