MySQL for Large Scale Social Games


Published on

Presented at OSCON 2011

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

MySQL for Large Scale Social Games

  1. 1. MySQL for large scale social games Yoshinori MatsunobuPrincipal Infrastructure Architect, Oracle ACE Director at DeNA Former APAC Lead MySQL Consultant at MySQL/Sun/Oracle, Twitter: @matsunobu 1
  2. 2. Table of contents Easier maintenance and automating failover Non-stop master migration Automated master failover New Open Source Software: “MySQL MHA” Optimizing MySQL for faster H/W 2
  3. 3. Company Introduction: DeNA One of the largest social game providers in Japan Both social game platform and social games themselves Subsidiary ngmoco:) in San Francisco Japan localized phone, Smart Phone, and PC games 2-3 billion page views per day 25+ million users 1000+ MySQL servers, 150+ {master, slaves} pairs 1.3B$ revenue in 2010 3
  4. 4. Games are expanding / shrinking rapidly It is very difficult to predict social game workloads Sometimes unexpectedly high traffics, sometimes much lower than expected Each social game traffic tends to go down after months / years For expanding games Adding slaves Adding more shards – It’s possible to add shards without stopping services Scaling up master’s H/W – More RAM, HDD->SSD/PCI-E SSD, Faster NW, etc For shrinking games Decreasing slaves Migrating master to lower-spec machine Consolidating a few masters/slaves within single machine 4
  5. 5. Desire for Easier Operations We want to move master servers more easily Scaling-up: Increasing RAM, replacing with faster SSD Upgrading MySQL: Results in 10 minute or more downtime to fill in buffer pool Scaling-down: Moving unpopular games to lower spec servers Working around for power outage: Moving games to remote datacenter If you can allocate maintenance downtime, it’s easy, but we can’t do so many times Announcing to users, coordinating with customer support, etc Longer downtime reduces revenue Operating staffs will be exhausted by too many midnight work Reducing maintenance time is important to manage hundreds or thousands of MySQL servers 5
  6. 6. Switching master in seconds If we can switch a master in less than 3 seconds, it is acceptable in most of our cases Stopping updates on the master Waiting until at least one of the slaves (new master) has synced with the current master Granting writes, allocating virtual ip (etc) to the new master All the rest slaves start replication from the new master 6
  7. 7. Blocking writes on master MySQL provides several commands/solutions to block writes, but not all of them are safe FLUSH TABLES WITH READ LOCK – Clients will wait forever, unless setting timeouts on client side – Running transactions will be aborted in the end “Updating master1 -> updating master 2 -> committing master1 -> getting error on committing master 2” will result in data inconsistency – Flushing all tables sometimes takes very long time Run “FLUSH NO_WRITE_TO_BINLOG TABLES” beforehand SET GLOBAL read_only = 1 – Getting errors immediately – Running transactions will be aborted Dropping MySQL user (used from applications) – Can not establish new MySQL connection from applications – Current sessions are NOT terminated until disconnect – Current sessions do not encounter errors – Works with non-persistent connections only 7
  8. 8. Trade-off between safeness and performance What we are now doing at DeNA is.. Checking there is not any long running updates – 100 seconds of updates will take 100 seconds on slaves Dropping app user -- starting downtime Waiting for a while (2 seconds maximum) until all active application sessions are disconnected – Ignoring replication threads, sessions sleeping 1 second or more (highly likely daemon program or unused sessions, which can be killed safely) – Not killing active sessions immediately Executing FLUSH TABLES WITH READ LOCK when there are no active sessions or 2 seconds have passed Starting slave promotion -- ending donwtime At most 1 second is enough to do all processes 8
  9. 9. Our solution From: Developing “MySQL-MHA: Master High host1 (current master) Availability manager and tools” +--host2 (backup) +--host3 (slave) +--host4 (slave) This is automated failover tool, but can also +--host5 (remote) be used for fast online master switch To: host2 (new master) Switching original master to new master +--host3 (slave) gracefully +--host4 (slave) We have switched 10+ masters so far. We +--host5 (remote) could switch in 0.5 – 1 second of downtime 9
  10. 10. Master Failover: What makes it difficult? Writer IP MySQL replication is asynchronous. master It is likely that some (or none of) slaves have not received all binary log events from the id=99 crashed master. id=100 id=101 id=102 It is also likely that only some slaves have received the latest events. 1. Save binlog events that exist on master only In the left example, id=102 is not replicated to any slave.slave1 slave2 slave3 slave 2 is the latest between slaves, but id=99 id=99 id=99 slave 1 and slave 3 have lost some events. id=100 id=100 id=100 It is necessary to do the following: id=101 id=101 id=101 - Copy id=102 from master (if possible) id=102 id=102 id=1022. Identify which events are not sent - Apply all differential events, otherwise data 3. Apply lost events inconsistency happens. 10
  11. 11. Current stable HA solutions and issues Pacemaker(Heartbeat) + DRBD (or shared disk) Cost: Additional passive master server (not handing any application traffic) Performance: To make HA really work on DRBD replication environments, innodb- flush-log-at-trx-commit and sync-binlog must be 1. But these kill write performance Otherwise necessary binlog events might be lost on the master. Then slaves can’t continue replication, and data consistency issues happen MySQL Cluster MySQL Cluster is really Highly Available, but unfortunately we use InnoDB Others Unstable, too complex, too hard to operate/administer, wrong/no document Not working with standard MySQL (are you saying we have to migrate all 150+ applications to bleeding edge distributions?) not working with remote datacenter, etc 11
  12. 12. Our solution: Developing MySQL-MHA Manager master MySQL-MasterHA-Manager - masterha_manager - other helper commands slave1 slave2 slave3 master MySQL-MasterHA-Node - save_binary_logs - apply_diff_relay_logs - purge_relay_logs slave1 slave2 slave3 MySQL Master High Availability manager and tools Manager pings master availability When detecting master failure, promoting one of slaves to the new master, fixing consistency issues between slaves 12
  13. 13. Internals: steps for recoveryDead Master Latest Slave Slave(i) Wait until SQL thread executes all events Final Relay_Log_File, Relay_Log_Pos (i1) Partial Transaction Master_Log_File Read_Master_Log_Pos (i2) Differential relay logs from each slave’s read pos to the latest slave’s read pos (X) Differential binary logs from the latest slave’s read pos to the dead master’s tail of the binary log On slave(i), Wait until the SQL thread executes events Apply i1 -> i2 -> X – On the latest slave, i2 is empty 13
  14. 14. Advantages of MySQL MHA Master failover and slave promotion can be done very quickly Total downtime can be 10-30 seconds Master crash does not result in data inconsistency No need to modify current MySQL settings We use MHA for 150+ normal MySQL 5.0/5.1/5.5 masters, without modifying anything Problems of MHA do not result in MySQL failure You can install/uninstall/upgrade/downgrade/restart without stopping MySQL No need to increase lots of servers No performance penalty Works with any storage engine Can also be used for failback (fast online master switch) 14
  15. 15. MySQL MHA Project Info Project top page Documentation Source tarball and rpm package (stable release) The latest source repository (dev release) (Manager source) (Per-MySQL server source) SkySQL provides commercial support for MHA 15
  16. 16. Table of contents Easier maintenance and automating failover Non-stop master migration Automated master failover New Open Source Software: “MySQL MHA” Optimizing MySQL for faster H/W 16
  17. 17. Per-server performance is important To handle 1 million queries per second.. 1000 queries/sec per server : 1000 servers in total 10000 queries/sec per server : 100 servers in total Additional 900 servers will cost 10M$ initially, 1M$ every year If you can increase per server throughput, you can reduce the total number of servers, which will decrease TCO Sharding is not everything 17
  18. 18. History of MySQL performance improvements H/W improvements HDD RAID, Write Cache Large RAM SATA SSD、PCI-Express SSD More number of CPU cores Faster Network S/W improvements Improved algorithm (i/o scheduling, swap control, etc) Much better concurrency Avoiding stalls Improved space efficiency (compression, etc) 18
  19. 19. 32bit LinuxUpdates 2GB RAM 2GB RAM 2GB RAM HDD RAID HDD RAID HDD RAID (20GB) (20GB) (20GB) + Many slaves + Many slaves + Many slaves Random disk i/o speed (IOPS) on HDD is very slow 100-200/sec per drive Database easily became disk i/o bound, regardless of disk size Applications could not handle large data (i.e. 30GB+ per server) Lots of database servers were needed Per server traffic was not so high because both the number of users and data volume per server were not so high Backup and restore completed in short time MyISAM was widely used because it’s very space efficient and fast 19
  20. 20. 64bit Linux + large RAM + BBWC 16GB RAM + Many slaves HDD RAID (120GB) Memory pricing went down, and 64bit Linux went mature It became common to deploy 16GB or more RAM on a single linux machine Memory hit ratio increased, much larger data could be stored The number of database servers decreased (consolidated) Per server traffic increased (the number of users per server increased) “Transaction commit” overheads were extremely reduced thanks to battery backed up write cache From database point of view, InnoDB became faster than MyISAM (row level locks, etc) Direct I/O became common 20
  21. 21. Side effect caused by fast server After 16-32GB RAM became common, we could run many more users and data per server Write traffic per server also increasedMaster 4-8 RAID 5/10 also became common, which improved concurrency a lot On 6 HDD RAID 10, single thread IOPS is around HDD RAID 200, 100 threads IOPS is around 1000-2000 Good parallelism on both reads and writes on master On slaves, there is only one writer thread (SQL thread). No parallelism on writes 6 HDD RAID10 is as slow as single HDD for writesSlave Slaves became performance bottleneck earlier than HDD RAID master Serious replication delay happened (10+ minutes at 21 peak time)
  22. 22. Using SATA SSD on slaves IOPS differences between master (1000+) and slave (100+) have caused serious replication delay Is there any way to gain high enough IOPS from single thread?Master Read IOPS on SATA SSD is 3000+, which should be enough (15 times better than HDD) HDD RAID Just replacing HDD with SSD solved replication delay Overall read throughput became much better Using SSD on master was still risky Using SSD on slaves (IOPS: 100+ -> 3000+) was more effective than using on master (IOPS: 1000+ -> 3000+)Slave We mainly deployed SSD on slaves SATA SSD The number of slaves could be reduced From MySQL point of view: Good concurrency on HDD RAID has been required : InnoDB Plugin 22
  23. 23. How about PCI-Express SSD? Deploying on both master and slaves? If PCI-E SSD is used on master, replication delay will happen again – 10,000IOPS from single thread, 40,000+ IOPS from 100 threads 10,000IOPS from 100 threads can be achieved with SATA SSD Parallel SQL threads should be implemented in MySQL Deploying on only slaves? If using HDD on master, SATA SSD should be enough to handle workloads – PCI-Express SSD is much more expensive than SATA SSD How about running multiple MySQL instances on single server? – Virtualization is not fast – Running multiple MySQL instances on single OS is more reasonable Does PCI-E SSD have enough storage capacity to run multiple instances? On HDD environments, typically only 100-200GB of database data can be stored because of slow random IOPS on HDD FusionIO SLC: 320GB Duo + 160GB = 480GB FusionIO MLC: 1280GB Duo + 640GB = 1920GB tachIOn SLC: 800GB x 2 = 1600GB 23
  24. 24. Running multiple slaves on single boxBefore After M M B M M B B S1 S2 S3 B S1 S2 S3 S1, S1 S2, S2 S1, S1 S2, S2 M M B S1 S2 S3 B S1 S2 S3 B M M B Running multiple slaves on a single PCI-E slave Master and Backup Server are still HDD based Consolidating multiple slaves Since slave’s SQL thread is single threaded, you can gain better concurrency by running multiple instances The number of instances is mainly restricted by capacity 24
  25. 25. Our environment Machine HP DL360G7 (1U), or Dell R610 PCI-E SSD FusionIO MLC (640GB Duo + 320GB non-Duo) tachIOn SLC (800GB x 2) CPU Two sockets, Nehalem 6-core per socket, HT enabled – 24 logical CPU cores are visible – Four socket machine is too expensive RAM 60GB or more Network Broadcom BCM5709, Four ports Using four network cables + bonding mode 4 + link aggregation – BONDING_OPTS="miimon=100 mode=4 lacp_rate=1 xmit_hash_policy=1" HDD 4-8 SAS RAID1+0 For backups, redo logs, relay logs, (optionally) doublewrite buffer 25
  26. 26. Benchmarks on our real workloads Consolidating 7 instances on FusionIO (640GB MLC Duo + 320GB MLC) Let half of SELECT queries go to these slaves 6GB innodb_buffer_pool_size Peak QPS (total of 7 instances) 61683.7 query/s 37939.1 select/s 7861.1 update/s 1105 insert/s 1843 delete/s 3143.5 begin/s CPU Utilization %user 27.3%, %sys 11%(%soft 4%), %iowait 4% C.f. SATA SSD:%user 4%, %sys 1%, %iowait 1% Buffer pool hit ratio 99.4% SATA SSD (single instance/server): 99.8% No replication delay No significant (100+ms) response time delay caused by SSD 26
  27. 27. CPU loads 22:10:57 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 22:11:57 all 27.13 0.00 6.58 4.06 0.14 3.70 0.00 58.40 56589.95 … 22:11:57 23 30.85 0.00 7.43 0.90 1.65 49.78 0.00 9.38 44031.82 CPU utilization was high, but should be able to handle more %user 27.3%, %sys 11%(%soft 4%), %iowait 4% Reached storage capacity limit (960GB). Using 1920GB MLC should be fine to handle more instances Network became the first bottleneck Recv: 14.6MB/s, Send: 28.7MB/s CentOS5 + bonding is not good for network requests handling (only single CPU core can handle requests) (I got the above result when I tested with normal bond0) We are now using link aggregation + bond4 with 4 network cables, then the CPU bottleneck went away 27
  28. 28. Things to consider To run multiple MySQL instances in single server, you need to allocate different IP addresses or port numbers Administration tools are also affected We allocated different (virtual) IP addresses because some of existing internal tools depend on “port=3306” bind-address=“virtual ip address” in my.cnf Creating separated directories and files Socket files, data directories, InnoDB files, binary log files etc should be stored on different location each other Storing some files on HDD, others on SSD Binary logs, Relay logs, Redo logs, error/slow logs, ibdata0 (files where doublewrite buffer is written), backup files on HDD Others on SSD 28
  29. 29. Optimizing for Social Game workloads Easily increasing millions of users in a few days Database size grows rapidly – Especially if PK is “user_id + xxx_id” (i.e. item_id) – Increasing GB/day is not uncommon Scaling reads is not difficult Adding slaves or adding caching servers Scaling writes is not trivial Sharding, scaling up Solutions depend on what kinds of tables we’re using, INSERT/UPDATE/DELETE workloads, etc 29
  30. 30. INSERT-mostly tables History tables such as access logs, diary, battle history INSERT and SELECT mostly Secondary index is needed (user_id, etc) Table size becomes huge (easily exceeding 1TB) Locality (Most of SELECT go to recent data) INSERT performance in general Fast in InnoDB (Thanks to “Insert Buffering”. Much faster than MyISAM) To modify index leaf blocks, they have to be in buffer pool When index size becomes too large to fit in the buffer pool, disk reads happen In-memory workloads -> disk-bound workloads – Suddenly suffering from serious performance slowdown – UPDATE/DELETE/SELECT also getting much slower Any faster storage devices can not compete with in-memory workloads 30
  31. 31. INSERT gets slower Time to insert 1 million records (InnoDB, HDD) 600 500 2,000 rows/s Seconds 400 Sequential order 300 Random order 200 100 10,000 rows/s 0 1 13 25 37 49 61 73 85 97 109 121 133 145 Existing records (millions) Index size exceeded buffer pool size Secondary index size exceeded innodb buffer pool size at 73 million records for random order test Gradually taking more time because buffer pool hit ratio is getting worse (more random disk reads are needed) For sequential order inserts, insertion time did not change. No random reads/writes 31
  32. 32. INSERT performance difference In-memory INSERT throughput 15000+ insert/s from single thread on recent H/W Exceeding buffer pool, starting disk reads Degrading to 2000-4000 insert/s on HDD, single thread 6000-8000 insert/s on multi-threaded workloads Serious replication delay often happens Faster storage does not solve everything At most 5000 insert/s on fastest SSDs such as tachIOn/FusionIO – InnoDB actually uses CPU resources quite a lot for disk i/o bound inserts (i.e. calculating checksum, malloc/free) It is important to minimize index size so that INSERT can complete in memory 32
  33. 33. Approach to complete INSERT in memory Partition 1 Partition 2 Single big physical table(index) Partition 3 Partition 4 Range partition by datetime Started from MySQL 5.1 Index size per partition becomes total_index_size / number_of_partitions INT or TIMESTAMP enables hourly based partitions – TIMESTAMP does not support partition pruning Old partitions can be dropped by ALTER TABLE .. DROP PARTITION 33
  34. 34. Optimizing UPDATE, DELETE, SELECT Using SSD is really, really helpful IOPS difference is significant – Updates in memory: 15,000/s – On HDD : 300/s – On SATA SSD: 1,800/s – On PCI-E SSD : 4,000/s We have used SATA SSD with RAID0 on slaves Now we are gradually increasing PCI-E SSD (FusionIO and tachIOn), consolidating 6-10 MySQL instances If all data fit in memory and traffics are very high, using NoSQL is helpful We use HandlerSocket on user’s database (pk: user_id) – Database size is less than InnoDB buffer pool size Check Oracle’s memcached API project. Should be very easy to use 34
  35. 35. Large-HDD servers and SSD servers “History Shard” Putting history data (comments, logs, etc) here Using range partitioning Large enough HDD with RAID 10 – 900GB (10K RPM) x 8 or 300GB (15K RPM) x 10 HDD Data size tends to be huge, but doesn’t matter so much “Application Shard” Middle range SSD (including SATA SSD), or PCI-E SSD Data size matters a lot 35
  36. 36. Our near-future deployments PCI-E or SATA/SAS SSD servers Large HDD servers Game1_shard1 Game1_shard2 Game1_history_shard1 Game1_shard3 Game1_history_shard2 Game1_shard4 Game1_history_shard3 Game2_shard1 Game1_history_shard4Master Master Game2_shard2 Slave/Backup Slave/Backup By moving history tables, application data size can be decreased significantly (less than 30%), so PCI-E servers can consolidate shards a lot Mostly in-memory workloads on HDD servers, so they can consolidate good numbers of shards Server crash causes multiple shards failure Automated failover is important 36
  37. 37. Summary Automated master failover and easier master maintenance is important to manage hundreds of master servers Scaling up, scaling down, version up, etc Using MHA will help a lot – Configuring MHA does not require MySQL settings changes – Master failover in 10-30 seconds, without passive server – Moving master can be done in 0.5-2 seconds of downtime Optimizing MySQL for faster H/W Deploying history tables (insert-mostly tables, hundreds of GBs) on HDD Deploying application tables on PCI-E SSD Consolidating multiple MySQL instances on single box 37