SlideShare a Scribd company logo
1 of 37
Download to read offline
MySQL for large scale
     social games


                 Yoshinori Matsunobu

Principal Infrastructure Architect, Oracle ACE Director at DeNA
 Former APAC Lead MySQL Consultant at MySQL/Sun/Oracle
   Yoshinori.Matsunobu@gmail.com, Twitter: @matsunobu
            http://yoshinorimatsunobu.blogspot.com/               1
Table of contents

   Easier maintenance and automating failover
     Non-stop master migration
     Automated master failover
     New Open Source Software: “MySQL MHA”


   Optimizing MySQL for faster H/W




                                                2
Company Introduction: DeNA

   One of the largest social game providers in Japan
     Both social game platform and social games themselves
     Subsidiary ngmoco:) in San Francisco
   Japan localized phone, Smart Phone, and PC games
   2-3 billion page views per day
   25+ million users
   1000+ MySQL servers, 150+ {master, slaves} pairs
   1.3B$ revenue in 2010


                                                             3
Games are expanding / shrinking rapidly
    It is very difficult to predict social game workloads
       Sometimes unexpectedly high traffics, sometimes much lower than
       expected

    Each social game traffic tends to go down after months / years

    For expanding games
       Adding slaves
       Adding more shards
        – It’s possible to add shards without stopping services
       Scaling up master’s H/W
        – More RAM, HDD->SSD/PCI-E SSD, Faster NW, etc

    For shrinking games
       Decreasing slaves
       Migrating master to lower-spec machine
       Consolidating a few masters/slaves within single machine
                                                                         4
Desire for Easier Operations
   We want to move master servers more easily
      Scaling-up: Increasing RAM, replacing with faster SSD
      Upgrading MySQL: Results in 10 minute or more downtime to fill in
      buffer pool
      Scaling-down: Moving unpopular games to lower spec servers
      Working around for power outage: Moving games to remote datacenter


   If you can allocate maintenance downtime, it’s easy, but we
   can’t do so many times
      Announcing to users, coordinating with customer support, etc
      Longer downtime reduces revenue
      Operating staffs will be exhausted by too many midnight work


   Reducing maintenance time is important to manage hundreds
   or thousands of MySQL servers                                           5
Switching master in seconds

   If we can switch a master in less than 3 seconds, it is
   acceptable in most of our cases
     Stopping updates on the master
     Waiting until at least one of the slaves (new master) has
     synced with the current master
     Granting writes, allocating virtual ip (etc) to the new master
     All the rest slaves start replication from the new master




                                                                      6
Blocking writes on master
  MySQL provides several commands/solutions to block writes, but
  not all of them are safe
     FLUSH TABLES WITH READ LOCK
      – Clients will wait forever, unless setting timeouts on client side
      – Running transactions will be aborted in the end
            “Updating master1 -> updating master 2 -> committing master1 -> getting error on
              committing master 2” will result in data inconsistency
      – Flushing all tables sometimes takes very long time
            Run “FLUSH NO_WRITE_TO_BINLOG TABLES” beforehand
     SET GLOBAL read_only = 1
      – Getting errors immediately
      – Running transactions will be aborted
     Dropping MySQL user (used from applications)
      –   Can not establish new MySQL connection from applications
      –   Current sessions are NOT terminated until disconnect
      –   Current sessions do not encounter errors
      –   Works with non-persistent connections only                                           7
Trade-off between safeness and performance
  What we are now doing at DeNA is..
     Checking there is not any long running updates
      – 100 seconds of updates will take 100 seconds on slaves
     Dropping app user -- starting downtime
     Waiting for a while (2 seconds maximum) until all active
     application sessions are disconnected
      – Ignoring replication threads, sessions sleeping 1 second or more (highly
        likely daemon program or unused sessions, which can be killed safely)
      – Not killing active sessions immediately
     Executing FLUSH TABLES WITH READ LOCK when there
     are no active sessions or 2 seconds have passed
     Starting slave promotion -- ending donwtime

     At most 1 second is enough to do all processes                                8
Our solution
 From:                    Developing “MySQL-MHA: Master High
 host1 (current master)   Availability manager and tools”
 +--host2 (backup)
                             http://code.google.com/p/mysql-master-ha
 +--host3 (slave)
 +--host4 (slave)            This is automated failover tool, but can also
 +--host5 (remote)           be used for fast online master switch

 To:
 host2 (new master)       Switching original master to new master
 +--host3 (slave)         gracefully
 +--host4 (slave)         We have switched 10+ masters so far. We
 +--host5 (remote)
                          could switch in 0.5 – 1 second of
                          downtime


                                                                             9
Master Failover: What makes it difficult?
          Writer IP                 MySQL replication is asynchronous.

     master                         It is likely that some (or none of) slaves have
                                    not received all binary log events from the
              id=99                 crashed master.
              id=100
              id=101
              id=102                It is also likely that only some slaves have
                                    received the latest events.
       1. Save binlog events that
       exist on master only
                                    In the left example, id=102 is not replicated to
                                    any slave.

slave1    slave2       slave3         slave 2 is the latest between slaves, but
    id=99       id=99       id=99
                                      slave 1 and slave 3 have lost some events.
    id=100      id=100      id=100    It is necessary to do the following:
     id=101     id=101      id=101    - Copy id=102 from master (if possible)
      id=102     id=102      id=102
2. Identify which events are not sent - Apply all differential events, otherwise data
 3. Apply lost events                 inconsistency happens.                         10
Current stable HA solutions and issues

 Pacemaker(Heartbeat) + DRBD (or shared disk)
    Cost: Additional passive master server (not handing any application traffic)
    Performance: To make HA really work on DRBD replication environments, innodb-
    flush-log-at-trx-commit and sync-binlog must be 1. But these kill write performance
    Otherwise necessary binlog events might be lost on the master. Then slaves can’t
    continue replication, and data consistency issues happen

 MySQL Cluster
    MySQL Cluster is really Highly Available, but unfortunately we use InnoDB

 Others
    Unstable, too complex, too hard to operate/administer, wrong/no document
    Not working with standard MySQL (are you saying we have to migrate all 150+
    applications to bleeding edge distributions?)
    not working with remote datacenter, etc




                                                                                          11
Our solution: Developing MySQL-MHA
 Manager
                                                     master

 MySQL-MasterHA-Manager
 - masterha_manager
 - other helper commands
                                              slave1     slave2     slave3
                                master                   MySQL-MasterHA-Node
                                                         - save_binary_logs
                                                         - apply_diff_relay_logs
                                                         - purge_relay_logs
                       slave1      slave2   slave3
   MySQL Master High Availability manager and tools
   http://code.google.com/p/mysql-master-ha
   Manager pings master availability
   When detecting master failure, promoting one of slaves to the new
   master, fixing consistency issues between slaves
                                                                                   12
Internals: steps for recovery
Dead Master         Latest Slave                              Slave(i)

                                                                          Wait until SQL thread
                                                                          executes all events

                                                                          Final Relay_Log_File,
                                                                          Relay_Log_Pos

                                                                           (i1) Partial Transaction
                                           Master_Log_File
                                           Read_Master_Log_Pos

                                    (i2) Differential relay logs from each slave’s read pos to
                                    the latest slave’s read pos
          (X) Differential binary logs from the latest slave’s read pos
          to the dead master’s tail of the binary log


              On slave(i),
                 Wait until the SQL thread executes events
                 Apply i1 -> i2 -> X
                   – On the latest slave, i2 is empty                                             13
Advantages of MySQL MHA
   Master failover and slave promotion can be done very quickly
      Total downtime can be 10-30 seconds

   Master crash does not result in data inconsistency

   No need to modify current MySQL settings
      We use MHA for 150+ normal MySQL 5.0/5.1/5.5 masters, without
      modifying anything

   Problems of MHA do not result in MySQL failure
      You can install/uninstall/upgrade/downgrade/restart without stopping
      MySQL

   No need to increase lots of servers
   No performance penalty
   Works with any storage engine
   Can also be used for failback (fast online master switch)                 14
MySQL MHA Project Info
   Project top page
      http://code.google.com/p/mysql-master-ha/

   Documentation
      http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6

   Source tarball and rpm package (stable release)
      http://code.google.com/p/mysql-master-ha/downloads/list

   The latest source repository (dev release)
       https://github.com/yoshinorim/MySQL-MasterHA-Manager (Manager
      source)
      https://github.com/yoshinorim/MySQL-MasterHA-Node (Per-MySQL
      server source)

   SkySQL provides commercial support for MHA
                                                                           15
Table of contents

   Easier maintenance and automating failover
     Non-stop master migration
     Automated master failover
     New Open Source Software: “MySQL MHA”


   Optimizing MySQL for faster H/W




                                                16
Per-server performance is important

   To handle 1 million queries per second..
     1000 queries/sec per server : 1000 servers in total
     10000 queries/sec per server : 100 servers in total


   Additional 900 servers will cost 10M$ initially, 1M$
   every year

   If you can increase per server throughput, you can
   reduce the total number of servers, which will decrease
   TCO

   Sharding is not everything
                                                             17
History of MySQL performance improvements

   H/W improvements
      HDD RAID, Write Cache
      Large RAM
      SATA SSD、PCI-Express SSD
      More number of CPU cores
      Faster Network


   S/W improvements
      Improved algorithm (i/o scheduling, swap control, etc)
      Much better concurrency
      Avoiding stalls
      Improved space efficiency (compression, etc)             18
32bit Linux
Updates

              2GB RAM                    2GB RAM                 2GB RAM



          HDD RAID                     HDD RAID                HDD RAID
             (20GB)                     (20GB)                   (20GB)
     + Many slaves                 + Many slaves          + Many slaves

      Random disk i/o speed (IOPS) on HDD is very slow
          100-200/sec per drive
      Database easily became disk i/o bound, regardless of disk size
      Applications could not handle large data (i.e. 30GB+ per server)
      Lots of database servers were needed
      Per server traffic was not so high because both the number of users and data
      volume per server were not so high
          Backup and restore completed in short time
      MyISAM was widely used because it’s very space efficient and fast
                                                                                     19
64bit Linux + large RAM + BBWC

                                   16GB RAM


                                               + Many slaves
                                  HDD RAID
                                   (120GB)

  Memory pricing went down, and 64bit Linux went mature
  It became common to deploy 16GB or more RAM on a single linux machine
  Memory hit ratio increased, much larger data could be stored
  The number of database servers decreased (consolidated)
  Per server traffic increased (the number of users per server increased)
  “Transaction commit” overheads were extremely reduced thanks to battery backed
  up write cache
  From database point of view,
      InnoDB became faster than MyISAM (row level locks, etc)
      Direct I/O became common                                                 20
Side effect caused by fast server
                    After 16-32GB RAM became common, we could
                    run many more users and data per server
                        Write traffic per server also increased
Master              4-8 RAID 5/10 also became common, which
                    improved concurrency a lot
                    On 6 HDD RAID 10, single thread IOPS is around
         HDD RAID   200, 100 threads IOPS is around 1000-2000
                    Good parallelism on both reads and writes on master

                    On slaves, there is only one writer thread (SQL
                    thread). No parallelism on writes
                        6 HDD RAID10 is as slow as single HDD for
                        writes
Slave
                    Slaves became performance bottleneck earlier than
         HDD RAID   master

                    Serious replication delay happened (10+ minutes at
                                                                         21
                    peak time)
Using SATA SSD on slaves
                    IOPS differences between master (1000+) and slave
                    (100+) have caused serious replication delay
                    Is there any way to gain high enough IOPS from
                    single thread?
Master
                    Read IOPS on SATA SSD is 3000+, which should
                    be enough (15 times better than HDD)
         HDD RAID   Just replacing HDD with SSD solved replication
                    delay
                    Overall read throughput became much better
                    Using SSD on master was still risky
                    Using SSD on slaves (IOPS: 100+ -> 3000+) was
                    more effective than using on master (IOPS: 1000+ ->
                    3000+)
Slave               We mainly deployed SSD on slaves
         SATA SSD   The number of slaves could be reduced
                    From MySQL point of view:
                       Good concurrency on HDD RAID has been required :
                       InnoDB Plugin                                    22
How about PCI-Express SSD?
  Deploying on both master and slaves?
     If PCI-E SSD is used on master, replication delay will happen again
       – 10,000IOPS from single thread, 40,000+ IOPS from 100 threads
     10,000IOPS from 100 threads can be achieved with SATA SSD
     Parallel SQL threads should be implemented in MySQL

  Deploying on only slaves?
     If using HDD on master, SATA SSD should be enough to handle workloads
       – PCI-Express SSD is much more expensive than SATA SSD
     How about running multiple MySQL instances on single server?
       – Virtualization is not fast
       – Running multiple MySQL instances on single OS is more reasonable

  Does PCI-E SSD have enough storage capacity to run multiple instances?
     On HDD environments, typically only 100-200GB of database data can be stored
     because of slow random IOPS on HDD
     FusionIO SLC: 320GB Duo + 160GB = 480GB
     FusionIO MLC: 1280GB Duo + 640GB = 1920GB
     tachIOn SLC: 800GB x 2 = 1600GB
                                                                                    23
Running multiple slaves on single box
Before                                          After

          M                     M                       B       M            M        B

  B      S1       S2   S3   B   S1   S2   S3

                                                                S1, S1       S2, S2
                                                                S1, S1       S2, S2
              M                  M


   B     S1       S2   S3   B   S1   S2   S3                B       M        M        B

       Running multiple slaves on a single PCI-E slave
          Master and Backup Server are still HDD based
          Consolidating multiple slaves
          Since slave’s SQL thread is single threaded, you can gain better
          concurrency by running multiple instances
          The number of instances is mainly restricted by capacity

                                                                                          24
Our environment
   Machine
      HP DL360G7 (1U), or Dell R610
   PCI-E SSD
      FusionIO MLC (640GB Duo + 320GB non-Duo)
      tachIOn SLC (800GB x 2)
   CPU
      Two sockets, Nehalem 6-core per socket, HT enabled
         – 24 logical CPU cores are visible
         – Four socket machine is too expensive
   RAM
      60GB or more
   Network
      Broadcom BCM5709, Four ports
      Using four network cables + bonding mode 4 + link aggregation
         – BONDING_OPTS="miimon=100 mode=4 lacp_rate=1 xmit_hash_policy=1"
   HDD
      4-8 SAS RAID1+0
      For backups, redo logs, relay logs, (optionally) doublewrite buffer
                                                                             25
Benchmarks on our real workloads
  Consolidating 7 instances on FusionIO (640GB MLC Duo + 320GB MLC)
      Let half of SELECT queries go to these slaves
      6GB innodb_buffer_pool_size

  Peak QPS (total of 7 instances)
      61683.7 query/s
      37939.1 select/s
      7861.1 update/s
      1105 insert/s
      1843 delete/s
      3143.5 begin/s

  CPU Utilization
      %user 27.3%, %sys 11%(%soft 4%), %iowait 4%
      C.f. SATA SSD:%user 4%, %sys 1%, %iowait 1%

  Buffer pool hit ratio
      99.4%
      SATA SSD (single instance/server): 99.8%

  No replication delay
  No significant (100+ms) response time delay caused by SSD           26
CPU loads




   22:10:57     CPU   %user   %nice    %sys %iowait     %irq   %soft %steal     %idle   intr/s
   22:11:57     all   27.13    0.00    6.58    4.06     0.14    3.70   0.00     58.40 56589.95
   …
   22:11:57      23   30.85    0.00    7.43     0.90    1.65   49.78     0.00    9.38 44031.82
  CPU utilization was high, but should be able to handle more
      %user 27.3%, %sys 11%(%soft 4%), %iowait 4%
      Reached storage capacity limit (960GB). Using 1920GB MLC should be fine to handle more
      instances

  Network became the first bottleneck
      Recv: 14.6MB/s, Send: 28.7MB/s
      CentOS5 + bonding is not good for network requests handling (only single CPU core can handle
      requests) (I got the above result when I tested with normal bond0)
      We are now using link aggregation + bond4 with 4 network cables, then the CPU bottleneck
      went away                                                                                  27
Things to consider
   To run multiple MySQL instances in single server,
   you need to allocate different IP addresses or port numbers
      Administration tools are also affected
      We allocated different (virtual) IP addresses because some of existing
      internal tools depend on “port=3306”
      bind-address=“virtual ip address” in my.cnf

   Creating separated directories and files
      Socket files, data directories, InnoDB files, binary log files etc should
      be stored on different location each other

   Storing some files on HDD, others on SSD
      Binary logs, Relay logs, Redo logs, error/slow logs, ibdata0 (files
      where doublewrite buffer is written), backup files on HDD
      Others on SSD

                                                                                  28
Optimizing for Social Game workloads
    Easily increasing millions of users in a few days
       Database size grows rapidly
         – Especially if PK is “user_id + xxx_id” (i.e. item_id)
         – Increasing GB/day is not uncommon


    Scaling reads is not difficult
       Adding slaves or adding caching servers


    Scaling writes is not trivial
       Sharding, scaling up


    Solutions depend on what kinds of tables we’re using,
    INSERT/UPDATE/DELETE workloads, etc
                                                                   29
INSERT-mostly tables
  History tables such as access logs, diary, battle history
     INSERT and SELECT mostly
     Secondary index is needed (user_id, etc)
     Table size becomes huge (easily exceeding 1TB)
     Locality (Most of SELECT go to recent data)


  INSERT performance in general
     Fast in InnoDB (Thanks to “Insert Buffering”. Much faster than MyISAM)
     To modify index leaf blocks, they have to be in buffer pool
     When index size becomes too large to fit in the buffer pool, disk reads
     happen
     In-memory workloads -> disk-bound workloads
      – Suddenly suffering from serious performance slowdown
      – UPDATE/DELETE/SELECT also getting much slower
     Any faster storage devices can not compete with in-memory workloads
                                                                               30
INSERT gets slower
                           Time to insert 1 million records (InnoDB, HDD)

           600
           500                                                              2,000 rows/s
 Seconds




           400
                                                                             Sequential order
           300
                                                                             Random order
           200
           100                                                              10,000 rows/s
             0
                 1   13 25 37 49 61 73 85 97 109 121 133 145
                             Existing records (millions)

                                                   Index size exceeded buffer pool size

           Secondary index size exceeded innodb buffer pool size at 73 million
           records for random order test
           Gradually taking more time because buffer pool hit ratio is getting worse
           (more random disk reads are needed)
           For sequential order inserts, insertion time did not change.
           No random reads/writes                                                               31
INSERT performance difference
   In-memory INSERT throughput
      15000+ insert/s from single thread on recent H/W

   Exceeding buffer pool, starting disk reads
      Degrading to 2000-4000 insert/s on HDD, single thread
      6000-8000 insert/s on multi-threaded workloads


   Serious replication delay often happens

   Faster storage does not solve everything
      At most 5000 insert/s on fastest SSDs such as tachIOn/FusionIO
       – InnoDB actually uses CPU resources quite a lot for disk i/o bound inserts (i.e.
         calculating checksum, malloc/free)


   It is important to minimize index size so that INSERT can
   complete in memory                                                                  32
Approach to complete INSERT in memory
                                         Partition 1        Partition 2
 Single big physical table(index)




                                         Partition 3        Partition 4




    Range partition by datetime
        Started from MySQL 5.1
        Index size per partition becomes total_index_size / number_of_partitions
        INT or TIMESTAMP enables hourly based partitions
          – TIMESTAMP does not support partition pruning
        Old partitions can be dropped by ALTER TABLE .. DROP PARTITION
                                                                                   33
Optimizing UPDATE, DELETE, SELECT
   Using SSD is really, really helpful
      IOPS difference is significant
       –   Updates in memory: 15,000/s
       –   On HDD : 300/s
       –   On SATA SSD: 1,800/s
       –   On PCI-E SSD : 4,000/s

      We have used SATA SSD with RAID0 on slaves
      Now we are gradually increasing PCI-E SSD (FusionIO and tachIOn),
      consolidating 6-10 MySQL instances

   If all data fit in memory and traffics are very high, using
   NoSQL is helpful
      We use HandlerSocket on user’s database (pk: user_id)
       – Database size is less than InnoDB buffer pool size
      Check Oracle’s memcached API project. Should be very easy to use
                                                                          34
Large-HDD servers and SSD servers

   “History Shard”
     Putting history data (comments, logs, etc) here
     Using range partitioning
     Large enough HDD with RAID 10
      – 900GB (10K RPM) x 8 or 300GB (15K RPM) x 10 HDD
     Data size tends to be huge, but doesn’t matter so much


   “Application Shard”
     Middle range SSD (including SATA SSD), or PCI-E SSD
     Data size matters a lot

                                                              35
Our near-future deployments
  PCI-E or SATA/SAS SSD servers                 Large HDD servers
               Game1_shard1
               Game1_shard2                    Game1_history_shard1
               Game1_shard3                    Game1_history_shard2
               Game1_shard4                    Game1_history_shard3
               Game2_shard1                    Game1_history_shard4
Master                                                                     Master
               Game2_shard2




                Slave/Backup                        Slave/Backup
  By moving history tables, application data size can be decreased significantly
  (less than 30%), so PCI-E servers can consolidate shards a lot
  Mostly in-memory workloads on HDD servers, so they can consolidate good
  numbers of shards
  Server crash causes multiple shards failure
         Automated failover is important                                            36
Summary
  Automated master failover and easier master
  maintenance is important to manage hundreds of
  master servers
    Scaling up, scaling down, version up, etc
    Using MHA will help a lot
     – Configuring MHA does not require MySQL settings changes
     – Master failover in 10-30 seconds, without passive server
     – Moving master can be done in 0.5-2 seconds of downtime


  Optimizing MySQL for faster H/W
    Deploying history tables (insert-mostly tables, hundreds of
    GBs) on HDD
    Deploying application tables on PCI-E SSD
    Consolidating multiple MySQL instances on single box
                                                                  37

More Related Content

What's hot

The InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLThe InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQL
Morgan Tocker
 

What's hot (20)

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorAlmost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
 
Using galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wanUsing galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wan
 
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
 
The consequences of sync_binlog != 1
The consequences of sync_binlog != 1The consequences of sync_binlog != 1
The consequences of sync_binlog != 1
 
Encrypting and Protecting Your Data in Neo4j(Jeff_Tallman).pptx
Encrypting and Protecting Your Data in Neo4j(Jeff_Tallman).pptxEncrypting and Protecting Your Data in Neo4j(Jeff_Tallman).pptx
Encrypting and Protecting Your Data in Neo4j(Jeff_Tallman).pptx
 
Differences between MariaDB 10.3 & MySQL 8.0
Differences between MariaDB 10.3 & MySQL 8.0Differences between MariaDB 10.3 & MySQL 8.0
Differences between MariaDB 10.3 & MySQL 8.0
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOps
 
Galera cluster for MySQL - Introduction Slides
Galera cluster for MySQL - Introduction SlidesGalera cluster for MySQL - Introduction Slides
Galera cluster for MySQL - Introduction Slides
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Undo internalspresentation
Undo internalspresentationUndo internalspresentation
Undo internalspresentation
 
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
 
How to Manage Scale-Out Environments with MariaDB MaxScale
How to Manage Scale-Out Environments with MariaDB MaxScaleHow to Manage Scale-Out Environments with MariaDB MaxScale
How to Manage Scale-Out Environments with MariaDB MaxScale
 
The InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQLThe InnoDB Storage Engine for MySQL
The InnoDB Storage Engine for MySQL
 
Oracle architecture ppt
Oracle architecture pptOracle architecture ppt
Oracle architecture ppt
 
PostgreSQL HA
PostgreSQL   HAPostgreSQL   HA
PostgreSQL HA
 
Oracle AWR Data mining
Oracle AWR Data miningOracle AWR Data mining
Oracle AWR Data mining
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Database backup & recovery
Database backup & recoveryDatabase backup & recovery
Database backup & recovery
 

Viewers also liked

データベース技術の羅針盤
データベース技術の羅針盤データベース技術の羅針盤
データベース技術の羅針盤
Yoshinori Matsunobu
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
Yoshinori Matsunobu
 
1 structure of the videogame industry
1   structure of the videogame industry1   structure of the videogame industry
1 structure of the videogame industry
Charis Creber
 
MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6
MYXPLAIN
 
The games industry- Industry Structure
The games industry- Industry StructureThe games industry- Industry Structure
The games industry- Industry Structure
cigdemkalem
 

Viewers also liked (20)

More mastering the art of indexing
More mastering the art of indexingMore mastering the art of indexing
More mastering the art of indexing
 
SSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLSSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQL
 
Automated master failover
Automated master failoverAutomated master failover
Automated master failover
 
Linux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLLinux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQL
 
How to Design Indexes, Really
How to Design Indexes, ReallyHow to Design Indexes, Really
How to Design Indexes, Really
 
データベース技術の羅針盤
データベース技術の羅針盤データベース技術の羅針盤
データベース技術の羅針盤
 
ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
 
Getty Images Customer Presentation
Getty Images Customer PresentationGetty Images Customer Presentation
Getty Images Customer Presentation
 
RightScale Customer Use Case - Ubisoft
RightScale Customer Use Case - UbisoftRightScale Customer Use Case - Ubisoft
RightScale Customer Use Case - Ubisoft
 
Launching Assassin's Creed with the Eagle Vision
Launching Assassin's Creed with the Eagle VisionLaunching Assassin's Creed with the Eagle Vision
Launching Assassin's Creed with the Eagle Vision
 
ubisoftfinal
ubisoftfinalubisoftfinal
ubisoftfinal
 
Introducing MySQL MHA (JP/LT)
Introducing MySQL MHA (JP/LT)Introducing MySQL MHA (JP/LT)
Introducing MySQL MHA (JP/LT)
 
1 structure of the videogame industry
1   structure of the videogame industry1   structure of the videogame industry
1 structure of the videogame industry
 
MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)
 
Ubisoft Entertainment
Ubisoft EntertainmentUbisoft Entertainment
Ubisoft Entertainment
 
The games industry- Industry Structure
The games industry- Industry StructureThe games industry- Industry Structure
The games industry- Industry Structure
 
Ubisoft
UbisoftUbisoft
Ubisoft
 
MySQL Performance Tips & Best Practices
MySQL Performance Tips & Best PracticesMySQL Performance Tips & Best Practices
MySQL Performance Tips & Best Practices
 

Similar to MySQL for Large Scale Social Games

Buytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemakerBuytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemaker
kuchinskaya
 
MySQL High Availability Solutions
MySQL High Availability SolutionsMySQL High Availability Solutions
MySQL High Availability Solutions
Lenz Grimmer
 
Mysqlhacodebits20091203 1260184765-phpapp02
Mysqlhacodebits20091203 1260184765-phpapp02Mysqlhacodebits20091203 1260184765-phpapp02
Mysqlhacodebits20091203 1260184765-phpapp02
Louis liu
 

Similar to MySQL for Large Scale Social Games (20)

MySQL Utilities -- Cool Tools For You: PHP World Nov 16 2016
MySQL Utilities -- Cool Tools For You: PHP World Nov 16 2016MySQL Utilities -- Cool Tools For You: PHP World Nov 16 2016
MySQL Utilities -- Cool Tools For You: PHP World Nov 16 2016
 
Mysql replication @ gnugroup
Mysql replication @ gnugroupMysql replication @ gnugroup
Mysql replication @ gnugroup
 
Mysql-MHA
Mysql-MHAMysql-MHA
Mysql-MHA
 
Mysql Latency
Mysql LatencyMysql Latency
Mysql Latency
 
Buytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemakerBuytaert kris my_sql-pacemaker
Buytaert kris my_sql-pacemaker
 
MySQL 5.6 Global Transaction Identifier - Use case: Failover
MySQL 5.6 Global Transaction Identifier - Use case: FailoverMySQL 5.6 Global Transaction Identifier - Use case: Failover
MySQL 5.6 Global Transaction Identifier - Use case: Failover
 
MySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspectiveMySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspective
 
MySQL Replication Update -- Zendcon 2016
MySQL Replication Update -- Zendcon 2016MySQL Replication Update -- Zendcon 2016
MySQL Replication Update -- Zendcon 2016
 
MySQL Replication Overview -- PHPTek 2016
MySQL Replication Overview -- PHPTek 2016MySQL Replication Overview -- PHPTek 2016
MySQL Replication Overview -- PHPTek 2016
 
MySQL Replication Basics -Ohio Linux Fest 2016
MySQL Replication Basics -Ohio Linux Fest 2016MySQL Replication Basics -Ohio Linux Fest 2016
MySQL Replication Basics -Ohio Linux Fest 2016
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
 
MySQL High Availability Solutions
MySQL High Availability SolutionsMySQL High Availability Solutions
MySQL High Availability Solutions
 
Mysqlhacodebits20091203 1260184765-phpapp02
Mysqlhacodebits20091203 1260184765-phpapp02Mysqlhacodebits20091203 1260184765-phpapp02
Mysqlhacodebits20091203 1260184765-phpapp02
 
MySQL High Availability Solutions
MySQL High Availability SolutionsMySQL High Availability Solutions
MySQL High Availability Solutions
 
Introduction to Galera Cluster
Introduction to Galera ClusterIntroduction to Galera Cluster
Introduction to Galera Cluster
 
Spil Games @ FOSDEM: Galera Replicator IRL
Spil Games @ FOSDEM: Galera Replicator IRLSpil Games @ FOSDEM: Galera Replicator IRL
Spil Games @ FOSDEM: Galera Replicator IRL
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Architecting cloud
Architecting cloudArchitecting cloud
Architecting cloud
 
MySQL HA Alternatives 2010
MySQL  HA  Alternatives 2010MySQL  HA  Alternatives 2010
MySQL HA Alternatives 2010
 

Recently uploaded

“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
Muhammad Subhan
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 

Recently uploaded (20)

AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 

MySQL for Large Scale Social Games

  • 1. MySQL for large scale social games Yoshinori Matsunobu Principal Infrastructure Architect, Oracle ACE Director at DeNA Former APAC Lead MySQL Consultant at MySQL/Sun/Oracle Yoshinori.Matsunobu@gmail.com, Twitter: @matsunobu http://yoshinorimatsunobu.blogspot.com/ 1
  • 2. Table of contents Easier maintenance and automating failover Non-stop master migration Automated master failover New Open Source Software: “MySQL MHA” Optimizing MySQL for faster H/W 2
  • 3. Company Introduction: DeNA One of the largest social game providers in Japan Both social game platform and social games themselves Subsidiary ngmoco:) in San Francisco Japan localized phone, Smart Phone, and PC games 2-3 billion page views per day 25+ million users 1000+ MySQL servers, 150+ {master, slaves} pairs 1.3B$ revenue in 2010 3
  • 4. Games are expanding / shrinking rapidly It is very difficult to predict social game workloads Sometimes unexpectedly high traffics, sometimes much lower than expected Each social game traffic tends to go down after months / years For expanding games Adding slaves Adding more shards – It’s possible to add shards without stopping services Scaling up master’s H/W – More RAM, HDD->SSD/PCI-E SSD, Faster NW, etc For shrinking games Decreasing slaves Migrating master to lower-spec machine Consolidating a few masters/slaves within single machine 4
  • 5. Desire for Easier Operations We want to move master servers more easily Scaling-up: Increasing RAM, replacing with faster SSD Upgrading MySQL: Results in 10 minute or more downtime to fill in buffer pool Scaling-down: Moving unpopular games to lower spec servers Working around for power outage: Moving games to remote datacenter If you can allocate maintenance downtime, it’s easy, but we can’t do so many times Announcing to users, coordinating with customer support, etc Longer downtime reduces revenue Operating staffs will be exhausted by too many midnight work Reducing maintenance time is important to manage hundreds or thousands of MySQL servers 5
  • 6. Switching master in seconds If we can switch a master in less than 3 seconds, it is acceptable in most of our cases Stopping updates on the master Waiting until at least one of the slaves (new master) has synced with the current master Granting writes, allocating virtual ip (etc) to the new master All the rest slaves start replication from the new master 6
  • 7. Blocking writes on master MySQL provides several commands/solutions to block writes, but not all of them are safe FLUSH TABLES WITH READ LOCK – Clients will wait forever, unless setting timeouts on client side – Running transactions will be aborted in the end “Updating master1 -> updating master 2 -> committing master1 -> getting error on committing master 2” will result in data inconsistency – Flushing all tables sometimes takes very long time Run “FLUSH NO_WRITE_TO_BINLOG TABLES” beforehand SET GLOBAL read_only = 1 – Getting errors immediately – Running transactions will be aborted Dropping MySQL user (used from applications) – Can not establish new MySQL connection from applications – Current sessions are NOT terminated until disconnect – Current sessions do not encounter errors – Works with non-persistent connections only 7
  • 8. Trade-off between safeness and performance What we are now doing at DeNA is.. Checking there is not any long running updates – 100 seconds of updates will take 100 seconds on slaves Dropping app user -- starting downtime Waiting for a while (2 seconds maximum) until all active application sessions are disconnected – Ignoring replication threads, sessions sleeping 1 second or more (highly likely daemon program or unused sessions, which can be killed safely) – Not killing active sessions immediately Executing FLUSH TABLES WITH READ LOCK when there are no active sessions or 2 seconds have passed Starting slave promotion -- ending donwtime At most 1 second is enough to do all processes 8
  • 9. Our solution From: Developing “MySQL-MHA: Master High host1 (current master) Availability manager and tools” +--host2 (backup) http://code.google.com/p/mysql-master-ha +--host3 (slave) +--host4 (slave) This is automated failover tool, but can also +--host5 (remote) be used for fast online master switch To: host2 (new master) Switching original master to new master +--host3 (slave) gracefully +--host4 (slave) We have switched 10+ masters so far. We +--host5 (remote) could switch in 0.5 – 1 second of downtime 9
  • 10. Master Failover: What makes it difficult? Writer IP MySQL replication is asynchronous. master It is likely that some (or none of) slaves have not received all binary log events from the id=99 crashed master. id=100 id=101 id=102 It is also likely that only some slaves have received the latest events. 1. Save binlog events that exist on master only In the left example, id=102 is not replicated to any slave. slave1 slave2 slave3 slave 2 is the latest between slaves, but id=99 id=99 id=99 slave 1 and slave 3 have lost some events. id=100 id=100 id=100 It is necessary to do the following: id=101 id=101 id=101 - Copy id=102 from master (if possible) id=102 id=102 id=102 2. Identify which events are not sent - Apply all differential events, otherwise data 3. Apply lost events inconsistency happens. 10
  • 11. Current stable HA solutions and issues Pacemaker(Heartbeat) + DRBD (or shared disk) Cost: Additional passive master server (not handing any application traffic) Performance: To make HA really work on DRBD replication environments, innodb- flush-log-at-trx-commit and sync-binlog must be 1. But these kill write performance Otherwise necessary binlog events might be lost on the master. Then slaves can’t continue replication, and data consistency issues happen MySQL Cluster MySQL Cluster is really Highly Available, but unfortunately we use InnoDB Others Unstable, too complex, too hard to operate/administer, wrong/no document Not working with standard MySQL (are you saying we have to migrate all 150+ applications to bleeding edge distributions?) not working with remote datacenter, etc 11
  • 12. Our solution: Developing MySQL-MHA Manager master MySQL-MasterHA-Manager - masterha_manager - other helper commands slave1 slave2 slave3 master MySQL-MasterHA-Node - save_binary_logs - apply_diff_relay_logs - purge_relay_logs slave1 slave2 slave3 MySQL Master High Availability manager and tools http://code.google.com/p/mysql-master-ha Manager pings master availability When detecting master failure, promoting one of slaves to the new master, fixing consistency issues between slaves 12
  • 13. Internals: steps for recovery Dead Master Latest Slave Slave(i) Wait until SQL thread executes all events Final Relay_Log_File, Relay_Log_Pos (i1) Partial Transaction Master_Log_File Read_Master_Log_Pos (i2) Differential relay logs from each slave’s read pos to the latest slave’s read pos (X) Differential binary logs from the latest slave’s read pos to the dead master’s tail of the binary log On slave(i), Wait until the SQL thread executes events Apply i1 -> i2 -> X – On the latest slave, i2 is empty 13
  • 14. Advantages of MySQL MHA Master failover and slave promotion can be done very quickly Total downtime can be 10-30 seconds Master crash does not result in data inconsistency No need to modify current MySQL settings We use MHA for 150+ normal MySQL 5.0/5.1/5.5 masters, without modifying anything Problems of MHA do not result in MySQL failure You can install/uninstall/upgrade/downgrade/restart without stopping MySQL No need to increase lots of servers No performance penalty Works with any storage engine Can also be used for failback (fast online master switch) 14
  • 15. MySQL MHA Project Info Project top page http://code.google.com/p/mysql-master-ha/ Documentation http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6 Source tarball and rpm package (stable release) http://code.google.com/p/mysql-master-ha/downloads/list The latest source repository (dev release) https://github.com/yoshinorim/MySQL-MasterHA-Manager (Manager source) https://github.com/yoshinorim/MySQL-MasterHA-Node (Per-MySQL server source) SkySQL provides commercial support for MHA 15
  • 16. Table of contents Easier maintenance and automating failover Non-stop master migration Automated master failover New Open Source Software: “MySQL MHA” Optimizing MySQL for faster H/W 16
  • 17. Per-server performance is important To handle 1 million queries per second.. 1000 queries/sec per server : 1000 servers in total 10000 queries/sec per server : 100 servers in total Additional 900 servers will cost 10M$ initially, 1M$ every year If you can increase per server throughput, you can reduce the total number of servers, which will decrease TCO Sharding is not everything 17
  • 18. History of MySQL performance improvements H/W improvements HDD RAID, Write Cache Large RAM SATA SSD、PCI-Express SSD More number of CPU cores Faster Network S/W improvements Improved algorithm (i/o scheduling, swap control, etc) Much better concurrency Avoiding stalls Improved space efficiency (compression, etc) 18
  • 19. 32bit Linux Updates 2GB RAM 2GB RAM 2GB RAM HDD RAID HDD RAID HDD RAID (20GB) (20GB) (20GB) + Many slaves + Many slaves + Many slaves Random disk i/o speed (IOPS) on HDD is very slow 100-200/sec per drive Database easily became disk i/o bound, regardless of disk size Applications could not handle large data (i.e. 30GB+ per server) Lots of database servers were needed Per server traffic was not so high because both the number of users and data volume per server were not so high Backup and restore completed in short time MyISAM was widely used because it’s very space efficient and fast 19
  • 20. 64bit Linux + large RAM + BBWC 16GB RAM + Many slaves HDD RAID (120GB) Memory pricing went down, and 64bit Linux went mature It became common to deploy 16GB or more RAM on a single linux machine Memory hit ratio increased, much larger data could be stored The number of database servers decreased (consolidated) Per server traffic increased (the number of users per server increased) “Transaction commit” overheads were extremely reduced thanks to battery backed up write cache From database point of view, InnoDB became faster than MyISAM (row level locks, etc) Direct I/O became common 20
  • 21. Side effect caused by fast server After 16-32GB RAM became common, we could run many more users and data per server Write traffic per server also increased Master 4-8 RAID 5/10 also became common, which improved concurrency a lot On 6 HDD RAID 10, single thread IOPS is around HDD RAID 200, 100 threads IOPS is around 1000-2000 Good parallelism on both reads and writes on master On slaves, there is only one writer thread (SQL thread). No parallelism on writes 6 HDD RAID10 is as slow as single HDD for writes Slave Slaves became performance bottleneck earlier than HDD RAID master Serious replication delay happened (10+ minutes at 21 peak time)
  • 22. Using SATA SSD on slaves IOPS differences between master (1000+) and slave (100+) have caused serious replication delay Is there any way to gain high enough IOPS from single thread? Master Read IOPS on SATA SSD is 3000+, which should be enough (15 times better than HDD) HDD RAID Just replacing HDD with SSD solved replication delay Overall read throughput became much better Using SSD on master was still risky Using SSD on slaves (IOPS: 100+ -> 3000+) was more effective than using on master (IOPS: 1000+ -> 3000+) Slave We mainly deployed SSD on slaves SATA SSD The number of slaves could be reduced From MySQL point of view: Good concurrency on HDD RAID has been required : InnoDB Plugin 22
  • 23. How about PCI-Express SSD? Deploying on both master and slaves? If PCI-E SSD is used on master, replication delay will happen again – 10,000IOPS from single thread, 40,000+ IOPS from 100 threads 10,000IOPS from 100 threads can be achieved with SATA SSD Parallel SQL threads should be implemented in MySQL Deploying on only slaves? If using HDD on master, SATA SSD should be enough to handle workloads – PCI-Express SSD is much more expensive than SATA SSD How about running multiple MySQL instances on single server? – Virtualization is not fast – Running multiple MySQL instances on single OS is more reasonable Does PCI-E SSD have enough storage capacity to run multiple instances? On HDD environments, typically only 100-200GB of database data can be stored because of slow random IOPS on HDD FusionIO SLC: 320GB Duo + 160GB = 480GB FusionIO MLC: 1280GB Duo + 640GB = 1920GB tachIOn SLC: 800GB x 2 = 1600GB 23
  • 24. Running multiple slaves on single box Before After M M B M M B B S1 S2 S3 B S1 S2 S3 S1, S1 S2, S2 S1, S1 S2, S2 M M B S1 S2 S3 B S1 S2 S3 B M M B Running multiple slaves on a single PCI-E slave Master and Backup Server are still HDD based Consolidating multiple slaves Since slave’s SQL thread is single threaded, you can gain better concurrency by running multiple instances The number of instances is mainly restricted by capacity 24
  • 25. Our environment Machine HP DL360G7 (1U), or Dell R610 PCI-E SSD FusionIO MLC (640GB Duo + 320GB non-Duo) tachIOn SLC (800GB x 2) CPU Two sockets, Nehalem 6-core per socket, HT enabled – 24 logical CPU cores are visible – Four socket machine is too expensive RAM 60GB or more Network Broadcom BCM5709, Four ports Using four network cables + bonding mode 4 + link aggregation – BONDING_OPTS="miimon=100 mode=4 lacp_rate=1 xmit_hash_policy=1" HDD 4-8 SAS RAID1+0 For backups, redo logs, relay logs, (optionally) doublewrite buffer 25
  • 26. Benchmarks on our real workloads Consolidating 7 instances on FusionIO (640GB MLC Duo + 320GB MLC) Let half of SELECT queries go to these slaves 6GB innodb_buffer_pool_size Peak QPS (total of 7 instances) 61683.7 query/s 37939.1 select/s 7861.1 update/s 1105 insert/s 1843 delete/s 3143.5 begin/s CPU Utilization %user 27.3%, %sys 11%(%soft 4%), %iowait 4% C.f. SATA SSD:%user 4%, %sys 1%, %iowait 1% Buffer pool hit ratio 99.4% SATA SSD (single instance/server): 99.8% No replication delay No significant (100+ms) response time delay caused by SSD 26
  • 27. CPU loads 22:10:57 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 22:11:57 all 27.13 0.00 6.58 4.06 0.14 3.70 0.00 58.40 56589.95 … 22:11:57 23 30.85 0.00 7.43 0.90 1.65 49.78 0.00 9.38 44031.82 CPU utilization was high, but should be able to handle more %user 27.3%, %sys 11%(%soft 4%), %iowait 4% Reached storage capacity limit (960GB). Using 1920GB MLC should be fine to handle more instances Network became the first bottleneck Recv: 14.6MB/s, Send: 28.7MB/s CentOS5 + bonding is not good for network requests handling (only single CPU core can handle requests) (I got the above result when I tested with normal bond0) We are now using link aggregation + bond4 with 4 network cables, then the CPU bottleneck went away 27
  • 28. Things to consider To run multiple MySQL instances in single server, you need to allocate different IP addresses or port numbers Administration tools are also affected We allocated different (virtual) IP addresses because some of existing internal tools depend on “port=3306” bind-address=“virtual ip address” in my.cnf Creating separated directories and files Socket files, data directories, InnoDB files, binary log files etc should be stored on different location each other Storing some files on HDD, others on SSD Binary logs, Relay logs, Redo logs, error/slow logs, ibdata0 (files where doublewrite buffer is written), backup files on HDD Others on SSD 28
  • 29. Optimizing for Social Game workloads Easily increasing millions of users in a few days Database size grows rapidly – Especially if PK is “user_id + xxx_id” (i.e. item_id) – Increasing GB/day is not uncommon Scaling reads is not difficult Adding slaves or adding caching servers Scaling writes is not trivial Sharding, scaling up Solutions depend on what kinds of tables we’re using, INSERT/UPDATE/DELETE workloads, etc 29
  • 30. INSERT-mostly tables History tables such as access logs, diary, battle history INSERT and SELECT mostly Secondary index is needed (user_id, etc) Table size becomes huge (easily exceeding 1TB) Locality (Most of SELECT go to recent data) INSERT performance in general Fast in InnoDB (Thanks to “Insert Buffering”. Much faster than MyISAM) To modify index leaf blocks, they have to be in buffer pool When index size becomes too large to fit in the buffer pool, disk reads happen In-memory workloads -> disk-bound workloads – Suddenly suffering from serious performance slowdown – UPDATE/DELETE/SELECT also getting much slower Any faster storage devices can not compete with in-memory workloads 30
  • 31. INSERT gets slower Time to insert 1 million records (InnoDB, HDD) 600 500 2,000 rows/s Seconds 400 Sequential order 300 Random order 200 100 10,000 rows/s 0 1 13 25 37 49 61 73 85 97 109 121 133 145 Existing records (millions) Index size exceeded buffer pool size Secondary index size exceeded innodb buffer pool size at 73 million records for random order test Gradually taking more time because buffer pool hit ratio is getting worse (more random disk reads are needed) For sequential order inserts, insertion time did not change. No random reads/writes 31
  • 32. INSERT performance difference In-memory INSERT throughput 15000+ insert/s from single thread on recent H/W Exceeding buffer pool, starting disk reads Degrading to 2000-4000 insert/s on HDD, single thread 6000-8000 insert/s on multi-threaded workloads Serious replication delay often happens Faster storage does not solve everything At most 5000 insert/s on fastest SSDs such as tachIOn/FusionIO – InnoDB actually uses CPU resources quite a lot for disk i/o bound inserts (i.e. calculating checksum, malloc/free) It is important to minimize index size so that INSERT can complete in memory 32
  • 33. Approach to complete INSERT in memory Partition 1 Partition 2 Single big physical table(index) Partition 3 Partition 4 Range partition by datetime Started from MySQL 5.1 Index size per partition becomes total_index_size / number_of_partitions INT or TIMESTAMP enables hourly based partitions – TIMESTAMP does not support partition pruning Old partitions can be dropped by ALTER TABLE .. DROP PARTITION 33
  • 34. Optimizing UPDATE, DELETE, SELECT Using SSD is really, really helpful IOPS difference is significant – Updates in memory: 15,000/s – On HDD : 300/s – On SATA SSD: 1,800/s – On PCI-E SSD : 4,000/s We have used SATA SSD with RAID0 on slaves Now we are gradually increasing PCI-E SSD (FusionIO and tachIOn), consolidating 6-10 MySQL instances If all data fit in memory and traffics are very high, using NoSQL is helpful We use HandlerSocket on user’s database (pk: user_id) – Database size is less than InnoDB buffer pool size Check Oracle’s memcached API project. Should be very easy to use 34
  • 35. Large-HDD servers and SSD servers “History Shard” Putting history data (comments, logs, etc) here Using range partitioning Large enough HDD with RAID 10 – 900GB (10K RPM) x 8 or 300GB (15K RPM) x 10 HDD Data size tends to be huge, but doesn’t matter so much “Application Shard” Middle range SSD (including SATA SSD), or PCI-E SSD Data size matters a lot 35
  • 36. Our near-future deployments PCI-E or SATA/SAS SSD servers Large HDD servers Game1_shard1 Game1_shard2 Game1_history_shard1 Game1_shard3 Game1_history_shard2 Game1_shard4 Game1_history_shard3 Game2_shard1 Game1_history_shard4 Master Master Game2_shard2 Slave/Backup Slave/Backup By moving history tables, application data size can be decreased significantly (less than 30%), so PCI-E servers can consolidate shards a lot Mostly in-memory workloads on HDD servers, so they can consolidate good numbers of shards Server crash causes multiple shards failure Automated failover is important 36
  • 37. Summary Automated master failover and easier master maintenance is important to manage hundreds of master servers Scaling up, scaling down, version up, etc Using MHA will help a lot – Configuring MHA does not require MySQL settings changes – Master failover in 10-30 seconds, without passive server – Moving master can be done in 0.5-2 seconds of downtime Optimizing MySQL for faster H/W Deploying history tables (insert-mostly tables, hundreds of GBs) on HDD Deploying application tables on PCI-E SSD Consolidating multiple MySQL instances on single box 37