Using HAProxy to Scale MySQL: growing from a single database server to a master-master with multiple slaves replication topology using HAProxy to maintain high-availability and ease maintenance tasks.
2. Overview
+ MySQL use at Mapmyfitness
+ Scaling MySQL
+ Integrating HAProxy and MySQL
+ Some issues that came up
+ Results
2
3. MySQL at mapmyfitness
+ Mapmyfitness DB
• Writer Active/Passive
• 5 Dedicated Read Slaves
+ Workouts DB
• Writer Active/Passive
• 4 Dedicated Read Slaves
+ Additional BI and Developer Servers
4. Scaling MySQL
+ Bigger faster hardware
• More RAM
• Faster Disks
4
+ Partition by Service
• Mapmyfitness (Default)
• Workouts
+ Split Reads and Writes
• 97% reads, 3% writes
• Different query profiles
5. HAProxy
+ HAProxy
• High Availability
• Load Balancing
+ MMF uses MySQL in Active/Passive Clusters
+ Reads and Writes are split at the application layer
• Writes go to one master
• Reads are routed to slaves
+ MySQL maintenance
• Upgrading servers
• OS Maintenance
+ Automatic Slave failover
+ Automatic Master Failover?
5
6. HAProxy at Mapmyfitness
+ Mapmyfitness Read Array
+ Workouts Read Array
+ Writer Array
6
+ HAProxy Servers
• Two servers for each array
• keepalived: prevent Single Point of Failure (SPOF)
• Configs managed via puppet
7. HAProxy Walk-through
+ You need three things
• Front-end
• Back-end
• Health Checks
7
+ We use four
• Front-end
• Back-end
• Health Checks
• Monitors
13. Some Issues
+ MySQL Replication Paths
• current master is not up
• current master is not replicating and alternate Master is writer
13
+ Solved by using HAProxy monitors
• Combine multiple health checks
• Check local replication
• Check replication upstream
• Check alternate master status
14. Some Issues
+ Hot Core
• Individual CPU at 100% utilization
• HAProxy is a single process
• It can use multiple cores, but there are some issues
14
+ Solved by using nbproc configuration option
• Will use multiple CPUs
• No parent process
• Stats are by PID
• Multiple processes means multiple checks
15. Some Issues
+ High connection rates
• TCP source port exhaustion
• Greater then 533 MySQL connect/disconnects per second
• 64k Sockets in TIME_WAIT
15
+ Solved by using multiple source IPs
• Bind multiple IPs to the HAProxy servers interface
• server mysql03 … source 172.16.16.22
• 5 IPs mapped on the Mapmyfitness read array HAProxy
• Let HAProxy manage the source ports
16. Results
+ Scale MySQL
• Reduced I/O wait on writers
• Moved query load to read array members
• Protect the servers from connection overloads
16
+ Server Maintenance
• add and remove servers from the array
• Allows for MySQL and server OS upgrades
+ High Availability
• Automatic failover of read members
• Could allow automatic failover of writers
• Write to multiple masters
MySQL is where all of our meta data is stored, like user profile data, route ids, start and stop points, workout duration, calories etc.
We use master-master with slaves replication topology, but only write to one master server, thus active/passive. The second master is a hot failover.
mapmyfitnes db: In the last year the monolithic db grew from 400Gb to 750Gb.
It’s now split between mapmyfitness (default) and workouts data servers (around March), mapmyfitness db is 480Gb and workouts is 340Gb, total 820Gb today (July 12, 2014).
Additional BI, Development, and customer service team mysql data servers
mysql06 mapmyfitness and workouts real time data
devint and extint mapmyfitness and workouts updated once a week (generally)
Currently:
add 40-50k new registered users every day
add 700-900k workouts per day, probably hit a million a day shortly.
we also add 400-500k routes per day
We are on track to almost doubling usage each year, we definitely need to scale MySQL.
Bigger faster hardware
48Gb to 96Gb and now we have three 128 Gb RAM servers for the mapmyfitness Db.
move from HDD to SSD
Split Reads and Writes
97% reads, 3% writes
Different query profiles
Immediate read after write
delayed read (up to 8 seconds)
Not time sensitive
set aside specific servers for specific tasks (writer, reader, reporter, thunderdome, etc)
Partition by Service
Mapmyfitness (Default)
meta data for users, routes
Workouts
workout specific data, 24x7 data
Need to do more, consider creating a routes service.
But then we will have to shard within these services to scale to 100-500 million users
So the real question is: how do we scale all these servers with minimal service interruption?
HAProxy
Traditionally used for web servers but it can handle any tcp connection, MySQL.
provide high availability using multiple backend servers
load balancing (roundrobin, least connections)
and can routing traffic to different backend servers for specific uses
Should be a good fit for our current MySQL eco-system
Active/passive Clusters
Active Master
Passive backup (which can be used as a reader)
MMF has been setup this way from very early (2009).
requires plenty of manual intervention to point each app at different servers for different uses
lots of places where configs could be missed
Reads and Writes are split at the application layer
Writes go to one master
Immediate reads go to current master
Delayed reads go to a slave of current master
Non time sensitive reads go to other slaves
“reporting” slave long running queries
“thunderdome” slave bad queries that need more resources
HAProxy also allows fairly non-desruptive MySQL maintenance
server upgrades
OS maintenance
writer switches
Slave failover
MySQL server overload protection (max connections)
In this case haproxy is listening on any interface on port 33306 for the mysql-reader array
check port: which port to check for health
addr: which server IP to query check port
maxconn: Maximum number of connections allowed to a server
weight: % of traffic to be sent to a server
source: The IP to send to a server, this allows increased connection rates
backup: is the server a backup.
Requires xinetd and custom scripts
check status:ERROR_MSG=`/usr/bin/mysql --defaults-file=/opt/mysql_tools/.password/mysqlchkusr.cnf --verbose -e "show databases;" 2>/dev/null`
#
# Check the output. If it is not empty then everything is fine and we return
# something. Else, we just do not return anything.
#
if [ "$ERROR_MSG" != "" ]
then
# mysql is fine, return http 200…
replication check:
tmp_file=`echo /tmp/$RANDOM`
rm -f $tmp_file
/usr/bin/mysql --defaults-file=/opt/mysql_tools/.password/mysqlchkusr.cnf -e "show slave status\G" > $tmp_file
Slave_IO_Running=`grep Slave_IO_Running: $tmp_file | awk '{ print $2 }'`
Slave_SQL_Running=`grep Slave_SQL_Running: $tmp_file | awk '{ print $2 }'`
Seconds_Behind_Master=`grep Seconds_Behind_Master: $tmp_file | awk '{ print $2 }'`
#
# Check the output. If it is not empty then everything is fine and we return
# something. Else, we just do not return anything.
#
if [ "$Slave_IO_Running" == "Yes" ] && [ "$Slave_SQL_Running" == "Yes" ] && [ $Seconds_Behind_Master -lt 8 ]
then
# mysql is fine, return http 200
/bin/echo -e "HTTP/1.1 200 OK\n”…
The check script either returns a 200 if it’s OK or 503 it’s not OK.
if [ "$ERROR_MSG" != "" ]
then
# mysql is fine, return http 200
/bin/echo -e "HTTP/1.1 200 OK\n"
/bin/echo -e "Content-Type: Content-Type: text/plain\n"
/bin/echo -e "\n"
/bin/echo -e "MySQL is running.\n"
/bin/echo -e "\n"
else
# mysql is not running, return http 503
/bin/echo -e "HTTP/1.1 503 Service Unavailable\n"
/bin/echo -e "Content-Type: Content-Type: text/plain\n"
/bin/echo -e "\n"
/bin/echo -e "MySQL is *down*.\n"
/bin/echo -e "\n"
fi
acl: no_repl_mysql03 checks to see if replication is within 8 seconds of it’s master
mysql10 checks to see if the mysql10 server is up
monitor fail:
fails if replication on mysql03 is more then 8 seconds behind it’s master
fails if replication on mysql09 is down and mysql10 is up, no replication path to the “active” master
MySQL replication: it always works even if it’s not working.
If the writer is mysql09 and replication on mysql10 goes down, replication on mysql04 and mysql18 will remain up and seconds behind mysql10 will be 0 seconds because mysql10 is not applying any trans actions. So mysql04 and mysql18 will be getting stall event though replication is running.
The monitor can check the replication path and fail if only one part of the chain fails.
HAProxy can reliably handle 50k connections per second on a single cpu.
With that said we were at ~1350 connections per second with ~1600 concurrent connections. Well below the known capabilities of haproxy.
And we still hit 100% CPU on haproxy. When that happens we get latency (slow connections), because haproxy is too busy to handle the additional connection requests and those requests get queued.
The immediate solution was to increase the number of haproxy processes running on the server. Spawns multiple haproxy processes, all using the same config file information. tofu07 is a 12 core machine, so we spun up 11 haproxy processes. CPU utilization dropped to 25-30% on couple of processes with the remaining processes running at 2-3% CPU.
But:
no parent process
unreliable stats (given by random PID)
multiple processes means multiple checks to each server from each process
We might be hitting a bug:2013/06/17 : 1.4.24
- BUG/MAJOR: backend: consistent hash can loop forever in certain circumstances
upgrade to v 1.4.25 (latest)
we are running:[root@589033-tofu05 ~]# haproxy -vv
HA-Proxy version 1.4.22 2012/08/09
Copyright 2000-2012 Willy Tarreau <w@1wt.eu>
Build options :
TARGET = linux26
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing
OPTIONS = USE_REGPARM=1 USE_PCRE=1
Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200
Encrypted password support via crypt(3): yes
Available polling systems :
sepoll : pref=400, test result OK
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 4 (4 usable), will use sepoll.
[root@589033-tofu05 ~]# strace -c -p 1769
Process 1769 attached - interrupt to quit
^CProcess 1769 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
45.64 0.018719 0 50170 65 sendto
33.92 0.013914 0 103596 50446 recvfrom
4.16 0.001707 0 4816 2456 connect
3.61 0.001480 6 259 epoll_wait
2.71 0.001113 0 3060 70 shutdown
2.15 0.000880 0 6395 epoll_ctl
1.83 0.000751 0 2977 close
1.55 0.000636 0 7183 setsockopt
1.41 0.000580 0 4809 fcntl
1.07 0.000437 0 2441 socket
1.07 0.000437 0 2463 95 accept
0.81 0.000331 0 2370 bind
0.08 0.000032 0 187 brk
------ ----------- ----------- --------- --------- ----------------
100.00 0.041017 190726 53132 total
HAProxy works as a reverse-proxy and so uses its own IP address to get connected to the server.
Any system has around 64K TCP source ports available to get connected to a remote IP:port. Once a combination of “source IP:port => dst IP:port” is in use, it can’t be re-used.
You can’t have more than 64K opened connections from a HAProxy box to a single remote IP:port couple.
There is an issue with MySQL client library: when a client sends its “QUIT” sequence, it performs a few internal operations before immediately shutting down the TCP connection, without waiting for the server to do it. A basic tcpdump will show it to you easily.
Note that you won’t be able to reproduce this issue on a loopback interface, because the server answers fast enough… You must use a LAN connection and 2 different servers.
Basically, here is the sequence currently performed by a MySQL client:
Mysql Client ==> "QUIT" sequence ==> Mysql Server
Mysql Client ==> FIN ==> MySQL Server
Mysql Client <== FIN ACK <== MySQL Server
Mysql Client ==> ACK ==> MySQL Server
First, a “clean” sequence should be: (happens on the loop back interface when both client and server are on the same machine)
Mysql Client ==> "QUIT" sequence ==> Mysql Server
Mysql Client <== FIN <== MySQL Server
Mysql Client ==> FIN ACK ==> MySQL Server
Mysql Client <== ACK <== MySQL Server
Which leads the client connection to remain unavailable for twice the MSL (Maximum Segment Life) time, which means 2 minutes. This parameter is in the kernel source code and can only be changed by recompiling a custom kernel.
Since the source port is unavailable for the system for 2 minutes, this means that over 534 MySQL requests per second you’re in danger of TCP source port exhaustion: 64000 (available ports) / 120 (number of seconds in 2 minutes) = 533.333.
This TCP port exhaustion appears on the MySQL client server itself, but as well on the HAProxy box because it forwards the client traffic to the server
We added 5 IPs to the read array haproxy boxes, which theoretically can now have 320k sockets in TIME_WAIT state
Each server in the backend generally gets a separate source IP
We have seen a maximum of 162k sockets in TIME_WAIT, which equates to 1350 connect/disconnects per second to the read array servers.
The kernel code is not all that efficient and managing the IP:sockets. We’ve additionally setup HAProxy manage those ip:sockets combinations to gets us to the 320k limit (2667 connects/disconnects per second). Set the port range on the source to allow haproxy’s built in functions to manage sockets; more efficient manner then the regular (non-custom) linux kernel code.
We also have these set in /etc/sysctl.conf
# haproxy net config
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 1024 65023
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_max_tw_buckets = 400000
net.ipv4.tcp_max_orphans = 60000
net.ipv4.tcp_synack_retries = 3
net.core.somaxconn = 65536
Reducing IO wait and hot table contention allows for more DML per second.
The Default (mapmyfitness db) writer is 33% write with I/O wait at 0.1%:
SSD servers fall down at greater then 10% I/O wait
1k DML per second, 2k reads per second
The workouts writer is 25% write with I/O wait at 4.2%:
HDD servers fall down at greater then 20% I/O wait
200 DML per second, 650 reads per second
Default read array: supporting 10k reads per second with 0.3% I/O wait
We’ve had peaks at 23k reads per second, bumping up against the 10% I/O wait threshold on the single SSD read server, so we’ve added an additional SSD server.
Connection overload, limited at haproxy, not on the server.
So we should always will be able to connect to the server even if all of the configured connections are “used”.
add: modify the haproxy config and reload
remove: for maint, stop xinetd service, permanent modify haproxy config and reload
graceful reload with the -sf options in the haproxy init.d script.
haproxy manages the connections according to the health checks, when a server is detected as down, connections are moved to another available server in the backend. If all the servers are down, then the backups listed in the haproxy back-end config take the connections.
Writer failover: not a good idea. Why did the original writer fail? Is there a flapping condition, causing the current writer to go down then back up, etc? MySQL replication, while pretty robust, has had serious problems in our environment in the past, with duplicate key violations on insert or missing rows on deletes, caused by flapping or duplicate writes to both masters.
Writes to multiple masters would dramatically increase our write throughput but again issues with duplicate keys, foreign key constraints have stopped up from moving in that direction.