Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

M|18 Technical Introduction: Automatic Failover

197 views

Published on

M|18 Technical Introduction: Automatic Failover

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

M|18 Technical Introduction: Automatic Failover

  1. 1. MariaDB Maxscale Switchover, Failover and Rejoin Wagner Bianchi Remote DBA Team Lead @ MariaDB RDBA Team Esa Korhonen Software Engineer @ MariaDB Maxscale Engineering Team
  2. 2. Introduction to MariaDB MaxScale ● Intelligent database proxy: ○ Separates client application from backend(s) ○ Understands authentication, queries and backend roles ○ Typical use-cases: read-write splitting, load-balancing ○ Many plugins: query filtering, logging, caching ● Latest GA version: 2.2 DATABASE SERVERS CLIENT
  3. 3. Query processing stages Filter Client Protocol Protocol Filter Filter Router Server State Monitor Parser updates monitors uses Backend
  4. 4. What is new in MariaDB-Monitor for MaxScale 2.2* ● Support for replication cluster manipulation: failover, switchover, rejoin ○ failover: replace a failed master with a slave ○ switchover: swap a slave with a live master ○ rejoin: bring a standalone server back to the cluster or redirect slaves replicating from the wrong master ● Failover & rejoin can be set to activate automatically ● Reduces need for custom scripts or replication management tools ● Supported topologies: 1 Master, N slaves, 1-level depth ● Limited support for external masters * Note: Renamed from previous mysqlmon
  5. 5. Switchover ● Controlled swap of master with a designated slave ● Monitor user must have SUPER-privilege ● Depends on read_only to freeze cluster ○ SUPER-users bypasses this ● Waits for all slaves to catch up with master ○ no data should be lost, but can be slow ● Configuration settings: ○ replication_user & replication_password ○ switchover_timeout $./maxctrl list servers ┌──────────────┬───────────┬──────┬─────────────┬─────────────────┐ │ Server │ Address │ Port │ Connections │ State │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalMaster1 │ 127.0.0.1 │ 3001 │ 0 │ Master, Running │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave1 │ 127.0.0.1 │ 3002 │ 0 │ Slave, Running │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave2 │ 127.0.0.1 │ 3003 │ 0 │ Slave, Running │ └──────────────┴───────────┴──────┴─────────────┴─────────────────┘ $./maxctrl call command mariadbmon switchover MariaDB-Monitor LocalSlave1 OK $./maxctrl list servers ┌──────────────┬───────────┬──────┬─────────────┬─────────────────┐ │ Server │ Address │ Port │ Connections │ State │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalMaster1 │ 127.0.0.1 │ 3001 │ 0 │ Slave, Running │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave1 │ 127.0.0.1 │ 3002 │ 0 │ Master, Running │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave2 │ 127.0.0.1 │ 3003 │ 0 │ Slave, Running │ └──────────────┴───────────┴──────┴─────────────┴─────────────────┘
  6. 6. Failover ● Promote a slave to take place of failed master ● Damage has already been done, so no need to worry about old master ● Chooses a new master based on following criteria (in order of importance): ○ not in exclusion-list ○ has latest event in relay log ○ has processed latest event ○ has log_slave_updates on ● Configuration: ○ failover_timeout ● May lose data with failed master ○ (semi)sync replication $./maxctrl list servers ┌──────────────┬───────────┬──────┬─────────────┬────────────────┐ │ Server │ Address │ Port │ Connections │ State │ ├──────────────┼───────────┼──────┼─────────────┼────────────────┤ │ LocalMaster1 │ 127.0.0.1 │ 3001 │ 0 │ Down │ ├──────────────┼───────────┼──────┼─────────────┼────────────────┤ │ LocalSlave1 │ 127.0.0.1 │ 3002 │ 0 │ Slave, Running │ ├──────────────┼───────────┼──────┼─────────────┼────────────────┤ │ LocalSlave2 │ 127.0.0.1 │ 3003 │ 0 │ Slave, Running │ └──────────────┴───────────┴──────┴─────────────┴────────────────┘ $./maxctrl call command mariadbmon failover MariaDB-Monitor OK $./maxctrl list servers ┌──────────────┬───────────┬──────┬─────────────┬─────────────────┐ │ Server │ Address │ Port │ Connections │ State │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalMaster1 │ 127.0.0.1 │ 3001 │ 0 │ Down │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave1 │ 127.0.0.1 │ 3002 │ 0 │ Master, Running │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave2 │ 127.0.0.1 │ 3003 │ 0 │ Slave, Running │ └──────────────┴───────────┴──────┴─────────────┴─────────────────┘
  7. 7. Automatic failover ● Trigger: master must be down for a set amount of time ● Additional check by looking at slave connections ● Configuration settings: ○ auto_failover ○ failcount & monitor_interval ○ verify_master_failure & master_failure_timeout $./maxctrl list servers ┌──────────────┬───────────┬──────┬─────────────┬─────────────────┐ │ Server │ Address │ Port │ Connections │ State │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalMaster1 │ 127.0.0.1 │ 3001 │ 0 │ Master, Running │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave1 │ 127.0.0.1 │ 3002 │ 0 │ Slave, Running │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave2 │ 127.0.0.1 │ 3003 │ 0 │ Slave, Running │ └──────────────┴───────────┴──────┴─────────────┴─────────────────┘ $docker stop maxscalebackends_testing1_master1_1 $./maxctrl list servers ┌──────────────┬───────────┬──────┬─────────────┬────────────────┐ │ Server │ Address │ Port │ Connections │ State │ ├──────────────┼───────────┼──────┼─────────────┼────────────────┤ │ LocalMaster1 │ 127.0.0.1 │ 3001 │ 0 │ Down │ ├──────────────┼───────────┼──────┼─────────────┼────────────────┤ │ LocalSlave1 │ 127.0.0.1 │ 3002 │ 0 │ Slave, Running │ ├──────────────┼───────────┼──────┼─────────────┼────────────────┤ │ LocalSlave2 │ 127.0.0.1 │ 3003 │ 0 │ Slave, Running │ └──────────────┴───────────┴──────┴─────────────┴────────────────┘ $./maxctrl list servers ┌──────────────┬───────────┬──────┬─────────────┬─────────────────┐ │ Server │ Address │ Port │ Connections │ State │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalMaster1 │ 127.0.0.1 │ 3001 │ 0 │ Down │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave1 │ 127.0.0.1 │ 3002 │ 0 │ Master, Running │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave2 │ 127.0.0.1 │ 3003 │ 0 │ Slave, Running │ └──────────────┴───────────┴──────┴─────────────┴─────────────────┘
  8. 8. Rejoin ● Directs the joining to server to replicate from the cluster master ○ redirect a slave replicating from the wrong master ○ start replication on a standalone server ● Looks at gtid:s to decide if the joining server can replicate ● Manual/automatic mode (auto_rejoin=1) ● Typical use case: master goes down -> failover -> old master comes back -> rejoined to cluster $./maxctrl list servers ┌──────────────┬───────────┬──────┬─────────────┬─────────────────┐ │ Server │ Address │ Port │ Connections │ State │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalMaster1 │ 127.0.0.1 │ 3001 │ 0 │ Down │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave1 │ 127.0.0.1 │ 3002 │ 0 │ Master, Running │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave2 │ 127.0.0.1 │ 3003 │ 0 │ Slave, Running │ └──────────────┴───────────┴──────┴─────────────┴─────────────────┘ $docker start maxscalebackends_testing1_master1_1 $./maxctrl list servers ┌──────────────┬───────────┬──────┬─────────────┬─────────────────┐ │ Server │ Address │ Port │ Connections │ State │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalMaster1 │ 127.0.0.1 │ 3001 │ 0 │ Running │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave1 │ 127.0.0.1 │ 3002 │ 0 │ Master, Running │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave2 │ 127.0.0.1 │ 3003 │ 0 │ Slave, Running │ └──────────────┴───────────┴──────┴─────────────┴─────────────────┘ $./maxctrl call command mariadbmon rejoin MariaDB-Monitor LocalMaster1 $./maxctrl list servers ┌──────────────┬───────────┬──────┬─────────────┬─────────────────┐ │ Server │ Address │ Port │ Connections │ State │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalMaster1 │ 127.0.0.1 │ 3001 │ 0 │ Slave, Running │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave1 │ 127.0.0.1 │ 3002 │ 0 │ Master, Running │ ├──────────────┼───────────┼──────┼─────────────┼─────────────────┤ │ LocalSlave2 │ 127.0.0.1 │ 3003 │ 0 │ Slave, Running │ └──────────────┴───────────┴──────┴─────────────┴─────────────────┘
  9. 9. External master handling DC A DC B replicating from DC A DC B replicating from
  10. 10. Switchover details Starting checks: 1. Cluster has 1 master and >1 slaves 2. All servers use GTID replication and cluster GTID-domain is known 3. Requested new master has binary log on Prepare current master: 1. SET GLOBAL read_only=1; 2. FLUSH TABLES; 3. FLUSH LOGS; 4. update GTID-info Wait until all slaves catch up to master: 1. MASTER_GTID_WAIT() A B C A B C Stop slave replication on new master: 1. STOP SLAVE; 2. RESET SLAVE ALL; 3. SET GLOBAL read_only=0 B A C Redirect slaves & old master to new master: 1. STOP SLAVE; 2. RESET SLAVE; 3. CHANGE MASTER TO … 4. START SLAVE; Check that replication is working: 1. FLUSH TABLES; 2. Check that all slaves receive new gtid
  11. 11. DEMO TIME!!
  12. 12. Maxscale 2.2 New Features ● At this point you know that, MariaDB Maxscale is able to: ○ Automatic/Manual Failover; ○ Manual Switchover; ○ Rejoin a crashed node as slave of an existing cluster; ● The previous processes relies on the new MariaDBMon monitor; ● Hidden details when implementing and/or break/fix: ○ For the switchover/failover/rejoin work, you need to have the monitor user (MariaDBMon) with access on all the servers or, a separate user for replication_user and replication_password with access on all the servers; ○ If the monitor user (MariaDBMon) has an encrypted password, the replication_password should be encrypted as well, otherwise, the CHANGE MASTER TO running for the processes won't be able to configure the replication for the new server;
  13. 13. Maxscale 2.2 New Features ● Failover: replacing a failed master. ● For the automatic failover, auto_failover variable should be true on monitor configuration definition; ○ auto_failover=true, for automatic failover be activated; ● For the manual failover, auto_failover should be set to false on monitor configuration definition; ● The master should be dead for the manual failover to work; ○ auto_failover=false, the failover can be activated manually: ● Enable and disable to auto_failover with the alter monitor command. [root@box01 ~]# maxadmin call command mariadbmon failover replication-cluster-monitor
  14. 14. Maxscale 2.2 New Features ● Failover: replacing a failed master (automatic, auto_failover=true) #: checking current configurations [root@box01 ~]# grep auto_failover /var/lib/maxscale/maxscale.cnf.d/replication-cluster-monitor.cnf auto_failover=true #: shutdown the current master - check the current topology out of `maxadmin list servers` for better confirming it [root@box02 ~]# systemctl stop mariadb.service #: watching the actions on the log file 2018-02-10 13:51:02 error : Monitor was unable to connect to server [192.168.50.13]:3306 : "Can't connect to MySQL server on '192.168.50.13'" 2018-02-10 13:51:02 notice : [mariadbmon] Server [192.168.50.13]:3306 lost the master status. 2018-02-10 13:51:02 notice : Server changed state: box03[192.168.50.13:3306]: master_down. [Master, Running] -> [Down] 2018-02-10 13:51:02 warning: [mariadbmon] Master has failed. If master status does not change in 4 monitor passes, failover begins. 2018-02-10 13:51:06 notice : [mariadbmon] Performing automatic failover to replace failed master 'box03'. 2018-02-10 13:51:06 notice : [mariadbmon] Promoting server 'box02' to master. 2018-02-10 13:51:06 notice : [mariadbmon] Redirecting slaves to new master. 2018-02-10 13:51:07 warning: [mariadbmon] Setting standalone master, server 'box02' is now the master. 2018-02-10 13:51:07 notice : Server changed state: box02[192.168.50.12:3306]: new_master. [Slave, Running] -> [Master, Running]
  15. 15. Maxscale 2.2 New Features ● Failover: replacing a failed master (manual, auto_failover=false) #: setting auto_fauilover=false [root@box01 ~]# maxadmin alter monitor replication-cluster-monitor auto_failover=false #: current master is down, automatic failover deactivated 2018-02-09 23:31:01 error : Monitor was unable to connect to server [192.168.50.12]:3306:"Can't connect to MySQL server on '192.168.50.12'" 2018-02-09 23:31:01 notice : [mariadbmon] Server [192.168.50.12]:3306 lost the master status. 2018-02-09 23:31:01 notice : Server changed state: box02[192.168.50.12:3306]: master_down. [Master, Running] -> [Down] #: manual failover executed [root@box01 ~]# maxadmin call command mariadbmon failover replication-cluster-monitor #: let's check the logs 2018-02-09 23:32:30 info : (17) [cli] MaxAdmin: call command "mariadbmon" "failover" "replication-cluster-monitor" 2018-02-09 23:32:30 notice : (17) [mariadbmon] Stopped monitor replication-cluster-monitor for the duration of failover. 2018-02-09 23:32:30 notice : (17) [mariadbmon] Promoting server 'box03' to master. 2018-02-09 23:32:30 notice : (17) [mariadbmon] Redirecting slaves to new master. 2018-02-09 23:32:30 notice : (17) [mariadbmon] Failover performed. 2018-02-09 23:32:30 warning: [mariadbmon] Setting standalone master, server 'box03' is now the master. 2018-02-09 23:32:30 notice : Server changed state: box03[192.168.50.13:3306]: new_master. [Slave, Running] -> [Master, Running]
  16. 16. Maxscale 2.2 New Features ● Failover: replacing a failed master, additional details ● The passes time is based on the monitor's monitor_interval value; ○ As it's now set as 1000ms, 1 second, the failover will be triggered after 4 seconds, considering the first pass done when monitor reported the first message; ○ If the failover process does not complete within the time configured on failover_timeout, it is 90 secs by default, the failover is canceled and the feature is disabled; ○ To enable failover again (after checking the possible problems), use the alter monitor cmd: 2018-02-10 13:51:02 warning: [mariadbmon] Master has failed.If master status does not change in 4 monitor passes, failover begins. [root@box01 ~]# maxadmin alter monitor replication-cluster-monitor auto_failover=true
  17. 17. Maxscale 2.2 New Features ● Switchover: swapping a slave with a running master. ● The switchover process relies on the replication_user and replication_password setting added to the monitor configs; ● The process is triggered manually and it should take up to switchover_timeout seconds to complete - default 90 seconds; ● If the process fails, the log will be written and the auto_failover will be disabled if enabled; [root@team01-box01 ~]# maxadmin call command mariadbmon switchover replication-cluster-monitor new_master master
  18. 18. Maxscale 2.2 New Features #: checking the current server's list [root@team01-box01 ~]# maxadmin list servers Servers. -------------------+-----------------+-------+-------------+-------------------- Server | Address | Port | Connections | Status -------------------+-----------------+-------+-------------+-------------------- box02 | 10.132.116.147 | 3306 | 0 | Slave, Running box03 | 10.132.116.161 | 3306 | 0 | Master, Running -------------------+-----------------+-------+-------------+-------------------- #: new_master=box03, current_master=box02 [root@team01-box01 ~]# maxadmin call command mariadbmon switchover replication-cluster-monitor box03 box02 #: checking logs 2018-02-14 16:44:46 info : (712) [cli] MaxAdmin: call command "mariadbmon" "switchover" "replication-cluster-monitor" "box02" "box03" 2018-02-14 16:44:46 notice : (712) [mariadbmon] Stopped the monitor replication-cluster-monitor for the duration of switchover. 2018-02-14 16:44:46 notice : (712) [mariadbmon] Demoting server 'box03'. 2018-02-14 16:44:46 notice : (712) [mariadbmon] Promoting server 'box02' to master. 2018-02-14 16:44:46 notice : (712) [mariadbmon] Old master 'box03' starting replication from 'box02'. 2018-02-14 16:44:46 notice : (712) [mariadbmon] Redirecting slaves to new master. 2018-02-14 16:44:47 notice : (712) [mariadbmon] Switchover box03 -> box02 performed. 2018-02-14 16:44:47 notice : Server changed state: box02[10.132.116.147:3306]: new_master. [Slave, Running] -> [Master, Slave, Running] 2018-02-14 16:44:47 notice : Server changed state: box03[10.132.116.161:3306]: new_slave. [Master, Running] -> [Slave, Running] 2018-02-14 16:44:48 notice : Server changed state: box02[10.132.116.147:3306]: new_master. [Master, Slave, Running] -> [Master, Running] Switchover: swapping a slave with a running master.
  19. 19. Maxscale 2.2 New Features ● Rejoin: joining a standalone server to the cluster. ● Enable automatic joining back of server to the cluster when a crashed backend server gets back online; ● When auto_rejoin is enabled, the monitor will attempt to direct standalone servers and servers replicating from a relay master to the main cluster master server; ● Test it as we did: ○ Check what is the current master, shutdown MariaDB Server; ○ The failover will happen in case auto_failover is enabled; ○ Start the process for the shutdown MariaDB Server; ○ List servers again out of Maxadmin, watch logs.
  20. 20. Maxscale 2.2 New Features ● Rejoin: joining a standalone server to the cluster. #: current_master=box02 [root@team01-box02 ~]# mysqladmin shutdown #: watching logs, the failover will happen as the master "crashed" 2018-02-14 18:44:36 error : Monitor was unable to connect to server [10.132.116.147]:3306 : "Can't connect to MySQL server on '10.132.116.147' (115)" 2018-02-14 18:44:36 notice : [mariadbmon] Server [10.132.116.147]:3306 lost the master status. 2018-02-14 18:44:36 notice : Server changed state: box02[10.132.116.147:3306]: master_down. [Master, Running] -> [Down] 2018-02-14 18:44:36 warning: [mariadbmon] Master has failed. If master status does not change in 4 monitor passes, failover begins. 2018-02-14 18:44:40 notice : [mariadbmon] Performing automatic failover to replace failed master 'box02'. 2018-02-14 18:44:40 notice : [mariadbmon] Promoting server 'box03' to master. 2018-02-14 18:44:40 notice : [mariadbmon] Redirecting slaves to new master. 2018-02-14 18:44:41 warning: [mariadbmon] Setting standalone master, server 'box03' is now the master. 2018-02-14 18:44:41 notice : Server changed state: box03[10.132.116.161:3306]: new_master. [Slave, Running] -> [Master, Running] #: starting old master back [root@team01-box02 ~]# systemctl start mariadb.service #: watching logs 2018-02-14 18:47:27 notice : Server changed state: box02[10.132.116.147:3306]: server_up. [Down] -> [Running] 2018-02-14 18:47:27 notice : [mariadbmon] Directing standalone server 'box02' to replicate from 'box03'. 2018-02-14 18:47:27 notice : [mariadbmon] 1 server(s) redirected or rejoined the cluster. 2018-02-14 18:47:28 notice : Server changed state: box02[10.132.116.147:3306]: new_slave. [Running] -> [Slave, Running]
  21. 21. Thank you! Time for questions And answers

×