• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Development of Fault-Tolerant Failover Tools with MySQL Utilities - MySQL Connect 2013 [CON4276]
 

Development of Fault-Tolerant Failover Tools with MySQL Utilities - MySQL Connect 2013 [CON4276]

on

  • 2,048 views

The occurrence of failures and crashes can compromise the high availability of your database system, affecting your revenue and reputation. Therefore, it is fundamental to minimize downtime and have ...

The occurrence of failures and crashes can compromise the high availability of your database system, affecting your revenue and reputation. Therefore, it is fundamental to minimize downtime and have an efficient strategy for crash recovery. Replication and failover are commonly applied to deal with those situations, but what if failures occur during the recovery process? This can really be a headache, so it is better to be prepared. This session discusses the development of fault-tolerant failover solutions using the MySQL utilities library and covers the following topics:
• Issues during failover/switchover
• Fault-tolerant failover solutions
• Using the MySQL utilities library to provide your own solution

Statistics

Views

Total Views
2,048
Views on SlideShare
2,045
Embed Views
3

Actions

Likes
1
Downloads
9
Comments
0

1 Embed 3

https://twitter.com 3

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Development of Fault-Tolerant Failover Tools with MySQL Utilities - MySQL Connect 2013 [CON4276] Development of Fault-Tolerant Failover Tools with MySQL Utilities - MySQL Connect 2013 [CON4276] Presentation Transcript

    • 1 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Development of FaultTolerant Failover Tools with MySQL Utilities Dr. Paulo Jesus Software Developer – MySQL Utilities Dr. Lars Thalmann MySQL Development Director 2 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Program Agenda  Introduction: Faults, Fault-Tolerance and Failover  MySQL Failover – How to Handle Master Crash, Slave Crash and Connection Failures Automatically!  Introducing the mysqlfailover utility – Fault-Tolerance and Taking Advantage of its Features  Final Remarks 3 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Introduction: Faults, Fault-tolerance and Failover 4 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Faults Different types of faults can lead to system failures  Crash – Server crash and stop due to a cosmic ray (more likely software/hardware malfunction)  Message loss – Communication channel fails  Byzantine – Data corruption or malicious attacks 5 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Fault-Tolerance Fault-tolerance is a property that enables the system to continue operating properly when faults occur.  Provide high-availability  Improve reliability  Reduce system downtime and revenue loss  Typically obtained through redundancy – Example: replication 6 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Failover Failover can be achieved by automatically switching to another server (redundant or standby) upon failure. *  Fault-tolerant strategy: – Promote a slave as the new master – Reduce system downtime * Switchover is the transfer of the master role in an otherwise healthy topology and is performed manually. In many cases, the original master can be returned to the topology as a slave. 7 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Context: MySQL Replication Faults, Fault-Tolerance and Failover Replication topology: trx1 trx2 trx3 A  Master: A  Slaves: B, C and D trx2, trx3 trx3 trx1  Asynchronous B D replication C 8 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx1 trx2 trx1 trx2 trx3
    • Context: MySQL Replication Faults, Fault-Tolerance and Failover Replication topology: trx1 trx2 trx3 A  Master: A  Slaves: B, C and D trx2, trx3 trx3 trx1 B Here comes the cosmic ray!!! 9 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. D C trx1 trx2 trx1 trx2 trx3
    • Context: MySQL Replication Faults, Fault-Tolerance and Failover Not critical... trx1 trx2 trx3 A  Partial impact: – Reduce reads trx2, trx3 performance  Replace the slave Slave crashed 10 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx3 trx1 B D C trx1 trx2 trx1 trx2 trx3
    • Context: MySQL Replication Faults, Fault-Tolerance and Failover Replication topology: trx1 trx2 trx3 A  Master: A  Slaves: B, C and D trx2, trx3 trx3 trx1 B D C 11 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx1 trx2 trx1 trx2 trx3
    • Context: MySQL Replication Faults, Fault-Tolerance and Failover Replication topology: trx1 trx2 trx3 A  Master: A  Slaves: B, C and D trx2, trx3 trx3 trx1 B D Connection failure C 12 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx1 trx2 trx1 trx2 trx3
    • Context: MySQL Replication Faults, Fault-Tolerance and Failover Not critical… trx1 trx2 trx3 A  Partial impact: – Outdated data trx2, trx3 read from slave  Restore connection Master-Slave Connection lost 13 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx3 trx1 B D C trx1 trx2 trx1 trx2 trx3
    • Context: MySQL Replication Faults, Fault-Tolerance and Failover Replication topology: trx1 trx2 trx3 A  Master: A  Slaves: B, C and D trx2, trx3 trx3 trx1 B D C 14 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx1 trx2 trx1 trx2 trx3
    • Context: MySQL Replication Faults, Fault-Tolerance and Failover Replication topology: trx1 trx2 trx3 A  Master: A  Slaves: B, C and D trx2, trx3 trx3 trx1 B Here comes another cosmic ray!!! 15 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. D C trx1 trx2 trx1 trx2 trx3
    • Context: MySQL Replication Faults, Fault-Tolerance and Failover Now, it gets serious! trx1 trx2 trx3 A  Replication stops – Reads only trx2, trx3 – No more writes  Failover is needed Master crashed 16 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx3 trx1 B D C trx1 trx2 trx1 trx2 trx3
    • Context: MySQL Replication Faults, Fault-Tolerance and Failover Failover process: A trx1 trx2 trx3  Promote the “best”* slave as master  Connect remaining slaves to the new master trx3 trx1 B C * Here it is the most up-to-date (for simplicity). 17 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. D trx2, trx3 trx1 trx2 trx1 trx2 trx3
    • Context: MySQL Replication Failover Is Not So Easy… Why ? A  New master might not be the most upto-date * trx1 trx2 trx3 trx1 B – But must be at the D end of failover  Ensure data consistency C trx1 trx2 * The “best” candidate might be the one with the best hardware. 18 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx1 trx2 trx3
    • MySQL Failover 19 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Relay Logs MySQL Failover binlog Relay log details: A  Slave B: trx2 trx1 trx2 trx3 trx3  Slave C: <none>  Slave D: trx2, trx3  Slave B and C did trx2 trx1 trx3 B relay relay not receive trx3 C relay 20 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx2 trx3 trx1 trx2 trx1 D
    • Relay Logs MySQL Failover binlog Master crashed! A Most up-to-date: Check slave status trx1 trx2 trx3  Slave D is the most trx3 up-do-date – SHOW SLAVE STATUS – Master log file and position or retrieved GTIDs trx2 trx1 trx2 trx3 trx3 B relay relay C relay 21 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx1 trx2 trx1 D
    • Relay Logs MySQL Failover binlog Failover: A relay binlog trx1 trx2 trx3 trx2 trx3 trx1  Promote D as the new master – STOP SLAVE, CHANGE MASTER TO*, START SLAVE Hum... Something is not right... trx2 trx1 B relay * CHANGE MASTER TO deletes all relay log files and starts a relay new one, unless relay_log_file or relay_log_pos are specified. 22 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. C trx1 trx2 D
    • Relay Logs MySQL Failover binlog (1) What can happen: A relay trx1 trx2 trx3 D  “D” consume trxs on relay log trx2 trx1 B relay C relay 23 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. binlog trx1 trx2 trx1 trx2 trx3
    • Relay Logs MySQL Failover binlog (1) What can happen: A relay trx1 trx2 trx3 D  “D” consume trxs on relay log  Slaves read trxs trx2, trx3 trx2 trx3 trx1 trx2, trx3 B relay trx2 trx3 relay 24 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. binlog C trx1 trx2 trx1 trx2 trx3
    • Relay Logs MySQL Failover binlog (1) What can happen: A relay trx1 trx2 trx3 D  “D” consume trxs on relay log  Slaves read trxs  Slaves apply trxs SQL thread stops on “C” Conflicting transactions! 25 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. B binlog trx1 trx2 trx3 trx1 trx2 trx3 relay trx2 trx3 relay C trx1 trx2 SQL thread error!
    • Relay Logs MySQL Failover binlog (2) What can happen: A relay binlog trx1 trx2 trx3 trx2 trx3 trx1  Delete relay log on the new master – RESET SLAVE trx1 B relay C relay 26 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx1 trx2 D
    • Relay Logs MySQL Failover binlog (2) What can happen: A relay binlog trx1 trx2 trx3 trx2 trx3 trx1 D  Delete relay log on the new master – RESET SLAVE trx2 only on slave B and trx3 is lost. trx1 B relay C Transactions lost! relay 27 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx1 trx2 trx3 is lost!
    • Relay Logs MySQL Failover binlog Solution: A relay trx1 trx2 trx3 D  Consume all transactions on the relay logs before promoting the new master B trx1 trx2 relay C relay 28 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. binlog trx1 trx2 trx1 trx2 trx3
    • New Master Synchronization MySQL Failover binlog New master is not the most up-to-date: A  “Best” candidate is trx3 the one with better: – Hardware trx2 trx1 trx3 relay relay to-date C relay 29 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx2 trx3 B – Location  It is not the most up- trx1 trx2 trx3 trx1 trx2 trx1 D
    • New Master Synchronization MySQL Failover binlog Failover:  Slave B is the “best” B trx1 trx2 A relay candidate  Consume relay logs  Promote “B” to master D relay C relay 30 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx1 trx2 trx3 trx1 trx2 trx1 trx2 trx3
    • New Master Synchronization MySQL Failover binlog Failover:  Slave B is the “best” B trx1 trx2 A relay candidate  Consume relay logs  Promote “B” to master Data consistency issue! D Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx1 trx2 trx3 relay C relay 31 trx1 trx2 trx3 trx1 trx2 trx3 is only on “D”!
    • New Master Synchronization MySQL Failover binlog Solution: Synchronize candidate before promoting it:  Get and apply differential changes from the most up-todate slave or loop throw all slaves B trx1 trx2 trx3 A relay trx3 D relay C relay 32 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx1 trx2 trx3 trx1 trx2 trx1 trx2 trx3
    • New Master Synchronization MySQL Failover binlog Solution:  New master must B trx1 trx2 trx3 A relay become one of the most up-to-date before being promoted D relay C Note: If candidate is too delayed it might take too much time to synchronize... Better to choose another one? 33 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx1 trx2 trx3 relay trx1 trx2 trx1 trx2 trx3
    • Failover With or Without GTIDs MySQL Failover  Without GTIDs: – Handle files and positions  Headache to determine file and position to synchronize candidates  Headache to determine file and position for CHANGE MASTER TO  With GTIDs (from MySQL 5.6.5): – Handle GTID sets  Simple sets manipulation to determine missing transactions  Automatic: CHANGE MASTER TO ... MASTER_AUTO_POSITION = 1  No headache to determine new master’s file and position 34 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Failover With or Without GTIDs MySQL Failover  Use of GTIDs simplifies the failover task: – Removes complexity of handling files and positions – Opens the door to support more complex multi-tier replication topologies  Still a difficult task to achieve (manually): – Detecting master failure – Requires profound knowledge of MySQL replication – Is time consuming and error-prone  Wouldn’t it be great if we had a tool to do it automatically? 35 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Introducing the mysqlfailover Utility 36 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Introducing the mysqlfailover Utility Utility to report the health of a replication topology (master and its slaves) and perform automatic failover.  Features – Configurable timeouts to detect master failures (interval and ping) – Configurable failover modes (auto, elect, fail) – Optimized failover algorithm New  http://svenmysql.blogspot.pt/2013/03/flexible-fail-over-policies-using-mysql.html – Automatic slave discovery – Logging to file 37 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Introducing the mysqlfailover Utility  Features – Two operation modes:  Console (default)  Daemon (only on POSIX platforms) New – Support for new authentication mechanism (login-path) New – Detect execution of multiple instances (for the same master) – Detect errant transactions New – Pedantic execution mode New – Extension points (allow execution of external scripts) 38 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Introducing the mysqlfailover Utility  Requirements – All servers with version >= MySQL 5.6.9 and GTID_MODE=ON:  --report-host, --report-port, --log-slave-updated, --enforce-gtid-consistency, and --master-info-repository=TABLE  More information – http://dev.mysql.com/doc/workbench/en/mysqlfailover.html  Example of use mysqlfailover --master=m1 --slaves=s2,s1,s3 --daemon=start --log=failover.txt Note: m1, s1, s2 and s3 are login-path in the mylogin.cnf file used for security and usability purposes, avoiding the use of plain text connection strings, such as rpluser:mypass@master_host:3306. 39 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Introducing the mysqlfailover Utility mysqlfailover in action: trx1 trx2 trx3 A  Master fault detection – Are you alive? trx2, trx3  Get health data from master and slaves Simple, open source, no impact on system performance! 40 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx3 trx1 B D C trx1 trx2 trx1 trx2 trx3
    • Introducing the mysqlfailover Utility 41 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Fault-Tolerance of mysqlfailover and Taking Advantage of its Features 42 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Fault-Tolerance of mysqlfailover What will be discussed  How faults affect mysqlfailover? – Server crash – Connection failure  How to improve fault-tolerance? – Tips and tricks – Extension points  Errant transactions – Good practice to avoid them 43 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Fault-Tolerance of mysqlfailover Server crash: trx1 trx2 trx3 A  Utility stopped  Replication topology not affected  Easy to start a new instance of the utility – Use --force option, overwrite instance registration 44 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx2, trx3 trx3 trx1 B D C trx1 trx2 trx1 trx2 trx3
    • Fault-Tolerance of mysqlfailover Connection failure (1): trx1 trx2 trx3 A  Fault detection – Suspect master trx2, trx3 D has failed  Failover trx3 trx1 B – Slave promoted to new master – Old master still alive, abandoned 45 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. C trx1 trx2 trx1 trx2 trx3
    • Fault-Tolerance of mysqlfailover Connection failure (2): A  Fault detection trx2, trx3 – Suspect master has failed  Failover B trx3 trx1 trx2 – Split replication topology C 46 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx1 trx2 trx3 trx1 trx2 D trx1 trx2 trx3
    • Fault-Tolerance of mysqlfailover Improve fault-tolerance: trx1 trx2 trx3 A  Run mysqlfailover close to the master, but not on it  Carefully set “interval” and “ping” values  Monitor mysqlfailover  Use extension points 47 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx2, trx3 trx3 trx1 B D C trx1 trx2 trx1 trx2 trx3
    • Extension Points of mysqlfailover Allow the execution of external scripts to override the behavior of the utility at specific points using the following options:  --exec-fail-check: script executed on each interval to replace default failure detection – E.g., customize failure detector according to replication topology (test master connection through different network paths)  --exec-before: script to execute before starting failover – E.g., shutdown master’s system (to ensure only one master will be online) 48 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Extension Points of mysqlfailover  --exec-after: script to execute at the end of failover – E.g., change network setting to direct client writes to the new master  --exec-post-failover: script to execute after completing the failover process (successfully or not) – E.g., send a notification to the administrator with the failover result  Access to variables with master’s information from scripts – $1: old master host, $2: old master port, $3: new master host, $4: new master port 49 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Errant Transactions Slave transaction that is not on all slaves a-1 a-2 a-3 A  Examples: b-1, c-1 – Different GTID trx2, trx3 – Can be the same SQL command B trx3 a-1 b-1 D  CREATE TABLE table1 – Each one only on one slave 50 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. C a-1 a-2 c-1 a-1 a-2 a-3
    • Errant Transactions Can lead to errors during/after failover:  Both b-1 and c-1 B – CREATE TABLE table1 New master B cannot apply c-1, and slave C cannot apply b-1: Table “table1” already exists. Conflicting transactions! 51 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. A a-1 b-1 a-2 a-3 a-1 a-2 a-3 b-1 b-1, a-3 SQL Error! D C a-1 a-2 c-1 a-1 a-2 a-3
    • Errant Transactions How to handle this problem? Avoid it!  mysqlfailover identifies errant transactions mysqlfailover is clever! – Error when starting the utility – Warning while executing unless the --pedantic option is used  Can only be fixed manually – Commit empty transaction on slaves with the GTID of the errant one  SET GTID_NEXT=‘…’; BEGIN; COMMIT; SET GTID_NEXT='AUTOMATIC‘;  Good practice – Disable binary log (SET sql_log_bin = 0) when executing transactions locally on slaves. So that they are not replicated! 52 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Final Remarks 53 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Final Remarks  Faults can greatly impact your system (replication topology) – System down – Revenue lost – Negative reputation  Failover – Fault-tolerant strategy for high availability – Works in most situations  Fault detectors make mistakes, failover process can fail, etc. 54 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Final Remarks The mysqlfailover utility  Simple and easy to use  Advanced features – Perform several configuration checks (to prevent failures) – Optimized failover algorithm – Configurable monitoring timeouts and operation modes – Extension points  Building block for more sophisticated and fault-tolerant tools – Taking advantage of extension points or MySQL Utilities Library (Open Source) 55 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • By The Way… There are many other useful MySQL Utilities Specific – Audit log Generic/Support Database Ops msqldiskusage mysqlindexcheck mysqlmetagrep mysqlprocgrep mysqluserclone New mysqlfrm Server Ops mysqldbcompare mysqldbcopy mysqldbexport mysqldbimport mysqldiff mysqlserverclone mysqlserverinfo Usability New 56 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. mysqluc New New mysqlauditadmin mysqlauditgrep High Availability mysqlfailover mysqlreplicate mysqlrpladmin mysqlrplcheck mysqlrplshow
    • Do You Want More?  MySQL Utilities (1.3.5 GA) – Download: http://dev.mysql.com/downloads/tools/utilities/  Launchpad: https://launchpad.net/mysql-utilities – Docs: http://dev.mysql.com/doc/workbench/en/mysql-utilities.html .  Contributing Ideas: – Community users can use: http://bugs.mysql.com (MySQL Workbench: Utilities). – Oracle customers can use: bug.oraclecorp.com (Product = MySQL Workbench, Component = WBUTILS).  Send us an e-mail (please use sparingly): – Charles A. Bell, PhD (Team Lead): chuck.bell@oracle.com – Paulo Jesus, PhD (Developer): paulo.jesus@oracle.com 57 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Questions ? Paulo Jesus paulo.jesus@oracle.com 58 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Graphic Section Divider 59 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
    • Another Fault Scenario Replication topology: trx1 trx2 trx3 A  Master: A  Slaves: B, C and D trx2, trx3 trx3 trx3 trx1  Asynchronous B D replication trx3 did not reach any slave (yet) 60 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. C trx1 trx2 trx1 trx2
    • Another Fault Scenario Master crashed: trx1 trx2 trx3 A  trx3 didn’t reach any slave trx2, trx3 trx3 trx3 trx1 B Is failover enough to solve this? 61 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. D C trx1 trx2 trx1 trx2
    • Another Fault Scenario What can we do? A  With failover trx1 trx2 trx3 trx2 – Try to get missing transaction from master (if possible)  Semi-synchronous D trx1 B replication helps C 62 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. trx1 trx2 trx1 trx2
    • 63 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.