Successfully reported this slideshow.
Your SlideShare is downloading. ×

Webinar slides: How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 39 Ad

Webinar slides: How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL

Download to read offline

Failover is the process of moving to a healthy standby component, during a failure or maintenance event, in order to preserve uptime. The quicker it can be done, the faster you can be back online. However, failover can be tricky for transactional database systems as we strive to preserve data integrity - especially in asynchronous or semi-synchronous topologies. There are risks associated, from diverging datasets to loss of data. Failing over due to incorrect reasoning, e.g., failed heartbeats in the case of network partitioning, can also cause significant harm.

This webinar replay gives a detailed overview of what failover processes may look like in MySQL, MariaDB and PostgreSQL replication setups. We’ve covered the dangers related to the failover process, and discuss the tradeoffs between failover speed and data integrity. We’ve found out about how to shield applications from database failures with the help of proxies. And we've finally had a look at how ClusterControl manages the failover process, and how it can be configured for both assisted and automated failover.

So if you’re looking at minimizing downtime and meet your SLAs through an automated or semi-automated approach, then this webinar replay is for you!


AGENDA

- An introduction to failover - what, when, how
- in MySQL / MariaDB
- in PostgreSQL
- To automate or not to automate
- Understanding the failover process
- Orchestrating failover across the whole HA stack
- Difficult problems
- Network partitioning
- Missed heartbeats
- Split brain
- From assisted to fully automated failover with ClusterControl
- Demo


SPEAKER

Krzysztof Książek, Senior Support Engineer at Severalnines, is a MySQL DBA with experience managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard.

Failover is the process of moving to a healthy standby component, during a failure or maintenance event, in order to preserve uptime. The quicker it can be done, the faster you can be back online. However, failover can be tricky for transactional database systems as we strive to preserve data integrity - especially in asynchronous or semi-synchronous topologies. There are risks associated, from diverging datasets to loss of data. Failing over due to incorrect reasoning, e.g., failed heartbeats in the case of network partitioning, can also cause significant harm.

This webinar replay gives a detailed overview of what failover processes may look like in MySQL, MariaDB and PostgreSQL replication setups. We’ve covered the dangers related to the failover process, and discuss the tradeoffs between failover speed and data integrity. We’ve found out about how to shield applications from database failures with the help of proxies. And we've finally had a look at how ClusterControl manages the failover process, and how it can be configured for both assisted and automated failover.

So if you’re looking at minimizing downtime and meet your SLAs through an automated or semi-automated approach, then this webinar replay is for you!


AGENDA

- An introduction to failover - what, when, how
- in MySQL / MariaDB
- in PostgreSQL
- To automate or not to automate
- Understanding the failover process
- Orchestrating failover across the whole HA stack
- Difficult problems
- Network partitioning
- Missed heartbeats
- Split brain
- From assisted to fully automated failover with ClusterControl
- Demo


SPEAKER

Krzysztof Książek, Senior Support Engineer at Severalnines, is a MySQL DBA with experience managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Webinar slides: How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL (20)

Advertisement

More from Severalnines (19)

Recently uploaded (20)

Advertisement

Webinar slides: How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL

  1. 1. krzysztof@severalnines.com Copyright 2018 Severalnines AB Presenter Krzysztof Książek, Senior Support Engineer @Severalnines How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL December 11th, 2018
  2. 2. Copyright 2018 Severalnines AB I'm JJ from the Severalnines Team and I'm your host for today's webinar! Feel free to ask any questions in the Questions section of this application or via the Chat box. You can also contact me directly via the chat box or via email: jj@severalnines.com during or after the webinar. Your host & some logistics
  3. 3. Copyright 2019 Severalnines ABCopyright 2019 Severalnines AB About Severalnines & ClusterControl
  4. 4. Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB
  5. 5. Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB About ClusterControl # Free to Download # Initial 30 Days Enterprise Trial # Reverts to Free Community Edition # Enterprise / Paid Versions Available
  6. 6. Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB ClusterControl Automation & Management Deployment (Free Community) # Deploy a Cluster in Minutes ○ On-Prem ○ Cloud (AWS/Azure/Google) - paid
 Monitoring (Free Community) # Systems View with 1 sec Resolution # Agentless via SSH, or agent-based with Prometheus # DB / OS stats & Performance Advisors # Configurable Dashboards # Query Analyzer # Real-time / historical Management (Paid Features) # Backup Management # Upgrades & Patching # Security & Compliance # Operational Reports # Automatic Recovery & Repair # Performance Management # Automatic Performance Advisors
  7. 7. Copyright 2019 Severalnines ABCopyright 2019 Severalnines AB Supported Databases
  8. 8. Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB Our Customers
  9. 9. krzysztof@severalnines.com Copyright 2018 Severalnines AB Presenter Krzysztof Książek, Senior Support Engineer @Severalnines How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL December 11th, 2018
  10. 10. Copyright 2018 Severalnines AB •An introduction to failover - what, when, how in MySQL / MariaDB in PostgreSQL •To automate or not to automate •Understanding the failover process •Orchestrating failover across the whole HA stack •Difficult problems Network partitioning Missed heartbeats Split brain •From assisted to fully automated failover with ClusterControl Demo Agenda
  11. 11. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB An introduction to failover - what, when, how
  12. 12. Copyright 2018 Severalnines AB •A switchover is the process of switching a master role to another server through the process of a slave promotion •A failover is the process of switching a master role to another server through the process of a slave promotion. Old master is not available or its availability is limited This is worse scenario as you cannot assume all the slaves are in sync •Today the we will focus on the failover process An introduction to replication failover - what, when, how
  13. 13. Copyright 2018 Severalnines AB •The failover is performed when the old master became unavailable. Both in MySQL and PostgreSQL replication, writes have to be sent to the master therefore its crash affects the whole cluster, making it not available •What is important, you should verify the master connectivity from the point of the slaves It may happen that the monitoring node cannot reach the master while slaves are happily replicating from it Failover should be triggered only if the master is indeed not reachable neither by the application nor by the slaves An introduction to replication failover - what, when, how
  14. 14. Copyright 2018 Severalnines AB •After a master crash you end up with one or more slaves •Verify that the master is indeed not reachable •Decide which slave is the most up to date and pick it as master candidate •Ensure there are no errant transactions on the master candidate •Collect missing data from the master (if it is possible) and replay them on the master candidate •Reslave all remaining slaves off the new master •Ensure to the best of your abilities that the old master will not be started again before it can be investigated •Rebuild the old master as a slave using the data from the new master Failover in MySQL
  15. 15. Copyright 2018 Severalnines AB •After an active server crash you end up with one or more standby servers •Verify that the active server is indeed not reachable •Find the most advanced standby server •Trigger the failover using either pg_ctl promote or the trigger_file •pg_rewind for remaining standby servers to make them in sync with the new master •Reslave remaining standby servers to the new master •Ensure to the best of your abilities that the old master will not be started again before it can be investigated •Rebuild the old master as a slave using the data from the new master Failover in PostgreSQL
  16. 16. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB To automate or not to automate?
  17. 17. Copyright 2018 Severalnines AB •As shown in last two slides, the failover requires couple of steps to be performed As usual, more steps and more complex they are, the higher chance for human error •Scripts can easily perform all the tasks required, run all the checks and do it way faster and more reliable than human can do •Scripts are as smart as we wrote them, though. Humans tend to be more flexible and can handle unpredictable situations better •Should we automate the failover or not? That’s the question! •Let’s go through some pros and cons of automated failover To automate or not to automate?
  18. 18. Copyright 2018 Severalnines AB •Pros Way faster reaction on the issue Higher reliability for typical situations When configured correctly, may handle majority of the cases in a proper way Reduce oncall burnout - even though you page your staff, it’s not as critical given that the systems are up and running To automate or not to automate? •Cons Limited situation awareness - does not understand the large picture (or understand what has been coded in) Decisions made are not always correct Requires intensive tests to ensure reliability Has to be maintained (if it is your own script)
  19. 19. Copyright 2018 Severalnines AB •The main differencing factors are the reaction time and lack of the situation awareness •Automated failover will be faster but may take actions user would not take •But the logic can be improved and safety features like white/blacklists can be use in attempt to reduce incorrect behaviour •Better visibility can also be implemented: Access tests through multiple hosts (slaves, proxies) Utilising clustering protocol like Raft or Paxos for network split detection •Don’t expect automated failover to cover correctly 100% of the cases though •A third way may also be applicable - assisted failover Does everything automatically but is initiated by the user, after the initial assessment To automate or not to automate?
  20. 20. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Understanding the failover process
  21. 21. Copyright 2018 Severalnines AB •Ensuring that the master is indeed down is critical •You never want to run two writable masters at the same time! •You may want to implement some sort of STONITH (Shoot The Other Node In The Head) to ensure dead master will stay dead •You can leverage data from multiple sources. Are slaves replicating? Do proxies see the master? Understanding the failover process
 Ensure that the master is indeed down
  22. 22. Copyright 2018 Severalnines AB •Picking correct slave as the master candidate is critical •You want to use the most advanced slave to avoid data loss •You want to ensure there are no errant transactions (in GTID setup) •You want to allow slave to apply the events from relay logs (as long as it does not take too long) •You want to try and reach the master to see if there are non-replicated binary log events Master failure not always mean you cannot SSH there and parse binlogs for missing transactions Understanding the failover process
 Pick the correct slave as the master candidate
  23. 23. Copyright 2018 Severalnines AB •Correct usage of whitelists and blacklists is critical •You may not want to promote any slave that you have •Better to stay within the same datacenter to avoid split brain scenario with two masters •Better to stay within the same datastore version for compatibility reasons •Better to stay within the same hardware for performance reasons •While executing a failover use the standard procedures for marking masters and slaves read_only and super_read_only = 0 or 1? Understanding the failover process Correct usage of whitelists and blacklists
  24. 24. Copyright 2018 Severalnines AB •Automated failover process can sometimes be augmented by the use of pre- or post-failover actions •Do you want to perform some action when the master failed? •Do you need to reconfigure some application when a new master is promoted? •Do you want to remove old master entry from your Consul key/value store? •Most of the main tools that support failover handling support also pre- and post-failover actions MHA Orchestrator ClusterControl Understanding the failover process Pre- and post-failover actions
  25. 25. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Orchestrating failover across the whole HA stack
  26. 26. Copyright 2018 Severalnines AB •Databases do not exist in vacuum, they are surrounded by other services to create a highly available environment •Proxies need a way to distinguish between the master and a slave In PostgreSQL streaming replication this is typically the existence of a recovery.conf file In MySQL it can be, for example a value of read_only and super_read_only: 1 or 0 •When failover is happening, you have to make sure you manage the variable’s value correctly You don’t want loadbalancers to send the traffic to your databases while failover is happening Orchestrating failover across the whole HA stack
  27. 27. Copyright 2018 Severalnines AB Orchestrating failover across the whole HA stack
  28. 28. Copyright 2018 Severalnines AB Orchestrating failover across the whole HA stack
  29. 29. Copyright 2018 Severalnines AB •All loadbalancers deployed by ClusterControl follow those rules recovery.conf file on PostgreSQL read_only value on MySQL •ClusterControl ensures that the values in MySQL are defined accordingly to the stage of the process in switchover, the master is demoted through read_only=1. In failover this cannot be done still, read_only=1 is configured in MySQL configuration on all nodes to minimise the chance of old master returning as writable host new master is marked with read_only=0 •This process works but it does not cover all the situations Orchestrating failover across the whole HA stack
  30. 30. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Difficult problems
  31. 31. Copyright 2018 Severalnines AB •Networks can be unstable and packets may be lost in the transfer •Replication itself is robust and it will work quite well even if there are network problems •Health checks performed over the replication also have to take such conditions under consideration •Make sure you do not take any actions based on just a single health check •Make sure you do not take any actions based on just a single host’s point of view •Expect network problems and try to understand their severity before an action will be taken Difficult problems - network issues
  32. 32. Copyright 2018 Severalnines AB •Every cluster type has its own problems. For MySQL and PostgreSQL replication one of the biggest issues is the lack of cluster awareness and lack of quorum support •Replication clusters are prone to the network split issues •Automated topology detection by proxies can make things even more tricky •There’s no easy, standard way to avoid this problem Difficult problems - network split
  33. 33. Copyright 2018 Severalnines AB •Network split happens when there’s lack of connectivity between one part of the cluster and the other part For example, the master cannot reach slaves, slaves cannot reach the master •Master is unavailable therefore cluster cannot handle writes Failover should be performed to restore cluster’s ability to handle traffic •Master is still running though, when networks converge two writeable hosts will show up •Standard topology detection logic will not be enough. Two nodes will have read_only=0, two nodes will not have the recovery.conf file Without additional measures to ensure the old master won’t get the traffic, a split brain is imminent Difficult problems - network split
  34. 34. Copyright 2018 Severalnines AB •Split brain is a condition in which two writable nodes take the traffic and, as a result, their data sets drift apart •There’s no easy solution to recover from such condition Shut down rogue master as soon as possible to minimise the data drift Manual action will be required to converge the data sets •Make sure that whatever solution you choose, it works You can do better than GitHub! Difficult problems - split brain
  35. 35. Copyright 2018 Severalnines AB Difficult problems - split brain
  36. 36. Copyright 2018 Severalnines AB •There are numerous ways in which you can reduce (but not avoid) the impact and probability that your data will be affected by the network issues •Collect as much data about the state of the replication topology before an action is taken Utilize multiple nodes as the point of view on the topology •Try to implement STONITH to reduce the chance that old master will show up Some kind of Lights-Out solution (iLO for example) might work in physical environment Kill scripts (destroy given virtual instance) may work in the cloud •Modify configuration of the proxies to remove old master after it’s deemed as dead •No solution will be 100% bullet proof You may not be able to reach all the proxies, the node itself or cloud service to kill the master Difficult problems - how to avoid them?
  37. 37. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Demo
  38. 38. End of Year Promotion Get Three Months Free 25% In Savings Just Sign By December 20th! with an Annual Contract
  39. 39. Copyright 2018 Severalnines AB •Blogs that cover failover: https://severalnines.com/blog/introduction-failover-mysql-replication-101-blog https://severalnines.com/blog/failover-postgresql-replication-101 https://severalnines.com/blog/how-control-replication-failover-mysql-and-mariadb https://severalnines.com/blog/controlling-replication-failover-mysql-and-mariadb-pre-or-post- failover-scripts •To automate or not to automate? https://severalnines.com/blog/failover-mysql-replication-and-others-should-it-be-automated • Contact: jj@severalnines.com Thank you!

×