Successfully reported this slideshow.
Your SlideShare is downloading. ×

Webinar slides: How to Measure Database Availability?

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Cloud Computing
Cloud Computing
Loading in …3
×

Check these out next

1 of 50 Ad

Webinar slides: How to Measure Database Availability?

Database availability is notoriously hard to measure and report on, although it is an important KPI in any SLA between you and your customer. We often define availability in terms of 9’s (e.g. 99.9% or 99.999%), although there is often a lack of understanding of what these numbers might mean, or how we can measure them.
Is the database available if an instance is up and running, but it is unable to serve any requests? Or if response times are excessively long, so that users consider the service unusable? Is the impact of one longer outage the same as multiple shorter outages? How do partial outages affect database availability, where some users are unable to use the service while others are completely unaffected?
Not agreeing on precise definitions with your customer might lead to dissatisfaction. The database team might be reporting that they have met their availability goals, while the customer is dissatisfied with the service. In this webinar, we will discuss the different factors that affect database availability. We will then see how you can measure your database availability in a realistic way.

AGENDA
- Defining availability targets
- Critical business functions
- Customer needs
- Duration and frequency of downtime
- Planned vs unplanned downtime
- SLA
- Measuring the database availability
- Failover/Switchover time
- Recovery time
- Upgrade time
- Queries latency
- Restoration time from backup
- Service outage time
- Instrumentation and tools to measure database availability:
- Free & open-source tools
- CC's Operational Report
- Paid tools

SPEAKER
Bartlomiej Oles is a MySQL and Oracle DBA, with over 15 years experience in managing highly available production systems at IBM, Nordea Bank, Acxiom, Lufthansa, and other Fortune 500 companies. In the past five years, his focus has been on building and applying automation tools to manage multi-datacenter database environments.

Database availability is notoriously hard to measure and report on, although it is an important KPI in any SLA between you and your customer. We often define availability in terms of 9’s (e.g. 99.9% or 99.999%), although there is often a lack of understanding of what these numbers might mean, or how we can measure them.
Is the database available if an instance is up and running, but it is unable to serve any requests? Or if response times are excessively long, so that users consider the service unusable? Is the impact of one longer outage the same as multiple shorter outages? How do partial outages affect database availability, where some users are unable to use the service while others are completely unaffected?
Not agreeing on precise definitions with your customer might lead to dissatisfaction. The database team might be reporting that they have met their availability goals, while the customer is dissatisfied with the service. In this webinar, we will discuss the different factors that affect database availability. We will then see how you can measure your database availability in a realistic way.

AGENDA
- Defining availability targets
- Critical business functions
- Customer needs
- Duration and frequency of downtime
- Planned vs unplanned downtime
- SLA
- Measuring the database availability
- Failover/Switchover time
- Recovery time
- Upgrade time
- Queries latency
- Restoration time from backup
- Service outage time
- Instrumentation and tools to measure database availability:
- Free & open-source tools
- CC's Operational Report
- Paid tools

SPEAKER
Bartlomiej Oles is a MySQL and Oracle DBA, with over 15 years experience in managing highly available production systems at IBM, Nordea Bank, Acxiom, Lufthansa, and other Fortune 500 companies. In the past five years, his focus has been on building and applying automation tools to manage multi-datacenter database environments.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Webinar slides: How to Measure Database Availability? (20)

Advertisement

More from Severalnines (20)

Recently uploaded (20)

Advertisement

Webinar slides: How to Measure Database Availability?

  1. 1. April 2018 How to Measure Database Availability? Bart Oleś, Support Engineer Presenter bart@severalnines.com
  2. 2. Copyright 2017 Severalnines AB I'm Jean-Jérôme from the Severalnines Team and I'm your host for today's webinar! Feel free to ask any questions in the Questions section of this application or via the Chat box. You can also contact me directly via the chat box or via email: info@severalnines.com during or after the webinar. Your host & some logistics
  3. 3. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB About Severalnines and ClusterControl
  4. 4. Copyright 2017 Severalnines AB What We Do Manage Scale MonitorDeploy
  5. 5. Copyright 2017 Severalnines AB ClusterControl Automation & Management Management ● Multi-Cluster / Multi-DC ● Automate Repair & Recovery ● Database Upgrades ● Backups ● Configuration Management ● Database Cloning ● One-Click Scaling Deployment ● Deploy a Cluster in Minutes ● On-Premises or in the Cloud (AWS) Monitoring ● Systems View with 1sec Resolution ● DB / OS stats & Performance Advisors ● Configurable Dashboards ● Query Analyzer ● Real-time / historical
  6. 6. Copyright 2017 Severalnines AB Supported Databases
  7. 7. Copyright 2012 Severalnines ABCopyright 2012 Severalnines AB Our Customers
  8. 8. April 2018 How to Measure Database Availability? Bart Oleś, Support Engineer Presenter bart@severalnines.com
  9. 9. Agenda Copyright 2018 Severalnines AB ● Introduction ● Defining availability targets ● Database Availability - What to measure? ● Database Availability - How to measure?
  10. 10. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Introduction
  11. 11. What is a high availability database? Copyright 2018 Severalnines AB High availability databases use an architecture that is designed to continue to function normally even after hardware, software or network failures.
  12. 12. Example high availability architecture Copyright 2018 Severalnines AB
  13. 13. Outage visibility Copyright 2018 Severalnines AB
  14. 14. Availability vs reliability Copyright 2018 Severalnines AB Availability A measure of % of time a service is in a usable state. Also measured in 9s. Reliability A measure of the probability of the service being in a usable state for a period of time. Measured as MTBF (Mean Time Between Failures), and the Failure Rate. Availability Reliability
  15. 15. Duration and frequency of downtime Copyright 2018 Severalnines AB Planned downtime is for scheduled upgrades and routine maintenance of hardware and software. Unplanned downtime is when your systems crash unexpectedly. Usually due to hardware/software failure, natural disaster or human error. Scheduled downtimes do not count towards availability, but may impact customer satisfaction metrics. Nevertheless there are some exceptions like Telcos, 911,...
  16. 16. Connecting availability and reliability Copyright 2018 Severalnines AB During a year a database goes down for an one hour Availability = 99.99% (or four nines) Reliability (MTBF) = 8759 hours Percentages of a particular order of magnitude are sometimes referred to by the number of nines or "class of nines" in the digits.
  17. 17. Availability as a percentage Copyright 2018 Severalnines AB Source: https://en.wikipedia.org/wiki/High_availability
  18. 18. Copyright 2018 Severalnines AB
  19. 19. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Defining availability targets
  20. 20. Why measure availability? Copyright 2018 Severalnines AB The need for availability is governed by business objectives, and the primary goal of its measurement is: ● To provide an availability baseline ● To help identify where to improve the systems ● To monitor and control improvement projects Improvement Monitor and control Availability Baseline
  21. 21. Calculating availability Copyright 2018 Severalnines AB AST: Agreed Service Time DT: Downtime If AST is 100 hours and downtime is 2 hours then the availability would be:
  22. 22. Calculating availability Copyright 2018 Severalnines AB The trouble with this is that, while this calculation is easy enough to perform, and collecting the data to do it seems straightforward, it’s really not at all clear what the number you end up with is actually telling you. Define customer needs and availability targets!
  23. 23. Another method of calculating availability Copyright 2018 Severalnines AB In this example, you would calculate the availability as:
  24. 24. Classifying business functions by criticality Copyright 2018 Severalnines AB Identify the critical business functions of your business. Classify these critical business functions into the following categories: high, medium, and low Complete the critical business functions chart with each critical business function. Function` Criticality Maximum Downtime Person/Team Required Resources Impacted Functions Brief process to complete functions Example: Insurance claims High 2 Days DBA Team 1 10 employees, claim, mgt software, paper forms Claims assessing filing Take calls, document in system, file Example Open new savings act. Low 1 Week DBA Team 2 1 employee, account mgmt, software New accounts Customer compleates form onsite
  25. 25. Mean Time Between Failure (MTBF) Copyright 2018 Severalnines AB
  26. 26. Mean Time To Failure (MTTF) Copyright 2018 Severalnines AB
  27. 27. Service Level Agreement (SLA) Copyright 2018 Severalnines AB The SLA is a contract negotiated and agreed between a customer and a service provider
  28. 28. SLA objectives and lifecycle Copyright 2018 Severalnines AB • Service description • Reliability • Responsiveness • Procedure for reporting problems • Monitoring & reporting service level • Consequences for not meeting service obligations • Escape clauses or constraints Service Level Agreement 1. Select service provider 2. Define SLA 3. Establish agreement 4. Monitor SLA violation 5.Terminate SLA 6. Enforce penalties for SLA violation
  29. 29. SLA – Lifecycle Copyright 2018 Severalnines AB 1. Select Service Provider 2. Define SLA 3. Establish Agreement 4. Monitor SLA Violation 5.Terminate SLA 6. Enforce Penalties for SLA Violation
  30. 30. SLA common mistakes Copyright 2018 Severalnines AB Do not: • Allow the service level agreement to become a marketing document. • Leave preparation of the Service Level Agreement until the last minute. • Have service levels without a compensation regime of some sort. • Have overly long service level measurement periods. • Lose sight of your objectives.
  31. 31. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Database Availability - What to measure?
  32. 32. Outage timeline Copyright 2018 Severalnines AB
  33. 33. Failure detection Copyright 2018 Severalnines AB • Check frequency ○ heartbeat check, ○ number of occurrences, counters ○ timeouts • Notification delay • Dashboard • Service desk response time
  34. 34. Designing failover mechanisms Copyright 2018 Severalnines AB Failover is the operational process of switching between primary and secondary systems or system components in the event of failure. When designing failover mechanisms, organizations generally calculate • RTO (Recovery Time Objective) • RPO (Recovery Point Objective)
  35. 35. RTO & RPO Copyright 2018 Severalnines AB ● RTO (Recovery Time Objective) Time period within which service must be restored to avoid unacceptable consequences. ● RPO (Recovery Point Objective) Maximum tolerable period in which data may be lost. RPO defines how much data an organisation can afford to lose. Based on this, optimum backup frequency and recovery speed can be determined.
  36. 36. Defining RTO & RPO Copyright 2018 Severalnines AB ● RTO Time to: ○ Recall backup media, ○ Travel time for on-call engineers, ○ Bring up infrastructure, ○ Restore data, ○ Bring up services, ○ Configure application, ○ Test and validate. ● RPO ○ Guaranteed last restorable point (PITR) (DEMO) ○ Delayed replication (DEMO) RPO RTO
  37. 37. ● If RPO = 4 hours, backups of data no older than 4 hours. ● If it takes 2 hours to restore the last backup that was done 4 hours ago, RTO is >= 2 hours and RPO is 4 hours. ● If a master fails and the slave is 10 minutes behind, your RPO cannot be < 10 minutes. ● If the application needs to be bounced and it takes 10 minutes, then the RTO cannot be < 10 minutes. Can RPO + RTO = 0 ? Copyright 2018 Severalnines AB
  38. 38. Failure handling - replication Copyright 2018 Severalnines AB ● Failure Detection ● Pre-failover - find most advanced slave - wait until replication lag - failover master ● Post-failover - update application connection (or use proxy) - re-slave to new master Additionally: How much data you can lose Master (RW) Slave (RO) A B
  39. 39. Failure handling - Galera cluster Copyright 2018 Severalnines AB Reads/Writes Reads/Writes A B Reads/Writes *https://severalnines.com/blog/using-galera-replication-window-advisor-avoid-sst • Single node failure leads to partial app outage • SST vs storage snapshot • Non-blocking donor node & performance impact • Bootstrap time ○ Determining the most advanced node ○ Bootstrap process • IST & Galera cache size (Replication Window*) C
  40. 40. Failure handling - Load balancers Copyright 2018 Severalnines AB ● Need to be able to handle transaction failures and retry them. ● Ability to check the health of the database servers. ● Keepalived & VIP failover. Benchmarked failover times*: ProxySQL 1.4.6 : 11 seconds HAProxy 1.5.14 : 12 seconds MaxScale 2.1.9 : 15 seconds Load Balancer *https://severalnines.com/blog/comparing-database-proxy-failover-times-proxysql-maxscale-and-haproxy Node A Node B
  41. 41. Failure handling - InnoDB recovery time Copyright 2018 Severalnines AB ● Checkpoint interval ● Size of the logs ● Data Access Locality ● Database size ● Buffer Pool Size ● Number of dirty buffers during the crash
  42. 42. Upgrade time Copyright 2018 Severalnines AB ● Size of the database ● Backup time ● Buffer pool size It can be minimised with: ● Rolling restart (in case of distributed setup) ● Upgrade combined with replication switchover (DEMO)
  43. 43. Query latency Copyright 2018 Severalnines AB Mysql users have a number of options for monitoring query latency (DEMO): Performance schema events_statements_summary_by_digest Sys schema sys schema provides an organized set of metrics in a more human-readable format: SELECT * FROM sys.statements_with_runtimes_in_95th_percentile; Slow queries SHOW VARIABLES LIKE 'long_query_time';
  44. 44. What impacts RTO: ● Database size ● Network throughput ● Backup type ● Standalone or Cluster Restoration time from a backup Copyright 2018 Severalnines AB Type of failure: ● Backup type – logical, physical, disk snapshot ● Partial restore on single node (DEMO) ● Cluster restore and bootstrap ● Datacenter
  45. 45. Other services that can affect the database: ● Networking ● OS upgrade ● Disk resize or other system maintenance ● Application upgrade Note: Define separately if not within control of database team Service outage time Copyright 2018 Severalnines AB
  46. 46. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Instrumentation and tools to measure database availability
  47. 47. Open-source and paid tools Copyright 2018 Severalnines AB ● Nagios ● ClusterControl Community ● Zabbix ● PMM ● Grafana ● Cacti ● OpenNMS ● Icinga ● Oracle Enterprise Manager ● Monyog ● MongoDB Ops Manager ● ClusterControl Enterprise
  48. 48. ClusterControl Operational Report Copyright 2018 Severalnines AB The idea behind creating Operational Reports is to put all of the most important data into a single document, which can be quickly reviewed to get an understanding of the state of the databases. ● Availability Summary ● Cluster - Availability Details ● Cluster State History
  49. 49. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Q & A
  50. 50. Additional Resources Copyright 2018 Severalnines AB ● Repair and recovery for your MySQL, MariaDB and MongoDB Clusters ● Designing Open Source Databases for High Availability ● HA & Load Balancing Tutorials ● Download ClusterControl ● Contact us: info@severalnines.com

×