Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tech Talk Series, Part 4: How do you achieve high availability in a MySQL environment?

495 views

Published on

For high-value, high-throughput sites, downtime can cost hundreds of thousands to millions of dollars. Service architectures have baked lots of resiliency into apps, but databases and their system of record design are often vulnerable to single points of failure, bringing down entire systems. Worse still, when the database is recovered, there can be missing data. How many database transactions can your workload handle losing if your primary database goes down?

There are many strategies to minimize MySQL downtime, usually using replication and redundant hardware. Often these systems involve some manual intervention and potential downtime as failover protocols take hold. Also, these strategies may be expensive and require redundant hardware.

At Clustrix, we think there are alternative strategies that may be a better fit for modern apps in a MySQL environment.

In our final Tech Talk in this series on scaling MySQL, we evaluate multiple HA strategies. We also discuss the following topics:

- The difference between fault tolerance and high availability
- Best practices for achieving high availability with MySQL
- What are the costs of achieving HA? What can be the most cost-effective strategy?
- How is it possible to survive a multi-node failure in MySQL?

View the webcast of this Tech Talk on our YouTube channel.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Tech Talk Series, Part 4: How do you achieve high availability in a MySQL environment?

  1. 1. Flexible transactional scale for the connected world. Challenges to Scaling MySQL: Best Practices for Creating High Availability Dave A. Anselmi @AnselmiDave Director of Product Management
  2. 2. Questions for Today PROPRIETARY & CONFIDENTIAL 2 o What is high availability and when is it needed? o What’s the difference between high availability and fault tolerance? o How is it possible to survive a multi-node failure in MySQL? o What are the best practices for achieving high availability with MySQL? o What are the costs of achieving HA? What can be the most cost-effective strategy??
  3. 3. HA: As You Scale, Your Exposure GrowsSCALE (GROWTH/SUCCESS) T I M E LAMP Stack AWS, Azure, RAX, GCE, etc Private Cloud REACH LIMIT App too slow; Lost users REACH LIMIT (AGAIN) App too slow; Lost users Migrate to Bigger Machine • Read slaves, then sharding, etc: • Add more hardware & DBAs • Refactor code /hardwired app  More Expensive  Higher Risk  Lost Revenue ONGOING: • Refactoring hardware • Data balancing • Shard maintenance REPEAT Migrate to Bigger Machine PROPRIETARY & CONFIDENTIAL 3
  4. 4. What do we mean by “High Availability”?
  5. 5. Availability by the 9s PROPRIETARY & CONFIDENTIAL 5https://www.percona.com/blog/2016/06/07/choosing-mysql-high-availability-solutions/
  6. 6. “High Availability” –vs- “Fault Tolerance” o High Availability – Minimize system downtime – Trade-off between as high “9s” level as can be budgeted – Goal: least amount of data loss possible o Fault Tolerance – System cannot go down – Arrays of redundant hardware, and automated failover systems – Cost is very high ORCL: – A high availability system minimizes the time when the system is down, or unavailable and maximizes the time when it is running, or available. IBM: – A fault tolerant environment has no service interruption but a significantly higher cost, – A highly available environment has a minimal service interruption. PROPRIETARY & CONFIDENTIAL 6
  7. 7. Fault Tolerance –vs– High Availability o Fault Tolerance: – Failover application processes, including heartbeat – Shared storage layer, multiple participants PROPRIETARY & CONFIDENTIAL 7 o High Availability: – Multiple redundant shared- nothing servers – Replication to keep in sync
  8. 8. High Availability rather than Fault-Tolerance o MySQL systems are rarely fault tolerant – High cost of fault tolerance is prohibitive o Most MySQL systems use replication for HA/DR o Galera isn’t fault tolerant – Certification replication provides synchronous replication between nodes – Availability is enforced over consistency: the write-set can be committed on the local node before the rest of the cluster has committed (Jepsen) PROPRIETARY & CONFIDENTIAL 8
  9. 9. Challenges to Deploying HA Systems
  10. 10. Four Challenges to HA 1. Tech/DevOps always wants HA: 1. Throughput & uptime is their core metric/KPIs 2. Business/Finance demands justification of the costs: 1. Redundant servers reflect underutilized resources 2. Redundant servers are considered “wasted budget” 3. Cloud/PaaS/IaaS can imply more HA than they provide – “Architected for 11x 9s” 4. Tension between scale and HA 1. Ideally, each new server would provide scale and redundancy 2. In practice, result is mixed; so choice is usually for scale PROPRIETARY & CONFIDENTIAL 10
  11. 11. Realities of HA in the Cloud o “Promise” –vs- “Reality” of the cloud – Promise of the cloud: web scale – Reality of the cloud: TANSTAAFL o “Doesn’t the cloud provide HA automatically?” – MBAs: literally taught “DevOps just wants to spend $$” • “We don’t need redundancy: we’re on the cloud, & the cloud is 5x 9s, right?” • “S3 is architected for 11x 9s, right?” • “We’re on Amazon, it’s backed up” o MUST deploy redundant hardware in the cloud – If it’s not on your bill, you haven’t provisioned it o “Success” of AWS Marketing  Exposed Workload – 2/28/2017 4 Hour S3 outage: Even though “the cloud” has lots of hardware that does NOT mean your systems are fault tolerant, let alone HA PROPRIETARY & CONFIDENTIAL 11
  12. 12. “Obvious” Critical Workloads needing HA o E-Commerce – Black Friday/Cyber Monday, Single’s Day, Back to School, flash sales, etc – 80% of Revenue in 2 months – Provisioning > 3x capacity for 2 months o Finance – System of Record – “Money changing hands” o Healthcare – “Life/death decisions” & DSS PROPRIETARY & CONFIDENTIAL 12
  13. 13. Assessing Your Workload’s Exposure o Downtime: how much new business lost? o How much does brand awareness/damage cost? o Lost data = what kind of cost? – Orders unfulfilled, unhappy customers – Missing/stale reports, unhappy executives o Not just e-commerce: – Internal critical DSS Reports => top bank runs 2x 100+ node sharded arrays • DSS needs to be near-real time • What if a shard fails, or the data is old? PROPRIETARY & CONFIDENTIAL 13
  14. 14. Business Case for HA The “insurance” of HA offsets multiple costs: o Opportunity cost – Each missed visitor was potentially a customer or referral o Single sale cost – Each missed sale is a tangible missed $-value o Customer lifetime cost – Unhappy customers who find sites they like better, won’t return o Market/brand cost – All customers use social media: communication “force multiplier” – “If you make customers unhappy in the physical world, they might each tell six friends. If you make customers unhappy on the internet, they can each tell 6,000.” – Jeff Bezos – W. Edwards Deming said “5” and “20”… – Call it “Customer Satisfaction at Web-Scale” PROPRIETARY & CONFIDENTIAL 14
  15. 15. Strategies to Make MySQL Deployments HA
  16. 16. MySQL HA is usually Replication-based o Redundant servers – Goal: get HA and more scale – Some level of consistency o Read slave or DR – data is still ‘seconds behind master’ – Async or Semistrict o Certification – Strong consistency as long as only a single master accepts writes o Group Replication – Strong consistency as long as only a single master accepts writes PROPRIETARY & CONFIDENTIAL 16
  17. 17. Consistency Ramifications to High Availability o Async Replication (Master/Slave): – Replication-based: latency between master and slaves – Always some number of transactions which COMMIT on Master aren’t represented on the Slave – “Trade latency for throughput.” OK for your workload? o Sync Replication: – Certification Replication: certificate is transmitted, local master commits before ACK, other nodes commit in background – Cloud Spanner & CockroachDB: time-based optimization for replicated partitions o Strong Consistency – Every node is in identical, global transactional state at all times – All nodes (at least two) containing data associated with the transaction are durably updated before application receives ACK PROPRIETARY & CONFIDENTIAL 17
  18. 18. Different Replication Strategies for HA Approach Details Pro’s Con’s Read Slave(s) Add a “Slave” read-server(s) to “Master” database server (e.g. “DR” node or cluster) • Easy setup • Single-master simplicity • Async == Slave is usually behind master • Eventually Consistent Master/ Master Both Masters are Slaves to each other • Allows updates to both masters • Async == Slave is usually behind master • Eventually Consistent Certification Replication Multi-Master cluster using synchronous Replication • Allows multiple masters to be close in state • Sync == Other nodes need to commit the certification. Window of skew exists (much shorter than async) Group Replication 1. Single-Primary, with automatic leader election 2. Multi-Primary, i.e. similar to certification replication • Allows multiple masters to be close in state • Sync == Other nodes need to commit the certification. Window of skew exists (much shorter than async)
  19. 19. MySQL Deployment Architectures PROPRIETARY & CONFIDENTIAL 19 SHARDO4SHARDO1 SHARDO2 SHARDO3 A-G H-M N-S T-Z DRDR DR DR A-G H-M N-S T-Z
  20. 20. HA Strategies per Architecture MySQL Deployment Approach Single Node Read Slave(s) Master/Master Sharding Read Slave(s) • Each read slave adds read scale + HA • Eventual consistency N/A • Secondary master is effectively same state as a read slave • Each shard has a read slave • Eventual consistency Master/ Master • No HA benefit over Read Slave • Secondary master is effectively same state as a read slave N/A • Each shard in Master/Master • Eventual consistency Certification Replication • Nodes are closer in state than read slave • Nodes are closer in state than read slave • Nodes are closer in state than Master/Master • Each shard in Master/Master using certification replication Group Replication • Automatic Master election • Group members are closer in state than read slave • Automatic Master election • Group members are closer in state than read slave • Group members are closer in state than Master/Master • Each shard using group Replication • Automatic Master election
  21. 21. How ClustrixDB Provides High Availability
  22. 22. ClustrixDB: PROPRIETARY & CONFIDENTIAL 22 ClustrixDB ACID Compliant Transactions & Joins Optimized for OLTP Built-In High Availability Flex-Up and Flex-Down Minimal DB Admin o Write + Read Linear Scale-Out o Automatically Highly Available o MySQL-Compatible
  23. 23. PROPRIETARY & CONFIDENTIAL 23 Automatic High Availability o Planned or Unplanned Outages – Planned: “soft-fail” the node(s) – Single minimal “database pause” to regain quorum o At least 2 instances of the data distributed across all the nodes – All data instances fully in sync at all times o Data is automatically rebalanced across the cluster – Tables are online for reads and writes – MVCC for lockless reads while writing S1 S2 S3 S3 S4 S4 S5 S1 ClustrixDB S2 S5
  24. 24. Questions for Today o What is high availability and when is it needed? – Redundancy to minimize downtimes – Financial, health, and other critical workloads o What’s the difference between high availability and fault tolerance? – High availability: minimize downtime – Fault tolerance: zero downtime o How is it possible to survive a multi-node failure in MySQL? – Multiple server redundancy – Maintaining strong consistency requires synchronous data replication between servers PROPRIETARY & CONFIDENTIAL 24
  25. 25. Questions for Today o What are the best practices for achieving high availability with MySQL? – Synchronous replication: can affect performance or scale – Asynchronous replication: can affect data consistency o What are the costs of achieving HA? What can be the most cost- effective strategy?? – Redundancy of servers: CAPEX & OPEX for DevOps – License/support costs: ramps up by # of servers – Ideally: each server provides scale + HA PROPRIETARY & CONFIDENTIAL 25
  26. 26. QUESTIONS?
  27. 27. THANK YOU!

×