A presentation discusses high availability (HA) strategies for MySQL databases. HA minimizes downtime while fault tolerance ensures zero downtime, but at high cost. For MySQL, HA usually uses replication across redundant servers, balancing consistency, throughput and cost. The best approach depends on the deployment, from single servers to sharded architectures. ClustrixDB provides automatic HA through synchronous multi-master replication across a cluster, minimizing administration while ensuring data consistency and continuous availability.
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Tech Talk Series, Part 4: How do you achieve high availability in a MySQL environment?
1. Flexible transactional scale for the connected world.
Challenges to Scaling MySQL:
Best Practices for Creating High Availability
Dave A. Anselmi @AnselmiDave
Director of Product Management
2. Questions for Today
PROPRIETARY & CONFIDENTIAL 2
o What is high availability and when is it needed?
o What’s the difference between high availability and fault tolerance?
o How is it possible to survive a multi-node failure in MySQL?
o What are the best practices for achieving high availability with MySQL?
o What are the costs of achieving HA? What can be the most cost-effective
strategy??
3. HA: As You Scale, Your Exposure GrowsSCALE
(GROWTH/SUCCESS)
T I M E
LAMP Stack
AWS, Azure,
RAX, GCE, etc
Private Cloud
REACH LIMIT
App too slow;
Lost users
REACH LIMIT
(AGAIN)
App too slow;
Lost users
Migrate
to Bigger
Machine
• Read slaves, then
sharding, etc:
• Add more hardware &
DBAs
• Refactor code
/hardwired app
More Expensive
Higher Risk
Lost Revenue
ONGOING:
• Refactoring
hardware
• Data balancing
• Shard
maintenance
REPEAT
Migrate
to Bigger
Machine
PROPRIETARY & CONFIDENTIAL 3
5. Availability by the 9s
PROPRIETARY & CONFIDENTIAL 5https://www.percona.com/blog/2016/06/07/choosing-mysql-high-availability-solutions/
6. “High Availability” –vs- “Fault Tolerance”
o High Availability – Minimize system downtime
– Trade-off between as high “9s” level as can be budgeted
– Goal: least amount of data loss possible
o Fault Tolerance – System cannot go down
– Arrays of redundant hardware, and automated failover systems
– Cost is very high
ORCL:
– A high availability system minimizes the time when the system is down, or
unavailable and maximizes the time when it is running, or available.
IBM:
– A fault tolerant environment has no service interruption but a significantly
higher cost,
– A highly available environment has a minimal service interruption.
PROPRIETARY & CONFIDENTIAL 6
7. Fault Tolerance –vs– High Availability
o Fault Tolerance:
– Failover application processes,
including heartbeat
– Shared storage layer, multiple
participants
PROPRIETARY & CONFIDENTIAL 7
o High Availability:
– Multiple redundant shared-
nothing servers
– Replication to keep in sync
8. High Availability rather than Fault-Tolerance
o MySQL systems are rarely fault tolerant
– High cost of fault tolerance is prohibitive
o Most MySQL systems use replication for HA/DR
o Galera isn’t fault tolerant
– Certification replication provides synchronous replication
between nodes
– Availability is enforced over consistency: the write-set can be
committed on the local node before the rest of the cluster has
committed (Jepsen)
PROPRIETARY & CONFIDENTIAL 8
10. Four Challenges to HA
1. Tech/DevOps always wants HA:
1. Throughput & uptime is their core metric/KPIs
2. Business/Finance demands justification of the costs:
1. Redundant servers reflect underutilized resources
2. Redundant servers are considered “wasted budget”
3. Cloud/PaaS/IaaS can imply more HA than they provide
– “Architected for 11x 9s”
4. Tension between scale and HA
1. Ideally, each new server would provide scale and redundancy
2. In practice, result is mixed; so choice is usually for scale
PROPRIETARY & CONFIDENTIAL 10
11. Realities of HA in the Cloud
o “Promise” –vs- “Reality” of the cloud
– Promise of the cloud: web scale
– Reality of the cloud: TANSTAAFL
o “Doesn’t the cloud provide HA automatically?”
– MBAs: literally taught “DevOps just wants to spend $$”
• “We don’t need redundancy: we’re on the cloud, & the cloud is 5x 9s, right?”
• “S3 is architected for 11x 9s, right?”
• “We’re on Amazon, it’s backed up”
o MUST deploy redundant hardware in the cloud
– If it’s not on your bill, you haven’t provisioned it
o “Success” of AWS Marketing Exposed Workload
– 2/28/2017 4 Hour S3 outage: Even though “the cloud” has lots of hardware
that does NOT mean your systems are fault tolerant, let alone HA
PROPRIETARY & CONFIDENTIAL 11
12. “Obvious” Critical Workloads needing HA
o E-Commerce
– Black Friday/Cyber Monday, Single’s Day, Back to School, flash
sales, etc
– 80% of Revenue in 2 months
– Provisioning > 3x capacity for 2 months
o Finance
– System of Record
– “Money changing hands”
o Healthcare
– “Life/death decisions” & DSS
PROPRIETARY & CONFIDENTIAL 12
13. Assessing Your Workload’s Exposure
o Downtime: how much new business lost?
o How much does brand awareness/damage cost?
o Lost data = what kind of cost?
– Orders unfulfilled, unhappy customers
– Missing/stale reports, unhappy executives
o Not just e-commerce:
– Internal critical DSS Reports => top bank runs 2x 100+ node
sharded arrays
• DSS needs to be near-real time
• What if a shard fails, or the data is old?
PROPRIETARY & CONFIDENTIAL 13
14. Business Case for HA
The “insurance” of HA offsets multiple costs:
o Opportunity cost
– Each missed visitor was potentially a customer or referral
o Single sale cost
– Each missed sale is a tangible missed $-value
o Customer lifetime cost
– Unhappy customers who find sites they like better, won’t return
o Market/brand cost
– All customers use social media: communication “force multiplier”
– “If you make customers unhappy in the physical world, they might each tell six friends. If
you make customers unhappy on the internet, they can each tell 6,000.” – Jeff Bezos
– W. Edwards Deming said “5” and “20”…
– Call it “Customer Satisfaction at Web-Scale”
PROPRIETARY & CONFIDENTIAL 14
16. MySQL HA is usually Replication-based
o Redundant servers
– Goal: get HA and more scale
– Some level of consistency
o Read slave or DR – data is still ‘seconds behind master’
– Async or Semistrict
o Certification
– Strong consistency as long as only a single master accepts writes
o Group Replication
– Strong consistency as long as only a single master accepts writes
PROPRIETARY & CONFIDENTIAL 16
17. Consistency Ramifications to High Availability
o Async Replication (Master/Slave):
– Replication-based: latency between master and slaves
– Always some number of transactions which COMMIT on Master aren’t
represented on the Slave
– “Trade latency for throughput.” OK for your workload?
o Sync Replication:
– Certification Replication: certificate is transmitted, local master commits
before ACK, other nodes commit in background
– Cloud Spanner & CockroachDB: time-based optimization for replicated
partitions
o Strong Consistency
– Every node is in identical, global transactional state at all times
– All nodes (at least two) containing data associated with the transaction are
durably updated before application receives ACK
PROPRIETARY & CONFIDENTIAL 17
18. Different Replication Strategies for HA
Approach Details Pro’s Con’s
Read Slave(s) Add a “Slave” read-server(s) to
“Master” database server
(e.g. “DR” node or cluster)
• Easy setup
• Single-master simplicity
• Async == Slave is usually behind
master
• Eventually Consistent
Master/
Master
Both Masters are Slaves to each
other
• Allows updates to both masters • Async == Slave is usually behind
master
• Eventually Consistent
Certification
Replication
Multi-Master cluster using
synchronous Replication
• Allows multiple masters to be close
in state
• Sync == Other nodes need to commit
the certification. Window of skew
exists (much shorter than async)
Group
Replication
1. Single-Primary, with
automatic leader election
2. Multi-Primary, i.e. similar to
certification replication
• Allows multiple masters to be close
in state
• Sync == Other nodes need to commit
the certification. Window of skew
exists (much shorter than async)
20. HA Strategies per Architecture
MySQL Deployment
Approach Single Node Read Slave(s) Master/Master Sharding
Read
Slave(s)
• Each read slave adds
read scale + HA
• Eventual consistency
N/A
• Secondary master is
effectively same state as a
read slave
• Each shard has a read
slave
• Eventual consistency
Master/
Master
• No HA benefit over
Read Slave
• Secondary master is
effectively same state
as a read slave
N/A
• Each shard in
Master/Master
• Eventual consistency
Certification
Replication
• Nodes are closer in
state than read slave
• Nodes are closer in
state than read slave
• Nodes are closer in state
than Master/Master
• Each shard in
Master/Master using
certification replication
Group
Replication
• Automatic Master
election
• Group members are
closer in state than read
slave
• Automatic Master
election
• Group members are
closer in state than
read slave
• Group members are closer
in state than Master/Master
• Each shard using group
Replication
• Automatic Master election
22. ClustrixDB:
PROPRIETARY & CONFIDENTIAL 22
ClustrixDB
ACID Compliant
Transactions & Joins
Optimized for OLTP
Built-In High Availability
Flex-Up and Flex-Down
Minimal DB Admin
o Write + Read Linear Scale-Out
o Automatically Highly Available
o MySQL-Compatible
23. PROPRIETARY & CONFIDENTIAL 23
Automatic High Availability
o Planned or Unplanned Outages
– Planned: “soft-fail” the node(s)
– Single minimal “database pause” to regain
quorum
o At least 2 instances of the data distributed
across all the nodes
– All data instances fully in sync at all times
o Data is automatically rebalanced across
the cluster
– Tables are online for reads and writes
– MVCC for lockless reads while writing
S1
S2
S3
S3
S4
S4
S5
S1
ClustrixDB
S2
S5
24. Questions for Today
o What is high availability and when is it needed?
– Redundancy to minimize downtimes
– Financial, health, and other critical workloads
o What’s the difference between high availability and fault tolerance?
– High availability: minimize downtime
– Fault tolerance: zero downtime
o How is it possible to survive a multi-node failure in MySQL?
– Multiple server redundancy
– Maintaining strong consistency requires synchronous data replication
between servers
PROPRIETARY & CONFIDENTIAL 24
25. Questions for Today
o What are the best practices for achieving high availability with
MySQL?
– Synchronous replication: can affect performance or scale
– Asynchronous replication: can affect data consistency
o What are the costs of achieving HA? What can be the most cost-
effective strategy??
– Redundancy of servers: CAPEX & OPEX for DevOps
– License/support costs: ramps up by # of servers
– Ideally: each server provides scale + HA
PROPRIETARY & CONFIDENTIAL 25