It is said that if you are not designing for failure, then you are heading for failure. How do you design a database system from the ground up to withstand failure? This can be a challenge as failures happen in many different ways, sometimes in ways that would be hard to imagine. This is a consequence of the complexity of today’s database environments.
At Severalnines we’re big fans of high availability databases and have seen our fair share of failure scenarios across the thousands of database deployments we enable every year.
In this webinar replay, we’ll look at the different types of failures you might encounter and what mechanisms can be used to address them. We will also look at some of popular HA solutions used today, and how they can help you achieve different levels of availability.
AGENDA
- Why design for High Availability?
- High availability concepts
- CAP theorem
- PACELC theorem
- Trade offs
- Deployment and operational cost
- System complexity
- Performance issues
- Lock management
- Architecting databases for failures
- Capacity planning
- Redundancy
- Load balancing
- Failover and switchover
- Quorum and split brain
- Fencing
- Multi datacenter and multi-cloud setups
- Recovery policy
- High availability solutions
- Database architecture determines Availability
- Active-Standby failover solution with shared storage or DRBD
- Master-slave replication
- Master-master cluster
- Failover and switchover mechanisms
- Reverse proxy
- Caching
- Virtual IP address
- Application connector
SPEAKER
Ashraf Sharif is System Support Engineer at Severalnines. He was previously involved in hosting world and LAMP stack, where he worked as principal consultant and head of support team and delivered clustering solutions for large websites in the South East Asia region. His professional interests are on system scalability and high availability.
2. Copyright 2017 Severalnines AB
I'm Jean-Jérôme from the Severalnines Team and
I'm your host for today's webinar!
Feel free to ask any questions in the Questions
section of this application or via the Chat box.
You can also contact me directly via the chat box
or via email: info@severalnines.com during or
after the webinar.
Your host & some logistics
11. What is High Availability?
Copyright 2018 Severalnines AB
A characteristic of a system, which aims to ensure an
agreed level of operational performance, usually
uptime, for a higher than normal period.
12. Database Architecture Evolution
Copyright 2018 Severalnines AB
1990's 2000's 2010's
RDBMSRDBMS
Cache
Reverse
Proxy
Distributed
Cache
NoSQLRDBMS
NoSQL
NoSQL
RDBMS
RDBMS
RDBMS
RDBMS
Virtual IP
shared
disk
13. Why do we need to
design a highly
available database
system?
Why Design for HA?
Copyright 2018 Severalnines AB
Scalability
Resilience
Reliability
Performance
Growth
Hardware failures
Network partition
Reconciliation Parallelism
Load distribution
Closer to users
Business continuation
Consolidation
Outage protection
Agility
Data integrity
14. Planning on High Availability
Copyright 2018 Severalnines AB
"He who fails to plan is
planning to fail"
Winston Churchill
16. CAP Theorem
Copyright 2018 Severalnines AB
● "It is impossible for a distributed data store to simultaneously
provide more than two out of the following three guarantees:
Consistency. Availability. Partition tolerance." - Eric Brewer, 2000
● In the event of a network partition, one has to choose between:
○ Consistency - Enforced consistency
○ Availability - Eventual consistency
● Consistency: Commits are atomic across the entire distributed
system.
● Availability: Remains accessible and operational at all time.
● Partition Tolerance: Only a total network failure can cause the
system to respond incorrectly.
C
A
P
RDBMS: MySQL,
PostgreSQL
Cassandra, Dynamo,
CouchDB, Riak
Hbase, MongoDB,
Redis, Galera Cluster
Pick Two
17. CAP Theorem
Copyright 2018 Severalnines AB
DB1 DB2 DB3
When P, choose C over A
3-node MySQL Galera Cluster on 3 different datacenters
during network partitioning
18. CAP Theorem
Copyright 2018 Severalnines AB
DB1 DB2 DB3
( A + C ) - P
3-node MySQL Galera Cluster on 3 different datacenters
during network partitioning
( P + C ) - A
19. CAP Theorem
Copyright 2018 Severalnines AB
DB2 DB3 DB4
When P, choose A over C
5-node Cassandra Cluster on 3 different datacenters during
network partitioning (fully operational)
DB1 DB5
20. PACELC Theorem
Copyright 2018 Severalnines AB
● "If there is a Partition, how does the system trade off Availability and Consistency, Else, when the system is
running normally in the absence of partition, how does the system trade off Latency and Consistency" -
Daniel Abadi, 2011
● Classify systems according to their behaviour during active mode (running normally) and degraded
mode (when network partitions happens).
● If Partitioned {
choose A or C; # as per CAP theorem
} Else {
choose L or C;
}
21. PACELC Theorem
Copyright 2018 Severalnines AB
Partitioned Else
Availability Consistency Latency
HA Distributed
Database System
pick one pick one
22. PACELC Theorem
Copyright 2018 Severalnines AB
DB1 DB2 DB3
( P + C ) - A
3-node MySQL Galera Cluster on 3 different datacenters
(non operational + network partition)
23. PACELC Theorem
Copyright 2018 Severalnines AB
DB1 DB2 DB3
When E, choose C over L
3-node MySQL Galera Cluster on 3 different datacenters
(fully operational)
24. Exhibits consistency at some later
point. Last write wins.
Database appears to work most of
the time. No SPOF.
Weak consistency
Committed data is never lost
Transactions do not affect each other
Only valid data is saved
Transactions are all or nothing
ACID vs BASE
Copyright 2018 Severalnines AB
Atomicity
Consistency
Isolation
Durability
Basic Availability
Soft State
Eventual
Consistency
Pessimistic RDBMS Optimistic NoSQL OLAPOLTP
25. PACELC on Database System
Copyright 2018 Severalnines AB
Partitioned Else
Availability Consistency Latency
HA Distributed
Database System
Consistency
BASE (PA/EL):
Cassandra, Dynamo, Riak
BASE (PA/EC):
MongoDB
ACID (PC/EC):
Galera Cluster, NDB Cluster
ACID (EC):
MySQL, PostgreSQL
31. ● Minimum number of database nodes:
○ Two (single-master)
○ Three (multi-master, quorum-based)
● General recommendation:
○ Keep the host specification uniformly.
○ Spare at least one standby instance/host for
emergency replacement.
○ Redundant power supply, disk, switch, network
interface card.
■ Use NIC bonding if you have multiple
NICs.
Redundancy - Hardware
Copyright 2018 Severalnines AB
32. DR
● Separate physical site mainly for restoring operation
(disaster recovery site). Also can be used for service
distribution.
● Major disasters:
○ Mother nature - flood, earthquake, hurricane
○ Building - fire, power failure, rack failure,
gateway failure
○ Others - theft, sabotage
● General recommendations:
○ Allocate sufficient bandwidth to connect the
sites.
○ Active-active setup requires at least 3 sites (to
respect quorum)
■ Both sites should be running on good
hardware.
Redundancy - Location
Copyright 2018 Severalnines AB
Primary DR
Primary
Arbitrator
33. ● Achieve data redundancy is through replication.
● Replication methods:
○ single-leader (master/slave)
○ multi-leader (master/master)
○ quorum-based replication
● Replication synchronization:
○ asynchronous
○ semi-synchronous
○ virtually synchronous
○ synchronous (two-phase commit)
Redundancy - Data
Copyright 2018 Severalnines AB
Data Data
34. Capacity Planning - Storage Space
Copyright 2018 Severalnines AB
● Storage space for data must be sufficient until the next
hardware refresh cycle.
● Production storage dimensioning:
○ Next hardware cycle: 3 years
○ Current DB size: 2048 MB
○ Current full backup size (week N): 1024 MB
○ Previous full backup size (week N-1): 768 MB
○ Delta size: 256 MB per week
○ Delta ratio: 30% increment/week
○ Total DB size estimation: (30 x 2048 x 52 x 3)/100 =
95846 MB ~ 95 GB after 3 years
○ Add 100% more room for operation and maintenance
(local backup, data staging, operation log, OS, etc)
■ 95 + 95 = 190 GB of storage
○ Memory-based: 16 x 16 GB RAM = 256 GB
○ Disk-based: 4 x 128 GB SSD RAID 10 with
battery-backed RAID controller ~ 250 GB
Yearly database size estimation using weekly
backup:
Where,
● Bn
- Current week full backup size,
● Bn-1
- Previous week full backup size,
● Dbdata
- Total database data size,
● Dbindex
- Total database index size,
● Y - Year.
35. Capacity Planning - Processor
Copyright 2018 Severalnines AB
● Lots of cores or faster CPU clock speed?
○ Faster CPU clock is always a better option for
DBMS
● Understand the database software behaviour around
multi-core and parallelism, for example:
○ PostgreSQL is multi-core friendly
○ MySQL Replication does not scale well on
multi-core machines!
■ Single connection will only use a single
core.
■ If workloads are IO bound, you will never
use more than one core.
○ Most reverse proxies and caches are single-core.
○ Most column-based DBMS support multi-core
and parallelism.
36. Failover & Switchover
Copyright 2018 Severalnines AB
● A procedure by which a system automatically transfers
control to a duplicate system when it detects a fault or
failure.
● The importance of understanding database failover
and switchover is to:
○ Avoid data loss
○ Ensure data consistency
○ Eliminate the element of surprise
● There are automation tools for automatic failover and
topology management:
○ MySQL Replication - ClusterControl,
Orchestrator, MHA, mysqlrpladmin (GTID only)
○ MariaDB Replication - MRM, MaxScale
○ PostgreSQL - PAF (pacemaker + corosync)
○ memcached - Membase
Manual Failover Automatic Failover
MySQL/MariaDB Replication,
PostgreSQL logical/streaming
Replication, memcached
MySQL Cluster (NDB), Galera
Cluster, MongoDB, MySQL
Group Replication, Riak,
Redis, Cassandra, Zookeeper,
etcd, HBase
Further reading:
MySQL High Availability tools - Comparing MHA,
MRM, ClusterControl
37. Failover & Switchover Tool
Placement
Copyright 2018 Severalnines AB
● Topology manager must be located in a good location
to:
○ Monitor the topology changes
○ Recognize failure scenario correctly (holistic
approach)
○ Promote the right node
○ Repoint slave to the new master.
○ Detect flapping
○ Perform post-failover verification
○ Send notifications and alerts
● General recommendation:
○ Place it in the same network segment as the
application tier, on the primary site.
○ Have backup line between geographical sites.
○ Must be available at all time.
M
S S
IM
S
Primary DC Secondary DC
APP APP
S
38. Reverse Proxy
Copyright 2018 Severalnines AB
● Distributes workloads across multiple database nodes.
● Popular RP for OS DBMS:
○ HAProxy, ProxySQL, MariaDB MaxScale, nginx,
IPVS, pen
● Algorithms:
○ source (IP hash)
○ least connection (weighted)
○ round-robin (weighted)
○ random
○ least latency
● The importance of reverse proxy:
○ Stabilize the cluster
○ Simplify the overall architecture
○ Connection queueing and overload protection
○ Transparency to the upper layer
Reverse
ProxyInternet
Forward
Proxy
Internet
Client
Client
Server
Server
39. APP
Reverse Proxy Placement
Copyright 2018 Severalnines AB
● Centralized:
○ Tier-based
○ Simple and easy to manage
○ Additional device or host
○ Usually tie together with a virtual IP to eliminate
SPOF
● Distributed:
○ Co-locate with the application server
○ Harder to manage
○ Faster from application standpoint (caching,
query rerouting, query firewall)
○ Might affect the health check performance if too
many proxy instances
LB
LB
APP VIP
LB
LB
LB
APP
Centralized Reverse Proxies
Distributed Reverse Proxies
40. Application Driver
Copyright 2018 Severalnines AB
● Drivers that connect both applications and databases
● The high availability logic is embedded into the
application. Skipping proxy tier.
● Some connectors support database high availability
components:
○ php-mysqlnd - Load balancing, r/w splitting,
persistent connection, cache.
○ Connector/J - Load balancing, r/w splitting,
connection pooling.
○ MySQL Fabric - Framework for MySQL HA and
sharding. Support PHP, python, Java.
○ php-mongodb - Member auto-discovery.
○ MongoDB Java Driver - Member auto-discovery,
r/w splitting, automatic failover.
M
S
S
JAVA apps
via
Connector/J
MySQL
Fabric
PHP web apps
via
php-mysqlnd
Python
apps via
Fabric
MySQL
Replication
41. ● Minimum number of members required to be available
(usually the majority)
● Quorum is important to:
○ Maintain consensus among DB nodes
○ Solve network partitioning
○ Maintain data consistency
● Quorum-based clusters use heartbeat to check out
each other:
○ Periodic signal generated by DB software to
indicate normal operation.
○ Happens at regular interval.
○ Force election if heartbeats are skipped or
inconsistent.
Quorum
Copyright 2018 Severalnines AB
Quorum calculation:
Where,
● pi - Members of the last seen primary
components,
● li - Members that are known to have left
gracefully,
● mi - Current component members,
● wi - Member weights.
42. ● Split-brain is the state when two partitioned sites cannot
determine the quorum and both remain available:
○ Data divergent.
○ Pretty hard to rollback once happens, possible of
data loss.
● Avoiding split-brain:
○ You need to have an odd total number of nodes
(3,5,7..).
○ Otherwise, use:
■ Weighted quorum (1 node = 2 votes)
■ Arbitrator node (a vote-only node)
○ Always start a node with secondary role
(read_only=ON) before promote it to become a
primary role (read_only=OFF)
Quorum and Split Brain
Copyright 2018 Severalnines AB
43. Split Brain in Single Master
Copyright 2018 Severalnines AB
Master
Backup
Master
Slave
Slave
Slaves
Reverse
Proxy
Application
read_only = OFF
read_only = OFF
Slave
Promote Backup
Master to
become master
44. Split Brain in Multi-Master
Copyright 2018 Severalnines AB
Master
Application
Master
Reverse
Proxy
switch
1/2
1/2
45. ● Additional tier in front of database tier to store
frequently accessed data.
● External cache might be useful to:
○ Offload the database server by caching the
expensive queries or a heavy dataset
○ Eliminate bottlenecks
○ Serve data while database server is unavailable,
thus improving availability
● DBMS:
○ Redis
○ memcached
● Reverse-proxy:
○ ProxySQL
○ MariaDB MaxScale
External Cache
Copyright 2018 Severalnines AB
APP
LB
LB
APP VIP
APP
Cache
46. ● Monitor every database components - replication state,
queries, disk space, IO, logs, load, lock, threads or
process, memory usage, latency.
● Benefits:
○ Get alerts when thresholds exceeded
○ Trends database usage over time.
○ Helps in capacity planning.
○ Fast detection of database outages, failures, and
table corruption.
● Popular database monitoring tools:
○ Nagios
○ PMM
○ Zabbix
○ Prometheus
○ ClusterControl
Monitoring and Trending
Copyright 2018 Severalnines AB
Monitoring
48. Replication Category
Copyright 2018 Severalnines AB
Database Replication
Logical
Replication
Physical
Replication
Multi-Site
Replication
Multi-Master
Replication
Replica Set
49. Logical Replication
Copyright 2018 Severalnines AB
● A group of database servers that replicates changes from
another node at database level:
○ Master - Receives reads and writes.
○ Intermediate Master - Replicates data from a
master. Read-only.
○ Slave - Replicates data from a master or
intermediate master. Read-only.
● Pulls replication data from archive logs (binary log, WAL,
oplog)
● Possible to only replicate a subset of data.
● No database-level conflict resolution.
● Minimum 2 hosts. Maximum is unlimited.
● DBMS:
○ MySQL/MariaDB Replication
■ master-slave, chain, multi-source,
master-master
○ PostgreSQL logical replication
M
IM
S
replicates
from
replicates
from
Replication data flow
S
replicates
from
50. Physical Replication
Copyright 2018 Severalnines AB
● A group of database servers that replicates binary-level
data from another node:
○ Primary - Receives reads and writes.
○ Secondary - Replicates data from the master via
block level. Doesn't serve data. Cold standby only.
● Holds the same data set. All nodes must be in the same
version.
● Replication or synchronization is performed by external
process.
● Minimum 2 hosts.
● Example: MySQL with DRBD, PostgreSQL physical
replication, embedded databases like SQLite with rsync.
P S
Replication data flow
Block
device
Block
device
synchronization
51. Replica Set
Copyright 2018 Severalnines AB
● A group of database servers that holds same set of data:
○ Primary - Receives writes.
○ Secondary/Replica - Replicates data from the
primary. Read-only.
● Quorum-based with automatic election & failover.
● No database-level conflict resolution.
● Minimum 3 hosts, maximum is limited.
● DBMS: MongoDB
Replication data flow
P
R
Rreplicates
from
replicates
from
Group Communication
52. Group
Communication
Multi-Master Replication
Copyright 2018 Severalnines AB
● A group of database servers that replicates from and to
another:
○ Multi-master - All nodes are equal and capable
of serving reads/writes.
○ Communication through group communication
protocol.
○ Transactions ordering coordination:
■ Transaction certification
■ 2-phase commit
● Database-level conflict handling and/or resolution.
● Quorum-based with automatic failover.
● Minimum 3 hosts.
● Example: Galera Cluster, MySQL Group Replication,
MySQL Cluster, Postgres-BDR, CouchDB
● Note: Asynchronous MySQL replication is NOT suitable
for Multi-Master Replication
M1
M3
M2
Replication data flow
replicates to
replicates from
53. Multi-Site Replication
Copyright 2018 Severalnines AB
● Multi-site replication:
○ A group of database servers that holds a same
set of logical data but located in a different
physical location.
○ Based on replication factor.
○ Masterless architecture:
■ All replicas are equally important.
■ Write to any node.
■ Node coordinates with replicas.
○ Eventual consistency. Use clock for conflict
resolution.
○ Built for geographical redundancy.
○ Example: Cassandra, Riak
C
R
R
R
C
replicate
toreplicate
to
*Replication factor = 3
54. High Availability Solutions
Copyright 2018 Severalnines AB
Data
Relational
Document
Columnar
Key-Value
MongoDB MarkLogic Couchbase
MySQL
Replication
MariaDB
Replication
Galera Cluster
MySQL Cluster
MySQL Group
Replication
ClickHouse
PostgreSQL
Replication
Apache HBase Cassandra
Redis Riak KV etcd
55. Example: MySQL Replication Scale Out
Copyright 2018 Severalnines AB
M
App
S M
AppAppAppAppApp
S
ProxySQL
RW R
ProxySQL
M
AppAppAppAppApp
IM
ProxySQL
RW R
ProxySQL
S
S
S
RW
Master-slave Master-slave
Master-intermediate master-slave
Topology
manager
56. Example: Galera Cluster Scale Out
Copyright 2018 Severalnines AB
M
App AppAppAppAppApp
ProxySQL ProxySQL
RW
M M MM M S
RW R
AppAppAppAppApp
ProxySQL ProxySQL
MM M S
RW
R
M M
Single-writer with Replication SlaveMulti-writer with Replication SlaveMulti-writer
57. Example: MongoDB Scale Out
Copyright 2018 Severalnines AB
App AppAppAppAppApp
RW
Standalone
S
P
S
Replica set
AppAppAppAppApp
mongos mongos
S
P
S
Replica set
Shard An+1
S S
P
S
P
S
Replica set
Shard An
Replica set
(Config)
RW RW
59. Trade Offs
Copyright 2018 Severalnines AB
Trade offs when
running a highly
available database
system
Cost
Complexity
Performance
Locking
60. ● Problems:
○ Lock Contention (hotspot)
○ Long Term Blocking (unreleased lock)
○ Database Deadlocks (2 or more transactions
locking each other)
○ System Deadlocks (Locks outside of database,
RO filesystem, snapshot)
● Tips:
○ Use lower isolation level (better concurrency,
weaker consistency)
○ Change the application behaviour:
■ Chunk up big transaction to a smaller
ones
○ Avoiding hotspot from multi-writers:
■ Forward the query to single writer
○ Use hot-backup utility or backup from replica
Locking
Copyright 2018 Severalnines AB
AppAppAppAppApp
ProxySQL ProxySQL
MM M S
RW
R
M M
Single-writer with Replication Slave
61. ● Capital expenditure (capex) is high:
○ Hardware preparation
○ Cluster deployment
○ Training
○ Testing and Integration
○ Tuning
○ Multiple sets of cluster (staging/test/dev)
● Operational expenditure (opex) will be getting lower
over time with:
○ Automation
○ Skills and expertise
○ Experience
○ Monitoring & alerting
○ Consolidation
Deployment and Operational Cost
Copyright 2018 Severalnines AB
● General recommendation for cost justification:
○ Calculate the ROI.
○ Estimate the planned and unplanned downtime
cost per hour.
○ Evaluate the existing infrastructure.
○ Growth planning.
62. ● Latency often problematic in distributed systems
especially ACID DBMS.
● Issues:
○ Network saturation (heartbeating, replication,
certification, syncing)
○ Unbalanced distribution
○ Non-uniform hardware
○ Large query performance (batch, routines, etc)
● General recommendations:
○ Reliable LAN (and/or WAN).
○ Use the right database for the right job.
○ Monitor everything.
○ Cache expensive queries.
○ Run performance assessment every quarter of
the year.
Performance Issues
Copyright 2018 Severalnines AB
63. ● Operational tasks become more complex:
○ Repetition (config change, backup, logs,
troubleshooting, upgrade)
○ Additional environments
(testing/staging/development)
○ Knowledge transfer
○ Scaling (auto-scaling, sharding)
● General recommendation:
○ Strict user access.
○ Document, log and audit everything.
○ Embrace automation tools.
○ Always perform test before pushing changes to
the production environment.
System Complexity
Copyright 2018 Severalnines AB