Webinar slides: Designing Open Source Databases for High Availability

Mar 2018
Designing Open-Source
Databases for High Availability
Ashraf Sharif, Support Engineer
Presenter
ashraf@severalnines.com

Copyright 2017 Severalnines AB
I'm Jean-Jérôme from the Severalnines Team and
I'm your host for today's webinar!
Feel free to ask any questions in the Questions
section of this application or via the Chat box.
You can also contact me directly via the chat box
or via email: info@severalnines.com during or
after the webinar.
Your host & some logistics

About Severalnines and ClusterControl

What We Do
Manage Scale
MonitorDeploy

ClusterControl Automation & Management
Management
● Multi-Cluster / Multi-DC
● Automate Repair &
Recovery
● Database Upgrades
● Backups
● Configuration Management
● Database Cloning
● One-Click Scaling
Deployment
● Deploy a Cluster in Minutes
● On-Premises or in the Cloud (AWS)
Monitoring
● Systems View with 1sec Resolution
● DB / OS stats & Performance Advisors
● Configurable Dashboards
● Query Analyzer
● Real-time / historical

Supported Databases

Copyright 2012 Severalnines ABCopyright 2012 Severalnines AB
Our Customers

Agenda
● Introduction
● HA Concepts
● Architecting Database for Failures
● HA Solutions
● Trade Offs

Introduction

What is High Availability?
A characteristic of a system, which aims to ensure an
agreed level of operational performance, usually
uptime, for a higher than normal period.

Database Architecture Evolution
1990's 2000's 2010's
RDBMSRDBMS
Cache
Reverse
Proxy
Distributed
Cache
NoSQLRDBMS
NoSQL
NoSQL
RDBMS
RDBMS
RDBMS
RDBMS
Virtual IP
shared
disk

Why do we need to
design a highly
available database
system?
Why Design for HA?
Scalability
Resilience
Reliability
Performance
Growth
Hardware failures
Network partition
Reconciliation Parallelism
Load distribution
Closer to users
Business continuation
Consolidation
Outage protection
Agility
Data integrity

Planning on High Availability
"He who fails to plan is
planning to fail"
Winston Churchill

High Availability Concepts

CAP Theorem
● "It is impossible for a distributed data store to simultaneously
provide more than two out of the following three guarantees:
Consistency. Availability. Partition tolerance." - Eric Brewer, 2000
● In the event of a network partition, one has to choose between:
○ Consistency - Enforced consistency
○ Availability - Eventual consistency
● Consistency: Commits are atomic across the entire distributed
system.
● Availability: Remains accessible and operational at all time.
● Partition Tolerance: Only a total network failure can cause the
system to respond incorrectly.
C
A
P
RDBMS: MySQL,
PostgreSQL
Cassandra, Dynamo,
CouchDB, Riak
Hbase, MongoDB,
Redis, Galera Cluster
Pick Two

CAP Theorem
DB1 DB2 DB3
When P, choose C over A
3-node MySQL Galera Cluster on 3 different datacenters
during network partitioning

CAP Theorem
DB1 DB2 DB3
( A + C ) - P
during network partitioning
( P + C ) - A

CAP Theorem
DB2 DB3 DB4
When P, choose A over C
5-node Cassandra Cluster on 3 different datacenters during
network partitioning (fully operational)
DB1 DB5

PACELC Theorem
● "If there is a Partition, how does the system trade off Availability and Consistency, Else, when the system is
running normally in the absence of partition, how does the system trade off Latency and Consistency" -
Daniel Abadi, 2011
● Classify systems according to their behaviour during active mode (running normally) and degraded
mode (when network partitions happens).
● If Partitioned {
choose A or C; # as per CAP theorem
} Else {
choose L or C;
}

PACELC Theorem
Partitioned Else
Availability Consistency Latency
HA Distributed
Database System
pick one pick one

PACELC Theorem
DB1 DB2 DB3
( P + C ) - A
(non operational + network partition)

PACELC Theorem
DB1 DB2 DB3
When E, choose C over L
(fully operational)

Exhibits consistency at some later
point. Last write wins.
Database appears to work most of
the time. No SPOF.
Weak consistency
Committed data is never lost
Transactions do not affect each other
Only valid data is saved
Transactions are all or nothing
ACID vs BASE
Atomicity
Consistency
Isolation
Durability
Basic Availability
Soft State
Eventual
Consistency
Pessimistic RDBMS Optimistic NoSQL OLAPOLTP

PACELC on Database System
Partitioned Else
Availability Consistency Latency
HA Distributed
Database System
Consistency
BASE (PA/EL):
Cassandra, Dynamo, Riak
BASE (PA/EC):
MongoDB
ACID (PC/EC):
Galera Cluster, NDB Cluster
ACID (EC):
MySQL, PostgreSQL

Architecting Database for Failures

HA Principles
High Availability
Principles
Elimination
of SPOF
Failover/
Switchover
Failure
Detection

Multi-tier Architecture
● At least two layers:
○ Presentation tier
○ Data tier
● Common changes to application:
○ Session management
○ Files storage (NFS, clustered FS, object storage)
○ Application interaction with data tier:
■ Query splitting
■ Connection pooling
■ Query hints
● Common changes to database:
○ Reverse proxy
○ Virtual IP address
○ Schema changes operation
○ Database maintenance (backup, config, etc)
APP
LB
LB
APP VIP
APP
APP
APP
APP

Read/Write Ratio
● Is your database workload read intensive or write
intensive?
● R/W ratio estimation:
○ Read = read_operations / total_operations * 100
○ Write = write_operations / total_operations * 100
● Read intensive:
○ Focus on read scalability and availability
○ More replicas
○ Cache resource-intensive queries
○ Indexing, query optimization, parallelism
● Write intensive:
○ Focus on write scalability and availability
○ More writers, shards
○ Conflict resolutions
mysql> SELECT @total_com := SUM(IF(variable_name IN
('Com_select', 'Com_delete', 'Com_insert',
'Com_update', 'Com_replace'), variable_value, 0)) AS
`Total`,
@total_reads := SUM(IF(variable_name =
'Com_select', variable_value, 0)) AS `Total reads`,
@total_writes := SUM(IF(variable_name IN
('Com_delete', 'Com_insert', 'Com_update',
'Com_replace'), variable_value, 0)) AS `Total
writes`,
ROUND((@total_reads / @total_com * 100),2) AS
`Reads %`,
ROUND((@total_writes / @total_com * 100),2) AS
`Writes %`
FROM information_schema.GLOBAL_STATUSG
****************** 1. row **********************
Total: 2581119
Total reads: 1643609
Total writes: 937510
Reads %: 63.68
Writes %: 36.32

Redundancy
Redundancy
Hardware Location
Data

● Minimum number of database nodes:
○ Two (single-master)
○ Three (multi-master, quorum-based)
● General recommendation:
○ Keep the host specification uniformly.
○ Spare at least one standby instance/host for
emergency replacement.
○ Redundant power supply, disk, switch, network
interface card.
■ Use NIC bonding if you have multiple
NICs.
Redundancy - Hardware

DR
● Separate physical site mainly for restoring operation
(disaster recovery site). Also can be used for service
distribution.
● Major disasters:
○ Mother nature - flood, earthquake, hurricane
○ Building - fire, power failure, rack failure,
gateway failure
○ Others - theft, sabotage
● General recommendations:
○ Allocate sufficient bandwidth to connect the
sites.
○ Active-active setup requires at least 3 sites (to
respect quorum)
■ Both sites should be running on good
hardware.
Redundancy - Location
Primary DR
Primary
Arbitrator

● Achieve data redundancy is through replication.
● Replication methods:
○ single-leader (master/slave)
○ multi-leader (master/master)
○ quorum-based replication
● Replication synchronization:
○ asynchronous
○ semi-synchronous
○ virtually synchronous
○ synchronous (two-phase commit)
Redundancy - Data
Data Data

Capacity Planning - Storage Space
● Storage space for data must be sufficient until the next
hardware refresh cycle.
● Production storage dimensioning:
○ Next hardware cycle: 3 years
○ Current DB size: 2048 MB
○ Current full backup size (week N): 1024 MB
○ Previous full backup size (week N-1): 768 MB
○ Delta size: 256 MB per week
○ Delta ratio: 30% increment/week
○ Total DB size estimation: (30 x 2048 x 52 x 3)/100 =
95846 MB ~ 95 GB after 3 years
○ Add 100% more room for operation and maintenance
(local backup, data staging, operation log, OS, etc)
■ 95 + 95 = 190 GB of storage
○ Memory-based: 16 x 16 GB RAM = 256 GB
○ Disk-based: 4 x 128 GB SSD RAID 10 with
battery-backed RAID controller ~ 250 GB
Yearly database size estimation using weekly
backup:
Where,
● Bn
- Current week full backup size,
● Bn-1
- Previous week full backup size,
● Dbdata
- Total database data size,
● Dbindex
- Total database index size,
● Y - Year.

Capacity Planning - Processor
● Lots of cores or faster CPU clock speed?
○ Faster CPU clock is always a better option for
DBMS
● Understand the database software behaviour around
multi-core and parallelism, for example:
○ PostgreSQL is multi-core friendly
○ MySQL Replication does not scale well on
multi-core machines!
■ Single connection will only use a single
core.
■ If workloads are IO bound, you will never
use more than one core.
○ Most reverse proxies and caches are single-core.
○ Most column-based DBMS support multi-core
and parallelism.

Failover & Switchover
● A procedure by which a system automatically transfers
control to a duplicate system when it detects a fault or
failure.
● The importance of understanding database failover
and switchover is to:
○ Avoid data loss
○ Ensure data consistency
○ Eliminate the element of surprise
● There are automation tools for automatic failover and
topology management:
○ MySQL Replication - ClusterControl,
Orchestrator, MHA, mysqlrpladmin (GTID only)
○ MariaDB Replication - MRM, MaxScale
○ PostgreSQL - PAF (pacemaker + corosync)
○ memcached - Membase
Manual Failover Automatic Failover
MySQL/MariaDB Replication,
PostgreSQL logical/streaming
Replication, memcached
MySQL Cluster (NDB), Galera
Cluster, MongoDB, MySQL
Group Replication, Riak,
Redis, Cassandra, Zookeeper,
etcd, HBase
Further reading:
MySQL High Availability tools - Comparing MHA,
MRM, ClusterControl

Failover & Switchover Tool
Placement
● Topology manager must be located in a good location
to:
○ Monitor the topology changes
○ Recognize failure scenario correctly (holistic
approach)
○ Promote the right node
○ Repoint slave to the new master.
○ Detect flapping
○ Perform post-failover verification
○ Send notifications and alerts
○ Place it in the same network segment as the
application tier, on the primary site.
○ Have backup line between geographical sites.
○ Must be available at all time.
M
S S
IM
S
Primary DC Secondary DC
APP APP
S

Reverse Proxy
● Distributes workloads across multiple database nodes.
● Popular RP for OS DBMS:
○ HAProxy, ProxySQL, MariaDB MaxScale, nginx,
IPVS, pen
● Algorithms:
○ source (IP hash)
○ least connection (weighted)
○ round-robin (weighted)
○ random
○ least latency
● The importance of reverse proxy:
○ Stabilize the cluster
○ Simplify the overall architecture
○ Connection queueing and overload protection
○ Transparency to the upper layer
Reverse
ProxyInternet
Forward
Proxy
Internet
Client
Client
Server
Server

APP
Reverse Proxy Placement
● Centralized:
○ Tier-based
○ Simple and easy to manage
○ Additional device or host
○ Usually tie together with a virtual IP to eliminate
SPOF
● Distributed:
○ Co-locate with the application server
○ Harder to manage
○ Faster from application standpoint (caching,
query rerouting, query firewall)
○ Might affect the health check performance if too
many proxy instances
LB
LB
APP VIP
LB
LB
LB
APP
Centralized Reverse Proxies
Distributed Reverse Proxies

Application Driver
● Drivers that connect both applications and databases
● The high availability logic is embedded into the
application. Skipping proxy tier.
● Some connectors support database high availability
components:
○ php-mysqlnd - Load balancing, r/w splitting,
persistent connection, cache.
○ Connector/J - Load balancing, r/w splitting,
connection pooling.
○ MySQL Fabric - Framework for MySQL HA and
sharding. Support PHP, python, Java.
○ php-mongodb - Member auto-discovery.
○ MongoDB Java Driver - Member auto-discovery,
r/w splitting, automatic failover.
M
S
S
JAVA apps
via
Connector/J
MySQL
Fabric
PHP web apps
via
php-mysqlnd
Python
apps via
Fabric
MySQL
Replication

● Minimum number of members required to be available
(usually the majority)
● Quorum is important to:
○ Maintain consensus among DB nodes
○ Solve network partitioning
○ Maintain data consistency
● Quorum-based clusters use heartbeat to check out
each other:
○ Periodic signal generated by DB software to
indicate normal operation.
○ Happens at regular interval.
○ Force election if heartbeats are skipped or
inconsistent.
Quorum
Quorum calculation:
Where,
● pi - Members of the last seen primary
components,
● li - Members that are known to have left
gracefully,
● mi - Current component members,
● wi - Member weights.

● Split-brain is the state when two partitioned sites cannot
determine the quorum and both remain available:
○ Data divergent.
○ Pretty hard to rollback once happens, possible of
data loss.
● Avoiding split-brain:
○ You need to have an odd total number of nodes
(3,5,7..).
○ Otherwise, use:
■ Weighted quorum (1 node = 2 votes)
■ Arbitrator node (a vote-only node)
○ Always start a node with secondary role
(read_only=ON) before promote it to become a
primary role (read_only=OFF)
Quorum and Split Brain

Split Brain in Single Master
Master
Backup
Master
Slave
Slave
Slaves
Reverse
Proxy
Application
read_only = OFF
read_only = OFF
Slave
Promote Backup
Master to
become master

Split Brain in Multi-Master
Master
Application
Master
Reverse
Proxy
switch
1/2
1/2

● Additional tier in front of database tier to store
frequently accessed data.
● External cache might be useful to:
○ Offload the database server by caching the
expensive queries or a heavy dataset
○ Eliminate bottlenecks
○ Serve data while database server is unavailable,
thus improving availability
● DBMS:
○ Redis
○ memcached
● Reverse-proxy:
○ ProxySQL
○ MariaDB MaxScale
External Cache
APP
LB
LB
APP VIP
APP
Cache

● Monitor every database components - replication state,
queries, disk space, IO, logs, load, lock, threads or
process, memory usage, latency.
● Benefits:
○ Get alerts when thresholds exceeded
○ Trends database usage over time.
○ Helps in capacity planning.
○ Fast detection of database outages, failures, and
table corruption.
● Popular database monitoring tools:
○ Nagios
○ PMM
○ Zabbix
○ Prometheus
○ ClusterControl
Monitoring and Trending
Monitoring

High Availability Solutions

Replication Category
Database Replication
Logical
Replication
Physical
Replication
Multi-Site
Replication
Multi-Master
Replication
Replica Set

Logical Replication
● A group of database servers that replicates changes from
another node at database level:
○ Master - Receives reads and writes.
○ Intermediate Master - Replicates data from a
master. Read-only.
○ Slave - Replicates data from a master or
intermediate master. Read-only.
● Pulls replication data from archive logs (binary log, WAL,
oplog)
● Possible to only replicate a subset of data.
● No database-level conflict resolution.
● Minimum 2 hosts. Maximum is unlimited.
● DBMS:
○ MySQL/MariaDB Replication
■ master-slave, chain, multi-source,
master-master
○ PostgreSQL logical replication
M
IM
S
replicates
from
replicates
from
Replication data flow
S
replicates
from

Physical Replication
● A group of database servers that replicates binary-level
data from another node:
○ Primary - Receives reads and writes.
○ Secondary - Replicates data from the master via
block level. Doesn't serve data. Cold standby only.
● Holds the same data set. All nodes must be in the same
version.
● Replication or synchronization is performed by external
process.
● Minimum 2 hosts.
● Example: MySQL with DRBD, PostgreSQL physical
replication, embedded databases like SQLite with rsync.
P S
Block
device
Block
device
synchronization

Replica Set
● A group of database servers that holds same set of data:
○ Primary - Receives writes.
○ Secondary/Replica - Replicates data from the
primary. Read-only.
● Quorum-based with automatic election & failover.
● No database-level conflict resolution.
● Minimum 3 hosts, maximum is limited.
● DBMS: MongoDB
P
R
Rreplicates
from
replicates
from
Group Communication

Group
Communication
Multi-Master Replication
● A group of database servers that replicates from and to
another:
○ Multi-master - All nodes are equal and capable
of serving reads/writes.
○ Communication through group communication
protocol.
○ Transactions ordering coordination:
■ Transaction certification
■ 2-phase commit
● Database-level conflict handling and/or resolution.
● Quorum-based with automatic failover.
● Minimum 3 hosts.
● Example: Galera Cluster, MySQL Group Replication,
MySQL Cluster, Postgres-BDR, CouchDB
● Note: Asynchronous MySQL replication is NOT suitable
for Multi-Master Replication
M1
M3
M2
replicates to
replicates from

Multi-Site Replication
● Multi-site replication:
○ A group of database servers that holds a same
set of logical data but located in a different
physical location.
○ Based on replication factor.
○ Masterless architecture:
■ All replicas are equally important.
■ Write to any node.
■ Node coordinates with replicas.
○ Eventual consistency. Use clock for conflict
resolution.
○ Built for geographical redundancy.
○ Example: Cassandra, Riak
C
R
R
R
C
replicate
toreplicate
to
*Replication factor = 3

High Availability Solutions
Data
Relational
Document
Columnar
Key-Value
MongoDB MarkLogic Couchbase
MySQL
Replication
MariaDB
Replication
Galera Cluster
MySQL Cluster
MySQL Group
Replication
ClickHouse
PostgreSQL
Replication
Apache HBase Cassandra
Redis Riak KV etcd

Example: MySQL Replication Scale Out
M
App
S M
AppAppAppAppApp
S
ProxySQL
RW R
ProxySQL
M
AppAppAppAppApp
IM
ProxySQL
RW R
ProxySQL
S
S
S
RW
Master-slave Master-slave
Master-intermediate master-slave
Topology
manager

Example: Galera Cluster Scale Out
M
App AppAppAppAppApp
ProxySQL ProxySQL
RW
M M MM M S
RW R
AppAppAppAppApp
ProxySQL ProxySQL
MM M S
RW
R
M M
Single-writer with Replication SlaveMulti-writer with Replication SlaveMulti-writer

Example: MongoDB Scale Out
App AppAppAppAppApp
RW
Standalone
S
P
S
Replica set
AppAppAppAppApp
mongos mongos
S
P
S
Replica set
Shard An+1
S S
P
S
P
S
Replica set
Shard An
Replica set
(Config)
RW RW

Trade Offs

Trade Offs
Trade offs when
running a highly
available database
system
Cost
Complexity
Performance
Locking

● Problems:
○ Lock Contention (hotspot)
○ Long Term Blocking (unreleased lock)
○ Database Deadlocks (2 or more transactions
locking each other)
○ System Deadlocks (Locks outside of database,
RO filesystem, snapshot)
● Tips:
○ Use lower isolation level (better concurrency,
weaker consistency)
○ Change the application behaviour:
■ Chunk up big transaction to a smaller
ones
○ Avoiding hotspot from multi-writers:
■ Forward the query to single writer
○ Use hot-backup utility or backup from replica
Locking
AppAppAppAppApp
ProxySQL ProxySQL
MM M S
RW
R
M M
Single-writer with Replication Slave

● Capital expenditure (capex) is high:
○ Hardware preparation
○ Cluster deployment
○ Training
○ Testing and Integration
○ Tuning
○ Multiple sets of cluster (staging/test/dev)
● Operational expenditure (opex) will be getting lower
over time with:
○ Automation
○ Skills and expertise
○ Experience
○ Monitoring & alerting
○ Consolidation
Deployment and Operational Cost
● General recommendation for cost justification:
○ Calculate the ROI.
○ Estimate the planned and unplanned downtime
cost per hour.
○ Evaluate the existing infrastructure.
○ Growth planning.

● Latency often problematic in distributed systems
especially ACID DBMS.
● Issues:
○ Network saturation (heartbeating, replication,
certification, syncing)
○ Unbalanced distribution
○ Non-uniform hardware
○ Large query performance (batch, routines, etc)
● General recommendations:
○ Reliable LAN (and/or WAN).
○ Use the right database for the right job.
○ Monitor everything.
○ Cache expensive queries.
○ Run performance assessment every quarter of
the year.
Performance Issues

● Operational tasks become more complex:
○ Repetition (config change, backup, logs,
troubleshooting, upgrade)
○ Additional environments
(testing/staging/development)
○ Knowledge transfer
○ Scaling (auto-scaling, sharding)
○ Strict user access.
○ Document, log and audit everything.
○ Embrace automation tools.
○ Always perform test before pushing changes to
the production environment.
System Complexity

Q & A

Additional Resources
● Repair and recovery for your MySQL, MariaDB and
MongoDB Clusters
● HA & Load Balancing Tutorials
● Download ClusterControl
● Contact us: info@severalnines.com

Webinar slides: Designing Open Source Databases for High Availability

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Webinar slides: Designing Open Source Databases for High Availability

Similar to Webinar slides: Designing Open Source Databases for High Availability (20)

More from Severalnines

More from Severalnines (14)

Recently uploaded

Recently uploaded (20)

Webinar slides: Designing Open Source Databases for High Availability