Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Advertisement
Advertisement

Webinar slides: Designing Open Source Databases for High Availability

  1. Mar 2018 Designing Open-Source Databases for High Availability Ashraf Sharif, Support Engineer Presenter ashraf@severalnines.com
  2. Copyright 2017 Severalnines AB I'm Jean-Jérôme from the Severalnines Team and I'm your host for today's webinar! Feel free to ask any questions in the Questions section of this application or via the Chat box. You can also contact me directly via the chat box or via email: info@severalnines.com during or after the webinar. Your host & some logistics
  3. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB About Severalnines and ClusterControl
  4. Copyright 2017 Severalnines AB What We Do Manage Scale MonitorDeploy
  5. Copyright 2017 Severalnines AB ClusterControl Automation & Management Management ● Multi-Cluster / Multi-DC ● Automate Repair & Recovery ● Database Upgrades ● Backups ● Configuration Management ● Database Cloning ● One-Click Scaling Deployment ● Deploy a Cluster in Minutes ● On-Premises or in the Cloud (AWS) Monitoring ● Systems View with 1sec Resolution ● DB / OS stats & Performance Advisors ● Configurable Dashboards ● Query Analyzer ● Real-time / historical
  6. Copyright 2017 Severalnines AB Supported Databases
  7. Copyright 2012 Severalnines ABCopyright 2012 Severalnines AB Our Customers
  8. Mar 2018 Designing Open-Source Databases for High Availability Ashraf Sharif, Support Engineer Presenter ashraf@severalnines.com
  9. Agenda Copyright 2018 Severalnines AB ● Introduction ● HA Concepts ● Architecting Database for Failures ● HA Solutions ● Trade Offs
  10. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Introduction
  11. What is High Availability? Copyright 2018 Severalnines AB A characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
  12. Database Architecture Evolution Copyright 2018 Severalnines AB 1990's 2000's 2010's RDBMSRDBMS Cache Reverse Proxy Distributed Cache NoSQLRDBMS NoSQL NoSQL RDBMS RDBMS RDBMS RDBMS Virtual IP shared disk
  13. Why do we need to design a highly available database system? Why Design for HA? Copyright 2018 Severalnines AB Scalability Resilience Reliability Performance Growth Hardware failures Network partition Reconciliation Parallelism Load distribution Closer to users Business continuation Consolidation Outage protection Agility Data integrity
  14. Planning on High Availability Copyright 2018 Severalnines AB "He who fails to plan is planning to fail" Winston Churchill
  15. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB High Availability Concepts
  16. CAP Theorem Copyright 2018 Severalnines AB ● "It is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: Consistency. Availability. Partition tolerance." - Eric Brewer, 2000 ● In the event of a network partition, one has to choose between: ○ Consistency - Enforced consistency ○ Availability - Eventual consistency ● Consistency: Commits are atomic across the entire distributed system. ● Availability: Remains accessible and operational at all time. ● Partition Tolerance: Only a total network failure can cause the system to respond incorrectly. C A P RDBMS: MySQL, PostgreSQL Cassandra, Dynamo, CouchDB, Riak Hbase, MongoDB, Redis, Galera Cluster Pick Two
  17. CAP Theorem Copyright 2018 Severalnines AB DB1 DB2 DB3 When P, choose C over A 3-node MySQL Galera Cluster on 3 different datacenters during network partitioning
  18. CAP Theorem Copyright 2018 Severalnines AB DB1 DB2 DB3 ( A + C ) - P 3-node MySQL Galera Cluster on 3 different datacenters during network partitioning ( P + C ) - A
  19. CAP Theorem Copyright 2018 Severalnines AB DB2 DB3 DB4 When P, choose A over C 5-node Cassandra Cluster on 3 different datacenters during network partitioning (fully operational) DB1 DB5
  20. PACELC Theorem Copyright 2018 Severalnines AB ● "If there is a Partition, how does the system trade off Availability and Consistency, Else, when the system is running normally in the absence of partition, how does the system trade off Latency and Consistency" - Daniel Abadi, 2011 ● Classify systems according to their behaviour during active mode (running normally) and degraded mode (when network partitions happens). ● If Partitioned { choose A or C; # as per CAP theorem } Else { choose L or C; }
  21. PACELC Theorem Copyright 2018 Severalnines AB Partitioned Else Availability Consistency Latency HA Distributed Database System pick one pick one
  22. PACELC Theorem Copyright 2018 Severalnines AB DB1 DB2 DB3 ( P + C ) - A 3-node MySQL Galera Cluster on 3 different datacenters (non operational + network partition)
  23. PACELC Theorem Copyright 2018 Severalnines AB DB1 DB2 DB3 When E, choose C over L 3-node MySQL Galera Cluster on 3 different datacenters (fully operational)
  24. Exhibits consistency at some later point. Last write wins. Database appears to work most of the time. No SPOF. Weak consistency Committed data is never lost Transactions do not affect each other Only valid data is saved Transactions are all or nothing ACID vs BASE Copyright 2018 Severalnines AB Atomicity Consistency Isolation Durability Basic Availability Soft State Eventual Consistency Pessimistic RDBMS Optimistic NoSQL OLAPOLTP
  25. PACELC on Database System Copyright 2018 Severalnines AB Partitioned Else Availability Consistency Latency HA Distributed Database System Consistency BASE (PA/EL): Cassandra, Dynamo, Riak BASE (PA/EC): MongoDB ACID (PC/EC): Galera Cluster, NDB Cluster ACID (EC): MySQL, PostgreSQL
  26. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Architecting Database for Failures
  27. HA Principles Copyright 2018 Severalnines AB High Availability Principles Elimination of SPOF Failover/ Switchover Failure Detection
  28. Multi-tier Architecture Copyright 2018 Severalnines AB ● At least two layers: ○ Presentation tier ○ Data tier ● Common changes to application: ○ Session management ○ Files storage (NFS, clustered FS, object storage) ○ Application interaction with data tier: ■ Query splitting ■ Connection pooling ■ Query hints ● Common changes to database: ○ Reverse proxy ○ Virtual IP address ○ Schema changes operation ○ Database maintenance (backup, config, etc) APP LB LB APP VIP APP APP APP APP
  29. Read/Write Ratio Copyright 2018 Severalnines AB ● Is your database workload read intensive or write intensive? ● R/W ratio estimation: ○ Read = read_operations / total_operations * 100 ○ Write = write_operations / total_operations * 100 ● Read intensive: ○ Focus on read scalability and availability ○ More replicas ○ Cache resource-intensive queries ○ Indexing, query optimization, parallelism ● Write intensive: ○ Focus on write scalability and availability ○ More writers, shards ○ Conflict resolutions mysql> SELECT @total_com := SUM(IF(variable_name IN ('Com_select', 'Com_delete', 'Com_insert', 'Com_update', 'Com_replace'), variable_value, 0)) AS `Total`, @total_reads := SUM(IF(variable_name = 'Com_select', variable_value, 0)) AS `Total reads`, @total_writes := SUM(IF(variable_name IN ('Com_delete', 'Com_insert', 'Com_update', 'Com_replace'), variable_value, 0)) AS `Total writes`, ROUND((@total_reads / @total_com * 100),2) AS `Reads %`, ROUND((@total_writes / @total_com * 100),2) AS `Writes %` FROM information_schema.GLOBAL_STATUSG ****************** 1. row ********************** Total: 2581119 Total reads: 1643609 Total writes: 937510 Reads %: 63.68 Writes %: 36.32
  30. Redundancy Copyright 2018 Severalnines AB Redundancy Hardware Location Data
  31. ● Minimum number of database nodes: ○ Two (single-master) ○ Three (multi-master, quorum-based) ● General recommendation: ○ Keep the host specification uniformly. ○ Spare at least one standby instance/host for emergency replacement. ○ Redundant power supply, disk, switch, network interface card. ■ Use NIC bonding if you have multiple NICs. Redundancy - Hardware Copyright 2018 Severalnines AB
  32. DR ● Separate physical site mainly for restoring operation (disaster recovery site). Also can be used for service distribution. ● Major disasters: ○ Mother nature - flood, earthquake, hurricane ○ Building - fire, power failure, rack failure, gateway failure ○ Others - theft, sabotage ● General recommendations: ○ Allocate sufficient bandwidth to connect the sites. ○ Active-active setup requires at least 3 sites (to respect quorum) ■ Both sites should be running on good hardware. Redundancy - Location Copyright 2018 Severalnines AB Primary DR Primary Arbitrator
  33. ● Achieve data redundancy is through replication. ● Replication methods: ○ single-leader (master/slave) ○ multi-leader (master/master) ○ quorum-based replication ● Replication synchronization: ○ asynchronous ○ semi-synchronous ○ virtually synchronous ○ synchronous (two-phase commit) Redundancy - Data Copyright 2018 Severalnines AB Data Data
  34. Capacity Planning - Storage Space Copyright 2018 Severalnines AB ● Storage space for data must be sufficient until the next hardware refresh cycle. ● Production storage dimensioning: ○ Next hardware cycle: 3 years ○ Current DB size: 2048 MB ○ Current full backup size (week N): 1024 MB ○ Previous full backup size (week N-1): 768 MB ○ Delta size: 256 MB per week ○ Delta ratio: 30% increment/week ○ Total DB size estimation: (30 x 2048 x 52 x 3)/100 = 95846 MB ~ 95 GB after 3 years ○ Add 100% more room for operation and maintenance (local backup, data staging, operation log, OS, etc) ■ 95 + 95 = 190 GB of storage ○ Memory-based: 16 x 16 GB RAM = 256 GB ○ Disk-based: 4 x 128 GB SSD RAID 10 with battery-backed RAID controller ~ 250 GB Yearly database size estimation using weekly backup: Where, ● Bn - Current week full backup size, ● Bn-1 - Previous week full backup size, ● Dbdata - Total database data size, ● Dbindex - Total database index size, ● Y - Year.
  35. Capacity Planning - Processor Copyright 2018 Severalnines AB ● Lots of cores or faster CPU clock speed? ○ Faster CPU clock is always a better option for DBMS ● Understand the database software behaviour around multi-core and parallelism, for example: ○ PostgreSQL is multi-core friendly ○ MySQL Replication does not scale well on multi-core machines! ■ Single connection will only use a single core. ■ If workloads are IO bound, you will never use more than one core. ○ Most reverse proxies and caches are single-core. ○ Most column-based DBMS support multi-core and parallelism.
  36. Failover & Switchover Copyright 2018 Severalnines AB ● A procedure by which a system automatically transfers control to a duplicate system when it detects a fault or failure. ● The importance of understanding database failover and switchover is to: ○ Avoid data loss ○ Ensure data consistency ○ Eliminate the element of surprise ● There are automation tools for automatic failover and topology management: ○ MySQL Replication - ClusterControl, Orchestrator, MHA, mysqlrpladmin (GTID only) ○ MariaDB Replication - MRM, MaxScale ○ PostgreSQL - PAF (pacemaker + corosync) ○ memcached - Membase Manual Failover Automatic Failover MySQL/MariaDB Replication, PostgreSQL logical/streaming Replication, memcached MySQL Cluster (NDB), Galera Cluster, MongoDB, MySQL Group Replication, Riak, Redis, Cassandra, Zookeeper, etcd, HBase Further reading: MySQL High Availability tools - Comparing MHA, MRM, ClusterControl
  37. Failover & Switchover Tool Placement Copyright 2018 Severalnines AB ● Topology manager must be located in a good location to: ○ Monitor the topology changes ○ Recognize failure scenario correctly (holistic approach) ○ Promote the right node ○ Repoint slave to the new master. ○ Detect flapping ○ Perform post-failover verification ○ Send notifications and alerts ● General recommendation: ○ Place it in the same network segment as the application tier, on the primary site. ○ Have backup line between geographical sites. ○ Must be available at all time. M S S IM S Primary DC Secondary DC APP APP S
  38. Reverse Proxy Copyright 2018 Severalnines AB ● Distributes workloads across multiple database nodes. ● Popular RP for OS DBMS: ○ HAProxy, ProxySQL, MariaDB MaxScale, nginx, IPVS, pen ● Algorithms: ○ source (IP hash) ○ least connection (weighted) ○ round-robin (weighted) ○ random ○ least latency ● The importance of reverse proxy: ○ Stabilize the cluster ○ Simplify the overall architecture ○ Connection queueing and overload protection ○ Transparency to the upper layer Reverse ProxyInternet Forward Proxy Internet Client Client Server Server
  39. APP Reverse Proxy Placement Copyright 2018 Severalnines AB ● Centralized: ○ Tier-based ○ Simple and easy to manage ○ Additional device or host ○ Usually tie together with a virtual IP to eliminate SPOF ● Distributed: ○ Co-locate with the application server ○ Harder to manage ○ Faster from application standpoint (caching, query rerouting, query firewall) ○ Might affect the health check performance if too many proxy instances LB LB APP VIP LB LB LB APP Centralized Reverse Proxies Distributed Reverse Proxies
  40. Application Driver Copyright 2018 Severalnines AB ● Drivers that connect both applications and databases ● The high availability logic is embedded into the application. Skipping proxy tier. ● Some connectors support database high availability components: ○ php-mysqlnd - Load balancing, r/w splitting, persistent connection, cache. ○ Connector/J - Load balancing, r/w splitting, connection pooling. ○ MySQL Fabric - Framework for MySQL HA and sharding. Support PHP, python, Java. ○ php-mongodb - Member auto-discovery. ○ MongoDB Java Driver - Member auto-discovery, r/w splitting, automatic failover. M S S JAVA apps via Connector/J MySQL Fabric PHP web apps via php-mysqlnd Python apps via Fabric MySQL Replication
  41. ● Minimum number of members required to be available (usually the majority) ● Quorum is important to: ○ Maintain consensus among DB nodes ○ Solve network partitioning ○ Maintain data consistency ● Quorum-based clusters use heartbeat to check out each other: ○ Periodic signal generated by DB software to indicate normal operation. ○ Happens at regular interval. ○ Force election if heartbeats are skipped or inconsistent. Quorum Copyright 2018 Severalnines AB Quorum calculation: Where, ● pi - Members of the last seen primary components, ● li - Members that are known to have left gracefully, ● mi - Current component members, ● wi - Member weights.
  42. ● Split-brain is the state when two partitioned sites cannot determine the quorum and both remain available: ○ Data divergent. ○ Pretty hard to rollback once happens, possible of data loss. ● Avoiding split-brain: ○ You need to have an odd total number of nodes (3,5,7..). ○ Otherwise, use: ■ Weighted quorum (1 node = 2 votes) ■ Arbitrator node (a vote-only node) ○ Always start a node with secondary role (read_only=ON) before promote it to become a primary role (read_only=OFF) Quorum and Split Brain Copyright 2018 Severalnines AB
  43. Split Brain in Single Master Copyright 2018 Severalnines AB Master Backup Master Slave Slave Slaves Reverse Proxy Application read_only = OFF read_only = OFF Slave Promote Backup Master to become master
  44. Split Brain in Multi-Master Copyright 2018 Severalnines AB Master Application Master Reverse Proxy switch 1/2 1/2
  45. ● Additional tier in front of database tier to store frequently accessed data. ● External cache might be useful to: ○ Offload the database server by caching the expensive queries or a heavy dataset ○ Eliminate bottlenecks ○ Serve data while database server is unavailable, thus improving availability ● DBMS: ○ Redis ○ memcached ● Reverse-proxy: ○ ProxySQL ○ MariaDB MaxScale External Cache Copyright 2018 Severalnines AB APP LB LB APP VIP APP Cache
  46. ● Monitor every database components - replication state, queries, disk space, IO, logs, load, lock, threads or process, memory usage, latency. ● Benefits: ○ Get alerts when thresholds exceeded ○ Trends database usage over time. ○ Helps in capacity planning. ○ Fast detection of database outages, failures, and table corruption. ● Popular database monitoring tools: ○ Nagios ○ PMM ○ Zabbix ○ Prometheus ○ ClusterControl Monitoring and Trending Copyright 2018 Severalnines AB Monitoring
  47. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB High Availability Solutions
  48. Replication Category Copyright 2018 Severalnines AB Database Replication Logical Replication Physical Replication Multi-Site Replication Multi-Master Replication Replica Set
  49. Logical Replication Copyright 2018 Severalnines AB ● A group of database servers that replicates changes from another node at database level: ○ Master - Receives reads and writes. ○ Intermediate Master - Replicates data from a master. Read-only. ○ Slave - Replicates data from a master or intermediate master. Read-only. ● Pulls replication data from archive logs (binary log, WAL, oplog) ● Possible to only replicate a subset of data. ● No database-level conflict resolution. ● Minimum 2 hosts. Maximum is unlimited. ● DBMS: ○ MySQL/MariaDB Replication ■ master-slave, chain, multi-source, master-master ○ PostgreSQL logical replication M IM S replicates from replicates from Replication data flow S replicates from
  50. Physical Replication Copyright 2018 Severalnines AB ● A group of database servers that replicates binary-level data from another node: ○ Primary - Receives reads and writes. ○ Secondary - Replicates data from the master via block level. Doesn't serve data. Cold standby only. ● Holds the same data set. All nodes must be in the same version. ● Replication or synchronization is performed by external process. ● Minimum 2 hosts. ● Example: MySQL with DRBD, PostgreSQL physical replication, embedded databases like SQLite with rsync. P S Replication data flow Block device Block device synchronization
  51. Replica Set Copyright 2018 Severalnines AB ● A group of database servers that holds same set of data: ○ Primary - Receives writes. ○ Secondary/Replica - Replicates data from the primary. Read-only. ● Quorum-based with automatic election & failover. ● No database-level conflict resolution. ● Minimum 3 hosts, maximum is limited. ● DBMS: MongoDB Replication data flow P R Rreplicates from replicates from Group Communication
  52. Group Communication Multi-Master Replication Copyright 2018 Severalnines AB ● A group of database servers that replicates from and to another: ○ Multi-master - All nodes are equal and capable of serving reads/writes. ○ Communication through group communication protocol. ○ Transactions ordering coordination: ■ Transaction certification ■ 2-phase commit ● Database-level conflict handling and/or resolution. ● Quorum-based with automatic failover. ● Minimum 3 hosts. ● Example: Galera Cluster, MySQL Group Replication, MySQL Cluster, Postgres-BDR, CouchDB ● Note: Asynchronous MySQL replication is NOT suitable for Multi-Master Replication M1 M3 M2 Replication data flow replicates to replicates from
  53. Multi-Site Replication Copyright 2018 Severalnines AB ● Multi-site replication: ○ A group of database servers that holds a same set of logical data but located in a different physical location. ○ Based on replication factor. ○ Masterless architecture: ■ All replicas are equally important. ■ Write to any node. ■ Node coordinates with replicas. ○ Eventual consistency. Use clock for conflict resolution. ○ Built for geographical redundancy. ○ Example: Cassandra, Riak C R R R C replicate toreplicate to *Replication factor = 3
  54. High Availability Solutions Copyright 2018 Severalnines AB Data Relational Document Columnar Key-Value MongoDB MarkLogic Couchbase MySQL Replication MariaDB Replication Galera Cluster MySQL Cluster MySQL Group Replication ClickHouse PostgreSQL Replication Apache HBase Cassandra Redis Riak KV etcd
  55. Example: MySQL Replication Scale Out Copyright 2018 Severalnines AB M App S M AppAppAppAppApp S ProxySQL RW R ProxySQL M AppAppAppAppApp IM ProxySQL RW R ProxySQL S S S RW Master-slave Master-slave Master-intermediate master-slave Topology manager
  56. Example: Galera Cluster Scale Out Copyright 2018 Severalnines AB M App AppAppAppAppApp ProxySQL ProxySQL RW M M MM M S RW R AppAppAppAppApp ProxySQL ProxySQL MM M S RW R M M Single-writer with Replication SlaveMulti-writer with Replication SlaveMulti-writer
  57. Example: MongoDB Scale Out Copyright 2018 Severalnines AB App AppAppAppAppApp RW Standalone S P S Replica set AppAppAppAppApp mongos mongos S P S Replica set Shard An+1 S S P S P S Replica set Shard An Replica set (Config) RW RW
  58. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Trade Offs
  59. Trade Offs Copyright 2018 Severalnines AB Trade offs when running a highly available database system Cost Complexity Performance Locking
  60. ● Problems: ○ Lock Contention (hotspot) ○ Long Term Blocking (unreleased lock) ○ Database Deadlocks (2 or more transactions locking each other) ○ System Deadlocks (Locks outside of database, RO filesystem, snapshot) ● Tips: ○ Use lower isolation level (better concurrency, weaker consistency) ○ Change the application behaviour: ■ Chunk up big transaction to a smaller ones ○ Avoiding hotspot from multi-writers: ■ Forward the query to single writer ○ Use hot-backup utility or backup from replica Locking Copyright 2018 Severalnines AB AppAppAppAppApp ProxySQL ProxySQL MM M S RW R M M Single-writer with Replication Slave
  61. ● Capital expenditure (capex) is high: ○ Hardware preparation ○ Cluster deployment ○ Training ○ Testing and Integration ○ Tuning ○ Multiple sets of cluster (staging/test/dev) ● Operational expenditure (opex) will be getting lower over time with: ○ Automation ○ Skills and expertise ○ Experience ○ Monitoring & alerting ○ Consolidation Deployment and Operational Cost Copyright 2018 Severalnines AB ● General recommendation for cost justification: ○ Calculate the ROI. ○ Estimate the planned and unplanned downtime cost per hour. ○ Evaluate the existing infrastructure. ○ Growth planning.
  62. ● Latency often problematic in distributed systems especially ACID DBMS. ● Issues: ○ Network saturation (heartbeating, replication, certification, syncing) ○ Unbalanced distribution ○ Non-uniform hardware ○ Large query performance (batch, routines, etc) ● General recommendations: ○ Reliable LAN (and/or WAN). ○ Use the right database for the right job. ○ Monitor everything. ○ Cache expensive queries. ○ Run performance assessment every quarter of the year. Performance Issues Copyright 2018 Severalnines AB
  63. ● Operational tasks become more complex: ○ Repetition (config change, backup, logs, troubleshooting, upgrade) ○ Additional environments (testing/staging/development) ○ Knowledge transfer ○ Scaling (auto-scaling, sharding) ● General recommendation: ○ Strict user access. ○ Document, log and audit everything. ○ Embrace automation tools. ○ Always perform test before pushing changes to the production environment. System Complexity Copyright 2018 Severalnines AB
  64. Copyright 2017 Severalnines AB Copyright 2018 Severalnines AB Q & A
  65. Additional Resources Copyright 2018 Severalnines AB ● Repair and recovery for your MySQL, MariaDB and MongoDB Clusters ● HA & Load Balancing Tutorials ● Download ClusterControl ● Contact us: info@severalnines.com
Advertisement