Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

M|18 Under the Hood: Galera Cluster

365 views

Published on

M|18 Under the Hood: Galera Cluster

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

M|18 Under the Hood: Galera Cluster

  1. 1. MariaDB Galera Cluster State of the Art Seppo Jaakola Codership
  2. 2. 2 www.galeracluster.com ➢ Seppo Jaakola ➢ One of the Founders of Codership ➢ Codership – Galera Replication developers ➢ Partner of MariaDB for developing and supporting MariaDB Galera Cluster ➢ Galera releases since 2009
  3. 3. 3 www.galeracluster.com Agenda ● Galera Cluster Overview ● Galera Cluster Upgrading ● Release History ● Galera with MariaDB ● Galera 4 Major Features ● Huge Transaction Support ● Inconsistency Voting ● Non Locking DDL ● Galera 4 Road Map
  4. 4. 4 www.galeracluster.com MariaDB MariaDB MariaDB Synchronous Multi-Master Replication Galera Replication Replication is synchronous read & write read & write read & write Client can connect to any node There can be several nodes Read & write access to any node
  5. 5. 5 www.galeracluster.com MariaDB Synchronous Multi-Master Replication a Multi-master cluster looks like one big database with multiple entry points read & write read & write read & write
  6. 6. 6 www.galeracluster.com MariaDB Synchronous Multi-Master Replication a Adding more nodes Opens new connection ports read & write read & writeread & write read & write
  7. 7. 7 www.galeracluster.com Galera Cluster ➢ Good Performance ➢ Optimistic concurrency control ➢ Parallel Replication ➢ Optimized Group Communication ➢ 99.99% transparent ➢ InnoDB look & feel ➢ Automatic node joining ➢ Works in LAN / WAN / Cloud
  8. 8. 8 www.galeracluster.com Galera Cluster ➢ Synchronous multi-master cluster ➢ no data loss ➢ no slave lag ➢ no slave failover ➢ For MySQL/InnoDB ➢ 3 or more nodes needed for HA ➢ No single point of failure
  9. 9. 9 www.galeracluster.com Galera Replication Plugin DBMS wsrep provider GCS framework replication wsrep hookswsrep API certification vsbes gcommspread Galera Plugin v25 v25
  10. 10. 10 www.galeracluster.com API 25 Galera Rolling Upgrades Galera Replication read & write MariaDB API 25 read & write MariaDB API 25 MariaDB API 25 Upgrade with API #25
  11. 11. 11 www.galeracluster.com API 25 Galera Rolling Upgrades Galera Replication read & write Upgrade all nodes One by one MariaDB API 25 MariaDB API 25 read & write MariaDB API 25 Upgrade with API #25
  12. 12. 12 www.galeracluster.com API 25 Galera Rolling Upgrades Galera Replication read & write MariaDB API 25 read & write MariaDB API 25 MariaDB API 25 Upgrade with API #26
  13. 13. 13 www.galeracluster.com API 25 Galera Rolling Upgrades Galera Replication read & write One node upgraded To API #26 MariaDB API 25 MariaDB API 26 read & write MariaDB API 25 Upgrade with API #26
  14. 14. 14 www.galeracluster.com API 26 Galera Rolling Upgrades Galera Replication read & write All nodes upgraded To API #26 MariaDB API 26 read & write MariaDB API 26 read & write MariaDB API 26 API #26 features now Enabled in replication
  15. 15. 27 www.galeracluster.com 2009 2010 2011 2012 2014 20152013 2016 Galera Releases 2017 GTID DDL replication TOI SST scripts Parallel Replication gcache Foreign keys IST DDL with RSU WAN replication Async replication Cluster Crash Recovery Donor Selection 1.0 2.0 3.0 5.7 release
  16. 16. MariaDB Galera Cluster
  17. 17. 29 www.galeracluster.com Galera Project WSREP API MySQL Community edition Galera Replication Plugin Galera Cluster for MySQL WSREP API MariaDB merge MariaDB Galera Cluster R&D
  18. 18. 30 www.galeracluster.com MariaDB Galera Cluster ● MariaDB Galera Cluster releases based on MariaDB 5.5 and 10.0 ● Since MariaDB 10.1, Galera is inbuilt in MariaDB ● MariaDB 10.1 server operates as galera cluster node if wsrep plugin is configured by wsrep_provider ● If wsrep plugin is not configured, server works as native MariaDB server ● And is Galera be present in MariaDB 10.2 and will be present in MariaDB 10.3 as well R&D
  19. 19. 31 www.galeracluster.com Galera in MariaDB 10.3 MariaDB 10.3 has huge feature set: ● Oracle server compatibility ● System versioned tables ● Spider sharding as GA version
  20. 20. 32 www.galeracluster.com Galera in MariaDB 10.3 Overall, Galera version 3 works fine in the new 10.3.5 RC version, however note following: ● In galera cluster, sequence has to declared with increment 0: CREATE SEQUENCE’my-seq’ INCREMENT=0; ● Galera Cluster not supported in Spider data nodes, in general case
  21. 21. Galera Development After Version 3
  22. 22. 34 www.galeracluster.com New Features, since 3.0 ● Non Blocking DDL ● Huge transactions by streaming replication ● Inconsistency Voting Protocol Galera 4.0 ● Intelligent Donor selection ● Cluster crash recovery Galera 3.* v v Galera 4.1 ● XA transaction Support ● Select for Update Support
  23. 23. Galera 4
  24. 24. Huge Transaction Support
  25. 25. 37 www.galeracluster.com Huge Transaction Support ● In Galera 3, transaction processes in master node until commit time ● For large transactions, the write size will be big, and is hard to handle ● There are means to prevent too large transactions ● wsrep_max_ws_size
  26. 26. 38 www.galeracluster.com Huge Transaction Replication Huge transaction Galera Replication Node A Node B Huge trx
  27. 27. 39 www.galeracluster.com Huge Transaction Replication Huge transaction Galera Replication Node A Node B Huge trx
  28. 28. 40 www.galeracluster.com Huge Transaction Replication Huge transaction Galera Replication Node A Node B Huge trx Ws commit
  29. 29. 41 www.galeracluster.com Huge Transaction Replication Huge transaction Galera Replication Node A Node B Huge trx
  30. 30. 42 www.galeracluster.com Huge Transaction Replication Huge transaction Galera Replication Node A Node B Huge trx WS WS WS Slave queue
  31. 31. 43 www.galeracluster.com Huge Transaction Demo Setup 1. Two nodes 2. Steady load of pure autocommit updates to measure trx throughput 3. A huge table with ~1.5M rows 4. Run update on huge table to modify all rows → monitor trx/sec rate in the cluster when the huge transaction kicks in
  32. 32. 44 www.galeracluster.com Impact of Huge Transaction 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Huge Transaction Slave Lag Trx in master 24 secs Trx in slave 9 secs
  33. 33. 45 www.galeracluster.com How to Deal with It? ● Skip flow control (use wsrep_desync) ➔ Flow control will not slow down master any more ➔ Slave queues will grow (very) long ➔ Only master can push more transactions ● Relax commit order ➔ Latter transactions can go ahead of the huge trx ➔ Database states will be momentarily different
  34. 34. 46 www.galeracluster.com How to Deal with It? ● Skip flow control (use wsrep_desync) ➔ Flow control will not slow down master any more ➔ Slave queues will grow (very) long ➔ Only master can push more transactions ● Relax commit order ➔ Latter transactions can go ahead of the huge trx ➔ Database states will be momentarily different
  35. 35. 47 www.galeracluster.com Streaming Replication ● Streaming replication is new technology developed for Galera Cluster 4.0 release to enable running transaction of unlimited size in cluster ● Transaction size limits will remain, and cluster can still reject too large transactions
  36. 36. 48 www.galeracluster.com Streaming Replication ● Transaction is replicated, gradually in small fragments, during transaction processing ● i.e. before actual commit, we replicate a number of small scale fragments ● Size threshold for fragment replication is configurable ● Replicated fragments are applied in slave transactions in all cluster nodes ➔ Fragments hold locks in all nodes and cannot be conflicted later
  37. 37. 49 www.galeracluster.com Streaming Replication Huge transaction Galera Replication Node A Node B Huge trx Update, update, update....
  38. 38. 50 www.galeracluster.com Streaming Replication Huge transaction Galera Replication Node A Node B Huge trx WS Update, update, update....
  39. 39. 51 www.galeracluster.com Streaming Replication Huge transaction Galera Replication Node A Node B Huge trx WS Update, update, update.... trx trx UpdateUpdate
  40. 40. 52 www.galeracluster.com Streaming Replication Huge transaction Galera Replication Node A Node B Huge trx WS commit
  41. 41. 53 www.galeracluster.com Streaming Replication Huge transaction Galera Replication Node A Node B OK
  42. 42. 54 www.galeracluster.com Configuring Streaming Replication wsrep_trx_fragment_unit Unit metrics for fragmenting, options are: ● bytes WS size in bytes ● events # of binlog events ● rows # of rows modified ● statements # of SQL statements issued wsrep_trx_fragment_size ● Threshold size (in units), when fragment will be replicated ● 0 = no streaming
  43. 43. 55 www.galeracluster.com Streaming Replication Demo Setup 1. Same scenario as before 2. Configure node1 to fragment huge transaction in 10K batches ● wsrep_trx_fragment_unit = bytes ● wsrep_trx_fragment_size = 10000 → monitor trx/sec rate in the cluster when streaming replication progresses
  44. 44. 56 www.galeracluster.com Streaming Replication 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Streaming Replication time trx/sec Streaming Replication 70 secs
  45. 45. 57 www.galeracluster.com Streaming Replication 0 500 1000 1500 2000 2500 3000 3500 4000 4500 time trx/sec
  46. 46. 58 www.galeracluster.com Streaming Replication for Conflict Prevention Transaction fragments hold innodb row level locks in all cluster nodes – New write attempts for these rows are blocked by row locks
  47. 47. 59 www.galeracluster.com Streaming Replication Huge transaction Galera Replication Node A Node B Huge trx WS Update t1..... Update t1.....
  48. 48. 60 www.galeracluster.com Streaming Replication for Conflict Prevention Replicated part of transaction is protected from conflicts Currently processing part of transaction is vulnerable for conflicts Streaming replication can be utilized to run hot-spot transactions effectively – Requires rewriting the hot-spot transaction
  49. 49. Optimizing Inconsistency Shutdown
  50. 50. 62 www.galeracluster.com Optimizing Inconsistency Shutdown ● Current Policy for Inconsistency: ● For suspected inconsistency, cluster node will do emergency shutdown ● (However, DDL failures are logged only as warnings) ● Injected inconsistency in one node can cause all other nodes to shutdown
  51. 51. 63 www.galeracluster.com Inconsistency Shutdown Galera Replication Node A Node B Node C Create table t1 (i int) t1 t1 t1
  52. 52. 64 www.galeracluster.com Inconsistency Shutdown Galera Replication Node A Node B Node C Set wsrep_on=OFF Insert into t values (8) t1 t1 t1 8
  53. 53. 65 www.galeracluster.com Inconsistency Shutdown Node A Node B Node C Set wsrep_on=ON Delete from t; t1 t1 t1 8 Del 8 Del 8
  54. 54. 66 www.galeracluster.com Inconsistency Shutdown Node A Node B Node C t1 t1 t1 Set wsrep_on=ON Delete from t; Del 8 Del 8
  55. 55. 67 www.galeracluster.com Inconsistency Shutdown Node A Node B Node C t1
  56. 56. 68 www.galeracluster.com Optimizing Consistency Shutdown ● Galera Cluster 4.0 will optimize shutdowns due to suspected inconsistency ● Nodes will communicate through consistency voting protocol if inconsistency is observed ● Target is to shutdown minimal number of nodes
  57. 57. 69 www.galeracluster.com Inconsistency Shutdown Node A Node B Node C 8 t1 t1 t1 Set wsrep_on=ON Delete from t; Del 8 Del 8 Consistency Voting
  58. 58. 70 www.galeracluster.com Consistency Voting Node A Success Node B Error: ‘row not found’ Node C Error: ‘row not found’
  59. 59. 71 www.galeracluster.com Inconsistency Shutdown Node A Node B Node C t1 t1 t1 Consistency Voting
  60. 60. 72 www.galeracluster.com Galera Consistency Voting Protocol ● With consistency voting, Galera Cluster can mitigate the harm of inconsistency for the cluster ● In the best case, only one node has to abort, and majority can continue operating normally ● However, the database has been inconsistent, for indefinitely long period, and application business logic may have been hurt
  61. 61. Non Blocking DDL
  62. 62. 75 www.galeracluster.com Non-Blocking DDL Current TOI replication blocks whole cluster for the duration of DDL statement processing Galera Cluster 4.0 optimizes DDL replication (TOI) to lock only the affected table
  63. 63. 76 www.galeracluster.com Non-Blocking DDL - NBO ALTER TABLE t1 Node A Node B
  64. 64. 77 www.galeracluster.com DDL Start Phase ALTER TABLE t1 Node A Node B ALTER t1 WS seqno
  65. 65. 78 www.galeracluster.com Non-Blocking DDL - NBO ALTER TABLE t1 Node A Node B ALTER t1ALTER t1 WS seqno
  66. 66. 79 www.galeracluster.com Non-Blocking DDL - NBO ALTER TABLE t1 Node A Node B ALTER t1ALTER t1 UPDATE t2
  67. 67. 80 www.galeracluster.com Non-Blocking DDL - NBO ALTER TABLE t1 Node A Node B ALTER t1ALTER t1 UPDATE t1
  68. 68. 81 www.galeracluster.com Non-Blocking DDL - NBO ALTER TABLE t1 Node A Node B ALTER t1ALTER t1 UPDATE t1UPDATE t1
  69. 69. 82 www.galeracluster.com Non-Blocking DDL - NBO ALTER TABLE t1 Node A Node B ALTER t1ALTER t1 UPDATE t1 UPDATE t3 UPDATE t1
  70. 70. 83 www.galeracluster.com DDL End Phase ALTER TABLE t1 Node A Node B UPDATE t1 WS seqno ALTER t1ALTER t1 DDL end protocol UPDATE t1
  71. 71. 84 www.galeracluster.com Non Blocking DDL ● Replication sets no locks on DDL ● But, affected table is locked in all cluster nodes in MySQL terms ● InnoDB Online DDL can make some part of DDL processing completely non-blocking ● Online DDL will be supported case by case ● All other tables are accessible during DDL processing ● DDL replication method of choice is declared by variable: wsrep_OSU_method = TOI | RSU | NBO
  72. 72. Galera 4 Road Map
  73. 73. 86 www.galeracluster.com 4.0 Release Status ● Galera 4 Development complete in MySQL version ● MariaDB merge in MariaDB 10.3 in testing phase ● Actual MariaDB version for Galera 4 has not been assigned yet (10.3, 10.3.x, 10.4…)
  74. 74. 87 www.galeracluster.com Later Road Map ● XA transaction support ● Locking reads (select for update) ● And more features coming for Galera 5.0
  75. 75. 96 www.galeracluster.com 4 Happy Clustering :-) Thank you for listening!

×