Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Galera explained 3


Published on

This presentation is aim to give an initial understanding of how MySQL/Galera works, and some advice.

Published in: Data & Analytics
  • Dating direct: ❤❤❤ ❤❤❤
    Are you sure you want to  Yes  No
    Your message goes here
  • Get Paid To Write Articles? YES! View 1000s of companies hiring online writers now! ■■■
    Are you sure you want to  Yes  No
    Your message goes here
  • Comparing VigRX Plus to ED Prescription Drugs ◆◆◆
    Are you sure you want to  Yes  No
    Your message goes here
  • How will you feel when your Ex boyfriend is in bed with another woman? Don't let this happen. Get him back with ▲▲▲
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

Galera explained 3

  1. 1. Galera Explained A Beginner (?) Level Tutorial Marco “The Grinch” Tusa 2015
  2. 2. About me Marco “The Grinch” • Former UN, MySQL AB, Pythian, Percona • 2 kids, 1 wife • Ex-MySQL AB employee • History of Religions; Ski; Snowboard; Scuba Di ving;
  3. 3. My Motto Use the Right Tool for the Job
  4. 4. Why you are here • You want to understand what Galera Cluster is • You know what it is, but want to know more • You’d like to grill the speaker with some nasty questions about it (wait for the end!) • You’re bored, with nothing better to do (a special welcome to you!)
  5. 5. Agenda • What is Galera? • How does Galera work? • What is a Node? • Node Status • Primary Component • Quorum • Data Replication (Synch.) • Optimistic & Pessimistic locking • Write-set Cache • State Transfer • Flow Control • Apply DDL • Geographic Distribution • Galera & Binary Logs • What to keep an Eye on • Well-known issues
  6. 6. What is Galera? (Virtually) Synchronous Replication: – True multi-master – No slave lag – No master-slave failover or VIP – Multi-threaded app layers – Automatic node provisioning – Elastic scale (in – out) – Geographic distributed (with segments) – Mix with Async replication Galera Balancer Web traffic
  7. 7. Data Replication (sync) Pros – High Availability Synchronous replication provides highly available clusters and guarantees 24/7 service availability, given that: » No data loss when nodes crash. » Data replicas remain consistent. » No complex, time-consuming failovers. – Improved Performance replications allows you to execute transactions on all nodes in the cluster in parallel to each other, increasing performance. – Causality across the Cluster Synchronous replication guarantees causality across the whole cluster.
  8. 8. What is Galera NOT? • Not Write-scalable solution • Not great for a high amount of parallel, small requests • Not great for working with Foreign Keys • Not good for sharding Data (each node has the entire dataset) Galera Balancer Web traffic
  9. 9. Data Replication (sync) (adv) Cons – Do not scale on write – Use a two phase commit, or distributed locking with capacity formula: m = n x o x t (where messages/sec = number of nodes due to process o number of operation with t transaction throughput) – More nodes more Dead locks & conflicts
  10. 10. Comparing Galera with: MHA – Each Slave has its own position – Data is replicate asynchronously – In case of crash ONLY one server could be elected, and in some cases needs to wait update from binlog Galera – Data is the same at each finalize commit – All Nodes share the same position – Any Node can be written at any time Master Log_pos=1000 Slave Log_pos=995 Slave Log_pos=993 Slave Log_pos=980 Slave Log_pos=998 Async Replicatio n In Case of Master crash Election by position
  11. 11. Comparing Galera with: Continuent Enterprise – Applications connect to an entry point – All data is distributed asynchronously – A central point keep information on all Galera – Application can connect any node – Data is shared using XA transactions – Status and State is at cluster level Async Replication Canada Italy Entry point (man in the middle)
  12. 12. Galera and HAProxy Two friends working together • Automatic Donor/fail/resurrection identification • Automatic write distribution • Light process scaling on Application server (no single point of failure)
  13. 13. • Transactional Database It requires that the database is transactional. Specifically, that the database can rollback uncommitted changes. • Atomic Changes It requires that replication events change the database atomically. Specifically, that the series of database operations must either all occur, else nothing occurs. • Global Ordering It requires that replication events are ordered globally. Specifically, that they are applied on all instances in the same order. Galera minimal requirements
  14. 14. How does Galera work? 1 Main components corresponding to code blocks • Database Management System (DBMS) The database server that runs on the individual node. • wsrep API The interface and the responsibilities for the database server and replication provider • Galera Replication Plugin The plugin that enables write-set replication service functionality. • Group Communication plugins The group communication systems available to Galera Cluster.
  15. 15. How does Galera work? 2 Main components (WSREP API) • Is a generic replication plugin interface for databases • Database servers have a state • State refers to the contents of the database • Changes in the database state as a series of atomic changes, or transactions • In a database cluster, all nodes always have the same state
  16. 16. How does Galera work? 3 Main components (Galera Replication Plugin) The Galera Replication Plugin implements the wsrep API. It operates as the wsrep provider. • Certification Layer This layer prepares the write-sets and performs the certification checks on them, ensuring that they can be applied. • Replication Layer This layer manages the replication protocol and provides the total ordering capability. • Group Communication Framework This layer provides a plugin architecture for the various group communication systems that connect to Galera Cluster.
  17. 17. How does Galera work? 4 Main components (Group communication plugin) • Implements a virtual synchrony QoS (Quality of Service) • Implements its own runtime-configurable temporal flow control. Flow control keeps nodes synchronized to the faction of a second • Provides a total ordering of messages from multiple sources. It uses this to generate Global Transaction ID’s in a multi-master cluster • Is a symmetric undirected graph. All database nodes connect to each other over a TCP connection
  18. 18. What is a Node? 1 • Standard MySQL Replication Master Slave Slave • Galera MySQL Replication Node Node Node 9cba28fa-a8be-11e4-8f41-9f963e1dbf4f
  19. 19. What is a Node? 2 • Standard MySQL Replication – Each MySQL instance is independent – Data can be different per node (schema, engine, content) • Galera MySQL Replication – Data is the center – Nodes connect and share same data – Node cannot (should not) be different, and have the same STATE
  20. 20. What is a Node? 3 • Data is the center – Data has an UUID = • 9cba28fa-a8be-11e4-8f41-9f963e1dbf4f – Data has a Position (seq number) • wsrep_last_committed | 1398 | • Position is the same in ANY Synchronized node – Node has UUID • 8186a31a-a8bf-11e4-9d19-6bd85d36493b Node belongs to a cluster/Data and NOT vice versa.
  21. 21. What is a Node 4 1. A connecting node talks to one node in the cluster 2. A DONOR is elected 3. The Donor shares Status and Starts Synchronization uuid: 9cba28fa-a8be-11e4-8f41-9f963e1dbf4f seqno: 1950 New cluster view: global state: 9cba28fa-a8be-11e4-8f41-9f963e1dbf4f:2037, view# 9: Primary, number of nodes: 5, my index: 2, protocol version 3
  22. 22. Segments • A segment is a logical grouping of nodes. • Replication between Segment is optimized • Traffic and messaging is reduced • In case of SST, the donor is chosen by proximity
  23. 23. Node Status 1. Node connect and Send status 2. Cluster provides a DONOR 3. Status (data) Exchange starts (node Joiner) 4. Donor ends transmission, applies “delta” and rejoins 5. Joiner -> Joined check seq_num and become Synced
  24. 24. Primary Component Under normal operations, the Primary Component is the whole cluster. When cluster partitioning occurs, Galera Cluster invokes a special quorum algorithm to select one component as the Primary Component. This guarantees that there is never more than one Primary Component in the cluster. Primary component
  25. 25. Primary Component 2 In case of a network issue, the cluster might be split. If the pc.weight and segments are set up correctly, the nodes in the Non-Primary state will attempt to rejoin the cluster. This is an automatic recovery that may trigger: • IST • SST Primary Non-Primary
  26. 26. Primary Component 3 When the cluster is NOT able to manage WHO is the primary correctly, a so-called “split brain” issue may occur. Split Brain: • Cannot be automatically recovered from • Puts all nodes in READ ONLY mode Non-Primary Non-Primary Split Brain
  27. 27. Quorum Quorum can be managed using: • Pc.weight • Segments Segments do not modify the quorum calculation but are useful to logically group servers. • Zone 1: Segment=1, weight = 2 • Zone 2: Segment=2, weight = 1
  28. 28. Quorum (adv) •
  29. 29. Quorum (adv) Galera organizes the presence/modification of node in VIEWS: WSREP: view(view_id(PRIM,28b4b776,78) memb { 28b4b776,1 79cc1886,1 8637105e,2 f218f33d,2} joined {} left {} partitioned { b9aabaa5,1 <--- node is shutting down}) 78 is the VIEW number PRIM define the view as Primary component Segment identifier
  30. 30. Quorum (adv) Assuming 2 Segments with 3 nodes each View 1 View 2 View 3 seg weight active n1 1 1 1 n2 1 1 1 n3 1 1 1 n4 2 1 1 n5 2 1 1 n6 2 1 1 seg weight active n1 1 1 0 n2 1 1 0 n3 1 1 0 n4 2 1 1 n5 2 1 1 n6 2 1 1 seg weight Active n1 1 1 1 n2 1 1 1 n3 1 1 1 n4 2 1 0 n5 2 1 0 n6 2 1 0 Segment 2 Quorum 0 Segment 1 Quorum 0 In this case in VIEW 2|3 we will not have a quorum and the Segments will become NON -PRIMARY
  31. 31. Quorum (adv) Assuming 2 Segments with 3 nodes each View 1 View 2 View 3 Segment 1 Quorum 1 Segment 2 Quorum 1 Using an arbitrator we can have the quorum. BUT what if both can access the quorum but not the other segment? SPLIT BRAIN !!! seg weight active n1 1 1 1 n2 1 1 1 n3 1 1 1 n4 2 1 1 n5 2 1 1 n6 2 1 1 n7 3 1 1 seg weight active n1 1 1 1 n2 1 1 1 n3 1 1 1 n4 2 1 0 n5 2 1 0 n6 2 1 0 n7 3 1 1 seg weight active n1 1 1 0 n2 1 1 0 n3 1 1 0 n4 2 1 1 n5 2 1 1 n6 2 1 1 n7 3 1 1
  32. 32. Quorum (adv) Assuming 2 Segments with 3 nodes each View 1 View 2 (1) View 2 (2) seg weight active n1 1 4 1/0 n2 1 3 1 n3 1 1 1 n4 2 5 1 n5 2 1 1 n6 2 1 1 seg weight active n1 1 4 1 n2 1 3 1 n3 1 1 1 n4 2 5 0 n5 2 1 0 n6 2 1 0 seg weight Active n1 1 4 0 n2 1 3 0 n3 1 1 0 n4 2 5 1 n5 2 1 1 n6 2 1 1 Segment 1 Quorum 1 Segment 2 Quorum 1 In this case in VIEW 2|3 we will have a quorum • Segment 1 always win and will have the quorum • Segment 2 will have the quorum in case of planned switch, otherwise NO-PRIMARY
  33. 33. Quorum Summary • Number of Nodes, Even/Odd, not really relevant • Quorum weight is relevant • Remind View quorum calculation • Witness node will NOT guarantee the Split-Brain prevention real node Should • HAProxy can help (a lot) to manage Segments • Plan carefully your cluster, and check View status before mantainance
  34. 34. Data Replication (sync) • On commit, but before commit • Transaction changes are ordered by PK and collected in a write set • The write set is certified on each node (including originator) for apply/reject • On failure, the originator rolls back, while others discard the write set
  35. 35. Data Replication (sync)(adv) Local Certification issues – Each re-ordered transaction (deterministic) has a Seq_no – Galera evaluates all transactions in the queue from the last successfully committed – If another writeset in the queue is conflicting, then the writeset in evaluation is discarded, and rolled back on the originator – Counter is incremented only on originator 6 5 4 23 1 Cluster Commit Queue Conflict 5 discarded wsrep_local_cert_failures
  36. 36. Data Replication (sync) (adv) Local certification issues (2) – Transaction started, not committed – Incoming writeset is applied – A lock conflict with local open transaction is raised – Incoming transaction (write set) always wins wsrep_local_bf_aborts
  37. 37. Data Replication (sync) • Certification take place on write-sets • Each write-set contains references for each affected key: – Primary – Unique – Foreign key • Keys are also maintained in a local certification index for multi-master conflict resolutions
  38. 38. Optimistic & Pessimistic locking 1. The originator has all internal locks 2. Originator ignores other nodes 3. On Commit, it optimistically sends the modification 4. The write-set is reordered and goes through a deterministic certification test 5. In the presence of a conflict, the last commit loses
  39. 39. Write-set Cache GCache A library that provides a transparent on-disk memory buffer cache. Its purpose is to allow an (almost) arbitrarily big action cache without RAM consumption. Permanent Ring-Buffer File Here, write-sets are pre- allocated to disk during cache initialization.
  40. 40. State Transfer 1 The process of replicating data from the cluster to an individual node, bringing that node into sync with the cluster. AKA Provisioning. Two ways of doing it: • Incremental State Transfers (IST) Where only the missing transactions transfer. • State Snapshot Transfers (SST) Where a snapshot of the entire node state transfers.
  41. 41. State Transfer 2 State transfers always require a: – Donor – Joiner A Joiner is the node that request the ST Member 0.1 (node3) requested state transfer from 'node5' A donor is the node Providing the data, donor can be blocked by getting incoming queries.
  42. 42. State Transfer 3 IST Incremental State Transfer, transfer the missing D between the Joiner and Donor. • State UUID must be the same as that of the group • All missing write-sets are available in the donor’s write-set cache • Much faster and non-blocking operation on the Donor • IST has a well-known interval: WSREP: IST request: 9cba28fa-a8be-11e4-8f41-9f963e1dbf4f:77030- 85722|tcp:// • IST picks the donor that can provide the full WS range (also, if defined, the donor can change)
  43. 43. State Transfer 4 SST State Snapshot Transfer is a full data copy from one node to another. This may happen because: • A New Node joins the cluster • Enough WS data not present in the Gcache of any Donor Two approaches: • Logical (mysqldump; export) • Physical (rsync; xtrabackup)
  44. 44. Flow Control Galera Cluster manages the replication process using a feedback mechanism named FLOW CONTROL. • Allows any node to pause and resume replication • Prevents any node from lagging too far behind Modes – No Flow Control – Write-set Caching – Catching Up – Cluster Sync
  45. 45. How Flow Control Works 1. Galera Cluster synchronously replicates write-sets on a cluster-wide ordering. 2. Transactions received but not yet applied and committed are placed in the receive queue (wsrep_local_recv_queue) 3. When the size of the queue exceeds the Flow Control Limit, the node will send a FC pause. 4. When the queue is manageable again (below the limit), the node removes the pause.
  46. 46. Flow Control States 1 Write-set Caching • Happens when the node is a: – Joiner – Donor • Write-set will be locally cached and applied later
  47. 47. Flow Control States 2 Catching up • Happens when the node is: – Joined • Nodes in this state can apply write-sets but are still making up the gap • Cluster rate replication is tuned to the Joined Node’s capacity • Applying a write-set is faster than executing a transaction • On empty Buffer Pool operations will be slower
  48. 48. Flow Control States 3 Cluster Sync • Happens when the node is: – Synced • By far the most common state • Node enters in FC to limit the receiving queue • Can be tuned with gcs.fc_limit, gcs.fc_factor
  49. 49. Flow Control How small my fc_limit should be? • Enough to keep low the delay any node in the cluster might have when applying cluster transactions • Enough to keep the certification interval small, which minimizes replication conflicts on a cluster where writes happen on all nodes – A small fc_limit keeps the certification index smaller in memory
  50. 50. Manage Flow Control What to check? • wsrep_flow_control_sent; wsrep_flow_control_recv; • wsrep_flow_control_paused; wsrep_flow_control_paused_ns What can be tuned? • Replication Rate (expert feature, do not touch) • Flow control – gcs.fc_limit (default 16, way too low for every real production) – gcs.fc_factor (default 1, means resume replication as soon as we go below fc_limit)
  51. 51. Flow Control Bad Tuned flow control (?)
  52. 52. Apply DDL • Any DDL is a non-transactional operation • Modification raises meta-lock/Server/Schema In a Galera Cluster, you can choose to run DDL in • TOI Total Order Isolation • RSU Rolling Schema Upgrade • pt-online-schema-change (recommended for large tables)
  53. 53. Apply DDL TOI When using Total Order Isolation, the cluster will work as a single server until the end of the process on ALL nodes. Cluster will stay locked: • Server Level For CREATE SCHEMA, GRANT and similar queries, where the cluster cannot apply concurrently any other transactions. • Schema Level For CREATE TABLE and similar queries, where the cluster cannot apply concurrently any transactions that access the schema. • Table Level For ALTER TABLE and similar queries, where the cluster cannot apply concurrently any other transactions that access the table.
  54. 54. Apply DDL RSU When using Rolling Schema Upgrade, each modification will apply ONLY on the node where the command is executed. • Different structure between Nodes • Data inconsistency • Dangerous use of WSREP_ON ( one-command-in-mysqlgalera.html) In short, this is potentially unsafe.
  55. 55. Apply DDL PT-OSC When using pt-online-schema-change, the cluster will block the nodes for a very short period of time: at the start and at the end of the process. • Data is replicated as a normal transaction • Nodes maintain consistency • No locking during the copy • Is recoverable
  56. 56. Geographic distribution Galera Cluster is well suited to cover a geographic distributed scenario. • Use a combination of Asynchronous and Synchronous replication • Use Master/Slave settings inside Galera • Use of Segments
  57. 57. Galera and Binary logs Not needed ? For a long while I stated so, but today I am older and wiser. • Useful to identify what transaction is a seql_no • Required when using a slave • Must have it on at least 2 Nodes when using a slave • Still an Option in case of DR (trust me I saw it!!)
  58. 58. Galera and Binary logs Understand the differences between SQL_LOG_BIN & WSREP_ON • SQL_LOG_BIN will prevent ANY DML to be replicated NOTE: Standard MySQL exclude DML and DDL • WSREP_ON will prevent ANY DML & DDL to be replicated • Use of GLOBAL in this context will cause data inconsistency at 99%
  59. 59. What to keep an eye on As any complex system, Galera Cluster requires your attention on many areas, the most critical: • Certification • Network performance • Proper schema design (PK/UK/FK) • Number of nodes (write distribution, not write scaling) • Correctly plan schema modification
  60. 60. Well known Issues • Foreign Keys • Small (very small) transactions and highly parallel committing • WSREP_ON (global) == SQL_LOG_BIN=0 • Master/Slave is ok, but be careful when using filters • Locks/Deadlocks can become more frequent • Network support (documentation)
  61. 61. What Next? Galera Operations: • Installation, simple and distributed • Add/remove a node • Data consistency • Debug issues using the log • Data export/Load • Backups • Monitoring
  62. 62. Q & A
  63. 63. Thank you To contact us 1-877-PYTHIAN To follow us Group/163902527671 @pythian To contact Me To follow me @marcotusa