MySQL Group Replication


Published on

MySQL Group Replication is a new 'synchronous', multi-master, auto-everything replication plugin for MySQL introduced with MySQL 5.7. It is the perfect tool for small 3-20 machine MySQL clusters to gain high availability and high performance. It stands for high availability because the fault of replica don't stop the cluster. Failed nodes can rejoin the cluster and new nodes can be added in a fully automatic way - no DBA intervention required. Its high performance because multiple masters process writes, not just one like with MySQL Replication. Running applications on it is simple: no read-write splitting, no fiddling with eventual consistency and stale data. The cluster offers strong consistency (generalized snapshot isolation).

It is based on Group Communication principles, hence the name.

Published in: Technology

MySQL Group Replication

  1. 1. MySQL Group Replication: 'Synchronous', multi-master, auto-everything Ulf Wendel, MySQL/Oracle
  2. 2. The speaker says... MySQL 5.7 introduces a new kind of replication: MySQL Group Replication. At the time of writing (10/2014) MySQL Group Replication is available as a preview release on In common user terms it features (virtually) synchronous, multi-master, auto-everything replication.
  3. 3. Proper wording... An eager update everywhere system based on the database state machine approach atop of a group communication system offering virtual synchrony and reliable total ordering messaging. MySQL Group Replication offers generalized snapshot isolation.
  4. 4. The speaker says... And here is a more technical description....
  5. 5. WHAT ?! Hmm, how does it compare?
  6. 6. The speaker says... The technical description given for MySQL Group Replication may sound confusing because it has elements from the distributed systems and database systems theory. From around 1996 and 2006 the two research communities jointly formulated the replication method implemented by MySQL Group Replication. As a web developer or MySQL DBA you are not expected to know distributed systems theory inside out. Yet to understand the properties of MySQL Group Replication and to get most of it, we'll have to touch some of the concepts. Let's see first how the new stuff compares to the existing.
  7. 7. Goals of distributed databases Availability • Cluster as a whole unaffected by loss of nodes Scalability • Geographic distribution • Scale size in terms of users and data • Database specific: read and/or write load Distribution Transparency • Access, Location, Migration, Relocation (while in use) • Replication • Concurrency, Failure
  8. 8. The speaker says... MySQL Group Replication is about building a distributed database. To catalog it and compare it with the existing MySQL solutions in this area, we can ask what the goals of distributed databases are. The goals lead to some criteria that is used to give a first, brief overview. Goal: a distributed database cluster strives for maximum availability and scalability while maintaining distribution transparency. Criteria: availability, scalability, distribution transparency.
  9. 9. MySQL clustering cheat sheet MySQL Replication MySQL Cluster MySQL Fabric Availability Primary = SpoF, no auto failover Shared nothing, auto failover SpoF monitored, auto failover Scalability Reads Partial replication, node limit Partial replication, no node limit Scale on WAN Asynchronous Synchronous (WAN option) Asynchronous (depends) Distribution Transparency R/W splitting SQL: yes (low level: no) Special clients No distributed queries
  10. 10. The speaker says... Already today MySQL has three solutions to build a distributed MySQL cluster: MySQL Replication, MySQL Cluster and MySQL Fabric. Each system has different optimizations, none can achieve all the goals of a distributed cluster at once. Some goals are orthogonal. Take MySQL Cluster. MySQL Cluster is a shared nothing system. Data storage is reundant, nodes fail independently. Transparent sharding (partial replication) ensures read and write scalability until the maximum number of nodes is reached. Great for clients: any SQL node runs any SQL, synchronous updates become visible immediately everywhere. But, it won't scale on slow WAN connections.
  11. 11. How Group Replication fits in Repl. Cluster Group Repl. Fabric Availability Shared nothing, auto failover Shared nothing, auto failover/join Scalability Partial replication, node limit Full replication, read and some write scalability Scale on WAN Synchronous (WAN option) (Virtually) Synchronous Distribution Transparenc y SQL: yes (low level: no) All nodes run all SQL
  12. 12. The speaker says... MySQL Group Replication has many of the desireable properties of MySQL Cluster. Its strong on availability and client friendly due to the distribution transparency. No complex client or application logic is required to use the cluster. So, how do the two differ? Unlike MySQL Cluster, MySQL Group Replication supports the InnoDB storage engine. InnoDB is the dominant storage engine for web applications. This makes MySQL Group Replication a very attractive choice for small clusters (3-7 nodes) running Drupal, WordPress, … in LAN settings! Also, Group Replication is not synchronous in a technical way. For practical matters it is.
  13. 13. Group Replication (vs. Cluster) Availability • Nodes fail independently • Cluster continues operation in case of node failures Scalability • Geographic distribution: n/a, needs fast messaging • All nodes accept writes, mild write scalability • All nodes accept reads, full read scalability Distribution Transparency • Full replication: all nodes have all the data • Fail stop model: developer free'd to worry about consistency
  14. 14. The speaker says... Another major difference between MySQL Cluster and MySQL Group Replication is the use of partial replication versus full replication. MySQL Cluster has transparent sharding (partial replication) build-in. On the inside, on the level of so-called MySQL Cluster data nodes, not every node has all the data. Writes don't add work to all nodes of the cluster but only a subset of them. Partial replication is the only known solution to write scalability. With MySQL Group Replication all nodes have all the data. Writes can be executed concurrently on different nodes but each write must be coordinated with every other node. … time to dig deeper >:).
  15. 15. Eager update everywhere... ?!
  16. 16. A developers categorization... Where are transactions run? Primary Copy Update Everywhere When does synchronizatio n happen? Eager (MySQL semi-synch Replication) MySQL Cluster MySQL Group 3rd party: Galera Lazy MySQL Replication/Fabric 3rd party: Tungsten MySQL Cluster Replication
  17. 17. The speaker says... I've described MySQL Group Replication as „ an eager update everywhere system“. The term comes from a categorization of different database replication systems by the two questions: - where can transaction every be run? - when are transactions synchronized between nodes? The answers to the questions tells a developer which challenges to expect. The answers determine which additional tasks an application must handle when its run on a cluster instead of a single server.
  18. 18. Lazy causes work... 010101001011010 101010110100101 101010010101010 101010110101011 101010110111101 Set price = 1.23 Node price = 1.23 Node Node Node price = 1.00 price = 1.23 price = 0.98
  19. 19. The speaker says... When you try to scale an application running it on a lazy (asynchronous) replication cluster instead of a single server you will soon have users complaining about outdated and „incorrect“ data. Depending which node the application connects to after a write, a user may or may not see his own updates. This can neither happen on a single server system nor on an eager (synchronous) replication cluster. Lazy replication causes extra work for the developer. BTW, have a look at PECL/mysqlnd_ms. It abstracts the problem of consistency for you. Things like read-your-writes boil down to a single function call.
  20. 20. Primary Copy causes work... Primary Write Copy Copy Copy Read Read Read Read
  21. 21. The speaker says... Judging from the developer perspective only, primary copy is an undesired replication solution. In a primary copy system only one node accepts writes. The other nodes copy the updates performed on the primary. Because of the read-write splitting, the replication system does not need to coordinate conflicting operations. Great for the replication system author, bad for the developer. As a developer you must ensure that all write operations are directed to the primary node... Again, have a look at PECL/mysqlnd_ms. MySQL Replication follows this approach. Worse, MySQL Replication is a lazy primary copy system.
  22. 22. Love: Eager Update Everywhere Node Write Read price = 1.23 price = 1.23 price = 1.23 Node Node Write Read Write Read
  23. 23. The speaker says... From a developer perspective an eager update anywhere system, like MySQL Group Replication, is indistinguishable from a single node. The only extra work it brings you is load balancing, but that is the case with any cluster. An eager update anywhere cluster improves distribution transparency and removes the risk of reading stale data. Transparency and flexibility is improved because any transaction can be directed to any replica. (Sometimes synchronization happens as part of the commit, thus strong consistency can be achieved.) Fault tolerance is better than with Primary Copy. There is no single point of failure – a single primary - that can cause a total outage of the cluster. Nodes may fail individually without bringing the cluster down immediately.
  24. 24. HOW? Distributed + DB? Database state machine?
  25. 25. The speaker says... In the mid-1990s two observations made the database and distributed system theory communities wondered if they could develop a joint replication approach. First Gray et. al. (database communitiy) showed that the common two-phase locking has an expected deadlock rate that grows with the third power of the number of replicas. Second, Schiper and Raynal noted that transactions have common properties with group communication principles (distributed systems) such as ordering, agreement/'all-or-nothing' and even durability.
  26. 26. Three building blocks State machine replication • … trivial to understand Atomic Broadcast • … database meets distributed systems community • … OMG, how easy state machine replication is to implement! Deferred Update Database Replication • … database meets distributed systems community • … how we gain high availability and high performance • … what those MySQL Replication team blogs talk about ;-)
  27. 27. The speaker says... Finally, in 1999 Pedone, Guerraoui and Schiper published the paper „The Database State Machine Approach“. The paper combines two well known building blocks for replication with a messaging primitive common in the distributed systems world: atomic broadcast. MySQL Group Replication is slightly different from this 1999 version, more following a later refinement from 2005 plus a bit of additional ease-of-use. However, by end of this chapter you learned how MySQL Cluster and MySQL Group Replication differ beyond InnoDB support and built-in sharding.
  28. 28. State machine replication Input Set A = 1 Replica Replica Replica Output A = 1 A = 1 A = 1 Output Output
  29. 29. The speaker says... The first building block is trivial: a state machine. A state machine takes some input and produces some output. Assume your state machines are determinisitic. Then, if you have a set of replicas all running the same state machine and they all get the same input, they all will produce the same output. On an aside: state machine replication is also known as active replication. Active means that every replica executes all the operations, active adds compute load to every replica. With passive replication, also called primary-backup replication, one replica (primary) executes the operations and forwards the results to the others. Passive suffers under primary availability and possibly network bandwith.
  30. 30. Requirement: Agreement Input Set A = 1 Replica Replica Replica Output A = 1 A = NULL
  31. 31. The speaker says... Here's more trivia about the state machine replication approach. There are two requirements for it to work. Quite obviously, every replica has to receive all input to come to the same output. And the precondition for receiving input is that the replica is still alive. In academic words the requirement is: agreement. Every non-faulty replica receives every request. Non-faulty replicas must agree on the input.
  32. 32. Requirement: Order 1) Set A = 1 2) Set B = 1 3) Set B = A *2 Input: 1, 2, 3 Input: 1, 3, 2 Input: 3, 1, 2 Replica Replica Replica A = 1 A = 1 B = 2 B = 1 A = 1 B = 1
  33. 33. The speaker says... The second trivial requirement for state machine replication is ordering. To produce the same output any two state machines must execute the very same input – including the ordering of input operations. The academic wording goes: if a replica processes requests r1 before r2, then no replica processes request r2 before r1. Note that if operations commute, some reording may still lead to correct output. The sequence A = 1, B = 1, B = A * 2 and the sequence B = 1, A = 1, B = A * 2 produce the same output. (Unrelated here: the database scaling talk touches the fancy commutative replicated data types Riak offers... hot!)
  34. 34. Atomic Broadcast Distributed systems messaging abstraction • Meets all replicated state machine requirements Agreement • If a site delivers a message m then every site delivers m Order • No two sites deliver any two messages in different orders Termination • If a site broadcasts message m and does not fail, then every site eventually delivers m • We need this in asynchronous enivronments
  35. 35. The speaker says... State machine replication is the first building block for understanding the database state machine approach. The second building block is a messaging abstraction from the distributed systems world called atomic broadcast. Atomic broadcast provides all the properties required for state machine replication: agreement and ordering. It adds a property needed for communication in an asynchronous system, such as a system communicating via network messages: termination. All in all, this greatly simplifies state machine replication and contributes to a simple, layered design.
  36. 36. Delivery, durability, group Client Replica Replica Replica Mr. X Replica Replica Replica Group Send first, possibly delivered second
  37. 37. The speaker says... The Atomic broadcast properties given are literally copied from the original paper describing the database state machine replication approach. There is two things in it not explained yet. First, atomic broadcast defines properties in terms of message delivery. The delivery property not only ensures total ordering despite slow transport but also covers message loss (MySQL desires uniform agreement here, something better than Corosync) and even the crash and recovery of processors (durability)! A recovering processor must first deliver outstanding messages before it continues. Second, note that atomic broadcast introduces the notion of a group. Only (correct) members of a group can exchange messages.
  38. 38. Deferred Update: the best? Client Client Replica Replica Replica Replica Replica Replica Client Request Server Coordination Execution Agreement Client Response
  39. 39. The speaker says... We are almost there. The third building block to the database state machine replication is deferred update database replication. The slide shows a generic functional model used by Pedone and Schiper in 2010 to illustrate their choice of deferred update.The argument goes that deferred update combines the best of the two most prominent object replication techniques: active and passive replication. Only the comination of the best from the two will give both high availability and high performance. Translation: MySQL Group Replication can – in theory - have higher overall throughput than MySQL Replication. Do you love the theory ;-) ? As a DBA you should.
  40. 40. Active Replication (SM) Replica Replica Replica Replica Replica Replica Client Client Client sends op to all Requests get ordered Execution All reply to client
  41. 41. The speaker says... In an active replication system, a pure state machine replication system, the client operations are forwarded to all replicas and each replica individually executes the operation. The two challenges are to ensure all replicas execute requests in the same order and all replicas decide the same. Recall, that we talk multi-threaded database servers here. A downside is that every replica has to execute the operation. If the operation is expensive in terms of CPU, this can be a waste of CPU time.
  42. 42. Passive Replication Backup Primary Backup Replica Replica Replica Client Client Client sends op to primary Only primary executes Primary forwards changes Primary replies to client
  43. 43. The speaker says... The alternative is passive replication or primary-backup replication. Here, the client talks to only one server, the primary. Only the primary server executes client operations. After computation of the result, the primary forwards the changes to the backups which apply tem. The problem here is that the primary determines the systems throughput. None of the backups can contribute its computing power to the overall system throughput.
  44. 44. Multi-primary (pass.) replication What we want... • … for performance: more than one primary • … for scalability: no distributed locking • .. and of course: transactions • Two-staged transaction protocol Client Primary Primary Primary Transaction processing Transaction termination
  45. 45. The speaker says... Multi-primary (passive) replication has all the ingredients desired. Transaction processing is two staged. First, a client picks any replica to execute a transaction. This replica becomes the primary of the transaction. The transaction executes locally, the stage is called transaction processing. In the second stage, during transaction termination, the primaries jointly decide whether the transaction can commit or must abort. Because updates are not immediately applied, database folks call this deferred update – our last building block.
  46. 46. Deferred Update DB Replication Deterministic certification • Reads execute locally, Updates get certified • Certification ensures transaction serializability • Replicas decide independently about certification result Read Primary Write Primary Primary Primary Rs/Ws/U
  47. 47. The speaker says... One property of transactions is isolation. Isolation is also know as serializability: the concurrent execution of transactions should be equivalent to a serial execution of the same transactions. In Deferred Update system, read transactions are processed and terminated on one replica and serialized locally. Updates must be certified. After the transaction processing the readset, writeset and updates are sent to all other replicas. The servers then decide in a deterministic procedure whether (one-copy) serializability holds, if the transaction commits. Because its a deterministic procedure, the servers can certify transactions independently!
  48. 48. Options for termination Atomic Broadcast based • … this is what is used, by MySQL, by DBSM Optimization: Reordering (atop of Atomic Broadcast) • … in theory it means less transaction aborts Optimization limit: Generic Broadcast based • … this has issues, which make it nasty Atomic Commit based • … more transactions than atomic broadcast
  49. 49. The speaker says... There are several ways of implementing the termination protocol and the certification. There are two truly distinct choices: atomic broadcast and atomic commit. Atomic commit causes more transaction aborts than atomic broadcast. So, it's out and atomic broadcast remains. Atomic broadcast can – in theory – be further optimized towards less transaction aborts using reordering. For practically matters, this is about where the optimizations end. A weaker (and possibly faster) generic broadcast causes problems in the transactional model. For databases, it could be an over-optimization.
  50. 50. Generic certification test Transactions have a state • Executing, Comitting, Comitted, Aborted Reads are handled locally Updates are send to all replicas • Readset and writeset are forwarded On each replica: search for 'conflicting' transactions • Can be serialized with all previous transactions? Commit! • Commit? Abort local transaction that overlap with update
  51. 51. The speaker says... No matter what termination procedure is used, the basic procedure for certification in the deferred update model is always the same. Updates/writes need certification. The data read and the data written by a transaction is forwarded to all other replicas. Every replica searches for potentially 'conflicting' transactions, the details depend on the termination procedure. A transaction is decided to commit if it does not violate serializability with all previous transactions. Any local transaction currently running and conflicting with the update is aborted.
  52. 52. Database State Machine Deferred Update Database Replication as a state machine • Atomic Broadcast based termination Plugin Services MySQL Transaction hooks Plugins MySQL Group Replication Capture Apply Recover Replication Protocol incl. termination protocol/certifier Group Communication System
  53. 53. The speaker says... The Database State Machine Approach combines all the bits and pieces. Let's do a bottom up summary. Atomic broadcast not only free's the database developer to bother about networking APIs it also solves the nasty bits of communicating in an asynchronous network. It provides properties that meet the requirements of the state machine replication. A deterministic state machine is what one needs to implement the termination protocol within deferred update replication. Deferred update replication does not use distributed locking which Gray proved problematic and it combines the best of active and passive replication. Side effects: simple replication protocol, layered code.
  54. 54. The termination algorithm Updates are send to all replicas • Readset and writeset are forwarded Step 1 - On each replica: certify • Is there any comitted transaction that conflicts? (In the original paper: check for write-read conflicts between comitting transaction and comitted transactions using. Does the committing transaction readset overlap with any comitted transactions writeset. Works slightly different in MySQL.) Step 2 – On each replica: commitment • Apply transactions decided to commit • Handle concurrent local transactions: remote wins
  55. 55. The speaker says... The termination process has two logical steps, just like the general one presented earlier. The very details of how exactly two transactions are checked for conflicts in the first step don't matter here. MySQL Group Replication is using a refinement of the algorithm tailored to its own needs. As a developer all you need to know is: a remote transaction always wins no matter how expensive local transactions are. And, keep conflicting writes on one replica. It's faster. The puzzling bit on the slide is the rule to check check a commiting transaction against any commited transaction for conflicts. Any !? Not any... only concurrent.
  56. 56. What's concurrent? Any other transaction that precedes the current one • Recall: total ordering • Recall: asynchronous, delay between broadcast and delivery Replica Replica Replica Replica Replica Broadcast Delivery 1 Total order 1 2 1 2 2 1 2
  57. 57. The speaker says... The definition of what concurrent means is a bit tricky. Its defined through a negation and that's confusing on the first look but becomes – hopefully – clear on the next slide. Concurrent to a transaction is any other transaction that does precede it. If we know the order of all transactions – in the entire cluster -, then we can which transactions precede one another. Atomic broadcast ensures total order on delivery. Some implementations decide on ordering when sending and that number (logical clock) could be be used. Any logical clock works.
  58. 58. Certify against all previous? Replica Replica Replica Replica Replica Transaction(2) 2 Total order 3 Certification 2 2 3 4 3 4 4 Broadcast: Transaction 4 is based on all previous up to 2 Certification when 4 is delivered: Check conflicts with trx >2 and trx < 4
  59. 59. The speaker says... The slide has an example how to find any other transaction that precedes one. When a transaction enters the committing state and is broadcasted, the broadcast includes the logical time (= total order number on the slide) of the latest transaction comitted on the replica. Eventually the transaction is delivered on all sites. Upon delivery the certification considers all transactions that happend after the logical time of the to be certified transaction. All those transactions precede the one to be certified, they executed concurrently at different replicas. We don't have to look further in the past. Further in the past is stuff that's been decided on already.
  60. 60. TIME TO BREATH MySQL is different anyway...
  61. 61. The speaker says... Good news! The algorithm used by MySQL Group Replication is different and simpler. For correctness, the precedes relation is still relevant. But it comes for free...
  62. 62. A developers view on commit Replica Replica Replica Replica Replica BEGIN COMMIT Result t(3) 4 Certify 4 Certify Apply Client Execute
  63. 63. The speaker says... We are not done with the theory yet but let's do some slides that take the developers perspective. Assuming you have to scale a PHP application, assuming a small cluster of a handful MySQL servers is enough and assuming these servers are co-located on racks, then MySQL Group Replication is your best possible choice. Did you get this from the theory? Replication is 'synchronous'. On commit you wait only for the server you are connected to. Once your transaction is broadcasted, you are done. You don't wait for the other servers to execute the transaction. With uniform atomic broadcast, once your transaction is broadcasted, it cannot get lost. (That's why I torture you with theory.)
  64. 64. MySQL Replication Master Slave Replica Replica Fetch Replica BEGIN COMMIT OK Bin log etc. Apply Client execute
  65. 65. The speaker says... If your network is slow or mother earth, the speed of light and network message round trip time adds too much too your transaction execution time, then asynchronous MySQL Replication is a better choice. In MySQL Replication the master (primary) never waits for the network. Not even to broadcast updates. Slaves asynchronously pull changes. Despite pushing work on the developer this approach has the downsite that a hardware crash on the master can cause transaction loss. Slaves may or may not have pulled the latest data.
  66. 66. MySQL Semi-sync Replication Master Slave Replica Replica BEGIN COMMIT OK Wait for first ACK Fetch Replica Bin log Apply Client Execute Slave Fetch Apply Replica
  67. 67. The speaker says... In the times of MySQL 5.0 the MySQL Community suggested that to avoid transaction loss the master should wait for one slave to acknowledge it has fetched the update from the master. The fact that it's fetched does not mean that it's been applied. The update may not be visible to clients yet. It is a back and forth whether database replication should be asynchronous or not. It depends on your needs. Back to theory after this break.
  68. 68. Back to theory! Virtual Synchrony?
  69. 69. Virtual Synchrony Groups and views • A turbo-charged veryion of Atomic Broadcast P1 P2 P3 P4 M1 M2 VC M3 M4 G1 = {P1, P2, P3} G2 = {P1, P2, P3, P4}
  70. 70. The speaker says... Good news! Virtual Synchrony and Atomic Broadcast are the same. Our Atomic Broadcast definition assumes a static group. Adding group members, removing members or detecting failed ones is covered. Virtual Synchrony handles all these membership changes. Whenever an existing group agrees on changes, a new view is installed through a view change (VC) event. (The term 'virtual': it's not synchronous. There is a delay we don't want to wait for short message delays. Yet, the system appears to be synchronous to most real life observers.)
  71. 71. Virtual Synchrony View changes act as a message barrier • That's a case causing troubles in Two-Phase Commit P1 P2 P3 P4 M5 VC M6 M7 M8 G2 = {P1, P2, P3, P4} G3 = {P1, P2, P3}
  72. 72. The speaker says... View changes are message barriers. If the group members suspect a member to have failed they install a new view. Maybe the former member was not dead but just too slow to respond, or disconnected for a brief period. False alarm. The former member then tries to broadcast some updates. Virtual Synchrony ensures that the updates will not be seen by the remaining members. Furthermore the former member will realize that it was excluded. Some GCS implementing virtual synchrony even provide abstractions that ensure a joining member learns all updates it missed (state transfer) before it rejoins.
  73. 73. Auto-everything: failover MySQL Group Replication has a pluggable GCS API • Split brain handling? Depends onGCS and/or GCS config • Default GCS is Corosync MySQL MySQL MySQL MySQL MySQL MySQL
  74. 74. The speaker says... Good news! The Virtual Synchrony group membership advantages are fully exposed to the user level: node failures are detected and handled automatically. PECL/mysqlnd_ms can help you with the client site. It's a minor tweak to have it automatically learn about remaining MySQL server. Expect and update release soon. MySQL Group Replication works with any Group Communication system that can be accessed from C and implements Virtual Synchrony. The default choice is Corosync. Split brain handling is GCS dependent. MySQL follows view change notifications of the GCS.
  75. 75. Auto-everything: joining Elastic cluster grows and shrinks on demand • State transfer done via asynch replication channel MySQL MySQL MySQL MySQL MySQL MySQL Donor State transfer Joiner
  76. 76. The speaker says... Good news! When adding a server you don't fiddle with the very details. You start the server, tell it to join the cluster and wait for it to catch up. The server picks a donor, begins fetching updates using much of the existing MySQL Replication code infrastructure and that's it.
  77. 77. Back to theory! Generalized Snapshot Isolation
  78. 78. Deferred Update tweak Transaction read set does not need to be broadcasted • Readset is hard to extract and can be huge • Weaker serializability level than 1SR • Sufficient for InnoDB default isolation Read Primary Write Primary Primary Primary V/Ws/U
  79. 79. The speaker says... Good news! This is last bit of theory. The original Database State Machine proposal was followed by a simpler to implement proposal in 2005. If the clusters serialization level is marginally lowered to snapshot, certification becomes easier. Generalized snapshot isolation can be achieved without having to broadcast the readset of transactions. Recording the readset of a transaction is difficult in most existing databases. Also, readsets can be huge. Snapshot isolation is an isolation level for multi-version concurrency control. MVCC? InnoDB! Somehow... Whatever this is the MySQL Group Replication termination base algorithm.
  80. 80. Snapshot Isolation Concurrent and write conflict? First comitter wins! • Reads use snapshot from the beginning of the transaction First committer Conflict (both change x) T1 T2 T1 T2 BEGIN(v1), W(v1, x=1), COMMIT!, x:v2=1 BEGIN(v1), W(v1, x=2), …, …, COMMIT? Concurrent write (version 1)
  81. 81. The speaker says... In Snapshot Isolations transactions take a snapshot when they begin. All reads return data from this snapshot. Although any other concurrent transaction may update the underlying data while the transaction still runs, the change is unvisiable, the transaction runs in isolation. If two concurrent transactions change the same data item they conflict. In case of conflicts, the first comitter wins. MVCC requires that as part update of an data item its version is incremented. Future transactions will base their snapshot on the new version.
  82. 82. The actual termination protocol Replica Replica Replica Replica Replica Write(v2, x=1) Certification Object Latest version x 1 y 13 OK
  83. 83. The speaker says... Every replica checks the version of a write during certification. It compares the writes data items version number with the latest it knows of. If the version is higher or equal than the one found in the replicas certification index, the write is accepted. A lower number indicates that someone has already updated the data item before. Because the first comitter must win a write showing a lower version number than is in the certification index must abort. (The certification index fills over time and is truncated periodically by MySQL. MySQL reports the size through Performance Schema tables.)
  84. 84. Hmm... Does it work?
  85. 85. It's a preview – there are limits General • InnoDB only • Corosync lacks uniform agreement • No rules to prevent split-brain (it's a preview, you're allowed to fool yourself if you misconfigure the GCS!) Isolation level • Primary Key based • Foreign Keys and Unique Keys not supported yet No concurrent DDL
  86. 86. That's it, folks! Questions?
  87. 87. The speaker says... (Oh, a question. Flips slide)
  88. 88. Network messages – pffft! MySQL super hero at Facebook @markcallaghan Sep 30 For MySQL sync replication, when all commits originate from 1 master is there 1 network round trip or 2? replication-hello-world … @Ulf_Wendel @markcallaghan AFAIK, on the logical level, there should be one. Some of your questions might depend on the GCS used. The GCS is pluggable @markcallaghan @Ulf_Wendel @h_ingo Henrik tells me it is "certification based" so I remain confused
  89. 89. GCS != MySQL Semi-sync It's many round trips, how many depends on GCS • Default GCS is Corosync, Corosyc is Totem Ring • Corosync uses a privilege-based approach for total ordering • Many options: fixed sequencer, moving sequencer, ... • Where you run your updates only impacts collision rate MySQL MySQL Corosync Corosync MySQL Corosync
  90. 90. The speaker says... No Mark, MySQL Group Replication cannot be understood as a replacement for MySQL Semi-sync Replication. The question about network round trips is hard to answer. Atomic Broadcast and Virtual Synchrony stack many subprotocols together. Let's consider a stable group, no network failure, Totem. Totem orders messages using a token that circulates along a virtual ring of all members. Whoever has the token, has the priviledge to broadcast. Others wait for the token to appear. Atomic Broadcast gives us all or nothing messaging. It takes at least another full round on the ring to be sure the broadcast has been received by all. How many round trips are that? Welcome to distributed systems...
  91. 91. THE END Contact:
  92. 92. The speaker says... Thank you for your attendance! Upcoming shows: Talk&Show! - YourPlace, any time