SlideShare a Scribd company logo
MySQL Group Replication: 
'Synchronous', 
multi-master, 
auto-everything 
Ulf Wendel, MySQL/Oracle
The speaker says... 
MySQL 5.7 introduces a new kind of replication: MySQL 
Group Replication. At the time of writing (10/2014) 
MySQL Group Replication is available as a preview release 
on labs.mysql.com. In common user terms it features 
(virtually) synchronous, multi-master, auto-everything 
replication.
Proper wording... 
An eager update everywhere system based 
on the database state machine approach 
atop of a group communication system 
offering virtual synchrony and 
reliable total ordering messaging. 
MySQL Group Replication offers 
generalized snapshot isolation.
The speaker says... 
And here is a more technical description....
WHAT ?! 
Hmm, how does it compare?
The speaker says... 
The technical description given for MySQL Group 
Replication may sound confusing because it has elements 
from the distributed systems and database systems theory. 
From around 1996 and 2006 the two research communities 
jointly formulated the replication method implemented by 
MySQL Group Replication. 
As a web developer or MySQL DBA you are not expected to 
know distributed systems theory inside out. Yet to 
understand the properties of MySQL Group Replication and 
to get most of it, we'll have to touch some of the concepts. 
Let's see first how the new stuff compares to the existing.
Goals of distributed databases 
Availability 
• Cluster as a whole unaffected by loss of nodes 
Scalability 
• Geographic distribution 
• Scale size in terms of users and data 
• Database specific: read and/or write load 
Distribution Transparency 
• Access, Location, Migration, Relocation (while in use) 
• Replication 
• Concurrency, Failure
The speaker says... 
MySQL Group Replication is about building a distributed 
database. To catalog it and compare it with the existing 
MySQL solutions in this area, we can ask what the goals of 
distributed databases are. The goals lead to some criteria 
that is used to give a first, brief overview. 
Goal: a distributed database cluster strives for maximum 
availability and scalability while maintaining distribution 
transparency. 
Criteria: availability, scalability, distribution transparency.
MySQL clustering cheat sheet 
MySQL 
Replication 
MySQL 
Cluster 
MySQL 
Fabric 
Availability Primary = SpoF, 
no auto failover 
Shared 
nothing, 
auto failover 
SpoF monitored, 
auto failover 
Scalability Reads 
Partial 
replication, 
node limit 
Partial 
replication, 
no node limit 
Scale on 
WAN Asynchronous Synchronous 
(WAN option) 
Asynchronous 
(depends) 
Distribution 
Transparency R/W splitting SQL: yes 
(low level: no) 
Special clients 
No distributed 
queries
The speaker says... 
Already today MySQL has three solutions to build a 
distributed MySQL cluster: MySQL Replication, MySQL 
Cluster and MySQL Fabric. Each system has different 
optimizations, none can achieve all the goals of a distributed 
cluster at once. Some goals are orthogonal. 
Take MySQL Cluster. MySQL Cluster is a shared nothing 
system. Data storage is reundant, nodes fail independently. 
Transparent sharding (partial replication) ensures read and 
write scalability until the maximum number of nodes is 
reached. Great for clients: any SQL node runs any SQL, 
synchronous updates become visible immediately 
everywhere. But, it won't scale on slow WAN connections.
How Group Replication fits in 
Repl. Cluster Group Repl. Fabric 
Availability Shared nothing, 
auto failover 
Shared nothing, 
auto failover/join 
Scalability 
Partial 
replication, 
node limit 
Full replication, 
read and some 
write scalability 
Scale on 
WAN 
Synchronous 
(WAN option) 
(Virtually) 
Synchronous 
Distribution 
Transparenc 
y 
SQL: yes 
(low level: no) 
All nodes run 
all SQL
The speaker says... 
MySQL Group Replication has many of the desireable 
properties of MySQL Cluster. Its strong on availability and 
client friendly due to the distribution transparency. No 
complex client or application logic is required to use the 
cluster. So, how do the two differ? 
Unlike MySQL Cluster, MySQL Group Replication supports 
the InnoDB storage engine. InnoDB is the dominant storage 
engine for web applications. This makes MySQL Group 
Replication a very attractive choice for small clusters (3-7 
nodes) running Drupal, WordPress, … in LAN settings! Also, 
Group Replication is not synchronous in a technical way. For 
practical matters it is.
Group Replication (vs. Cluster) 
Availability 
• Nodes fail independently 
• Cluster continues operation in case of node failures 
Scalability 
• Geographic distribution: n/a, needs fast messaging 
• All nodes accept writes, mild write scalability 
• All nodes accept reads, full read scalability 
Distribution Transparency 
• Full replication: all nodes have all the data 
• Fail stop model: developer free'd to worry about consistency
The speaker says... 
Another major difference between MySQL Cluster and 
MySQL Group Replication is the use of partial replication 
versus full replication. MySQL Cluster has transparent 
sharding (partial replication) build-in. On the inside, on the 
level of so-called MySQL Cluster data nodes, not every node 
has all the data. Writes don't add work to all nodes of the 
cluster but only a subset of them. Partial replication is the 
only known solution to write scalability. With MySQL Group 
Replication all nodes have all the data. Writes can be 
executed concurrently on different nodes but each write 
must be coordinated with every other node. 
… time to dig deeper >:).
Eager update everywhere... ?!
A developers categorization... 
Where are transactions run? 
Primary Copy Update Everywhere 
When does 
synchronizatio 
n happen? 
Eager (MySQL semi-synch 
Replication) 
MySQL Cluster 
MySQL Group 
3rd party: Galera 
Lazy 
MySQL 
Replication/Fabric 
3rd party: Tungsten 
MySQL Cluster 
Replication
The speaker says... 
I've described MySQL Group Replication as „ an eager 
update everywhere system“. The term comes from a 
categorization of different database replication systems by 
the two questions: 
- where can transaction every be run? 
- when are transactions synchronized between nodes? 
The answers to the questions tells a developer which 
challenges to expect. The answers determine which 
additional tasks an application must handle when its run on 
a cluster instead of a single server.
Lazy causes work... 
010101001011010 
101010110100101 
101010010101010 
101010110101011 
101010110111101 
Set price = 1.23 
Node 
price = 1.23 
Node Node Node 
price = 1.00 price = 1.23 price = 0.98
The speaker says... 
When you try to scale an application running it on a lazy 
(asynchronous) replication cluster instead of a single server 
you will soon have users complaining about outdated and 
„incorrect“ data. Depending which node the application 
connects to after a write, a user may or may not see his own 
updates. This can neither happen on a single server system 
nor on an eager (synchronous) replication cluster. Lazy 
replication causes extra work for the developer. 
BTW, have a look at PECL/mysqlnd_ms. It abstracts the 
problem of consistency for you. Things like read-your-writes 
boil down to a single function call.
Primary Copy causes work... 
Primary 
Write 
Copy Copy Copy 
Read 
Read 
Read Read
The speaker says... 
Judging from the developer perspective only, primary copy is 
an undesired replication solution. In a primary copy system 
only one node accepts writes. The other nodes copy the 
updates performed on the primary. Because of the read-write 
splitting, the replication system does not need to 
coordinate conflicting operations. Great for the replication 
system author, bad for the developer. As a developer you 
must ensure that all write operations are directed to the 
primary node... Again, have a look at PECL/mysqlnd_ms. 
MySQL Replication follows this approach. Worse, MySQL 
Replication is a lazy primary copy system.
Love: Eager Update Everywhere 
Node 
Write 
Read 
price = 1.23 
price = 1.23 price = 1.23 
Node Node 
Write Read Write Read
The speaker says... 
From a developer perspective an eager update anywhere 
system, like MySQL Group Replication, is indistinguishable 
from a single node. The only extra work it brings you is load 
balancing, but that is the case with any cluster. An eager 
update anywhere cluster improves distribution transparency 
and removes the risk of reading stale data. Transparency 
and flexibility is improved because any transaction can be 
directed to any replica. (Sometimes synchronization 
happens as part of the commit, thus strong consistency can 
be achieved.) Fault tolerance is better than with Primary 
Copy. There is no single point of failure – a single primary - 
that can cause a total outage of the cluster. Nodes may fail 
individually without bringing the cluster down immediately.
HOW? Distributed + DB? 
Database state machine?
The speaker says... 
In the mid-1990s two observations made the database and 
distributed system theory communities wondered if they 
could develop a joint replication approach. 
First Gray et. al. (database communitiy) showed that the 
common two-phase locking has an expected deadlock rate 
that grows with the third power of the number of replicas. 
Second, Schiper and Raynal noted that transactions have 
common properties with group communication principles 
(distributed systems) such as ordering, agreement/'all-or-nothing' 
and even durability.
Three building blocks 
State machine replication 
• … trivial to understand 
Atomic Broadcast 
• … database meets distributed systems community 
• … OMG, how easy state machine replication is to implement! 
Deferred Update Database Replication 
• … database meets distributed systems community 
• … how we gain high availability and high performance 
• … what those MySQL Replication team blogs talk about ;-)
The speaker says... 
Finally, in 1999 Pedone, Guerraoui and Schiper published 
the paper „The Database State Machine Approach“. The 
paper combines two well known building blocks for 
replication with a messaging primitive common in the 
distributed systems world: atomic broadcast. 
MySQL Group Replication is slightly different from this 1999 
version, more following a later refinement from 2005 plus a 
bit of additional ease-of-use. However, by end of this chapter 
you learned how MySQL Cluster and MySQL Group 
Replication differ beyond InnoDB support and built-in 
sharding.
State machine replication 
Input 
Set A = 1 
Replica Replica 
Replica 
Output 
A = 1 A = 1 A = 1 
Output Output
The speaker says... 
The first building block is trivial: a state machine. A state 
machine takes some input and produces some output. 
Assume your state machines are determinisitic. Then, if you 
have a set of replicas all running the same state machine 
and they all get the same input, they all will produce the 
same output. On an aside: state machine replication is also 
known as active replication. Active means that every replica 
executes all the operations, active adds compute load to 
every replica. With passive replication, also called primary-backup 
replication, one replica (primary) executes the 
operations and forwards the results to the others. Passive 
suffers under primary availability and possibly network 
bandwith.
Requirement: Agreement 
Input 
Set A = 1 
Replica Replica 
Replica 
Output 
A = 1 
A = NULL
The speaker says... 
Here's more trivia about the state machine replication 
approach. There are two requirements for it to work. Quite 
obviously, every replica has to receive all input to come to 
the same output. And the precondition for receiving input is 
that the replica is still alive. 
In academic words the requirement is: agreement. Every 
non-faulty replica receives every request. Non-faulty replicas 
must agree on the input.
Requirement: Order 
1) Set A = 1 
2) Set B = 1 
3) Set B = A *2 
Input: 1, 2, 3 Input: 1, 3, 2 Input: 3, 1, 2 
Replica Replica 
Replica 
A = 1 A = 1 
B = 2 B = 1 
A = 1 
B = 1
The speaker says... 
The second trivial requirement for state machine replication 
is ordering. To produce the same output any two state 
machines must execute the very same input – including the 
ordering of input operations. The academic wording goes: if 
a replica processes requests r1 before r2, then no replica 
processes request r2 before r1. Note that if operations 
commute, some reording may still lead to correct output. 
The sequence A = 1, B = 1, B = A * 2 and the sequence B = 
1, A = 1, B = A * 2 produce the same output. 
(Unrelated here: the database scaling talk touches the fancy 
commutative replicated data types Riak offers... hot!)
Atomic Broadcast 
Distributed systems messaging abstraction 
• Meets all replicated state machine requirements 
Agreement 
• If a site delivers a message m then every site delivers m 
Order 
• No two sites deliver any two messages in different orders 
Termination 
• If a site broadcasts message m and does not fail, then every 
site eventually delivers m 
• We need this in asynchronous enivronments
The speaker says... 
State machine replication is the first building block for 
understanding the database state machine approach. The 
second building block is a messaging abstraction from the 
distributed systems world called atomic broadcast. Atomic 
broadcast provides all the properties required for state 
machine replication: agreement and ordering. It adds a 
property needed for communication in an asynchronous 
system, such as a system communicating via network 
messages: termination. 
All in all, this greatly simplifies state machine replication and 
contributes to a simple, layered design.
Delivery, durability, group 
Client 
Replica 
Replica 
Replica 
Mr. X 
Replica 
Replica 
Replica 
Group 
Send first, possibly delivered second
The speaker says... 
The Atomic broadcast properties given are literally copied 
from the original paper describing the database state 
machine replication approach. There is two things in it not 
explained yet. First, atomic broadcast defines properties in 
terms of message delivery. The delivery property not only 
ensures total ordering despite slow transport but also covers 
message loss (MySQL desires uniform agreement here, 
something better than Corosync) and even the crash and 
recovery of processors (durability)! A recovering processor 
must first deliver outstanding messages before it continues. 
Second, note that atomic broadcast introduces the notion of 
a group. Only (correct) members of a group can exchange 
messages.
Deferred Update: the best? 
Client Client 
Replica 
Replica 
Replica 
Replica 
Replica 
Replica 
Client Request 
Server Coordination 
Execution 
Agreement 
Client Response
The speaker says... 
We are almost there. The third building block to the 
database state machine replication is deferred update 
database replication. The slide shows a generic functional 
model used by Pedone and Schiper in 2010 to illustrate their 
choice of deferred update.The argument goes that deferred 
update combines the best of the two most prominent object 
replication techniques: active and passive replication. Only 
the comination of the best from the two will give both high 
availability and high performance. 
Translation: MySQL Group Replication can – in theory - 
have higher overall throughput than MySQL Replication. Do 
you love the theory ;-) ? As a DBA you should.
Active Replication (SM) 
Replica 
Replica 
Replica 
Replica 
Replica 
Replica 
Client Client 
Client sends op to all 
Requests get ordered 
Execution 
All reply to client
The speaker says... 
In an active replication system, a pure state machine 
replication system, the client operations are forwarded to all 
replicas and each replica individually executes the operation. 
The two challenges are to ensure all replicas execute 
requests in the same order and all replicas decide the same. 
Recall, that we talk multi-threaded database servers here. 
A downside is that every replica has to execute the 
operation. If the operation is expensive in terms of CPU, this 
can be a waste of CPU time.
Passive Replication 
Backup 
Primary 
Backup 
Replica 
Replica 
Replica 
Client Client 
Client sends op to primary 
Only primary executes 
Primary forwards changes 
Primary replies to client
The speaker says... 
The alternative is passive replication or primary-backup 
replication. Here, the client talks to only one server, the 
primary. Only the primary server executes client operations. 
After computation of the result, the primary forwards the 
changes to the backups which apply tem. 
The problem here is that the primary determines the 
systems throughput. None of the backups can contribute its 
computing power to the overall system throughput.
Multi-primary (pass.) replication 
What we want... 
• … for performance: more than one primary 
• … for scalability: no distributed locking 
• .. and of course: transactions 
• Two-staged transaction protocol 
Client Primary 
Primary 
Primary 
Transaction processing Transaction termination
The speaker says... 
Multi-primary (passive) replication has all the ingredients 
desired. 
Transaction processing is two staged. First, a client picks 
any replica to execute a transaction. This replica becomes 
the primary of the transaction. The transaction executes 
locally, the stage is called transaction processing. In the 
second stage, during transaction termination, the primaries 
jointly decide whether the transaction can commit or must 
abort. 
Because updates are not immediately applied, database 
folks call this deferred update – our last building block.
Deferred Update DB Replication 
Deterministic certification 
• Reads execute locally, Updates get certified 
• Certification ensures transaction serializability 
• Replicas decide independently about certification result 
Read Primary 
Write Primary 
Primary 
Primary 
Rs/Ws/U
The speaker says... 
One property of transactions is isolation. Isolation is also 
know as serializability: the concurrent execution of 
transactions should be equivalent to a serial execution of the 
same transactions. In Deferred Update system, read 
transactions are processed and terminated on one replica 
and serialized locally. 
Updates must be certified. After the transaction processing 
the readset, writeset and updates are sent to all other 
replicas. The servers then decide in a deterministic 
procedure whether (one-copy) serializability holds, if the 
transaction commits. Because its a deterministic procedure, 
the servers can certify transactions independently!
Options for termination 
Atomic Broadcast based 
• … this is what is used, by MySQL, by DBSM 
Optimization: Reordering (atop of Atomic Broadcast) 
• … in theory it means less transaction aborts 
Optimization limit: Generic Broadcast based 
• … this has issues, which make it nasty 
Atomic Commit based 
• … more transactions than atomic broadcast
The speaker says... 
There are several ways of implementing the termination 
protocol and the certification. There are two truly distinct 
choices: atomic broadcast and atomic commit. Atomic 
commit causes more transaction aborts than atomic 
broadcast. So, it's out and atomic broadcast remains. 
Atomic broadcast can – in theory – be further optimized 
towards less transaction aborts using reordering. For 
practically matters, this is about where the optimizations 
end. A weaker (and possibly faster) generic broadcast 
causes problems in the transactional model. For databases, 
it could be an over-optimization.
Generic certification test 
Transactions have a state 
• Executing, Comitting, Comitted, Aborted 
Reads are handled locally 
Updates are send to all replicas 
• Readset and writeset are forwarded 
On each replica: search for 'conflicting' transactions 
• Can be serialized with all previous transactions? Commit! 
• Commit? Abort local transaction that overlap with update
The speaker says... 
No matter what termination procedure is used, the basic 
procedure for certification in the deferred update model is 
always the same. Updates/writes need certification. The 
data read and the data written by a transaction is forwarded 
to all other replicas. 
Every replica searches for potentially 'conflicting' 
transactions, the details depend on the termination 
procedure. A transaction is decided to commit if it does not 
violate serializability with all previous transactions. Any local 
transaction currently running and conflicting with the update 
is aborted.
Database State Machine 
Deferred Update Database Replication as a state 
machine 
• Atomic Broadcast based termination 
Plugin Services 
MySQL 
Transaction hooks 
Plugins 
MySQL Group Replication 
Capture Apply Recover 
Replication Protocol incl. termination protocol/certifier 
Group Communication System
The speaker says... 
The Database State Machine Approach combines all the bits 
and pieces. Let's do a bottom up summary. Atomic 
broadcast not only free's the database developer to bother 
about networking APIs it also solves the nasty bits of 
communicating in an asynchronous network. It provides 
properties that meet the requirements of the state machine 
replication. A deterministic state machine is what one needs 
to implement the termination protocol within deferred update 
replication. Deferred update replication does not use 
distributed locking which Gray proved problematic and it 
combines the best of active and passive replication. Side 
effects: simple replication protocol, layered code.
The termination algorithm 
Updates are send to all replicas 
• Readset and writeset are forwarded 
Step 1 - On each replica: certify 
• Is there any comitted transaction that conflicts? 
(In the original paper: check for write-read conflicts between 
comitting transaction and comitted transactions using. Does 
the committing transaction readset overlap with any comitted 
transactions writeset. Works slightly different in MySQL.) 
Step 2 – On each replica: commitment 
• Apply transactions decided to commit 
• Handle concurrent local transactions: remote wins
The speaker says... 
The termination process has two logical steps, just like the 
general one presented earlier. The very details of how 
exactly two transactions are checked for conflicts in the first 
step don't matter here. MySQL Group Replication is using a 
refinement of the algorithm tailored to its own needs. As a 
developer all you need to know is: a remote transaction 
always wins no matter how expensive local transactions are. 
And, keep conflicting writes on one replica. It's faster. 
The puzzling bit on the slide is the rule to check check a 
commiting transaction against any commited transaction for 
conflicts. Any !? Not any... only concurrent.
What's concurrent? 
Any other transaction that precedes the current one 
• Recall: total ordering 
• Recall: asynchronous, delay between broadcast and delivery 
Replica 
Replica 
Replica 
Replica 
Replica 
Broadcast Delivery 
1 
Total order 1 
2 
1 2 2 
1 2
The speaker says... 
The definition of what concurrent means is a bit tricky. Its 
defined through a negation and that's confusing on the first 
look but becomes – hopefully – clear on the next slide. 
Concurrent to a transaction is any other transaction that 
does precede it. If we know the order of all transactions – in 
the entire cluster -, then we can which transactions precede 
one another. 
Atomic broadcast ensures total order on delivery. Some 
implementations decide on ordering when sending and that 
number (logical clock) could be be used. Any logical clock 
works.
Certify against all previous? 
Replica 
Replica 
Replica 
Replica 
Replica 
Transaction(2) 
2 
Total order 3 
Certification 
2 
2 
3 
4 
3 
4 
4 
Broadcast: 
Transaction 4 is based 
on all previous up to 2 
Certification when 4 is delivered: 
Check conflicts with trx >2 and trx < 4
The speaker says... 
The slide has an example how to find any other transaction 
that precedes one. When a transaction enters the 
committing state and is broadcasted, the broadcast includes 
the logical time (= total order number on the slide) of the 
latest transaction comitted on the replica. 
Eventually the transaction is delivered on all sites. Upon 
delivery the certification considers all transactions that 
happend after the logical time of the to be certified 
transaction. All those transactions precede the one to be 
certified, they executed concurrently at different replicas. We 
don't have to look further in the past. Further in the past is 
stuff that's been decided on already.
TIME TO BREATH 
MySQL is different anyway...
The speaker says... 
Good news! The algorithm used by MySQL Group 
Replication is different and simpler. For correctness, the 
precedes relation is still relevant. But it comes for free...
A developers view on commit 
Replica 
Replica 
Replica 
Replica 
Replica 
BEGIN COMMIT Result 
t(3) 
4 Certify 
4 Certify 
Apply 
Client Execute
The speaker says... 
We are not done with the theory yet but let's do some slides 
that take the developers perspective. Assuming you have to 
scale a PHP application, assuming a small cluster of a 
handful MySQL servers is enough and assuming these 
servers are co-located on racks, then MySQL Group 
Replication is your best possible choice. 
Did you get this from the theory? Replication is 
'synchronous'. On commit you wait only for the server you 
are connected to. Once your transaction is broadcasted, you 
are done. You don't wait for the other servers to execute the 
transaction. With uniform atomic broadcast, once your 
transaction is broadcasted, it cannot get lost. (That's why I 
torture you with theory.)
MySQL Replication 
Master 
Slave 
Replica 
Replica 
Fetch Replica 
BEGIN COMMIT OK 
Bin log etc. 
Apply 
Client execute
The speaker says... 
If your network is slow or mother earth, the speed of light 
and network message round trip time adds too much too 
your transaction execution time, then asynchronous MySQL 
Replication is a better choice. 
In MySQL Replication the master (primary) never waits for 
the network. Not even to broadcast updates. Slaves 
asynchronously pull changes. Despite pushing work on the 
developer this approach has the downsite that a hardware 
crash on the master can cause transaction loss. Slaves may 
or may not have pulled the latest data.
MySQL Semi-sync Replication 
Master 
Slave 
Replica 
Replica 
BEGIN COMMIT OK 
Wait for first ACK 
Fetch Replica 
Bin log 
Apply 
Client Execute 
Slave Fetch Apply Replica
The speaker says... 
In the times of MySQL 5.0 the MySQL Community 
suggested that to avoid transaction loss the master should 
wait for one slave to acknowledge it has fetched the update 
from the master. The fact that it's fetched does not mean 
that it's been applied. The update may not be visible to 
clients yet. 
It is a back and forth whether database replication should be 
asynchronous or not. It depends on your needs. 
Back to theory after this break.
Back to theory! 
Virtual Synchrony?
Virtual Synchrony 
Groups and views 
• A turbo-charged veryion of Atomic Broadcast 
P1 
P2 
P3 
P4 
M1 
M2 
VC 
M3 
M4 
G1 = {P1, P2, P3} G2 = {P1, P2, P3, P4}
The speaker says... 
Good news! Virtual Synchrony and Atomic Broadcast are the 
same. Our Atomic Broadcast definition assumes a static 
group. Adding group members, removing members or 
detecting failed ones is covered. 
Virtual Synchrony handles all these membership changes. 
Whenever an existing group agrees on changes, a new view 
is installed through a view change (VC) event. 
(The term 'virtual': it's not synchronous. There is a delay we 
don't want to wait for short message delays. Yet, the system 
appears to be synchronous to most real life observers.)
Virtual Synchrony 
View changes act as a message barrier 
• That's a case causing troubles in Two-Phase Commit 
P1 
P2 
P3 
P4 
M5 
VC 
M6 
M7 
M8 
G2 = {P1, P2, P3, P4} G3 = {P1, P2, P3}
The speaker says... 
View changes are message barriers. If the group members 
suspect a member to have failed they install a new view. 
Maybe the former member was not dead but just too slow to 
respond, or disconnected for a brief period. False alarm. The 
former member then tries to broadcast some updates. 
Virtual Synchrony ensures that the updates will not be seen 
by the remaining members. Furthermore the former member 
will realize that it was excluded. 
Some GCS implementing virtual synchrony even provide 
abstractions that ensure a joining member learns all updates 
it missed (state transfer) before it rejoins.
Auto-everything: failover 
MySQL Group Replication has a pluggable GCS API 
• Split brain handling? Depends onGCS and/or GCS config 
• Default GCS is Corosync 
MySQL 
MySQL 
MySQL 
MySQL 
MySQL 
MySQL
The speaker says... 
Good news! The Virtual Synchrony group membership 
advantages are fully exposed to the user level: node failures 
are detected and handled automatically. PECL/mysqlnd_ms 
can help you with the client site. It's a minor tweak to have it 
automatically learn about remaining MySQL server. Expect 
and update release soon. 
MySQL Group Replication works with any Group 
Communication system that can be accessed from C and 
implements Virtual Synchrony. The default choice is 
Corosync. Split brain handling is GCS dependent. MySQL 
follows view change notifications of the GCS.
Auto-everything: joining 
Elastic cluster grows and shrinks on demand 
• State transfer done via asynch replication channel 
MySQL 
MySQL 
MySQL 
MySQL 
MySQL 
MySQL 
Donor State transfer 
Joiner
The speaker says... 
Good news! When adding a server you don't fiddle with the 
very details. You start the server, tell it to join the cluster and 
wait for it to catch up. The server picks a donor, begins 
fetching updates using much of the existing MySQL 
Replication code infrastructure and that's it.
Back to theory! 
Generalized Snapshot Isolation
Deferred Update tweak 
Transaction read set does not need to be broadcasted 
• Readset is hard to extract and can be huge 
• Weaker serializability level than 1SR 
• Sufficient for InnoDB default isolation 
Read Primary 
Write Primary 
Primary 
Primary 
V/Ws/U
The speaker says... 
Good news! This is last bit of theory. The original Database 
State Machine proposal was followed by a simpler to 
implement proposal in 2005. If the clusters serialization level 
is marginally lowered to snapshot, certification becomes 
easier. Generalized snapshot isolation can be achieved 
without having to broadcast the readset of transactions. 
Recording the readset of a transaction is difficult in most 
existing databases. Also, readsets can be huge. 
Snapshot isolation is an isolation level for multi-version 
concurrency control. MVCC? InnoDB! Somehow... Whatever 
this is the MySQL Group Replication termination base 
algorithm.
Snapshot Isolation 
Concurrent and write conflict? First comitter wins! 
• Reads use snapshot from the beginning of the transaction 
First committer 
Conflict (both change x) 
T1 
T2 
T1 
T2 
BEGIN(v1), W(v1, x=1), COMMIT!, x:v2=1 
BEGIN(v1), W(v1, x=2), …, …, COMMIT? 
Concurrent write (version 1)
The speaker says... 
In Snapshot Isolations transactions take a snapshot when 
they begin. All reads return data from this snapshot. 
Although any other concurrent transaction may update the 
underlying data while the transaction still runs, the change is 
unvisiable, the transaction runs in isolation. If two concurrent 
transactions change the same data item they conflict. In 
case of conflicts, the first comitter wins. 
MVCC requires that as part update of an data item its 
version is incremented. Future transactions will base their 
snapshot on the new version.
The actual termination protocol 
Replica 
Replica 
Replica 
Replica 
Replica 
Write(v2, x=1) 
Certification 
Object Latest version 
x 1 
y 13 
OK
The speaker says... 
Every replica checks the version of a write during 
certification. It compares the writes data items version 
number with the latest it knows of. If the version is higher or 
equal than the one found in the replicas certification index, 
the write is accepted. A lower number indicates that 
someone has already updated the data item before. 
Because the first comitter must win a write showing a lower 
version number than is in the certification index must abort. 
(The certification index fills over time and is truncated 
periodically by MySQL. MySQL reports the size through 
Performance Schema tables.)
Hmm... 
Does it work?
It's a preview – there are limits 
General 
• InnoDB only 
• Corosync lacks uniform agreement 
• No rules to prevent split-brain (it's a preview, you're allowed to 
fool yourself if you misconfigure the GCS!) 
Isolation level 
• Primary Key based 
• Foreign Keys and Unique Keys not supported yet 
No concurrent DDL
That's it, folks! 
Questions?
The speaker says... 
(Oh, a question. Flips slide)
Network messages – pffft! 
MySQL super hero at Facebook 
@markcallaghan Sep 30 
For MySQL sync replication, when all commits originate from 1 master is 
there 1 network round trip or 2? http://mysqlhighavailability.com/mysql-group- 
replication-hello-world … 
@Ulf_Wendel 
@markcallaghan AFAIK, on the logical level, there should be one. Some 
of your questions might depend on the GCS used. The GCS is 
pluggable 
@markcallaghan 
@Ulf_Wendel @h_ingo Henrik tells me it is "certification based" so I 
remain confused
GCS != MySQL Semi-sync 
It's many round trips, how many depends on GCS 
• Default GCS is Corosync, Corosyc is Totem Ring 
• Corosync uses a privilege-based approach for total ordering 
• Many options: fixed sequencer, moving sequencer, ... 
• Where you run your updates only impacts collision rate 
MySQL 
MySQL 
Corosync 
Corosync 
MySQL 
Corosync
The speaker says... 
No Mark, MySQL Group Replication cannot be understood 
as a replacement for MySQL Semi-sync Replication. The 
question about network round trips is hard to answer. Atomic 
Broadcast and Virtual Synchrony stack many subprotocols 
together. Let's consider a stable group, no network failure, 
Totem. Totem orders messages using a token that circulates 
along a virtual ring of all members. Whoever has the token, 
has the priviledge to broadcast. Others wait for the token to 
appear. Atomic Broadcast gives us all or nothing messaging. 
It takes at least another full round on the ring to be sure the 
broadcast has been received by all. How many round trips 
are that? Welcome to distributed systems...
THE END 
Contact: ulf.wendel@oracle.com
The speaker says... 
Thank you for your attendance! 
Upcoming shows: 
Talk&Show! - YourPlace, any time

More Related Content

What's hot

What's hot (20)

MySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & ClusterMySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
 
MySQL High Availability with Group Replication
MySQL High Availability with Group ReplicationMySQL High Availability with Group Replication
MySQL High Availability with Group Replication
 
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
 
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL ShellMySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
MySQL InnoDB Cluster: Management and Troubleshooting with MySQL Shell
 
MySQL InnoDB Cluster - Advanced Configuration & Operations
MySQL InnoDB Cluster - Advanced Configuration & OperationsMySQL InnoDB Cluster - Advanced Configuration & Operations
MySQL InnoDB Cluster - Advanced Configuration & Operations
 
MySQL InnoDB Cluster - A complete High Availability solution for MySQL
MySQL InnoDB Cluster - A complete High Availability solution for MySQLMySQL InnoDB Cluster - A complete High Availability solution for MySQL
MySQL InnoDB Cluster - A complete High Availability solution for MySQL
 
MySQL Database Architectures - 2020-10
MySQL Database Architectures -  2020-10MySQL Database Architectures -  2020-10
MySQL Database Architectures - 2020-10
 
Everything You Need to Know About MySQL Group Replication
Everything You Need to Know About MySQL Group ReplicationEverything You Need to Know About MySQL Group Replication
Everything You Need to Know About MySQL Group Replication
 
MySQL Database Architectures - 2022-08
MySQL Database Architectures - 2022-08MySQL Database Architectures - 2022-08
MySQL Database Architectures - 2022-08
 
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
 
MySQL InnoDB Cluster and Group Replication in a nutshell hands-on tutorial
MySQL InnoDB Cluster and Group Replication in a nutshell  hands-on tutorialMySQL InnoDB Cluster and Group Replication in a nutshell  hands-on tutorial
MySQL InnoDB Cluster and Group Replication in a nutshell hands-on tutorial
 
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Wars of MySQL Cluster ( InnoDB Cluster VS Galera ) Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
 
MySQL 8 High Availability with InnoDB Clusters
MySQL 8 High Availability with InnoDB ClustersMySQL 8 High Availability with InnoDB Clusters
MySQL 8 High Availability with InnoDB Clusters
 
MySQL Router REST API
MySQL Router REST APIMySQL Router REST API
MySQL Router REST API
 
Introduction to MySQL InnoDB Cluster
Introduction to MySQL InnoDB ClusterIntroduction to MySQL InnoDB Cluster
Introduction to MySQL InnoDB Cluster
 
Galera cluster for high availability
Galera cluster for high availability Galera cluster for high availability
Galera cluster for high availability
 
MySQL InnoDB Cluster and Group Replication in a Nutshell
MySQL InnoDB Cluster and Group Replication in a NutshellMySQL InnoDB Cluster and Group Replication in a Nutshell
MySQL InnoDB Cluster and Group Replication in a Nutshell
 
MySQL Group Replication: Handling Network Glitches - Best Practices
MySQL Group Replication: Handling Network Glitches - Best PracticesMySQL Group Replication: Handling Network Glitches - Best Practices
MySQL Group Replication: Handling Network Glitches - Best Practices
 
MMUG18 - MySQL Failover and Orchestrator
MMUG18 - MySQL Failover and OrchestratorMMUG18 - MySQL Failover and Orchestrator
MMUG18 - MySQL Failover and Orchestrator
 
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
 

Viewers also liked

MySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architecturesMySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architectures
FromDual GmbH
 
Reducing Risk When Upgrading MySQL
Reducing Risk When Upgrading MySQLReducing Risk When Upgrading MySQL
Reducing Risk When Upgrading MySQL
Kenny Gryp
 
MHA (MySQL High Availability): Getting started & moving past quirks
MHA (MySQL High Availability): Getting started & moving past quirksMHA (MySQL High Availability): Getting started & moving past quirks
MHA (MySQL High Availability): Getting started & moving past quirks
Colin Charles
 

Viewers also liked (20)

MySQL 5.7 Fabric: Introduction to High Availability and Sharding
MySQL 5.7 Fabric: Introduction to High Availability and Sharding MySQL 5.7 Fabric: Introduction to High Availability and Sharding
MySQL 5.7 Fabric: Introduction to High Availability and Sharding
 
HTTP Plugin for MySQL!
HTTP Plugin for MySQL!HTTP Plugin for MySQL!
HTTP Plugin for MySQL!
 
Mastering InnoDB Diagnostics
Mastering InnoDB DiagnosticsMastering InnoDB Diagnostics
Mastering InnoDB Diagnostics
 
MySQL Server Defaults
MySQL Server DefaultsMySQL Server Defaults
MySQL Server Defaults
 
MySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architecturesMySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architectures
 
Successful Scalability Principles - Part 1
Successful Scalability Principles - Part 1Successful Scalability Principles - Part 1
Successful Scalability Principles - Part 1
 
Extensible Data Modeling
Extensible Data ModelingExtensible Data Modeling
Extensible Data Modeling
 
Hbase源码初探
Hbase源码初探Hbase源码初探
Hbase源码初探
 
Reducing Risk When Upgrading MySQL
Reducing Risk When Upgrading MySQLReducing Risk When Upgrading MySQL
Reducing Risk When Upgrading MySQL
 
MHA (MySQL High Availability): Getting started & moving past quirks
MHA (MySQL High Availability): Getting started & moving past quirksMHA (MySQL High Availability): Getting started & moving past quirks
MHA (MySQL High Availability): Getting started & moving past quirks
 
2010丹臣的思考
2010丹臣的思考2010丹臣的思考
2010丹臣的思考
 
High Availability Using MySQL Group Replication
High Availability Using MySQL Group ReplicationHigh Availability Using MySQL Group Replication
High Availability Using MySQL Group Replication
 
Online MySQL Backups with Percona XtraBackup
Online MySQL Backups with Percona XtraBackupOnline MySQL Backups with Percona XtraBackup
Online MySQL Backups with Percona XtraBackup
 
Mix ‘n’ Match Async and Group Replication for Advanced Replication Setups
Mix ‘n’ Match Async and Group Replication for Advanced Replication SetupsMix ‘n’ Match Async and Group Replication for Advanced Replication Setups
Mix ‘n’ Match Async and Group Replication for Advanced Replication Setups
 
MySQL Group Replication
MySQL Group ReplicationMySQL Group Replication
MySQL Group Replication
 
淘宝数据库架构演进历程
淘宝数据库架构演进历程淘宝数据库架构演进历程
淘宝数据库架构演进历程
 
Advanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteAdvanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suite
 
Inno db internals innodb file formats and source code structure
Inno db internals innodb file formats and source code structureInno db internals innodb file formats and source code structure
Inno db internals innodb file formats and source code structure
 
MySQL 5.7: Focus on InnoDB
MySQL 5.7: Focus on InnoDBMySQL 5.7: Focus on InnoDB
MySQL 5.7: Focus on InnoDB
 
A New Architecture for Group Replication in Data Grid
A New Architecture for Group Replication in Data GridA New Architecture for Group Replication in Data Grid
A New Architecture for Group Replication in Data Grid
 

Similar to MySQL Group Replication

Lecture-04-Principles of data management.pdf
Lecture-04-Principles of data management.pdfLecture-04-Principles of data management.pdf
Lecture-04-Principles of data management.pdf
manimozhi98
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
xlight
 

Similar to MySQL Group Replication (20)

Master master vs master-slave database
Master master vs master-slave databaseMaster master vs master-slave database
Master master vs master-slave database
 
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategy
 
MySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspectiveMySQL 5.7 clustering: The developer perspective
MySQL 5.7 clustering: The developer perspective
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
 
Lecture-04-Principles of data management.pdf
Lecture-04-Principles of data management.pdfLecture-04-Principles of data management.pdf
Lecture-04-Principles of data management.pdf
 
PowerPoint Format
PowerPoint FormatPowerPoint Format
PowerPoint Format
 
Clustered PHP - DC PHP 2009
Clustered PHP - DC PHP 2009Clustered PHP - DC PHP 2009
Clustered PHP - DC PHP 2009
 
No sql3 rmoug
No sql3 rmougNo sql3 rmoug
No sql3 rmoug
 
MySQL? Load? Clustering! Balancing! PECL/mysqlnd_ms 1.4
MySQL? Load? Clustering! Balancing! PECL/mysqlnd_ms 1.4MySQL? Load? Clustering! Balancing! PECL/mysqlnd_ms 1.4
MySQL? Load? Clustering! Balancing! PECL/mysqlnd_ms 1.4
 
DFWUUG -- MySQL InnoDB Cluster & Group Replciation
DFWUUG -- MySQL InnoDB Cluster & Group ReplciationDFWUUG -- MySQL InnoDB Cluster & Group Replciation
DFWUUG -- MySQL InnoDB Cluster & Group Replciation
 
Mysql high availability and scalability
Mysql high availability and scalabilityMysql high availability and scalability
Mysql high availability and scalability
 
Dsys guide37
Dsys guide37Dsys guide37
Dsys guide37
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, Implementations
 
PoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HAPoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HA
 
NoSql Databases
NoSql DatabasesNoSql Databases
NoSql Databases
 
NoSQL Basics - A Quick Tour
NoSQL Basics - A Quick TourNoSQL Basics - A Quick Tour
NoSQL Basics - A Quick Tour
 
MySQL Group Replication
MySQL Group ReplicationMySQL Group Replication
MySQL Group Replication
 
DIY: A distributed database cluster, or: MySQL Cluster
DIY: A distributed database cluster, or: MySQL ClusterDIY: A distributed database cluster, or: MySQL Cluster
DIY: A distributed database cluster, or: MySQL Cluster
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
Scaling MySQL -- Swanseacon.co.uk
Scaling MySQL -- Swanseacon.co.uk Scaling MySQL -- Swanseacon.co.uk
Scaling MySQL -- Swanseacon.co.uk
 

More from Ulf Wendel

The PHP mysqlnd plugin talk - plugins an alternative to MySQL Proxy
The PHP mysqlnd plugin talk - plugins an alternative to MySQL ProxyThe PHP mysqlnd plugin talk - plugins an alternative to MySQL Proxy
The PHP mysqlnd plugin talk - plugins an alternative to MySQL Proxy
Ulf Wendel
 

More from Ulf Wendel (18)

Data massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodesData massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodes
 
NoSQL in MySQL
NoSQL in MySQLNoSQL in MySQL
NoSQL in MySQL
 
Vote NO for MySQL
Vote NO for MySQLVote NO for MySQL
Vote NO for MySQL
 
PHP mysqlnd connection multiplexing plugin
PHP mysqlnd connection multiplexing pluginPHP mysqlnd connection multiplexing plugin
PHP mysqlnd connection multiplexing plugin
 
HTTP, JSON, JavaScript, Map&Reduce built-in to MySQL
HTTP, JSON, JavaScript, Map&Reduce built-in to MySQLHTTP, JSON, JavaScript, Map&Reduce built-in to MySQL
HTTP, JSON, JavaScript, Map&Reduce built-in to MySQL
 
MySQL 5.6 Global Transaction IDs - Use case: (session) consistency
MySQL 5.6 Global Transaction IDs - Use case: (session) consistencyMySQL 5.6 Global Transaction IDs - Use case: (session) consistency
MySQL 5.6 Global Transaction IDs - Use case: (session) consistency
 
MySQL 5.6 Global Transaction Identifier - Use case: Failover
MySQL 5.6 Global Transaction Identifier - Use case: FailoverMySQL 5.6 Global Transaction Identifier - Use case: Failover
MySQL 5.6 Global Transaction Identifier - Use case: Failover
 
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
MySQL native driver for PHP (mysqlnd) - Introduction and overview, Edition 2011
 
PHPopstar der PHP Unconference 2011
PHPopstar der PHP Unconference 2011PHPopstar der PHP Unconference 2011
PHPopstar der PHP Unconference 2011
 
The mysqlnd replication and load balancing plugin
The mysqlnd replication and load balancing pluginThe mysqlnd replication and load balancing plugin
The mysqlnd replication and load balancing plugin
 
Award-winning technology: Oxid loves the query cache
Award-winning technology: Oxid loves the query cacheAward-winning technology: Oxid loves the query cache
Award-winning technology: Oxid loves the query cache
 
The power of mysqlnd plugins
The power of mysqlnd pluginsThe power of mysqlnd plugins
The power of mysqlnd plugins
 
Mysqlnd query cache plugin benchmark report
Mysqlnd query cache plugin benchmark reportMysqlnd query cache plugin benchmark report
Mysqlnd query cache plugin benchmark report
 
mysqlnd query cache plugin: user-defined storage handler
mysqlnd query cache plugin: user-defined storage handlermysqlnd query cache plugin: user-defined storage handler
mysqlnd query cache plugin: user-defined storage handler
 
Mysqlnd query cache plugin statistics and tuning
Mysqlnd query cache plugin statistics and tuningMysqlnd query cache plugin statistics and tuning
Mysqlnd query cache plugin statistics and tuning
 
Built-in query caching for all PHP MySQL extensions/APIs
Built-in query caching for all PHP MySQL extensions/APIsBuilt-in query caching for all PHP MySQL extensions/APIs
Built-in query caching for all PHP MySQL extensions/APIs
 
The PHP mysqlnd plugin talk - plugins an alternative to MySQL Proxy
The PHP mysqlnd plugin talk - plugins an alternative to MySQL ProxyThe PHP mysqlnd plugin talk - plugins an alternative to MySQL Proxy
The PHP mysqlnd plugin talk - plugins an alternative to MySQL Proxy
 
Mysqlnd Async Ipc2008
Mysqlnd Async Ipc2008Mysqlnd Async Ipc2008
Mysqlnd Async Ipc2008
 

Recently uploaded

Recently uploaded (20)

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 

MySQL Group Replication

  • 1. MySQL Group Replication: 'Synchronous', multi-master, auto-everything Ulf Wendel, MySQL/Oracle
  • 2. The speaker says... MySQL 5.7 introduces a new kind of replication: MySQL Group Replication. At the time of writing (10/2014) MySQL Group Replication is available as a preview release on labs.mysql.com. In common user terms it features (virtually) synchronous, multi-master, auto-everything replication.
  • 3. Proper wording... An eager update everywhere system based on the database state machine approach atop of a group communication system offering virtual synchrony and reliable total ordering messaging. MySQL Group Replication offers generalized snapshot isolation.
  • 4. The speaker says... And here is a more technical description....
  • 5. WHAT ?! Hmm, how does it compare?
  • 6. The speaker says... The technical description given for MySQL Group Replication may sound confusing because it has elements from the distributed systems and database systems theory. From around 1996 and 2006 the two research communities jointly formulated the replication method implemented by MySQL Group Replication. As a web developer or MySQL DBA you are not expected to know distributed systems theory inside out. Yet to understand the properties of MySQL Group Replication and to get most of it, we'll have to touch some of the concepts. Let's see first how the new stuff compares to the existing.
  • 7. Goals of distributed databases Availability • Cluster as a whole unaffected by loss of nodes Scalability • Geographic distribution • Scale size in terms of users and data • Database specific: read and/or write load Distribution Transparency • Access, Location, Migration, Relocation (while in use) • Replication • Concurrency, Failure
  • 8. The speaker says... MySQL Group Replication is about building a distributed database. To catalog it and compare it with the existing MySQL solutions in this area, we can ask what the goals of distributed databases are. The goals lead to some criteria that is used to give a first, brief overview. Goal: a distributed database cluster strives for maximum availability and scalability while maintaining distribution transparency. Criteria: availability, scalability, distribution transparency.
  • 9. MySQL clustering cheat sheet MySQL Replication MySQL Cluster MySQL Fabric Availability Primary = SpoF, no auto failover Shared nothing, auto failover SpoF monitored, auto failover Scalability Reads Partial replication, node limit Partial replication, no node limit Scale on WAN Asynchronous Synchronous (WAN option) Asynchronous (depends) Distribution Transparency R/W splitting SQL: yes (low level: no) Special clients No distributed queries
  • 10. The speaker says... Already today MySQL has three solutions to build a distributed MySQL cluster: MySQL Replication, MySQL Cluster and MySQL Fabric. Each system has different optimizations, none can achieve all the goals of a distributed cluster at once. Some goals are orthogonal. Take MySQL Cluster. MySQL Cluster is a shared nothing system. Data storage is reundant, nodes fail independently. Transparent sharding (partial replication) ensures read and write scalability until the maximum number of nodes is reached. Great for clients: any SQL node runs any SQL, synchronous updates become visible immediately everywhere. But, it won't scale on slow WAN connections.
  • 11. How Group Replication fits in Repl. Cluster Group Repl. Fabric Availability Shared nothing, auto failover Shared nothing, auto failover/join Scalability Partial replication, node limit Full replication, read and some write scalability Scale on WAN Synchronous (WAN option) (Virtually) Synchronous Distribution Transparenc y SQL: yes (low level: no) All nodes run all SQL
  • 12. The speaker says... MySQL Group Replication has many of the desireable properties of MySQL Cluster. Its strong on availability and client friendly due to the distribution transparency. No complex client or application logic is required to use the cluster. So, how do the two differ? Unlike MySQL Cluster, MySQL Group Replication supports the InnoDB storage engine. InnoDB is the dominant storage engine for web applications. This makes MySQL Group Replication a very attractive choice for small clusters (3-7 nodes) running Drupal, WordPress, … in LAN settings! Also, Group Replication is not synchronous in a technical way. For practical matters it is.
  • 13. Group Replication (vs. Cluster) Availability • Nodes fail independently • Cluster continues operation in case of node failures Scalability • Geographic distribution: n/a, needs fast messaging • All nodes accept writes, mild write scalability • All nodes accept reads, full read scalability Distribution Transparency • Full replication: all nodes have all the data • Fail stop model: developer free'd to worry about consistency
  • 14. The speaker says... Another major difference between MySQL Cluster and MySQL Group Replication is the use of partial replication versus full replication. MySQL Cluster has transparent sharding (partial replication) build-in. On the inside, on the level of so-called MySQL Cluster data nodes, not every node has all the data. Writes don't add work to all nodes of the cluster but only a subset of them. Partial replication is the only known solution to write scalability. With MySQL Group Replication all nodes have all the data. Writes can be executed concurrently on different nodes but each write must be coordinated with every other node. … time to dig deeper >:).
  • 16. A developers categorization... Where are transactions run? Primary Copy Update Everywhere When does synchronizatio n happen? Eager (MySQL semi-synch Replication) MySQL Cluster MySQL Group 3rd party: Galera Lazy MySQL Replication/Fabric 3rd party: Tungsten MySQL Cluster Replication
  • 17. The speaker says... I've described MySQL Group Replication as „ an eager update everywhere system“. The term comes from a categorization of different database replication systems by the two questions: - where can transaction every be run? - when are transactions synchronized between nodes? The answers to the questions tells a developer which challenges to expect. The answers determine which additional tasks an application must handle when its run on a cluster instead of a single server.
  • 18. Lazy causes work... 010101001011010 101010110100101 101010010101010 101010110101011 101010110111101 Set price = 1.23 Node price = 1.23 Node Node Node price = 1.00 price = 1.23 price = 0.98
  • 19. The speaker says... When you try to scale an application running it on a lazy (asynchronous) replication cluster instead of a single server you will soon have users complaining about outdated and „incorrect“ data. Depending which node the application connects to after a write, a user may or may not see his own updates. This can neither happen on a single server system nor on an eager (synchronous) replication cluster. Lazy replication causes extra work for the developer. BTW, have a look at PECL/mysqlnd_ms. It abstracts the problem of consistency for you. Things like read-your-writes boil down to a single function call.
  • 20. Primary Copy causes work... Primary Write Copy Copy Copy Read Read Read Read
  • 21. The speaker says... Judging from the developer perspective only, primary copy is an undesired replication solution. In a primary copy system only one node accepts writes. The other nodes copy the updates performed on the primary. Because of the read-write splitting, the replication system does not need to coordinate conflicting operations. Great for the replication system author, bad for the developer. As a developer you must ensure that all write operations are directed to the primary node... Again, have a look at PECL/mysqlnd_ms. MySQL Replication follows this approach. Worse, MySQL Replication is a lazy primary copy system.
  • 22. Love: Eager Update Everywhere Node Write Read price = 1.23 price = 1.23 price = 1.23 Node Node Write Read Write Read
  • 23. The speaker says... From a developer perspective an eager update anywhere system, like MySQL Group Replication, is indistinguishable from a single node. The only extra work it brings you is load balancing, but that is the case with any cluster. An eager update anywhere cluster improves distribution transparency and removes the risk of reading stale data. Transparency and flexibility is improved because any transaction can be directed to any replica. (Sometimes synchronization happens as part of the commit, thus strong consistency can be achieved.) Fault tolerance is better than with Primary Copy. There is no single point of failure – a single primary - that can cause a total outage of the cluster. Nodes may fail individually without bringing the cluster down immediately.
  • 24. HOW? Distributed + DB? Database state machine?
  • 25. The speaker says... In the mid-1990s two observations made the database and distributed system theory communities wondered if they could develop a joint replication approach. First Gray et. al. (database communitiy) showed that the common two-phase locking has an expected deadlock rate that grows with the third power of the number of replicas. Second, Schiper and Raynal noted that transactions have common properties with group communication principles (distributed systems) such as ordering, agreement/'all-or-nothing' and even durability.
  • 26. Three building blocks State machine replication • … trivial to understand Atomic Broadcast • … database meets distributed systems community • … OMG, how easy state machine replication is to implement! Deferred Update Database Replication • … database meets distributed systems community • … how we gain high availability and high performance • … what those MySQL Replication team blogs talk about ;-)
  • 27. The speaker says... Finally, in 1999 Pedone, Guerraoui and Schiper published the paper „The Database State Machine Approach“. The paper combines two well known building blocks for replication with a messaging primitive common in the distributed systems world: atomic broadcast. MySQL Group Replication is slightly different from this 1999 version, more following a later refinement from 2005 plus a bit of additional ease-of-use. However, by end of this chapter you learned how MySQL Cluster and MySQL Group Replication differ beyond InnoDB support and built-in sharding.
  • 28. State machine replication Input Set A = 1 Replica Replica Replica Output A = 1 A = 1 A = 1 Output Output
  • 29. The speaker says... The first building block is trivial: a state machine. A state machine takes some input and produces some output. Assume your state machines are determinisitic. Then, if you have a set of replicas all running the same state machine and they all get the same input, they all will produce the same output. On an aside: state machine replication is also known as active replication. Active means that every replica executes all the operations, active adds compute load to every replica. With passive replication, also called primary-backup replication, one replica (primary) executes the operations and forwards the results to the others. Passive suffers under primary availability and possibly network bandwith.
  • 30. Requirement: Agreement Input Set A = 1 Replica Replica Replica Output A = 1 A = NULL
  • 31. The speaker says... Here's more trivia about the state machine replication approach. There are two requirements for it to work. Quite obviously, every replica has to receive all input to come to the same output. And the precondition for receiving input is that the replica is still alive. In academic words the requirement is: agreement. Every non-faulty replica receives every request. Non-faulty replicas must agree on the input.
  • 32. Requirement: Order 1) Set A = 1 2) Set B = 1 3) Set B = A *2 Input: 1, 2, 3 Input: 1, 3, 2 Input: 3, 1, 2 Replica Replica Replica A = 1 A = 1 B = 2 B = 1 A = 1 B = 1
  • 33. The speaker says... The second trivial requirement for state machine replication is ordering. To produce the same output any two state machines must execute the very same input – including the ordering of input operations. The academic wording goes: if a replica processes requests r1 before r2, then no replica processes request r2 before r1. Note that if operations commute, some reording may still lead to correct output. The sequence A = 1, B = 1, B = A * 2 and the sequence B = 1, A = 1, B = A * 2 produce the same output. (Unrelated here: the database scaling talk touches the fancy commutative replicated data types Riak offers... hot!)
  • 34. Atomic Broadcast Distributed systems messaging abstraction • Meets all replicated state machine requirements Agreement • If a site delivers a message m then every site delivers m Order • No two sites deliver any two messages in different orders Termination • If a site broadcasts message m and does not fail, then every site eventually delivers m • We need this in asynchronous enivronments
  • 35. The speaker says... State machine replication is the first building block for understanding the database state machine approach. The second building block is a messaging abstraction from the distributed systems world called atomic broadcast. Atomic broadcast provides all the properties required for state machine replication: agreement and ordering. It adds a property needed for communication in an asynchronous system, such as a system communicating via network messages: termination. All in all, this greatly simplifies state machine replication and contributes to a simple, layered design.
  • 36. Delivery, durability, group Client Replica Replica Replica Mr. X Replica Replica Replica Group Send first, possibly delivered second
  • 37. The speaker says... The Atomic broadcast properties given are literally copied from the original paper describing the database state machine replication approach. There is two things in it not explained yet. First, atomic broadcast defines properties in terms of message delivery. The delivery property not only ensures total ordering despite slow transport but also covers message loss (MySQL desires uniform agreement here, something better than Corosync) and even the crash and recovery of processors (durability)! A recovering processor must first deliver outstanding messages before it continues. Second, note that atomic broadcast introduces the notion of a group. Only (correct) members of a group can exchange messages.
  • 38. Deferred Update: the best? Client Client Replica Replica Replica Replica Replica Replica Client Request Server Coordination Execution Agreement Client Response
  • 39. The speaker says... We are almost there. The third building block to the database state machine replication is deferred update database replication. The slide shows a generic functional model used by Pedone and Schiper in 2010 to illustrate their choice of deferred update.The argument goes that deferred update combines the best of the two most prominent object replication techniques: active and passive replication. Only the comination of the best from the two will give both high availability and high performance. Translation: MySQL Group Replication can – in theory - have higher overall throughput than MySQL Replication. Do you love the theory ;-) ? As a DBA you should.
  • 40. Active Replication (SM) Replica Replica Replica Replica Replica Replica Client Client Client sends op to all Requests get ordered Execution All reply to client
  • 41. The speaker says... In an active replication system, a pure state machine replication system, the client operations are forwarded to all replicas and each replica individually executes the operation. The two challenges are to ensure all replicas execute requests in the same order and all replicas decide the same. Recall, that we talk multi-threaded database servers here. A downside is that every replica has to execute the operation. If the operation is expensive in terms of CPU, this can be a waste of CPU time.
  • 42. Passive Replication Backup Primary Backup Replica Replica Replica Client Client Client sends op to primary Only primary executes Primary forwards changes Primary replies to client
  • 43. The speaker says... The alternative is passive replication or primary-backup replication. Here, the client talks to only one server, the primary. Only the primary server executes client operations. After computation of the result, the primary forwards the changes to the backups which apply tem. The problem here is that the primary determines the systems throughput. None of the backups can contribute its computing power to the overall system throughput.
  • 44. Multi-primary (pass.) replication What we want... • … for performance: more than one primary • … for scalability: no distributed locking • .. and of course: transactions • Two-staged transaction protocol Client Primary Primary Primary Transaction processing Transaction termination
  • 45. The speaker says... Multi-primary (passive) replication has all the ingredients desired. Transaction processing is two staged. First, a client picks any replica to execute a transaction. This replica becomes the primary of the transaction. The transaction executes locally, the stage is called transaction processing. In the second stage, during transaction termination, the primaries jointly decide whether the transaction can commit or must abort. Because updates are not immediately applied, database folks call this deferred update – our last building block.
  • 46. Deferred Update DB Replication Deterministic certification • Reads execute locally, Updates get certified • Certification ensures transaction serializability • Replicas decide independently about certification result Read Primary Write Primary Primary Primary Rs/Ws/U
  • 47. The speaker says... One property of transactions is isolation. Isolation is also know as serializability: the concurrent execution of transactions should be equivalent to a serial execution of the same transactions. In Deferred Update system, read transactions are processed and terminated on one replica and serialized locally. Updates must be certified. After the transaction processing the readset, writeset and updates are sent to all other replicas. The servers then decide in a deterministic procedure whether (one-copy) serializability holds, if the transaction commits. Because its a deterministic procedure, the servers can certify transactions independently!
  • 48. Options for termination Atomic Broadcast based • … this is what is used, by MySQL, by DBSM Optimization: Reordering (atop of Atomic Broadcast) • … in theory it means less transaction aborts Optimization limit: Generic Broadcast based • … this has issues, which make it nasty Atomic Commit based • … more transactions than atomic broadcast
  • 49. The speaker says... There are several ways of implementing the termination protocol and the certification. There are two truly distinct choices: atomic broadcast and atomic commit. Atomic commit causes more transaction aborts than atomic broadcast. So, it's out and atomic broadcast remains. Atomic broadcast can – in theory – be further optimized towards less transaction aborts using reordering. For practically matters, this is about where the optimizations end. A weaker (and possibly faster) generic broadcast causes problems in the transactional model. For databases, it could be an over-optimization.
  • 50. Generic certification test Transactions have a state • Executing, Comitting, Comitted, Aborted Reads are handled locally Updates are send to all replicas • Readset and writeset are forwarded On each replica: search for 'conflicting' transactions • Can be serialized with all previous transactions? Commit! • Commit? Abort local transaction that overlap with update
  • 51. The speaker says... No matter what termination procedure is used, the basic procedure for certification in the deferred update model is always the same. Updates/writes need certification. The data read and the data written by a transaction is forwarded to all other replicas. Every replica searches for potentially 'conflicting' transactions, the details depend on the termination procedure. A transaction is decided to commit if it does not violate serializability with all previous transactions. Any local transaction currently running and conflicting with the update is aborted.
  • 52. Database State Machine Deferred Update Database Replication as a state machine • Atomic Broadcast based termination Plugin Services MySQL Transaction hooks Plugins MySQL Group Replication Capture Apply Recover Replication Protocol incl. termination protocol/certifier Group Communication System
  • 53. The speaker says... The Database State Machine Approach combines all the bits and pieces. Let's do a bottom up summary. Atomic broadcast not only free's the database developer to bother about networking APIs it also solves the nasty bits of communicating in an asynchronous network. It provides properties that meet the requirements of the state machine replication. A deterministic state machine is what one needs to implement the termination protocol within deferred update replication. Deferred update replication does not use distributed locking which Gray proved problematic and it combines the best of active and passive replication. Side effects: simple replication protocol, layered code.
  • 54. The termination algorithm Updates are send to all replicas • Readset and writeset are forwarded Step 1 - On each replica: certify • Is there any comitted transaction that conflicts? (In the original paper: check for write-read conflicts between comitting transaction and comitted transactions using. Does the committing transaction readset overlap with any comitted transactions writeset. Works slightly different in MySQL.) Step 2 – On each replica: commitment • Apply transactions decided to commit • Handle concurrent local transactions: remote wins
  • 55. The speaker says... The termination process has two logical steps, just like the general one presented earlier. The very details of how exactly two transactions are checked for conflicts in the first step don't matter here. MySQL Group Replication is using a refinement of the algorithm tailored to its own needs. As a developer all you need to know is: a remote transaction always wins no matter how expensive local transactions are. And, keep conflicting writes on one replica. It's faster. The puzzling bit on the slide is the rule to check check a commiting transaction against any commited transaction for conflicts. Any !? Not any... only concurrent.
  • 56. What's concurrent? Any other transaction that precedes the current one • Recall: total ordering • Recall: asynchronous, delay between broadcast and delivery Replica Replica Replica Replica Replica Broadcast Delivery 1 Total order 1 2 1 2 2 1 2
  • 57. The speaker says... The definition of what concurrent means is a bit tricky. Its defined through a negation and that's confusing on the first look but becomes – hopefully – clear on the next slide. Concurrent to a transaction is any other transaction that does precede it. If we know the order of all transactions – in the entire cluster -, then we can which transactions precede one another. Atomic broadcast ensures total order on delivery. Some implementations decide on ordering when sending and that number (logical clock) could be be used. Any logical clock works.
  • 58. Certify against all previous? Replica Replica Replica Replica Replica Transaction(2) 2 Total order 3 Certification 2 2 3 4 3 4 4 Broadcast: Transaction 4 is based on all previous up to 2 Certification when 4 is delivered: Check conflicts with trx >2 and trx < 4
  • 59. The speaker says... The slide has an example how to find any other transaction that precedes one. When a transaction enters the committing state and is broadcasted, the broadcast includes the logical time (= total order number on the slide) of the latest transaction comitted on the replica. Eventually the transaction is delivered on all sites. Upon delivery the certification considers all transactions that happend after the logical time of the to be certified transaction. All those transactions precede the one to be certified, they executed concurrently at different replicas. We don't have to look further in the past. Further in the past is stuff that's been decided on already.
  • 60. TIME TO BREATH MySQL is different anyway...
  • 61. The speaker says... Good news! The algorithm used by MySQL Group Replication is different and simpler. For correctness, the precedes relation is still relevant. But it comes for free...
  • 62. A developers view on commit Replica Replica Replica Replica Replica BEGIN COMMIT Result t(3) 4 Certify 4 Certify Apply Client Execute
  • 63. The speaker says... We are not done with the theory yet but let's do some slides that take the developers perspective. Assuming you have to scale a PHP application, assuming a small cluster of a handful MySQL servers is enough and assuming these servers are co-located on racks, then MySQL Group Replication is your best possible choice. Did you get this from the theory? Replication is 'synchronous'. On commit you wait only for the server you are connected to. Once your transaction is broadcasted, you are done. You don't wait for the other servers to execute the transaction. With uniform atomic broadcast, once your transaction is broadcasted, it cannot get lost. (That's why I torture you with theory.)
  • 64. MySQL Replication Master Slave Replica Replica Fetch Replica BEGIN COMMIT OK Bin log etc. Apply Client execute
  • 65. The speaker says... If your network is slow or mother earth, the speed of light and network message round trip time adds too much too your transaction execution time, then asynchronous MySQL Replication is a better choice. In MySQL Replication the master (primary) never waits for the network. Not even to broadcast updates. Slaves asynchronously pull changes. Despite pushing work on the developer this approach has the downsite that a hardware crash on the master can cause transaction loss. Slaves may or may not have pulled the latest data.
  • 66. MySQL Semi-sync Replication Master Slave Replica Replica BEGIN COMMIT OK Wait for first ACK Fetch Replica Bin log Apply Client Execute Slave Fetch Apply Replica
  • 67. The speaker says... In the times of MySQL 5.0 the MySQL Community suggested that to avoid transaction loss the master should wait for one slave to acknowledge it has fetched the update from the master. The fact that it's fetched does not mean that it's been applied. The update may not be visible to clients yet. It is a back and forth whether database replication should be asynchronous or not. It depends on your needs. Back to theory after this break.
  • 68. Back to theory! Virtual Synchrony?
  • 69. Virtual Synchrony Groups and views • A turbo-charged veryion of Atomic Broadcast P1 P2 P3 P4 M1 M2 VC M3 M4 G1 = {P1, P2, P3} G2 = {P1, P2, P3, P4}
  • 70. The speaker says... Good news! Virtual Synchrony and Atomic Broadcast are the same. Our Atomic Broadcast definition assumes a static group. Adding group members, removing members or detecting failed ones is covered. Virtual Synchrony handles all these membership changes. Whenever an existing group agrees on changes, a new view is installed through a view change (VC) event. (The term 'virtual': it's not synchronous. There is a delay we don't want to wait for short message delays. Yet, the system appears to be synchronous to most real life observers.)
  • 71. Virtual Synchrony View changes act as a message barrier • That's a case causing troubles in Two-Phase Commit P1 P2 P3 P4 M5 VC M6 M7 M8 G2 = {P1, P2, P3, P4} G3 = {P1, P2, P3}
  • 72. The speaker says... View changes are message barriers. If the group members suspect a member to have failed they install a new view. Maybe the former member was not dead but just too slow to respond, or disconnected for a brief period. False alarm. The former member then tries to broadcast some updates. Virtual Synchrony ensures that the updates will not be seen by the remaining members. Furthermore the former member will realize that it was excluded. Some GCS implementing virtual synchrony even provide abstractions that ensure a joining member learns all updates it missed (state transfer) before it rejoins.
  • 73. Auto-everything: failover MySQL Group Replication has a pluggable GCS API • Split brain handling? Depends onGCS and/or GCS config • Default GCS is Corosync MySQL MySQL MySQL MySQL MySQL MySQL
  • 74. The speaker says... Good news! The Virtual Synchrony group membership advantages are fully exposed to the user level: node failures are detected and handled automatically. PECL/mysqlnd_ms can help you with the client site. It's a minor tweak to have it automatically learn about remaining MySQL server. Expect and update release soon. MySQL Group Replication works with any Group Communication system that can be accessed from C and implements Virtual Synchrony. The default choice is Corosync. Split brain handling is GCS dependent. MySQL follows view change notifications of the GCS.
  • 75. Auto-everything: joining Elastic cluster grows and shrinks on demand • State transfer done via asynch replication channel MySQL MySQL MySQL MySQL MySQL MySQL Donor State transfer Joiner
  • 76. The speaker says... Good news! When adding a server you don't fiddle with the very details. You start the server, tell it to join the cluster and wait for it to catch up. The server picks a donor, begins fetching updates using much of the existing MySQL Replication code infrastructure and that's it.
  • 77. Back to theory! Generalized Snapshot Isolation
  • 78. Deferred Update tweak Transaction read set does not need to be broadcasted • Readset is hard to extract and can be huge • Weaker serializability level than 1SR • Sufficient for InnoDB default isolation Read Primary Write Primary Primary Primary V/Ws/U
  • 79. The speaker says... Good news! This is last bit of theory. The original Database State Machine proposal was followed by a simpler to implement proposal in 2005. If the clusters serialization level is marginally lowered to snapshot, certification becomes easier. Generalized snapshot isolation can be achieved without having to broadcast the readset of transactions. Recording the readset of a transaction is difficult in most existing databases. Also, readsets can be huge. Snapshot isolation is an isolation level for multi-version concurrency control. MVCC? InnoDB! Somehow... Whatever this is the MySQL Group Replication termination base algorithm.
  • 80. Snapshot Isolation Concurrent and write conflict? First comitter wins! • Reads use snapshot from the beginning of the transaction First committer Conflict (both change x) T1 T2 T1 T2 BEGIN(v1), W(v1, x=1), COMMIT!, x:v2=1 BEGIN(v1), W(v1, x=2), …, …, COMMIT? Concurrent write (version 1)
  • 81. The speaker says... In Snapshot Isolations transactions take a snapshot when they begin. All reads return data from this snapshot. Although any other concurrent transaction may update the underlying data while the transaction still runs, the change is unvisiable, the transaction runs in isolation. If two concurrent transactions change the same data item they conflict. In case of conflicts, the first comitter wins. MVCC requires that as part update of an data item its version is incremented. Future transactions will base their snapshot on the new version.
  • 82. The actual termination protocol Replica Replica Replica Replica Replica Write(v2, x=1) Certification Object Latest version x 1 y 13 OK
  • 83. The speaker says... Every replica checks the version of a write during certification. It compares the writes data items version number with the latest it knows of. If the version is higher or equal than the one found in the replicas certification index, the write is accepted. A lower number indicates that someone has already updated the data item before. Because the first comitter must win a write showing a lower version number than is in the certification index must abort. (The certification index fills over time and is truncated periodically by MySQL. MySQL reports the size through Performance Schema tables.)
  • 84. Hmm... Does it work?
  • 85. It's a preview – there are limits General • InnoDB only • Corosync lacks uniform agreement • No rules to prevent split-brain (it's a preview, you're allowed to fool yourself if you misconfigure the GCS!) Isolation level • Primary Key based • Foreign Keys and Unique Keys not supported yet No concurrent DDL
  • 86. That's it, folks! Questions?
  • 87. The speaker says... (Oh, a question. Flips slide)
  • 88. Network messages – pffft! MySQL super hero at Facebook @markcallaghan Sep 30 For MySQL sync replication, when all commits originate from 1 master is there 1 network round trip or 2? http://mysqlhighavailability.com/mysql-group- replication-hello-world … @Ulf_Wendel @markcallaghan AFAIK, on the logical level, there should be one. Some of your questions might depend on the GCS used. The GCS is pluggable @markcallaghan @Ulf_Wendel @h_ingo Henrik tells me it is "certification based" so I remain confused
  • 89. GCS != MySQL Semi-sync It's many round trips, how many depends on GCS • Default GCS is Corosync, Corosyc is Totem Ring • Corosync uses a privilege-based approach for total ordering • Many options: fixed sequencer, moving sequencer, ... • Where you run your updates only impacts collision rate MySQL MySQL Corosync Corosync MySQL Corosync
  • 90. The speaker says... No Mark, MySQL Group Replication cannot be understood as a replacement for MySQL Semi-sync Replication. The question about network round trips is hard to answer. Atomic Broadcast and Virtual Synchrony stack many subprotocols together. Let's consider a stable group, no network failure, Totem. Totem orders messages using a token that circulates along a virtual ring of all members. Whoever has the token, has the priviledge to broadcast. Others wait for the token to appear. Atomic Broadcast gives us all or nothing messaging. It takes at least another full round on the ring to be sure the broadcast has been received by all. How many round trips are that? Welcome to distributed systems...
  • 91. THE END Contact: ulf.wendel@oracle.com
  • 92. The speaker says... Thank you for your attendance! Upcoming shows: Talk&Show! - YourPlace, any time