DIY: A distributed database cluster, or: MySQL Cluster

MySQL Cluster talk
DIY
No Best Practices
No Product Presentation
… you have been warned.
N marketing fluff

Foreword and disclaimer
Do it yourself, become a maker, get famous!
In this course you will learn how to create an eager update
anywhere cluster. You need:
●
A soldering iron, solder
●
Wires (multiple colors recommended)
●
A collection of computers
By the end of the talk you can either challenge MySQL, or
get MySQL Cluster for free – it's Open Source, as ever since.
Get armed with the distributed system theory you, as a
developer, need to master any distributed database.

DIY – Distributed Database
Cluster, or: MySQL Cluster
Ulf Wendel, MySQL/Oracle
N marketing fluff

Live on stage:
Making a Cluster

The speaker says...
Beautiful work, but unfortunately the DIY troubles begin
before the first message has been delivered in our cluster.
Long before we can speak about the latest hats fashion, we
have to fix wiring and communication! Communication
should be:
• Fast
• Reliable (loss, retransmission, checksum, ordering)
• Secure
Network performance is a limiting factor for
distributed systems. Hmm, we better go back to the
drawing board before we mess up more computers...

Availability
• Cluster as a whole unaffected by loss of nodes
Scalability
• Geographic distribution
• Scale size in terms of users and data
• Database specific: read and/or write load
Distribution Transparency
• Access, Location, Migration, Relocation (while in use)
• Replication
• Concurrency, Failure
Back to the beginning: goals

The speaker says...
A distributed database cluster strives for maximum
availability and scalability while maintaining distribution
transparency.
MySQL Cluster has a shared-nothing design good enough
for 99,999% (five minutes downtime per year). It scales
from Rasperry Pi run in a briefcase to 1.2 billion write
transactions per second on a 30 data nodes cluster (if using
possibly unsupported bleeding edge APIs.) It offers full
distribution transparency with the exception of partition
relocation to be triggered manually but performed
transparently by the cluster. That's to beat. Let's learn what
kind of clusters exist, how they tick and what the best
algorithms are.

Where are transactions run?
Primary Copy Update Anywhere
When does
synchronization
happen?
Eager
Not available for
MySQL
MySQL Cluster
3rd
party
Lazy
MySQL Replication
3rd
party
MySQL Cluster
Replication
What kind of cluster?

The speaker says...
A wide range of clusters can be categorized by asking
where transactions are run and when replicas
synchronize their data. Any eager solution ensures that all
replicas are synchronized at any time: it offers strong
consistency. A transaction cannot commit before
synchronization is done. Please note, what it means to
transaction rates:
• Single computer tx rate ~ disk/fsync rate
• Lazy cluster tx rate ~ disk/fsync rate
• Eager cluster tx rate ~ network round-trip time (RTT)
Test: Would you deploy MySQL Cluster on Amazon EC2 :-) ?

Lazy Primary Copy we have...
010101001011010
101010110100101
101010010101010
101010110101011
101010110111101
Master (Primary)
Write
Slave (Copy) Slave (Copy) Slave (Copy)
Read
Read
Lazy synchronization: eventual consistency
Primary Copy: where any transaction may run

The speaker says...
MySQL Replication falls into the category of lazy Primary
Copy clusters. It is a rather unflexible solution as all
updates must be sent to the primary. However, this
simplifies concurrency control of conflicting, concurrent
update transactions. Concurrency control is no different
from a single database.
Lazy replication can be fast. Transactions don't have to
wait for synchronization of replicas. The price of the fast
execution is the risk of stale reads and eventual
consistency. Transactions can be lost when the primary
crashes after commit and before any copy has been
updated. (This is something you can avoid by using MySQL
semi-sync replication, which delays the commit until delivery
to copy.)

BTW, confusing: Multi-Master
Master (Primary)
Slave (Copy)
Master (Primary)
Slave (Copy)
SET A = 1 SET B = 1
A, B A, B

The speaker says...
Be aware of the term Multi-Master. MySQL Community
sometimes uses it to describe a set of Primary Copy
clusters where primaries (master) replicate from each
other. This is one of the many possible topologies that you
can build with MySQL Replication. In the example, the PC
cluster on the left manages table A and the PC cluster on
the right manages table B. The Primaries copy table A
respectively table B from each other. There is no
concurrency control and conflicts can arise. There is no
distribution transparency. This is not an own kind of cluster
with regards to our where and when criteria. And, it is
rarely what you want...
Not a good goal for DIY – let's move on.

Let's do Eager Update Anywhere
010101001011010
101010110100101
101010010101010
101010110101011
101010110111101
Replica
Write
Replica Replica Replica
Read
Eager synchronization: strong consistency
Update Anywhere: any transaction can run on any replica

The speaker says...
An eager update anywhere cluster improves
distribution transparency and removes the risk of
reading stale data. Transparency and flexibility is improved
because any transaction can be directed to any
replica. Synchronization happens as part of the commit,
thus strong consistency is achieved. Remember:
transaction rate ~ network RTT. Failure tolerance is
better than with Primary Copy. There is no single point of
failure – the primary - that can cause a total outage of the
cluster. Nodes may fail without bringing the cluster down
immediately. Concurrency control (synchronization) is
complex as concurrent transactions from different replicas
may conflict.

Concurrency Control: 1SR
010101001011010
101010110100101
101010010101010
101010110101011
101010110111101
Replicat0
: SET a = 1 Replica t0
: SET a = 2
One-Copy-Serializability (1SR) for correctness
• All replicas must decide on the same transaction order
a = 1
a = 2
a = 2a = 1
a = 1
010101001011010
101010110100101
101010010101010
101010110101011
101010110111101

The speaker says...
Concurrent ACID transactions must be isolated from each
other to ensure correctness. The database system needs a
mechanism to detect conflicts. If any, transactions need to
be serialized. The challenge is to have all replicas commit
transactions in the same serial order. One-Copy-
Serializability (1SR) demands the concurrent
execution of transactions in an replicated database
to be equivalent to a serial execution of these
transactions over a single logical copy of the
database. 1SR is the highest level of consistency, lower
exist, for example, snapshot isolation. Given that, the
questions are:
• How to detect conflicting transactions?
• How to enforce a global total order?

Certification: detect conflict
Replica
Update transaction
Replica
Read query
Replica
Read set: a = 1
Write set: b = 12
Transactions get executed and certified before commit
• Conflict detection is based on read and write sets
• Multi-Primary deferred update
Certification Certification

The speaker says...
(For brevity we discuss multi-primary deferred update only.)
In a multi-primary deferred update system a read
query can be served by a replica without consulting
any of the other replicas. A write transaction must be
certified by all other replicas before it can commit.
During the execution of the transaction, the replica records
all data items read and written. The read/write sets are then
forwarded by the replica to all other replicas to certify the
remote transaction. The other replicas check whether the
remote transaction includes data items modified by an
active local transaction. The outcome of the certification
decides on commit or abort. Either symetric (statement
based) or asymetric (row based) replication can be used.

Concurrency Control
010101001011010
101010110100101
101010010101010
101010110101011
101010110111101
Replicat0
: SET a = 1 Replica t0
: SET a = 2
Various synchronization mechanisms
• Atomic commit
• Atomic broadcast
• Strict two-phase locking (2PL)
• Optimistic, Physical clock, Lamport's clock, vector clock...
a = 1
a = 2
a = 1a = 1
a = 2

The speaker says...
One challenge remains: replicas must agree on a global
total order for comitting transactions no matter in
which order they receive messages.
We will discuss atomic commit (two-phase-locking) and
atomic broadcast. The other approaches are out of scope.

Atomic commit for CC
Execute Committing PreCommit
Aborted
Comitted
Formula (background): serial execution, unnecessary
aborts

The speaker says...
Atomic commit can be expressed as a state machine with
the final states abort and commit. Once a transaction has
been executed, it enters the committing state in which
certification/voting takes place. Given the absence of
conflicting concurrent transactions, a replica sets the
transactions status to precommit. If all replicas precommit,
the transaction is comitted, otherwise it is aborted.
Don't worry about the formula. It checks for concurrent
transactions – as we did before – and ensures, in case of
conflicts, that only one transaction can commit at a time.
Problem: it may also do unnecessary aborts
depending on message delivery order as it requires all
servers to precommit->commit in the same order.

Atomic broadcast for CC
Atomic broadcast guarantees
• Agreement: if one server delivers a message, all will
• Total order: all servers deliver messages in the same order
Greatly simplified concurrency check
• Deterministic: no extra communication after local decision

The speaker says...
Atomic broadcast ensures that transaction are delivered in
the same order to all replicas. Thus, certification of
transactions is deterministic: all replicas will make the same
decision about commit or abort because they all base their
decision on the same facts. This in turn means that there is
no need to coordinate the decisions of all replicas – all
replicas will make the same decision.
A transaction does not conflict and thus will commit, if its
executed after the commit of any other transaction, or its
read set does not overlap with the write set of any other
transaction. The formula is greatly simplified! Great for DIY!

Voting quorum: ROWA, or...?
Read-One Write-All is a special quorum
• Quorum constraints: NR
+ NW
> N, NW
> N/2
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Example: N= 12, read quorum NR
= 3, write quorum NW
= 10
Replica Replica Replica
Example: N= 3, read quorum NR
= 2, write quorum NW
= 2

The speaker says...
So far we have silently assumed a Read-One Write-All
(ROWA) quorum for voting. Reads could be served locally
because updates have been applied to all replicas.
Alternatively, we could make a rule that an update has to be
agreed by and applied to half of the replicas plus one. This
may be faster than achieving agreement among all replicas.
However, for a correct read we now have to contact half of
the replicas plus one and check whether they all give the
same reply. If so, we must have read the latest version as
the remaining, unchecked replicas form a minority that
cannot be updated. The read quorum overlaps the write
quorum by at least one element.

Voting quorum: ROWA!
ROWA almost always performs better
• Are Quorums an Alternative for Data Replication?
(Jimenez-Peris et.al.)
• „The obvious conclusion from these results is that ROWAA is
the best choice for a wide range of application scenarios. It offers
good scalability (within the limitations of replication protocols),
very good availability, and an acceptable communication
overhead. It also has the significant advantage of being very
simple to implement and very easy to adapt to configuration
changes. For every peculiar loads and configurations, it is possible
that some variation of quorum does better than ROWAA.“
• Background: scale out results from study

The speaker says...
Judging from the paper ROWA respectively Read-
One Write-All-Available (ROWAA) is a promising
approach. For example, it offers linear scalability for read
only workloads but still remains competitive for mixed
update and read loads. It requires a high write-to-read ratio
before the various Quorum algorithms outperform ROWA on
scalability. In sum: ROWA beats Quorums by a magnitude
for read but does not drop by a magniture for write, and the
web is read dominated. Scalability is one aspect.
Quorums also help with availability – the studies
finding is similar: ROWA is fine.
DIY decision on currency control: ROWA, atomic broadcast.
Quiz: name a system using Quorums? Riak! Next:
Availability and Fault Tolerance.

Complex failure handling required
• Later evolution: Three-Phase Commit (3PC)
Fault Tolerance: 2PC
Coordinator Participant Participant
Vote Request
PreCommit
PreCommit
Vote Request
Global Commit
Commit

The speaker says...
When discussing atomic commit we have effectively shown
the Two-Phase Commit (2PC) protocol. 2PC starts with a
vote request multicasted from a coordinator to all
participants. The participants either vote to commit
(precommit) or abort. Then, the coordinator checks the
voting result. If all voted to commit, it sends a global
commit messages and the participants commit. Otherwise
the coordinator sends a global abort command. Various
issues may arise in case of network or process
failures. Some cannot be cured using timeouts. For
example, consider the situation when a participant
precommits but gets no global commit or global abort. The
participant cannot uniliterally leave the state. At best, it can
ask another participant what to do.

Two-Phase Commit is a blocking protocol
Fault Tolerance: 2PC
Coordinator Participant Participant
Vote Request
PreCommit
PreCommit
Vote Request

The speaker says...
The worst case scenario is a crash of the coordinator after
all participants have voted to precommit. The participants
cannot leave the precommit state before the coordinator has
recovered. They do not know whether all of them have
voted to commit or not. Thus, they do not know whether a
global commit or global abort has to be performed.
As none of them has received a message about the outcome
of the voting, the participants cannot contact one another
and ask for the outcome.
Two-Phase Commit is also known as a blocking
protocol.

Reliable multicast/broadcast
• Build on the idea of group views and view changes
Virtual Synchrony
P1
P2
P3
P4
M1
M2
VC
M3
M4
G1 = {P1, P2, P3} G2 = {P1, P2, P3, P4}

The speaker says...
Virtual Synchrony is a mechanism that does not block. It is
build around the idea of associating multicast messages with
the notion of a group. A message is delivered to all
members of a group but no other processes. Either the
message is delivered to all members of a group or to none
of them. All members of the group agree that they are part
of the group before the message is multicasted (group
view). In the example, M1...3 are associated with the group
G1 = {P1, P2, P3}. If a process wants to join or leave a
group a view change message is multicated. In the
example, P4 wants to join the group and a VC message is
send while M3 is still being delivered. Virtual Synchrony
requires that either M3 is delivered to all of G1 before the
view change takes place or to none.

View changes act as a message barrier
• Remember the issues with 2PC …?
Virtual Synchrony
P1
P2
P3
P4
M5
VC
M6
G2 = {P1, P2, P3, P4} G3 = {P1, P2, P3}
M7
M8

The speaker says...
There is only one condition under which a multicast
message is allowed not to be delivered: if the sender
crashed. Assume the processes continue working and
multicast messages M5, M6, M7 to group G2 = {P1, P2, P3,
P4}. While P4 sends M7 it crashes. P4 has managed to
deliver its message to {P3}. The crash of P4 is noticed and a
view change is triggered. Because Virtual Synchrony
requires a message to be delivered to all members of the
group associated with it but the sender crashed, P3 is free
to drop M7 and the view change can take place.
A new group view G3 is established and messages can be
exchanged again.

Wire: message ordering and fault tolerance
• Common choices: UDP or TCP over IP
Reliable, delivered vs. received
010101001011010
101010110100101
101010010101010
101010110101011
101010110111101
ReplicaReplica
Update 1 Update 2
t1
: Update 1
t2
: Update 2
t1
: Update 2
t2
: Update 1 (lost)

The speaker says...
Virtual Synchrony offers reliable multicast. Reliability can be
best achieved using a protocol higher up on the OSI model.
Isis, an early framework implementing Virtual Synchrony,
has used TCP point to point connections if reliable service
was requested. TCP is a connection oriented protocol
(endpoint failures can be deteted easily) with error handling
and message delivery in the order sent. However, using
TCP only there are no ordering constraints between
messages from any two senders. Those ordering
constraints have to be implemented at the application layer.
We say a message can be recieved on the network layer
in a different order than its delivered to the application
by the model discussed. Vector clocks can be used for
global total ordering.

AB = Virtual Synchrony offering total-order delivery
• „Synchrony“ does not refer to temporal aspects
Atomic broadcast definition
P1
P2
P3
P4
M1
M2
Unordered delivery Ordered delivery
P1
P2
P3
P4
M1
M2

The speaker says...
Atomic broadcast means Virtual Synchrony used with total-
order message ordering. When Virtual Synchrony was
introduced back in the mid 80s, it was explicitly designed to
allow other message orderings. For example, it should be
able to support distributed applications that have a notion of
finding messages that commute, and thus may be applied in
an order different from the order sent to improve
performace. If events are applied in different order on
different processes, the system cannot be called
synchronous any more – the inventors called it virtually
synchronous.
However, recall we are only after total-ordering for 1SR.

Wash the brain without marketing fluff, split brain, done!
• System dependent... E.g. Isis failure detector was very basic
How to cook brains
P1
P2
P3
P4
M1
M2
n1({P1, P2, P3, P4]) = 4
VC
Split brain – Connection lost
n2({P1, P2}) = 2 < (n1/2)

The speaker says...
The failure of individual processes – or database replicas –
has been discussed. The model has measures to handle
them following using a fail stop approach.
To conclude the discussion of fault tolerance we look at a
situation called split brain: one half of the cluster lost
connection to another half. Which shall survive? The
answer is often implementation dependent. For
example, the early Virtual Synchrony framework Isis has a
rule that a new group view can only be installed if it
contains n / 2 + 1 members with n being the number of
members in the current group. In the example both halves
would shut down. Brain splitting question: how many
replicas would you project for a cluster if you don't know
split brain implementation details?

In-core architecture
DIY: Hack MySQL (oh, oh), or...?
MySQL DBMS MySQL DBMS
Load Balancer
PECL/mysqlnd_ms MySQL Proxy
PHP PHP PHP
Reflector Reflector
Replicator Replicator
GCS

The speaker says...
Here's a generic architecture made of five components:
• Clients (PHP, Java, …) using well known interfaces
• Load Balancer (for example PECL/mysqlnd_ms)
• The actual database system
• The reflector allows inspection and modification of on-
going transactions
• The (distributed) replicator handling concurrency
control
• The Group Communication System (GCS) provides
communication primitives such as multicast (GCS
examples: Appia, JGroups – Java, Spread – C/C++)

Middleware architecture
Virtual DBMS Virtual DBMS
Load Balancer
Clients
Reflector Reflector
GCS
DBMS DBMS

The speaker says...
An in-core design requires support for a reflector by the
database. Strictly speaking there is no API inside MySQL one
can use. The APIs used for MySQL Replication are not
sufficient. Nonetheless, MySQL Replication can be
classified as in-core in our model. Due to the lack of an
reflector API, the only third party product following an in-
core design (Galera by Codership) has to patch the
MySQL core.
Tungsten Replicator by Continuent is a Middleware
design. Clients contact a virtual database. Requests are
intercepted, parsed and replicated. The challenge is in the
interception: statements using non-deterministic calls such
as NOW() and TIME() must be taken care of.

Hybrid architecture
DBMS DBMS
Load Balancer
Clients
Reflector Plugin Reflector Plugin
GCS

The speaker says...
In a hybrid architecture the reflector runs within the
database process but the replicator layer is using extra
processes.
It is not a perfect comparison as we will see later but for
the sake of our model, we can classify MySQL Cluster as a
hybrid architecture. The reflector is implemented as a
storage engine. The replicator layer is using extra processes.
This design has some neat MySQL NDB Cluster specific
benefits. If any MySQL product has NoSQL genes, it is
MySQL Cluster.

Primary Copy Update Anywhere
Eager
Not available for
MySQL
MySQL Cluster (Hybrid)
Galera (In-core)
Lazy
MySQL Replication
(In-core)
Tungsten
(Middleware)
MySQL Cluster
Replication
(Hybrid)
DIY: Summary

The speaker says...
Time for a summary before coding ants and compilers start
their work. From a DIY perspective we can skip Lazy
Primary Copy: it has simple concurrency control, it
does not depend on network speed, it is great for flacky
and slow WAN connections but it offers eventual
consistency only (hint: enjoy PECL/mysqlnd_ms!), it has
no means to scale writes. And, it exists – no karma...
An eager update anywhere solution offering the highest
level of correctness (1SR) gives you strong consistency. It
scales writes to some degree because they can be
executed on any replica, which parallizes execution load.
Commit performance is network bound.

Full Replication Partitial Replication
Read
Scale Out
Write
Scale Out
Capability
MySQL Replication
(Lazy Primary Copy,
In-core)
MySQL Cluster
(Eager Update
Anywhere,
Hybrid)
Tungsten
(Primary Copy,
Middleware)
Galera
(Eager Update Anywhere,
In-core)
If 1SR - hard limit
DIY: The Master Class

The speaker says...
The DIY Master Class for maximum karma is a partial
replication solution offering strong consistency. Partial
replication is the only way to ultimately scale write
requests. The explanation is simple: every write adds load
to the entire cluster. Remember that writes need to be
coordinated, remember that concurrency control involves all
replicas (ROWA) or a good number of them (Quorum).
Thus, every additional replica adds load to all others. The
solution is to partition the data set and keep each partition
on a subset of all replicas only. NoSQL calls it sharding,
MySQL Cluster calls it partitioning. Partial replication –
that's the DIY master piece, that will give you KARMA.

Availability
• Shared-nothing, High Availability (99,999%)
• WAN Replication to secondary data centers
Scalability
• Read and write through partial replication (partitioning)
• Distributed queries (parallize work), real-time guarantees
• Focus In-Memory with disk storage extension
• Sophisticated thread model for multi-core CPU
• Optimized for short transaction (hundrets of operations)
Distribution Transparency
• SQL level: 100%, low-level interfaces available
MySQL (NDB) Cluster goals

The speaker says...
I am not aware of text books discussing partial
replication theory in-depth. Thus, we have to reverse
engineer an existing system. As this is a talk about
MySQL Cluster, how about talking about MySQL Cluster
finally?MySQL Cluster has originally been developed to serve
telecommunication systems. It aims to parallize work as
much as possible, hence it is a distributed database. It
started as an in-memory solution but can store data on disk
meanwhile. It runs best in environments offering low
network latency, high network throughput and issuing short
transactions. Applications should not demand complex joins.
There is no chance you throw Drupal at it and Drupal runs
super-fast out of the box! Let's see why...

SQL view: Cluster is yet another table storage engine
MySQL Cluster is a hybrid
MySQL MySQL
Load Balancer
Clients
Reflector Plugin = NDB Storage Engine
Replicator = NDB Data Node
GCS

The speaker says...
MySQL Cluster has a hybrid architecture. It consists of the
green elements on the slide. The Reflector is
implemented as a MySQL storage engine. From a SQL
user's perspective, it is just another storage engine, similar
to MyISAM, InnoDB or others (Distribution Transparency).
From a SQL client perspective there is no change: all MySQL
APIs can be used. The Reflector (NDB Storage Engine) runs
as part of the MySQL process. The Replicator is a
seperate process called NDB data node. Please note,
node means process not machine. MySQL Cluster does not
fit perfectly in the model: an NDB data node combines
Replicator and storage tasks.
BTW, what happens to Cluster if a MySQL Server fails?

Fast low-level access: bypassing the SQL layer
MySQL Cluster is a beast
MySQL MySQL
Load Balancer
Clients
Reflector Plugin = NDB Storage Engine
Replicator = NDB Data Node
GCS
Clients
4.3b read tx/s
1.2b write tx/s
(in 2012)

The speaker says...
From the perspective of MySQL Cluster, a MySQL Server is
yet another application client. MySQL Server happens to be
an application that implements a SQL view on the relational
data stored inside the cluster.
MySQL Cluster users often bypass the SQL layer by
implementing application clients on their own. SQL is a rich
query language but parsing a SQL query can take 30...50%
of the total runtime of a query. Thus, bypassing is a good
idea. The top benchmark results we show for Cluster are
achieved using C/C++ clients directly accessing MySQL
Cluster. There are many extra APIs for this special case:
NDB API (C/C++, low level), ClusterJ (ORM style),
ClusterJPA (low level), … - even for node.js (ORM style)

Partitioning (auto-sharding)
NDB Data Node 1 NDB Data Node 2
Partition 0, Primary
Partition 2, Copy
Partition 0, Copy
Partition 1, Primary Partition 1, Copy
Partition 3, Copy Partition 3, Primary
Node Group 1
Node Group 0

The speaker says...
There is a lot to say about how MySQL Cluster partitions a
table and spreads it over nodes. The manual has all details,
just all...
The key idea is to use an eager primary copy approach for
partitions combined with a mindful distribution of each
partitions primary and its copies. NDB supports zero or one
copies (replication factor). The failure of a partitions primary
does not cause a failure of the Cluster. In the example, the
failure of any one node has no impact. Also, node 1 and 4
may fail without a stop of the Cluster (fail stop model). But
the cluster shuts down if all nodes of a node group fail.

Concurrency Control: 2PL,“2PC“
Partition 2, Copy
Partition 0, Copy
W
R
R

The speaker says...
Buuuuh? Two-Phase-Locking (2PL) and Two-Phase-Commit
(2PC) are used for concurrency control. Cluster is using
traditional row locking to isolate transactions. Read and
write locks can be distributed throughout the cluster. The
locks are set on the primary partitions. Transactions are
serialized during execution. When a transaction commits, an
optimized Two-Phase-Commit is used to synchronize the
partition copies.
The SQL layer recognizes the commit as soon as the copies
are updated (and before logs have been written to disk).
The low-level NDB C/C++ application API is asynchronous.
Fire and forget is possible: your application can continue
before transaction processing as even begun!

Brain Masala
Partition 2, Copy
Partition 0, Copy
Arbitrator

The speaker says...
The failure of a single node is detected using a hearthbeat
protocol: details are documented, future improvements are
possible. Both MySQL Cluster and Virtual Synchrony
seperate message delivery from node failure detection.
The worst case scenario of a brain split is cured by the
introduction of arbitrators. If the nodes split and each half
is able to keep the Cluster up, the nodes try to contact the
arbitrator. It is then up to the arbitrator to decide who stays
up and who shuts down. Arbitrators are extra processes,
ideally run on extra machines. Management nodes can act
as arbitrators too. You need at least one management node
for administration, thus you always have an arbitrator
readily available.

Drupal? Sysbench? Oh, oh...
Partition 2, Copy
Partition 0, Copy
MySQL

The speaker says...
Partial replication (here: partitioning, sharding) is the only
known solution to the write scale out problem. But, it comes
at the high price of distributed queries.
A SQL query may require reading data from many partitions.
One the one hand work is nicely parallized over many nodes
on the other hand, records found have to be transferred
within the cluster from one node to another. Although
Cluster tries to batch requests efficiently together to
minimize communication delays, transferring data from node
to node to answer questions remains an expensive
operation.

Oh, oh... tune your partitions!
Partition 2, Copy
Partition 0, Copy
MySQL
CREATE TABLE cities {
id INT NOT NULL,
Population INT UNSIGNED,
city_name VARCHAR(100),
PRIMARY KEY(city_name, id)
}
SELECT id FROM cities
WHERE
city_name = 'Kiel'

The speaker says...
How much traffic and latency occurs depends on the actual
SQL query and the partitioning scheme. By default a table
is partitioned into 3840 virtual fragments (think
vBuckets) using its primary key. The partitioning can
and should be tuned.
Try to find partitioning keys that make your common,
expensive or time-criticial queries run on a single node.
Assume you have a list of cities. City names are not unique,
thus you have introduced a numeric primary key. It is likely
that your most common query checks for the city name not
for the numeric primary key only. Therefore, your
partitioning should be based on city name as well.

The ultimate Key-Value-Store?
Partition 2, Copy
Partition 0, Copy
MySQL
CREATE TABLE cities {
id INT NOT NULL,
city_name VARCHAR(100),
PRIMARY KEY(id)
}
SELECT FROM cities
WHERE id = 1
SELECT FROM citites
WHERE id = 100

The speaker says...
I may have stated it before: if there is any product at
MySQL that can compete with NoSQL (as in Key-Value-
Store) on the issue of distributed data stores, it is MySQL
Cluster.
An optimal query load for MySQL Cluster is one that
primarily performs lookups on partition keys. Each query will
execute on one node only. There is little traffic within the
cluster – little network overhead. Work load is perfectly
parallized.
Will your unmodified PHP application perform on Cluster?

Joins: 24...70x faster
Then
Now
NDB_API> read a from table t1 where pk = 1
[round trip]
(a = 15)
NDB_API> read b from table t2 where pk = 15
[round trip]
(b = 30)
[return a = 15, b = 30]
SELECT t1.a, t2.b FROM t1, t2
WHERE t1.pk = 1 AND t1.a = t2.pk
NDB_API> read @a=a from table t1 where pk = 1;
read b from table t2 where pk = @a
[round trip]

The speaker says...
In 7.2 we claim certain joins to execute 24...70x faster by
the help of AQL (condition push-down)! How come?
Partial replication does not go together well with joins. Take
this simple nested join as an example. There are two tables
to join. The join condition of the second table depends on
the values of the first table. Thus, t1 has to be searched
before t2 can be searched and the result can be returned to
the user. That makes two operations and two round trips.
As of 7.2, there is a new batched way of doing it. It saves
round trips. Some round trips avoided means – at the
extreme - 24...70x faster: the network is your enemy #1.

Benchmark pitfall: connections
MySQL
Load Balancer
Many, many clients
MySQL
NDB Storage Engine NDB Storage Engine

The speaker says...
If you ever come to the point of evaluating MySQL Cluster,
make sure you configure MySQL to Cluster connections
appropriately (ndb_cluster_connection_pool).
A MySQL Server with only one connection (default setting)
from itself to the cluster may not be able to serve many
concurrent clients at the rate the Cluster part itself might be
able to handle them. The connection may an impose an
artifical limitation on the cluster throughput.

Adding nodes, rebalancing
Partitions Partitions
Partitions Partitions

The speaker says...
Adding nodes, growing the capacity of your cluster in terms
of size and computing power, is an online operation. At any
time you can add nodes to your cluster.
New nodes do not immediately participate in
operations. You have to tell the cluster what to do with
them: use for new tables, or use for growing the capacity
available to existing tables. When growing existing tables,
data needs to be redistributed to the new nodes.
Rebalancing is an online operation: it does not block
clients. The partitioning algorithm used by Cluster ensures
that data is copied to new nodes only, there is no
traffic between nodes currently holding fragments of
the table to be rebalanced.

We shall...
• Code an Eager Update-Anywhere Cluster
• Prefer an hybrid design to get not too deep into MySQL
• Do not fear the lack of text books on partital replication
• Read CPU vendor tuning guides like comics
• Like Sweden or Finland
Send your application to the MySQL Cluster team.
Cluster is different. MySQL Cluster is perfect for web
session storage. Whether your Drupal, WordPress, …
runs faster is hard to tell – possibly not faster.
PS (marketing fluff): ask Sales for a show!
DIY - Summary

The speaker says...
By the end of this talk you should remember at least this:
●
There are four kinds of replication solutions based on a
matrix asking „where can all transactions run“ and „when
are replicas synchronized“
●
Clusters don't make everything faster – the network is
your enemy. For read scale out there are proven
solutions.
●
Write scale out is only possible through partial replication
(Small write Quorum would impact read performance)

THE END
Contact: ulf.wendel@oracle.com

The speaker says...
Thank you for your attendance!
Upcoming shows:
Talk&Show! (ask... :-))
YourPlace, any time
PHP Summit
Munich, December 2013

DIY: A distributed database cluster, or: MySQL Cluster

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to DIY: A distributed database cluster, or: MySQL Cluster

Similar to DIY: A distributed database cluster, or: MySQL Cluster (20)

More from Ulf Wendel

More from Ulf Wendel (8)

Recently uploaded

Recently uploaded (20)

DIY: A distributed database cluster, or: MySQL Cluster