Successfully reported this slideshow.
Your SlideShare is downloading. ×

DIY: A distributed database cluster, or: MySQL Cluster

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 81 Ad

DIY: A distributed database cluster, or: MySQL Cluster

Download to read offline

Live from the International PHP Conference 2013: MySQL Cluster is a distributed, auto-sharding database offering 99,999% high availability. It runs on Rasperry PI as well as on a cluster of multi-core machines. A 30 node cluster was able to deliver 4.3 billion (not million) read transactions per second in 2012. Take a deeper look into the theory behind all the MySQL replication/clustering solutions (including 3rd party) and learn how they differ.

Live from the International PHP Conference 2013: MySQL Cluster is a distributed, auto-sharding database offering 99,999% high availability. It runs on Rasperry PI as well as on a cluster of multi-core machines. A 30 node cluster was able to deliver 4.3 billion (not million) read transactions per second in 2012. Take a deeper look into the theory behind all the MySQL replication/clustering solutions (including 3rd party) and learn how they differ.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to DIY: A distributed database cluster, or: MySQL Cluster (20)

Advertisement

Recently uploaded (20)

DIY: A distributed database cluster, or: MySQL Cluster

  1. 1. MySQL Cluster talk DIY No Best Practices No Product Presentation … you have been warned. N marketing fluff
  2. 2. Foreword and disclaimer Do it yourself, become a maker, get famous! In this course you will learn how to create an eager update anywhere cluster. You need: ● A soldering iron, solder ● Wires (multiple colors recommended) ● A collection of computers By the end of the talk you can either challenge MySQL, or get MySQL Cluster for free – it's Open Source, as ever since. Get armed with the distributed system theory you, as a developer, need to master any distributed database.
  3. 3. DIY – Distributed Database Cluster, or: MySQL Cluster Ulf Wendel, MySQL/Oracle N marketing fluff
  4. 4. Live on stage: Making a Cluster
  5. 5. The speaker says... Beautiful work, but unfortunately the DIY troubles begin before the first message has been delivered in our cluster. Long before we can speak about the latest hats fashion, we have to fix wiring and communication! Communication should be: • Fast • Reliable (loss, retransmission, checksum, ordering) • Secure Network performance is a limiting factor for distributed systems. Hmm, we better go back to the drawing board before we mess up more computers...
  6. 6. Availability • Cluster as a whole unaffected by loss of nodes Scalability • Geographic distribution • Scale size in terms of users and data • Database specific: read and/or write load Distribution Transparency • Access, Location, Migration, Relocation (while in use) • Replication • Concurrency, Failure Back to the beginning: goals
  7. 7. The speaker says... A distributed database cluster strives for maximum availability and scalability while maintaining distribution transparency. MySQL Cluster has a shared-nothing design good enough for 99,999% (five minutes downtime per year). It scales from Rasperry Pi run in a briefcase to 1.2 billion write transactions per second on a 30 data nodes cluster (if using possibly unsupported bleeding edge APIs.) It offers full distribution transparency with the exception of partition relocation to be triggered manually but performed transparently by the cluster. That's to beat. Let's learn what kind of clusters exist, how they tick and what the best algorithms are.
  8. 8. Where are transactions run? Primary Copy Update Anywhere When does synchronization happen? Eager Not available for MySQL MySQL Cluster 3rd party Lazy MySQL Replication 3rd party MySQL Cluster Replication What kind of cluster?
  9. 9. The speaker says... A wide range of clusters can be categorized by asking where transactions are run and when replicas synchronize their data. Any eager solution ensures that all replicas are synchronized at any time: it offers strong consistency. A transaction cannot commit before synchronization is done. Please note, what it means to transaction rates: • Single computer tx rate ~ disk/fsync rate • Lazy cluster tx rate ~ disk/fsync rate • Eager cluster tx rate ~ network round-trip time (RTT) Test: Would you deploy MySQL Cluster on Amazon EC2 :-) ?
  10. 10. Lazy Primary Copy we have... 010101001011010 101010110100101 101010010101010 101010110101011 101010110111101 Master (Primary) Write Slave (Copy) Slave (Copy) Slave (Copy) Read Read Lazy synchronization: eventual consistency Primary Copy: where any transaction may run
  11. 11. The speaker says... MySQL Replication falls into the category of lazy Primary Copy clusters. It is a rather unflexible solution as all updates must be sent to the primary. However, this simplifies concurrency control of conflicting, concurrent update transactions. Concurrency control is no different from a single database. Lazy replication can be fast. Transactions don't have to wait for synchronization of replicas. The price of the fast execution is the risk of stale reads and eventual consistency. Transactions can be lost when the primary crashes after commit and before any copy has been updated. (This is something you can avoid by using MySQL semi-sync replication, which delays the commit until delivery to copy.)
  12. 12. BTW, confusing: Multi-Master Master (Primary) Slave (Copy) Master (Primary) Slave (Copy) SET A = 1 SET B = 1 A, B A, B
  13. 13. The speaker says... Be aware of the term Multi-Master. MySQL Community sometimes uses it to describe a set of Primary Copy clusters where primaries (master) replicate from each other. This is one of the many possible topologies that you can build with MySQL Replication. In the example, the PC cluster on the left manages table A and the PC cluster on the right manages table B. The Primaries copy table A respectively table B from each other. There is no concurrency control and conflicts can arise. There is no distribution transparency. This is not an own kind of cluster with regards to our where and when criteria. And, it is rarely what you want... Not a good goal for DIY – let's move on.
  14. 14. Let's do Eager Update Anywhere 010101001011010 101010110100101 101010010101010 101010110101011 101010110111101 Replica Write Replica Replica Replica Read Eager synchronization: strong consistency Update Anywhere: any transaction can run on any replica
  15. 15. The speaker says... An eager update anywhere cluster improves distribution transparency and removes the risk of reading stale data. Transparency and flexibility is improved because any transaction can be directed to any replica. Synchronization happens as part of the commit, thus strong consistency is achieved. Remember: transaction rate ~ network RTT. Failure tolerance is better than with Primary Copy. There is no single point of failure – the primary - that can cause a total outage of the cluster. Nodes may fail without bringing the cluster down immediately. Concurrency control (synchronization) is complex as concurrent transactions from different replicas may conflict.
  16. 16. Concurrency Control: 1SR 010101001011010 101010110100101 101010010101010 101010110101011 101010110111101 Replicat0 : SET a = 1 Replica t0 : SET a = 2 One-Copy-Serializability (1SR) for correctness • All replicas must decide on the same transaction order a = 1 a = 2 a = 2a = 1 a = 1 010101001011010 101010110100101 101010010101010 101010110101011 101010110111101
  17. 17. The speaker says... Concurrent ACID transactions must be isolated from each other to ensure correctness. The database system needs a mechanism to detect conflicts. If any, transactions need to be serialized. The challenge is to have all replicas commit transactions in the same serial order. One-Copy- Serializability (1SR) demands the concurrent execution of transactions in an replicated database to be equivalent to a serial execution of these transactions over a single logical copy of the database. 1SR is the highest level of consistency, lower exist, for example, snapshot isolation. Given that, the questions are: • How to detect conflicting transactions? • How to enforce a global total order?
  18. 18. Certification: detect conflict Replica Update transaction Replica Read query Replica Read set: a = 1 Write set: b = 12 Transactions get executed and certified before commit • Conflict detection is based on read and write sets • Multi-Primary deferred update Certification Certification
  19. 19. The speaker says... (For brevity we discuss multi-primary deferred update only.) In a multi-primary deferred update system a read query can be served by a replica without consulting any of the other replicas. A write transaction must be certified by all other replicas before it can commit. During the execution of the transaction, the replica records all data items read and written. The read/write sets are then forwarded by the replica to all other replicas to certify the remote transaction. The other replicas check whether the remote transaction includes data items modified by an active local transaction. The outcome of the certification decides on commit or abort. Either symetric (statement based) or asymetric (row based) replication can be used.
  20. 20. Concurrency Control 010101001011010 101010110100101 101010010101010 101010110101011 101010110111101 Replicat0 : SET a = 1 Replica t0 : SET a = 2 Various synchronization mechanisms • Atomic commit • Atomic broadcast • Strict two-phase locking (2PL) • Optimistic, Physical clock, Lamport's clock, vector clock... a = 1 a = 2 a = 1a = 1 a = 2
  21. 21. The speaker says... One challenge remains: replicas must agree on a global total order for comitting transactions no matter in which order they receive messages. We will discuss atomic commit (two-phase-locking) and atomic broadcast. The other approaches are out of scope.
  22. 22. Atomic commit for CC Execute Committing PreCommit Aborted Comitted Formula (background): serial execution, unnecessary aborts
  23. 23. The speaker says... Atomic commit can be expressed as a state machine with the final states abort and commit. Once a transaction has been executed, it enters the committing state in which certification/voting takes place. Given the absence of conflicting concurrent transactions, a replica sets the transactions status to precommit. If all replicas precommit, the transaction is comitted, otherwise it is aborted. Don't worry about the formula. It checks for concurrent transactions – as we did before – and ensures, in case of conflicts, that only one transaction can commit at a time. Problem: it may also do unnecessary aborts depending on message delivery order as it requires all servers to precommit->commit in the same order.
  24. 24. Atomic broadcast for CC Atomic broadcast guarantees • Agreement: if one server delivers a message, all will • Total order: all servers deliver messages in the same order Greatly simplified concurrency check • Deterministic: no extra communication after local decision
  25. 25. The speaker says... Atomic broadcast ensures that transaction are delivered in the same order to all replicas. Thus, certification of transactions is deterministic: all replicas will make the same decision about commit or abort because they all base their decision on the same facts. This in turn means that there is no need to coordinate the decisions of all replicas – all replicas will make the same decision. A transaction does not conflict and thus will commit, if its executed after the commit of any other transaction, or its read set does not overlap with the write set of any other transaction. The formula is greatly simplified! Great for DIY!
  26. 26. Voting quorum: ROWA, or...? Read-One Write-All is a special quorum • Quorum constraints: NR + NW > N, NW > N/2 Replica Replica Replica Replica Replica Replica Replica Replica Replica Replica Replica Replica Example: N= 12, read quorum NR = 3, write quorum NW = 10 Replica Replica Replica Example: N= 3, read quorum NR = 2, write quorum NW = 2
  27. 27. The speaker says... So far we have silently assumed a Read-One Write-All (ROWA) quorum for voting. Reads could be served locally because updates have been applied to all replicas. Alternatively, we could make a rule that an update has to be agreed by and applied to half of the replicas plus one. This may be faster than achieving agreement among all replicas. However, for a correct read we now have to contact half of the replicas plus one and check whether they all give the same reply. If so, we must have read the latest version as the remaining, unchecked replicas form a minority that cannot be updated. The read quorum overlaps the write quorum by at least one element.
  28. 28. Voting quorum: ROWA! ROWA almost always performs better • Are Quorums an Alternative for Data Replication? (Jimenez-Peris et.al.) • „The obvious conclusion from these results is that ROWAA is the best choice for a wide range of application scenarios. It offers good scalability (within the limitations of replication protocols), very good availability, and an acceptable communication overhead. It also has the significant advantage of being very simple to implement and very easy to adapt to configuration changes. For every peculiar loads and configurations, it is possible that some variation of quorum does better than ROWAA.“ • Background: scale out results from study
  29. 29. The speaker says... Judging from the paper ROWA respectively Read- One Write-All-Available (ROWAA) is a promising approach. For example, it offers linear scalability for read only workloads but still remains competitive for mixed update and read loads. It requires a high write-to-read ratio before the various Quorum algorithms outperform ROWA on scalability. In sum: ROWA beats Quorums by a magnitude for read but does not drop by a magniture for write, and the web is read dominated. Scalability is one aspect. Quorums also help with availability – the studies finding is similar: ROWA is fine. DIY decision on currency control: ROWA, atomic broadcast. Quiz: name a system using Quorums? Riak! Next: Availability and Fault Tolerance.
  30. 30. Complex failure handling required • Later evolution: Three-Phase Commit (3PC) Fault Tolerance: 2PC Coordinator Participant Participant Vote Request PreCommit PreCommit Vote Request Global Commit Commit
  31. 31. The speaker says... When discussing atomic commit we have effectively shown the Two-Phase Commit (2PC) protocol. 2PC starts with a vote request multicasted from a coordinator to all participants. The participants either vote to commit (precommit) or abort. Then, the coordinator checks the voting result. If all voted to commit, it sends a global commit messages and the participants commit. Otherwise the coordinator sends a global abort command. Various issues may arise in case of network or process failures. Some cannot be cured using timeouts. For example, consider the situation when a participant precommits but gets no global commit or global abort. The participant cannot uniliterally leave the state. At best, it can ask another participant what to do.
  32. 32. Two-Phase Commit is a blocking protocol Fault Tolerance: 2PC Coordinator Participant Participant Vote Request PreCommit PreCommit Vote Request
  33. 33. The speaker says... The worst case scenario is a crash of the coordinator after all participants have voted to precommit. The participants cannot leave the precommit state before the coordinator has recovered. They do not know whether all of them have voted to commit or not. Thus, they do not know whether a global commit or global abort has to be performed. As none of them has received a message about the outcome of the voting, the participants cannot contact one another and ask for the outcome. Two-Phase Commit is also known as a blocking protocol.
  34. 34. Reliable multicast/broadcast • Build on the idea of group views and view changes Virtual Synchrony P1 P2 P3 P4 M1 M2 VC M3 M4 G1 = {P1, P2, P3} G2 = {P1, P2, P3, P4}
  35. 35. The speaker says... Virtual Synchrony is a mechanism that does not block. It is build around the idea of associating multicast messages with the notion of a group. A message is delivered to all members of a group but no other processes. Either the message is delivered to all members of a group or to none of them. All members of the group agree that they are part of the group before the message is multicasted (group view). In the example, M1...3 are associated with the group G1 = {P1, P2, P3}. If a process wants to join or leave a group a view change message is multicated. In the example, P4 wants to join the group and a VC message is send while M3 is still being delivered. Virtual Synchrony requires that either M3 is delivered to all of G1 before the view change takes place or to none.
  36. 36. View changes act as a message barrier • Remember the issues with 2PC …? Virtual Synchrony P1 P2 P3 P4 M5 VC M6 G2 = {P1, P2, P3, P4} G3 = {P1, P2, P3} M7 M8
  37. 37. The speaker says... There is only one condition under which a multicast message is allowed not to be delivered: if the sender crashed. Assume the processes continue working and multicast messages M5, M6, M7 to group G2 = {P1, P2, P3, P4}. While P4 sends M7 it crashes. P4 has managed to deliver its message to {P3}. The crash of P4 is noticed and a view change is triggered. Because Virtual Synchrony requires a message to be delivered to all members of the group associated with it but the sender crashed, P3 is free to drop M7 and the view change can take place. A new group view G3 is established and messages can be exchanged again.
  38. 38. Wire: message ordering and fault tolerance • Common choices: UDP or TCP over IP Reliable, delivered vs. received 010101001011010 101010110100101 101010010101010 101010110101011 101010110111101 ReplicaReplica Update 1 Update 2 t1 : Update 1 t2 : Update 2 t1 : Update 2 t2 : Update 1 (lost)
  39. 39. The speaker says... Virtual Synchrony offers reliable multicast. Reliability can be best achieved using a protocol higher up on the OSI model. Isis, an early framework implementing Virtual Synchrony, has used TCP point to point connections if reliable service was requested. TCP is a connection oriented protocol (endpoint failures can be deteted easily) with error handling and message delivery in the order sent. However, using TCP only there are no ordering constraints between messages from any two senders. Those ordering constraints have to be implemented at the application layer. We say a message can be recieved on the network layer in a different order than its delivered to the application by the model discussed. Vector clocks can be used for global total ordering.
  40. 40. AB = Virtual Synchrony offering total-order delivery • „Synchrony“ does not refer to temporal aspects Atomic broadcast definition P1 P2 P3 P4 M1 M2 Unordered delivery Ordered delivery P1 P2 P3 P4 M1 M2
  41. 41. The speaker says... Atomic broadcast means Virtual Synchrony used with total- order message ordering. When Virtual Synchrony was introduced back in the mid 80s, it was explicitly designed to allow other message orderings. For example, it should be able to support distributed applications that have a notion of finding messages that commute, and thus may be applied in an order different from the order sent to improve performace. If events are applied in different order on different processes, the system cannot be called synchronous any more – the inventors called it virtually synchronous. However, recall we are only after total-ordering for 1SR.
  42. 42. Wash the brain without marketing fluff, split brain, done! • System dependent... E.g. Isis failure detector was very basic How to cook brains P1 P2 P3 P4 M1 M2 n1({P1, P2, P3, P4]) = 4 VC Split brain – Connection lost n2({P1, P2}) = 2 < (n1/2)
  43. 43. The speaker says... The failure of individual processes – or database replicas – has been discussed. The model has measures to handle them following using a fail stop approach. To conclude the discussion of fault tolerance we look at a situation called split brain: one half of the cluster lost connection to another half. Which shall survive? The answer is often implementation dependent. For example, the early Virtual Synchrony framework Isis has a rule that a new group view can only be installed if it contains n / 2 + 1 members with n being the number of members in the current group. In the example both halves would shut down. Brain splitting question: how many replicas would you project for a cluster if you don't know split brain implementation details?
  44. 44. In-core architecture DIY: Hack MySQL (oh, oh), or...? MySQL DBMS MySQL DBMS Load Balancer PECL/mysqlnd_ms MySQL Proxy PHP PHP PHP Reflector Reflector Replicator Replicator GCS
  45. 45. The speaker says... Here's a generic architecture made of five components: • Clients (PHP, Java, …) using well known interfaces • Load Balancer (for example PECL/mysqlnd_ms) • The actual database system • The reflector allows inspection and modification of on- going transactions • The (distributed) replicator handling concurrency control • The Group Communication System (GCS) provides communication primitives such as multicast (GCS examples: Appia, JGroups – Java, Spread – C/C++)
  46. 46. Middleware architecture DIY: Hack MySQL (oh, oh), or...? Virtual DBMS Virtual DBMS Load Balancer Clients Reflector Reflector Replicator Replicator GCS DBMS DBMS
  47. 47. The speaker says... An in-core design requires support for a reflector by the database. Strictly speaking there is no API inside MySQL one can use. The APIs used for MySQL Replication are not sufficient. Nonetheless, MySQL Replication can be classified as in-core in our model. Due to the lack of an reflector API, the only third party product following an in- core design (Galera by Codership) has to patch the MySQL core. Tungsten Replicator by Continuent is a Middleware design. Clients contact a virtual database. Requests are intercepted, parsed and replicated. The challenge is in the interception: statements using non-deterministic calls such as NOW() and TIME() must be taken care of.
  48. 48. Hybrid architecture DIY: Hack MySQL (oh, oh), or...? DBMS DBMS Load Balancer Clients Reflector Plugin Reflector Plugin Replicator Replicator GCS
  49. 49. The speaker says... In a hybrid architecture the reflector runs within the database process but the replicator layer is using extra processes. It is not a perfect comparison as we will see later but for the sake of our model, we can classify MySQL Cluster as a hybrid architecture. The reflector is implemented as a storage engine. The replicator layer is using extra processes. This design has some neat MySQL NDB Cluster specific benefits. If any MySQL product has NoSQL genes, it is MySQL Cluster.
  50. 50. Primary Copy Update Anywhere Eager Not available for MySQL MySQL Cluster (Hybrid) Galera (In-core) Lazy MySQL Replication (In-core) Tungsten (Middleware) MySQL Cluster Replication (Hybrid) DIY: Summary
  51. 51. The speaker says... Time for a summary before coding ants and compilers start their work. From a DIY perspective we can skip Lazy Primary Copy: it has simple concurrency control, it does not depend on network speed, it is great for flacky and slow WAN connections but it offers eventual consistency only (hint: enjoy PECL/mysqlnd_ms!), it has no means to scale writes. And, it exists – no karma... An eager update anywhere solution offering the highest level of correctness (1SR) gives you strong consistency. It scales writes to some degree because they can be executed on any replica, which parallizes execution load. Commit performance is network bound.
  52. 52. Full Replication Partitial Replication Read Scale Out Write Scale Out Capability MySQL Replication (Lazy Primary Copy, In-core) MySQL Cluster (Eager Update Anywhere, Hybrid) Tungsten (Primary Copy, Middleware) Galera (Eager Update Anywhere, In-core) If 1SR - hard limit DIY: The Master Class
  53. 53. The speaker says... The DIY Master Class for maximum karma is a partial replication solution offering strong consistency. Partial replication is the only way to ultimately scale write requests. The explanation is simple: every write adds load to the entire cluster. Remember that writes need to be coordinated, remember that concurrency control involves all replicas (ROWA) or a good number of them (Quorum). Thus, every additional replica adds load to all others. The solution is to partition the data set and keep each partition on a subset of all replicas only. NoSQL calls it sharding, MySQL Cluster calls it partitioning. Partial replication – that's the DIY master piece, that will give you KARMA.
  54. 54. Availability • Shared-nothing, High Availability (99,999%) • WAN Replication to secondary data centers Scalability • Read and write through partial replication (partitioning) • Distributed queries (parallize work), real-time guarantees • Focus In-Memory with disk storage extension • Sophisticated thread model for multi-core CPU • Optimized for short transaction (hundrets of operations) Distribution Transparency • SQL level: 100%, low-level interfaces available MySQL (NDB) Cluster goals
  55. 55. The speaker says... I am not aware of text books discussing partial replication theory in-depth. Thus, we have to reverse engineer an existing system. As this is a talk about MySQL Cluster, how about talking about MySQL Cluster finally?MySQL Cluster has originally been developed to serve telecommunication systems. It aims to parallize work as much as possible, hence it is a distributed database. It started as an in-memory solution but can store data on disk meanwhile. It runs best in environments offering low network latency, high network throughput and issuing short transactions. Applications should not demand complex joins. There is no chance you throw Drupal at it and Drupal runs super-fast out of the box! Let's see why...
  56. 56. SQL view: Cluster is yet another table storage engine MySQL Cluster is a hybrid MySQL MySQL Load Balancer Clients Reflector Plugin = NDB Storage Engine Replicator = NDB Data Node GCS
  57. 57. The speaker says... MySQL Cluster has a hybrid architecture. It consists of the green elements on the slide. The Reflector is implemented as a MySQL storage engine. From a SQL user's perspective, it is just another storage engine, similar to MyISAM, InnoDB or others (Distribution Transparency). From a SQL client perspective there is no change: all MySQL APIs can be used. The Reflector (NDB Storage Engine) runs as part of the MySQL process. The Replicator is a seperate process called NDB data node. Please note, node means process not machine. MySQL Cluster does not fit perfectly in the model: an NDB data node combines Replicator and storage tasks. BTW, what happens to Cluster if a MySQL Server fails?
  58. 58. Fast low-level access: bypassing the SQL layer MySQL Cluster is a beast MySQL MySQL Load Balancer Clients Reflector Plugin = NDB Storage Engine Replicator = NDB Data Node GCS Clients 4.3b read tx/s 1.2b write tx/s (in 2012)
  59. 59. The speaker says... From the perspective of MySQL Cluster, a MySQL Server is yet another application client. MySQL Server happens to be an application that implements a SQL view on the relational data stored inside the cluster. MySQL Cluster users often bypass the SQL layer by implementing application clients on their own. SQL is a rich query language but parsing a SQL query can take 30...50% of the total runtime of a query. Thus, bypassing is a good idea. The top benchmark results we show for Cluster are achieved using C/C++ clients directly accessing MySQL Cluster. There are many extra APIs for this special case: NDB API (C/C++, low level), ClusterJ (ORM style), ClusterJPA (low level), … - even for node.js (ORM style)
  60. 60. Partitioning (auto-sharding) NDB Data Node 1 NDB Data Node 2 NDB Data Node 3 NDB Data Node 4 Partition 0, Primary Partition 2, Copy Partition 0, Copy Partition 2, Primary Partition 1, Primary Partition 1, Copy Partition 3, Copy Partition 3, Primary Node Group 1 Node Group 0
  61. 61. The speaker says... There is a lot to say about how MySQL Cluster partitions a table and spreads it over nodes. The manual has all details, just all... The key idea is to use an eager primary copy approach for partitions combined with a mindful distribution of each partitions primary and its copies. NDB supports zero or one copies (replication factor). The failure of a partitions primary does not cause a failure of the Cluster. In the example, the failure of any one node has no impact. Also, node 1 and 4 may fail without a stop of the Cluster (fail stop model). But the cluster shuts down if all nodes of a node group fail.
  62. 62. Concurrency Control: 2PL,“2PC“ NDB Data Node 1 NDB Data Node 2 NDB Data Node 3 NDB Data Node 4 Partition 0, Primary Partition 2, Copy Partition 0, Copy Partition 2, Primary Partition 1, Primary Partition 1, Copy Partition 3, Copy Partition 3, Primary W R R
  63. 63. The speaker says... Buuuuh? Two-Phase-Locking (2PL) and Two-Phase-Commit (2PC) are used for concurrency control. Cluster is using traditional row locking to isolate transactions. Read and write locks can be distributed throughout the cluster. The locks are set on the primary partitions. Transactions are serialized during execution. When a transaction commits, an optimized Two-Phase-Commit is used to synchronize the partition copies. The SQL layer recognizes the commit as soon as the copies are updated (and before logs have been written to disk). The low-level NDB C/C++ application API is asynchronous. Fire and forget is possible: your application can continue before transaction processing as even begun!
  64. 64. Brain Masala NDB Data Node 1 NDB Data Node 2 NDB Data Node 3 NDB Data Node 4 Partition 0, Primary Partition 2, Copy Partition 0, Copy Partition 2, Primary Partition 1, Primary Partition 1, Copy Partition 3, Copy Partition 3, Primary Arbitrator
  65. 65. The speaker says... The failure of a single node is detected using a hearthbeat protocol: details are documented, future improvements are possible. Both MySQL Cluster and Virtual Synchrony seperate message delivery from node failure detection. The worst case scenario of a brain split is cured by the introduction of arbitrators. If the nodes split and each half is able to keep the Cluster up, the nodes try to contact the arbitrator. It is then up to the arbitrator to decide who stays up and who shuts down. Arbitrators are extra processes, ideally run on extra machines. Management nodes can act as arbitrators too. You need at least one management node for administration, thus you always have an arbitrator readily available.
  66. 66. Drupal? Sysbench? Oh, oh... NDB Data Node 1 NDB Data Node 2 NDB Data Node 3 NDB Data Node 4 Partition 0, Primary Partition 2, Copy Partition 0, Copy Partition 2, Primary Partition 1, Primary Partition 1, Copy Partition 3, Copy Partition 3, Primary MySQL
  67. 67. The speaker says... Partial replication (here: partitioning, sharding) is the only known solution to the write scale out problem. But, it comes at the high price of distributed queries. A SQL query may require reading data from many partitions. One the one hand work is nicely parallized over many nodes on the other hand, records found have to be transferred within the cluster from one node to another. Although Cluster tries to batch requests efficiently together to minimize communication delays, transferring data from node to node to answer questions remains an expensive operation.
  68. 68. Oh, oh... tune your partitions! NDB Data Node 1 NDB Data Node 2 NDB Data Node 3 NDB Data Node 4 Partition 0, Primary Partition 2, Copy Partition 0, Copy Partition 2, Primary Partition 1, Primary Partition 1, Copy Partition 3, Copy Partition 3, Primary MySQL CREATE TABLE cities { id INT NOT NULL, Population INT UNSIGNED, city_name VARCHAR(100), PRIMARY KEY(city_name, id) } SELECT id FROM cities WHERE city_name = 'Kiel'
  69. 69. The speaker says... How much traffic and latency occurs depends on the actual SQL query and the partitioning scheme. By default a table is partitioned into 3840 virtual fragments (think vBuckets) using its primary key. The partitioning can and should be tuned. Try to find partitioning keys that make your common, expensive or time-criticial queries run on a single node. Assume you have a list of cities. City names are not unique, thus you have introduced a numeric primary key. It is likely that your most common query checks for the city name not for the numeric primary key only. Therefore, your partitioning should be based on city name as well.
  70. 70. The ultimate Key-Value-Store? NDB Data Node 1 NDB Data Node 2 NDB Data Node 3 NDB Data Node 4 Partition 0, Primary Partition 2, Copy Partition 0, Copy Partition 2, Primary Partition 1, Primary Partition 1, Copy Partition 3, Copy Partition 3, Primary MySQL CREATE TABLE cities { id INT NOT NULL, city_name VARCHAR(100), PRIMARY KEY(id) } SELECT FROM cities WHERE id = 1 SELECT FROM citites WHERE id = 100
  71. 71. The speaker says... I may have stated it before: if there is any product at MySQL that can compete with NoSQL (as in Key-Value- Store) on the issue of distributed data stores, it is MySQL Cluster. An optimal query load for MySQL Cluster is one that primarily performs lookups on partition keys. Each query will execute on one node only. There is little traffic within the cluster – little network overhead. Work load is perfectly parallized. Will your unmodified PHP application perform on Cluster?
  72. 72. Joins: 24...70x faster Then Now NDB_API> read a from table t1 where pk = 1 [round trip] (a = 15) NDB_API> read b from table t2 where pk = 15 [round trip] (b = 30) [return a = 15, b = 30] SELECT t1.a, t2.b FROM t1, t2 WHERE t1.pk = 1 AND t1.a = t2.pk NDB_API> read @a=a from table t1 where pk = 1; read b from table t2 where pk = @a [round trip]
  73. 73. The speaker says... In 7.2 we claim certain joins to execute 24...70x faster by the help of AQL (condition push-down)! How come? Partial replication does not go together well with joins. Take this simple nested join as an example. There are two tables to join. The join condition of the second table depends on the values of the first table. Thus, t1 has to be searched before t2 can be searched and the result can be returned to the user. That makes two operations and two round trips. As of 7.2, there is a new batched way of doing it. It saves round trips. Some round trips avoided means – at the extreme - 24...70x faster: the network is your enemy #1.
  74. 74. Benchmark pitfall: connections NDB Data Node 1 NDB Data Node 2 NDB Data Node 3 NDB Data Node 4 MySQL Load Balancer Many, many clients MySQL NDB Storage Engine NDB Storage Engine
  75. 75. The speaker says... If you ever come to the point of evaluating MySQL Cluster, make sure you configure MySQL to Cluster connections appropriately (ndb_cluster_connection_pool). A MySQL Server with only one connection (default setting) from itself to the cluster may not be able to serve many concurrent clients at the rate the Cluster part itself might be able to handle them. The connection may an impose an artifical limitation on the cluster throughput.
  76. 76. Adding nodes, rebalancing NDB Data Node 1 NDB Data Node 2 NDB Data Node 3 NDB Data Node 4 Partitions Partitions Partitions Partitions NDB Data Node 5 NDB Data Node 6
  77. 77. The speaker says... Adding nodes, growing the capacity of your cluster in terms of size and computing power, is an online operation. At any time you can add nodes to your cluster. New nodes do not immediately participate in operations. You have to tell the cluster what to do with them: use for new tables, or use for growing the capacity available to existing tables. When growing existing tables, data needs to be redistributed to the new nodes. Rebalancing is an online operation: it does not block clients. The partitioning algorithm used by Cluster ensures that data is copied to new nodes only, there is no traffic between nodes currently holding fragments of the table to be rebalanced.
  78. 78. We shall... • Code an Eager Update-Anywhere Cluster • Prefer an hybrid design to get not too deep into MySQL • Do not fear the lack of text books on partital replication • Read CPU vendor tuning guides like comics • Like Sweden or Finland Send your application to the MySQL Cluster team. Cluster is different. MySQL Cluster is perfect for web session storage. Whether your Drupal, WordPress, … runs faster is hard to tell – possibly not faster. PS (marketing fluff): ask Sales for a show! DIY - Summary
  79. 79. The speaker says... By the end of this talk you should remember at least this: ● There are four kinds of replication solutions based on a matrix asking „where can all transactions run“ and „when are replicas synchronized“ ● Clusters don't make everything faster – the network is your enemy. For read scale out there are proven solutions. ● Write scale out is only possible through partial replication (Small write Quorum would impact read performance)
  80. 80. THE END Contact: ulf.wendel@oracle.com
  81. 81. The speaker says... Thank you for your attendance! Upcoming shows: Talk&Show! (ask... :-)) YourPlace, any time PHP Summit Munich, December 2013

×