Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Group Replication: A Journey to the Group Communication Core

1,248 views

Published on

Describes the design decisions on the paxos-based implementation that is used by Group Replication.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Group Replication: A Journey to the Group Communication Core

  1. 1. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. Group Replication: A Journey to the Group Communication Core Alfranio Correia (alfranio.correia@oracle.com) Principal Software Engineer 4th of February Oracle / Fosdem 2017
  2. 2. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. 24th of February Oracle / Fosdem 2017
  3. 3. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Program Agenda 4th of February Oracle / Fosdem 2017
  4. 4. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Background Group Communication Interface Group Communication Engine Performance Conclusion Program Agenda 4th of February Oracle / Fosdem 2017 4 1 2 3 4 5
  5. 5. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Background 4th of February Oracle / Fosdem 2017 1
  6. 6. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | MySQL InnoDB Cluster 64th of February Oracle / Fosdem 2017 S1 S2 S3 S4 S… M M M MySQL Connector Application MySQL Router MySQL Connector Application MySQL Router MySQL Shell HA ReplicaSet1 S1 S2 S3 S4 S… M M M MySQL Connector Application MySQL Router HA ReplicaSet 2 ReplicaSet 3 MySQL Connector Application MySQL Router S1 S2 S3 S4 M M M HA
  7. 7. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | MySQL Group Replication • What is MySQL Group Replication? “Multi-master update everywhere replication plugin for MySQL with built-in automatic distributed recovery, conflict detection and group membership.” • What does the MySQL Group Replication plugin do for the user? – Automates server failover in Single Primary – Provides fault tolerance – Enables update everywhere setups – Automates group reconfiguration (handling of crashes, failures, re-connects) – Provides a highly available replicated database 74th of February Oracle / Fosdem 2017
  8. 8. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Major Building Blocks 84th of February Oracle / Fosdem 2017 M M M M M Com. API Replication Plugin API MySQL Server Group Comm. System (Corosync) Group Com. Engine
  9. 9. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | The Complete Stack 94th of February Oracle / Fosdem 2017 API Replication Plugin API MySQL Server Performance Schema Tables: Monitoring MySQL APIs: Lifecycle / Capture / Applier InnoDB Replication Protocol Group Com. API Group Com. Engine Network Plugin Capture Applier Conflicts Handler Group Comm. System (Corosync) Group Com. Engine Group Com. Binding Recovery
  10. 10. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Group Communication Interface 4th of February Oracle / Fosdem 2017 2
  11. 11. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Design • Abstract interface to support different solutions – Reconfigure the group and get membership information – Send and receive messages • Uses the observer pattern – MySQL Group Replication listens to events • Different implementations per Communication Systems • Made the transition from Corosync easy 114th of February Oracle / Fosdem 2017
  12. 12. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Semantics • Closed Group – Only group members can send and receive messages • Total Order – Messages are totally ordered among each other • Safe Delivery – One cannot deliver a message if the majority can’t do so • View Synchrony – Changes to membership are tolltaly ordered with messages 124th of February Oracle / Fosdem 2017
  13. 13. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Group Communication Engine 4th of February Oracle / Fosdem 2017 3
  14. 14. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Built-in Communication Engine • Based on proven distributed systems algorithms (Paxos) – Compression, multi-platform, dynamic membership, SSL, IP whitelisting • No third-party software required • No network multicast support required – MySQL Group Replication can operate on cloud based installations where multicast is unsupported 144th of February Oracle / Fosdem 2017
  15. 15. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Paxos Family and Friends 154th of February Oracle / Fosdem 2017 Multi-Paxos Fast Paxos Disk Paxos Cheap Paxos Vertical Paxos Generalized Paxos Raft Mencius Flexible Paxos Egalitarian Paxos Byzantine Paxos
  16. 16. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Basic Paxos 164th of February Oracle / Fosdem 2017 M0 M1 M2 Prepare/Election Phase M0 M1 M2 Accept Phase M0 M1 M2 Learn Phase • Get agreement on a value: – Next message/transaction to be delivered • Members may have different roles: – Usually all members are proposers, acceptors and learners • Need a quorum to make progress – Usually a majority 1 2 3
  17. 17. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Prepare Phase 174th of February Oracle / Fosdem 2017 • Proposer sends a prepare request with number “n” to members (i.e. acceptors) • If an acceptor has not received a request with a number greater than “n”, it will respond • It will promise not to accept a request numbered less than “n” • If the reply has a non-empty value, the leader will use that with the highest number M0 M1 M2 Prepare1.1 M0 M1 M2 Promise1.2 (n) (n) (y, value) (x, value)
  18. 18. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Accept Phase 184th of February Oracle / Fosdem 2017 • If the leader finds out that a non-empty value has been previously proposed, it will use it • Otherwise, it will propose a new value • Requires a network round-trip to get agreement M0 M1 M2 Accept2.1 M0 M1 M2 Accepted2.2 (n, value) (n, value) (ack) (ack)
  19. 19. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Learn Phase 194th of February Oracle / Fosdem 2017 • It will inform other members about the decision • Only one learner is required to have progress • If the member already has the value, an ack is enough M0 M1 M2 Learn3 (value) (value)
  20. 20. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Multi-Paxos 204th of February Oracle / Fosdem 2017 slot 0 0 1 2 Accept/Learn 0 1 2 Accept/Learn 0 1 2 Accept/Learn 0 1 2 Election 0 1 2 Accept/Learn 0 1 2 Election slot 1 slot 2 slot 3 ... • Consensus round to decide on each slot’s content • Replicated Log Stream
  21. 21. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | So what? • They can easily become a bottleneck • Multiple leaders: eXtended COMmunications 214th of February Oracle / Fosdem 2017
  22. 22. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | How does XCOM work? 224th of February Oracle / Fosdem 2017 slot 0 0 1 2 Accept/Learn slot 1 slot 2 slot 3 0 1 2 Accept/Learn slot 4 slot 5 ...... 0 1 2 Accept/Learn 0 1 2 Accept/Learn 0 1 2 Accept/Learn 0 1 2 Accept/Learn • Every member is a leader so no leader election • Every member owns a In-Memory Replicated Log
  23. 23. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Nothing to Propose 234th of February Oracle / Fosdem 2017 slot 0 0 1 2 Accept/Learn nop slot 2 slot 3 0 1 2 Accept/Learn nop slot 5 ...... 0 1 2 Accept/Learn 0 1 2 Learn 0 1 2 Accept/Learn 0 1 2 Learn • Only a learn message with a “nop” is enough
  24. 24. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | How is the optimization possible? • Member “1” sends a learn message “(0, nop)” to member “4” and dies • Non-leaders can only propose “nop”(s) on behalf of others • They must go through all Paxos phases 244th of February Oracle / Fosdem 2017 0 2 3 1 4 Learn 1 2 3 0 4 (1) (1) 1 2 3 0 4 (0, -) (0, -) 1 2 3 0 4 (1, nop) (1, nop) 1 2 3 0 4 (ack) (ack) Prepare Promise Accept Accepted 1 2 3 0 4 (nop) (nop) Learn (0, nop)
  25. 25. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Handling Failures/Suspicions 254th of February Oracle / Fosdem 2017 slot 0 0 1 2 Accept/Learn 0 1 2 Accept/Learn 0 1 2 Prep./Accept/Learn slot 1 slot 2 nop 0 1 2 Accept/Learn 0 1 2 Accept/Learn slot 4 0 1 2 Accept/Learn slot 5 ......
  26. 26. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Implemented Optimizations in XCOM • Pipeline – Proposes several “transactions” in parallel – Improves performance in high latency networks – Current value is “10” • Batch – Improves CPU usage – Improves performance in high latency/low bandwidth networks – Current value is “5” 264th of February Oracle / Fosdem 2017
  27. 27. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Implemented Optimizations in Biding • Compression – Reduces bandwith consumption • Automatically reconfigure a group – Faulty members are expelled 274th of February Oracle / Fosdem 2017
  28. 28. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Performance 4th of February Oracle / Fosdem 2017 6
  29. 29. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Configuration • Multipe writers – One per Server • Single writer – Just one client • Oracle Server X5-2L with two Intel Xeon E5-2660-V3 processors – 20 Cores – 40 Hardware Threads • Oracle Enterprise Linux 7, kernel 3.8.13-118.13.3 • 10 Gbps ethernet • Used “tc” to throttle network 294th of February Oracle / Fosdem 2017
  30. 30. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Multiple writers (256 Bytes) 304th of February Oracle / Fosdem 2017 3 members 5 members 7 members 3 members 5 members 7 members Uncompressed 256 byte payload Compressed 256 byte payload 0 20000 40000 60000 80000 100000 120000 140000 160000 10Gbps network with 0.1ms latency 200Mbps network with 7ms latency • Compression improves performance in Metropolitan • Headers are not compressed (~200 bytes) though Messages per second sent
  31. 31. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Multiple writers (1K Bytes) 314th of February Oracle / Fosdem 2017 • Check whether compression may help or not • Usually helps when bandwidth is a problem 3 members 5 members 7 members 3 members 5 members 7 members Uncompressed 1K payload Compressed 1K payload 0 20000 40000 60000 80000 100000 120000 10Gbps network with 0.1ms latency 200Mbps network with 7ms latency Messages per second sent
  32. 32. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Single Writer (1K Bytes) 324th of February Oracle / Fosdem 2017 3 members 5 members 7 members 3 members 5 members 7 members Uncompressed 1K payload Compressed 1K payload 0 20000 40000 60000 80000 100000 120000 10Gbps network with 0.1ms latency 200Mbps network with 7ms latency • The scale out effect with multiple writers is small • Compression does not help here Messages per second sent
  33. 33. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Conclusion 4th of February Oracle / Fosdem 2017 5
  34. 34. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Current Status • Has made into MySQL 5.7.17 release • GA in December 2016 344th of February Oracle / Fosdem 2017
  35. 35. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Future • Configurable Paxos role(s) – Leader/Acceptor/Learner or Acceptor/Learner or Learner • Multiple leaders only if needed: – Avoids the skip message – Improves CPU and network usage • Not all members need to make messages network durable – Reduces resilience but improves performance 354th of February Oracle / Fosdem 2017
  36. 36. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Future • Expose some configuration options: – Batch – Pipeline • Compression at low level layers as well • Write to network in parallel • Overlay networks 364th of February Oracle / Fosdem 2017
  37. 37. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Where to go from here? • Packages – http://www.mysql.com/downloads/ • Documentation – http://dev.mysql.com/doc/refman/5.7/en/group-replication.html • Blogs from the Engineers (news, technical information, and much more) – http://mysqlhighavailability.com 374th of February Oracle / Fosdem 2017

×