Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cluster Consensus: when Aeron met Raft

58 views

Published on

Cluster Consensus: when Aeron met Raft by Martin Thompson

Want to know what is possible in the field of high performance consensus algorithms for distributed systems? If the answer is yes then come to this talk.

Published in: Technology
  • Be the first to comment

Cluster Consensus: when Aeron met Raft

  1. 1. Cluster Consensus When Aeron Met Raft Martin Thompson - @mjpt777
  2. 2. What does “Consensus” mean?
  3. 3. con•sen•sus noun kən-ˈsen(t)-səs : general agreement : unanimity Source: http://www.merriam-webster.com/
  4. 4. con•sen•sus noun kən-ˈsen(t)-səs : general agreement : unanimity : the judgment arrived at by most of those concerned Source: http://www.merriam-webster.com/
  5. 5. Consensus on what?
  6. 6. Message - 1 Message - 2 Message - 3 Message - 4 Message - 5 Message - 6
  7. 7. Message - 1 Message - 2 Message - 3 Message - 4 Message - 5 Message - 6 State Machine In Memory Model
  8. 8. Message - 1 Message - 2 Message - 3 Message - 4 Message - 5 Message - 6 State Machine In Memory Model Actors Event Sourcing CQRS Command Sourcing WAL Immutable Log
  9. 9. https://raft.github.io/raft.pdf
  10. 10. https://www.cl.cam.ac.uk/~ms705/pub/papers/2015-osr-raft.pdf
  11. 11. Raft in a Nutshell
  12. 12. Roles CandidateFollower Leader
  13. 13. RPCs 1. RequestVote RPC Invoked by candidates to gather votes
  14. 14. RPCs 1. RequestVote RPC Invoked by candidates to gather votes 2. AppendEntries RPC Invoked by leader to replicate and heartbeat
  15. 15. Safety Guarantees • Election Safety • Leader Append-Only • Log Matching • Leader Completeness • State Machine Safety
  16. 16. Monotonic Functions
  17. 17. Version all the things!
  18. 18. Clustering Aeron
  19. 19. Is it Guaranteed Delivery™ ???
  20. 20. What is the “Architect” really looking for?
  21. 21. “Guaranteed Processing™”
  22. 22. There are many design approaches
  23. 23. Client Client Client Client Client Service
  24. 24. Client Client Client Client Client Service
  25. 25. Client Client Client Client Client Consensus Module Service Consensus Module Service
  26. 26. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  27. 27. NIO Pain!
  28. 28. Do servers crash?
  29. 29. FileChannel channel = null; try { channel = FileChannel.open(directory.toPath()); } catch (final IOException ignore) { } if (null != channel) { channel.force(true); }
  30. 30. FileChannel channel = null; try { channel = FileChannel.open(directory.toPath()); } catch (final IOException ignore) { } if (null != channel) { channel.force(true); }
  31. 31. FileChannel channel = null; try { channel = FileChannel.open(directory.toPath()); } catch (final IOException ignore) { } if (null != channel) { channel.force(true); }
  32. 32. Directory Sync Files.force(directory.toPath(), true);
  33. 33. Performance
  34. 34. Let’s consider an RPC design approach
  35. 35. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  36. 36. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  37. 37. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  38. 38. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  39. 39. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  40. 40. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  41. 41. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  42. 42. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  43. 43. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  44. 44. Concurrency and/or Parallelism?
  45. 45. 1. Parallel is the opposite of Serial 2. Concurrent is the opposite of Sequential 3. Vector is the opposite of Scalar – John Gustafson
  46. 46. Fetch Time Instruction Pipelining
  47. 47. Fetch Decode Time Instruction Pipelining
  48. 48. Fetch Decode Execute Time Instruction Pipelining
  49. 49. Fetch Decode Execute Retire Time Instruction Pipelining
  50. 50. Fetch Decode Execute Retire Time Fetch Decode Execute Retire Instruction Pipelining
  51. 51. Fetch Decode Execute Retire Time Fetch Decode Execute Retire Fetch Decode Execute Retire Instruction Pipelining
  52. 52. Fetch Decode Execute Retire Time Fetch Decode Execute Retire Fetch Decode Execute Retire Fetch Decode Execute Retire Instruction Pipelining
  53. 53. Order Time Consensus Pipeline
  54. 54. Order Log Time Consensus Pipeline
  55. 55. Order Log Transmit Time Consensus Pipeline
  56. 56. Order Log Transmit Commit Time Consensus Pipeline
  57. 57. Order Log Transmit Commit Time Consensus Pipeline Execute
  58. 58. Order Log Transmit Commit Time Consensus Pipeline Execute Order Log Transmit Commit Execute
  59. 59. Order Log Transmit Commit Time Consensus Pipeline Execute Order Log Transmit Commit Execute Order Log Transmit Commit Execute
  60. 60. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  61. 61. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  62. 62. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  63. 63. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  64. 64. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  65. 65. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  66. 66. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  67. 67. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  68. 68. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  69. 69. NIO Pain!
  70. 70. ByteBuffer byte copies ByteBuffer byteBuffer = ByteBuffer.allocate(64 * 1024); byteBuffer.putInt(index, value);
  71. 71. ByteBuffer byte copies ByteBuffer byteBuffer = ByteBuffer.allocate(64 * 1024); byteBuffer.putBytes(index, bytes);
  72. 72. ByteBuffer byte copies ByteBuffer byteBuffer = ByteBuffer.allocate(64 * 1024); byteBuffer.putBytes(index, bytes);
  73. 73. How can Aeron help?
  74. 74. Message Index => Byte Index
  75. 75. Multicast, MDC, and Spy based Messaging
  76. 76. Counters => Bounded Consumption
  77. 77. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0 5 10 15 20 Batching – Amortising Costs Average overhead per item or operation in batch
  78. 78. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0 5 10 15 20 Batching – Amortising Costs • System calls • Network round trips • Disk writes • Expensive computations
  79. 79. Interesting Features
  80. 80. Timers
  81. 81. All state must enter the system as a message!
  82. 82. public void foo() { // Decide to schedule a timer cluster.scheduleTimer(correlationId, cluster.timeMs() + TimeUnit.SECONDS.toMillis(5)); } public void onTimerEvent(final long correlationId, final long timestampMs) { // Look up the correlationId associated with the timer } Timers
  83. 83. public void foo() { // Decide to schedule a timer cluster.scheduleTimer(correlationId, cluster.timeMs() + TimeUnit.SECONDS.toMillis(5)); } public void onTimerEvent(final long correlationId, final long timestampMs) { // Look up the correlationId associated with the timer } Timers
  84. 84. public void foo() { // Decide to schedule a timer cluster.scheduleTimer(correlationId, cluster.timeMs() + TimeUnit.SECONDS.toMillis(5)); } public void onTimerEvent(final long correlationId, final long timestampMs) { // Look up the correlationId associated with the timer } Timers
  85. 85. Back Pressure and Stashed Work
  86. 86. public ControlledFragmentAssembler.Action onSessionMessage( final DirectBuffer buffer, final int offset, final int length, final long clusterSessionId, final long correlationId) { final ClusterSession session = sessionByIdMap.get(clusterSessionId); if (null == session || session.state() == CLOSED) { return ControlledFragmentHandler.Action.CONTINUE; } final long nowMs = cachedEpochClock.time(); if (session.state() == OPEN && logPublisher.appendMessage(buffer, offset, length, nowMs)) { session.lastActivity(nowMs, correlationId); return ControlledFragmentHandler.Action.CONTINUE; } return ControlledFragmentHandler.Action.ABORT; } Back Pressure
  87. 87. public ControlledFragmentAssembler.Action onSessionMessage( final DirectBuffer buffer, final int offset, final int length, final long clusterSessionId, final long correlationId) { final ClusterSession session = sessionByIdMap.get(clusterSessionId); if (null == session || session.state() == CLOSED) { return ControlledFragmentHandler.Action.CONTINUE; } final long nowMs = cachedEpochClock.time(); if (session.state() == OPEN && logPublisher.appendMessage(buffer, offset, length, nowMs)) { session.lastActivity(nowMs, correlationId); return ControlledFragmentHandler.Action.CONTINUE; } return ControlledFragmentHandler.Action.ABORT; } Back Pressure
  88. 88. public ControlledFragmentAssembler.Action onSessionMessage( final DirectBuffer buffer, final int offset, final int length, final long clusterSessionId, final long correlationId) { final ClusterSession session = sessionByIdMap.get(clusterSessionId); if (null == session || session.state() == CLOSED) { return ControlledFragmentHandler.Action.CONTINUE; } final long nowMs = cachedEpochClock.time(); if (session.state() == OPEN && logPublisher.appendMessage(buffer, offset, length, nowMs)) { session.lastActivity(nowMs, correlationId); return ControlledFragmentHandler.Action.CONTINUE; } return ControlledFragmentHandler.Action.ABORT; } Back Pressure
  89. 89. public ControlledFragmentAssembler.Action onSessionMessage( final DirectBuffer buffer, final int offset, final int length, final long clusterSessionId, final long correlationId) { final ClusterSession session = sessionByIdMap.get(clusterSessionId); if (null == session || session.state() == CLOSED) { return ControlledFragmentHandler.Action.CONTINUE; } final long nowMs = cachedEpochClock.time(); if (session.state() == OPEN && logPublisher.appendMessage(buffer, offset, length, nowMs)) { session.lastActivity(nowMs, correlationId); return ControlledFragmentHandler.Action.CONTINUE; } return ControlledFragmentHandler.Action.ABORT; } Back Pressure
  90. 90. public ControlledFragmentAssembler.Action onSessionMessage( final DirectBuffer buffer, final int offset, final int length, final long clusterSessionId, final long correlationId) { final ClusterSession session = sessionByIdMap.get(clusterSessionId); if (null == session || session.state() == CLOSED) { return ControlledFragmentHandler.Action.CONTINUE; } final long nowMs = cachedEpochClock.time(); if (session.state() == OPEN && logPublisher.appendMessage(buffer, offset, length, nowMs)) { session.lastActivity(nowMs, correlationId); return ControlledFragmentHandler.Action.CONTINUE; } return ControlledFragmentHandler.Action.ABORT; } Back Pressure
  91. 91. Log Replay and Snapshots
  92. 92. Log Replay and Snapshots Distributed File System?
  93. 93. Log Replay and Snapshots Distributed File System? Aeron Archive Recorded Streams
  94. 94. Multiple Services on the same stream
  95. 95. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service
  96. 96. Client Client Client Client Client Consensus Module Service Consensus Module Service Consensus Module Service Service Service Service Service Service Service
  97. 97. NIO Pain!
  98. 98. MappedByteBuffer DirectByteBuffer 1 2
  99. 99. DirectByteBuffer MappedByteBuffer MappedByteBuffer DirectByteBuffer 1 2
  100. 100. DirectByteBuffer MappedByteBuffer MappedByteBuffer DirectByteBuffer 1 2
  101. 101. In Closing
  102. 102. What’s the Roadmap?
  103. 103. https://github.com/real-logic/aeron Twitter: @mjpt777 “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Leslie Lamport Questions?

×