Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hermes Reliable Replication Protocol - ASPLOS'20 Presentation

312 views

Published on

"Hermes: A Fast, Fault-Tolerant and Linearizable Replication Protocol" -- ASPLOS'20

Published in: Software
  • Be the first to comment

Hermes Reliable Replication Protocol - ASPLOS'20 Presentation

  1. 1. Hermes A Fast, Fault-tolerant and Linearizable Replication Protocol Antonios Katsarakis, V. Gavrielatos, S. Katebzadeh, A. Joshi*, B. Grot, V. Nagarajan, A. Dragojevic† University of Edinburgh, *Intel, †Microsoft Research hermes-protocol.com Thanks to:
  2. 2. In-memory with read/write API Backbone of online services Need: High performance Fault tolerance Distributed datastores 2 Distributed Datastore
  3. 3. In-memory with read/write API Backbone of online services Need: High performance Fault tolerance Distributed datastores 3 Distributed Datastore
  4. 4. In-memory with read/write API Backbone of online services Need: High performance Fault tolerance Distributed datastores 4 Distributed Datastore
  5. 5. In-memory with read/write API Backbone of online services Need: High performance Fault tolerance Distributed datastores 5 Distributed Datastore
  6. 6. In-memory with read/write API Backbone of online services Need: High performance Fault tolerance Distributed datastores 6 Distributed Datastore
  7. 7. In-memory with read/write API Backbone of online services Need: High performance Fault tolerance Distributed datastores 7 Distributed Datastore Mandates data replication
  8. 8. Typically 3 to 7 replicas Consistency Weak: performance but nasty surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 9 …… … …
  9. 9. Typically 3 to 7 replicas Consistency Weak: performance but nasty surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 10 …… … …
  10. 10. Typically 3 to 7 replicas Consistency Weak: performance but nasty surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 11 …… … … Reliable Replication Protocol
  11. 11. Typically 3 to 7 replicas Consistency Weak: performance but nasty surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 12 …… … … Reliable Replication Protocol
  12. 12. Typically 3 to 7 replicas Consistency Weak: performance but nasty surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 13 Can reliable protocols provide high performance? …… … … Reliable Replication Protocol
  13. 13. Golden standard strong consistency and fault tolerance Low performance reads à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 15 Paxos
  14. 14. Golden standard strong consistency and fault tolerance Low performance reads à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 16 Paxos
  15. 15. Golden standard strong consistency and fault tolerance Low performance reads à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 17 Paxos
  16. 16. Golden standard strong consistency and fault tolerance Low performance reads à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 18 Paxos State-of-the-art reliable protocols exploit failure-free operation for performance
  17. 17. 20 Performance of state-of-the-art protocols Leader ZAB replicas
  18. 18. 21 Performance of state-of-the-art protocols Leader ZAB writeread bcastucast Local reads form all replicas à Fast
  19. 19. 22 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize on the leader à Low throughput writeread bcastucast Local reads form all replicas à Fast
  20. 20. 23 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize on the leader à Low throughput Head Tail CRAQ writeread bcastucast Local reads form all replicas à Fast
  21. 21. 24 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize on the leader à Low throughput Head Tail CRAQ writeread bcastucast Local reads form all replicas à Fast Local reads form all replicas à Fast
  22. 22. 25 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize on the leader à Low throughput Head Tail CRAQ Head Tail Writes traverse length of the chain à High latency writeread bcastucast Local reads form all replicas à Fast Local reads form all replicas à Fast
  23. 23. 26 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize on the leader à Low throughput Head Tail CRAQ Head Tail Writes traverse length of the chain à High latency writeread bcastucast Fast reads but poor write performance Local reads form all replicas à Fast Local reads form all replicas à Fast
  24. 24. 28 Goal: low-latency + high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance
  25. 25. 29 Goal: low-latency + high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas
  26. 26. 30 Goal: low-latency + high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Head Tail Avoid long latencies
  27. 27. 32 Goal: low-latency + high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Leader Avoid write serialization Key protocol features for high performance Local reads from all replicas
  28. 28. 33 Goal: low-latency + high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Fast, decentralized, fully concurrent writes
  29. 29. 34 Goal: low-latency + high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Fast, decentralized, fully concurrent writes Existing replication protocols are deficient
  30. 30. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 36
  31. 31. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 37 write(A=3) Coordinator Followers
  32. 32. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 38 States of A: Valid, Invalid write(A=3) Coordinator Followers I Invalidation I
  33. 33. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 39 States of A: Valid, Invalid write(A=3) Coordinator Followers At this point, no stale reads can be served Strong consistency! I Invalidation I
  34. 34. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 41 States of A: Valid, Invalid write(A=3) Coordinator Followers Ack Ack I Invalidation I
  35. 35. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 42 States of A: Valid, Invalid write(A=3) Coordinator Followers Ack Ack I Invalidation I Vcommit
  36. 36. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 43 States of A: Valid, Invalid write(A=3) Coordinator Followers V Validation V Ack Ack I Invalidation I V
  37. 37. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 44 States of A: Valid, Invalid write(A=3) Coordinator Followers V Validation V Ack Ack I Invalidation I V
  38. 38. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 45 States of A: Valid, Invalid write(A=3) Coordinator Followers What about concurrent writes? V Validation V Ack Ack I Invalidation I V
  39. 39. Challenge How to efficiently order concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 47 write(A=3) write(A=1)
  40. 40. Challenge How to efficiently order concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 48 write(A=3) write(A=1)
  41. 41. Challenge How to efficiently order concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 49 write(A=3) write(A=1) Inv(TS1) Inv(TS4)
  42. 42. Challenge How to efficiently order concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 50 write(A=3) write(A=1) Inv(TS1) Inv(TS4)
  43. 43. Challenge How to efficiently order concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 51 write(A=3) write(A=1) Inv(TS1) Inv(TS4)
  44. 44. Challenge How to efficiently order concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 52 write(A=3) write(A=1) Inv(TS1) Inv(TS4) Broadcast + Invalidations + TS à high performance writes
  45. 45. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 54 Broadcast + Invalidations + TS
  46. 46. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 55 Broadcast + Invalidations + TS
  47. 47. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 56 Broadcast + Invalidations + TS
  48. 48. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 57 Broadcast + Invalidations + TS
  49. 49. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 58 Awesome! But what about fault tolerance? Broadcast + Invalidations + TS
  50. 50. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation 60 Handling faults in Hermes
  51. 51. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 61 Handling faults in Hermes
  52. 52. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 62 Handling faults in Hermes Inv(TS) I I
  53. 53. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 63 Handling faults in Hermes Inv(TS) Coordinator fails I I
  54. 54. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 64 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  55. 55. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 65 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  56. 56. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 66 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  57. 57. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 67 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  58. 58. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 68 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  59. 59. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation Handling faults in Hermes 70 Inv(3,TS)write(A=3) Coordinator fails I I Coordinator Followers
  60. 60. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation Handling faults in Hermes 71 Inv(3,TS)write(A=3) read(A) Coordinator fails I I Coordinator Followers
  61. 61. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation V V Inv(3,TS) completion write replay read(A) Handling faults in Hermes 73 Inv(3,TS)write(A=3) Coordinator fails I I Coordinator Followers
  62. 62. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation V V Inv(3,TS) completion write replay read(A) Handling faults in Hermes 74 Inv(3,TS)write(A=3) Early value propagation enables write replays Coordinator fails I I Coordinator Followers
  63. 63. Strong Consistency through CC-inspired Invalidations Fault-tolerance write replays via early value propagation High Performance Local reads at all replicas High performance writes Fast Decentralized Fully-distributed Hermes recap 76 V I write(A=3) commit Coordinator Followers Inv(3,TS) V I V Broadcast + Invalidations + TS + early value propagation
  64. 64. Strong Consistency through CC-inspired Invalidations Fault-tolerance write replays via early value propagation High Performance Local reads at all replicas High performance writes Fast Decentralized Fully-distributed Hermes recap 77 V I write(A=3) commit Coordinator Followers Inv(3,TS) V I V Broadcast + Invalidations + TS + early value propagation In the paper: protocol details, RMWs, other goodies
  65. 65. Evaluation 78 State-of-the-art hardware testbed - 5 servers - 2x 10 core Intel Xeon E5-2630v4 per server - 56 Gb/s InfiniBand NICs KVS Workload - Uniform access distribution - Million KV pairs: <8B keys, 32B values> Evaluated protocols: - ZAB - CRAQ - Hermes
  66. 66. Performance 79 Throughput high-perf. writes + local reads conc. writes + local reads local reads Millionrequests/sec
  67. 67. Performance 80 Throughput high-perf. writes + local reads conc. writes + local reads local reads 4x 40% Millionrequests/sec
  68. 68. Performance 81 Throughput high-perf. writes + local reads conc. writes + local reads local reads 4x 40% Millionrequests/sec Write performance matters even at low write ratios
  69. 69. Performance 82 Throughput high-perf. writes + local reads conc. writes + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Millionrequests/sec Write performance matters even at low write ratios
  70. 70. Performance 83 Throughput high-perf. writes + local reads conc. writes + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Millionrequests/sec Write performance matters even at low write ratios 6x
  71. 71. Performance 84 Throughput high-perf. writes + local reads conc. writes + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Millionrequests/sec Write performance matters even at low write ratios 6x Hermes: highest throughput & lowest latency
  72. 72. Hermes Broadcast + Invalidations + TS + early value propagation Hermes-protocol.com Code available TLA+ verification Q&A Conclusion 86
  73. 73. Hermes Broadcast + Invalidations + TS + early value propagation Strong consistency Fault tolerance via write replays High performance Local reads from all replicas High performance writes Fast Decentralized Fully concurrent Hermes-protocol.com Code available TLA+ verification Q&A Conclusion 87
  74. 74. Hermes Broadcast + Invalidations + TS + early value propagation Strong consistency Fault tolerance via write replays High performance Local reads from all replicas High performance writes Fast Decentralized Fully concurrent Hermes-protocol.com Code available TLA+ verification Q&A Conclusion 88
  75. 75. Hermes Broadcast + Invalidations + TS + early value propagation Strong consistency Fault tolerance via write replays High performance Local reads from all replicas High performance writes Fast Decentralized Fully concurrent Hermes-protocol.com Code available TLA+ verification Q&A Conclusion 89 Need reliability and performance? Choose Hermes!

×