The document describes Hermes, a new replication protocol that aims to provide high performance, strong consistency, and fault tolerance for distributed datastores. Hermes uses an invalidating broadcast-based approach where writes are coordinated by a replica that broadcasts invalidations and value updates to other replicas. It allows for local reads from any replica and fast, decentralized, fully concurrent writes that commit in one network round trip. To handle faults, Hermes propagates write values with invalidations to allow replicas to recover from failures without blocking. The goal is to improve on existing protocols by avoiding serialization bottlenecks while maintaining strong consistency under both normal operation and replica failures.
1. Hermes
A Fast, Fault-tolerant and Linearizable
Replication Protocol
Antonios Katsarakis, V. Gavrielatos, S. Katebzadeh,
A. Joshi*, B. Grot, V. Nagarajan, A. Dragojevic†
University of Edinburgh, *Intel, †Microsoft Research
hermes-protocol.com
Thanks to:
2. In-memory with read/write API
Backbone of online services
Need:
High performance
Fault tolerance
Distributed datastores
2
Distributed Datastore
3. In-memory with read/write API
Backbone of online services
Need:
High performance
Fault tolerance
Distributed datastores
3
Distributed Datastore
4. In-memory with read/write API
Backbone of online services
Need:
High performance
Fault tolerance
Distributed datastores
4
Distributed Datastore
5. In-memory with read/write API
Backbone of online services
Need:
High performance
Fault tolerance
Distributed datastores
5
Distributed Datastore
6. In-memory with read/write API
Backbone of online services
Need:
High performance
Fault tolerance
Distributed datastores
6
Distributed Datastore
7. In-memory with read/write API
Backbone of online services
Need:
High performance
Fault tolerance
Distributed datastores
7
Distributed Datastore
Mandates data replication
8. Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: programmable and intuitive
Reliable replication protocols
• Strong consistency even under faults
• Define actions to execute reads & writes
à these determine a datastore’s performance
Replication 101
9
…… … …
9. Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: programmable and intuitive
Reliable replication protocols
• Strong consistency even under faults
• Define actions to execute reads & writes
à these determine a datastore’s performance
Replication 101
10
…… … …
10. Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: programmable and intuitive
Reliable replication protocols
• Strong consistency even under faults
• Define actions to execute reads & writes
à these determine a datastore’s performance
Replication 101
11
…… … …
Reliable Replication Protocol
11. Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: programmable and intuitive
Reliable replication protocols
• Strong consistency even under faults
• Define actions to execute reads & writes
à these determine a datastore’s performance
Replication 101
12
…… … …
Reliable Replication Protocol
12. Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: programmable and intuitive
Reliable replication protocols
• Strong consistency even under faults
• Define actions to execute reads & writes
à these determine a datastore’s performance
Replication 101
13
Can reliable protocols provide high performance?
…… … …
Reliable Replication Protocol
13. Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
15
Paxos
14. Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
16
Paxos
15. Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
17
Paxos
16. Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
18
Paxos
State-of-the-art reliable protocols exploit
failure-free operation for performance
19. 22
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
writeread bcastucast
Local reads form all replicas à Fast
20. 23
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
writeread bcastucast
Local reads form all replicas à Fast
21. 24
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
writeread bcastucast
Local reads form all replicas à Fast Local reads form all replicas à Fast
22. 25
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
Head Tail
Writes traverse length of the chain
à High latency
writeread bcastucast
Local reads form all replicas à Fast Local reads form all replicas à Fast
23. 26
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
Head Tail
Writes traverse length of the chain
à High latency
writeread bcastucast
Fast reads but poor write performance
Local reads form all replicas à Fast Local reads form all replicas à Fast
24. 28
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
25. 29
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
26. 30
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Head Tail
Avoid long latencies
27. 32
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Leader
Avoid write serialization
Key protocol features for high performance
Local reads from all replicas
28. 33
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Fast, decentralized, fully concurrent writes
29. 34
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Fast, decentralized, fully concurrent writes
Existing replication protocols are deficient
30. Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
36
31. Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
37
write(A=3)
Coordinator Followers
32. Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
38
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
I
Invalidation
I
33. Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
39
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
At this point, no stale reads can be served
Strong consistency!
I
Invalidation
I
34. Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
41
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
Ack
Ack
I
Invalidation
I
35. Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
42
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
Ack
Ack
I
Invalidation
I
Vcommit
36. Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
43
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
V
Validation
V
Ack
Ack
I
Invalidation
I
V
37. Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
44
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
V
Validation
V
Ack
Ack
I
Invalidation
I
V
38. Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
45
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
What about concurrent writes?
V
Validation
V
Ack
Ack
I
Invalidation
I
V
39. Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
47
write(A=3) write(A=1)
40. Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
48
write(A=3) write(A=1)
41. Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
49
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
42. Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
50
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
43. Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
51
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
44. Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
52
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
Broadcast + Invalidations + TS à high performance writes
45. 1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Writes commit in 1 RTT
Writes never abort
Writes in Hermes
54
Broadcast + Invalidations + TS
46. 1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Writes commit in 1 RTT
Writes never abort
Writes in Hermes
55
Broadcast + Invalidations + TS
47. 1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Writes commit in 1 RTT
Writes never abort
Writes in Hermes
56
Broadcast + Invalidations + TS
48. 1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Writes commit in 1 RTT
Writes never abort
Writes in Hermes
57
Broadcast + Invalidations + TS
49. 1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Writes commit in 1 RTT
Writes never abort
Writes in Hermes
58
Awesome! But what about fault tolerance?
Broadcast + Invalidations + TS
50. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Solution: send write value with Invalidation à Early value propagation
60
Handling faults in Hermes
51. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
61
Handling faults in Hermes
52. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
62
Handling faults in Hermes
Inv(TS)
I
I
53. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
63
Handling faults in Hermes
Inv(TS)
Coordinator
fails
I
I
54. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
64
Handling faults in Hermes
read(A)
Inv(TS)
Coordinator
fails
I
I
55. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
65
Handling faults in Hermes
read(A)
Inv(TS)
Coordinator
fails
I
I
56. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
66
Handling faults in Hermes
read(A)
Inv(TS)
Coordinator
fails
I
I
57. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
67
Handling faults in Hermes
read(A)
Inv(TS)
Coordinator
fails
I
I
58. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
68
Handling faults in Hermes
read(A)
Inv(TS)
Coordinator
fails
I
I
59. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à Early value propagation
Handling faults in Hermes
70
Inv(3,TS)write(A=3)
Coordinator
fails
I
I
Coordinator Followers
60. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à Early value propagation
Handling faults in Hermes
71
Inv(3,TS)write(A=3)
read(A)
Coordinator
fails
I
I
Coordinator Followers
61. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à Early value propagation
V
V
Inv(3,TS)
completion
write
replay
read(A)
Handling faults in Hermes
73
Inv(3,TS)write(A=3)
Coordinator
fails
I
I
Coordinator Followers
62. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à Early value propagation
V
V
Inv(3,TS)
completion
write
replay
read(A)
Handling faults in Hermes
74
Inv(3,TS)write(A=3)
Early value propagation enables write replays
Coordinator
fails
I
I
Coordinator Followers
63. Strong Consistency
through CC-inspired Invalidations
Fault-tolerance
write replays via early value propagation
High Performance
Local reads at all replicas
High performance writes
Fast
Decentralized
Fully-distributed
Hermes recap
76
V
I
write(A=3)
commit
Coordinator Followers
Inv(3,TS)
V
I
V
Broadcast + Invalidations + TS + early value propagation
64. Strong Consistency
through CC-inspired Invalidations
Fault-tolerance
write replays via early value propagation
High Performance
Local reads at all replicas
High performance writes
Fast
Decentralized
Fully-distributed
Hermes recap
77
V
I
write(A=3)
commit
Coordinator Followers
Inv(3,TS)
V
I
V
Broadcast + Invalidations + TS + early value propagation
In the paper: protocol details, RMWs, other goodies
69. Performance
82
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
5% Write Ratio
Write Latency
(normalized to Hermes)
Millionrequests/sec
Write performance matters even at low write ratios
70. Performance
83
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
5% Write Ratio
Write Latency
(normalized to Hermes)
Millionrequests/sec
Write performance matters even at low write ratios
6x
71. Performance
84
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
5% Write Ratio
Write Latency
(normalized to Hermes)
Millionrequests/sec
Write performance matters even at low write ratios
6x
Hermes: highest throughput & lowest latency
72. Hermes
Broadcast + Invalidations + TS + early value propagation
Hermes-protocol.com
Code available
TLA+ verification
Q&A
Conclusion
86
73. Hermes
Broadcast + Invalidations + TS + early value propagation
Strong consistency
Fault tolerance via write replays
High performance
Local reads from all replicas
High performance writes
Fast
Decentralized
Fully concurrent
Hermes-protocol.com
Code available
TLA+ verification
Q&A
Conclusion
87
74. Hermes
Broadcast + Invalidations + TS + early value propagation
Strong consistency
Fault tolerance via write replays
High performance
Local reads from all replicas
High performance writes
Fast
Decentralized
Fully concurrent
Hermes-protocol.com
Code available
TLA+ verification
Q&A
Conclusion
88
75. Hermes
Broadcast + Invalidations + TS + early value propagation
Strong consistency
Fault tolerance via write replays
High performance
Local reads from all replicas
High performance writes
Fast
Decentralized
Fully concurrent
Hermes-protocol.com
Code available
TLA+ verification
Q&A
Conclusion
89
Need reliability and performance? Choose Hermes!