Hermes
A Fast, Fault-tolerant and Linearizable
Replication Protocol
Antonios Katsarakis, V. Gavrielatos, S. Katebzadeh,
A. Joshi*, B. Grot, V. Nagarajan, A. Dragojevic†
University of Edinburgh, *Intel, †Microsoft Research
hermes-protocol.com
Thanks to:
In-memory with read/write API
Backbone of online services
Need:
High performance
Fault tolerance
Distributed datastores
2
Distributed Datastore
In-memory with read/write API
Backbone of online services
Need:
High performance
Fault tolerance
Distributed datastores
3
Distributed Datastore
In-memory with read/write API
Backbone of online services
Need:
High performance
Fault tolerance
Distributed datastores
4
Distributed Datastore
In-memory with read/write API
Backbone of online services
Need:
High performance
Fault tolerance
Distributed datastores
5
Distributed Datastore
In-memory with read/write API
Backbone of online services
Need:
High performance
Fault tolerance
Distributed datastores
6
Distributed Datastore
In-memory with read/write API
Backbone of online services
Need:
High performance
Fault tolerance
Distributed datastores
7
Distributed Datastore
Mandates data replication
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: programmable and intuitive
Reliable replication protocols
• Strong consistency even under faults
• Define actions to execute reads & writes
à these determine a datastore’s performance
Replication 101
9
…… … …
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: programmable and intuitive
Reliable replication protocols
• Strong consistency even under faults
• Define actions to execute reads & writes
à these determine a datastore’s performance
Replication 101
10
…… … …
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: programmable and intuitive
Reliable replication protocols
• Strong consistency even under faults
• Define actions to execute reads & writes
à these determine a datastore’s performance
Replication 101
11
…… … …
Reliable Replication Protocol
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: programmable and intuitive
Reliable replication protocols
• Strong consistency even under faults
• Define actions to execute reads & writes
à these determine a datastore’s performance
Replication 101
12
…… … …
Reliable Replication Protocol
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: programmable and intuitive
Reliable replication protocols
• Strong consistency even under faults
• Define actions to execute reads & writes
à these determine a datastore’s performance
Replication 101
13
Can reliable protocols provide high performance?
…… … …
Reliable Replication Protocol
Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
15
Paxos
Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
16
Paxos
Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
17
Paxos
Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
18
Paxos
State-of-the-art reliable protocols exploit
failure-free operation for performance
20
Performance of state-of-the-art protocols
Leader
ZAB
replicas
21
Performance of state-of-the-art protocols
Leader
ZAB
writeread bcastucast
Local reads form all replicas à Fast
22
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
writeread bcastucast
Local reads form all replicas à Fast
23
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
writeread bcastucast
Local reads form all replicas à Fast
24
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
writeread bcastucast
Local reads form all replicas à Fast Local reads form all replicas à Fast
25
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
Head Tail
Writes traverse length of the chain
à High latency
writeread bcastucast
Local reads form all replicas à Fast Local reads form all replicas à Fast
26
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
Head Tail
Writes traverse length of the chain
à High latency
writeread bcastucast
Fast reads but poor write performance
Local reads form all replicas à Fast Local reads form all replicas à Fast
28
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
29
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
30
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Head Tail
Avoid long latencies
32
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Leader
Avoid write serialization
Key protocol features for high performance
Local reads from all replicas
33
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Fast, decentralized, fully concurrent writes
34
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Fast, decentralized, fully concurrent writes
Existing replication protocols are deficient
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
36
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
37
write(A=3)
Coordinator Followers
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
38
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
I
Invalidation
I
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
39
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
At this point, no stale reads can be served
Strong consistency!
I
Invalidation
I
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
41
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
Ack
Ack
I
Invalidation
I
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
42
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
Ack
Ack
I
Invalidation
I
Vcommit
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
43
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
V
Validation
V
Ack
Ack
I
Invalidation
I
V
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
44
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
V
Validation
V
Ack
Ack
I
Invalidation
I
V
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
45
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
What about concurrent writes?
V
Validation
V
Ack
Ack
I
Invalidation
I
V
Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
47
write(A=3) write(A=1)
Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
48
write(A=3) write(A=1)
Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
49
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
50
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
51
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
52
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
Broadcast + Invalidations + TS à high performance writes
1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Writes commit in 1 RTT
Writes never abort
Writes in Hermes
54
Broadcast + Invalidations + TS
1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Writes commit in 1 RTT
Writes never abort
Writes in Hermes
55
Broadcast + Invalidations + TS
1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Writes commit in 1 RTT
Writes never abort
Writes in Hermes
56
Broadcast + Invalidations + TS
1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Writes commit in 1 RTT
Writes never abort
Writes in Hermes
57
Broadcast + Invalidations + TS
1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Writes commit in 1 RTT
Writes never abort
Writes in Hermes
58
Awesome! But what about fault tolerance?
Broadcast + Invalidations + TS
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Solution: send write value with Invalidation à Early value propagation
60
Handling faults in Hermes
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
61
Handling faults in Hermes
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
62
Handling faults in Hermes
Inv(TS)
I
I
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
63
Handling faults in Hermes
Inv(TS)
Coordinator
fails
I
I
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
64
Handling faults in Hermes
read(A)
Inv(TS)
Coordinator
fails
I
I
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
65
Handling faults in Hermes
read(A)
Inv(TS)
Coordinator
fails
I
I
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
66
Handling faults in Hermes
read(A)
Inv(TS)
Coordinator
fails
I
I
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
67
Handling faults in Hermes
read(A)
Inv(TS)
Coordinator
fails
I
I
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
68
Handling faults in Hermes
read(A)
Inv(TS)
Coordinator
fails
I
I
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à Early value propagation
Handling faults in Hermes
70
Inv(3,TS)write(A=3)
Coordinator
fails
I
I
Coordinator Followers
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à Early value propagation
Handling faults in Hermes
71
Inv(3,TS)write(A=3)
read(A)
Coordinator
fails
I
I
Coordinator Followers
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à Early value propagation
V
V
Inv(3,TS)
completion
write
replay
read(A)
Handling faults in Hermes
73
Inv(3,TS)write(A=3)
Coordinator
fails
I
I
Coordinator Followers
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à Early value propagation
V
V
Inv(3,TS)
completion
write
replay
read(A)
Handling faults in Hermes
74
Inv(3,TS)write(A=3)
Early value propagation enables write replays
Coordinator
fails
I
I
Coordinator Followers
Strong Consistency
through CC-inspired Invalidations
Fault-tolerance
write replays via early value propagation
High Performance
Local reads at all replicas
High performance writes
Fast
Decentralized
Fully-distributed
Hermes recap
76
V
I
write(A=3)
commit
Coordinator Followers
Inv(3,TS)
V
I
V
Broadcast + Invalidations + TS + early value propagation
Strong Consistency
through CC-inspired Invalidations
Fault-tolerance
write replays via early value propagation
High Performance
Local reads at all replicas
High performance writes
Fast
Decentralized
Fully-distributed
Hermes recap
77
V
I
write(A=3)
commit
Coordinator Followers
Inv(3,TS)
V
I
V
Broadcast + Invalidations + TS + early value propagation
In the paper: protocol details, RMWs, other goodies
Evaluation
78
State-of-the-art hardware testbed
- 5 servers
- 2x 10 core Intel Xeon E5-2630v4 per server
- 56 Gb/s InfiniBand NICs
KVS Workload
- Uniform access distribution
- Million KV pairs: <8B keys, 32B values>
Evaluated protocols:
- ZAB
- CRAQ
- Hermes
Performance
79
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
Millionrequests/sec
Performance
80
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
Millionrequests/sec
Performance
81
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
Millionrequests/sec
Write performance matters even at low write ratios
Performance
82
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
5% Write Ratio
Write Latency
(normalized to Hermes)
Millionrequests/sec
Write performance matters even at low write ratios
Performance
83
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
5% Write Ratio
Write Latency
(normalized to Hermes)
Millionrequests/sec
Write performance matters even at low write ratios
6x
Performance
84
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
5% Write Ratio
Write Latency
(normalized to Hermes)
Millionrequests/sec
Write performance matters even at low write ratios
6x
Hermes: highest throughput & lowest latency
Hermes
Broadcast + Invalidations + TS + early value propagation
Hermes-protocol.com
Code available
TLA+ verification
Q&A
Conclusion
86
Hermes
Broadcast + Invalidations + TS + early value propagation
Strong consistency
Fault tolerance via write replays
High performance
Local reads from all replicas
High performance writes
Fast
Decentralized
Fully concurrent
Hermes-protocol.com
Code available
TLA+ verification
Q&A
Conclusion
87
Hermes
Broadcast + Invalidations + TS + early value propagation
Strong consistency
Fault tolerance via write replays
High performance
Local reads from all replicas
High performance writes
Fast
Decentralized
Fully concurrent
Hermes-protocol.com
Code available
TLA+ verification
Q&A
Conclusion
88
Hermes
Broadcast + Invalidations + TS + early value propagation
Strong consistency
Fault tolerance via write replays
High performance
Local reads from all replicas
High performance writes
Fast
Decentralized
Fully concurrent
Hermes-protocol.com
Code available
TLA+ verification
Q&A
Conclusion
89
Need reliability and performance? Choose Hermes!

Hermes Reliable Replication Protocol - ASPLOS'20 Presentation

  • 1.
    Hermes A Fast, Fault-tolerantand Linearizable Replication Protocol Antonios Katsarakis, V. Gavrielatos, S. Katebzadeh, A. Joshi*, B. Grot, V. Nagarajan, A. Dragojevic† University of Edinburgh, *Intel, †Microsoft Research hermes-protocol.com Thanks to:
  • 2.
    In-memory with read/writeAPI Backbone of online services Need: High performance Fault tolerance Distributed datastores 2 Distributed Datastore
  • 3.
    In-memory with read/writeAPI Backbone of online services Need: High performance Fault tolerance Distributed datastores 3 Distributed Datastore
  • 4.
    In-memory with read/writeAPI Backbone of online services Need: High performance Fault tolerance Distributed datastores 4 Distributed Datastore
  • 5.
    In-memory with read/writeAPI Backbone of online services Need: High performance Fault tolerance Distributed datastores 5 Distributed Datastore
  • 6.
    In-memory with read/writeAPI Backbone of online services Need: High performance Fault tolerance Distributed datastores 6 Distributed Datastore
  • 7.
    In-memory with read/writeAPI Backbone of online services Need: High performance Fault tolerance Distributed datastores 7 Distributed Datastore Mandates data replication
  • 8.
    Typically 3 to7 replicas Consistency Weak: performance but nasty surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 9 …… … …
  • 9.
    Typically 3 to7 replicas Consistency Weak: performance but nasty surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 10 …… … …
  • 10.
    Typically 3 to7 replicas Consistency Weak: performance but nasty surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 11 …… … … Reliable Replication Protocol
  • 11.
    Typically 3 to7 replicas Consistency Weak: performance but nasty surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 12 …… … … Reliable Replication Protocol
  • 12.
    Typically 3 to7 replicas Consistency Weak: performance but nasty surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 13 Can reliable protocols provide high performance? …… … … Reliable Replication Protocol
  • 13.
    Golden standard strong consistencyand fault tolerance Low performance reads à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 15 Paxos
  • 14.
    Golden standard strong consistencyand fault tolerance Low performance reads à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 16 Paxos
  • 15.
    Golden standard strong consistencyand fault tolerance Low performance reads à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 17 Paxos
  • 16.
    Golden standard strong consistencyand fault tolerance Low performance reads à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 18 Paxos State-of-the-art reliable protocols exploit failure-free operation for performance
  • 17.
    20 Performance of state-of-the-artprotocols Leader ZAB replicas
  • 18.
    21 Performance of state-of-the-artprotocols Leader ZAB writeread bcastucast Local reads form all replicas à Fast
  • 19.
    22 Performance of state-of-the-artprotocols Leader ZAB Leader Writes serialize on the leader à Low throughput writeread bcastucast Local reads form all replicas à Fast
  • 20.
    23 Performance of state-of-the-artprotocols Leader ZAB Leader Writes serialize on the leader à Low throughput Head Tail CRAQ writeread bcastucast Local reads form all replicas à Fast
  • 21.
    24 Performance of state-of-the-artprotocols Leader ZAB Leader Writes serialize on the leader à Low throughput Head Tail CRAQ writeread bcastucast Local reads form all replicas à Fast Local reads form all replicas à Fast
  • 22.
    25 Performance of state-of-the-artprotocols Leader ZAB Leader Writes serialize on the leader à Low throughput Head Tail CRAQ Head Tail Writes traverse length of the chain à High latency writeread bcastucast Local reads form all replicas à Fast Local reads form all replicas à Fast
  • 23.
    26 Performance of state-of-the-artprotocols Leader ZAB Leader Writes serialize on the leader à Low throughput Head Tail CRAQ Head Tail Writes traverse length of the chain à High latency writeread bcastucast Fast reads but poor write performance Local reads form all replicas à Fast Local reads form all replicas à Fast
  • 24.
    28 Goal: low-latency +high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance
  • 25.
    29 Goal: low-latency +high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas
  • 26.
    30 Goal: low-latency +high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Head Tail Avoid long latencies
  • 27.
    32 Goal: low-latency +high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Leader Avoid write serialization Key protocol features for high performance Local reads from all replicas
  • 28.
    33 Goal: low-latency +high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Fast, decentralized, fully concurrent writes
  • 29.
    34 Goal: low-latency +high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Fast, decentralized, fully concurrent writes Existing replication protocols are deficient
  • 30.
    Broadcast-based, invalidating replicationprotocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 36
  • 31.
    Broadcast-based, invalidating replicationprotocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 37 write(A=3) Coordinator Followers
  • 32.
    Broadcast-based, invalidating replicationprotocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 38 States of A: Valid, Invalid write(A=3) Coordinator Followers I Invalidation I
  • 33.
    Broadcast-based, invalidating replicationprotocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 39 States of A: Valid, Invalid write(A=3) Coordinator Followers At this point, no stale reads can be served Strong consistency! I Invalidation I
  • 34.
    Broadcast-based, invalidating replicationprotocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 41 States of A: Valid, Invalid write(A=3) Coordinator Followers Ack Ack I Invalidation I
  • 35.
    Broadcast-based, invalidating replicationprotocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 42 States of A: Valid, Invalid write(A=3) Coordinator Followers Ack Ack I Invalidation I Vcommit
  • 36.
    Broadcast-based, invalidating replicationprotocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 43 States of A: Valid, Invalid write(A=3) Coordinator Followers V Validation V Ack Ack I Invalidation I V
  • 37.
    Broadcast-based, invalidating replicationprotocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 44 States of A: Valid, Invalid write(A=3) Coordinator Followers V Validation V Ack Ack I Invalidation I V
  • 38.
    Broadcast-based, invalidating replicationprotocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 45 States of A: Valid, Invalid write(A=3) Coordinator Followers What about concurrent writes? V Validation V Ack Ack I Invalidation I V
  • 39.
    Challenge How to efficientlyorder concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 47 write(A=3) write(A=1)
  • 40.
    Challenge How to efficientlyorder concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 48 write(A=3) write(A=1)
  • 41.
    Challenge How to efficientlyorder concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 49 write(A=3) write(A=1) Inv(TS1) Inv(TS4)
  • 42.
    Challenge How to efficientlyorder concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 50 write(A=3) write(A=1) Inv(TS1) Inv(TS4)
  • 43.
    Challenge How to efficientlyorder concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 51 write(A=3) write(A=1) Inv(TS1) Inv(TS4)
  • 44.
    Challenge How to efficientlyorder concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 52 write(A=3) write(A=1) Inv(TS1) Inv(TS4) Broadcast + Invalidations + TS à high performance writes
  • 45.
    1. Decentralized Fully distributedwrite ordering at endpoints 2. Fully concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 54 Broadcast + Invalidations + TS
  • 46.
    1. Decentralized Fully distributedwrite ordering at endpoints 2. Fully concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 55 Broadcast + Invalidations + TS
  • 47.
    1. Decentralized Fully distributedwrite ordering at endpoints 2. Fully concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 56 Broadcast + Invalidations + TS
  • 48.
    1. Decentralized Fully distributedwrite ordering at endpoints 2. Fully concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 57 Broadcast + Invalidations + TS
  • 49.
    1. Decentralized Fully distributedwrite ordering at endpoints 2. Fully concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 58 Awesome! But what about fault tolerance? Broadcast + Invalidations + TS
  • 50.
    Problem A failure inthe middle of a write can permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation 60 Handling faults in Hermes
  • 51.
    Problem A failure inthe middle of a write can permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 61 Handling faults in Hermes
  • 52.
    Problem A failure inthe middle of a write can permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 62 Handling faults in Hermes Inv(TS) I I
  • 53.
    Problem A failure inthe middle of a write can permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 63 Handling faults in Hermes Inv(TS) Coordinator fails I I
  • 54.
    Problem A failure inthe middle of a write can permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 64 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  • 55.
    Problem A failure inthe middle of a write can permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 65 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  • 56.
    Problem A failure inthe middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 66 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  • 57.
    Problem A failure inthe middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 67 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  • 58.
    Problem A failure inthe middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 68 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  • 59.
    Problem A failure inthe middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation Handling faults in Hermes 70 Inv(3,TS)write(A=3) Coordinator fails I I Coordinator Followers
  • 60.
    Problem A failure inthe middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation Handling faults in Hermes 71 Inv(3,TS)write(A=3) read(A) Coordinator fails I I Coordinator Followers
  • 61.
    Problem A failure inthe middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation V V Inv(3,TS) completion write replay read(A) Handling faults in Hermes 73 Inv(3,TS)write(A=3) Coordinator fails I I Coordinator Followers
  • 62.
    Problem A failure inthe middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation V V Inv(3,TS) completion write replay read(A) Handling faults in Hermes 74 Inv(3,TS)write(A=3) Early value propagation enables write replays Coordinator fails I I Coordinator Followers
  • 63.
    Strong Consistency through CC-inspiredInvalidations Fault-tolerance write replays via early value propagation High Performance Local reads at all replicas High performance writes Fast Decentralized Fully-distributed Hermes recap 76 V I write(A=3) commit Coordinator Followers Inv(3,TS) V I V Broadcast + Invalidations + TS + early value propagation
  • 64.
    Strong Consistency through CC-inspiredInvalidations Fault-tolerance write replays via early value propagation High Performance Local reads at all replicas High performance writes Fast Decentralized Fully-distributed Hermes recap 77 V I write(A=3) commit Coordinator Followers Inv(3,TS) V I V Broadcast + Invalidations + TS + early value propagation In the paper: protocol details, RMWs, other goodies
  • 65.
    Evaluation 78 State-of-the-art hardware testbed -5 servers - 2x 10 core Intel Xeon E5-2630v4 per server - 56 Gb/s InfiniBand NICs KVS Workload - Uniform access distribution - Million KV pairs: <8B keys, 32B values> Evaluated protocols: - ZAB - CRAQ - Hermes
  • 66.
    Performance 79 Throughput high-perf. writes +local reads conc. writes + local reads local reads Millionrequests/sec
  • 67.
    Performance 80 Throughput high-perf. writes +local reads conc. writes + local reads local reads 4x 40% Millionrequests/sec
  • 68.
    Performance 81 Throughput high-perf. writes +local reads conc. writes + local reads local reads 4x 40% Millionrequests/sec Write performance matters even at low write ratios
  • 69.
    Performance 82 Throughput high-perf. writes +local reads conc. writes + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Millionrequests/sec Write performance matters even at low write ratios
  • 70.
    Performance 83 Throughput high-perf. writes +local reads conc. writes + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Millionrequests/sec Write performance matters even at low write ratios 6x
  • 71.
    Performance 84 Throughput high-perf. writes +local reads conc. writes + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Millionrequests/sec Write performance matters even at low write ratios 6x Hermes: highest throughput & lowest latency
  • 72.
    Hermes Broadcast + Invalidations+ TS + early value propagation Hermes-protocol.com Code available TLA+ verification Q&A Conclusion 86
  • 73.
    Hermes Broadcast + Invalidations+ TS + early value propagation Strong consistency Fault tolerance via write replays High performance Local reads from all replicas High performance writes Fast Decentralized Fully concurrent Hermes-protocol.com Code available TLA+ verification Q&A Conclusion 87
  • 74.
    Hermes Broadcast + Invalidations+ TS + early value propagation Strong consistency Fault tolerance via write replays High performance Local reads from all replicas High performance writes Fast Decentralized Fully concurrent Hermes-protocol.com Code available TLA+ verification Q&A Conclusion 88
  • 75.
    Hermes Broadcast + Invalidations+ TS + early value propagation Strong consistency Fault tolerance via write replays High performance Local reads from all replicas High performance writes Fast Decentralized Fully concurrent Hermes-protocol.com Code available TLA+ verification Q&A Conclusion 89 Need reliability and performance? Choose Hermes!