2. Data: in-memory, sharded across
servers within a datacenter (DC)
Offer a single-object read/write
or multi-object transactions API
Backbone of online services and
cloud applications
Must provide:
High performance
Fault tolerance
Distributed datastores
2
distributed datastore
3. Data: in-memory, sharded across
servers within a datacenter (DC)
Offer a single-object read/write
or multi-object transactions API
Backbone of online services and
cloud applications
Must provide:
High performance
Fault tolerance
Distributed datastores
2
distributed datastore
4. Data: in-memory, sharded across
servers within a datacenter (DC)
Offer a single-object read/write
or multi-object transactions API
Backbone of online services and
cloud applications
Must provide:
High performance
Fault tolerance
Distributed datastores
2
distributed datastore
Mandate data replication
5. Performance
a single node may not keep up with load
Fault tolerance
data remain available despite failures
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: intuitive, broadest spectrum of apps
Replication protocols
- Strong consistency even under faults – if fault tolerant
- Define actions to execute reads/writes or transactions (txs)
à determine the datastore’s performance
Replication 101
3
…
… … …
replication protocol
6. Performance
a single node may not keep up with load
Fault tolerance
data remain available despite failures
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: intuitive, broadest spectrum of apps
Replication protocols
- Strong consistency even under faults – if fault tolerant
- Define actions to execute reads/writes or transactions (txs)
à determine the datastore’s performance
Replication 101
3
…
… … …
replication protocol
7. Performance
a single node may not keep up with load
Fault tolerance
data remain available despite failures
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: intuitive, broadest spectrum of apps
Replication protocols
- Strong consistency even under faults – if fault tolerant
- Define actions to execute reads/writes or transactions (txs)
à determine the datastore’s performance
Replication 101
3
Can strongly consistent protocols offer
fault tolerance and high performance?
…
… … …
replication protocol
8. Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
Strongly consistent replication
4
Datastores: replication protocols
data replicated across nodes
Fault tolerance
Performance
- reads/writes: sacrifice
concurrency or speed
- txs: cannot exploit locality
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
9. Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
Replication protocols inside a DC
- Network: fast, remote direct memory access (RDMA)
- Faults are rare within a replica group
A server fails at most twice a year
fault-free operation >> operation under faults
Strongly consistent replication
4
Datastores: replication protocols
data replicated across nodes
Fault tolerance
Performance
- reads/writes: sacrifice
concurrency or speed
- txs: cannot exploit locality
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
10. Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
Replication protocols inside a DC
- Network: fast, remote direct memory access (RDMA)
- Faults are rare within a replica group
A server fails at most twice a year
fault-free operation >> operation under faults
Strongly consistent replication
4
Datastores: replication protocols
data replicated across nodes
Fault tolerance
Performance
- reads/writes: sacrifice
concurrency or speed
- txs: cannot exploit locality
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
The common operation of replication protocols
resembles the multiprocessor!
11. Thesis overview
5
Adapting multiprocessor-inspired invalidating protocols to intra-DC replicated
datastores enables: strong consistency, fault tolerance, high performance
Primary contributions
4 invalidating protocols à 3 most common replication uses in datastores
1-slide summary N-slides 1-slide summary
Scale-out ccNUMA [Eurosys’18]
Galene
protocol
Performant read/write
replication (skew)
Zeus [Eurosys’21]
Zeus ownership,
Zeus reliable commit
Replicated fault-tolerant
distributed txs
Hermes [ASPLOS’20]
Hermes
protocol
Fast fault-tolerant
read/write replication
12. Performant read/write replication for skew
12
Many workloads exhibit skewed data accesses
a few servers are overloaded, most are underutilized
State-of-the-art skew mitigation
distributes accesses across all servers & uses RDMA
No locality: most requests need remote access
à increased latency, bottlenecked by network b/w
Symmetric caching
all servers have a cache same hottest objects
Throughput scales with numbers of servers
Less network b/w: most requests served locally
Challenge: efficiently keep caches consistent
Existing protocols
serialize writes @ physical point = hotspot
Galene protocol
invalidations + logical timestamps = fully distributed writes
RDMA
13. Performant read/write replication for skew
13
Many workloads exhibit skewed data accesses
a few servers are overloaded, most are underutilized
State-of-the-art skew mitigation
distributes accesses across all servers & uses RDMA
No locality: most requests need remote access
à increased latency, bottlenecked by network b/w
Symmetric caching
all servers have a cache same hottest objects
Throughput scales with numbers of servers
Less network b/w: most requests served locally
Challenge: efficiently keep caches consistent
Existing protocols
serialize writes @ physical point = hotspot
Galene protocol
invalidations + logical timestamps = fully distributed writes
RDMA
RDMA
Symmetric caching
and Galene
14. Performant read/write replication for skew
14
Many workloads exhibit skewed data accesses
a few servers are overloaded, most are underutilized
State-of-the-art skew mitigation
distributes accesses across all servers & uses RDMA
No locality: most requests need remote access
à increased latency, bottlenecked by network b/w
Symmetric caching
all servers have a cache same hottest objects
Throughput scales with numbers of servers
Less network b/w: most requests served locally
Challenge: efficiently keep caches consistent
Existing protocols
serialize writes @ physical point = hotspot
Galene protocol
invalidations + logical timestamps = fully distributed writes
RDMA
RDMA
Symmetric caching
and Galene
100s millions ops/sec & up to 3x state-of-the-art!
15. 15
Hmmm …
Invalidating protocols
good read/write performance
when replicating under skew
can maintain high read/write performance
while providing fault tolerance?
reliable = strongly consistent + fault tolerant
2nd primary contribution: Hermes!
16. 16
Hmmm …
Invalidating protocols
good read/write performance
when replicating under skew
can maintain high read/write performance
while providing fault tolerance?
reliable = strongly consistent + fault tolerant
2nd primary contribution: Hermes!
What is the issue of existing reliable protocols?
17. Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
17
Paxos
18. Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
18
Paxos
State-of-the-art replication protocols exploit
failure-free operation for performance
20. 20
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
Head Tail
Writes traverse length of the chain
à High latency
write
read bcast
ucast
Local reads form all replicas à Fast Local reads form all replicas à Fast
21. 21
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
Head Tail
Writes traverse length of the chain
à High latency
write
read bcast
ucast
Fast reads but poor write performance
Local reads form all replicas à Fast Local reads form all replicas à Fast
22. 13
Goal: low latency + high throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Head Tail
Avoid long latencies
23. 23
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service
Leader
Avoid write serialization
Key protocol features for high performance
Local reads from all replicas
24. 24
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Fast, decentralized, fully concurrent writes
25. 25
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Fast, decentralized, fully concurrent writes
Existing replication protocols are deficient
26. Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
26
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
I
Invalidation
I
27. Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
27
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
At this point, no stale reads can be served
Strong consistency!
I
Invalidation
I
28. Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
28
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
V
Validation
V
Ack
Ack
I
Invalidation
I
V
commit
29. Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
29
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
What about concurrent writes?
V
Validation
V
Ack
Ack
I
Invalidation
I
V
commit
30. Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
30
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
31. Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
31
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
Broadcast + Invalidations + TS à high performance writes
32. 1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Commit in 1 RTT
Never abort
Writes in Hermes
32
Broadcast + Invalidations + TS
33. 1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Commit in 1 RTT
Never abort
Writes in Hermes
33
Awesome! But what about fault tolerance?
Broadcast + Invalidations + TS
34. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
34
Handling faults in Hermes
read(A)
Inv(TS)
Coordinator
fails
I
I
35. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à early value propagation
Handling faults in Hermes
35
Inv(3,TS)
write(A=3)
read(A)
Coordinator
fails
I
I
Coordinator Followers
36. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à early value propagation
V
V
Inv(3,TS)
completion
write
replay
read(A)
Handling faults in Hermes
36
Inv(3,TS)
write(A=3)
Coordinator
fails
I
I
Coordinator Followers
37. Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à early value propagation
V
V
Inv(3,TS)
completion
write
replay
read(A)
Handling faults in Hermes
37
Inv(3,TS)
write(A=3)
Early value propagation enables write replays
Coordinator
fails
I
I
Coordinator Followers
39. Performance
39
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
5% Write Ratio
Write Latency
(normalized to Hermes)
Million
requests
/
sec
Write performance matters even at low write ratios
6x
% Write Ratio
40. Performance
40
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
5% Write Ratio
Write Latency
(normalized to Hermes)
Million
requests
/
sec
Write performance matters even at low write ratios
6x
Hermes: highest throughput & lowest latency
% Write Ratio
41. Strong Consistency
through multiprocessor-inspired Invalidations
Fault-tolerance
write replays via early value propagation
High Performance
Local reads at all replicas
High performance writes
Fast
Decentralized
Fully concurrent
Hermes recap
41
V
I
write(A=3)
commit
Coordinator Followers
Inv(3,TS)
V
I
V
Broadcast + Invalidations + TS + early value propagation
42. Strong Consistency
through multiprocessor-inspired Invalidations
Fault-tolerance
write replays via early value propagation
High Performance
Local reads at all replicas
High performance writes
Fast
Decentralized
Fully concurrent
Hermes recap
42
V
I
write(A=3)
commit
Coordinator Followers
Inv(3,TS)
V
I
V
Broadcast + Invalidations + TS + early value propagation
What about reliable txs? … 3rd primary contribution (1-slide)!
43. Reliable replicated transactions
43
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
44. Reliable replicated transactions
44
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
45. Reliable replicated transactions
45
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
46. Reliable replicated transactions
46
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
47. Reliable replicated transactions
47
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
48. Reliable replicated transactions
48
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
10s millions txs/sec & up to 2x state-of-the-art!
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
49. Reliable replicated transactions
49
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
10s millions txs/sec & up to 2x state-of-the-art!
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
Two Invalidating protocols!
50. Thesis summary
50
Replicated datastores powered by multiprocessor-inspired invalidating
protocols can deliver: strong consistency, fault tolerance, high performance
4 invalidating protocols à 3 most common replication uses in datastores
- High performance (10s–100s M ops / sec)
- Strong consistency under concurrency & faults (formally verified in TLA+)
Scale-out ccNUMA [Eurosys’18] Hermes [ASPLOS’20] Zeus [Eurosys’21]
Galene
protocol
Hermes
protocol
Zeus ownership,
Zeus reliable commit
Performant read/write
replication for skew
Fast reliable read/write
replication
Locality-aware reliable txs
with dynamic sharding
51. Thesis summary
51
Replicated datastores powered by multiprocessor-inspired invalidating
protocols can deliver: strong consistency, fault tolerance, high performance
4 invalidating protocols à 3 most common replication uses in datastores
- High performance (10s–100s M ops / sec)
- Strong consistency under concurrency & faults (formally verified in TLA+)
Scale-out ccNUMA [Eurosys’18] Hermes [ASPLOS’20] Zeus [Eurosys’21]
Galene
protocol
Hermes
protocol
Zeus ownership,
Zeus reliable commit
Performant read/write
replication for skew
Fast reliable read/write
replication
Locality-aware reliable txs
with dynamic sharding
Is this the end ??
52. Follow up research
52
• The L2AW theorem
[to be submitted]
• Hardware offloading
• Replication across datacenters
• Single-shot reliable writes from external clients
• Non-blocking reconfiguration on node crashes
…
53. Follow up research
53
• The L2AW theorem
[to be submitted]
• Hardware offloading
• Replication across datacenters
• Single-shot reliable writes from external clients
• Non-blocking reconfiguration on node crashes
…
Thank you! Questions?