Invalidation-Based Protocols for Replicated Datastores

Invalidation-Based Protocols
for Replicated Datastores
Antonios Katsarakis
Doctor of Philosophy
T
H
E
U
N I V E R
S
I
T
Y
O
F
E
D I N B U
R
G
H

Data: in-memory, sharded across
servers within a datacenter (DC)
Offer a single-object read/write
or multi-object transactions API
Backbone of online services and
cloud applications
Must provide:
High performance
Fault tolerance
Distributed datastores
2
distributed datastore

Data: in-memory, sharded across
servers within a datacenter (DC)
Offer a single-object read/write
or multi-object transactions API
Backbone of online services and
cloud applications
Must provide:
High performance
Fault tolerance
Distributed datastores
2
distributed datastore
Mandate data replication

Performance
a single node may not keep up with load
Fault tolerance
data remain available despite failures
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: intuitive, broadest spectrum of apps
Replication protocols
- Strong consistency even under faults – if fault tolerant
- Define actions to execute reads/writes or transactions (txs)
à determine the datastore’s performance
Replication 101
3
…
… … …
replication protocol

Performance
a single node may not keep up with load
Fault tolerance
data remain available despite failures
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: intuitive, broadest spectrum of apps
Replication protocols
- Strong consistency even under faults – if fault tolerant
- Define actions to execute reads/writes or transactions (txs)
à determine the datastore’s performance
Replication 101
3
Can strongly consistent protocols offer
fault tolerance and high performance?
…
… … …
replication protocol

Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
Strongly consistent replication
4
Datastores: replication protocols
data replicated across nodes
Fault tolerance
Performance
- reads/writes: sacrifice
concurrency or speed
- txs: cannot exploit locality
Fault tolerance

Fault tolerance
Replication protocols inside a DC
- Network: fast, remote direct memory access (RDMA)
- Faults are rare within a replica group
A server fails at most twice a year
fault-free operation >> operation under faults
4
Fault tolerance
Performance
Fault tolerance

Fault tolerance
Replication protocols inside a DC
- Network: fast, remote direct memory access (RDMA)
- Faults are rare within a replica group
A server fails at most twice a year
fault-free operation >> operation under faults
4
Fault tolerance
Performance
Fault tolerance
The common operation of replication protocols
resembles the multiprocessor!

Thesis overview
5
Adapting multiprocessor-inspired invalidating protocols to intra-DC replicated
datastores enables: strong consistency, fault tolerance, high performance
Primary contributions
4 invalidating protocols à 3 most common replication uses in datastores
1-slide summary N-slides 1-slide summary
Scale-out ccNUMA [Eurosys’18]
Galene
protocol
Performant read/write
replication (skew)
Zeus [Eurosys’21]
Zeus ownership,
Zeus reliable commit
Replicated fault-tolerant
distributed txs
Hermes [ASPLOS’20]
Hermes
protocol
Fast fault-tolerant
read/write replication

Performant read/write replication for skew
12
Many workloads exhibit skewed data accesses
a few servers are overloaded, most are underutilized
State-of-the-art skew mitigation
distributes accesses across all servers & uses RDMA
No locality: most requests need remote access
à increased latency, bottlenecked by network b/w
Symmetric caching
all servers have a cache same hottest objects
Throughput scales with numbers of servers
Less network b/w: most requests served locally
Challenge: efficiently keep caches consistent
Existing protocols
serialize writes @ physical point = hotspot
Galene protocol
invalidations + logical timestamps = fully distributed writes
RDMA

13
Symmetric caching
Existing protocols
Galene protocol
RDMA
RDMA
Symmetric caching
and Galene

14
Symmetric caching
Existing protocols
Galene protocol
RDMA
RDMA
Symmetric caching
and Galene
100s millions ops/sec & up to 3x state-of-the-art!

15
Hmmm …
Invalidating protocols
good read/write performance
when replicating under skew
can maintain high read/write performance
while providing fault tolerance?
reliable = strongly consistent + fault tolerant
2nd primary contribution: Hermes!

16
Hmmm …
Invalidating protocols
good read/write performance
when replicating under skew
can maintain high read/write performance
while providing fault tolerance?
reliable = strongly consistent + fault tolerant
2nd primary contribution: Hermes!
What is the issue of existing reliable protocols?

Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
17
Paxos

Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
18
Paxos
State-of-the-art replication protocols exploit
failure-free operation for performance

11
Performance of state-of-the-art protocols
Leader
ZAB
replicas

20
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
Head Tail
Writes traverse length of the chain
à High latency
write
read bcast
ucast
Local reads form all replicas à Fast Local reads form all replicas à Fast

21
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
Head Tail
Writes traverse length of the chain
à High latency
write
read bcast
ucast
Fast reads but poor write performance
Local reads form all replicas à Fast Local reads form all replicas à Fast

13
Goal: low latency + high throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Head Tail
Avoid long latencies

23
Goal: low-latency + high-throughput
Reads
Writes
Fast
Decentralized
Fully concurrent
- Any replica can service
Leader
Avoid write serialization

24
Reads
Writes
Fast
Decentralized
Fully concurrent
Fast, decentralized, fully concurrent writes

25
Reads
Writes
Fast
Decentralized
Fully concurrent
Fast, decentralized, fully concurrent writes
Existing replication protocols are deficient

Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
26
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
I
Invalidation
I

- Coordinator is a replica servicing a write
Enter Hermes
27
write(A=3)
At this point, no stale reads can be served
Strong consistency!
I
Invalidation
I

2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
à valid objects = latest value
Enter Hermes
28
write(A=3)
V
Validation
V
Ack
Ack
I
Invalidation
I
V
commit

2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
à valid objects = latest value
Enter Hermes
29
write(A=3)
What about concurrent writes?
V
Validation
V
Ack
Ack
I
Invalidation
I
V
commit

Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
30
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)

Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
31
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
Broadcast + Invalidations + TS à high performance writes

1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Commit in 1 RTT
Never abort
Writes in Hermes
32
Broadcast + Invalidations + TS

1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Commit in 1 RTT
Never abort
Writes in Hermes
33
Awesome! But what about fault tolerance?
Broadcast + Invalidations + TS

Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à Early value propagation
write(A=3)
34
Handling faults in Hermes
read(A)
Inv(TS)
Coordinator
fails
I
I

Problem
Idea
How?
- Write value
Solution: send write value with Invalidation à early value propagation
35
Inv(3,TS)
write(A=3)
read(A)
Coordinator
fails
I
I

Problem
Idea
How?
- Write value
V
V
Inv(3,TS)
completion
write
replay
read(A)
36
Inv(3,TS)
write(A=3)
Coordinator
fails
I
I

Problem
Idea
How?
- Write value
V
V
Inv(3,TS)
completion
write
replay
read(A)
37
Inv(3,TS)
write(A=3)
Early value propagation enables write replays
Coordinator
fails
I
I

Evaluation
38
Evaluated protocols:
- ZAB
- CRAQ
- Hermes
State-of-the-art hardware testbed
- 5 servers
- 56 Gb/s InfiniBand NICs
- 2x 10 core Intel Xeon E5-2630v4 per server
KVS Workload
- Uniform access distribution
- Million key-value pairs: <8B keys, 32B values>

Performance
39
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
5% Write Ratio
Write Latency
(normalized to Hermes)
Million
requests
/
sec
Write performance matters even at low write ratios
6x
% Write Ratio

Performance
40
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
5% Write Ratio
Write Latency
(normalized to Hermes)
Million
requests
/
sec
Write performance matters even at low write ratios
6x
Hermes: highest throughput & lowest latency
% Write Ratio

Strong Consistency
through multiprocessor-inspired Invalidations
Fault-tolerance
write replays via early value propagation
High Performance
Local reads at all replicas
High performance writes
Fast
Decentralized
Fully concurrent
Hermes recap
41
V
I
write(A=3)
commit
Inv(3,TS)
V
I
V
Broadcast + Invalidations + TS + early value propagation

Strong Consistency
through multiprocessor-inspired Invalidations
Fault-tolerance
write replays via early value propagation
High Performance
Local reads at all replicas
High performance writes
Fast
Decentralized
Fully concurrent
Hermes recap
42
V
I
write(A=3)
commit
Inv(3,TS)
V
I
V
Broadcast + Invalidations + TS + early value propagation
What about reliable txs? … 3rd primary contribution (1-slide)!

Reliable replicated transactions
43
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses

44
changes dynamically
Reliable commit
distributed commit
1. tx: if (p) b++;
remote accesses
distributed commit
1. tx: if (p) b++;
remote accesses
costly txs, cannot exploit locality

45
changes dynamically
Reliable commit
distributed commit
1. tx: if (p) b++;
remote accesses
distributed commit
1. tx: if (p) b++;
remote accesses

46
changes dynamically
Reliable commit
distributed commit
1. tx: if (p) b++;
remote accesses
distributed commit
1. tx: if (p) b++;
remote accesses

47
changes dynamically
Reliable commit
distributed commit
1. tx: if (p) b++;
remote accesses
distributed commit
1. tx: if (p) b++;
remote accesses

48
changes dynamically
Reliable commit
10s millions txs/sec & up to 2x state-of-the-art!
distributed commit
1. tx: if (p) b++;
remote accesses
distributed commit
1. tx: if (p) b++;
remote accesses

49
changes dynamically
Reliable commit
10s millions txs/sec & up to 2x state-of-the-art!
distributed commit
1. tx: if (p) b++;
remote accesses
distributed commit
1. tx: if (p) b++;
remote accesses
Two Invalidating protocols!

Thesis summary
50
Replicated datastores powered by multiprocessor-inspired invalidating
protocols can deliver: strong consistency, fault tolerance, high performance
- High performance (10s–100s M ops / sec)
- Strong consistency under concurrency & faults (formally verified in TLA+)
Scale-out ccNUMA [Eurosys’18] Hermes [ASPLOS’20] Zeus [Eurosys’21]
Galene
protocol
Hermes
protocol
Zeus ownership,
replication for skew
Fast reliable read/write
replication
Locality-aware reliable txs
with dynamic sharding

Thesis summary
51
Replicated datastores powered by multiprocessor-inspired invalidating
protocols can deliver: strong consistency, fault tolerance, high performance
- High performance (10s–100s M ops / sec)
- Strong consistency under concurrency & faults (formally verified in TLA+)
Scale-out ccNUMA [Eurosys’18] Hermes [ASPLOS’20] Zeus [Eurosys’21]
Galene
protocol
Hermes
protocol
Zeus ownership,
replication for skew
Fast reliable read/write
replication
Locality-aware reliable txs
with dynamic sharding
Is this the end ??

Follow up research
52
• The L2AW theorem
[to be submitted]
• Hardware offloading
• Replication across datacenters
• Single-shot reliable writes from external clients
• Non-blocking reconfiguration on node crashes
…

Follow up research
53
• The L2AW theorem
[to be submitted]
• Hardware offloading
• Replication across datacenters
• Single-shot reliable writes from external clients
• Non-blocking reconfiguration on node crashes
…
Thank you! Questions?

Invalidation-Based Protocols for Replicated Datastores

Recommended

Recommended

More Related Content

Similar to Invalidation-Based Protocols for Replicated Datastores

Similar to Invalidation-Based Protocols for Replicated Datastores (20)

More from Antonios Katsarakis

More from Antonios Katsarakis (7)

Recently uploaded

Recently uploaded (20)

Invalidation-Based Protocols for Replicated Datastores