SlideShare a Scribd company logo
1 of 53
Download to read offline
Invalidation-Based Protocols
for Replicated Datastores
Antonios Katsarakis
Doctor of Philosophy
T
H
E
U
N I V E R
S
I
T
Y
O
F
E
D I N B U
R
G
H
Data: in-memory, sharded across
servers within a datacenter (DC)
Offer a single-object read/write
or multi-object transactions API
Backbone of online services and
cloud applications
Must provide:
High performance
Fault tolerance
Distributed datastores
2
distributed datastore
Data: in-memory, sharded across
servers within a datacenter (DC)
Offer a single-object read/write
or multi-object transactions API
Backbone of online services and
cloud applications
Must provide:
High performance
Fault tolerance
Distributed datastores
2
distributed datastore
Data: in-memory, sharded across
servers within a datacenter (DC)
Offer a single-object read/write
or multi-object transactions API
Backbone of online services and
cloud applications
Must provide:
High performance
Fault tolerance
Distributed datastores
2
distributed datastore
Mandate data replication
Performance
a single node may not keep up with load
Fault tolerance
data remain available despite failures
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: intuitive, broadest spectrum of apps
Replication protocols
- Strong consistency even under faults – if fault tolerant
- Define actions to execute reads/writes or transactions (txs)
à determine the datastore’s performance
Replication 101
3
…
… … …
replication protocol
Performance
a single node may not keep up with load
Fault tolerance
data remain available despite failures
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: intuitive, broadest spectrum of apps
Replication protocols
- Strong consistency even under faults – if fault tolerant
- Define actions to execute reads/writes or transactions (txs)
à determine the datastore’s performance
Replication 101
3
…
… … …
replication protocol
Performance
a single node may not keep up with load
Fault tolerance
data remain available despite failures
Typically 3 to 7 replicas
Consistency
Weak: performance but nasty surprises
Strong: intuitive, broadest spectrum of apps
Replication protocols
- Strong consistency even under faults – if fault tolerant
- Define actions to execute reads/writes or transactions (txs)
à determine the datastore’s performance
Replication 101
3
Can strongly consistent protocols offer
fault tolerance and high performance?
…
… … …
replication protocol
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
Strongly consistent replication
4
Datastores: replication protocols
data replicated across nodes
Fault tolerance
Performance
- reads/writes: sacrifice
concurrency or speed
- txs: cannot exploit locality
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
Replication protocols inside a DC
- Network: fast, remote direct memory access (RDMA)
- Faults are rare within a replica group
A server fails at most twice a year
fault-free operation >> operation under faults
Strongly consistent replication
4
Datastores: replication protocols
data replicated across nodes
Fault tolerance
Performance
- reads/writes: sacrifice
concurrency or speed
- txs: cannot exploit locality
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
Replication protocols inside a DC
- Network: fast, remote direct memory access (RDMA)
- Faults are rare within a replica group
A server fails at most twice a year
fault-free operation >> operation under faults
Strongly consistent replication
4
Datastores: replication protocols
data replicated across nodes
Fault tolerance
Performance
- reads/writes: sacrifice
concurrency or speed
- txs: cannot exploit locality
Multiprocessor: coherence / HTM
data replicated across caches
Fault tolerance
Performance via Invalidations
(low-latency interconnect)
- reads/writes: concurrency & speed
- txs: fully exploit locality
The common operation of replication protocols
resembles the multiprocessor!
Thesis overview
5
Adapting multiprocessor-inspired invalidating protocols to intra-DC replicated
datastores enables: strong consistency, fault tolerance, high performance
Primary contributions
4 invalidating protocols à 3 most common replication uses in datastores
1-slide summary N-slides 1-slide summary
Scale-out ccNUMA [Eurosys’18]
Galene
protocol
Performant read/write
replication (skew)
Zeus [Eurosys’21]
Zeus ownership,
Zeus reliable commit
Replicated fault-tolerant
distributed txs
Hermes [ASPLOS’20]
Hermes
protocol
Fast fault-tolerant
read/write replication
Performant read/write replication for skew
12
Many workloads exhibit skewed data accesses
a few servers are overloaded, most are underutilized
State-of-the-art skew mitigation
distributes accesses across all servers & uses RDMA
No locality: most requests need remote access
à increased latency, bottlenecked by network b/w
Symmetric caching
all servers have a cache same hottest objects
Throughput scales with numbers of servers
Less network b/w: most requests served locally
Challenge: efficiently keep caches consistent
Existing protocols
serialize writes @ physical point = hotspot
Galene protocol
invalidations + logical timestamps = fully distributed writes
RDMA
Performant read/write replication for skew
13
Many workloads exhibit skewed data accesses
a few servers are overloaded, most are underutilized
State-of-the-art skew mitigation
distributes accesses across all servers & uses RDMA
No locality: most requests need remote access
à increased latency, bottlenecked by network b/w
Symmetric caching
all servers have a cache same hottest objects
Throughput scales with numbers of servers
Less network b/w: most requests served locally
Challenge: efficiently keep caches consistent
Existing protocols
serialize writes @ physical point = hotspot
Galene protocol
invalidations + logical timestamps = fully distributed writes
RDMA
RDMA
Symmetric caching
and Galene
Performant read/write replication for skew
14
Many workloads exhibit skewed data accesses
a few servers are overloaded, most are underutilized
State-of-the-art skew mitigation
distributes accesses across all servers & uses RDMA
No locality: most requests need remote access
à increased latency, bottlenecked by network b/w
Symmetric caching
all servers have a cache same hottest objects
Throughput scales with numbers of servers
Less network b/w: most requests served locally
Challenge: efficiently keep caches consistent
Existing protocols
serialize writes @ physical point = hotspot
Galene protocol
invalidations + logical timestamps = fully distributed writes
RDMA
RDMA
Symmetric caching
and Galene
100s millions ops/sec & up to 3x state-of-the-art!
15
Hmmm …
Invalidating protocols
good read/write performance
when replicating under skew
can maintain high read/write performance
while providing fault tolerance?
reliable = strongly consistent + fault tolerant
2nd primary contribution: Hermes!
16
Hmmm …
Invalidating protocols
good read/write performance
when replicating under skew
can maintain high read/write performance
while providing fault tolerance?
reliable = strongly consistent + fault tolerant
2nd primary contribution: Hermes!
What is the issue of existing reliable protocols?
Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
17
Paxos
Golden standard
strong consistency and fault tolerance
Low performance
reads à inter-replica communication
writes à multiple RTTs over the network
Common-case performance (i.e., no faults)
as bad as worst-case (under faults)
18
Paxos
State-of-the-art replication protocols exploit
failure-free operation for performance
11
Performance of state-of-the-art protocols
Leader
ZAB
replicas
20
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
Head Tail
Writes traverse length of the chain
à High latency
write
read bcast
ucast
Local reads form all replicas à Fast Local reads form all replicas à Fast
21
Performance of state-of-the-art protocols
Leader
ZAB
Leader
Writes serialize on the leader
à Low throughput
Head Tail
CRAQ
Head Tail
Writes traverse length of the chain
à High latency
write
read bcast
ucast
Fast reads but poor write performance
Local reads form all replicas à Fast Local reads form all replicas à Fast
13
Goal: low latency + high throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Head Tail
Avoid long latencies
23
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service
Leader
Avoid write serialization
Key protocol features for high performance
Local reads from all replicas
24
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Fast, decentralized, fully concurrent writes
25
Goal: low-latency + high-throughput
Reads
Local from all replicas
Writes
Fast
- Minimize network hops
Decentralized
- No serialization points
Fully concurrent
- Any replica can service a write
Key protocol features for high performance
Local reads from all replicas
Fast, decentralized, fully concurrent writes
Existing replication protocols are deficient
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
26
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
I
Invalidation
I
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
- Coordinator is a replica servicing a write
Enter Hermes
27
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
At this point, no stale reads can be served
Strong consistency!
I
Invalidation
I
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
28
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
V
Validation
V
Ack
Ack
I
Invalidation
I
V
commit
Broadcast-based, invalidating replication protocol
Inspired by multiprocessor cache-coherence protocols
Fault-free operation:
1. Coordinator broadcasts Invalidations
2. Followers Acknowledge invalidation
3. Coordinator broadcasts Validations
- All replicas can now serve reads for this object
Strongest consistency Linearizability
Local reads from all replicas
à valid objects = latest value
Enter Hermes
29
States of A: Valid, Invalid
write(A=3)
Coordinator Followers
What about concurrent writes?
V
Validation
V
Ack
Ack
I
Invalidation
I
V
commit
Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
30
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
Challenge
How to efficiently order concurrent writes to an object?
Solution
Store a logical timestamp (TS) along with each object
- Upon a write:
coordinator increments TS and sends it with Invalidations
- Upon receiving Invalidation:
a follower updates the object’s TS
- When two writes to the same object race:
use node ID to order them
Concurrent writes = challenge
31
write(A=3) write(A=1)
Inv(TS1) Inv(TS4)
Broadcast + Invalidations + TS à high performance writes
1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Commit in 1 RTT
Never abort
Writes in Hermes
32
Broadcast + Invalidations + TS
1. Decentralized
Fully distributed write ordering at endpoints
2. Fully concurrent
Any replica can coordinate a write
Writes to different objects proceed in parallel
3. Fast
Commit in 1 RTT
Never abort
Writes in Hermes
33
Awesome! But what about fault tolerance?
Broadcast + Invalidations + TS
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à Early value propagation
write(A=3)
Coordinator Followers
34
Handling faults in Hermes
read(A)
Inv(TS)
Coordinator
fails
I
I
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à early value propagation
Handling faults in Hermes
35
Inv(3,TS)
write(A=3)
read(A)
Coordinator
fails
I
I
Coordinator Followers
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à early value propagation
V
V
Inv(3,TS)
completion
write
replay
read(A)
Handling faults in Hermes
36
Inv(3,TS)
write(A=3)
Coordinator
fails
I
I
Coordinator Followers
Problem
A failure in the middle of a write can
permanently leave a replica in Invalid state
Idea
Allow any Invalidated replica to
replay the write and unblock.
How?
Insight: to replay a write need
- Write’s original TS (for ordering)
- Write value
TS sent with Invalidation, but write value is not
Solution: send write value with Invalidation à early value propagation
V
V
Inv(3,TS)
completion
write
replay
read(A)
Handling faults in Hermes
37
Inv(3,TS)
write(A=3)
Early value propagation enables write replays
Coordinator
fails
I
I
Coordinator Followers
Evaluation
38
Evaluated protocols:
- ZAB
- CRAQ
- Hermes
State-of-the-art hardware testbed
- 5 servers
- 56 Gb/s InfiniBand NICs
- 2x 10 core Intel Xeon E5-2630v4 per server
KVS Workload
- Uniform access distribution
- Million key-value pairs: <8B keys, 32B values>
Performance
39
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
5% Write Ratio
Write Latency
(normalized to Hermes)
Million
requests
/
sec
Write performance matters even at low write ratios
6x
% Write Ratio
Performance
40
Throughput
high-perf. writes + local reads
conc. writes + local reads
local reads
4x
40%
5% Write Ratio
Write Latency
(normalized to Hermes)
Million
requests
/
sec
Write performance matters even at low write ratios
6x
Hermes: highest throughput & lowest latency
% Write Ratio
Strong Consistency
through multiprocessor-inspired Invalidations
Fault-tolerance
write replays via early value propagation
High Performance
Local reads at all replicas
High performance writes
Fast
Decentralized
Fully concurrent
Hermes recap
41
V
I
write(A=3)
commit
Coordinator Followers
Inv(3,TS)
V
I
V
Broadcast + Invalidations + TS + early value propagation
Strong Consistency
through multiprocessor-inspired Invalidations
Fault-tolerance
write replays via early value propagation
High Performance
Local reads at all replicas
High performance writes
Fast
Decentralized
Fully concurrent
Hermes recap
42
V
I
write(A=3)
commit
Coordinator Followers
Inv(3,TS)
V
I
V
Broadcast + Invalidations + TS + early value propagation
What about reliable txs? … 3rd primary contribution (1-slide)!
Reliable replicated transactions
43
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
Reliable replicated transactions
44
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
Reliable replicated transactions
45
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
Reliable replicated transactions
46
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
Reliable replicated transactions
47
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
Reliable replicated transactions
48
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
10s millions txs/sec & up to 2x state-of-the-art!
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
Reliable replicated transactions
49
Many tx workloads exhibit locality in accesses
State-of-the-art datastores rely on static sharding
Reliable txs regardless of access pattern
Objects randomly sharded on fixed nodes
- remote accesses to execute
- expensive distributed commit
Zeus – locality-aware reliable txs:
Each object: node owner = data + excl. write access
changes dynamically
Coordinator: becomes owner of all tx’s objects
à single node commit
Ownership stays with coordinator
à future tx = local accesses
Reliable ownership (1.5 RTT)
alters replica placement, access levels
Reliable commit
- read-only txs: local from all replicas
- fast write txs: pipelined, 1 RTT to commit
10s millions txs/sec & up to 2x state-of-the-art!
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
distributed commit
1. tx: if (p) b++;
remote accesses
Adapted from FaSST [OSDI’16]
Adapted from FaSST [OSDI’16]
costly txs, cannot exploit locality
Two Invalidating protocols!
Thesis summary
50
Replicated datastores powered by multiprocessor-inspired invalidating
protocols can deliver: strong consistency, fault tolerance, high performance
4 invalidating protocols à 3 most common replication uses in datastores
- High performance (10s–100s M ops / sec)
- Strong consistency under concurrency & faults (formally verified in TLA+)
Scale-out ccNUMA [Eurosys’18] Hermes [ASPLOS’20] Zeus [Eurosys’21]
Galene
protocol
Hermes
protocol
Zeus ownership,
Zeus reliable commit
Performant read/write
replication for skew
Fast reliable read/write
replication
Locality-aware reliable txs
with dynamic sharding
Thesis summary
51
Replicated datastores powered by multiprocessor-inspired invalidating
protocols can deliver: strong consistency, fault tolerance, high performance
4 invalidating protocols à 3 most common replication uses in datastores
- High performance (10s–100s M ops / sec)
- Strong consistency under concurrency & faults (formally verified in TLA+)
Scale-out ccNUMA [Eurosys’18] Hermes [ASPLOS’20] Zeus [Eurosys’21]
Galene
protocol
Hermes
protocol
Zeus ownership,
Zeus reliable commit
Performant read/write
replication for skew
Fast reliable read/write
replication
Locality-aware reliable txs
with dynamic sharding
Is this the end ??
Follow up research
52
• The L2AW theorem
[to be submitted]
• Hardware offloading
• Replication across datacenters
• Single-shot reliable writes from external clients
• Non-blocking reconfiguration on node crashes
…
Follow up research
53
• The L2AW theorem
[to be submitted]
• Hardware offloading
• Replication across datacenters
• Single-shot reliable writes from external clients
• Non-blocking reconfiguration on node crashes
…
Thank you! Questions?

More Related Content

Similar to Invalidation-Based Protocols for Replicated Datastores

Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
Socket programming with php
Socket programming with phpSocket programming with php
Socket programming with php
Elizabeth Smith
 
HTTP at your local BigCo
HTTP at your local BigCoHTTP at your local BigCo
HTTP at your local BigCo
pgriess
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
NAVER D2
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Cal Henderson
 
Clustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And AvailabilityClustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And Availability
ConSanFrancisco123
 
Network and distributed systems
Network and distributed systemsNetwork and distributed systems
Network and distributed systems
Sri Prasanna
 
Creating customized openSUSE versions with SUSE Studio
Creating customized openSUSE versions with SUSE StudioCreating customized openSUSE versions with SUSE Studio
Creating customized openSUSE versions with SUSE Studio
elliando dias
 

Similar to Invalidation-Based Protocols for Replicated Datastores (20)

Knowledge share about scalable application architecture
Knowledge share about scalable application architectureKnowledge share about scalable application architecture
Knowledge share about scalable application architecture
 
What a Modern Database Enables_Srini Srinivasan.pdf
What a Modern Database Enables_Srini Srinivasan.pdfWhat a Modern Database Enables_Srini Srinivasan.pdf
What a Modern Database Enables_Srini Srinivasan.pdf
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Sharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual MachinesSharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual Machines
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Socket programming with php
Socket programming with phpSocket programming with php
Socket programming with php
 
HTTP at your local BigCo
HTTP at your local BigCoHTTP at your local BigCo
HTTP at your local BigCo
 
Data center disaster recovery.ppt
Data center disaster recovery.ppt Data center disaster recovery.ppt
Data center disaster recovery.ppt
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
Clustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And AvailabilityClustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And Availability
 
Network and distributed systems
Network and distributed systemsNetwork and distributed systems
Network and distributed systems
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
 
Tech Talk Series, Part 4: How do you achieve high availability in a MySQL env...
Tech Talk Series, Part 4: How do you achieve high availability in a MySQL env...Tech Talk Series, Part 4: How do you achieve high availability in a MySQL env...
Tech Talk Series, Part 4: How do you achieve high availability in a MySQL env...
 
Creating customized openSUSE versions with SUSE Studio
Creating customized openSUSE versions with SUSE StudioCreating customized openSUSE versions with SUSE Studio
Creating customized openSUSE versions with SUSE Studio
 
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
 
MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011
 

More from Antonios Katsarakis

More from Antonios Katsarakis (7)

The L2AW theorem
The L2AW theoremThe L2AW theorem
The L2AW theorem
 
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
 
Hermes Reliable Replication Protocol - Poster
Hermes Reliable Replication Protocol - Poster Hermes Reliable Replication Protocol - Poster
Hermes Reliable Replication Protocol - Poster
 
Scale-out ccNUMA - Eurosys'18
Scale-out ccNUMA - Eurosys'18Scale-out ccNUMA - Eurosys'18
Scale-out ccNUMA - Eurosys'18
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing Frameworks
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 

Invalidation-Based Protocols for Replicated Datastores

  • 1. Invalidation-Based Protocols for Replicated Datastores Antonios Katsarakis Doctor of Philosophy T H E U N I V E R S I T Y O F E D I N B U R G H
  • 2. Data: in-memory, sharded across servers within a datacenter (DC) Offer a single-object read/write or multi-object transactions API Backbone of online services and cloud applications Must provide: High performance Fault tolerance Distributed datastores 2 distributed datastore
  • 3. Data: in-memory, sharded across servers within a datacenter (DC) Offer a single-object read/write or multi-object transactions API Backbone of online services and cloud applications Must provide: High performance Fault tolerance Distributed datastores 2 distributed datastore
  • 4. Data: in-memory, sharded across servers within a datacenter (DC) Offer a single-object read/write or multi-object transactions API Backbone of online services and cloud applications Must provide: High performance Fault tolerance Distributed datastores 2 distributed datastore Mandate data replication
  • 5. Performance a single node may not keep up with load Fault tolerance data remain available despite failures Typically 3 to 7 replicas Consistency Weak: performance but nasty surprises Strong: intuitive, broadest spectrum of apps Replication protocols - Strong consistency even under faults – if fault tolerant - Define actions to execute reads/writes or transactions (txs) à determine the datastore’s performance Replication 101 3 … … … … replication protocol
  • 6. Performance a single node may not keep up with load Fault tolerance data remain available despite failures Typically 3 to 7 replicas Consistency Weak: performance but nasty surprises Strong: intuitive, broadest spectrum of apps Replication protocols - Strong consistency even under faults – if fault tolerant - Define actions to execute reads/writes or transactions (txs) à determine the datastore’s performance Replication 101 3 … … … … replication protocol
  • 7. Performance a single node may not keep up with load Fault tolerance data remain available despite failures Typically 3 to 7 replicas Consistency Weak: performance but nasty surprises Strong: intuitive, broadest spectrum of apps Replication protocols - Strong consistency even under faults – if fault tolerant - Define actions to execute reads/writes or transactions (txs) à determine the datastore’s performance Replication 101 3 Can strongly consistent protocols offer fault tolerance and high performance? … … … … replication protocol
  • 8. Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality Strongly consistent replication 4 Datastores: replication protocols data replicated across nodes Fault tolerance Performance - reads/writes: sacrifice concurrency or speed - txs: cannot exploit locality Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality
  • 9. Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality Replication protocols inside a DC - Network: fast, remote direct memory access (RDMA) - Faults are rare within a replica group A server fails at most twice a year fault-free operation >> operation under faults Strongly consistent replication 4 Datastores: replication protocols data replicated across nodes Fault tolerance Performance - reads/writes: sacrifice concurrency or speed - txs: cannot exploit locality Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality
  • 10. Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality Replication protocols inside a DC - Network: fast, remote direct memory access (RDMA) - Faults are rare within a replica group A server fails at most twice a year fault-free operation >> operation under faults Strongly consistent replication 4 Datastores: replication protocols data replicated across nodes Fault tolerance Performance - reads/writes: sacrifice concurrency or speed - txs: cannot exploit locality Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality The common operation of replication protocols resembles the multiprocessor!
  • 11. Thesis overview 5 Adapting multiprocessor-inspired invalidating protocols to intra-DC replicated datastores enables: strong consistency, fault tolerance, high performance Primary contributions 4 invalidating protocols à 3 most common replication uses in datastores 1-slide summary N-slides 1-slide summary Scale-out ccNUMA [Eurosys’18] Galene protocol Performant read/write replication (skew) Zeus [Eurosys’21] Zeus ownership, Zeus reliable commit Replicated fault-tolerant distributed txs Hermes [ASPLOS’20] Hermes protocol Fast fault-tolerant read/write replication
  • 12. Performant read/write replication for skew 12 Many workloads exhibit skewed data accesses a few servers are overloaded, most are underutilized State-of-the-art skew mitigation distributes accesses across all servers & uses RDMA No locality: most requests need remote access à increased latency, bottlenecked by network b/w Symmetric caching all servers have a cache same hottest objects Throughput scales with numbers of servers Less network b/w: most requests served locally Challenge: efficiently keep caches consistent Existing protocols serialize writes @ physical point = hotspot Galene protocol invalidations + logical timestamps = fully distributed writes RDMA
  • 13. Performant read/write replication for skew 13 Many workloads exhibit skewed data accesses a few servers are overloaded, most are underutilized State-of-the-art skew mitigation distributes accesses across all servers & uses RDMA No locality: most requests need remote access à increased latency, bottlenecked by network b/w Symmetric caching all servers have a cache same hottest objects Throughput scales with numbers of servers Less network b/w: most requests served locally Challenge: efficiently keep caches consistent Existing protocols serialize writes @ physical point = hotspot Galene protocol invalidations + logical timestamps = fully distributed writes RDMA RDMA Symmetric caching and Galene
  • 14. Performant read/write replication for skew 14 Many workloads exhibit skewed data accesses a few servers are overloaded, most are underutilized State-of-the-art skew mitigation distributes accesses across all servers & uses RDMA No locality: most requests need remote access à increased latency, bottlenecked by network b/w Symmetric caching all servers have a cache same hottest objects Throughput scales with numbers of servers Less network b/w: most requests served locally Challenge: efficiently keep caches consistent Existing protocols serialize writes @ physical point = hotspot Galene protocol invalidations + logical timestamps = fully distributed writes RDMA RDMA Symmetric caching and Galene 100s millions ops/sec & up to 3x state-of-the-art!
  • 15. 15 Hmmm … Invalidating protocols good read/write performance when replicating under skew can maintain high read/write performance while providing fault tolerance? reliable = strongly consistent + fault tolerant 2nd primary contribution: Hermes!
  • 16. 16 Hmmm … Invalidating protocols good read/write performance when replicating under skew can maintain high read/write performance while providing fault tolerance? reliable = strongly consistent + fault tolerant 2nd primary contribution: Hermes! What is the issue of existing reliable protocols?
  • 17. Golden standard strong consistency and fault tolerance Low performance reads à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 17 Paxos
  • 18. Golden standard strong consistency and fault tolerance Low performance reads à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 18 Paxos State-of-the-art replication protocols exploit failure-free operation for performance
  • 19. 11 Performance of state-of-the-art protocols Leader ZAB replicas
  • 20. 20 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize on the leader à Low throughput Head Tail CRAQ Head Tail Writes traverse length of the chain à High latency write read bcast ucast Local reads form all replicas à Fast Local reads form all replicas à Fast
  • 21. 21 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize on the leader à Low throughput Head Tail CRAQ Head Tail Writes traverse length of the chain à High latency write read bcast ucast Fast reads but poor write performance Local reads form all replicas à Fast Local reads form all replicas à Fast
  • 22. 13 Goal: low latency + high throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Head Tail Avoid long latencies
  • 23. 23 Goal: low-latency + high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service Leader Avoid write serialization Key protocol features for high performance Local reads from all replicas
  • 24. 24 Goal: low-latency + high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Fast, decentralized, fully concurrent writes
  • 25. 25 Goal: low-latency + high-throughput Reads Local from all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Fast, decentralized, fully concurrent writes Existing replication protocols are deficient
  • 26. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 26 States of A: Valid, Invalid write(A=3) Coordinator Followers I Invalidation I
  • 27. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 27 States of A: Valid, Invalid write(A=3) Coordinator Followers At this point, no stale reads can be served Strong consistency! I Invalidation I
  • 28. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 28 States of A: Valid, Invalid write(A=3) Coordinator Followers V Validation V Ack Ack I Invalidation I V commit
  • 29. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 29 States of A: Valid, Invalid write(A=3) Coordinator Followers What about concurrent writes? V Validation V Ack Ack I Invalidation I V commit
  • 30. Challenge How to efficiently order concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 30 write(A=3) write(A=1) Inv(TS1) Inv(TS4)
  • 31. Challenge How to efficiently order concurrent writes to an object? Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 31 write(A=3) write(A=1) Inv(TS1) Inv(TS4) Broadcast + Invalidations + TS à high performance writes
  • 32. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Commit in 1 RTT Never abort Writes in Hermes 32 Broadcast + Invalidations + TS
  • 33. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Commit in 1 RTT Never abort Writes in Hermes 33 Awesome! But what about fault tolerance? Broadcast + Invalidations + TS
  • 34. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 34 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  • 35. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à early value propagation Handling faults in Hermes 35 Inv(3,TS) write(A=3) read(A) Coordinator fails I I Coordinator Followers
  • 36. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à early value propagation V V Inv(3,TS) completion write replay read(A) Handling faults in Hermes 36 Inv(3,TS) write(A=3) Coordinator fails I I Coordinator Followers
  • 37. Problem A failure in the middle of a write can permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à early value propagation V V Inv(3,TS) completion write replay read(A) Handling faults in Hermes 37 Inv(3,TS) write(A=3) Early value propagation enables write replays Coordinator fails I I Coordinator Followers
  • 38. Evaluation 38 Evaluated protocols: - ZAB - CRAQ - Hermes State-of-the-art hardware testbed - 5 servers - 56 Gb/s InfiniBand NICs - 2x 10 core Intel Xeon E5-2630v4 per server KVS Workload - Uniform access distribution - Million key-value pairs: <8B keys, 32B values>
  • 39. Performance 39 Throughput high-perf. writes + local reads conc. writes + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Million requests / sec Write performance matters even at low write ratios 6x % Write Ratio
  • 40. Performance 40 Throughput high-perf. writes + local reads conc. writes + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Million requests / sec Write performance matters even at low write ratios 6x Hermes: highest throughput & lowest latency % Write Ratio
  • 41. Strong Consistency through multiprocessor-inspired Invalidations Fault-tolerance write replays via early value propagation High Performance Local reads at all replicas High performance writes Fast Decentralized Fully concurrent Hermes recap 41 V I write(A=3) commit Coordinator Followers Inv(3,TS) V I V Broadcast + Invalidations + TS + early value propagation
  • 42. Strong Consistency through multiprocessor-inspired Invalidations Fault-tolerance write replays via early value propagation High Performance Local reads at all replicas High performance writes Fast Decentralized Fully concurrent Hermes recap 42 V I write(A=3) commit Coordinator Followers Inv(3,TS) V I V Broadcast + Invalidations + TS + early value propagation What about reliable txs? … 3rd primary contribution (1-slide)!
  • 43. Reliable replicated transactions 43 Many tx workloads exhibit locality in accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16]
  • 44. Reliable replicated transactions 44 Many tx workloads exhibit locality in accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  • 45. Reliable replicated transactions 45 Many tx workloads exhibit locality in accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  • 46. Reliable replicated transactions 46 Many tx workloads exhibit locality in accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  • 47. Reliable replicated transactions 47 Many tx workloads exhibit locality in accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  • 48. Reliable replicated transactions 48 Many tx workloads exhibit locality in accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit 10s millions txs/sec & up to 2x state-of-the-art! distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  • 49. Reliable replicated transactions 49 Many tx workloads exhibit locality in accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit 10s millions txs/sec & up to 2x state-of-the-art! distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality Two Invalidating protocols!
  • 50. Thesis summary 50 Replicated datastores powered by multiprocessor-inspired invalidating protocols can deliver: strong consistency, fault tolerance, high performance 4 invalidating protocols à 3 most common replication uses in datastores - High performance (10s–100s M ops / sec) - Strong consistency under concurrency & faults (formally verified in TLA+) Scale-out ccNUMA [Eurosys’18] Hermes [ASPLOS’20] Zeus [Eurosys’21] Galene protocol Hermes protocol Zeus ownership, Zeus reliable commit Performant read/write replication for skew Fast reliable read/write replication Locality-aware reliable txs with dynamic sharding
  • 51. Thesis summary 51 Replicated datastores powered by multiprocessor-inspired invalidating protocols can deliver: strong consistency, fault tolerance, high performance 4 invalidating protocols à 3 most common replication uses in datastores - High performance (10s–100s M ops / sec) - Strong consistency under concurrency & faults (formally verified in TLA+) Scale-out ccNUMA [Eurosys’18] Hermes [ASPLOS’20] Zeus [Eurosys’21] Galene protocol Hermes protocol Zeus ownership, Zeus reliable commit Performant read/write replication for skew Fast reliable read/write replication Locality-aware reliable txs with dynamic sharding Is this the end ??
  • 52. Follow up research 52 • The L2AW theorem [to be submitted] • Hardware offloading • Replication across datacenters • Single-shot reliable writes from external clients • Non-blocking reconfiguration on node crashes …
  • 53. Follow up research 53 • The L2AW theorem [to be submitted] • Hardware offloading • Replication across datacenters • Single-shot reliable writes from external clients • Non-blocking reconfiguration on node crashes … Thank you! Questions?