Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Outwards 
from the middle of the maze 
Peter Alvaro 
UC Berkeley
Outline 
1. Mourning the death of transactions 
2. What is so hard about distributed systems? 
3. Distributed consistency:...
The transaction concept 
DEBIT_CREDIT: 
BEGIN_TRANSACTION; 
GET 
MESSAGE; 
EXTRACT 
ACCOUT_NUMBER, 
DELTA, 
TELLER, 
BRANC...
The transaction concept 
DEBIT_CREDIT: 
BEGIN_TRANSACTION; 
GET 
MESSAGE; 
EXTRACT 
ACCOUT_NUMBER, 
DELTA, 
TELLER, 
BRANC...
The transaction concept 
DEBIT_CREDIT: 
BEGIN_TRANSACTION; 
GET 
MESSAGE; 
EXTRACT 
ACCOUT_NUMBER, 
DELTA, 
TELLER, 
BRANC...
The transaction concept 
DEBIT_CREDIT: 
BEGIN_TRANSACTION; 
GET 
MESSAGE; 
EXTRACT 
ACCOUT_NUMBER, 
DELTA, 
TELLER, 
BRANC...
The “top-down” ethos
The “top-down” ethos
The “top-down” ethos
The “top-down” ethos
The “top-down” ethos
The “top-down” ethos
Transactions: a holistic contract 
Write 
Read 
Application 
Opaque 
store 
Transactions
Transactions: a holistic contract 
Write 
Read 
Application 
Opaque 
store 
Transactions 
Assert: 
balance > 0
Transactions: a holistic contract 
Assert: 
balance > 0 
Write 
Read 
Application 
Opaque 
store 
Transactions
Transactions: a holistic contract 
Write 
Read 
Application 
Opaque 
store 
Transactions 
Assert: 
balance > 0
Transactions: a holistic contract 
Write 
Read 
Application 
Opaque 
store 
Transactions 
Assert: 
balance > 0
Incidental complexities 
• The “Internet.” Searching it. 
• Cross-datacenter replication schemes 
• CAP Theorem 
• Dynamo ...
Fundamental complexity 
“[…] distributed systems require that the 
programmer be aware of latency, have a different 
model...
A holistic contract 
…stretched to the limit 
Write 
Read 
Application 
Opaque 
store 
Transactions
A holistic contract 
…stretched to the limit 
Write 
Read 
Application 
Opaque 
store 
Transactions
Are you blithely asserting 
that transactions aren’t webscale? 
Some people just want to see the world burn. 
Those same p...
Alternative to top-down design? 
The “bottom-up,” systems tradition: 
Simple, reusable components first. 
Semantics later.
Alternative: 
the “bottom-up,” systems ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos 
“‘Tis a fine barn, but sure ‘tis no castle, English”
The “bottom-up” ethos 
Simple, reusable components first. 
Semantics later. 
This is how we live now. 
Question: Do we eve...
Low-level contracts 
Write 
Read 
Application 
Distributed 
store KVS
Low-level contracts 
Write 
Read 
Application 
Distributed 
store KVS
Low-level contracts 
Write 
Read 
Application 
Distributed 
store KVS 
R1(X=1) 
R2(X=1) 
W1(X=2) 
W2(X=0) 
W1(X=1) 
W1(Y=2...
Low-level contracts 
Write 
Read 
Application 
Distributed 
store KVS 
Assert: 
balance > 0 
R1(X=1) 
R2(X=1) 
W1(X=2) 
W2...
Low-level contracts 
Write 
Read 
Application 
Distributed 
store KVS 
Assert: 
balance > 0 
causal? 
PRAM? 
delta? 
fork/...
When do contracts compose? 
Application 
Distributed 
service 
Assert: 
balance > 0
iw, did I get mongo in my riak? 
Assert: 
balance > 0
Composition is the last hard 
problem 
Composing modules is hard enough 
We must learn how to compose guarantees
Outline 
1. Mourning the death of transactions 
2. What is so hard about distributed systems? 
3. Distributed consistency:...
Why distributed systems are hard2 
Asynchrony Partial Failure 
Fundamental Uncertainty
Asynchrony isn’t that hard 
Ameloriation: 
Logical timestamps 
Deterministic interleaving
Partial failure isn’t that hard 
Ameloriation: 
Replication 
Replay
(asynchrony * partial failure) = hard2 
Logical timestamps 
Deterministic interleaving 
Replication 
Replay
(asynchrony * partial failure) = hard2 
Logical timestamps 
Deterministic interleaving 
Replication 
Replay
(asynchrony * partial failure) = hard2 
Tackling one clown at a time 
Poor strategy for programming distributed systems 
W...
Outline 
1. Mourning the death of transactions 
2. What is so hard about distributed systems? 
3. Distributed consistency:...
Distributed consistency 
Today: A quick summary of some great work.
Consider a (distributed) graph 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14
Partitioned, for scalability 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14
Replicated, for availability 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9...
Deadlock detection 
Task: Identify strongly-connected 
components 
Waits-for graph 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T...
Garbage collection 
Task: Identify nodes not reachable 
from Root. Root 
Refers-to graph 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 ...
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14 
Correctness 
Deadlock detection 
• Safety: No false positives...
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14 
Correctness 
Deadlock detection 
• Safety: No false positives...
Correctness 
Deadlock detection 
• Safety: No false positives 
• Liveness: Identify all deadlocks 
Garbage collection 
• S...
Consistency at the extremes 
Application 
Language 
Custom s 
olutions? 
Flow 
Object 
Storage 
Linearizable 
key-value st...
Consistency at the extremes 
Application 
Language 
Custom s 
olutions? 
Flow 
Object 
Storage 
Linearizable 
key-value st...
Consistency at the extremes 
Application 
Language 
Custom s 
olutions? 
Flow 
Efficient Object 
Correct 
Storage 
Lineari...
Object-level consistency 
Capture semantics of data structures that 
• allow greater concurrency 
• maintain guarantees (e...
Object-level consistency
Object-level consistency 
Insert 
Read 
Convergent 
data structure 
(e.g., Set CRDT) 
Insert 
Read 
Commutativity 
Associa...
Object-level consistency 
Insert 
Read 
Convergent 
data structure 
(e.g., Set CRDT) 
Insert 
Read 
Commutativity 
Associa...
Object-level consistency 
Insert 
Read 
Convergent 
data structure 
(e.g., Set CRDT) 
Insert 
Read 
Commutativity 
Associa...
Object-level consistency 
Insert 
Read 
Convergent 
data structure 
(e.g., Set CRDT) 
Insert 
Read 
Commutativity 
Associa...
Object-level composition? 
Application 
Convergent 
data structures 
Assert: 
Graph replicas 
converge
Object-level composition? 
Application 
Convergent 
data structures 
GC Assert: 
No live nodes are reclaimed 
Assert: 
Gra...
Object-level composition? 
Application 
Convergent 
data structures 
GC Assert: 
No live nodes are reclaimed 
? 
? 
Assert...
Flow-level consistency 
Application 
Language 
Flow 
Object 
Storage
Flow-level consistency 
Capture semantics of data in motion 
• Asynchronous dataflow model 
• component properties à syst...
Flow-level consistency 
Order-insensitivity (confluence) 
output 
set 
= 
f(input 
set)
Flow-level consistency 
Order-insensitivity (confluence) 
output 
set 
= 
f(input 
set)
Flow-level consistency 
Order-insensitivity (confluence) 
output 
set 
= 
f(input 
set)
Flow-level consistency 
Order-insensitivity (confluence) 
output 
set 
= 
f(input 
set)
Flow-level consistency 
Order-insensitivity (confluence) 
output 
set 
= 
f(input 
set) 
=
Flow-level consistency 
Order-insensitivity (confluence) 
output 
set 
= 
f(input 
set) 
{ 
} 
= 
{ 
}
Confluence is compositional 
output 
set 
= 
f 
Ÿ 
g(input 
set)
Confluence is compositional 
output 
set 
= 
f 
Ÿ 
g(input 
set)
Confluence is compositional 
output 
set 
= 
f 
Ÿ 
g(input 
set)
Graph queries as dataflow 
Graph 
store 
Memory 
allocator 
Transitive 
closure 
Garbage 
collector 
Confluent Not 
Conflu...
Graph queries as dataflow 
Graph 
store 
Memory 
allocator 
Confluent 
Transitive 
closure 
Garbage 
collector 
Confluent ...
Coordination: what is that? 
Strategy 1: Establish a total order 
Graph 
store 
Memory 
allocator 
Coordinate 
here 
Trans...
Coordination: what is that? 
Strategy 2: Establish a producer-consumer 
Graph 
store 
Memory 
allocator 
Coordinate 
here ...
Fundamental costs: FT via replication 
(mostly) free! 
Graph 
store 
Transaction 
manager 
Transitive 
closure 
Deadlock 
...
Fundamental costs: FT via replication 
global synchronization! 
Graph 
store 
Transaction 
manager 
Transitive 
closure 
G...
Fundamental costs: FT via replication 
The first principle of successful scalability is to batter the 
consistency mechani...
Language-level consistency 
DSLs for distributed programming? 
• Capture consistency concerns in the 
type system 
Applica...
Language-level consistency 
CALM Theorem: 
Monotonic à confluent 
Conservative, syntactic test for confluence
Language-level consistency 
Deadlock detector 
Garbage collector
Language-level consistency 
Deadlock detector 
Garbage collector 
nonmonotonic
Let’s review 
• Consistency is tolerance to asynchrony 
• Tricks: 
– focus on data in motion, not at rest 
– avoid coordin...
Outline 
1. Mourning the death of transactions 
2. What is so hard about distributed systems? 
3. Distributed consistency:...
Grand challenge: composition 
Hard problem: 
Is a given component fault-tolerant? 
Much harder: 
Is this system (built up ...
Example: Atomic 
multi-partition update 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14 
Two-phase 
commit
Example: replication 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T1...
Popular wisdom: don’t reinvent
Example: Kafka replication bug 
Three “correct” components: 
1. Primary/backup replication 
2. Timeout-based failure detec...
A guarantee would be nice 
Bottom up approach: 
• use formal methods to verify individual 
components (e.g. protocols) 
• ...
Bottom-up assurances 
Formal 
verifica[on 
Environment 
Program 
Correctness 
Spec
Composing bottom-up 
assurances
Composing bottom-up 
assurances 
Issue 1: incompatible failure models 
eg, crash failure vs. omissions 
Issue 2: Specs do ...
Composing bottom-up 
assurances
Composing bottom-up 
assurances
Composing bottom-up 
assurances
Top-down “assurances”
Top-down “assurances” 
Testing
Top-down “assurances” 
Fault 
injection Testing
Top-down “assurances” 
Fault 
injection 
Testing
End-to-end testing 
would be nice 
Top-down approach: 
• Build a large-scale system 
• Test the system under faults 
Short...
Lineage-driven fault injection 
Goal: top-down testing that 
• finds all of the fault-tolerance bugs, or 
• certifies that...
Lineage-driven fault injection 
Correctness 
Specification 
Malevolent 
sentience 
Molly
Lineage-driven fault injection 
Molly 
Correctness 
Specification 
Malevolent 
sentience
Lineage-driven fault injection 
(LDFI) 
Approach: think backwards from outcomes 
Question: could a bad thing ever happen? ...
Thomasina: What a faint-heart! We must 
work outward from the middle of the 
maze. We will start with something simple.
The game 
• Both players agree on a failure model 
• The programmer provides a protocol 
• The adversary observes executio...
Dedalus: it’s about data 
log(B, “data”)@5 
What 
Where 
When 
Some data
Dedalus: it’s like Datalog 
consequence ! :- premise[s]! 
! 
log(Node, Pload) ! ! ! :- bcast(Node, Pload);! 
!
Dedalus: it’s like Datalog 
consequence ! :- premise[s]! 
! 
log(Node, Pload) ! ! ! :- bcast(Node, Pload);! 
! 
(Which is ...
Dedalus: it’s about time 
consequence@when ! :- premise[s]! 
!! 
node(Node, Neighbor)@next :- node(Node, Neighbor);! 
!! 
...
Dedalus: it’s about time 
consequence@when ! :- premise[s]! 
!! 
node(Node, Neighbor)@next :- node(Node, Neighbor);! 
!! 
...
The match 
Protocol: 
Reliable broadcast 
Specification: 
Pre: A correct process delivers a message m 
Post: All correct p...
Round 1 
node(Node, Neighbor)@next :- node(Node, Neighbor);! 
log(Node, Pload)@next ! :- log(Node, Pload);! 
!! 
log(Node,...
Round 1 in space / time 
Process b Process a Process c 
2 
1 
2 
log log
Round 1: Lineage 
log(B, 
data)@5
Round 1: Lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(Node, Pload)@next :- log(Node, Pload);! 
!!! 
log(B, data)@5:- log...
Round 1: Lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(B, 
data)@3
Round 1: Lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(B, 
data)@3 
log(B,data)@2
Round 1: Lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(B, 
data)@3 
log(B,data)@2 
log(Node2, Pload)@async :- bcast(Node1...
An execution is a (fragile) “proof” 
of an outcome 
log(A, data)@1 node(A, B)@1 
AB1 r2 
log(B, data)@2 
r1 
log(B, data)@...
Valentine: “The unpredictable and the 
predetermined unfold together to make 
everything the way it is.”
Round 1: counterexample 
Process b Process a Process c 
1 
2 
log (LOST) log 
The adversary wins!
Round 
2 
Same 
as 
Round 
1, 
but 
A 
retries. 
bcast(N, P)@next ! ! ! :- bcast(N, P);!
Round 2 in spacetime 
Process b Process a Process c 
2 
3 
4 
5 
1 
2 
3 
4 
2 
3 
4 
5 
log log 
log log 
log log 
log lo...
Round 2 
log(B, 
data)@5
Round 2 
log(B, 
data)@5 
log(B, 
data)@4 
log(Node, Pload)@next :- log(Node, Pload);! 
!!! 
log(B, data)@5:- log(B, data)...
Round 2 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, No...
Round 2 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(B, 
data)@3 
log(A, 
data)@3
Round 2 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(B, 
data)@3 
log(A, 
data)@3 
log(B,data)@2 
log(A, 
data)...
Round 2 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(B, 
data)@3 
log(A, 
data)@3 
log(B,data)@2 
log(A, 
data)...
Round 2 
Retry provides redundancy in time 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(B, 
data)@3 
log(A, 
da...
Traces 
are 
forests 
of 
proof 
trees 
log(A, data)@1 node(A, B)@1 
AB1 r2 
log(B, data)@2 
r1 
log(B, data)@3 
r1 
log(B...
Traces 
are 
forests 
of 
proof 
trees 
log(A, data)@1 node(A, B)@1 
AB1 r2 
log(B, data)@2 
r1 
log(B, data)@3 
r1 
log(B...
Round 
2: 
counterexample 
Process b Process a Process c 
1 
log (LOST) log 
CRASHED 2 
The adversary wins!
Round 3 
Same 
as 
in 
Round 
2, 
but 
symmetrical. 
bcast(N, P)@next ! ! ! :- log(N, P);!
Round 3 in space / time 
Process b Process a Process c 
2 
3 
4 
5 
1 
log log 
2 
3 
4 
5 
2 
3 
4 
5 
log log 
log log 
...
Round 3 -- lineage 
log(B, 
data)@5
Round 3 -- lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(C, 
data)@4
Round 3 -- lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(C, 
data)@4 
Log(B, 
data)@3 
log(A, 
data)@3 
...
Round 3 -- lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(C, 
data)@4 
Log(B, 
data)@3 
log(A, 
data)@3 
...
Round 3 -- lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(C, 
data)@4 
Log(B, 
data)@3 
log(A, 
data)@3 
...
Round 3 
The programmer wins!
Let’s reflect 
Fault-tolerance is redundancy in space and 
time. 
Best strategy for both players: reason 
backwards from o...
The role of the adversary 
can be automated 
1. Break a proof by dropping any contributing 
message. 
(AB1 ∨ BC2) 
Disjunc...
The role of the adversary 
can be automated 
1. Break a proof by dropping any contributing 
message. 
2. Find a set of fai...
The role of the adversary 
can be automated 
1. Break a proof by dropping any contributing 
message. 
2. Find a set of fai...
Molly, the LDFI prototype 
Molly finds fault-tolerance violations quickly 
or guarantees that none exist. 
Molly finds bug...
Commit protocols 
Problem: 
Atomically change things 
Correctness properties: 
1. Agreement (All or nothing) 
2. Terminati...
Two-phase commit 
Agent a Agent b Coordinator Agent d 
2 
5 
2 
5 
1 
prepare prepare prepare 
3 
4 
2 
5 
vote vote 
vote...
Two-phase commit 
Agent a Agent b Coordinator Agent d 
2 
5 
2 
5 
1 
prepare prepare prepare 
3 
4 
2 
5 
vote vote 
vote...
Two-phase commit 
Agent a Agent b Coordinator Agent d 
2 
5 
2 
5 
1 
prepare prepare prepare 
3 
4 
2 
5 
vote vote 
vote...
Two-phase commit 
Agent a Agent b Coordinator Agent d 
2 
5 
2 
5 
1 
prepare prepare prepare 
3 
4 
2 
5 
vote vote 
vote...
Two-phase commit 
Agent a Agent a Coordinator Agent d 
2 2 
1 
p p p 
3 
CRASHED 
2 
v v 
v 
Violation: Termination
The 
collabora[ve 
termina[on 
protocol 
Basic idea: 
Agents talk amongst themselves when the 
coordinator fails. 
Protoco...
2PC - CTP 
Agent a Agent b Coordinator Agent d 
2 
3 
4 
5 
6 
7 
prepare prepare prepare 
2 
3 
4 
5 
6 
7 
1 
2 
3 
CRAS...
2PC - CTP 
Agent a Agent b Coordinator Agent d 
2 
3 
4 
5 
6 
7 
prepare prepare prepare 
2 
3 
4 
5 
6 
7 
1 
2 
3 
CRAS...
3PC 
Basic idea: 
Add a round, a state, and simple failure 
detectors (timeouts). 
Protocol: 
1. Phase 1: Just like in 2PC...
3PC 
Process a Process b Process C Process d 
2 
4 
7 
2 
4 
7 
1 
cancommit cancommit cancommit 
3 
vote_msg 
precommit p...
3PC 
Process a Process b Process C Process d 
2 
4 
7 
2 
4 
7 
1 
cancommit cancommit cancommit 
3 
vote_msg 
precommit p...
Network partitions 
make 3pc act crazy 
Process a Process b Process C Process d 
2 
4 
7 
8 
2 
4 
7 
8 
1 
3 
5 
6 
7 
8 ...
Network partitions 
make 3pc act crazy 
Process a Process b Process C Process d 
2 
4 
7 
8 
2 
4 
7 
8 
1 
3 
5 
6 
7 
8 ...
Network partitions 
make 3pc act crazy 
Process a Process b Process C Process d 
2 
4 
7 
8 
2 
4 
7 
8 
1 
3 
5 
6 
7 
8 ...
Network partitions 
make 3pc act crazy 
Process a Process b Process C Process d 
2 
4 
7 
8 
2 
4 
7 
8 
1 
3 
5 
6 
7 
8 ...
Network partitions 
make 3pc act crazy 
Process a Process b Process C Process d 
2 
4 
7 
8 
2 
4 
7 
8 
1 
3 
5 
6 
7 
8 ...
Kafka durability bug 
Replica b Replica c Zookeeper Replica a Client 
1 1 
2 
1 
3 
4 
CRASHED 
1 
3 
5 
m m 
m 
m l 
a 
c...
Kafka durability bug 
Replica b Replica c Zookeeper Replica a Client 
1 1 
2 
1 
3 
4 
CRASHED 
1 
3 
5 
m m 
m 
m l 
a 
c...
Kafka durability bug 
Replica b Replica c Zookeeper Replica a Client 
1 1 
2 
1 
3 
4 
CRASHED 
1 
3 
5 
m m 
m 
m l 
a 
c...
Kafka durability bug 
Replica b Replica c Zookeeper Replica a Client 
1 1 
2 
1 
3 
4 
CRASHED 
1 
3 
5 
m m 
m 
m l 
a 
c...
Kafka durability bug 
Replica b Replica c Zookeeper Replica a Client 
1 1 
2 
1 
3 
4 
CRASHED 
1 
3 
5 
m m 
m 
m l 
a 
c...
Molly summary 
Lineage allows us to reason backwards 
from good outcomes 
Molly: surgically-targeted fault injection 
Inve...
Where we’ve been; where we’re headed 
1. Mourning the death of transactions 
2. What is so hard about distributed systems?...
Where we’ve been; where we’re headed 
1. We need application-level guarantees 
2. What is so hard about distributed system...
Where we’ve been; where we’re headed 
1. We need application-level guarantees 
2. What is so hard about distributed system...
Where we’ve been; where we’re headed 
1. We need application-level guarantees 
2. (asynchrony X partial failure) = too har...
Where we’ve been; where we’re headed 
1. We need application-level guarantees 
2. asynchrony X partial failure = too hard ...
Where we’ve been; where we’re headed 
1. We need application-level guarantees 
2. asynchrony X partial failure = too hard ...
Outline 
1. We need application-level guarantees 
2. asynchrony X partial failure = too hard to hide! 
We need tools to ma...
Outline 
1. We need application-level guarantees 
2. asynchrony X partial failure = too hard to hide! 
We need tools to ma...
Remember 
1. We need application-level guarantees 
2. asynchrony X partial failure = too hard to hide! We 
need tools to m...
A happy crisis 
Valentine: “It makes me so happy. To be at 
the beginning again, knowing almost 
nothing.... It's the best...
Upcoming SlideShare
Loading in …5
×

RICON keynote: outwards from the middle of the maze

4,296 views

Published on

slides from my RICON keynote

Published in: Science
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

RICON keynote: outwards from the middle of the maze

  1. 1. Outwards from the middle of the maze Peter Alvaro UC Berkeley
  2. 2. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  3. 3. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  4. 4. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  5. 5. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  6. 6. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  7. 7. The “top-down” ethos
  8. 8. The “top-down” ethos
  9. 9. The “top-down” ethos
  10. 10. The “top-down” ethos
  11. 11. The “top-down” ethos
  12. 12. The “top-down” ethos
  13. 13. Transactions: a holistic contract Write Read Application Opaque store Transactions
  14. 14. Transactions: a holistic contract Write Read Application Opaque store Transactions Assert: balance > 0
  15. 15. Transactions: a holistic contract Assert: balance > 0 Write Read Application Opaque store Transactions
  16. 16. Transactions: a holistic contract Write Read Application Opaque store Transactions Assert: balance > 0
  17. 17. Transactions: a holistic contract Write Read Application Opaque store Transactions Assert: balance > 0
  18. 18. Incidental complexities • The “Internet.” Searching it. • Cross-datacenter replication schemes • CAP Theorem • Dynamo & MapReduce • “Cloud”
  19. 19. Fundamental complexity “[…] distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure.” Jim Waldo et al., A Note on Distributed Computing (1994)
  20. 20. A holistic contract …stretched to the limit Write Read Application Opaque store Transactions
  21. 21. A holistic contract …stretched to the limit Write Read Application Opaque store Transactions
  22. 22. Are you blithely asserting that transactions aren’t webscale? Some people just want to see the world burn. Those same people want to see the world use inconsistent databases. - Emin Gun Sirer
  23. 23. Alternative to top-down design? The “bottom-up,” systems tradition: Simple, reusable components first. Semantics later.
  24. 24. Alternative: the “bottom-up,” systems ethos
  25. 25. The “bottom-up” ethos
  26. 26. The “bottom-up” ethos
  27. 27. The “bottom-up” ethos
  28. 28. The “bottom-up” ethos
  29. 29. The “bottom-up” ethos
  30. 30. The “bottom-up” ethos
  31. 31. The “bottom-up” ethos “‘Tis a fine barn, but sure ‘tis no castle, English”
  32. 32. The “bottom-up” ethos Simple, reusable components first. Semantics later. This is how we live now. Question: Do we ever get those application-level guarantees back?
  33. 33. Low-level contracts Write Read Application Distributed store KVS
  34. 34. Low-level contracts Write Read Application Distributed store KVS
  35. 35. Low-level contracts Write Read Application Distributed store KVS R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
  36. 36. Low-level contracts Write Read Application Distributed store KVS Assert: balance > 0 R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
  37. 37. Low-level contracts Write Read Application Distributed store KVS Assert: balance > 0 causal? PRAM? delta? fork/join? red/blue? Release? R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
  38. 38. When do contracts compose? Application Distributed service Assert: balance > 0
  39. 39. iw, did I get mongo in my riak? Assert: balance > 0
  40. 40. Composition is the last hard problem Composing modules is hard enough We must learn how to compose guarantees
  41. 41. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  42. 42. Why distributed systems are hard2 Asynchrony Partial Failure Fundamental Uncertainty
  43. 43. Asynchrony isn’t that hard Ameloriation: Logical timestamps Deterministic interleaving
  44. 44. Partial failure isn’t that hard Ameloriation: Replication Replay
  45. 45. (asynchrony * partial failure) = hard2 Logical timestamps Deterministic interleaving Replication Replay
  46. 46. (asynchrony * partial failure) = hard2 Logical timestamps Deterministic interleaving Replication Replay
  47. 47. (asynchrony * partial failure) = hard2 Tackling one clown at a time Poor strategy for programming distributed systems Winning strategy for analyzing distributed programs
  48. 48. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  49. 49. Distributed consistency Today: A quick summary of some great work.
  50. 50. Consider a (distributed) graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  51. 51. Partitioned, for scalability T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  52. 52. Replicated, for availability T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  53. 53. Deadlock detection Task: Identify strongly-connected components Waits-for graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  54. 54. Garbage collection Task: Identify nodes not reachable from Root. Root Refers-to graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  55. 55. T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Correctness Deadlock detection • Safety: No false positives • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory
  56. 56. T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Correctness Deadlock detection • Safety: No false positives- • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory
  57. 57. Correctness Deadlock detection • Safety: No false positives • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Root
  58. 58. Consistency at the extremes Application Language Custom s olutions? Flow Object Storage Linearizable key-value store?
  59. 59. Consistency at the extremes Application Language Custom s olutions? Flow Object Storage Linearizable key-value store?
  60. 60. Consistency at the extremes Application Language Custom s olutions? Flow Efficient Object Correct Storage Linearizable key-value store?
  61. 61. Object-level consistency Capture semantics of data structures that • allow greater concurrency • maintain guarantees (e.g. convergence) Application Language Flow Object Storage
  62. 62. Object-level consistency
  63. 63. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence
  64. 64. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence
  65. 65. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence
  66. 66. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence Reordering Batching Retry/duplication Tolerant to
  67. 67. Object-level composition? Application Convergent data structures Assert: Graph replicas converge
  68. 68. Object-level composition? Application Convergent data structures GC Assert: No live nodes are reclaimed Assert: Graph replicas converge
  69. 69. Object-level composition? Application Convergent data structures GC Assert: No live nodes are reclaimed ? ? Assert: Graph replicas converge
  70. 70. Flow-level consistency Application Language Flow Object Storage
  71. 71. Flow-level consistency Capture semantics of data in motion • Asynchronous dataflow model • component properties à system-wide guarantees Graph store Transaction manager Transitive closure Deadlock detector
  72. 72. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  73. 73. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  74. 74. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  75. 75. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  76. 76. Flow-level consistency Order-insensitivity (confluence) output set = f(input set) =
  77. 77. Flow-level consistency Order-insensitivity (confluence) output set = f(input set) { } = { }
  78. 78. Confluence is compositional output set = f Ÿ g(input set)
  79. 79. Confluence is compositional output set = f Ÿ g(input set)
  80. 80. Confluence is compositional output set = f Ÿ g(input set)
  81. 81. Graph queries as dataflow Graph store Memory allocator Transitive closure Garbage collector Confluent Not Confluent Confluent Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent
  82. 82. Graph queries as dataflow Graph store Memory allocator Confluent Transitive closure Garbage collector Confluent Not Confluent Confluent Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent Coordinate here
  83. 83. Coordination: what is that? Strategy 1: Establish a total order Graph store Memory allocator Coordinate here Transitive closure Garbage collector Confluent Not Confluent Confluent
  84. 84. Coordination: what is that? Strategy 2: Establish a producer-consumer Graph store Memory allocator Coordinate here Transitive closure Garbage collector Confluent Not Confluent Confluent barrier
  85. 85. Fundamental costs: FT via replication (mostly) free! Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent Graph store Transitive closure Deadlock detector Confluent Confluent Confluent
  86. 86. Fundamental costs: FT via replication global synchronization! Graph store Transaction manager Transitive closure Garbage Collector Confluent Confluent Graph store Transitive closure Garbage Collector Confluent Not Confluent Confluent Paxos Not Confluent
  87. 87. Fundamental costs: FT via replication The first principle of successful scalability is to batter the consistency mechanisms down to a minimum. – James Hamilton Garbage Collector Graph store Transaction manager Transitive closure Garbage Collector Confluent Confluent Graph store Transitive closure Confluent Not Confluent Confluent Barrier Not Confluent Barrier
  88. 88. Language-level consistency DSLs for distributed programming? • Capture consistency concerns in the type system Application Language Flow Object Storage
  89. 89. Language-level consistency CALM Theorem: Monotonic à confluent Conservative, syntactic test for confluence
  90. 90. Language-level consistency Deadlock detector Garbage collector
  91. 91. Language-level consistency Deadlock detector Garbage collector nonmonotonic
  92. 92. Let’s review • Consistency is tolerance to asynchrony • Tricks: – focus on data in motion, not at rest – avoid coordination when possible – choose coordination carefully otherwise (Tricks are great, but tools are better)
  93. 93. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  94. 94. Grand challenge: composition Hard problem: Is a given component fault-tolerant? Much harder: Is this system (built up from components) fault-tolerant?
  95. 95. Example: Atomic multi-partition update T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Two-phase commit
  96. 96. Example: replication T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Reliable broadcast
  97. 97. Popular wisdom: don’t reinvent
  98. 98. Example: Kafka replication bug Three “correct” components: 1. Primary/backup replication 2. Timeout-based failure detectors 3. Zookeeper One nasty bug: Acknowledged writes are lost
  99. 99. A guarantee would be nice Bottom up approach: • use formal methods to verify individual components (e.g. protocols) • Build systems from verified components Shortcomings: • Hard to use • Hard to compose Investment Returns
  100. 100. Bottom-up assurances Formal verifica[on Environment Program Correctness Spec
  101. 101. Composing bottom-up assurances
  102. 102. Composing bottom-up assurances Issue 1: incompatible failure models eg, crash failure vs. omissions Issue 2: Specs do not compose (FT is an end-to-end property) If you take 10 components off the shelf, you are putting 10 world views together, and the result will be a mess. -- Butler Lampson
  103. 103. Composing bottom-up assurances
  104. 104. Composing bottom-up assurances
  105. 105. Composing bottom-up assurances
  106. 106. Top-down “assurances”
  107. 107. Top-down “assurances” Testing
  108. 108. Top-down “assurances” Fault injection Testing
  109. 109. Top-down “assurances” Fault injection Testing
  110. 110. End-to-end testing would be nice Top-down approach: • Build a large-scale system • Test the system under faults Shortcomings: • Hard to identify complex bugs • Fundamentally incomplete Investment Returns
  111. 111. Lineage-driven fault injection Goal: top-down testing that • finds all of the fault-tolerance bugs, or • certifies that none exist
  112. 112. Lineage-driven fault injection Correctness Specification Malevolent sentience Molly
  113. 113. Lineage-driven fault injection Molly Correctness Specification Malevolent sentience
  114. 114. Lineage-driven fault injection (LDFI) Approach: think backwards from outcomes Question: could a bad thing ever happen? Reframe: • Why did a good thing happen? • What could have gone wrong along the way?
  115. 115. Thomasina: What a faint-heart! We must work outward from the middle of the maze. We will start with something simple.
  116. 116. The game • Both players agree on a failure model • The programmer provides a protocol • The adversary observes executions and chooses failures for the next execution.
  117. 117. Dedalus: it’s about data log(B, “data”)@5 What Where When Some data
  118. 118. Dedalus: it’s like Datalog consequence ! :- premise[s]! ! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! !
  119. 119. Dedalus: it’s like Datalog consequence ! :- premise[s]! ! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! ! (Which is like SQL) create view log as select Node, Pload from bcast;!
  120. 120. Dedalus: it’s about time consequence@when ! :- premise[s]! !! node(Node, Neighbor)@next :- node(Node, Neighbor);! !! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);
  121. 121. Dedalus: it’s about time consequence@when ! :- premise[s]! !! node(Node, Neighbor)@next :- node(Node, Neighbor);! !! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2); State change Natural join (bcast.Node1 == node.Node1) Communication
  122. 122. The match Protocol: Reliable broadcast Specification: Pre: A correct process delivers a message m Post: All correct process delivers m Failure Model: (Permanent) crash failures Message loss / partitions
  123. 123. Round 1 node(Node, Neighbor)@next :- node(Node, Neighbor);! log(Node, Pload)@next ! :- log(Node, Pload);! !! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! ! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2); “An effort” delivery protocol
  124. 124. Round 1 in space / time Process b Process a Process c 2 1 2 log log
  125. 125. Round 1: Lineage log(B, data)@5
  126. 126. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(Node, Pload)@next :- log(Node, Pload);! !!! log(B, data)@5:- log(B, data)@4;!
  127. 127. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3
  128. 128. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3 log(B,data)@2
  129. 129. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3 log(B,data)@2 log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! node(Node1, Node2);! !!!! log(B, data)@2 :- bcast(A, data)@1, ! ! ! ! ! ! ! node(A, B)@1;! log(A, data)@1
  130. 130. An execution is a (fragile) “proof” of an outcome log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(log(AB2 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 log(log(log(log((which required a message from A to B at time 1)
  131. 131. Valentine: “The unpredictable and the predetermined unfold together to make everything the way it is.”
  132. 132. Round 1: counterexample Process b Process a Process c 1 2 log (LOST) log The adversary wins!
  133. 133. Round 2 Same as Round 1, but A retries. bcast(N, P)@next ! ! ! :- bcast(N, P);!
  134. 134. Round 2 in spacetime Process b Process a Process c 2 3 4 5 1 2 3 4 2 3 4 5 log log log log log log log log
  135. 135. Round 2 log(B, data)@5
  136. 136. Round 2 log(B, data)@5 log(B, data)@4 log(Node, Pload)@next :- log(Node, Pload);! !!! log(B, data)@5:- log(B, data)@4;!
  137. 137. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);! !!!! log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;!
  138. 138. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3
  139. 139. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2
  140. 140. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2 log(A, data)@1
  141. 141. Round 2 Retry provides redundancy in time log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2 log(A, data)@1
  142. 142. Traces are forests of proof trees log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4
  143. 143. Traces are forests of proof trees log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4
  144. 144. Round 2: counterexample Process b Process a Process c 1 log (LOST) log CRASHED 2 The adversary wins!
  145. 145. Round 3 Same as in Round 2, but symmetrical. bcast(N, P)@next ! ! ! :- log(N, P);!
  146. 146. Round 3 in space / time Process b Process a Process c 2 3 4 5 1 log log 2 3 4 5 2 3 4 5 log log log log log log log log log log log log log log log log log log Redundancy in space and time
  147. 147. Round 3 -- lineage log(B, data)@5
  148. 148. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4
  149. 149. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3
  150. 150. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3 log(B,data)@2 log(A, data)@2 log(C, data)@2 log(A, data)@1
  151. 151. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3 log(B,data)@2 log(A, data)@2 log(C, data)@2 log(A, data)@1
  152. 152. Round 3 The programmer wins!
  153. 153. Let’s reflect Fault-tolerance is redundancy in space and time. Best strategy for both players: reason backwards from outcomes using lineage Finding bugs: find a set of failures that “breaks” all derivations Fixing bugs: add additional derivations
  154. 154. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. (AB1 ∨ BC2) Disjunction
  155. 155. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. 2. Find a set of failures that breaks all proofs of a good outcome. (AB1 ∨ BC2) Disjunction ∧ (AC1) ∧ (AC2) Conjunction of disjunctions (AKA CNF)
  156. 156. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. 2. Find a set of failures that breaks all proofs of a good outcome. (AB1 ∨ BC2) Disjunction ∧ (AC1) ∧ (AC2) Conjunction of disjunctions (AKA CNF)
  157. 157. Molly, the LDFI prototype Molly finds fault-tolerance violations quickly or guarantees that none exist. Molly finds bugs by explaining good outcomes – then it explains the bugs. Bugs identified: 2pc, 2pc-ctp, 3pc, Kafka Certified correct: paxos (synod), Flux, bully leader election, reliable broadcast
  158. 158. Commit protocols Problem: Atomically change things Correctness properties: 1. Agreement (All or nothing) 2. Termination (Something)
  159. 159. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit
  160. 160. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it?
  161. 161. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it? YES YOU CAN
  162. 162. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it? YES YOU CAN Well I’m gone
  163. 163. Two-phase commit Agent a Agent a Coordinator Agent d 2 2 1 p p p 3 CRASHED 2 v v v Violation: Termination
  164. 164. The collabora[ve termina[on protocol Basic idea: Agents talk amongst themselves when the coordinator fails. Protocol: On timeout, ask other agents about decision.
  165. 165. 2PC - CTP Agent a Agent b Coordinator Agent d 2 3 4 5 6 7 prepare prepare prepare 2 3 4 5 6 7 1 2 3 CRASHED 2 3 4 5 6 7 vote decision_req decision_req vote decision_req decision_req vote decision_req decision_req
  166. 166. 2PC - CTP Agent a Agent b Coordinator Agent d 2 3 4 5 6 7 prepare prepare prepare 2 3 4 5 6 7 1 2 3 CRASHED 2 3 4 5 6 7 vote decision_req decision_req vote decision_req decision_req vote decision_req decision_req Can I kick it? YES YOU CAN ……?
  167. 167. 3PC Basic idea: Add a round, a state, and simple failure detectors (timeouts). Protocol: 1. Phase 1: Just like in 2PC – Agent timeout à abort 2. Phase 2: send canCommit, collect acks – Agent timeout à commit 3. Phase 3: Just like phase 2 of 2PC
  168. 168. 3PC Process a Process b Process C Process d 2 4 7 2 4 7 1 cancommit cancommit cancommit 3 vote_msg precommit precommit precommit 5 6 2 4 7 vote_msg ack vote_msg ack ack commit commit commit
  169. 169. 3PC Process a Process b Process C Process d 2 4 7 2 4 7 1 cancommit cancommit cancommit 3 vote_msg precommit precommit precommit 5 6 2 4 7 vote_msg ack vote_msg ack ack commit commit commit Timeout à Abort Timeout à Commit
  170. 170. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg
  171. 171. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Agent crash Agents learn commit decision
  172. 172. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Agent crash Agents learn commit decision d is dead; coordinator decides to abort
  173. 173. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Brief network partition Agent crash Agents learn commit decision d is dead; coordinator decides to abort
  174. 174. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Brief network partition Agent crash Agents learn commit decision d is dead; coordinator decides to abort Agents A & B decide to commit
  175. 175. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w
  176. 176. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition
  177. 177. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica
  178. 178. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica a ACKs client write
  179. 179. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica a ACKs client write Data loss
  180. 180. Molly summary Lineage allows us to reason backwards from good outcomes Molly: surgically-targeted fault injection Investment similar to testing Returns similar to formal methods
  181. 181. Where we’ve been; where we’re headed 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  182. 182. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  183. 183. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  184. 184. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. (asynchrony X partial failure) = too hard to hide! We need tools to manage it. 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  185. 185. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  186. 186. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Fault-tolerance: progress despite failures
  187. 187. Outline 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Fault-tolerance: progress despite failures
  188. 188. Outline 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Backwards from outcomes
  189. 189. Remember 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Backwards from outcomes Composition is the hardest problem
  190. 190. A happy crisis Valentine: “It makes me so happy. To be at the beginning again, knowing almost nothing.... It's the best possible time of being alive, when almost everything you thought you knew is wrong.”

×