SlideShare a Scribd company logo
Outwards 
from the middle of the maze 
Peter Alvaro 
UC Berkeley
Outline 
1. Mourning the death of transactions 
2. What is so hard about distributed systems? 
3. Distributed consistency: managing asynchrony 
4. Fault-tolerance: progress despite failures
The transaction concept 
DEBIT_CREDIT: 
BEGIN_TRANSACTION; 
GET 
MESSAGE; 
EXTRACT 
ACCOUT_NUMBER, 
DELTA, 
TELLER, 
BRANCH 
FROM 
MESSAGE; 
FIND 
ACCOUNT(ACCOUT_NUMBER) 
IN 
DATA 
BASE; 
IF 
NOT_FOUND 
| 
ACCOUNT_BALANCE 
+ 
DELTA 
< 
0 
THEN 
PUT 
NEGATIVE 
RESPONSE; 
ELSE 
DO; 
ACCOUNT_BALANCE 
= 
ACCOUNT_BALANCE 
+ 
DELTA; 
POST 
HISTORY 
RECORD 
ON 
ACCOUNT 
(DELTA); 
CASH_DRAWER(TELLER) 
= 
CASH_DRAWER(TELLER) 
+ 
DELTA; 
BRANCH_BALANCE(BRANCH) 
= 
BRANCH_BALANCE(BRANCH) 
+ 
DELTA; 
PUT 
MESSAGE 
('NEW 
BALANCE 
=' 
ACCOUNT_BALANCE); 
END; 
COMMIT;
The transaction concept 
DEBIT_CREDIT: 
BEGIN_TRANSACTION; 
GET 
MESSAGE; 
EXTRACT 
ACCOUT_NUMBER, 
DELTA, 
TELLER, 
BRANCH 
FROM 
MESSAGE; 
FIND 
ACCOUNT(ACCOUT_NUMBER) 
IN 
DATA 
BASE; 
IF 
NOT_FOUND 
| 
ACCOUNT_BALANCE 
+ 
DELTA 
< 
0 
THEN 
PUT 
NEGATIVE 
RESPONSE; 
ELSE 
DO; 
ACCOUNT_BALANCE 
= 
ACCOUNT_BALANCE 
+ 
DELTA; 
POST 
HISTORY 
RECORD 
ON 
ACCOUNT 
(DELTA); 
CASH_DRAWER(TELLER) 
= 
CASH_DRAWER(TELLER) 
+ 
DELTA; 
BRANCH_BALANCE(BRANCH) 
= 
BRANCH_BALANCE(BRANCH) 
+ 
DELTA; 
PUT 
MESSAGE 
('NEW 
BALANCE 
=' 
ACCOUNT_BALANCE); 
END; 
COMMIT;
The transaction concept 
DEBIT_CREDIT: 
BEGIN_TRANSACTION; 
GET 
MESSAGE; 
EXTRACT 
ACCOUT_NUMBER, 
DELTA, 
TELLER, 
BRANCH 
FROM 
MESSAGE; 
FIND 
ACCOUNT(ACCOUT_NUMBER) 
IN 
DATA 
BASE; 
IF 
NOT_FOUND 
| 
ACCOUNT_BALANCE 
+ 
DELTA 
< 
0 
THEN 
PUT 
NEGATIVE 
RESPONSE; 
ELSE 
DO; 
ACCOUNT_BALANCE 
= 
ACCOUNT_BALANCE 
+ 
DELTA; 
POST 
HISTORY 
RECORD 
ON 
ACCOUNT 
(DELTA); 
CASH_DRAWER(TELLER) 
= 
CASH_DRAWER(TELLER) 
+ 
DELTA; 
BRANCH_BALANCE(BRANCH) 
= 
BRANCH_BALANCE(BRANCH) 
+ 
DELTA; 
PUT 
MESSAGE 
('NEW 
BALANCE 
=' 
ACCOUNT_BALANCE); 
END; 
COMMIT;
The transaction concept 
DEBIT_CREDIT: 
BEGIN_TRANSACTION; 
GET 
MESSAGE; 
EXTRACT 
ACCOUT_NUMBER, 
DELTA, 
TELLER, 
BRANCH 
FROM 
MESSAGE; 
FIND 
ACCOUNT(ACCOUT_NUMBER) 
IN 
DATA 
BASE; 
IF 
NOT_FOUND 
| 
ACCOUNT_BALANCE 
+ 
DELTA 
< 
0 
THEN 
PUT 
NEGATIVE 
RESPONSE; 
ELSE 
DO; 
ACCOUNT_BALANCE 
= 
ACCOUNT_BALANCE 
+ 
DELTA; 
POST 
HISTORY 
RECORD 
ON 
ACCOUNT 
(DELTA); 
CASH_DRAWER(TELLER) 
= 
CASH_DRAWER(TELLER) 
+ 
DELTA; 
BRANCH_BALANCE(BRANCH) 
= 
BRANCH_BALANCE(BRANCH) 
+ 
DELTA; 
PUT 
MESSAGE 
('NEW 
BALANCE 
=' 
ACCOUNT_BALANCE); 
END; 
COMMIT;
The “top-down” ethos
The “top-down” ethos
The “top-down” ethos
The “top-down” ethos
The “top-down” ethos
The “top-down” ethos
Transactions: a holistic contract 
Write 
Read 
Application 
Opaque 
store 
Transactions
Transactions: a holistic contract 
Write 
Read 
Application 
Opaque 
store 
Transactions 
Assert: 
balance > 0
Transactions: a holistic contract 
Assert: 
balance > 0 
Write 
Read 
Application 
Opaque 
store 
Transactions
Transactions: a holistic contract 
Write 
Read 
Application 
Opaque 
store 
Transactions 
Assert: 
balance > 0
Transactions: a holistic contract 
Write 
Read 
Application 
Opaque 
store 
Transactions 
Assert: 
balance > 0
Incidental complexities 
• The “Internet.” Searching it. 
• Cross-datacenter replication schemes 
• CAP Theorem 
• Dynamo & MapReduce 
• “Cloud”
Fundamental complexity 
“[…] distributed systems require that the 
programmer be aware of latency, have a different 
model of memory access, and take into account 
issues of concurrency and partial failure.” 
Jim Waldo et al., 
A Note on Distributed Computing (1994)
A holistic contract 
…stretched to the limit 
Write 
Read 
Application 
Opaque 
store 
Transactions
A holistic contract 
…stretched to the limit 
Write 
Read 
Application 
Opaque 
store 
Transactions
Are you blithely asserting 
that transactions aren’t webscale? 
Some people just want to see the world burn. 
Those same people want to see the world use inconsistent databases. 
- Emin Gun Sirer
Alternative to top-down design? 
The “bottom-up,” systems tradition: 
Simple, reusable components first. 
Semantics later.
Alternative: 
the “bottom-up,” systems ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos
The “bottom-up” ethos 
“‘Tis a fine barn, but sure ‘tis no castle, English”
The “bottom-up” ethos 
Simple, reusable components first. 
Semantics later. 
This is how we live now. 
Question: Do we ever get those 
application-level guarantees back?
Low-level contracts 
Write 
Read 
Application 
Distributed 
store KVS
Low-level contracts 
Write 
Read 
Application 
Distributed 
store KVS
Low-level contracts 
Write 
Read 
Application 
Distributed 
store KVS 
R1(X=1) 
R2(X=1) 
W1(X=2) 
W2(X=0) 
W1(X=1) 
W1(Y=2) 
R2(Y=2) 
R2(X=0)
Low-level contracts 
Write 
Read 
Application 
Distributed 
store KVS 
Assert: 
balance > 0 
R1(X=1) 
R2(X=1) 
W1(X=2) 
W2(X=0) 
W1(X=1) 
W1(Y=2) 
R2(Y=2) 
R2(X=0)
Low-level contracts 
Write 
Read 
Application 
Distributed 
store KVS 
Assert: 
balance > 0 
causal? 
PRAM? 
delta? 
fork/join? 
red/blue? 
Release? 
R1(X=1) 
R2(X=1) 
W1(X=2) 
W2(X=0) 
W1(X=1) 
W1(Y=2) 
R2(Y=2) 
R2(X=0)
When do contracts compose? 
Application 
Distributed 
service 
Assert: 
balance > 0
iw, did I get mongo in my riak? 
Assert: 
balance > 0
Composition is the last hard 
problem 
Composing modules is hard enough 
We must learn how to compose guarantees
Outline 
1. Mourning the death of transactions 
2. What is so hard about distributed systems? 
3. Distributed consistency: managing asynchrony 
4. Fault-tolerance: progress despite failures
Why distributed systems are hard2 
Asynchrony Partial Failure 
Fundamental Uncertainty
Asynchrony isn’t that hard 
Ameloriation: 
Logical timestamps 
Deterministic interleaving
Partial failure isn’t that hard 
Ameloriation: 
Replication 
Replay
(asynchrony * partial failure) = hard2 
Logical timestamps 
Deterministic interleaving 
Replication 
Replay
(asynchrony * partial failure) = hard2 
Logical timestamps 
Deterministic interleaving 
Replication 
Replay
(asynchrony * partial failure) = hard2 
Tackling one clown at a time 
Poor strategy for programming distributed systems 
Winning strategy for analyzing distributed programs
Outline 
1. Mourning the death of transactions 
2. What is so hard about distributed systems? 
3. Distributed consistency: managing asynchrony 
4. Fault-tolerance: progress despite failures
Distributed consistency 
Today: A quick summary of some great work.
Consider a (distributed) graph 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14
Partitioned, for scalability 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14
Replicated, for availability 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14
Deadlock detection 
Task: Identify strongly-connected 
components 
Waits-for graph 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14
Garbage collection 
Task: Identify nodes not reachable 
from Root. Root 
Refers-to graph 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14 
Correctness 
Deadlock detection 
• Safety: No false positives 
• Liveness: Identify all deadlocks 
Garbage collection 
• Safety: Never GC live memory! 
• Liveness: GC all orphaned memory
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14 
Correctness 
Deadlock detection 
• Safety: No false positives- 
• Liveness: Identify all deadlocks 
Garbage collection 
• Safety: Never GC live memory! 
• Liveness: GC all orphaned memory
Correctness 
Deadlock detection 
• Safety: No false positives 
• Liveness: Identify all deadlocks 
Garbage collection 
• Safety: Never GC live memory! 
• Liveness: GC all orphaned memory 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14 
Root
Consistency at the extremes 
Application 
Language 
Custom s 
olutions? 
Flow 
Object 
Storage 
Linearizable 
key-value store?
Consistency at the extremes 
Application 
Language 
Custom s 
olutions? 
Flow 
Object 
Storage 
Linearizable 
key-value store?
Consistency at the extremes 
Application 
Language 
Custom s 
olutions? 
Flow 
Efficient Object 
Correct 
Storage 
Linearizable 
key-value store?
Object-level consistency 
Capture semantics of data structures that 
• allow greater concurrency 
• maintain guarantees (e.g. convergence) 
Application 
Language 
Flow 
Object 
Storage
Object-level consistency
Object-level consistency 
Insert 
Read 
Convergent 
data structure 
(e.g., Set CRDT) 
Insert 
Read 
Commutativity 
Associativity 
Idempotence
Object-level consistency 
Insert 
Read 
Convergent 
data structure 
(e.g., Set CRDT) 
Insert 
Read 
Commutativity 
Associativity 
Idempotence
Object-level consistency 
Insert 
Read 
Convergent 
data structure 
(e.g., Set CRDT) 
Insert 
Read 
Commutativity 
Associativity 
Idempotence
Object-level consistency 
Insert 
Read 
Convergent 
data structure 
(e.g., Set CRDT) 
Insert 
Read 
Commutativity 
Associativity 
Idempotence 
Reordering 
Batching 
Retry/duplication 
Tolerant to
Object-level composition? 
Application 
Convergent 
data structures 
Assert: 
Graph replicas 
converge
Object-level composition? 
Application 
Convergent 
data structures 
GC Assert: 
No live nodes are reclaimed 
Assert: 
Graph replicas 
converge
Object-level composition? 
Application 
Convergent 
data structures 
GC Assert: 
No live nodes are reclaimed 
? 
? 
Assert: 
Graph replicas 
converge
Flow-level consistency 
Application 
Language 
Flow 
Object 
Storage
Flow-level consistency 
Capture semantics of data in motion 
• Asynchronous dataflow model 
• component properties à system-wide guarantees 
Graph 
store 
Transaction 
manager 
Transitive 
closure 
Deadlock 
detector
Flow-level consistency 
Order-insensitivity (confluence) 
output 
set 
= 
f(input 
set)
Flow-level consistency 
Order-insensitivity (confluence) 
output 
set 
= 
f(input 
set)
Flow-level consistency 
Order-insensitivity (confluence) 
output 
set 
= 
f(input 
set)
Flow-level consistency 
Order-insensitivity (confluence) 
output 
set 
= 
f(input 
set)
Flow-level consistency 
Order-insensitivity (confluence) 
output 
set 
= 
f(input 
set) 
=
Flow-level consistency 
Order-insensitivity (confluence) 
output 
set 
= 
f(input 
set) 
{ 
} 
= 
{ 
}
Confluence is compositional 
output 
set 
= 
f 
Ÿ 
g(input 
set)
Confluence is compositional 
output 
set 
= 
f 
Ÿ 
g(input 
set)
Confluence is compositional 
output 
set 
= 
f 
Ÿ 
g(input 
set)
Graph queries as dataflow 
Graph 
store 
Memory 
allocator 
Transitive 
closure 
Garbage 
collector 
Confluent Not 
Confluent 
Confluent 
Graph 
store 
Transaction 
manager 
Transitive 
closure 
Deadlock 
detector 
Confluent Confluent Confluent
Graph queries as dataflow 
Graph 
store 
Memory 
allocator 
Confluent 
Transitive 
closure 
Garbage 
collector 
Confluent Not 
Confluent 
Confluent 
Graph 
store 
Transaction 
manager 
Transitive 
closure 
Deadlock 
detector 
Confluent Confluent Confluent 
Coordinate 
here
Coordination: what is that? 
Strategy 1: Establish a total order 
Graph 
store 
Memory 
allocator 
Coordinate 
here 
Transitive 
closure 
Garbage 
collector 
Confluent Not 
Confluent 
Confluent
Coordination: what is that? 
Strategy 2: Establish a producer-consumer 
Graph 
store 
Memory 
allocator 
Coordinate 
here 
Transitive 
closure 
Garbage 
collector 
Confluent Not 
Confluent 
Confluent 
barrier
Fundamental costs: FT via replication 
(mostly) free! 
Graph 
store 
Transaction 
manager 
Transitive 
closure 
Deadlock 
detector 
Confluent Confluent Confluent 
Graph 
store 
Transitive 
closure 
Deadlock 
detector 
Confluent Confluent Confluent
Fundamental costs: FT via replication 
global synchronization! 
Graph 
store 
Transaction 
manager 
Transitive 
closure 
Garbage 
Collector 
Confluent Confluent 
Graph 
store 
Transitive 
closure 
Garbage 
Collector 
Confluent Not 
Confluent 
Confluent 
Paxos 
Not 
Confluent
Fundamental costs: FT via replication 
The first principle of successful scalability is to batter the 
consistency mechanisms down to a minimum. 
– James Hamilton 
Garbage 
Collector 
Graph 
store 
Transaction 
manager 
Transitive 
closure 
Garbage 
Collector 
Confluent Confluent 
Graph 
store 
Transitive 
closure 
Confluent Not 
Confluent 
Confluent 
Barrier 
Not 
Confluent 
Barrier
Language-level consistency 
DSLs for distributed programming? 
• Capture consistency concerns in the 
type system 
Application 
Language 
Flow 
Object 
Storage
Language-level consistency 
CALM Theorem: 
Monotonic à confluent 
Conservative, syntactic test for confluence
Language-level consistency 
Deadlock detector 
Garbage collector
Language-level consistency 
Deadlock detector 
Garbage collector 
nonmonotonic
Let’s review 
• Consistency is tolerance to asynchrony 
• Tricks: 
– focus on data in motion, not at rest 
– avoid coordination when possible 
– choose coordination carefully otherwise 
(Tricks are great, but tools are better)
Outline 
1. Mourning the death of transactions 
2. What is so hard about distributed systems? 
3. Distributed consistency: managing asynchrony 
4. Fault-tolerance: progress despite failures
Grand challenge: composition 
Hard problem: 
Is a given component fault-tolerant? 
Much harder: 
Is this system (built up from components) 
fault-tolerant?
Example: Atomic 
multi-partition update 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14 
Two-phase 
commit
Example: replication 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14 
T1 
T2 
T4 
T3 
T10 
T6 
T5 
T9 
T7 
T11 
T8 
T12 
T13 
T14 
Reliable 
broadcast
Popular wisdom: don’t reinvent
Example: Kafka replication bug 
Three “correct” components: 
1. Primary/backup replication 
2. Timeout-based failure detectors 
3. Zookeeper 
One nasty bug: 
Acknowledged writes are lost
A guarantee would be nice 
Bottom up approach: 
• use formal methods to verify individual 
components (e.g. protocols) 
• Build systems from verified components 
Shortcomings: 
• Hard to use 
• Hard to compose 
Investment 
Returns
Bottom-up assurances 
Formal 
verifica[on 
Environment 
Program 
Correctness 
Spec
Composing bottom-up 
assurances
Composing bottom-up 
assurances 
Issue 1: incompatible failure models 
eg, crash failure vs. omissions 
Issue 2: Specs do not compose 
(FT is an end-to-end property) 
If you take 10 components off the shelf, you are putting 10 world views 
together, and the result will be a mess. -- Butler Lampson
Composing bottom-up 
assurances
Composing bottom-up 
assurances
Composing bottom-up 
assurances
Top-down “assurances”
Top-down “assurances” 
Testing
Top-down “assurances” 
Fault 
injection Testing
Top-down “assurances” 
Fault 
injection 
Testing
End-to-end testing 
would be nice 
Top-down approach: 
• Build a large-scale system 
• Test the system under faults 
Shortcomings: 
• Hard to identify complex bugs 
• Fundamentally incomplete 
Investment 
Returns
Lineage-driven fault injection 
Goal: top-down testing that 
• finds all of the fault-tolerance bugs, or 
• certifies that none exist
Lineage-driven fault injection 
Correctness 
Specification 
Malevolent 
sentience 
Molly
Lineage-driven fault injection 
Molly 
Correctness 
Specification 
Malevolent 
sentience
Lineage-driven fault injection 
(LDFI) 
Approach: think backwards from outcomes 
Question: could a bad thing ever happen? 
Reframe: 
• Why did a good thing happen? 
• What could have gone wrong along the way?
Thomasina: What a faint-heart! We must 
work outward from the middle of the 
maze. We will start with something simple.
The game 
• Both players agree on a failure model 
• The programmer provides a protocol 
• The adversary observes executions and 
chooses failures for the next execution.
Dedalus: it’s about data 
log(B, “data”)@5 
What 
Where 
When 
Some data
Dedalus: it’s like Datalog 
consequence ! :- premise[s]! 
! 
log(Node, Pload) ! ! ! :- bcast(Node, Pload);! 
!
Dedalus: it’s like Datalog 
consequence ! :- premise[s]! 
! 
log(Node, Pload) ! ! ! :- bcast(Node, Pload);! 
! 
(Which is like SQL) 
create view log as 
select Node, Pload from bcast;!
Dedalus: it’s about time 
consequence@when ! :- premise[s]! 
!! 
node(Node, Neighbor)@next :- node(Node, Neighbor);! 
!! 
log(Node2, Pload)@async :- bcast(Node1, Pload), 
! ! ! ! ! ! ! ! ! node(Node1, Node2);
Dedalus: it’s about time 
consequence@when ! :- premise[s]! 
!! 
node(Node, Neighbor)@next :- node(Node, Neighbor);! 
!! 
log(Node2, Pload)@async :- bcast(Node1, Pload), 
! ! ! ! ! ! ! ! ! node(Node1, Node2); 
State change 
Natural join (bcast.Node1 == node.Node1) 
Communication
The match 
Protocol: 
Reliable broadcast 
Specification: 
Pre: A correct process delivers a message m 
Post: All correct process delivers m 
Failure Model: 
(Permanent) crash failures 
Message loss / partitions
Round 1 
node(Node, Neighbor)@next :- node(Node, Neighbor);! 
log(Node, Pload)@next ! :- log(Node, Pload);! 
!! 
log(Node, Pload) ! ! ! :- bcast(Node, Pload);! 
! 
log(Node2, Pload)@async :- bcast(Node1, Pload), 
! ! ! ! ! ! ! ! ! node(Node1, Node2); 
“An effort” delivery protocol
Round 1 in space / time 
Process b Process a Process c 
2 
1 
2 
log log
Round 1: Lineage 
log(B, 
data)@5
Round 1: Lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(Node, Pload)@next :- log(Node, Pload);! 
!!! 
log(B, data)@5:- log(B, data)@4;!
Round 1: Lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(B, 
data)@3
Round 1: Lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(B, 
data)@3 
log(B,data)@2
Round 1: Lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(B, 
data)@3 
log(B,data)@2 
log(Node2, Pload)@async :- bcast(Node1, Pload), ! 
! ! ! ! ! ! node(Node1, Node2);! 
!!!! 
log(B, data)@2 :- bcast(A, data)@1, ! 
! ! ! ! ! ! node(A, B)@1;! 
log(A, 
data)@1
An execution is a (fragile) “proof” 
of an outcome 
log(A, data)@1 node(A, B)@1 
AB1 r2 
log(B, data)@2 
r1 
log(B, data)@3 
r1 
log(B, data)@4 
r1 
log(B, data)@5 
log(log(AB2 log(A, data)@1 
r1 
log(A, data)@2 
r1 
log(A, data)@3 
node(A, B)@1 
r3 
node(A, B)@2 
r3 
node(A, B)@3 
AB3 r2 
log(B, data)@4 
log(log(log(log((which required a message from A to B at time 1)
Valentine: “The unpredictable and the 
predetermined unfold together to make 
everything the way it is.”
Round 1: counterexample 
Process b Process a Process c 
1 
2 
log (LOST) log 
The adversary wins!
Round 
2 
Same 
as 
Round 
1, 
but 
A 
retries. 
bcast(N, P)@next ! ! ! :- bcast(N, P);!
Round 2 in spacetime 
Process b Process a Process c 
2 
3 
4 
5 
1 
2 
3 
4 
2 
3 
4 
5 
log log 
log log 
log log 
log log
Round 2 
log(B, 
data)@5
Round 2 
log(B, 
data)@5 
log(B, 
data)@4 
log(Node, Pload)@next :- log(Node, Pload);! 
!!! 
log(B, data)@5:- log(B, data)@4;!
Round 2 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);! 
!!!! log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;!
Round 2 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(B, 
data)@3 
log(A, 
data)@3
Round 2 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(B, 
data)@3 
log(A, 
data)@3 
log(B,data)@2 
log(A, 
data)@2
Round 2 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(B, 
data)@3 
log(A, 
data)@3 
log(B,data)@2 
log(A, 
data)@2 
log(A, 
data)@1
Round 2 
Retry provides redundancy in time 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(B, 
data)@3 
log(A, 
data)@3 
log(B,data)@2 
log(A, 
data)@2 
log(A, 
data)@1
Traces 
are 
forests 
of 
proof 
trees 
log(A, data)@1 node(A, B)@1 
AB1 r2 
log(B, data)@2 
r1 
log(B, data)@3 
r1 
log(B, data)@4 
r1 
log(B, data)@5 
log(A, data)@1 
r1 
log(A, data)@2 
node(A, B)@1 
r3 
node(A, B)@2 
AB2 r2 
log(B, data)@3 
r1 
log(B, data)@4 
r1 
log(B, data)@5 
log(A, data)@1 
r1 
log(A, data)@2 
r1 
log(A, data)@3 
node(A, B)@1 
r3 
node(A, B)@2 
r3 
node(A, B)@3 
AB3 r2 
log(B, data)@4 
r1 
log(B, data)@5 
log(A, data)@1 
r1 
log(A, data)@2 
r1 
log(A, data)@3 
r1 
log(A, data)@4 
node(A, B)@1 
r3 
node(A, B)@2 
r3 
node(A, B)@3 
r3 
node(A, B)@4 
AB4 r2 
log(B, data)@5 
AB1 ^ AB2 ^ AB3 ^ AB4
Traces 
are 
forests 
of 
proof 
trees 
log(A, data)@1 node(A, B)@1 
AB1 r2 
log(B, data)@2 
r1 
log(B, data)@3 
r1 
log(B, data)@4 
r1 
log(B, data)@5 
log(A, data)@1 
r1 
log(A, data)@2 
node(A, B)@1 
r3 
node(A, B)@2 
AB2 r2 
log(B, data)@3 
r1 
log(B, data)@4 
r1 
log(B, data)@5 
log(A, data)@1 
r1 
log(A, data)@2 
r1 
log(A, data)@3 
node(A, B)@1 
r3 
node(A, B)@2 
r3 
node(A, B)@3 
AB3 r2 
log(B, data)@4 
r1 
log(B, data)@5 
log(A, data)@1 
r1 
log(A, data)@2 
r1 
log(A, data)@3 
r1 
log(A, data)@4 
node(A, B)@1 
r3 
node(A, B)@2 
r3 
node(A, B)@3 
r3 
node(A, B)@4 
AB4 r2 
log(B, data)@5 
AB1 ^ AB2 ^ AB3 ^ AB4
Round 
2: 
counterexample 
Process b Process a Process c 
1 
log (LOST) log 
CRASHED 2 
The adversary wins!
Round 3 
Same 
as 
in 
Round 
2, 
but 
symmetrical. 
bcast(N, P)@next ! ! ! :- log(N, P);!
Round 3 in space / time 
Process b Process a Process c 
2 
3 
4 
5 
1 
log log 
2 
3 
4 
5 
2 
3 
4 
5 
log log 
log log 
log log 
log log 
log log 
log log 
log log 
log log 
log log 
Redundancy in 
space and time
Round 3 -- lineage 
log(B, 
data)@5
Round 3 -- lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(C, 
data)@4
Round 3 -- lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(C, 
data)@4 
Log(B, 
data)@3 
log(A, 
data)@3 
log(C, 
data)@3
Round 3 -- lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(C, 
data)@4 
Log(B, 
data)@3 
log(A, 
data)@3 
log(C, 
data)@3 
log(B,data)@2 
log(A, 
data)@2 
log(C, 
data)@2 
log(A, 
data)@1
Round 3 -- lineage 
log(B, 
data)@5 
log(B, 
data)@4 
log(A, 
data)@4 
log(C, 
data)@4 
Log(B, 
data)@3 
log(A, 
data)@3 
log(C, 
data)@3 
log(B,data)@2 
log(A, 
data)@2 
log(C, 
data)@2 
log(A, 
data)@1
Round 3 
The programmer wins!
Let’s reflect 
Fault-tolerance is redundancy in space and 
time. 
Best strategy for both players: reason 
backwards from outcomes using lineage 
Finding bugs: find a set of failures that 
“breaks” all derivations 
Fixing bugs: add additional derivations
The role of the adversary 
can be automated 
1. Break a proof by dropping any contributing 
message. 
(AB1 ∨ BC2) 
Disjunction
The role of the adversary 
can be automated 
1. Break a proof by dropping any contributing 
message. 
2. Find a set of failures that breaks all proofs 
of a good outcome. 
(AB1 ∨ BC2) 
Disjunction 
∧ (AC1) ∧ (AC2) 
Conjunction of disjunctions (AKA CNF)
The role of the adversary 
can be automated 
1. Break a proof by dropping any contributing 
message. 
2. Find a set of failures that breaks all proofs 
of a good outcome. 
(AB1 ∨ BC2) 
Disjunction 
∧ (AC1) ∧ (AC2) 
Conjunction of disjunctions (AKA CNF)
Molly, the LDFI prototype 
Molly finds fault-tolerance violations quickly 
or guarantees that none exist. 
Molly finds bugs by explaining good 
outcomes – then it explains the bugs. 
Bugs identified: 2pc, 2pc-ctp, 3pc, Kafka 
Certified correct: paxos (synod), Flux, bully 
leader election, reliable broadcast
Commit protocols 
Problem: 
Atomically change things 
Correctness properties: 
1. Agreement (All or nothing) 
2. Termination (Something)
Two-phase commit 
Agent a Agent b Coordinator Agent d 
2 
5 
2 
5 
1 
prepare prepare prepare 
3 
4 
2 
5 
vote vote 
vote 
commit commit commit
Two-phase commit 
Agent a Agent b Coordinator Agent d 
2 
5 
2 
5 
1 
prepare prepare prepare 
3 
4 
2 
5 
vote vote 
vote 
commit commit commit 
Can I kick it?
Two-phase commit 
Agent a Agent b Coordinator Agent d 
2 
5 
2 
5 
1 
prepare prepare prepare 
3 
4 
2 
5 
vote vote 
vote 
commit commit commit 
Can I kick it? 
YES YOU CAN
Two-phase commit 
Agent a Agent b Coordinator Agent d 
2 
5 
2 
5 
1 
prepare prepare prepare 
3 
4 
2 
5 
vote vote 
vote 
commit commit commit 
Can I kick it? 
YES YOU CAN 
Well I’m gone
Two-phase commit 
Agent a Agent a Coordinator Agent d 
2 2 
1 
p p p 
3 
CRASHED 
2 
v v 
v 
Violation: Termination
The 
collabora[ve 
termina[on 
protocol 
Basic idea: 
Agents talk amongst themselves when the 
coordinator fails. 
Protocol: On timeout, ask other agents 
about decision.
2PC - CTP 
Agent a Agent b Coordinator Agent d 
2 
3 
4 
5 
6 
7 
prepare prepare prepare 
2 
3 
4 
5 
6 
7 
1 
2 
3 
CRASHED 
2 
3 
4 
5 
6 
7 
vote 
decision_req decision_req 
vote 
decision_req decision_req 
vote 
decision_req decision_req
2PC - CTP 
Agent a Agent b Coordinator Agent d 
2 
3 
4 
5 
6 
7 
prepare prepare prepare 
2 
3 
4 
5 
6 
7 
1 
2 
3 
CRASHED 
2 
3 
4 
5 
6 
7 
vote 
decision_req decision_req 
vote 
decision_req decision_req 
vote 
decision_req decision_req 
Can I kick it? 
YES YOU CAN 
……?
3PC 
Basic idea: 
Add a round, a state, and simple failure 
detectors (timeouts). 
Protocol: 
1. Phase 1: Just like in 2PC 
– Agent timeout à abort 
2. Phase 2: send canCommit, collect acks 
– Agent timeout à commit 
3. Phase 3: Just like phase 2 of 2PC
3PC 
Process a Process b Process C Process d 
2 
4 
7 
2 
4 
7 
1 
cancommit cancommit cancommit 
3 
vote_msg 
precommit precommit precommit 
5 
6 
2 
4 
7 
vote_msg 
ack 
vote_msg 
ack 
ack 
commit commit commit
3PC 
Process a Process b Process C Process d 
2 
4 
7 
2 
4 
7 
1 
cancommit cancommit cancommit 
3 
vote_msg 
precommit precommit precommit 
5 
6 
2 
4 
7 
vote_msg 
ack 
vote_msg 
ack 
ack 
commit commit commit 
Timeout 
à Abort 
Timeout 
à Commit
Network partitions 
make 3pc act crazy 
Process a Process b Process C Process d 
2 
4 
7 
8 
2 
4 
7 
8 
1 
3 
5 
6 
7 
8 
2 
CRASHED 
vote_msg 
ack 
commit 
vote_msg 
ack 
commit 
cancommit cancommit cancommit 
precommit precommit precommit 
abort (LOST) abort (LOST) 
abort abort 
vote_msg
Network partitions 
make 3pc act crazy 
Process a Process b Process C Process d 
2 
4 
7 
8 
2 
4 
7 
8 
1 
3 
5 
6 
7 
8 
2 
CRASHED 
vote_msg 
ack 
commit 
vote_msg 
ack 
commit 
cancommit cancommit cancommit 
precommit precommit precommit 
abort (LOST) abort (LOST) 
abort abort 
vote_msg 
Agent crash 
Agents learn 
commit decision
Network partitions 
make 3pc act crazy 
Process a Process b Process C Process d 
2 
4 
7 
8 
2 
4 
7 
8 
1 
3 
5 
6 
7 
8 
2 
CRASHED 
vote_msg 
ack 
commit 
vote_msg 
ack 
commit 
cancommit cancommit cancommit 
precommit precommit precommit 
abort (LOST) abort (LOST) 
abort abort 
vote_msg 
Agent crash 
Agents learn 
commit decision 
d is dead; coordinator 
decides to abort
Network partitions 
make 3pc act crazy 
Process a Process b Process C Process d 
2 
4 
7 
8 
2 
4 
7 
8 
1 
3 
5 
6 
7 
8 
2 
CRASHED 
vote_msg 
ack 
commit 
vote_msg 
ack 
commit 
cancommit cancommit cancommit 
precommit precommit precommit 
abort (LOST) abort (LOST) 
abort abort 
vote_msg 
Brief network 
partition 
Agent crash 
Agents learn 
commit decision 
d is dead; coordinator 
decides to abort
Network partitions 
make 3pc act crazy 
Process a Process b Process C Process d 
2 
4 
7 
8 
2 
4 
7 
8 
1 
3 
5 
6 
7 
8 
2 
CRASHED 
vote_msg 
ack 
commit 
vote_msg 
ack 
commit 
cancommit cancommit cancommit 
precommit precommit precommit 
abort (LOST) abort (LOST) 
abort abort 
vote_msg 
Brief network 
partition 
Agent crash 
Agents learn 
commit decision 
d is dead; coordinator 
decides to abort 
Agents A & B 
decide to 
commit
Kafka durability bug 
Replica b Replica c Zookeeper Replica a Client 
1 1 
2 
1 
3 
4 
CRASHED 
1 
3 
5 
m m 
m 
m l 
a 
c 
w
Kafka durability bug 
Replica b Replica c Zookeeper Replica a Client 
1 1 
2 
1 
3 
4 
CRASHED 
1 
3 
5 
m m 
m 
m l 
a 
c 
w 
Brief network 
partition
Kafka durability bug 
Replica b Replica c Zookeeper Replica a Client 
1 1 
2 
1 
3 
4 
CRASHED 
1 
3 
5 
m m 
m 
m l 
a 
c 
w 
Brief network 
partition 
a becomes 
leader and 
sole replica
Kafka durability bug 
Replica b Replica c Zookeeper Replica a Client 
1 1 
2 
1 
3 
4 
CRASHED 
1 
3 
5 
m m 
m 
m l 
a 
c 
w 
Brief network 
partition 
a becomes 
leader and 
sole replica 
a ACKs 
client write
Kafka durability bug 
Replica b Replica c Zookeeper Replica a Client 
1 1 
2 
1 
3 
4 
CRASHED 
1 
3 
5 
m m 
m 
m l 
a 
c 
w 
Brief network 
partition 
a becomes 
leader and 
sole replica 
a ACKs 
client write 
Data 
loss
Molly summary 
Lineage allows us to reason backwards 
from good outcomes 
Molly: surgically-targeted fault injection 
Investment similar to testing 
Returns similar to formal methods
Where we’ve been; where we’re headed 
1. Mourning the death of transactions 
2. What is so hard about distributed systems? 
3. Distributed consistency: managing asynchrony 
4. Fault-tolerance: progress despite failures
Where we’ve been; where we’re headed 
1. We need application-level guarantees 
2. What is so hard about distributed systems? 
3. Distributed consistency: managing asynchrony 
4. Fault-tolerance: progress despite failures
Where we’ve been; where we’re headed 
1. We need application-level guarantees 
2. What is so hard about distributed systems? 
3. Distributed consistency: managing asynchrony 
4. Fault-tolerance: progress despite failures
Where we’ve been; where we’re headed 
1. We need application-level guarantees 
2. (asynchrony X partial failure) = too hard to 
hide! We need tools to manage it. 
3. Distributed consistency: managing asynchrony 
4. Fault-tolerance: progress despite failures
Where we’ve been; where we’re headed 
1. We need application-level guarantees 
2. asynchrony X partial failure = too hard to hide! 
We need tools to manage it. 
3. Distributed consistency: managing asynchrony 
4. Fault-tolerance: progress despite failures
Where we’ve been; where we’re headed 
1. We need application-level guarantees 
2. asynchrony X partial failure = too hard to hide! 
We need tools to manage it. 
3. Focus on flow: data in motion 
4. Fault-tolerance: progress despite failures
Outline 
1. We need application-level guarantees 
2. asynchrony X partial failure = too hard to hide! 
We need tools to manage it. 
3. Focus on flow: data in motion 
4. Fault-tolerance: progress despite failures
Outline 
1. We need application-level guarantees 
2. asynchrony X partial failure = too hard to hide! 
We need tools to manage it. 
3. Focus on flow: data in motion 
4. Backwards from outcomes
Remember 
1. We need application-level guarantees 
2. asynchrony X partial failure = too hard to hide! We 
need tools to manage it. 
3. Focus on flow: data in motion 
4. Backwards from outcomes 
Composition is the hardest problem
A happy crisis 
Valentine: “It makes me so happy. To be at 
the beginning again, knowing almost 
nothing.... It's the best possible time of 
being alive, when almost everything you 
thought you knew is wrong.”

More Related Content

Similar to RICON keynote: outwards from the middle of the maze

The Reactive Principles: Design Principles For Cloud Native Applications
The Reactive Principles: Design Principles For Cloud Native ApplicationsThe Reactive Principles: Design Principles For Cloud Native Applications
The Reactive Principles: Design Principles For Cloud Native Applications
Jonas Bonér
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Eight Tenets For Building Cloud Native ApplicationsThe Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
Lightbend
 
MongoDB World 2019: Distributed Transactions: With Great Power Comes Great Re...
MongoDB World 2019: Distributed Transactions: With Great Power Comes Great Re...MongoDB World 2019: Distributed Transactions: With Great Power Comes Great Re...
MongoDB World 2019: Distributed Transactions: With Great Power Comes Great Re...
MongoDB
 
Forward Chaining in HALO
Forward Chaining in HALOForward Chaining in HALO
Forward Chaining in HALO
ESUG
 
Bloom plseminar-sp15
Bloom plseminar-sp15Bloom plseminar-sp15
Bloom plseminar-sp15
Joe Hellerstein
 
ArchSummit Shenzhen - Using sagas to maintain data consistency in a microserv...
ArchSummit Shenzhen - Using sagas to maintain data consistency in a microserv...ArchSummit Shenzhen - Using sagas to maintain data consistency in a microserv...
ArchSummit Shenzhen - Using sagas to maintain data consistency in a microserv...
Chris Richardson
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Sid Anand
 
The free lunch is over
The free lunch is overThe free lunch is over
The free lunch is over
Thadeu Russo
 
Cassandra Tutorial
Cassandra TutorialCassandra Tutorial
Cassandra Tutorial
mubarakss
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable Code
Baidu, Inc.
 
Vertica the convertro way
Vertica   the convertro wayVertica   the convertro way
Vertica the convertro way
Zvika Gutkin
 
DBMS_Unit-4 data bas management (1).pptx
DBMS_Unit-4 data bas management (1).pptxDBMS_Unit-4 data bas management (1).pptx
DBMS_Unit-4 data bas management (1).pptx
cherukuriyuvaraju9
 
Full Consistency Lag and its Applications
Full Consistency Lag and its ApplicationsFull Consistency Lag and its Applications
Full Consistency Lag and its Applications
Cassandra Austin
 
Chronicle accelerate building a digital currency
Chronicle accelerate   building a digital currencyChronicle accelerate   building a digital currency
Chronicle accelerate building a digital currency
Peter Lawrey
 
Return of the transaction king
Return of the transaction kingReturn of the transaction king
Return of the transaction king
Ryan Knight
 
Embrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with RippleEmbrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with Ripple
Sean Cribbs
 
Attacking and Exploiting Ethereum Smart Contracts: Auditing 101
Attacking and Exploiting Ethereum Smart Contracts: Auditing 101Attacking and Exploiting Ethereum Smart Contracts: Auditing 101
Attacking and Exploiting Ethereum Smart Contracts: Auditing 101
Simone Onofri
 
Nelson: Rigorous Deployment for a Functional World
Nelson: Rigorous Deployment for a Functional WorldNelson: Rigorous Deployment for a Functional World
Nelson: Rigorous Deployment for a Functional World
Timothy Perrett
 
The End of a Myth: Ultra-Scalable Transactional Management
The End of a Myth: Ultra-Scalable Transactional ManagementThe End of a Myth: Ultra-Scalable Transactional Management
The End of a Myth: Ultra-Scalable Transactional Management
Ricardo Jimenez-Peris
 

Similar to RICON keynote: outwards from the middle of the maze (20)

The Reactive Principles: Design Principles For Cloud Native Applications
The Reactive Principles: Design Principles For Cloud Native ApplicationsThe Reactive Principles: Design Principles For Cloud Native Applications
The Reactive Principles: Design Principles For Cloud Native Applications
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Eight Tenets For Building Cloud Native ApplicationsThe Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
 
MongoDB World 2019: Distributed Transactions: With Great Power Comes Great Re...
MongoDB World 2019: Distributed Transactions: With Great Power Comes Great Re...MongoDB World 2019: Distributed Transactions: With Great Power Comes Great Re...
MongoDB World 2019: Distributed Transactions: With Great Power Comes Great Re...
 
Forward Chaining in HALO
Forward Chaining in HALOForward Chaining in HALO
Forward Chaining in HALO
 
Bloom plseminar-sp15
Bloom plseminar-sp15Bloom plseminar-sp15
Bloom plseminar-sp15
 
ArchSummit Shenzhen - Using sagas to maintain data consistency in a microserv...
ArchSummit Shenzhen - Using sagas to maintain data consistency in a microserv...ArchSummit Shenzhen - Using sagas to maintain data consistency in a microserv...
ArchSummit Shenzhen - Using sagas to maintain data consistency in a microserv...
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
 
The free lunch is over
The free lunch is overThe free lunch is over
The free lunch is over
 
Cassandra Tutorial
Cassandra TutorialCassandra Tutorial
Cassandra Tutorial
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable Code
 
Vertica the convertro way
Vertica   the convertro wayVertica   the convertro way
Vertica the convertro way
 
DBMS_Unit-4 data bas management (1).pptx
DBMS_Unit-4 data bas management (1).pptxDBMS_Unit-4 data bas management (1).pptx
DBMS_Unit-4 data bas management (1).pptx
 
Full Consistency Lag and its Applications
Full Consistency Lag and its ApplicationsFull Consistency Lag and its Applications
Full Consistency Lag and its Applications
 
Chronicle accelerate building a digital currency
Chronicle accelerate   building a digital currencyChronicle accelerate   building a digital currency
Chronicle accelerate building a digital currency
 
Return of the transaction king
Return of the transaction kingReturn of the transaction king
Return of the transaction king
 
Embrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with RippleEmbrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with Ripple
 
Attacking and Exploiting Ethereum Smart Contracts: Auditing 101
Attacking and Exploiting Ethereum Smart Contracts: Auditing 101Attacking and Exploiting Ethereum Smart Contracts: Auditing 101
Attacking and Exploiting Ethereum Smart Contracts: Auditing 101
 
Nelson: Rigorous Deployment for a Functional World
Nelson: Rigorous Deployment for a Functional WorldNelson: Rigorous Deployment for a Functional World
Nelson: Rigorous Deployment for a Functional World
 
The End of a Myth: Ultra-Scalable Transactional Management
The End of a Myth: Ultra-Scalable Transactional ManagementThe End of a Myth: Ultra-Scalable Transactional Management
The End of a Myth: Ultra-Scalable Transactional Management
 

Recently uploaded

Traditional, current and future use of fish and seaweed for fertilisation - ...
Traditional, current and future use of fish and seaweed for fertilisation -  ...Traditional, current and future use of fish and seaweed for fertilisation -  ...
Traditional, current and future use of fish and seaweed for fertilisation - ...
Faculty of Applied Chemistry and Materials Science
 
ellipticytescausesprognosistreatment-240622051139-23d50b05.pptx
ellipticytescausesprognosistreatment-240622051139-23d50b05.pptxellipticytescausesprognosistreatment-240622051139-23d50b05.pptx
ellipticytescausesprognosistreatment-240622051139-23d50b05.pptx
muralinath2
 
Potential of Marine Renewable and Non renewable energy.pptx
Potential of Marine Renewable and Non renewable energy.pptxPotential of Marine Renewable and Non renewable energy.pptx
Potential of Marine Renewable and Non renewable energy.pptx
J. Bovas Joel BFSc
 
Structure of Sperm / Spermatozoon .pdf
Structure of  Sperm / Spermatozoon  .pdfStructure of  Sperm / Spermatozoon  .pdf
Structure of Sperm / Spermatozoon .pdf
SELF-EXPLANATORY
 
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Sérgio Sacani
 
Plant Kingdom BioHack class 11 neet ....
Plant Kingdom BioHack class 11 neet ....Plant Kingdom BioHack class 11 neet ....
Plant Kingdom BioHack class 11 neet ....
anushkakharat13
 
Concept of Balanced Diet & Nutrients.pdf
Concept of Balanced Diet & Nutrients.pdfConcept of Balanced Diet & Nutrients.pdf
Concept of Balanced Diet & Nutrients.pdf
SELF-EXPLANATORY
 
MCQ in Electrostatics. for class XII pptx
MCQ in Electrostatics. for class XII  pptxMCQ in Electrostatics. for class XII  pptx
MCQ in Electrostatics. for class XII pptx
ArunachalamM22
 
THE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physics
THE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physicsTHE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physics
THE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physics
Dr. sreeremya S
 
Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...
Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...
Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...
bellared2
 
End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...
End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...
End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...
Faculty of Applied Chemistry and Materials Science
 
Rapid pulse drying of marine biomasses - Sigurd Sannan
Rapid pulse drying of marine biomasses - Sigurd SannanRapid pulse drying of marine biomasses - Sigurd Sannan
Rapid pulse drying of marine biomasses - Sigurd Sannan
Faculty of Applied Chemistry and Materials Science
 
Gametogenesis: Male gametes Formation Process / Spermatogenesis .pdf
Gametogenesis: Male gametes Formation Process / Spermatogenesis .pdfGametogenesis: Male gametes Formation Process / Spermatogenesis .pdf
Gametogenesis: Male gametes Formation Process / Spermatogenesis .pdf
SELF-EXPLANATORY
 
Potential of Marine renewable and Non renewable energy.pptx
Potential of Marine renewable and Non renewable energy.pptxPotential of Marine renewable and Non renewable energy.pptx
Potential of Marine renewable and Non renewable energy.pptx
J. Bovas Joel BFSc
 
20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...
20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...
20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...
Sharon Liu
 
Pancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptxPancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptx
muralinath2
 
Analytical methods for blue residues characterization - Oana Crina Bujor
Analytical methods for blue residues characterization - Oana Crina BujorAnalytical methods for blue residues characterization - Oana Crina Bujor
Analytical methods for blue residues characterization - Oana Crina Bujor
Faculty of Applied Chemistry and Materials Science
 
Lake classification and Morphometry.pptx
Lake classification and Morphometry.pptxLake classification and Morphometry.pptx
Lake classification and Morphometry.pptx
boobalanbfsc
 
A hot-Jupiter progenitor on a super-eccentric retrograde orbit
A hot-Jupiter progenitor on a super-eccentric retrograde orbitA hot-Jupiter progenitor on a super-eccentric retrograde orbit
A hot-Jupiter progenitor on a super-eccentric retrograde orbit
Sérgio Sacani
 
Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...
Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...
Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...
Dr. sreeremya S
 

Recently uploaded (20)

Traditional, current and future use of fish and seaweed for fertilisation - ...
Traditional, current and future use of fish and seaweed for fertilisation -  ...Traditional, current and future use of fish and seaweed for fertilisation -  ...
Traditional, current and future use of fish and seaweed for fertilisation - ...
 
ellipticytescausesprognosistreatment-240622051139-23d50b05.pptx
ellipticytescausesprognosistreatment-240622051139-23d50b05.pptxellipticytescausesprognosistreatment-240622051139-23d50b05.pptx
ellipticytescausesprognosistreatment-240622051139-23d50b05.pptx
 
Potential of Marine Renewable and Non renewable energy.pptx
Potential of Marine Renewable and Non renewable energy.pptxPotential of Marine Renewable and Non renewable energy.pptx
Potential of Marine Renewable and Non renewable energy.pptx
 
Structure of Sperm / Spermatozoon .pdf
Structure of  Sperm / Spermatozoon  .pdfStructure of  Sperm / Spermatozoon  .pdf
Structure of Sperm / Spermatozoon .pdf
 
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
Detection of the elusive dangling OH ice features at ~2.7 μm in Chamaeleon I ...
 
Plant Kingdom BioHack class 11 neet ....
Plant Kingdom BioHack class 11 neet ....Plant Kingdom BioHack class 11 neet ....
Plant Kingdom BioHack class 11 neet ....
 
Concept of Balanced Diet & Nutrients.pdf
Concept of Balanced Diet & Nutrients.pdfConcept of Balanced Diet & Nutrients.pdf
Concept of Balanced Diet & Nutrients.pdf
 
MCQ in Electrostatics. for class XII pptx
MCQ in Electrostatics. for class XII  pptxMCQ in Electrostatics. for class XII  pptx
MCQ in Electrostatics. for class XII pptx
 
THE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physics
THE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physicsTHE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physics
THE ESSENCE OF CHANGE CHAPTER ,energy,conversion,life is easy,laws of physics
 
Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...
Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...
Celebrity Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl S...
 
End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...
End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...
End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...
 
Rapid pulse drying of marine biomasses - Sigurd Sannan
Rapid pulse drying of marine biomasses - Sigurd SannanRapid pulse drying of marine biomasses - Sigurd Sannan
Rapid pulse drying of marine biomasses - Sigurd Sannan
 
Gametogenesis: Male gametes Formation Process / Spermatogenesis .pdf
Gametogenesis: Male gametes Formation Process / Spermatogenesis .pdfGametogenesis: Male gametes Formation Process / Spermatogenesis .pdf
Gametogenesis: Male gametes Formation Process / Spermatogenesis .pdf
 
Potential of Marine renewable and Non renewable energy.pptx
Potential of Marine renewable and Non renewable energy.pptxPotential of Marine renewable and Non renewable energy.pptx
Potential of Marine renewable and Non renewable energy.pptx
 
20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...
20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...
20240710 ACMJ Diagrams Set 3.docx . Apache, Csharp, Mysql, Javascript stack a...
 
Pancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptxPancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptx
 
Analytical methods for blue residues characterization - Oana Crina Bujor
Analytical methods for blue residues characterization - Oana Crina BujorAnalytical methods for blue residues characterization - Oana Crina Bujor
Analytical methods for blue residues characterization - Oana Crina Bujor
 
Lake classification and Morphometry.pptx
Lake classification and Morphometry.pptxLake classification and Morphometry.pptx
Lake classification and Morphometry.pptx
 
A hot-Jupiter progenitor on a super-eccentric retrograde orbit
A hot-Jupiter progenitor on a super-eccentric retrograde orbitA hot-Jupiter progenitor on a super-eccentric retrograde orbit
A hot-Jupiter progenitor on a super-eccentric retrograde orbit
 
Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...
Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...
Direct instructions, towards hundred fold yield,layering,budding,grafting,pla...
 

RICON keynote: outwards from the middle of the maze

  • 1. Outwards from the middle of the maze Peter Alvaro UC Berkeley
  • 2. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 3. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  • 4. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  • 5. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  • 6. The transaction concept DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;
  • 13. Transactions: a holistic contract Write Read Application Opaque store Transactions
  • 14. Transactions: a holistic contract Write Read Application Opaque store Transactions Assert: balance > 0
  • 15. Transactions: a holistic contract Assert: balance > 0 Write Read Application Opaque store Transactions
  • 16. Transactions: a holistic contract Write Read Application Opaque store Transactions Assert: balance > 0
  • 17. Transactions: a holistic contract Write Read Application Opaque store Transactions Assert: balance > 0
  • 18. Incidental complexities • The “Internet.” Searching it. • Cross-datacenter replication schemes • CAP Theorem • Dynamo & MapReduce • “Cloud”
  • 19. Fundamental complexity “[…] distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure.” Jim Waldo et al., A Note on Distributed Computing (1994)
  • 20. A holistic contract …stretched to the limit Write Read Application Opaque store Transactions
  • 21. A holistic contract …stretched to the limit Write Read Application Opaque store Transactions
  • 22. Are you blithely asserting that transactions aren’t webscale? Some people just want to see the world burn. Those same people want to see the world use inconsistent databases. - Emin Gun Sirer
  • 23. Alternative to top-down design? The “bottom-up,” systems tradition: Simple, reusable components first. Semantics later.
  • 31. The “bottom-up” ethos “‘Tis a fine barn, but sure ‘tis no castle, English”
  • 32. The “bottom-up” ethos Simple, reusable components first. Semantics later. This is how we live now. Question: Do we ever get those application-level guarantees back?
  • 33. Low-level contracts Write Read Application Distributed store KVS
  • 34. Low-level contracts Write Read Application Distributed store KVS
  • 35. Low-level contracts Write Read Application Distributed store KVS R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
  • 36. Low-level contracts Write Read Application Distributed store KVS Assert: balance > 0 R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
  • 37. Low-level contracts Write Read Application Distributed store KVS Assert: balance > 0 causal? PRAM? delta? fork/join? red/blue? Release? R1(X=1) R2(X=1) W1(X=2) W2(X=0) W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)
  • 38. When do contracts compose? Application Distributed service Assert: balance > 0
  • 39. iw, did I get mongo in my riak? Assert: balance > 0
  • 40. Composition is the last hard problem Composing modules is hard enough We must learn how to compose guarantees
  • 41. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 42. Why distributed systems are hard2 Asynchrony Partial Failure Fundamental Uncertainty
  • 43. Asynchrony isn’t that hard Ameloriation: Logical timestamps Deterministic interleaving
  • 44. Partial failure isn’t that hard Ameloriation: Replication Replay
  • 45. (asynchrony * partial failure) = hard2 Logical timestamps Deterministic interleaving Replication Replay
  • 46. (asynchrony * partial failure) = hard2 Logical timestamps Deterministic interleaving Replication Replay
  • 47. (asynchrony * partial failure) = hard2 Tackling one clown at a time Poor strategy for programming distributed systems Winning strategy for analyzing distributed programs
  • 48. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 49. Distributed consistency Today: A quick summary of some great work.
  • 50. Consider a (distributed) graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  • 51. Partitioned, for scalability T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  • 52. Replicated, for availability T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  • 53. Deadlock detection Task: Identify strongly-connected components Waits-for graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  • 54. Garbage collection Task: Identify nodes not reachable from Root. Root Refers-to graph T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14
  • 55. T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Correctness Deadlock detection • Safety: No false positives • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory
  • 56. T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Correctness Deadlock detection • Safety: No false positives- • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory
  • 57. Correctness Deadlock detection • Safety: No false positives • Liveness: Identify all deadlocks Garbage collection • Safety: Never GC live memory! • Liveness: GC all orphaned memory T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Root
  • 58. Consistency at the extremes Application Language Custom s olutions? Flow Object Storage Linearizable key-value store?
  • 59. Consistency at the extremes Application Language Custom s olutions? Flow Object Storage Linearizable key-value store?
  • 60. Consistency at the extremes Application Language Custom s olutions? Flow Efficient Object Correct Storage Linearizable key-value store?
  • 61. Object-level consistency Capture semantics of data structures that • allow greater concurrency • maintain guarantees (e.g. convergence) Application Language Flow Object Storage
  • 63. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence
  • 64. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence
  • 65. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence
  • 66. Object-level consistency Insert Read Convergent data structure (e.g., Set CRDT) Insert Read Commutativity Associativity Idempotence Reordering Batching Retry/duplication Tolerant to
  • 67. Object-level composition? Application Convergent data structures Assert: Graph replicas converge
  • 68. Object-level composition? Application Convergent data structures GC Assert: No live nodes are reclaimed Assert: Graph replicas converge
  • 69. Object-level composition? Application Convergent data structures GC Assert: No live nodes are reclaimed ? ? Assert: Graph replicas converge
  • 70. Flow-level consistency Application Language Flow Object Storage
  • 71. Flow-level consistency Capture semantics of data in motion • Asynchronous dataflow model • component properties à system-wide guarantees Graph store Transaction manager Transitive closure Deadlock detector
  • 72. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  • 73. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  • 74. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  • 75. Flow-level consistency Order-insensitivity (confluence) output set = f(input set)
  • 76. Flow-level consistency Order-insensitivity (confluence) output set = f(input set) =
  • 77. Flow-level consistency Order-insensitivity (confluence) output set = f(input set) { } = { }
  • 78. Confluence is compositional output set = f Ÿ g(input set)
  • 79. Confluence is compositional output set = f Ÿ g(input set)
  • 80. Confluence is compositional output set = f Ÿ g(input set)
  • 81. Graph queries as dataflow Graph store Memory allocator Transitive closure Garbage collector Confluent Not Confluent Confluent Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent
  • 82. Graph queries as dataflow Graph store Memory allocator Confluent Transitive closure Garbage collector Confluent Not Confluent Confluent Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent Coordinate here
  • 83. Coordination: what is that? Strategy 1: Establish a total order Graph store Memory allocator Coordinate here Transitive closure Garbage collector Confluent Not Confluent Confluent
  • 84. Coordination: what is that? Strategy 2: Establish a producer-consumer Graph store Memory allocator Coordinate here Transitive closure Garbage collector Confluent Not Confluent Confluent barrier
  • 85. Fundamental costs: FT via replication (mostly) free! Graph store Transaction manager Transitive closure Deadlock detector Confluent Confluent Confluent Graph store Transitive closure Deadlock detector Confluent Confluent Confluent
  • 86. Fundamental costs: FT via replication global synchronization! Graph store Transaction manager Transitive closure Garbage Collector Confluent Confluent Graph store Transitive closure Garbage Collector Confluent Not Confluent Confluent Paxos Not Confluent
  • 87. Fundamental costs: FT via replication The first principle of successful scalability is to batter the consistency mechanisms down to a minimum. – James Hamilton Garbage Collector Graph store Transaction manager Transitive closure Garbage Collector Confluent Confluent Graph store Transitive closure Confluent Not Confluent Confluent Barrier Not Confluent Barrier
  • 88. Language-level consistency DSLs for distributed programming? • Capture consistency concerns in the type system Application Language Flow Object Storage
  • 89. Language-level consistency CALM Theorem: Monotonic à confluent Conservative, syntactic test for confluence
  • 90. Language-level consistency Deadlock detector Garbage collector
  • 91. Language-level consistency Deadlock detector Garbage collector nonmonotonic
  • 92. Let’s review • Consistency is tolerance to asynchrony • Tricks: – focus on data in motion, not at rest – avoid coordination when possible – choose coordination carefully otherwise (Tricks are great, but tools are better)
  • 93. Outline 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 94. Grand challenge: composition Hard problem: Is a given component fault-tolerant? Much harder: Is this system (built up from components) fault-tolerant?
  • 95. Example: Atomic multi-partition update T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Two-phase commit
  • 96. Example: replication T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 T1 T2 T4 T3 T10 T6 T5 T9 T7 T11 T8 T12 T13 T14 Reliable broadcast
  • 98. Example: Kafka replication bug Three “correct” components: 1. Primary/backup replication 2. Timeout-based failure detectors 3. Zookeeper One nasty bug: Acknowledged writes are lost
  • 99. A guarantee would be nice Bottom up approach: • use formal methods to verify individual components (e.g. protocols) • Build systems from verified components Shortcomings: • Hard to use • Hard to compose Investment Returns
  • 100. Bottom-up assurances Formal verifica[on Environment Program Correctness Spec
  • 102. Composing bottom-up assurances Issue 1: incompatible failure models eg, crash failure vs. omissions Issue 2: Specs do not compose (FT is an end-to-end property) If you take 10 components off the shelf, you are putting 10 world views together, and the result will be a mess. -- Butler Lampson
  • 108. Top-down “assurances” Fault injection Testing
  • 109. Top-down “assurances” Fault injection Testing
  • 110. End-to-end testing would be nice Top-down approach: • Build a large-scale system • Test the system under faults Shortcomings: • Hard to identify complex bugs • Fundamentally incomplete Investment Returns
  • 111. Lineage-driven fault injection Goal: top-down testing that • finds all of the fault-tolerance bugs, or • certifies that none exist
  • 112. Lineage-driven fault injection Correctness Specification Malevolent sentience Molly
  • 113. Lineage-driven fault injection Molly Correctness Specification Malevolent sentience
  • 114. Lineage-driven fault injection (LDFI) Approach: think backwards from outcomes Question: could a bad thing ever happen? Reframe: • Why did a good thing happen? • What could have gone wrong along the way?
  • 115. Thomasina: What a faint-heart! We must work outward from the middle of the maze. We will start with something simple.
  • 116. The game • Both players agree on a failure model • The programmer provides a protocol • The adversary observes executions and chooses failures for the next execution.
  • 117. Dedalus: it’s about data log(B, “data”)@5 What Where When Some data
  • 118. Dedalus: it’s like Datalog consequence ! :- premise[s]! ! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! !
  • 119. Dedalus: it’s like Datalog consequence ! :- premise[s]! ! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! ! (Which is like SQL) create view log as select Node, Pload from bcast;!
  • 120. Dedalus: it’s about time consequence@when ! :- premise[s]! !! node(Node, Neighbor)@next :- node(Node, Neighbor);! !! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);
  • 121. Dedalus: it’s about time consequence@when ! :- premise[s]! !! node(Node, Neighbor)@next :- node(Node, Neighbor);! !! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2); State change Natural join (bcast.Node1 == node.Node1) Communication
  • 122. The match Protocol: Reliable broadcast Specification: Pre: A correct process delivers a message m Post: All correct process delivers m Failure Model: (Permanent) crash failures Message loss / partitions
  • 123. Round 1 node(Node, Neighbor)@next :- node(Node, Neighbor);! log(Node, Pload)@next ! :- log(Node, Pload);! !! log(Node, Pload) ! ! ! :- bcast(Node, Pload);! ! log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2); “An effort” delivery protocol
  • 124. Round 1 in space / time Process b Process a Process c 2 1 2 log log
  • 125. Round 1: Lineage log(B, data)@5
  • 126. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(Node, Pload)@next :- log(Node, Pload);! !!! log(B, data)@5:- log(B, data)@4;!
  • 127. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3
  • 128. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3 log(B,data)@2
  • 129. Round 1: Lineage log(B, data)@5 log(B, data)@4 log(B, data)@3 log(B,data)@2 log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! node(Node1, Node2);! !!!! log(B, data)@2 :- bcast(A, data)@1, ! ! ! ! ! ! ! node(A, B)@1;! log(A, data)@1
  • 130. An execution is a (fragile) “proof” of an outcome log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(log(AB2 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 log(log(log(log((which required a message from A to B at time 1)
  • 131. Valentine: “The unpredictable and the predetermined unfold together to make everything the way it is.”
  • 132. Round 1: counterexample Process b Process a Process c 1 2 log (LOST) log The adversary wins!
  • 133. Round 2 Same as Round 1, but A retries. bcast(N, P)@next ! ! ! :- bcast(N, P);!
  • 134. Round 2 in spacetime Process b Process a Process c 2 3 4 5 1 2 3 4 2 3 4 5 log log log log log log log log
  • 135. Round 2 log(B, data)@5
  • 136. Round 2 log(B, data)@5 log(B, data)@4 log(Node, Pload)@next :- log(Node, Pload);! !!! log(B, data)@5:- log(B, data)@4;!
  • 137. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);! !!!! log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;!
  • 138. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3
  • 139. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2
  • 140. Round 2 log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2 log(A, data)@1
  • 141. Round 2 Retry provides redundancy in time log(B, data)@5 log(B, data)@4 log(A, data)@4 log(B, data)@3 log(A, data)@3 log(B,data)@2 log(A, data)@2 log(A, data)@1
  • 142. Traces are forests of proof trees log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4
  • 143. Traces are forests of proof trees log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4
  • 144. Round 2: counterexample Process b Process a Process c 1 log (LOST) log CRASHED 2 The adversary wins!
  • 145. Round 3 Same as in Round 2, but symmetrical. bcast(N, P)@next ! ! ! :- log(N, P);!
  • 146. Round 3 in space / time Process b Process a Process c 2 3 4 5 1 log log 2 3 4 5 2 3 4 5 log log log log log log log log log log log log log log log log log log Redundancy in space and time
  • 147. Round 3 -- lineage log(B, data)@5
  • 148. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4
  • 149. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3
  • 150. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3 log(B,data)@2 log(A, data)@2 log(C, data)@2 log(A, data)@1
  • 151. Round 3 -- lineage log(B, data)@5 log(B, data)@4 log(A, data)@4 log(C, data)@4 Log(B, data)@3 log(A, data)@3 log(C, data)@3 log(B,data)@2 log(A, data)@2 log(C, data)@2 log(A, data)@1
  • 152. Round 3 The programmer wins!
  • 153. Let’s reflect Fault-tolerance is redundancy in space and time. Best strategy for both players: reason backwards from outcomes using lineage Finding bugs: find a set of failures that “breaks” all derivations Fixing bugs: add additional derivations
  • 154. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. (AB1 ∨ BC2) Disjunction
  • 155. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. 2. Find a set of failures that breaks all proofs of a good outcome. (AB1 ∨ BC2) Disjunction ∧ (AC1) ∧ (AC2) Conjunction of disjunctions (AKA CNF)
  • 156. The role of the adversary can be automated 1. Break a proof by dropping any contributing message. 2. Find a set of failures that breaks all proofs of a good outcome. (AB1 ∨ BC2) Disjunction ∧ (AC1) ∧ (AC2) Conjunction of disjunctions (AKA CNF)
  • 157. Molly, the LDFI prototype Molly finds fault-tolerance violations quickly or guarantees that none exist. Molly finds bugs by explaining good outcomes – then it explains the bugs. Bugs identified: 2pc, 2pc-ctp, 3pc, Kafka Certified correct: paxos (synod), Flux, bully leader election, reliable broadcast
  • 158. Commit protocols Problem: Atomically change things Correctness properties: 1. Agreement (All or nothing) 2. Termination (Something)
  • 159. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit
  • 160. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it?
  • 161. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it? YES YOU CAN
  • 162. Two-phase commit Agent a Agent b Coordinator Agent d 2 5 2 5 1 prepare prepare prepare 3 4 2 5 vote vote vote commit commit commit Can I kick it? YES YOU CAN Well I’m gone
  • 163. Two-phase commit Agent a Agent a Coordinator Agent d 2 2 1 p p p 3 CRASHED 2 v v v Violation: Termination
  • 164. The collabora[ve termina[on protocol Basic idea: Agents talk amongst themselves when the coordinator fails. Protocol: On timeout, ask other agents about decision.
  • 165. 2PC - CTP Agent a Agent b Coordinator Agent d 2 3 4 5 6 7 prepare prepare prepare 2 3 4 5 6 7 1 2 3 CRASHED 2 3 4 5 6 7 vote decision_req decision_req vote decision_req decision_req vote decision_req decision_req
  • 166. 2PC - CTP Agent a Agent b Coordinator Agent d 2 3 4 5 6 7 prepare prepare prepare 2 3 4 5 6 7 1 2 3 CRASHED 2 3 4 5 6 7 vote decision_req decision_req vote decision_req decision_req vote decision_req decision_req Can I kick it? YES YOU CAN ……?
  • 167. 3PC Basic idea: Add a round, a state, and simple failure detectors (timeouts). Protocol: 1. Phase 1: Just like in 2PC – Agent timeout à abort 2. Phase 2: send canCommit, collect acks – Agent timeout à commit 3. Phase 3: Just like phase 2 of 2PC
  • 168. 3PC Process a Process b Process C Process d 2 4 7 2 4 7 1 cancommit cancommit cancommit 3 vote_msg precommit precommit precommit 5 6 2 4 7 vote_msg ack vote_msg ack ack commit commit commit
  • 169. 3PC Process a Process b Process C Process d 2 4 7 2 4 7 1 cancommit cancommit cancommit 3 vote_msg precommit precommit precommit 5 6 2 4 7 vote_msg ack vote_msg ack ack commit commit commit Timeout à Abort Timeout à Commit
  • 170. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg
  • 171. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Agent crash Agents learn commit decision
  • 172. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Agent crash Agents learn commit decision d is dead; coordinator decides to abort
  • 173. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Brief network partition Agent crash Agents learn commit decision d is dead; coordinator decides to abort
  • 174. Network partitions make 3pc act crazy Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Brief network partition Agent crash Agents learn commit decision d is dead; coordinator decides to abort Agents A & B decide to commit
  • 175. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w
  • 176. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition
  • 177. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica
  • 178. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica a ACKs client write
  • 179. Kafka durability bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m m l a c w Brief network partition a becomes leader and sole replica a ACKs client write Data loss
  • 180. Molly summary Lineage allows us to reason backwards from good outcomes Molly: surgically-targeted fault injection Investment similar to testing Returns similar to formal methods
  • 181. Where we’ve been; where we’re headed 1. Mourning the death of transactions 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 182. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 183. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. What is so hard about distributed systems? 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 184. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. (asynchrony X partial failure) = too hard to hide! We need tools to manage it. 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 185. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Distributed consistency: managing asynchrony 4. Fault-tolerance: progress despite failures
  • 186. Where we’ve been; where we’re headed 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Fault-tolerance: progress despite failures
  • 187. Outline 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Fault-tolerance: progress despite failures
  • 188. Outline 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Backwards from outcomes
  • 189. Remember 1. We need application-level guarantees 2. asynchrony X partial failure = too hard to hide! We need tools to manage it. 3. Focus on flow: data in motion 4. Backwards from outcomes Composition is the hardest problem
  • 190. A happy crisis Valentine: “It makes me so happy. To be at the beginning again, knowing almost nothing.... It's the best possible time of being alive, when almost everything you thought you knew is wrong.”

Editor's Notes

  1. USER-CENTRIC
  2. OMG pause here. Remember brewer 2012? Top-down vs bottom-up designs? We had this top-down thing and it was beautiful.
  3. It was so beautiful that it didn’t matter that it was somewhat ugly
  4. The abstraction was so beautiful, IT DOESN”T MATTER WHAT”S UNDERNEATH. Wait, or does it? When does it?
  5. We’ve known for a long time that it is hard to hide the complexities of distribution
  6. Focus not on semantics, but on the properties of components: thin interfaces, understandable latency & failure modes. DEV-centric But can we ever recover those guarantees? I mean real guarantees, at the application level? Are my (app-level) constraints upheld? No? What can go wrong?
  7. FIX ME: joe’s idea: sketch of a castle being filled in, vs bricks But can we ever recover those guarantees? I mean real guarantees, at the application level? Are my (app-level) constraints upheld? No? What can go wrong?
  8. In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  9. In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  10. In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  11. In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  12. In a world without transactions, one programmer must risk inconsistency to build a distributed application out of individually-verified components
  13. Meaning: translation
  14. DS are hard because of uncertainty – nondeterminism – which is fundamental to the environment and can “leak” into the results” It’s astoundingly difficult to face these demons at the same time – tempting to try to defeat them one at a time.
  15. Async isn’t a problem: just need to be careful to number messages and interleave correctly. Ignore arrival order. Whoa, this is easy so far.
  16. Failure isn’t a problem: just do redundant computation and store redundant data. Make more copies than there will be failures. I win.
  17. We can’t do deterministic interleaving if producers may fail. Nd message order makes it hard to keep replicas in agreement
  18. We can’t do deterministic interleaving if producers may fail. Nd message order makes it hard to keep replicas in agreement
  19. We can’t do deterministic interleaving if producers may fail. Nd message order makes it hard to keep replicas in agreement
  20. To guard against failures, we replicate. NB: asynchrony => replicas might not agree
  21. Very similar looking criteria (1 safe 1 live). Takes some work, even on a single site. But hard in our scenario: disorder => replica disagreement, partial failure => missing partitions
  22. Very similar looking criteria (1 safe 1 live). Takes some work, even on a single site. But hard in our scenario: disorder => replica disagreement, partial failure => missing partitions
  23. Very similar looking criteria (1 safe 1 live). Takes some work, even on a single site. But hard in our scenario: disorder => replica disagreement, partial failure => missing partitions
  24. FIX: make it about translation vs. prayer
  25. FIX: make it about translation vs. prayer
  26. FIX: make it about translation vs. prayer
  27. Ie, reorderability, batchability, tolerance to duplication / retry Now programmer must map from application invariants to object API (with richer semantics than read/write).
  28. Ie, reorderability, batchability, tolerance to duplication / retry Now programmer must map from application invariants to object API (with richer semantics than read/write).
  29. Convergence is a property of component state. It rules out divergence, but it does not readily compose.
  30. Convergence is a property of component state. It rules out divergence, but it does not readily compose.
  31. Convergence is a property of component state. It rules out divergence, but it does not readily compose.
  32. Convergence is a property of component state. It rules out divergence, but it does not readily compose.
  33. However, not sufficient to synchronize GC. Perhaps more importantly, not *compositional* -- what guarantees does my app – pieced together from many convergent objects – give? To reason compositionally, need guarantees about what comes OUT of my objects, and how it transits the app. *** main point to make here: we’d like to reason backwards from the outcomes, at the level of abstraction of the appplication.
  34. However, not sufficient to synchronize GC. Perhaps more importantly, not *compositional* -- what guarantees does my app – pieced together from many convergent objects – give? To reason compositionally, need guarantees about what comes OUT of my objects, and how it transits the app. *** main point to make here: we’d like to reason backwards from the outcomes, at the level of abstraction of the appplication.
  35. However, not sufficient to synchronize GC. Perhaps more importantly, not *compositional* -- what guarantees does my app – pieced together from many convergent objects – give? To reason compositionally, need guarantees about what comes OUT of my objects, and how it transits the app. *** main point to make here: we’d like to reason backwards from the outcomes, at the level of abstraction of the appplication.
  36. We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence. A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  37. We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence. A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  38. We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence. A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  39. We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence. A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  40. We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence. A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  41. We are interested in the properties of component *outputs* rather than just internal state. Hence we are interested in a different property: confluence. A confluent module behaves like a function from sets (of inputs) to sets (of outputs)
  42. Confluence is compositional: Composing confluent components yields a confluent dataflow
  43. Confluence is compositional: Composing confluent components yields a confluent dataflow
  44. Confluence is compositional: Composing confluent components yields a confluent dataflow
  45. All of these components are confluent! Composing confluent components yields a confluent dataflow But annotations are burdensome
  46. All of these components are confluent! Composing confluent components yields a confluent dataflow But annotations are burdensome
  47. A separate question is choosing a coordination strategy that “fits” the problem without “overpaying.” for example, we could establish a global ordering of messages, but that would essentially cost us what linearizable storage cost us. We can solve the GC problem with SEALING: establishing a big barrier; damming the stream.
  48. A separate question is choosing a coordination strategy that “fits” the problem without “overpaying.” for example, we could establish a global ordering of messages, but that would essentially cost us what linearizable storage cost us. We can solve the GC problem with SEALING: establishing a big barrier; damming the stream.
  49. M – a semantic property of code – implies confluence An appropriately constrained language provides a conservative syntactic test for M.
  50. M – a semantic property of code – implies confluence An appropriately constrained language provides a conservative syntactic test for M.
  51. Also note that a data-centric language give us the dataflow graph automatically, via dependencies (across LOC, modules, processes, nodes, etc)
  52. Also note that a data-centric language give us the dataflow graph automatically, via dependencies (across LOC, modules, processes, nodes, etc)
  53. Try to not use it! Learn how to choose it. Tools help!
  54. Start with a hard problem Hard problem: is my FT protocol work? Harder: is the composition of my components FT
  55. Point: we need to replicate data to both copies of a replica We need to commit multiple partitions together
  56. Start with a hard problem Hard problem: is my FT protocol work? Harder: is the composition of my components FT
  57. Examples! 2pc and replication. Properties, etc etc
  58. Talk about speed too.
  59. After all, FT is an end-to-end concern.
  60. (synchronous)
  61. (synchronous)
  62. (synchronous)
  63. TALK ABOUT SAT!!!
  64. TALK ABOUT SAT!!!