Distributed Systems Theory
for Mere Mortals
Ensar Basri Kahveci
Distributed Systems Engineer, Hazelcast
In this presentation, I talk about distributed systems theory based on my own understanding.
First of all, distributed systems theory is hard. It also covers a wide-range of topics.
So, my statements might be wrong or incomplete!
Please discuss any point you are confused or you think I am wrong.
- Defining distributed systems
- Systems Models
- Time and Order
- Consensus, FLP Result, Failure Detectors
- Consensus Algorithms: 2PC, 3PC, Paxos and others...
“A DISTRIBUTED SYSTEM IS ONE IN WHICH THE
FAILURE OF A COMPUTER YOU DID NOT EVEN
KNOW EXISTED CAN RENDER YOUR OWN
What is a distributed system?
- Collection of entities (machines, nodes, processes...)
- trying to solve a common problem,
- linked by a network and communicating via passing messages,
- having uncertain and partial knowledge of the system.
About being distributed…
- Independent failures
- Some servers might fail while others work correctly.
- Non-negligible message transmission delays
- The interconnection between servers has lower bandwidth and higher latency than that
available within a single server.
- Unreliable communication
- The connections between server are unreliable compared to the connections within a
Time and Order
- We use time to:
- order events
- measure the duration between events
- In the asynchronous model, nodes have local clocks, which can shift unboundedly.
- Components of a distributed system behave in an unpredictable manner.
- Failures, rates of advance, delays in network packets etc.
- We cannot assume sync clocks while designing our algorithms in the asynchronous
- Clock synchronization methods helps us a lot but doesn’t fix the problem completely.
The Idea: Ordering Events
- We don’t have the notion of “now” in distributed systems.
- To what extend do we need it?
- We don’t need absolute clock synchronization.
- If machines don’t interact with each other, why bothering synchronizing their clocks?
- For a lot of problems, processes need to agree on the order in which
events occur, rather than the time at which they occur
Ordering Events: Logical Clocks
- We can use Logical Clocks (=Lamport Clocks)  to order events in a
- Logical clocks rely on counters and the communication between nodes.
- Each node maintains a local counter value.
- happened-before relationship ( “→” )
- If events a and b are events in the same process, and a comes before b, then a → b
- If a is sending and b is receipt of a message, then a → b
- If a → b and b → c, then a → c
- If neither of a → b or b → a holds, a and b are concurrent.
- Partial ordering and total ordering of the events
- For any events a, b: if a → b, then C(a) < C(b).
- Can we also infer the reverse?
- p1→ q2 and q2 → q3, then C(q3) > C(p1)
- Causality: p1 causes q2 and q2 causes q3, then p1
- C(p3) and C(q3) are concurrent events due to
the happened-before relationship.
- Can we infer if there is any causality by comparing C(p3)
Image taken from  14
Vector Clocks and Causality
- We use vector clocks to infer
causalities by comparing clock values.
- If V(a) < V(b) then a causally precedes b
Image taken from  15
Is Logical Clocks our only chance?
- Google Spanner  uses NTP, GPS, and atomic clocks to synchronize the
local clocks of the machines as much as possible.
- It doesn’t pretend that clocks are perfectly synchronized.
- It introduces the uncertainty of clocks into its TrueTime API.
- CockroachDB  uses Hybrid Logical Clocks  which combines logical
clocks and physical clocks to infer causalities.
- The problem of having a set of processes agree on a value.
- leader election, state machine replication, deciding to commit a transaction etc.
- Validity: the value agreed upon must have been proposed by some
- Termination: at least one non-faulty process eventually decides
- Agreement: all deciding processes agree on the same value
Liveness and Safety Properties
- Liveness: A “good” thing happens during execution of an algorithm
- Safety: Some “bad” thing never happens during execution of an algorithm
FLP Result (Fischer, Lynch and Paterson) 
- Distributed consensus is not always possible ...
- with reliable message delivery
- with a single crash-stop failure
- … in the asynchronous model, because we cannot differentiate between a crashed
process or a slow process.
- No algorithm can always guarantee termination in the presence of crashes.
- It is related to the liveness property, not the safety property.
Detecting failures: Why don’t you “talking to me”?
Unreliable Failure Detectors by Chandra and Toueg 
- Distributed failure detectors which are allowed to make mistakes
- Each process has a local state to keep the list of processes that it suspects have failed
- A local failure detector can make 2 types of mistakes
- suspecting processes that haven’t actually crashed ⇒ ACCURACY property
- not-suspecting processes that have actually crashed ⇒ COMPLETENESS property
- Degrees of completeness
- strong completeness, weak completeness
- Degrees of accuracy
- strong accuracy, weak accuracy, eventually strong accuracy, eventually weak accuracy
Classes of Failure Detectors
- Perfect Failure Detector (P)
- Strongly Complete: Every faulty process is eventually permanently suspected by every non-faulty
- Strongly Accurate: No process is suspected (by anybody) before it crashes.
- Eventually Strong Failure Detector (⋄S)
- Strongly Complete
- Eventually Weakly Accurate: After some initial period of confusion, some non-faulty process is never
- Consensus problem can be solved with Eventually Strong Failure Detector (⋄S)
with f < n / 2 failures in the asynchronous model. , 
- As long as you hear from the majority, you can solve consensus. ⇒ SAFETY
- Every correct process eventually decides. No blocking forever. ⇒ LIVENESS
2PC, 3PC, Paxos, Raft and the others
Two-Phase Commit (2PC) 
- With no failures, it satisfies Validity,
Termination, and Agreement.
- C crashes before Phase 1: No problem
- C crashes before Phase 2: A can ask B
what it has vote for.
- C and A crash before Phase 2: The
- The protocol blocks with fail-stop failures
(the simplest failure model).
Three-Phase Commit (3PC) 
- The main problem of 2PC is the
participants don’t know outcome
of the voting before they actually
take action (commit / abort).
- We add a new step for this ⇒
- 3PC is non-blocking and it
handles fail-stop failures.
- What about fail-recover, network
partitions, the asynchronous
Paxos , 
- It chooses to sacrifice liveness to maintain safety
- It doesn’t terminate when the network behaves asynchronously and
terminates only when synchronicity returns.
- It doesn’t block when the majority is available.
- The correct run is similar to 2PC.
- 2 new mechanisms:
- Order to proposals such that we can find out which proposal should
be accepted: sequence numbers
- Prefer majority, instead of all participants
27Image taken from 
- The original paper “The Part-time Parliament”  is difficult to read as it explains the
algorithm using an analogy with Greek democracy.
- Submitted in 1990, published in 1998, after explained in another paper  in 1996.
- “The Paxos algorithm, when presented in plain English, is very simple” Paxos Made Simple 
- Cheap Paxos , Fast Paxos  and many other variations…
- Paxos Made Live : There are significant gaps between the description of the Paxos algorithm and the
needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas
scattered in the literature and make several relatively small protocol extensions. The cumulative effort will
be substantial and the final system will be based on an unproven protocol.
- Paxos Made Moderately Complex : For anybody who has ever tried to implement it, Paxos is by no
means a simple protocol, even though it is based on relatively simple invariants. This paper provides
imperative pseudo-code for the full Paxos (or Multi-Paxos) protocol without shying away from discussing
various implementation details. 28
Raft: In search of an understandable consensus algorithm 
- A new consensus algorithm with understandability being one of its design
- It divides the problem into parts:
- leader election, log replication, safety and membership changes
- Also discusses implementation details
- More than 80 implementations on its website 
Other Consensus Algorithms
- Viewstamped Replication , 
- Another consensus algorithm. It is less popular than Paxos.
- Raft has a lot of similarities to it.
- Zab 
- Implemented in ZooKeeper
- Many variants of Paxos...
 Lamport, Leslie. "Time, clocks, and the ordering of events in a distributed system." Communications of the ACM 21.7 (1978): 558-565.
 Raynal, Michel, and Mukesh Singhal. "Logical time: Capturing causality in distributed systems." Computer 29.2 (1996): 49-56.
 Corbett, James C., et al. "Spanner: Google’s globally distributed database." ACM Transactions on Computer Systems (TOCS) 31.3 (2013): 8.
 Leone, Marcelo, et al. "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases." (2014).
 Fischer, Michael J., Nancy A. Lynch, and Michael S. Paterson. "Impossibility of distributed consensus with one faulty process." Journal of the ACM (JACM) 32.2 (1985): 374-382.
 Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM (JACM) 43.2 (1996): 225-267.
 Chandra, Tushar Deepak, Vassos Hadzilacos, and Sam Toueg. "The weakest failure detector for solving consensus." Journal of the ACM (JACM) 43.4 (1996): 685-722.
 Gray, James N. "Notes on database operating systems." Operating Systems. Springer Berlin Heidelberg, 1978. 393-481.
 Skeen, Dale. "Nonblocking commit protocols." Proceedings of the 1981 ACM SIGMOD international conference on Management of data. ACM, 1981.
 Lamport, Leslie. "The part-time parliament." ACM Transactions on Computer Systems (TOCS) 16.2 (1998): 133-169.
 Lamport, Leslie. "Paxos made simple." ACM Sigact News 32.4 (2001): 18-25.
 Lamport, Leslie, and Mike Massa. "Cheap paxos." Dependable Systems and Networks, 2004 International Conference on. IEEE, 2004.
 Lamport, Leslie. "Fast paxos." Distributed Computing 19.2 (2006): 79-103.
 Chandra, Tushar D., Robert Griesemer, and Joshua Redstone. "Paxos made live: an engineering perspective." Proceedings of the twenty-sixth annual ACM symposium on
Principles of distributed computing. ACM, 2007.
 Van Renesse, Robbert, and Deniz Altinbuken. "Paxos made moderately complex." ACM Computing Surveys (CSUR) 47.3 (2015): 42.
 Lampson, Butler. "How to build a highly available system using consensus." Distributed Algorithms (1996): 1-17.
 Ongaro, Diego, and John Ousterhout. "In search of an understandable consensus algorithm." 2014 USENIX Annual Technical Conference (USENIX ATC 14). 2014.
 Oki, Brian M., and Barbara H. Liskov. "Viewstamped replication: A new primary copy method to support highly-available distributed systems." Proceedings of the seventh
annual ACM Symposium on Principles of distributed computing. ACM, 1988.
 Liskov, Barbara, and James Cowling. "Viewstamped replication revisited." (2012).
 Junqueira, Flavio P., Benjamin C. Reed, and Marco Serafini. "Zab: High-performance broadcast for primary-backup systems." 2011 IEEE/IFIP 41st International Conference
on Dependable Systems & Networks (DSN). IEEE, 2011.
 http://the-paper-trail.org/blog/consensus-protocols-paxos/ 31
Stay tuned for the next episode...