Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Distributed Systems Theory
for Mere Mortals
Ensar Basri Kahveci
Distributed Systems Engineer, Hazelcast
1
Disclaimer Notice
In this presentation, I talk about distributed systems theory based on my own understanding.
First of al...
Agenda
- Defining distributed systems
- Systems Models
- Time and Order
- Consensus, FLP Result, Failure Detectors
- Conse...
“A DISTRIBUTED SYSTEM IS ONE IN WHICH THE
FAILURE OF A COMPUTER YOU DID NOT EVEN
KNOW EXISTED CAN RENDER YOUR OWN
COMPUTER...
What is a distributed system?
- Collection of entities (machines, nodes, processes...)
- trying to solve a common problem,...
About being distributed…
- Independent failures
- Some servers might fail while others work correctly.
- Non-negligible me...
System Models
7
Interaction Models
- Synchronous
- Asynchronous
- Partially-synchronous
8
Failure Modes
- Fail-stop
- Fail-recover
- Omission failures
- Arbitrary failures (Byzantine)
9
Time and Order
10
Time and Order
- We use time to:
- order events
- measure the duration between events
- In the asynchronous model, nodes h...
The Idea: Ordering Events
- We don’t have the notion of “now” in distributed systems.
- To what extend do we need it?
- We...
Ordering Events: Logical Clocks
- We can use Logical Clocks (=Lamport Clocks) [1] to order events in a
distributed system....
Clock Condition
- For any events a, b: if a → b, then C(a) < C(b).
- Can we also infer the reverse?
- p1→ q2 and q2 → q3, ...
Vector Clocks and Causality
- We use vector clocks to infer
causalities by comparing clock values.
- If V(a) < V(b) then a...
Is Logical Clocks our only chance?
- Google Spanner [3] uses NTP, GPS, and atomic clocks to synchronize the
local clocks o...
Consensus
17
Consensus
- The problem of having a set of processes agree on a value.
- leader election, state machine replication, decid...
Liveness and Safety Properties
- Liveness: A “good” thing happens during execution of an algorithm
- Safety: Some “bad” th...
FLP Result (Fischer, Lynch and Paterson) [6]
- Distributed consensus is not always possible ...
- with reliable message de...
Detecting failures: Why don’t you “talking to me”?
21
Unreliable Failure Detectors by Chandra and Toueg [7]
- Distributed failure detectors which are allowed to make mistakes
-...
Classes of Failure Detectors
- Perfect Failure Detector (P)
- Strongly Complete: Every faulty process is eventually perman...
Consensus Algorithms
2PC, 3PC, Paxos, Raft and the others
24
Two-Phase Commit (2PC) [9]
- With no failures, it satisfies Validity,
Termination, and Agreement.
- C crashes before Phase...
Three-Phase Commit (3PC) [10]
- The main problem of 2PC is the
participants don’t know outcome
of the voting before they a...
Paxos [11], [12]
- It chooses to sacrifice liveness to maintain safety
- It doesn’t terminate when the network behaves asy...
Paxos
- The original paper “The Part-time Parliament” [11] is difficult to read as it explains the
algorithm using an anal...
Raft: In search of an understandable consensus algorithm [18]
- A new consensus algorithm with understandability being one...
Other Consensus Algorithms
- Viewstamped Replication [20], [21]
- Another consensus algorithm. It is less popular than Pax...
References
[1] Lamport, Leslie. "Time, clocks, and the ordering of events in a distributed system." Communications of the ...
Thank you!
Stay tuned for the next episode...
32
Upcoming SlideShare
Loading in …5
×

Distributed Systems Theory for Mere Mortals

1,434 views

Published on

A quick walk-through of fundamental topics in distributed systems

@ Hazelcast Dev Days, 2016 July, Istanbul

Published in: Engineering

Distributed Systems Theory for Mere Mortals

  1. 1. Distributed Systems Theory for Mere Mortals Ensar Basri Kahveci Distributed Systems Engineer, Hazelcast 1
  2. 2. Disclaimer Notice In this presentation, I talk about distributed systems theory based on my own understanding. First of all, distributed systems theory is hard. It also covers a wide-range of topics. So, my statements might be wrong or incomplete! Please discuss any point you are confused or you think I am wrong. 2
  3. 3. Agenda - Defining distributed systems - Systems Models - Time and Order - Consensus, FLP Result, Failure Detectors - Consensus Algorithms: 2PC, 3PC, Paxos and others... 3
  4. 4. “A DISTRIBUTED SYSTEM IS ONE IN WHICH THE FAILURE OF A COMPUTER YOU DID NOT EVEN KNOW EXISTED CAN RENDER YOUR OWN COMPUTER UNUSABLE” Leslie Lamport 4
  5. 5. What is a distributed system? - Collection of entities (machines, nodes, processes...) - trying to solve a common problem, - linked by a network and communicating via passing messages, - having uncertain and partial knowledge of the system. 5
  6. 6. About being distributed… - Independent failures - Some servers might fail while others work correctly. - Non-negligible message transmission delays - The interconnection between servers has lower bandwidth and higher latency than that available within a single server. - Unreliable communication - The connections between server are unreliable compared to the connections within a server. 6
  7. 7. System Models 7
  8. 8. Interaction Models - Synchronous - Asynchronous - Partially-synchronous 8
  9. 9. Failure Modes - Fail-stop - Fail-recover - Omission failures - Arbitrary failures (Byzantine) 9
  10. 10. Time and Order 10
  11. 11. Time and Order - We use time to: - order events - measure the duration between events - In the asynchronous model, nodes have local clocks, which can shift unboundedly. - Components of a distributed system behave in an unpredictable manner. - Failures, rates of advance, delays in network packets etc. - We cannot assume sync clocks while designing our algorithms in the asynchronous model. - Clock synchronization methods helps us a lot but doesn’t fix the problem completely. 11
  12. 12. The Idea: Ordering Events - We don’t have the notion of “now” in distributed systems. - To what extend do we need it? - We don’t need absolute clock synchronization. - If machines don’t interact with each other, why bothering synchronizing their clocks? - For a lot of problems, processes need to agree on the order in which events occur, rather than the time at which they occur 12
  13. 13. Ordering Events: Logical Clocks - We can use Logical Clocks (=Lamport Clocks) [1] to order events in a distributed system. - Logical clocks rely on counters and the communication between nodes. - Each node maintains a local counter value. - happened-before relationship ( “→” ) - If events a and b are events in the same process, and a comes before b, then a → b - If a is sending and b is receipt of a message, then a → b - If a → b and b → c, then a → c - If neither of a → b or b → a holds, a and b are concurrent. - Partial ordering and total ordering of the events 13
  14. 14. Clock Condition - For any events a, b: if a → b, then C(a) < C(b). - Can we also infer the reverse? - p1→ q2 and q2 → q3, then C(q3) > C(p1) - Causality: p1 causes q2 and q2 causes q3, then p1 causes q3. - C(p3) and C(q3) are concurrent events due to the happened-before relationship. - Can we infer if there is any causality by comparing C(p3) and C(q3)? Image taken from [1] 14
  15. 15. Vector Clocks and Causality - We use vector clocks to infer causalities by comparing clock values. - If V(a) < V(b) then a causally precedes b Image taken from [2] 15
  16. 16. Is Logical Clocks our only chance? - Google Spanner [3] uses NTP, GPS, and atomic clocks to synchronize the local clocks of the machines as much as possible. - It doesn’t pretend that clocks are perfectly synchronized. - It introduces the uncertainty of clocks into its TrueTime API. - CockroachDB [4] uses Hybrid Logical Clocks [5] which combines logical clocks and physical clocks to infer causalities. 16
  17. 17. Consensus 17
  18. 18. Consensus - The problem of having a set of processes agree on a value. - leader election, state machine replication, deciding to commit a transaction etc. - Validity: the value agreed upon must have been proposed by some process - Termination: at least one non-faulty process eventually decides - Agreement: all deciding processes agree on the same value 18
  19. 19. Liveness and Safety Properties - Liveness: A “good” thing happens during execution of an algorithm - Safety: Some “bad” thing never happens during execution of an algorithm 19
  20. 20. FLP Result (Fischer, Lynch and Paterson) [6] - Distributed consensus is not always possible ... - with reliable message delivery - with a single crash-stop failure - … in the asynchronous model, because we cannot differentiate between a crashed process or a slow process. - No algorithm can always guarantee termination in the presence of crashes. - It is related to the liveness property, not the safety property. 20
  21. 21. Detecting failures: Why don’t you “talking to me”? 21
  22. 22. Unreliable Failure Detectors by Chandra and Toueg [7] - Distributed failure detectors which are allowed to make mistakes - Each process has a local state to keep the list of processes that it suspects have failed - A local failure detector can make 2 types of mistakes - suspecting processes that haven’t actually crashed ⇒ ACCURACY property - not-suspecting processes that have actually crashed ⇒ COMPLETENESS property - Degrees of completeness - strong completeness, weak completeness - Degrees of accuracy - strong accuracy, weak accuracy, eventually strong accuracy, eventually weak accuracy 22
  23. 23. Classes of Failure Detectors - Perfect Failure Detector (P) - Strongly Complete: Every faulty process is eventually permanently suspected by every non-faulty process. - Strongly Accurate: No process is suspected (by anybody) before it crashes. - Eventually Strong Failure Detector (⋄S) - Strongly Complete - Eventually Weakly Accurate: After some initial period of confusion, some non-faulty process is never suspected. - Consensus problem can be solved with Eventually Strong Failure Detector (⋄S) with f < n / 2 failures in the asynchronous model. [7], [8] - As long as you hear from the majority, you can solve consensus. ⇒ SAFETY - Every correct process eventually decides. No blocking forever. ⇒ LIVENESS 23
  24. 24. Consensus Algorithms 2PC, 3PC, Paxos, Raft and the others 24
  25. 25. Two-Phase Commit (2PC) [9] - With no failures, it satisfies Validity, Termination, and Agreement. - C crashes before Phase 1: No problem - C crashes before Phase 2: A can ask B what it has vote for. - C and A crash before Phase 2: The protocol blocks! - The protocol blocks with fail-stop failures (the simplest failure model). 25
  26. 26. Three-Phase Commit (3PC) [10] - The main problem of 2PC is the participants don’t know outcome of the voting before they actually take action (commit / abort). - We add a new step for this ⇒ 3PC - 3PC is non-blocking and it handles fail-stop failures. - What about fail-recover, network partitions, the asynchronous model? 26
  27. 27. Paxos [11], [12] - It chooses to sacrifice liveness to maintain safety - It doesn’t terminate when the network behaves asynchronously and terminates only when synchronicity returns. - It doesn’t block when the majority is available. - The correct run is similar to 2PC. - 2 new mechanisms: - Order to proposals such that we can find out which proposal should be accepted: sequence numbers - Prefer majority, instead of all participants 27Image taken from [23]
  28. 28. Paxos - The original paper “The Part-time Parliament” [11] is difficult to read as it explains the algorithm using an analogy with Greek democracy. - Submitted in 1990, published in 1998, after explained in another paper [17] in 1996. - “The Paxos algorithm, when presented in plain English, is very simple” Paxos Made Simple [12] - Cheap Paxos [13], Fast Paxos [14] and many other variations… - Paxos Made Live [15]: There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol. - Paxos Made Moderately Complex [16]: For anybody who has ever tried to implement it, Paxos is by no means a simple protocol, even though it is based on relatively simple invariants. This paper provides imperative pseudo-code for the full Paxos (or Multi-Paxos) protocol without shying away from discussing various implementation details. 28
  29. 29. Raft: In search of an understandable consensus algorithm [18] - A new consensus algorithm with understandability being one of its design goals. - It divides the problem into parts: - leader election, log replication, safety and membership changes - Also discusses implementation details - More than 80 implementations on its website [19] 29
  30. 30. Other Consensus Algorithms - Viewstamped Replication [20], [21] - Another consensus algorithm. It is less popular than Paxos. - Raft has a lot of similarities to it. - Zab [22] - Implemented in ZooKeeper - Many variants of Paxos... 30
  31. 31. References [1] Lamport, Leslie. "Time, clocks, and the ordering of events in a distributed system." Communications of the ACM 21.7 (1978): 558-565. [2] Raynal, Michel, and Mukesh Singhal. "Logical time: Capturing causality in distributed systems." Computer 29.2 (1996): 49-56. [3] Corbett, James C., et al. "Spanner: Google’s globally distributed database." ACM Transactions on Computer Systems (TOCS) 31.3 (2013): 8. [4] https://github.com/cockroachdb/cockroach [5] Leone, Marcelo, et al. "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases." (2014). [6] Fischer, Michael J., Nancy A. Lynch, and Michael S. Paterson. "Impossibility of distributed consensus with one faulty process." Journal of the ACM (JACM) 32.2 (1985): 374-382. [7] Chandra, Tushar Deepak, and Sam Toueg. "Unreliable failure detectors for reliable distributed systems." Journal of the ACM (JACM) 43.2 (1996): 225-267. [8] Chandra, Tushar Deepak, Vassos Hadzilacos, and Sam Toueg. "The weakest failure detector for solving consensus." Journal of the ACM (JACM) 43.4 (1996): 685-722. [9] Gray, James N. "Notes on database operating systems." Operating Systems. Springer Berlin Heidelberg, 1978. 393-481. [10] Skeen, Dale. "Nonblocking commit protocols." Proceedings of the 1981 ACM SIGMOD international conference on Management of data. ACM, 1981. [11] Lamport, Leslie. "The part-time parliament." ACM Transactions on Computer Systems (TOCS) 16.2 (1998): 133-169. [12] Lamport, Leslie. "Paxos made simple." ACM Sigact News 32.4 (2001): 18-25. [13] Lamport, Leslie, and Mike Massa. "Cheap paxos." Dependable Systems and Networks, 2004 International Conference on. IEEE, 2004. [14] Lamport, Leslie. "Fast paxos." Distributed Computing 19.2 (2006): 79-103. [15] Chandra, Tushar D., Robert Griesemer, and Joshua Redstone. "Paxos made live: an engineering perspective." Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing. ACM, 2007. [16] Van Renesse, Robbert, and Deniz Altinbuken. "Paxos made moderately complex." ACM Computing Surveys (CSUR) 47.3 (2015): 42. [17] Lampson, Butler. "How to build a highly available system using consensus." Distributed Algorithms (1996): 1-17. [18] Ongaro, Diego, and John Ousterhout. "In search of an understandable consensus algorithm." 2014 USENIX Annual Technical Conference (USENIX ATC 14). 2014. [19] https://raft.github.io/ [20] Oki, Brian M., and Barbara H. Liskov. "Viewstamped replication: A new primary copy method to support highly-available distributed systems." Proceedings of the seventh annual ACM Symposium on Principles of distributed computing. ACM, 1988. [21] Liskov, Barbara, and James Cowling. "Viewstamped replication revisited." (2012). [22] Junqueira, Flavio P., Benjamin C. Reed, and Marco Serafini. "Zab: High-performance broadcast for primary-backup systems." 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN). IEEE, 2011. [23] http://the-paper-trail.org/blog/consensus-protocols-paxos/ 31
  32. 32. Thank you! Stay tuned for the next episode... 32

×