Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
From Mainframe to Microservice 
An Introduction to 
Distributed Systems 
@tyler_treat 
Workiva
An Introduction to Distributed Systems 
❖ Building a foundation of understanding 
❖ Why distributed systems? 
❖ Universal ...
“A distributed system is one in which the failure of 
a computer you didn't even know existed can 
render your own compute...
Scale Up vs. Scale Out 
Vertical Scaling 
❖ Add resources to a node 
❖ Increases node capacity, load 
is unaffected 
❖ Sys...
A distributed system 
is a collection of independent computers 
that behave as a single coherent system.
Why Distributed Systems? 
Availability 
Fault Tolerance 
Throughput 
Architecture 
Economics 
serve every request 
resilie...
oh shit…
“You have to design distributed systems with the 
expectation of failure.” 
–Ken Arnold
Distributed systems engineers are 
the world’s biggest pessimists.
Universal Fallacy #1 
The network is reliable. 
❖ Message delivery is never guaranteed 
❖ Best effort 
❖ Is it worth it? 
...
Universal Fallacy #2 
Latency is zero. 
❖ We cannot defy the laws of physics 
❖ LAN to WAN deteriorates quickly 
❖ Minimiz...
Universal Fallacy #3 
Bandwidth is infinite. 
❖ Out of our control 
❖ Limit message sizes 
❖ Use message queueing
Universal Fallacy #4 
The network is secure. 
❖ Everyone is out to get you 
❖ Build in security from day 1 
❖ Multi-layere...
Universal Fallacy #5 
Topology doesn’t change. 
❖ Network topology is dynamic 
❖ Don’t statically address hosts 
❖ Collect...
Universal Fallacy #6 
There is one administrator. 
❖ May integrate with third-party systems 
❖ “Is it our problem or their...
Universal Fallacy #7 
Transport cost is zero. 
❖ Monetary and practical costs 
❖ Building/maintaining a network is not 
tr...
Universal Fallacy #8 
The network is homogenous. 
❖ Networks are almost never homogenous 
❖ Third-party integration? 
❖ Co...
These problems apply to LAN and WAN systems 
(single-data-center and cross-data-center) 
No one is safe.
“Anything that can go 
wrong will go wrong.” 
–Murphy’s Law
Characteristics of a Reliable Distributed System 
Fault-tolerant 
Available 
Scalable 
Consistent 
Secure 
Performant 
nod...
Distributed systems are 
all about trade-offs.
CAP Theorem 
❖ Presented in 1998 by Eric 
Brewer 
❖ Impossible to guarantee 
all three: 
❖ Consistency 
❖ Availability 
❖ ...
Consistency
Consistency 
❖ Linearizable - there exists a total order of all state 
updates and each update appears atomic 
❖ E.g. mute...
Consistency 
Eventual consistency 
replicas allowed to diverge, 
eventually converge 
Strong consistency 
replicas can’t d...
Availability 
❖ Every request received by a non-failing node must be 
served 
❖ If a piece of data required for a request ...
Partition Tolerance 
❖ A partition is a split in the network—many causes 
❖ Partition tolerance means partitions can happe...
Partition Tolerance
Common Pitfalls 
❖ Halting failure - machine stops 
❖ Network failure - network connection breaks 
❖ Omission failure - me...
Digging Deeper 
Exploring some higher-level concepts
Byzantine Generals Problem 
❖ Consider a city under siege by two allied armies 
❖ Each army has a general 
❖ One general i...
Byzantine Generals Problem
Byzantine Generals Problem 
❖ Send 100 messages, attack no matter what 
❖ A might attack without B 
❖ Send 100 messages, w...
Distributed Consensus 
❖ Replace 2 generals with N 
generals 
❖ Nodes must agree on data 
value 
❖ Solutions: 
❖ Multi-pha...
Two-Phase Commit 
❖ Blocking protocol 
❖ Coordinator waits for 
cohorts 
❖ Cohorts wait for 
commit/rollback 
❖ Can deadlo...
Three-Phase Commit 
❖ Non-blocking 
protocol 
❖ Abort on timeouts 
❖ Susceptible to 
network partitions
State Replication 
❖ E.g. Paxos, Raft protocols 
❖ Elect a leader (coordinator) 
❖ All changes go through leader 
❖ Each c...
State Replication 
❖ Must have quorum (majority) 
to proceed 
❖ Commit once quorum acks 
❖ Quorums mitigate partitions 
❖ ...
Split-Brain
Split-Brain
Split-Brain
Split-Brain 
❖ Optimistic (AP) - let partitions work as usual 
❖ Pessimistic (CP) - quorum partition works, fence others
Hybrid Consistency Models 
❖ Weak == available, low latency, stale reads 
❖ Strong == fresh reads, less available, high la...
Scaling Shared Data 
❖ Sharing mutable data at large scale is difficult 
❖ Solutions: 
❖ Immutable data 
❖ Last write wins...
Scaling Shared Data 
Imagine a shared, global 
counter… 
“Get, add 1, and put” 
transaction will not 
scale
CRDT 
❖ Conflict-free Replicated Data Type 
❖ Convergent: state-based 
❖ Commutative: operations-based 
❖ E.g. distributed...
CRDT 
❖ CRDTs always converge (provably) 
❖ Operations commute (order doesn’t matter) 
❖ Highly available, eventually cons...
G-Counter
CRDT 
❖ Add to set is associative, commutative, idempotent 
❖ add(“a”), add(“b”), add(“a”) => {“a”, “b”} 
❖ Adding and rem...
Two-Phase Set 
❖ Use two sets, one for adding, one for removing 
❖ Elements can be added once and removed once 
❖ { 
“a”: ...
Let’s Recap...
Distributed architectures allow us to build 
highly available, fault-tolerant systems.
We can't live in this fantasy land 
where everything works perfectly 
all of the time.
Shit happens — network partitions, 
hardware failure, GC pauses, 
latency, dropped packets…
Build resilient systems.
Design for failure.
kill -9
Consider the trade-off between 
consistency and availability.
Partition tolerance is not an option, 
it’s required. 
(if you’re building a distributed system)
Use weak consistency when possible, 
strong when necessary.
Sharing data at scale is hard, 
let’s go shopping. 
(or consider your options)
State is hell.
Further Readings 
❖ Jepsen series 
Kyle Kingsbury (aphyr) 
❖ A Comprehensive Study of Convergent and Commutative 
Replicat...
Thanks! 
@tyler_treat 
github.com/tylertreat 
bravenewgeek.com
From Mainframe to Microservice: An Introduction to Distributed Systems
From Mainframe to Microservice: An Introduction to Distributed Systems
From Mainframe to Microservice: An Introduction to Distributed Systems
From Mainframe to Microservice: An Introduction to Distributed Systems
From Mainframe to Microservice: An Introduction to Distributed Systems
Upcoming SlideShare
Loading in …5
×

From Mainframe to Microservice: An Introduction to Distributed Systems

31,217 views

Published on

An introductory overview of distributed systems—what they are and why they're difficult to build. We explore fundamental ideas and practical concepts in distributed programming. What is the CAP theorem? What is distributed consensus? What are CRDTs? We also look at options for solving the split-brain problem while considering the trade-off of high availability as well as options for scaling shared data.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • for Mainframe Technologies online training register at http://www.todaycourses.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

From Mainframe to Microservice: An Introduction to Distributed Systems

  1. 1. From Mainframe to Microservice An Introduction to Distributed Systems @tyler_treat Workiva
  2. 2. An Introduction to Distributed Systems ❖ Building a foundation of understanding ❖ Why distributed systems? ❖ Universal fallacies ❖ Characteristics and the CAP theorem ❖ Common pitfalls ❖ Digging deeper ❖ Byzantine Generals Problem and consensus ❖ Split-brain ❖ Hybrid consistency models ❖ Scaling shared data and CRDTs
  3. 3. “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” –Leslie Lamport
  4. 4. Scale Up vs. Scale Out Vertical Scaling ❖ Add resources to a node ❖ Increases node capacity, load is unaffected ❖ System complexity unaffected Horizontal Scaling ❖ Add nodes to a cluster ❖ Decreases load, capacity is unaffected ❖ Availability and throughput w/ increased complexity
  5. 5. A distributed system is a collection of independent computers that behave as a single coherent system.
  6. 6. Why Distributed Systems? Availability Fault Tolerance Throughput Architecture Economics serve every request resilient to failures parallel computation decoupled, focused services scale-out becoming manageable/ cost-effective
  7. 7. oh shit…
  8. 8. “You have to design distributed systems with the expectation of failure.” –Ken Arnold
  9. 9. Distributed systems engineers are the world’s biggest pessimists.
  10. 10. Universal Fallacy #1 The network is reliable. ❖ Message delivery is never guaranteed ❖ Best effort ❖ Is it worth it? ❖ Resiliency/redundancy/failover
  11. 11. Universal Fallacy #2 Latency is zero. ❖ We cannot defy the laws of physics ❖ LAN to WAN deteriorates quickly ❖ Minimize network calls (batch) ❖ Design asynchronous systems
  12. 12. Universal Fallacy #3 Bandwidth is infinite. ❖ Out of our control ❖ Limit message sizes ❖ Use message queueing
  13. 13. Universal Fallacy #4 The network is secure. ❖ Everyone is out to get you ❖ Build in security from day 1 ❖ Multi-layered ❖ Encrypt, pentest, train developers
  14. 14. Universal Fallacy #5 Topology doesn’t change. ❖ Network topology is dynamic ❖ Don’t statically address hosts ❖ Collection of services, not nodes ❖ Service discovery
  15. 15. Universal Fallacy #6 There is one administrator. ❖ May integrate with third-party systems ❖ “Is it our problem or theirs?” ❖ Conflicting policies/priorities ❖ Third parties constrain; weigh the risk
  16. 16. Universal Fallacy #7 Transport cost is zero. ❖ Monetary and practical costs ❖ Building/maintaining a network is not trivial ❖ The “perfect” system might be too costly
  17. 17. Universal Fallacy #8 The network is homogenous. ❖ Networks are almost never homogenous ❖ Third-party integration? ❖ Consider interoperability ❖ Avoid proprietary protocols
  18. 18. These problems apply to LAN and WAN systems (single-data-center and cross-data-center) No one is safe.
  19. 19. “Anything that can go wrong will go wrong.” –Murphy’s Law
  20. 20. Characteristics of a Reliable Distributed System Fault-tolerant Available Scalable Consistent Secure Performant nodes can fail serve all the requests, all the time behave correctly with changing topologies state is coordinated across nodes access is authenticated it’s fast!
  21. 21. Distributed systems are all about trade-offs.
  22. 22. CAP Theorem ❖ Presented in 1998 by Eric Brewer ❖ Impossible to guarantee all three: ❖ Consistency ❖ Availability ❖ Partition tolerance
  23. 23. Consistency
  24. 24. Consistency ❖ Linearizable - there exists a total order of all state updates and each update appears atomic ❖ E.g. mutexes make operations appear atomic ❖ When operations are linearizable, we can assign a unique “timestamp” to each one (total order) ❖ A system is consistent if every node shares the same total order ❖ Consistency which is both global and instantaneous is impossible
  25. 25. Consistency Eventual consistency replicas allowed to diverge, eventually converge Strong consistency replicas can’t diverge; requires linearizability
  26. 26. Availability ❖ Every request received by a non-failing node must be served ❖ If a piece of data required for a request is unavailable, the system is unavailable ❖ 100% availability is a myth
  27. 27. Partition Tolerance ❖ A partition is a split in the network—many causes ❖ Partition tolerance means partitions can happen ❖ CA is easy when your network is perfectly reliable ❖ Your network is not perfectly reliable
  28. 28. Partition Tolerance
  29. 29. Common Pitfalls ❖ Halting failure - machine stops ❖ Network failure - network connection breaks ❖ Omission failure - messages are lost ❖ Timing failure - clock skew ❖ Byzantine failure - arbitrary failure
  30. 30. Digging Deeper Exploring some higher-level concepts
  31. 31. Byzantine Generals Problem ❖ Consider a city under siege by two allied armies ❖ Each army has a general ❖ One general is the leader ❖ Armies must agree when to attack ❖ Must use messengers to communicate ❖ Messengers can be captured by defenders
  32. 32. Byzantine Generals Problem
  33. 33. Byzantine Generals Problem ❖ Send 100 messages, attack no matter what ❖ A might attack without B ❖ Send 100 messages, wait for acks, attack if confident ❖ B might attack without A ❖ Messages have overhead ❖ Can’t reliably make decision (provenly impossible)
  34. 34. Distributed Consensus ❖ Replace 2 generals with N generals ❖ Nodes must agree on data value ❖ Solutions: ❖ Multi-phase commit ❖ State replication
  35. 35. Two-Phase Commit ❖ Blocking protocol ❖ Coordinator waits for cohorts ❖ Cohorts wait for commit/rollback ❖ Can deadlock
  36. 36. Three-Phase Commit ❖ Non-blocking protocol ❖ Abort on timeouts ❖ Susceptible to network partitions
  37. 37. State Replication ❖ E.g. Paxos, Raft protocols ❖ Elect a leader (coordinator) ❖ All changes go through leader ❖ Each change appends log entry ❖ Each node has log replica
  38. 38. State Replication ❖ Must have quorum (majority) to proceed ❖ Commit once quorum acks ❖ Quorums mitigate partitions ❖ Logs allow state to be rebuilt
  39. 39. Split-Brain
  40. 40. Split-Brain
  41. 41. Split-Brain
  42. 42. Split-Brain ❖ Optimistic (AP) - let partitions work as usual ❖ Pessimistic (CP) - quorum partition works, fence others
  43. 43. Hybrid Consistency Models ❖ Weak == available, low latency, stale reads ❖ Strong == fresh reads, less available, high latency ❖ How do you choose a consistency model? ❖ Hybrid models ❖ Weaker models when possible (likes, followers, votes) ❖ Stronger models when necessary ❖ Tunable consistency models (Cassandra, Riak, etc.)
  44. 44. Scaling Shared Data ❖ Sharing mutable data at large scale is difficult ❖ Solutions: ❖ Immutable data ❖ Last write wins ❖ Application-level conflict resolution ❖ Causal ordering (e.g. vector clocks) ❖ Distributed data types (CRDTs)
  45. 45. Scaling Shared Data Imagine a shared, global counter… “Get, add 1, and put” transaction will not scale
  46. 46. CRDT ❖ Conflict-free Replicated Data Type ❖ Convergent: state-based ❖ Commutative: operations-based ❖ E.g. distributed sets, lists, maps, counters ❖ Update concurrently w/o writer coordination
  47. 47. CRDT ❖ CRDTs always converge (provably) ❖ Operations commute (order doesn’t matter) ❖ Highly available, eventually consistent ❖ Always reach consistent state ❖ Drawbacks: ❖ Requires knowledge of all clients ❖ Must be associative, commutative, and idempotent
  48. 48. G-Counter
  49. 49. CRDT ❖ Add to set is associative, commutative, idempotent ❖ add(“a”), add(“b”), add(“a”) => {“a”, “b”} ❖ Adding and removing items is not ❖ add(“a”), remove(“a”) => {} ❖ remove(“a”), add(“a”) => {“a”} ❖ CRDTs require interpretation of common data structures w/ limitations
  50. 50. Two-Phase Set ❖ Use two sets, one for adding, one for removing ❖ Elements can be added once and removed once ❖ { “a”: [“a”, “b”, “c”], “r”: [“a”] } ❖ => {“b”, “c”} ❖ add(“a”), remove(“a”) => {“a”: [“a”], “r”: [“a”]} ❖ remove(“a”), add(“a”) => {“a”: [“a”], “r”: [“a”]}
  51. 51. Let’s Recap...
  52. 52. Distributed architectures allow us to build highly available, fault-tolerant systems.
  53. 53. We can't live in this fantasy land where everything works perfectly all of the time.
  54. 54. Shit happens — network partitions, hardware failure, GC pauses, latency, dropped packets…
  55. 55. Build resilient systems.
  56. 56. Design for failure.
  57. 57. kill -9
  58. 58. Consider the trade-off between consistency and availability.
  59. 59. Partition tolerance is not an option, it’s required. (if you’re building a distributed system)
  60. 60. Use weak consistency when possible, strong when necessary.
  61. 61. Sharing data at scale is hard, let’s go shopping. (or consider your options)
  62. 62. State is hell.
  63. 63. Further Readings ❖ Jepsen series Kyle Kingsbury (aphyr) ❖ A Comprehensive Study of Convergent and Commutative Replicated Data Types Shapiro et al. ❖ In Search of an Understandable Consensus Algorithm Ongaro et al. ❖ CAP Twelve Years Later Eric Brewer ❖ Many, many more…
  64. 64. Thanks! @tyler_treat github.com/tylertreat bravenewgeek.com

×