This talk provides an introduction to various concepts that are essential to the understanding of distributed systems. Concepts covered include the 8 fallacies of distributed computing, the anatomy of a distributed system, system models, the CAP theorem, consistency models, partitioning, replication, leader election, failure detection, and consensus algorithms. This is the first in a three-part series designed to familiarize the audience with the design and usage of distributed systems.
3. I N T R O D U C T I O N
3
W H AT I S A
D I S T R I B U T E D
S Y S T E M ?
4. I N T R O D U C T I O N
4
A N AT O M Y O F A
D I S T R I B U T E D
S Y S T E MW H AT I S A
D I S T R I B U T E D
S Y S T E M ?
5. I N T R O D U C T I O N
5
A N AT O M Y O F A
D I S T R I B U T E D
S Y S T E M FA L L A C I E S O F
D I S T R I B U T E D
C O M P U T I N G
W H AT I S A
D I S T R I B U T E D
S Y S T E M ?
6. A collection of independent computers
that appear to users as a single coherent
system
W H AT I S A D I S T R I B U T E D S Y S T E M ?
6
7. “A collection of independent
computers that appear to the users
of the system as a single computer”
— Andrew Tanenbaum
W H AT I S A D I S T R I B U T E D S Y S T E M ?
7
8. “You know you have a distributed
system when the crash of a
computer you’ve never heard of
stops you from getting any work
done”
— Leslie Lamport
W H AT I S A D I S T R I B U T E D S Y S T E M ?
8
9. • Scalability and fault tolerance
• Memory, disk, and CPU are finite resources
• Computers crash and networks fail
• Science hasn’t kept up with technological needs
W H AT I S A D I S T R I B U T E D S Y S T E M ?
9
10. B U T
D I S T R I B U T E D
S Y S T E M S A R E
H A R D !
1 0
11. T H E T W O G E N E R A L S P R O B L E M
1 1
• Two generals on the opposite sides of a valley have to
coordinate to decide when to attack
• Each general must be sure the other made the same
decision
• Generals can only communicate through messages
• Messengers sent through the valley can be captured
12. A N AT O M Y O F A
D I S T R I B U T E D S Y S T E M
1 2
13. Nodes
A N AT O M Y O F A D I S T R I B U T E D
S Y S T E M
1 3
16. • Each independent component of a distributed
system is called a node
• Also known as a process, agent or actor
• Operations within a node are fast
• Communication between nodes is slow
• Operations generally occur in order
N O D E S
1 6
17. S Y S T E M M O D E L
1 7
• Bounded message delays
• Accurate global clock
• Easy to reason about
• You don’t have one
ASYNCHRONOUSSYNCHRONOUS
• Processes execute
independently
• Unbounded message delays
• No global clock
• Difficult to reason about
• You have one
18. • Nodes communicate via messages
• Example: UDP, TCP, HTTP
M E S S A G E PA S S I N G
1 8
19. FA L L A C I E S O F
D I S T R I B U T E D
C O M P U T I N G
1 9
20. The network is reliable
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
2 0
21. The network is reliable
Latency is zero
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
2 1
22. The network is reliable
Latency is zero
Bandwidth is infinite
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
2 2
23. The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
2 3
24. The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
2 4
Topology doesn’t change
25. The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
2 5
Topology doesn’t change
There is one administrator
26. The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
2 6
Topology doesn’t change
There is one administrator
Transport cost is zero
27. The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
2 7
Topology doesn’t change
There is one administrator
Transport cost is zero
The network is homogeneous
28. FALLACY #1
THE NETWORK IS
RELIABLE
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
2 8
29. • On average 5.2 devices and 40.8 lines fail per day in
Microsoft data centers
• The majority of Google’s outages that lasted more than 30
seconds were due to network maintenance or connectivity
issues
• If network hardware doesn’t fail, software will
• We cannot rely on the network to deliver our communications
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
2 9
30. FALLACY #2
LATENCY IS ZERO
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
3 0
31. • Latency is the time it takes for a signal to travel from one
computer to another
• Latency is a function of the speed of light
• It takes 40 milliseconds for light to travel from New York to
Paris and back
• The JVM executes billions of instructions per second
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
3 1
33. • Bandwidth is roughly the amount of information that can
be transmitted each second
• Networks are limited by hardware
• Applications are limited by software
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
3 3
34. FALLACY #4
THE NETWORK IS
SECURE
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
3 4
35. • We see hacks of major corporations’ networks seemingly
on a weekly basis
• In 2015, Foxglove Security discovered a major
vulnerability in Java’s serialization framework
• Allowing remote access to friendly users opens systems up
to unfriendly ones
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
3 5
36. FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
3 6
D ATA B R E A C H E S S I N C E 2 0 0 5
38. • Administrators add and remove servers from networks
• We cannot depend on machines always being in the same
place
• Service discovery and routing layers solve this problem
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
3 8
39. FALLACY #6
THERE IS ONE
ADMINISTRATOR
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
3 9
40. • Production systems are often maintained and managed by
numerous people
• Multiple administrators may institute conflicting policies
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
4 0
42. • Local processing is cheap
• Network communication is expensive
• Latency and bandwidth ensure transport cost is never zero
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
4 2
43. FALLACY #8
THE NETWORK IS
HOMOGENEOUS
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
4 3
44. • Applications must be designed to work in a variety of
environments
• Wired networks
• Wireless networks
• Cellular networks
• Satellite networks
FA L L A C I E S O F D I S T R I B U T E D
C O M P U T I N G
4 4
46. C O N C E P T S
4 6
T I M E I N
D I S T R I B U T E D
S Y S T E M S
47. C O N C E P T S
4 7
C O N S I S T E N C Y I N
D I S T R I B U T E D
S Y S T E M ST I M E I N
D I S T R I B U T E D
S Y S T E M S
48. C O N C E P T S
4 8
C O N S I S T E N C Y I N
D I S T R I B U T E D
S Y S T E M S PA R T I T I O N I N G
A N D
R E P L I C AT I O N
T I M E I N
D I S T R I B U T E D
S Y S T E M S
49. CONSISTENCY AVAILABILITY PARTITION TOLERANCE
ZOOKEEPER STRONG QUORUM YES
DYNAMO EVENTUALLY STRONG HIGH YES
MYSQL STRONG HIGH NO
T H E C A P T H E O R E M
4 9
T R A D E O F F S I N D I S T R I B U T E D S Y S T E M S
50. O R D E R I N D I S T R I B U T E D
S Y S T E M S
5 0
51. • Order is necessary to enforce causal relationships
• Two types of order in distributed systems
• Partial order
• Order of dependent events
• Total order
• Order of all events
• Single-threaded applications are totally ordered
O R D E R I N D I S T R I B U T E D S Y S T E M S
5 1
52. T I M E I N D I S T R I B U T E D
S Y S T E M S
5 2
53. • Time can be used to enforce order
• Time can be used to enforce bounds on communications
• But time progresses independently in asynchronous
systems
• Clocks suffer from clock drift
• Even NTP can only synchronize clocks to within a few
milliseconds of each other
T I M E I N D I S T R I B U T E D S Y S T E M S
5 3
54. T I M E I N D I S T R I B U T E D S Y S T E M S
5 4
55. • “Time, Clocks, and the Ordering of Events in a Distributed
System”
• Developed by Leslie Lamport in 1978
• One of the seminal papers in distributed systems
• Determines partial ordering of events in a distributed
system
• Also referred to as logical clocks
T I M E I N D I S T R I B U T E D S Y S T E M S
5 5
L A M P O R T C L O C K S
56. T I M E I N D I S T R I B U T E D S Y S T E M S
5 6
57. • “Timestamps in Message Passing Systems That Preserve the
Partial Ordering” - Colin J. Fidge
• “Virtual Time and Global States of Distributed Systems” -
Friedemann Mattern
• Independently developed by two researchers in 1988
• Determines causal ordering of events in a distributed system
• Also referred to as version vectors
T I M E I N D I S T R I B U T E D S Y S T E M S
5 7
V E C T O R C L O C K S
58. T I M E I N D I S T R I B U T E D S Y S T E M S
5 8
60. • Linearizability
• Sequential consistency
• Causal consistency
• Eventual strong consistency
• Eventual consistency
C O N S I S T E N C Y M O D E L S
6 0
61. • Monotonic read consistency
• Monotonic write consistency
• Read-your-writes consistency
• Writes follow reads consistency
• Serializability
C O N S I S T E N C Y M O D E L S
6 1
M O R E C O N S I S T E N C Y M O D E L S
63. • Split data across multiple machines
• Reduces the amount of data each node must handle
• Reduces the amount of network I/O for certain algorithms
PA R T I T I O N I N G
6 3
65. • Sharing information to ensure consistency between
redundant services
• Active replication — push
• Passive replication — pull
• Quorum-based
• Gossip
R E P L I C AT I O N
6 5
66. R E P L I C AT I O N
6 6
• Nodes updated between the
request and response
• Consistency over
performance
A S Y N C H R O N O U SS Y N C H R O N O U S
• State persisted locally and
replicated after response
• Performance over
consistency
67. PRIMARY-BACKUP GOSSIP 2PC QUORUM
CONSISTENCY
TRANSACTIONS
LATENCY
THROUGHPUT
DATA LOSS
READ ONLY
E V E N T U A L S T R O N G
L O W
H I G H
F U L L
H I G H
F U L L L O C A L
S O M E
R E A D O N LY
L O W M E D I U M
N O N E
R E A D / W R I T E
R E P L I C AT I O N
6 7
T R A D E O F F S I N D I S T R I B U T E D S Y S T E M S
68. • Gossip is one of the simplest distributed communication
algorithms
• Inspired by the gossip that takes place in human communication
• Each node periodically chooses a random set of neighbors with
which to exchange information
• Information propagates through the system quickly
• Version vectors can be used to resolve conflicts in updates
R E P L I C AT I O N
6 8
G O S S I P
70. • Map each object to a point on the edge of a circle
• Map each machine to a pseudo-random point on the same
circle
• To find the node on which an object is stored, find the
location of the object on the edge of the circle and walk
around the circle until the first node is found
C O N S I S T E N T H A S H I N G
7 0
73. • Failure detectors are characterized in terms of completeness and
accuracy
• In a synchronous system, failure detection is solvable
• Certain problems are not solvable without failure detection in an
asynchronous system
• A partitioned process is indistinguishable from a crashed process
• Thus reliable failure detection is impossible in an asynchronous system
• Failure detection is usually based on time
FA I L U R E D E T E C T I O N
7 3
75. L E A D E R E L E C T I O N
7 5
• The process of selecting a single node to coordinate a
cluster
• Difficult to account for failures
• Electing a leader allows a single process to control a
cluster
• Frequently used in consensus algorithms
• But a single leader can limit throughput
76. L E A D E R E L E C T I O N
7 6
B U L LY A L G O R I T H M
78. • Single-system view, shared state
• Key to building consistent storage systems
C O N S E N S U S
7 8
79. • Agreement — every correct process must agree on the
same value
• Integrity — every correct process decides at most one
proposed value
• Termination — all processes eventually reach some value
• Validity — if all correct processes propose the same value v
then all processes decide the same value v
C O N S E N S U S
7 9
80. • “Impossibility of Consensus with One Faulty Process” —
Fischer, Lynch, and Paterson
• Commonly referred to as the FLP Impossibility Result
• Consensus is impossible to guarantee in a fault-tolerant
asynchronous system
• In practice, consensus can be reached
C O N S E N S U S
8 0
81. ZooKeeper Atomic
Broadcast
“Wait-free Coordination for Internet
Scale Systems” — Hunt, Konar et al
Viewstamped Replication
“Viewstamped Replication” — Brian
M. Oki and Barbara H. Liskov
Raft
“In Search of an Understandable
Consensus Algorithm” — Diego
Ongaro and John Osterhout
C O N S E N S U S
8 1
Paxos
“The Part-Time Parliament” — Leslie
Lamport
“Paxos Made Easy” — Leslie Lamport
82. • Leader election
• Log replication
• Failure detection
• Log compaction
• Membership changes
C O N S E N S U S
8 2