A talk about some of the core theoretical topics of distributed systems. It discusses system models, failure modes, time, the consensus problem, consistency models and high-level design principles: such as CAP and PACELC.
2. WHO AM I?
ENSAR BASRI KAHVECI
▸ Distributed Systems Engineer @ Hazelcast
▸ twitter & github: metanet
▸ linkedin.com/in/basrikahveci
▸ basrikahveci.com
2
3. HAZELCAST
LEADING OPEN SOURCE JAVA IMDG
▸ Distributed Java collections, JCache, …
▸ Distributed computation and messaging
▸ Embedded or client-server deployment
▸ Integration modules & cloud friendly
3
4. HAZELCAST
ELASTICITY AND HIGH AVAILABILITY
▸ Scale up & scale out
▸ Dynamic clustering & elasticity
▸ Data partitioning & replication
▸ Fault tolerance & high availability
4
5. DISTRIBUTED SYSTEMS
COLLECTION OF ENTITIES SOLVING A COMMON PROBLEM
▸ Shared nothing
▸ Communication via messaging
▸ Uncertain and partial knowledge
▸ Main motivations are scalability, fault tolerance, availability,
economics, etc.
5
7. MODELS AND ABSTRACTIONS
ABSTRACTIONS SIMPLIFY REASONING
▸ Timing assumptions
▸ Failure modes
▸ Notion of the time
▸ Design principles
▸ Consistency models
7
8. TIMING ASSUMPTIONS
ASYNCHRONY IS INHERENTLY PRESENT IN OUR SYSTEMS
▸ A message can be delayed in network or in a process.
▸ Local clocks can drift arbitrarily.
▸ Asynchrony makes dealing with failures difficult.
request
NODE A NODE B
no timely response
8
9. TIMING ASSUMPTIONS
OUR SYSTEMS WORK WITH PARTIAL SYNCHRONY
▸ Time window of synchrony is expected to be long enough
for an algorithm to terminate.
▸ OperationTimeoutException in Hazelcast
operation
NODE A NODE B
throws OperationTimeoutException
if no timely response
9
10. FAILURE MODES
A PROCESS CRASHES AND NEVER COMES BACK
▸ Crash-stop
▸ Default failure model of Hazelcast
NODE B
KEY1
KEY2
NODE A
KEY1
KEY2
NODE B
KEY1
KEY2
10
11. FAILURE MODES
A MESSAGE NEVER ARRIVES AT ITS DESTINATION
▸ Omission faults
PRIMARY BACKUP
backup v=5
backup v=6
backup v=7
expected=5
expected=6
BACKUP IS DIRTY !!!
X
11
12. FAILURE MODES
A PROCESS CRASHES, AND RECOVERS AFTER SOME TIME
▸ Crash-recover
▸ Hazelcast can perform crash-recover with the Hot Restart
feature.
NODE B
KEY1
KEY2
NODE A
KEY1
KEY2
NODE B
KEY1
KEY2
NODE A
KEY1
KEY2
12
13. FAILURE MODES
A PROCESS ARBITRARILY DEVIATES FROM ITS ALGORITHM
▸ Byzantine failures
NODE A
(MASTER)
NODE C
member
list
NODE B
REJECT
13
14. TIME AND ORDER
WE USE TIME TO ORDER EVENTS IN A SYSTEM
▸ Physical timestamps and the latest update wins approach
▸ LatestUpdateMapMergePolicy of Hazelcast
▸ Clock drifts make our clocks unreliable.
NODE A NODE B
name: ensar
time: 11:10
name: basri
time: 11:11
NODE A NODE B
merge
name: basri
time: 11:11
name: basri
time: 11:11
14
15. TIME AND ORDER
GOOGLE TRUETIME
▸ Special hardware to bound clock drifts
▸ Clock uncertainty is exposed.
▸ CASE 1:
▸ CASE 2:
E1
E2
E1
E2
15
16. TIME AND ORDER
LOGICAL CLOCKS (LAMPORT CLOCKS)
▸ Relative order of events is defined based on local counters
and communication.
▸ The happens-before relationship (i.e., causality)
▸ Hazelcast uses logical clocks extensively.
PRIMARY BACKUP
backup v=5
backup v=6
backup v=7
16
17. TIME AND ORDER
VECTOR CLOCKS
▸ Lamport clocks do not encode the causality information.
▸ Vector clocks are used to infer causality.
NODE A
NODE B
NODE C E1E2
E1 E2
E1 E3
E3
physical time
17
18. CONSENSUS
THE PROBLEM OF HAVING A SET OF PROCESSES AGREE ON A VALUE
▸ Fault tolerant leader election
▸ Achieving strong consistency on replicated data
▸ Committing distributed transactions
▸ Safety and liveness properties
18
19. CONSENSUS
FLP RESULT
▸ In an asynchronous system with reliable message delivery:
▸ Distributed consensus may not be solved within bounded
time if at least one process can fail with crash-stop :(
▸ The reason is, we cannot differentiate between
a slow process and a crashed process.
19
20. CONSENSUS
END OF THE STORY?
▸ The FLP result is about liveness, not safety.
▸ TCP gives a good degree of reliability for message delivery.
▸ If we make timing assumptions, the consensus problem
becomes solvable.
▸ Unreliable failure detectors
20
21. ▸ 2PC preserves safety, but it
may lose liveness.
▸ 2PC is a blocking protocol.
▸ 3PC resolves the liveness
problem with timeouts, but it
may lose safety on crash-
recover failures or network
partitions.
▸ 3PC is a non-blocking protocol.
CONSENSUS ALGORITHMS
TWO-PHASE COMMIT AND THREE-PHASE COMMIT
COORDINATOR COHORT
vote
yes / no
commit / rollback
3PC adds a pre-commit phase here.
21
22. CONSENSUS ALGORITHMS
MAJORITY-BASED CONSENSUS ALGORITHMS
▸ The majority approach preserves
safety and liveness.
▸ (2f + 1) nodes tolerate
failure of f nodes.
▸ Resiliency to crash-stop, network
partitions, and crash-recover
failures
▸ Paxos, Zab, Raft, VR, …
FOLLOWER
2
FOLLOWER
1
LEADER
CLIENT
set x = 1
X=1
X=1
22
23. CAP PRINCIPLE
CP VERSUS AP
▸ Proposed by Eric Brewer in 2000
▸ A shared-data system cannot achieve perfect consistency
and perfect availability at the same time in the presence of
network partitions.
NODE A
NODE B
NODE CCLIENT1 CLIENT2
23
24. THE SPECTRUM OF CONSISTENCY AND AVAILABILITY
DATA-CENTRIC AND CLIENT-CENTRIC CONSISTENCY MODELS
LINEARIZABLE
SEQUENTIAL
CAUSAL
PRAM
WRITES
FOLLOWING READS
MONOTONIC
WRITES
MONOTONIC
READS
READ YOUR
WRITES
Data-centric consistency (CP)
Client-centric consistency
with high availability (AP)
Client-centric consistency
with sticky availability (AP)
24
25. PACELC
CONSISTENCY / LATENCY TRADEOFF
▸ Proposed by Daniel Abadi in 2010
▸ PACELC
▸ If there is a network partition ( P ), how does the system
trade off availability and consistency (A and C) ?
▸ Else ( E ), during normal operation, how does the system
trade off latency and consistency ( L and C ) ?
25
26. HAZELCAST AND PACELC
FAVOURING CONSISTENCY DURING NORMAL OPERATION (EC)
▸ Hazelcast uses the primary-copy replication technique.
NODE B
KEY1
KEY2
KEY3
NODE A
KEY1
KEY2
KEY3
NODE C
KEY1
KEY2
KEY3
CLIENT
get KEY2; get KEY3;get KEY1;
26
27. HAZELCAST AND PACELC
FAVOURING LATENCY DURING NORMAL OPERATION (EL)
▸ A client can use a near cache to scale reads.
NODE B
KEY1
KEY2
KEY3
NODE A
KEY1
KEY2
KEY3
NODE C
KEY1
KEY2
KEY3
CLIENT
KEY1
KEY2
KEY3
27
28. HAZELCAST AND PACELC
FAVOURING AVAILABILITY DURING NETWORK PARTITIONS (PA)
▸ Hazelcast remains available during network partitions
by default.
CLIENT1
NODE B
KEY1
KEY2
KEY3
NODE A
KEY1
KEY2
KEY3
NODE C
KEY1
KEY2
KEY3
CLIENT2
28
29. HAZELCAST AND PACELC
FAVOURING CONSISTENCY DURING NETWORK PARTITIONS (PC)
▸ The Split-Brain Protection Feature
CLIENT1
NODE B
KEY1
KEY2
KEY3
NODE A
KEY1
KEY2
KEY3
NODE C
KEY1
KEY2
KEY3
CLIENT2
29
30. HAZELCAST AND PACELC
PC / EC AND PA / EL ARE MORE COMMON IN PRACTICE
▸ Hazelcast is PA / EC by default.
▸ Hazelcast can work in the PA / EL mode with some features,
such as Near Cache, Read from Backups, and WAN
Replication.
▸ Hazelcast can work in the PC / EC mode with the Split-Brain
Protection feature to maintain its baseline consistency with a
best-effort approach.
30
31. RECAP
LEARN THE FUNDAMENTALS, THE REST WILL CHANGE ANYWAY
▸ Take the core limitations into consideration
▸ Pick a coherent set of abstractions and models
▸ Define your trade-offs
▸ Many systems can use a mix of models
31