Distributed Systems Theory for Mere Mortals - Topconf Dusseldorf October 2017

DISTRIBUTED SYSTEMS THEORY
FOR MERE MORTALS
ENSAR BASRI KAHVECI
1

WHO AM I?
ENSAR BASRI KAHVECI
▸ Distributed Systems Engineer @ Hazelcast
▸ twitter & github: metanet
▸ linkedin.com/in/basrikahveci
▸ basrikahveci.com
2

HAZELCAST
LEADING OPEN SOURCE JAVA IMDG
▸ Distributed Java collections, JCache, …
▸ Distributed computation and messaging
▸ Embedded or client-server deployment
▸ Integration modules & cloud friendly
3

HAZELCAST
ELASTICITY AND HIGH AVAILABILITY
▸ Scale up & scale out
▸ Dynamic clustering & elasticity
▸ Data partitioning & replication
▸ Fault tolerance & high availability
4

DISTRIBUTED SYSTEMS
COLLECTION OF ENTITIES SOLVING A COMMON PROBLEM
▸ Shared nothing
▸ Communication via messaging
▸ Uncertain and partial knowledge
▸ Main motivations are scalability, fault tolerance, availability,
economics, etc.
5

DISTRIBUTED SYSTEMS
FUNDAMENTAL DIFFICULTIES
▸ Independent and partial failures
▸ Non-negligible communication delays
▸ Unreliable communication
6

MODELS AND ABSTRACTIONS
ABSTRACTIONS SIMPLIFY REASONING
▸ Timing assumptions
▸ Failure modes
▸ Notion of the time
▸ Design principles
▸ Consistency models
7

TIMING ASSUMPTIONS
ASYNCHRONY IS INHERENTLY PRESENT IN OUR SYSTEMS
▸ A message can be delayed in network or in a process.
▸ Local clocks can drift arbitrarily.
▸ Asynchrony makes dealing with failures difﬁcult.
request
NODE A NODE B
no timely response
8

TIMING ASSUMPTIONS
OUR SYSTEMS WORK WITH PARTIAL SYNCHRONY
▸ Time window of synchrony is expected to be long enough
for an algorithm to terminate.
▸ OperationTimeoutException in Hazelcast
operation
NODE A NODE B
throws OperationTimeoutException  
if no timely response
9

FAILURE MODES
A PROCESS CRASHES AND NEVER COMES BACK
▸ Crash-stop
▸ Default failure model of Hazelcast
NODE B 
 
 
 
KEY1
KEY2
NODE A 
 
 
 
KEY1
KEY2
NODE B 
 
 
 
KEY1
KEY2
10

FAILURE MODES
A MESSAGE NEVER ARRIVES AT ITS DESTINATION
▸ Omission faults
PRIMARY BACKUP
backup v=5
backup v=6
backup v=7
expected=5
expected=6
BACKUP IS DIRTY !!!
X
11

FAILURE MODES
A PROCESS CRASHES, AND RECOVERS AFTER SOME TIME
▸ Crash-recover
▸ Hazelcast can perform crash-recover with the Hot Restart
feature.
NODE B 
 
 
 
KEY1
KEY2
NODE A 
 
 
 
KEY1
KEY2
NODE B 
 
 
 
KEY1
KEY2
NODE A 
 
 
 
KEY1
KEY2
12

FAILURE MODES
A PROCESS ARBITRARILY DEVIATES FROM ITS ALGORITHM
▸ Byzantine failures
NODE A
(MASTER)
NODE C
member
list
NODE B
REJECT
13

TIME AND ORDER
WE USE TIME TO ORDER EVENTS IN A SYSTEM
▸ Physical timestamps and the latest update wins approach
▸ LatestUpdateMapMergePolicy of Hazelcast
▸ Clock drifts make our clocks unreliable.
NODE A NODE B
name: ensar 
time: 11:10
name: basri 
time: 11:11
NODE A NODE B
merge
name: basri 
time: 11:11
name: basri 
time: 11:11
14

TIME AND ORDER
GOOGLE TRUETIME
▸ Special hardware to bound clock drifts
▸ Clock uncertainty is exposed.
▸ CASE 1: 
 
▸ CASE 2:
E1
E2
E1
E2
15

TIME AND ORDER
LOGICAL CLOCKS (LAMPORT CLOCKS)
▸ Relative order of events is deﬁned based on local counters
and communication.
▸ The happens-before relationship (i.e., causality)
▸ Hazelcast uses logical clocks extensively.
PRIMARY BACKUP
backup v=5
backup v=6
backup v=7
16

TIME AND ORDER
VECTOR CLOCKS
▸ Lamport clocks do not encode the causality information.
▸ Vector clocks are used to infer causality.
NODE A
NODE B
NODE C E1E2
E1 E2
E1 E3
E3
physical time
17

CONSENSUS
THE PROBLEM OF HAVING A SET OF PROCESSES AGREE ON A VALUE
▸ Fault tolerant leader election
▸ Achieving strong consistency on replicated data
▸ Committing distributed transactions
▸ Safety and liveness properties
18

CONSENSUS
FLP RESULT
▸ In an asynchronous system with reliable message delivery:
▸ Distributed consensus may not be solved within bounded
time if at least one process can fail with crash-stop :(
▸ The reason is, we cannot differentiate between  
a slow process and a crashed process.
19

CONSENSUS
END OF THE STORY?
▸ The FLP result is about liveness, not safety.
▸ TCP gives a good degree of reliability for message delivery.
▸ If we make timing assumptions, the consensus problem
becomes solvable.
▸ Unreliable failure detectors
20

▸ 2PC preserves safety, but it
may lose liveness.
▸ 2PC is a blocking protocol.
▸ 3PC resolves the liveness
problem with timeouts, but it
may lose safety on crash-
recover failures or network
partitions.
▸ 3PC is a non-blocking protocol.
CONSENSUS ALGORITHMS
TWO-PHASE COMMIT AND THREE-PHASE COMMIT
COORDINATOR COHORT
vote
yes / no
commit / rollback
3PC adds a pre-commit phase here.
21

CONSENSUS ALGORITHMS
MAJORITY-BASED CONSENSUS ALGORITHMS
▸ The majority approach preserves
safety and liveness.
▸ (2f + 1) nodes tolerate  
failure of f nodes.
▸ Resiliency to crash-stop, network
partitions, and crash-recover
failures
▸ Paxos, Zab, Raft, VR, …
FOLLOWER
2
FOLLOWER
1
LEADER
CLIENT
set x = 1
X=1
X=1
22

CAP PRINCIPLE
CP VERSUS AP
▸ Proposed by Eric Brewer in 2000
▸ A shared-data system cannot achieve perfect consistency
and perfect availability at the same time in the presence of
network partitions.
NODE A
NODE B
NODE CCLIENT1 CLIENT2
23

THE SPECTRUM OF CONSISTENCY AND AVAILABILITY
DATA-CENTRIC AND CLIENT-CENTRIC CONSISTENCY MODELS
LINEARIZABLE
SEQUENTIAL
CAUSAL
PRAM
WRITES 
FOLLOWING READS
MONOTONIC 
WRITES
MONOTONIC 
READS
READ YOUR 
WRITES
Data-centric consistency (CP)
Client-centric consistency  
with high availability (AP)
Client-centric consistency  
with sticky availability (AP)
24

PACELC
CONSISTENCY / LATENCY TRADEOFF
▸ Proposed by Daniel Abadi in 2010
▸ PACELC
▸ If there is a network partition ( P ), how does the system
trade off availability and consistency (A and C) ?
▸ Else ( E ), during normal operation, how does the system
trade off latency and consistency ( L and C ) ?
25

HAZELCAST AND PACELC
FAVOURING CONSISTENCY DURING NORMAL OPERATION (EC)
▸ Hazelcast uses the primary-copy replication technique.
NODE B 
 
 
 
KEY1
KEY2
KEY3
NODE A 
 
 
 
KEY1
KEY2
KEY3
NODE C 
 
 
 
KEY1
KEY2
KEY3
CLIENT
get KEY2; get KEY3;get KEY1;
26

FAVOURING LATENCY DURING NORMAL OPERATION (EL)
▸ A client can use a near cache to scale reads.
NODE B 
 
 
 
KEY1
KEY2
KEY3
NODE A 
 
 
 
KEY1
KEY2
KEY3
NODE C 
 
 
 
KEY1
KEY2
KEY3
CLIENT 
 
 
 
KEY1
KEY2
KEY3
27

FAVOURING AVAILABILITY DURING NETWORK PARTITIONS (PA)
▸ Hazelcast remains available during network partitions  
by default.
CLIENT1
NODE B 
 
 
 
KEY1
KEY2
KEY3
NODE A 
 
 
 
KEY1
KEY2
KEY3
NODE C 
 
 
 
KEY1
KEY2
KEY3
CLIENT2
28

FAVOURING CONSISTENCY DURING NETWORK PARTITIONS (PC)
▸ The Split-Brain Protection Feature
CLIENT1
NODE B 
 
 
 
KEY1
KEY2
KEY3
NODE A 
 
 
 
KEY1
KEY2
KEY3
NODE C 
 
 
 
KEY1
KEY2
KEY3
CLIENT2
29

PC / EC AND PA / EL ARE MORE COMMON IN PRACTICE
▸ Hazelcast is PA / EC by default.
▸ Hazelcast can work in the PA / EL mode with some features,
such as Near Cache, Read from Backups, and WAN
Replication.
▸ Hazelcast can work in the PC / EC mode with the Split-Brain
Protection feature to maintain its baseline consistency with a
best-effort approach.
30

RECAP
LEARN THE FUNDAMENTALS, THE REST WILL CHANGE ANYWAY
▸ Take the core limitations into consideration
▸ Pick a coherent set of abstractions and models
▸ Deﬁne your trade-offs
▸ Many systems can use a mix of models
31

Distributed Systems Theory for Mere Mortals - Topconf Dusseldorf October 2017

Recommended

Recommended

More Related Content

Similar to Distributed Systems Theory for Mere Mortals - Topconf Dusseldorf October 2017

Similar to Distributed Systems Theory for Mere Mortals - Topconf Dusseldorf October 2017 (20)

More from Ensar Basri Kahveci

More from Ensar Basri Kahveci (14)

Recently uploaded

Recently uploaded (20)

Distributed Systems Theory for Mere Mortals - Topconf Dusseldorf October 2017