SlideShare a Scribd company logo
1 of 32
Download to read offline
DISTRIBUTED SYSTEMS THEORY
FOR MERE MORTALS
ENSAR BASRI KAHVECI
1
WHO AM I?
ENSAR BASRI KAHVECI
▸ Distributed Systems Engineer @ Hazelcast
▸ twitter & github: metanet
▸ linkedin.com/in/basrikahveci
▸ basrikahveci.com
2
HAZELCAST
LEADING OPEN SOURCE JAVA IMDG
▸ Distributed Java collections, JCache, …
▸ Distributed computation and messaging
▸ Embedded or client-server deployment
▸ Integration modules & cloud friendly
3
HAZELCAST
ELASTICITY AND HIGH AVAILABILITY
▸ Scale up & scale out
▸ Dynamic clustering & elasticity
▸ Data partitioning & replication
▸ Fault tolerance & high availability
4
DISTRIBUTED SYSTEMS
COLLECTION OF ENTITIES SOLVING A COMMON PROBLEM
▸ Shared nothing
▸ Communication via messaging
▸ Uncertain and partial knowledge
▸ Main motivations are scalability, fault tolerance, availability,
economics, etc.
5
DISTRIBUTED SYSTEMS
FUNDAMENTAL DIFFICULTIES
▸ Independent and partial failures
▸ Non-negligible communication delays
▸ Unreliable communication
6
MODELS AND ABSTRACTIONS
ABSTRACTIONS SIMPLIFY REASONING
▸ Timing assumptions
▸ Failure modes
▸ Notion of the time
▸ Design principles
▸ Consistency models
7
TIMING ASSUMPTIONS
ASYNCHRONY IS INHERENTLY PRESENT IN OUR SYSTEMS
▸ A message can be delayed in network or in a process.
▸ Local clocks can drift arbitrarily.
▸ Asynchrony makes dealing with failures difficult.
request
NODE A NODE B
no timely response
8
TIMING ASSUMPTIONS
OUR SYSTEMS WORK WITH PARTIAL SYNCHRONY
▸ Time window of synchrony is expected to be long enough
for an algorithm to terminate.
▸ OperationTimeoutException in Hazelcast
operation
NODE A NODE B
throws OperationTimeoutException 

if no timely response
9
FAILURE MODES
A PROCESS CRASHES AND NEVER COMES BACK
▸ Crash-stop
▸ Default failure model of Hazelcast
NODE B







KEY1
KEY2
NODE A







KEY1
KEY2
NODE B







KEY1
KEY2
10
FAILURE MODES
A MESSAGE NEVER ARRIVES AT ITS DESTINATION
▸ Omission faults
PRIMARY BACKUP
backup v=5
backup v=6
backup v=7
expected=5
expected=6
BACKUP IS DIRTY !!!
X
11
FAILURE MODES
A PROCESS CRASHES, AND RECOVERS AFTER SOME TIME
▸ Crash-recover
▸ Hazelcast can perform crash-recover with the Hot Restart
feature.
NODE B







KEY1
KEY2
NODE A







KEY1
KEY2
NODE B







KEY1
KEY2
NODE A







KEY1
KEY2
12
FAILURE MODES
A PROCESS ARBITRARILY DEVIATES FROM ITS ALGORITHM
▸ Byzantine failures
NODE A
(MASTER)
NODE C
member
list
NODE B
REJECT
13
TIME AND ORDER
WE USE TIME TO ORDER EVENTS IN A SYSTEM
▸ Physical timestamps and the latest update wins approach
▸ LatestUpdateMapMergePolicy of Hazelcast
▸ Clock drifts make our clocks unreliable.
NODE A NODE B
name: ensar

time: 11:10
name: basri

time: 11:11
NODE A NODE B
merge
name: basri

time: 11:11
name: basri

time: 11:11
14
TIME AND ORDER
GOOGLE TRUETIME
▸ Special hardware to bound clock drifts
▸ Clock uncertainty is exposed.
▸ CASE 1:



▸ CASE 2:
E1
E2
E1
E2
15
TIME AND ORDER
LOGICAL CLOCKS (LAMPORT CLOCKS)
▸ Relative order of events is defined based on local counters
and communication.
▸ The happens-before relationship (i.e., causality)
▸ Hazelcast uses logical clocks extensively.
PRIMARY BACKUP
backup v=5
backup v=6
backup v=7
16
TIME AND ORDER
VECTOR CLOCKS
▸ Lamport clocks do not encode the causality information.
▸ Vector clocks are used to infer causality.
NODE A
NODE B
NODE C E1E2
E1 E2
E1 E3
E3
physical time
17
CONSENSUS
THE PROBLEM OF HAVING A SET OF PROCESSES AGREE ON A VALUE
▸ Fault tolerant leader election
▸ Achieving strong consistency on replicated data
▸ Committing distributed transactions
▸ Safety and liveness properties
18
CONSENSUS
FLP RESULT
▸ In an asynchronous system with reliable message delivery:
▸ Distributed consensus may not be solved within bounded
time if at least one process can fail with crash-stop :(
▸ The reason is, we cannot differentiate between 

a slow process and a crashed process.
19
CONSENSUS
END OF THE STORY?
▸ The FLP result is about liveness, not safety.
▸ TCP gives a good degree of reliability for message delivery.
▸ If we make timing assumptions, the consensus problem
becomes solvable.
▸ Unreliable failure detectors
20
▸ 2PC preserves safety, but it
may lose liveness.
▸ 2PC is a blocking protocol.
▸ 3PC resolves the liveness
problem with timeouts, but it
may lose safety on crash-
recover failures or network
partitions.
▸ 3PC is a non-blocking protocol.
CONSENSUS ALGORITHMS
TWO-PHASE COMMIT AND THREE-PHASE COMMIT
COORDINATOR COHORT
vote
yes / no
commit / rollback
3PC adds a pre-commit phase here.
21
CONSENSUS ALGORITHMS
MAJORITY-BASED CONSENSUS ALGORITHMS
▸ The majority approach preserves
safety and liveness.
▸ (2f + 1) nodes tolerate 

failure of f nodes.
▸ Resiliency to crash-stop, network
partitions, and crash-recover
failures
▸ Paxos, Zab, Raft, VR, …
FOLLOWER
2
FOLLOWER
1
LEADER
CLIENT
set	x	=	1
X=1
X=1
22
CAP PRINCIPLE
CP VERSUS AP
▸ Proposed by Eric Brewer in 2000
▸ A shared-data system cannot achieve perfect consistency
and perfect availability at the same time in the presence of
network partitions.
NODE A
NODE B
NODE CCLIENT1 CLIENT2
23
THE SPECTRUM OF CONSISTENCY AND AVAILABILITY
DATA-CENTRIC AND CLIENT-CENTRIC CONSISTENCY MODELS
LINEARIZABLE
SEQUENTIAL
CAUSAL
PRAM
WRITES

FOLLOWING READS
MONOTONIC

WRITES
MONOTONIC

READS
READ YOUR

WRITES
Data-centric consistency (CP)
Client-centric consistency 

with high availability (AP)
Client-centric consistency 

with sticky availability (AP)
24
PACELC
CONSISTENCY / LATENCY TRADEOFF
▸ Proposed by Daniel Abadi in 2010
▸ PACELC
▸ If there is a network partition ( P ), how does the system
trade off availability and consistency (A and C) ?
▸ Else ( E ), during normal operation, how does the system
trade off latency and consistency ( L and C ) ?
25
HAZELCAST AND PACELC
FAVOURING CONSISTENCY DURING NORMAL OPERATION (EC)
▸ Hazelcast uses the primary-copy replication technique.
NODE B







KEY1
KEY2
KEY3
NODE A







KEY1
KEY2
KEY3
NODE C







KEY1
KEY2
KEY3
CLIENT
get	KEY2; get	KEY3;get	KEY1;
26
HAZELCAST AND PACELC
FAVOURING LATENCY DURING NORMAL OPERATION (EL)
▸ A client can use a near cache to scale reads.
NODE B







KEY1
KEY2
KEY3
NODE A







KEY1
KEY2
KEY3
NODE C







KEY1
KEY2
KEY3
CLIENT







KEY1
KEY2
KEY3
27
HAZELCAST AND PACELC
FAVOURING AVAILABILITY DURING NETWORK PARTITIONS (PA)
▸ Hazelcast remains available during network partitions 

by default.
CLIENT1
NODE B







KEY1
KEY2
KEY3
NODE A







KEY1
KEY2
KEY3
NODE C







KEY1
KEY2
KEY3
CLIENT2
28
HAZELCAST AND PACELC
FAVOURING CONSISTENCY DURING NETWORK PARTITIONS (PC)
▸ The Split-Brain Protection Feature
CLIENT1
NODE B







KEY1
KEY2
KEY3
NODE A







KEY1
KEY2
KEY3
NODE C







KEY1
KEY2
KEY3
CLIENT2
29
HAZELCAST AND PACELC
PC / EC AND PA / EL ARE MORE COMMON IN PRACTICE
▸ Hazelcast is PA / EC by default.
▸ Hazelcast can work in the PA / EL mode with some features,
such as Near Cache, Read from Backups, and WAN
Replication.
▸ Hazelcast can work in the PC / EC mode with the Split-Brain
Protection feature to maintain its baseline consistency with a
best-effort approach.
30
RECAP
LEARN THE FUNDAMENTALS, THE REST WILL CHANGE ANYWAY
▸ Take the core limitations into consideration
▸ Pick a coherent set of abstractions and models
▸ Define your trade-offs
▸ Many systems can use a mix of models
31
THANK YOU
32

More Related Content

Similar to Distributed Systems Theory for Mere Mortals - Topconf Dusseldorf October 2017

Distributed Systems Theory for Mere Mortals - Software Craftsmanship Turkey
Distributed Systems Theory for Mere Mortals - Software Craftsmanship TurkeyDistributed Systems Theory for Mere Mortals - Software Craftsmanship Turkey
Distributed Systems Theory for Mere Mortals - Software Craftsmanship Turkey
Ensar Basri Kahveci
 
10. th cncf meetup - Routing microservice-architectures-with-traefik-cncfsk
10. th cncf meetup - Routing microservice-architectures-with-traefik-cncfsk10. th cncf meetup - Routing microservice-architectures-with-traefik-cncfsk
10. th cncf meetup - Routing microservice-architectures-with-traefik-cncfsk
Juraj Hantak
 
Ethernet as fabric
Ethernet as fabricEthernet as fabric
Ethernet as fabric
Matthew Macy
 

Similar to Distributed Systems Theory for Mere Mortals - Topconf Dusseldorf October 2017 (20)

Distributed Systems Theory for Mere Mortals - Software Craftsmanship Turkey
Distributed Systems Theory for Mere Mortals - Software Craftsmanship TurkeyDistributed Systems Theory for Mere Mortals - Software Craftsmanship Turkey
Distributed Systems Theory for Mere Mortals - Software Craftsmanship Turkey
 
Basic distributed systems principles
Basic distributed systems principlesBasic distributed systems principles
Basic distributed systems principles
 
Traefik 2.x features - canary deployment with Traefik and K3S
Traefik 2.x features - canary deployment with Traefik and K3STraefik 2.x features - canary deployment with Traefik and K3S
Traefik 2.x features - canary deployment with Traefik and K3S
 
Canary deployment with Traefik and K3S
Canary deployment with Traefik and K3SCanary deployment with Traefik and K3S
Canary deployment with Traefik and K3S
 
10. th cncf meetup - Routing microservice-architectures-with-traefik-cncfsk
10. th cncf meetup - Routing microservice-architectures-with-traefik-cncfsk10. th cncf meetup - Routing microservice-architectures-with-traefik-cncfsk
10. th cncf meetup - Routing microservice-architectures-with-traefik-cncfsk
 
Webinar slides: 9 DevOps Tips for Going in Production with Galera Cluster for...
Webinar slides: 9 DevOps Tips for Going in Production with Galera Cluster for...Webinar slides: 9 DevOps Tips for Going in Production with Galera Cluster for...
Webinar slides: 9 DevOps Tips for Going in Production with Galera Cluster for...
 
Ethernet as fabric
Ethernet as fabricEthernet as fabric
Ethernet as fabric
 
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
 
Container orchestration in geo-distributed cloud computing platforms
Container orchestration in geo-distributed cloud computing platformsContainer orchestration in geo-distributed cloud computing platforms
Container orchestration in geo-distributed cloud computing platforms
 
Traefik as an open source edge router for microservice architectures
Traefik as an open source edge router for microservice architecturesTraefik as an open source edge router for microservice architectures
Traefik as an open source edge router for microservice architectures
 
Loadbalancing In-depth study for scale @ 80K TPS
Loadbalancing In-depth study for scale @ 80K TPS Loadbalancing In-depth study for scale @ 80K TPS
Loadbalancing In-depth study for scale @ 80K TPS
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
 
Distribute Storage System May-2014
Distribute Storage System May-2014Distribute Storage System May-2014
Distribute Storage System May-2014
 
osi-oss-dbs.pptx
osi-oss-dbs.pptxosi-oss-dbs.pptx
osi-oss-dbs.pptx
 
Planning for Disaster Recovery (DR) with Galera Cluster
Planning for Disaster Recovery (DR) with Galera ClusterPlanning for Disaster Recovery (DR) with Galera Cluster
Planning for Disaster Recovery (DR) with Galera Cluster
 
2017-10-24 All Day DevOps - Disposable Development Environments
2017-10-24 All Day DevOps - Disposable Development Environments2017-10-24 All Day DevOps - Disposable Development Environments
2017-10-24 All Day DevOps - Disposable Development Environments
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
State Machines and Workflow Nets on your PHP projects
State Machines and Workflow Nets on your PHP projectsState Machines and Workflow Nets on your PHP projects
State Machines and Workflow Nets on your PHP projects
 
Production Grade Kubernetes Applications
Production Grade Kubernetes ApplicationsProduction Grade Kubernetes Applications
Production Grade Kubernetes Applications
 
Q2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLEQ2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLE
 

More from Ensar Basri Kahveci

More from Ensar Basri Kahveci (14)

java.util.concurrent for Distributed Coordination - Berlin Expert Days 2019
java.util.concurrent for Distributed Coordination - Berlin Expert Days 2019java.util.concurrent for Distributed Coordination - Berlin Expert Days 2019
java.util.concurrent for Distributed Coordination - Berlin Expert Days 2019
 
java.util.concurrent for Distributed Coordination, Riga DevDays 2019
java.util.concurrent for Distributed Coordination, Riga DevDays 2019java.util.concurrent for Distributed Coordination, Riga DevDays 2019
java.util.concurrent for Distributed Coordination, Riga DevDays 2019
 
java.util.concurrent for Distributed Coordination, GeeCON Krakow 2019
java.util.concurrent for Distributed Coordination, GeeCON Krakow 2019java.util.concurrent for Distributed Coordination, GeeCON Krakow 2019
java.util.concurrent for Distributed Coordination, GeeCON Krakow 2019
 
java.util.concurrent for Distributed Coordination, JEEConf 2019
java.util.concurrent for Distributed Coordination, JEEConf 2019java.util.concurrent for Distributed Coordination, JEEConf 2019
java.util.concurrent for Distributed Coordination, JEEConf 2019
 
Replication Distilled: Hazelcast Deep Dive @ In-Memory Computing Summit San F...
Replication Distilled: Hazelcast Deep Dive @ In-Memory Computing Summit San F...Replication Distilled: Hazelcast Deep Dive @ In-Memory Computing Summit San F...
Replication Distilled: Hazelcast Deep Dive @ In-Memory Computing Summit San F...
 
Replication Distilled: Hazelcast Deep Dive - Berlin Expert Days 2018
Replication Distilled: Hazelcast Deep Dive - Berlin Expert Days 2018Replication Distilled: Hazelcast Deep Dive - Berlin Expert Days 2018
Replication Distilled: Hazelcast Deep Dive - Berlin Expert Days 2018
 
From AP to CP and Back: The Curious Case of Hazelcast (jdk.io 2018)
From AP to CP and Back: The Curious Case of Hazelcast (jdk.io 2018)From AP to CP and Back: The Curious Case of Hazelcast (jdk.io 2018)
From AP to CP and Back: The Curious Case of Hazelcast (jdk.io 2018)
 
Distributed Systems Theory for Mere Mortals - GeeCON Krakow May 2017
Distributed Systems Theory for Mere Mortals -  GeeCON Krakow May 2017Distributed Systems Theory for Mere Mortals -  GeeCON Krakow May 2017
Distributed Systems Theory for Mere Mortals - GeeCON Krakow May 2017
 
Replication in the Wild - Warsaw Cloud Native Meetup - May 2017
Replication in the Wild - Warsaw Cloud Native Meetup - May 2017Replication in the Wild - Warsaw Cloud Native Meetup - May 2017
Replication in the Wild - Warsaw Cloud Native Meetup - May 2017
 
Distributed Systems Theory for Mere Mortals - Java Day Istanbul May 2017
Distributed Systems Theory for Mere Mortals - Java Day Istanbul May 2017 Distributed Systems Theory for Mere Mortals - Java Day Istanbul May 2017
Distributed Systems Theory for Mere Mortals - Java Day Istanbul May 2017
 
Client-centric Consistency Models
Client-centric Consistency ModelsClient-centric Consistency Models
Client-centric Consistency Models
 
Replication in the Wild
Replication in the WildReplication in the Wild
Replication in the Wild
 
Distributed Systems Theory for Mere Mortals
Distributed Systems Theory for Mere MortalsDistributed Systems Theory for Mere Mortals
Distributed Systems Theory for Mere Mortals
 
Ankara Jug - Practical Functional Programming with Scala
Ankara Jug - Practical Functional Programming with ScalaAnkara Jug - Practical Functional Programming with Scala
Ankara Jug - Practical Functional Programming with Scala
 

Recently uploaded

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 

Recently uploaded (20)

Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 

Distributed Systems Theory for Mere Mortals - Topconf Dusseldorf October 2017

  • 1. DISTRIBUTED SYSTEMS THEORY FOR MERE MORTALS ENSAR BASRI KAHVECI 1
  • 2. WHO AM I? ENSAR BASRI KAHVECI ▸ Distributed Systems Engineer @ Hazelcast ▸ twitter & github: metanet ▸ linkedin.com/in/basrikahveci ▸ basrikahveci.com 2
  • 3. HAZELCAST LEADING OPEN SOURCE JAVA IMDG ▸ Distributed Java collections, JCache, … ▸ Distributed computation and messaging ▸ Embedded or client-server deployment ▸ Integration modules & cloud friendly 3
  • 4. HAZELCAST ELASTICITY AND HIGH AVAILABILITY ▸ Scale up & scale out ▸ Dynamic clustering & elasticity ▸ Data partitioning & replication ▸ Fault tolerance & high availability 4
  • 5. DISTRIBUTED SYSTEMS COLLECTION OF ENTITIES SOLVING A COMMON PROBLEM ▸ Shared nothing ▸ Communication via messaging ▸ Uncertain and partial knowledge ▸ Main motivations are scalability, fault tolerance, availability, economics, etc. 5
  • 6. DISTRIBUTED SYSTEMS FUNDAMENTAL DIFFICULTIES ▸ Independent and partial failures ▸ Non-negligible communication delays ▸ Unreliable communication 6
  • 7. MODELS AND ABSTRACTIONS ABSTRACTIONS SIMPLIFY REASONING ▸ Timing assumptions ▸ Failure modes ▸ Notion of the time ▸ Design principles ▸ Consistency models 7
  • 8. TIMING ASSUMPTIONS ASYNCHRONY IS INHERENTLY PRESENT IN OUR SYSTEMS ▸ A message can be delayed in network or in a process. ▸ Local clocks can drift arbitrarily. ▸ Asynchrony makes dealing with failures difficult. request NODE A NODE B no timely response 8
  • 9. TIMING ASSUMPTIONS OUR SYSTEMS WORK WITH PARTIAL SYNCHRONY ▸ Time window of synchrony is expected to be long enough for an algorithm to terminate. ▸ OperationTimeoutException in Hazelcast operation NODE A NODE B throws OperationTimeoutException 
 if no timely response 9
  • 10. FAILURE MODES A PROCESS CRASHES AND NEVER COMES BACK ▸ Crash-stop ▸ Default failure model of Hazelcast NODE B
 
 
 
 KEY1 KEY2 NODE A
 
 
 
 KEY1 KEY2 NODE B
 
 
 
 KEY1 KEY2 10
  • 11. FAILURE MODES A MESSAGE NEVER ARRIVES AT ITS DESTINATION ▸ Omission faults PRIMARY BACKUP backup v=5 backup v=6 backup v=7 expected=5 expected=6 BACKUP IS DIRTY !!! X 11
  • 12. FAILURE MODES A PROCESS CRASHES, AND RECOVERS AFTER SOME TIME ▸ Crash-recover ▸ Hazelcast can perform crash-recover with the Hot Restart feature. NODE B
 
 
 
 KEY1 KEY2 NODE A
 
 
 
 KEY1 KEY2 NODE B
 
 
 
 KEY1 KEY2 NODE A
 
 
 
 KEY1 KEY2 12
  • 13. FAILURE MODES A PROCESS ARBITRARILY DEVIATES FROM ITS ALGORITHM ▸ Byzantine failures NODE A (MASTER) NODE C member list NODE B REJECT 13
  • 14. TIME AND ORDER WE USE TIME TO ORDER EVENTS IN A SYSTEM ▸ Physical timestamps and the latest update wins approach ▸ LatestUpdateMapMergePolicy of Hazelcast ▸ Clock drifts make our clocks unreliable. NODE A NODE B name: ensar
 time: 11:10 name: basri
 time: 11:11 NODE A NODE B merge name: basri
 time: 11:11 name: basri
 time: 11:11 14
  • 15. TIME AND ORDER GOOGLE TRUETIME ▸ Special hardware to bound clock drifts ▸ Clock uncertainty is exposed. ▸ CASE 1:
 
 ▸ CASE 2: E1 E2 E1 E2 15
  • 16. TIME AND ORDER LOGICAL CLOCKS (LAMPORT CLOCKS) ▸ Relative order of events is defined based on local counters and communication. ▸ The happens-before relationship (i.e., causality) ▸ Hazelcast uses logical clocks extensively. PRIMARY BACKUP backup v=5 backup v=6 backup v=7 16
  • 17. TIME AND ORDER VECTOR CLOCKS ▸ Lamport clocks do not encode the causality information. ▸ Vector clocks are used to infer causality. NODE A NODE B NODE C E1E2 E1 E2 E1 E3 E3 physical time 17
  • 18. CONSENSUS THE PROBLEM OF HAVING A SET OF PROCESSES AGREE ON A VALUE ▸ Fault tolerant leader election ▸ Achieving strong consistency on replicated data ▸ Committing distributed transactions ▸ Safety and liveness properties 18
  • 19. CONSENSUS FLP RESULT ▸ In an asynchronous system with reliable message delivery: ▸ Distributed consensus may not be solved within bounded time if at least one process can fail with crash-stop :( ▸ The reason is, we cannot differentiate between 
 a slow process and a crashed process. 19
  • 20. CONSENSUS END OF THE STORY? ▸ The FLP result is about liveness, not safety. ▸ TCP gives a good degree of reliability for message delivery. ▸ If we make timing assumptions, the consensus problem becomes solvable. ▸ Unreliable failure detectors 20
  • 21. ▸ 2PC preserves safety, but it may lose liveness. ▸ 2PC is a blocking protocol. ▸ 3PC resolves the liveness problem with timeouts, but it may lose safety on crash- recover failures or network partitions. ▸ 3PC is a non-blocking protocol. CONSENSUS ALGORITHMS TWO-PHASE COMMIT AND THREE-PHASE COMMIT COORDINATOR COHORT vote yes / no commit / rollback 3PC adds a pre-commit phase here. 21
  • 22. CONSENSUS ALGORITHMS MAJORITY-BASED CONSENSUS ALGORITHMS ▸ The majority approach preserves safety and liveness. ▸ (2f + 1) nodes tolerate 
 failure of f nodes. ▸ Resiliency to crash-stop, network partitions, and crash-recover failures ▸ Paxos, Zab, Raft, VR, … FOLLOWER 2 FOLLOWER 1 LEADER CLIENT set x = 1 X=1 X=1 22
  • 23. CAP PRINCIPLE CP VERSUS AP ▸ Proposed by Eric Brewer in 2000 ▸ A shared-data system cannot achieve perfect consistency and perfect availability at the same time in the presence of network partitions. NODE A NODE B NODE CCLIENT1 CLIENT2 23
  • 24. THE SPECTRUM OF CONSISTENCY AND AVAILABILITY DATA-CENTRIC AND CLIENT-CENTRIC CONSISTENCY MODELS LINEARIZABLE SEQUENTIAL CAUSAL PRAM WRITES
 FOLLOWING READS MONOTONIC
 WRITES MONOTONIC
 READS READ YOUR
 WRITES Data-centric consistency (CP) Client-centric consistency 
 with high availability (AP) Client-centric consistency 
 with sticky availability (AP) 24
  • 25. PACELC CONSISTENCY / LATENCY TRADEOFF ▸ Proposed by Daniel Abadi in 2010 ▸ PACELC ▸ If there is a network partition ( P ), how does the system trade off availability and consistency (A and C) ? ▸ Else ( E ), during normal operation, how does the system trade off latency and consistency ( L and C ) ? 25
  • 26. HAZELCAST AND PACELC FAVOURING CONSISTENCY DURING NORMAL OPERATION (EC) ▸ Hazelcast uses the primary-copy replication technique. NODE B
 
 
 
 KEY1 KEY2 KEY3 NODE A
 
 
 
 KEY1 KEY2 KEY3 NODE C
 
 
 
 KEY1 KEY2 KEY3 CLIENT get KEY2; get KEY3;get KEY1; 26
  • 27. HAZELCAST AND PACELC FAVOURING LATENCY DURING NORMAL OPERATION (EL) ▸ A client can use a near cache to scale reads. NODE B
 
 
 
 KEY1 KEY2 KEY3 NODE A
 
 
 
 KEY1 KEY2 KEY3 NODE C
 
 
 
 KEY1 KEY2 KEY3 CLIENT
 
 
 
 KEY1 KEY2 KEY3 27
  • 28. HAZELCAST AND PACELC FAVOURING AVAILABILITY DURING NETWORK PARTITIONS (PA) ▸ Hazelcast remains available during network partitions 
 by default. CLIENT1 NODE B
 
 
 
 KEY1 KEY2 KEY3 NODE A
 
 
 
 KEY1 KEY2 KEY3 NODE C
 
 
 
 KEY1 KEY2 KEY3 CLIENT2 28
  • 29. HAZELCAST AND PACELC FAVOURING CONSISTENCY DURING NETWORK PARTITIONS (PC) ▸ The Split-Brain Protection Feature CLIENT1 NODE B
 
 
 
 KEY1 KEY2 KEY3 NODE A
 
 
 
 KEY1 KEY2 KEY3 NODE C
 
 
 
 KEY1 KEY2 KEY3 CLIENT2 29
  • 30. HAZELCAST AND PACELC PC / EC AND PA / EL ARE MORE COMMON IN PRACTICE ▸ Hazelcast is PA / EC by default. ▸ Hazelcast can work in the PA / EL mode with some features, such as Near Cache, Read from Backups, and WAN Replication. ▸ Hazelcast can work in the PC / EC mode with the Split-Brain Protection feature to maintain its baseline consistency with a best-effort approach. 30
  • 31. RECAP LEARN THE FUNDAMENTALS, THE REST WILL CHANGE ANYWAY ▸ Take the core limitations into consideration ▸ Pick a coherent set of abstractions and models ▸ Define your trade-offs ▸ Many systems can use a mix of models 31