SlideShare a Scribd company logo
1 of 45
Download to read offline
Paxos
Building Reliable System
2015-07-02 @drdrxp
Background
Several processes do one thing.
The only problem in distributed system is
achieving consensus.
Paxos: the core of distributed system.
Agenda
1. Problem
2. Replication is not enough
3. Paxos Algorithm
4. Paxos Optimization
Problem
Required:
Durability: 99.99999999%
Availability: 99.99%
What we have:
Hard Drive: 4% of Annual failure rate
Server Down Time: 0.1% or longer
Packet loss between IDC: 5% ~ 30%
Solution(Maybe)
Multiple Replicas
No data loss if x(x<n) replicas lost
Durability:
1 replicas: ~ 0.63%
2 replicas: ~ 0.00395%
3 replicas: < 0.000001%
n replicas: = 1 - x^n /* x = failure rate of single replica */
Solution.
How to replicate
data?
Besides number of replicas:
Availability
Atomicity
Consistency
...
Fundamental Replication Algorithms
Master-Slave Async
Master-Slave Sync
Master-Slave Semi-Sync
Quorum Write and Read
Master-Slave Async
The Mysql Way.
1. Master received write op.
2. Master wrote on disk.
3. Master responded ‘OK’.
4. Master replicated to slaves.
If disk fail before replication
→ Data loss.
Time
MasterClient Slave.1 Slave.2
Disk Failure
Master-Slave Sync
1. Master received write op.
2. Master replicated log to slaves.
3. Slave may block...
4. Client won’t receive ‘OK’ until
all slaves respond.
One unreachable node
halts the entire system.
: No data loss.
: But low availability.
Time
MasterClient Slave.1 Slave.2
Master-Slave Semi-Sync
1. Master received write op.
2. Master replicated log to slaves.
3. Slave may block...
4. Client receives ‘OK’ if [1,n)
slaves respond.
: High durability.
: High availability.
: No slave has all data
→ We need Quorum Write
Time
MasterClient Slave.1 Slave.2
Quorum Write and Read
Dynamo / Cassandra
Write to W >=N/2+1 nodes.
No master required.
Read R >=N/2+1 nodes.
W + R > N
Tolerate upto (N-1)/2 failed
nodes.
Time
Node.1Client Node.2 Node.3
Quorum Write and Read. Last-Win
The last write wins.
Totally ordered based on
timestamp.
Time
Node.1Client Node.2 Node.3
: High durability.
: High availability.
: Data completeness is guaranteed.
Is it enough?
Quorum Write and Read..
Quorum Write and Read... W + R > N
Consistency:
Eventual
Transactionality:
Non-Atomic-Update
Dirty-Read
Lost-Update
http://en.wikipedia.org/wiki/Concurrency_control
An Imaginary Storage Service
● A storage system with 3 nodes(processes).
● Policy: Quorum RW.
● It stores only one variable “i”.
● “i” has multiple versions: i1, i2, i3…
● Commands:
get /* read latest “i” */
set <n> /* assign <n> to “i” */
inc <n> /* increment “i” by <n> */
It shows us the deficiency of Quorum RW
and how paxos solves these problems.
An Imaginary Storage Service.
"set" → Quorum Write.
"inc" → the simplest transactional operation:
1. Read latest “i” with Quorum Read: i1
2. Let i2 = i1 + n
3. set i2
X
set i2=3
X
get i
21
21
00
32
21
32
X
get i1=2
i2 = i1 + 1
32
21
32
set i2=3
OK
set i2=4
An Imaginary Storage Service..
X
X
get i
21
21
00
32
21
32
53
21
53
X
get i1=2
i2 = i1 + 1
We expect X to be able to get i3=5
This requires Y to “fail” after X wrote i2. How do we do that?
Y
get i1=2
Y
i2 = i1 + 2
32
21
32
Y should run Quorum Read and Quorum Write again...
Must Fail.
Or existed
value will be
overwritten.
An Imaginary Storage Service...
In order to correctly get i3 after 2 “inc” operations:
There can only be ONE successful “write” operation
to a certain version of “i”(in our case: i2).
Generalization:
One value(one version of a variable) should not be
modified any more after it is determined(client received
“OK” and believes it is stored).
How to define “determined”?
How to avoid changing a “determined” value?
Determine a Value
X
Y
Any value set?
X
No
XX -
---
Any value set?
---
Y
Yes, Y gives up
X
XX -
XX -
Solution: Before writing a value, run a Quorum Read
round to check if such a value exists(or maybe exist).
Determine a Value.
X Y
Any value set?
X
No
YYX Y
XX -
---
Any value set?
--- Y
No
X
But both X and Y would believe there is no value set.
X and Y both will start to write at the same time.
Lost Update
Determine a Value..
X
Any value set?
X
No
YYX Y
---
---
X
Y---
Any value set?
Quorum Read+Write:
Remember X is the last reader
--- Y
No
Quorum Read+Write:
Remember Y is the last reader
X --
Solution improved: Remember who did the last read And
deny write from previous readers.
now node 1 and 2 will only accept
request from X.
now node 2 and 3 will only accept
request from Y.
Determine a Value...
By applying this policy, a value(each version of “i” in our
case) can be stored safely and consistently.
Leslie Lamport made a paper of this policy.
Paxos
What is Paxos
● A reliable storage: based on Quorum RW.
● Each paxos instance stores only 1 value.
● 2 rounds are required to determine 1 value.
● A value can’t be modified after determined.
● determined means being accepted by a
quorum(>n/2).
● Immediate Consistency.
Paxos
Classic Paxos
2 rounds per instance.
Multi Paxos
~1 round per instance.
Fast Paxos
1 round per instance ( without conflict ).
2 rounds per instance ( with conflict ).
Paxos: Precondition
Storage must be reliable:
No Data loss
/* Or it falls back to Byzantine Paxos */
Tolerate:
Message loss
Message in random order
Proposer: process that starts a paxos round to write sth.
Acceptor: process that receives and stores messages.
Quorum( of acceptors ) : n/2+1 Acceptors.
Round:Including 2 phases:Phase-1 & Phase-2
Round Number (rnd):
ID of a round.
monotonic incremental;Last-Win;Universially unique;
Paxos: Concepts
Last Round Number (last_rnd):
Greatest rnd an Acceptor has ever seen;
To identify the proposer from which a acceptor would
accept write request;
Value (v): the value an Acceptor accepted.
Value round number (vrnd):
At which round an Acceptor accepted the v.
Value determined:
The value accepted by a quorum of acceptors.
Paxos: Concepts.
Illustration of Acceptor
5,x3
last_rnd
v
vrnd
In following slides, an Acceptor would have 3 attributes
saved on it: last_rnd, v and vrnd:
Paxos: Classic - phase 1
X
rnd=1
X
last_rnd=0, v=nil, vrnd=0
last_rnd=0, v=nil, vrnd=0..Phase 1
1,1, -
---
Proposer X Acceptor 1,2,3
Upon Acceptor received requests from Proposer:
● Refuse requests whose rnd < last_rnd.
● Save the rnd from phase-1 request into its last_rnd.
● Since now it only accepts phase-2 request with this
last_rnd.
● Respond with last_rnd, v and vrnd it has previously
accepted.
Paxos: Classic - phase 1.
X
rnd=1
X
Phase 1
1,1, -
---
Proposer X Acceptor 1,2,3
Upon Proposer received replies from Acceptors:
● If a last_rnd > rnd found: Discard this round.
● Choose v with the greatest vrnd if there is non-nil v.
● Choose the v that Proposer wants to write.
● If less than (n+1)/2 responses received, fail this round.
last_rnd=0, v=nil, vrnd=0
last_rnd=0, v=nil, vrnd=0..
Paxos: Classic - phase 2
X
v="x", rnd=1
X
AcceptedPhase 2
1,1, -
1,x1
1,x1
-
Proposer X Acceptor 1,2,3
v=x, vrnd=1
Proposer:
Send phase-2 with v chosen from previous step to
Acceptors
Paxos: Classic - phase 2.
X
v="x", rnd=1
X
AcceptedPhase 2
1,1, -
1,x1
1,x1
-
Proposer X Acceptor 1,2,3
v=x, vrnd=1
Acceptor:
● Accept requests with rnd that equals its last_rnd
last_rnd==rnd guarantees there is no other Proposer
touches this Acceptor.
Paxos: Case 1: Classic, no Conflict
X
rnd=1
X
last_rnd=0, v=nil, vrnd=0
X
v="x", rnd=1
X
Accepted
Phase 1
Phase 2
1,1, -
---
1,1, -
1,x1
1,x1
-
Proposer X Acceptor 1,2,3
v=x, vrnd=1
Paxos: Case 2.1: Resolve Conflict
X
Y
rnd=1
X
Phase 1 for X
rnd=2
OK, forget X
Phase 1 for Y
Y
X
Y
v="x", rnd=1
Fail
v="y",rnd=2
OK
Phase 2
Y
round=1
round=2
Time
2,y2
1,x1
2,y2
2,1,x1
2,
2,1,x1
2,
2,1, 2,
1,1, -
1,1, -
---
Paxos: Case 2.2: Respect Existed v
X
rnd=3
X
v="y",vrnd=2;
v="x",vrnd=1;
choose 'y'
Phase 1
X
v="y",vrnd=3
Phase 2
round=3
2,y2
1,x1
2,y2
3,y2
3,x1
2,y2
3,y2
3,x1
2,y2
X
OK
3,y3
3,y3
3,y3
v=“y” must be chosen by
Proposer X because “y” may
be a determined value and
should not be overwritten.
Although, without checking
the 3rd acceptor we do not
know if “y” is actually
determined(accepted by a
quorum)
Paxos........
Learner:
● Acceptor send phase-3 message to Learner to inform
that a value has been determined.
● Most of the time Proposer can also be a Learner.
Livelock:
Proposers continually raise its rnd and overwrite others’
last_rnd on Acceptors, thus no phase-2 can be done
successfully.
Multi Paxos
Combine multiple phase-1 requests into one
message.
Send each phase-2 request separately.
Applications:
chubby zookeeper megastore spanner
Fast Paxos
● Proposers send phase-2 without sending phase-1.
● rnd in a Fast Paxos phase-2 is 0.
rnd=0 because rnd must be lower than any Classic rnd.
So it can fall back to Classic Paxos safely.
● Acceptor accepts Fast-phase-2 only when v=nil
● If conflict happened, Proposer should fall back to Class
Paxos with a rnd > 0.
Is Fast Paxos as cheap as Class Paxos?
Fast Paxos Quorum
--- - -
0,x0
-0,x0
0,x0
0,y0
0,x0
X
fast rnd=0
X
phase 2
OK
Y
fast rnd=0
phase 2
2/5; Fails
-
0,y0? ?
If Quorum of Fast Paxos is n/2+1 = 3:
When Y found conflict and fell back to Classic Paxos:
No way for Y to know if x0
or y0
is a determined value.
Solution: An undetermined value must not occupy half of the n/2+1 Acceptors:
→ Fast quorum > n*¾;
→ A value is determined in Fast Round if it is accepted by n*¾+1 Acceptors.
Fast Paxos Quorum.
Fast Paxos Quorum = n*¾
Availability becomes lower because Fast Paxos requires
more Acceptors to work.
Fast Paxos requires at least 5 Acceptors in order to tolerate
one failed Acceptor.
Fast Paxos ⅘: Y has a Conflict
--- - -
0,x0
-0,x0
0,x0
0,x0
0,y0
0,x0
0,x0
0,x0
0,x0
2,y0
0,x0
0,x0
2,x0
2,x0
2,x2
0,x0
0,x0
2,x2
2,x2
X
fast rnd=0
X
phase 2
OK
Y
fast rnd=0
phase 2
1/5; Fail
Y
classic rnd=2
phase 1
OK, "x"
Y
phase 2
OK, writes "x"
Y saw two x0
on 3 Acceptors.
Y must choose x0
because x0
might be a determined value.
y0
can not be determined
because even if the other two
untouched acceptors both have
y0
, there are not enough(5*¾ )
y0
to form a quorum.
Fast Paxos ⅘: X Y conflicts
--- - -
0,x0
0,x0
0,x0
0,y0
0,y0
1,x0
1,x0
1,x0
0,y0
0,y0
1,x0
1,x0
2,y0
2,y0
2,x0
X
fast rnd=0
X
phase 2
Conflict
Y
fast rnd=0
phase 2
Y
Conflict
0,x0
0,x0
0,x0
0,y0
0,y0X
classic rnd=1
phase 1
Y
classic rnd=2
phase 1
X
OK, only "x"
Y
OK, choose "y"
Y
phase 2
2,y2
2,y2
2,y2
2,y2
2,y2X
fail in phase 2
Note
In phase-2, it is also correct if Acceptor accpets
request with rnd >= last_rnd
Q&A
Thanks
drdr.xp@gmail.com
http://drmingdrmer.github.io
weibo.com: @drdrxp

More Related Content

What's hot

Foult Tolerence In Distributed System
Foult Tolerence In Distributed SystemFoult Tolerence In Distributed System
Foult Tolerence In Distributed System
Rajan Kumar
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 
Cena de Filósofos
Cena de FilósofosCena de Filósofos
Cena de Filósofos
Miguel Cruz
 
Fault Tolerance (Distributed computing)
Fault Tolerance (Distributed computing)Fault Tolerance (Distributed computing)
Fault Tolerance (Distributed computing)
Sri Prasanna
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 

What's hot (20)

Foult Tolerence In Distributed System
Foult Tolerence In Distributed SystemFoult Tolerence In Distributed System
Foult Tolerence In Distributed System
 
Distributed Transaction
Distributed TransactionDistributed Transaction
Distributed Transaction
 
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
Distributed DBMS - Unit 8 - Distributed Transaction Management & Concurrency ...
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Physical and Logical Clocks
Physical and Logical ClocksPhysical and Logical Clocks
Physical and Logical Clocks
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
 
Mutual exclusion
Mutual exclusionMutual exclusion
Mutual exclusion
 
Replication in Distributed Database
Replication in Distributed DatabaseReplication in Distributed Database
Replication in Distributed Database
 
Real time-embedded-system-lec-02
Real time-embedded-system-lec-02Real time-embedded-system-lec-02
Real time-embedded-system-lec-02
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
PHP - Web Development
PHP - Web DevelopmentPHP - Web Development
PHP - Web Development
 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
 
Dining Philosopher's Problem
Dining Philosopher's ProblemDining Philosopher's Problem
Dining Philosopher's Problem
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
 
CAP theorem and distributed systems
CAP theorem and distributed systemsCAP theorem and distributed systems
CAP theorem and distributed systems
 
Cena de Filósofos
Cena de FilósofosCena de Filósofos
Cena de Filósofos
 
Scheduling algorithms
Scheduling algorithmsScheduling algorithms
Scheduling algorithms
 
Transactions and Concurrency Control
Transactions and Concurrency ControlTransactions and Concurrency Control
Transactions and Concurrency Control
 
Fault Tolerance (Distributed computing)
Fault Tolerance (Distributed computing)Fault Tolerance (Distributed computing)
Fault Tolerance (Distributed computing)
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 

Viewers also liked (6)

Paxos introduction
Paxos introductionPaxos introduction
Paxos introduction
 
the Paxos Commit algorithm
the Paxos Commit algorithmthe Paxos Commit algorithm
the Paxos Commit algorithm
 
Screenless Browsing - Audio Sword
Screenless Browsing - Audio SwordScreenless Browsing - Audio Sword
Screenless Browsing - Audio Sword
 
Paxos and Raft Distributed Consensus Algorithm
Paxos and Raft Distributed Consensus AlgorithmPaxos and Raft Distributed Consensus Algorithm
Paxos and Raft Distributed Consensus Algorithm
 
Paxos
PaxosPaxos
Paxos
 
图解分布式一致性协议Paxos 20150311
图解分布式一致性协议Paxos 20150311图解分布式一致性协议Paxos 20150311
图解分布式一致性协议Paxos 20150311
 

Similar to Paxos building-reliable-system

Peer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
Peer-to-Peer Streaming Based on Network Coding Decreases Packet JitterPeer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
Peer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
Alpen-Adria-Universität
 
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
Media Gorod
 
Computer network (8)
Computer network (8)Computer network (8)
Computer network (8)
NYversity
 
Fast dynamic analysis, Kostya Serebryany
Fast dynamic analysis, Kostya SerebryanyFast dynamic analysis, Kostya Serebryany
Fast dynamic analysis, Kostya Serebryany
yaevents
 

Similar to Paxos building-reliable-system (20)

Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
Lab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed SystemsLab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed Systems
 
Peer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
Peer-to-Peer Streaming Based on Network Coding Decreases Packet JitterPeer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
Peer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
 
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
 
Computer network (8)
Computer network (8)Computer network (8)
Computer network (8)
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
 
RabbitMQ in Sprayer
RabbitMQ in SprayerRabbitMQ in Sprayer
RabbitMQ in Sprayer
 
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
 
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and Safer
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
When Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of TorqueboxWhen Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of Torquebox
 
Slides for a talk on UML Semantics in Nuremberg in 2005
Slides for a talk on UML Semantics in Nuremberg in 2005Slides for a talk on UML Semantics in Nuremberg in 2005
Slides for a talk on UML Semantics in Nuremberg in 2005
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural Network
 
MySQL 5.6 Global Transaction IDs - Use case: (session) consistency
MySQL 5.6 Global Transaction IDs - Use case: (session) consistencyMySQL 5.6 Global Transaction IDs - Use case: (session) consistency
MySQL 5.6 Global Transaction IDs - Use case: (session) consistency
 
5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron)
 
Fast dynamic analysis, Kostya Serebryany
Fast dynamic analysis, Kostya SerebryanyFast dynamic analysis, Kostya Serebryany
Fast dynamic analysis, Kostya Serebryany
 
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...
 
Interval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision makingInterval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision making
 

Recently uploaded

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Paxos building-reliable-system

  • 2. Background Several processes do one thing. The only problem in distributed system is achieving consensus. Paxos: the core of distributed system.
  • 3. Agenda 1. Problem 2. Replication is not enough 3. Paxos Algorithm 4. Paxos Optimization
  • 4. Problem Required: Durability: 99.99999999% Availability: 99.99% What we have: Hard Drive: 4% of Annual failure rate Server Down Time: 0.1% or longer Packet loss between IDC: 5% ~ 30%
  • 5. Solution(Maybe) Multiple Replicas No data loss if x(x<n) replicas lost Durability: 1 replicas: ~ 0.63% 2 replicas: ~ 0.00395% 3 replicas: < 0.000001% n replicas: = 1 - x^n /* x = failure rate of single replica */
  • 6. Solution. How to replicate data? Besides number of replicas: Availability Atomicity Consistency ...
  • 7. Fundamental Replication Algorithms Master-Slave Async Master-Slave Sync Master-Slave Semi-Sync Quorum Write and Read
  • 8. Master-Slave Async The Mysql Way. 1. Master received write op. 2. Master wrote on disk. 3. Master responded ‘OK’. 4. Master replicated to slaves. If disk fail before replication → Data loss. Time MasterClient Slave.1 Slave.2 Disk Failure
  • 9. Master-Slave Sync 1. Master received write op. 2. Master replicated log to slaves. 3. Slave may block... 4. Client won’t receive ‘OK’ until all slaves respond. One unreachable node halts the entire system. : No data loss. : But low availability. Time MasterClient Slave.1 Slave.2
  • 10. Master-Slave Semi-Sync 1. Master received write op. 2. Master replicated log to slaves. 3. Slave may block... 4. Client receives ‘OK’ if [1,n) slaves respond. : High durability. : High availability. : No slave has all data → We need Quorum Write Time MasterClient Slave.1 Slave.2
  • 11. Quorum Write and Read Dynamo / Cassandra Write to W >=N/2+1 nodes. No master required. Read R >=N/2+1 nodes. W + R > N Tolerate upto (N-1)/2 failed nodes. Time Node.1Client Node.2 Node.3
  • 12. Quorum Write and Read. Last-Win The last write wins. Totally ordered based on timestamp. Time Node.1Client Node.2 Node.3
  • 13. : High durability. : High availability. : Data completeness is guaranteed. Is it enough? Quorum Write and Read..
  • 14. Quorum Write and Read... W + R > N Consistency: Eventual Transactionality: Non-Atomic-Update Dirty-Read Lost-Update http://en.wikipedia.org/wiki/Concurrency_control
  • 15. An Imaginary Storage Service ● A storage system with 3 nodes(processes). ● Policy: Quorum RW. ● It stores only one variable “i”. ● “i” has multiple versions: i1, i2, i3… ● Commands: get /* read latest “i” */ set <n> /* assign <n> to “i” */ inc <n> /* increment “i” by <n> */ It shows us the deficiency of Quorum RW and how paxos solves these problems.
  • 16. An Imaginary Storage Service. "set" → Quorum Write. "inc" → the simplest transactional operation: 1. Read latest “i” with Quorum Read: i1 2. Let i2 = i1 + n 3. set i2 X set i2=3 X get i 21 21 00 32 21 32 X get i1=2 i2 = i1 + 1 32 21 32
  • 17. set i2=3 OK set i2=4 An Imaginary Storage Service.. X X get i 21 21 00 32 21 32 53 21 53 X get i1=2 i2 = i1 + 1 We expect X to be able to get i3=5 This requires Y to “fail” after X wrote i2. How do we do that? Y get i1=2 Y i2 = i1 + 2 32 21 32 Y should run Quorum Read and Quorum Write again... Must Fail. Or existed value will be overwritten.
  • 18. An Imaginary Storage Service... In order to correctly get i3 after 2 “inc” operations: There can only be ONE successful “write” operation to a certain version of “i”(in our case: i2). Generalization: One value(one version of a variable) should not be modified any more after it is determined(client received “OK” and believes it is stored). How to define “determined”? How to avoid changing a “determined” value?
  • 19. Determine a Value X Y Any value set? X No XX - --- Any value set? --- Y Yes, Y gives up X XX - XX - Solution: Before writing a value, run a Quorum Read round to check if such a value exists(or maybe exist).
  • 20. Determine a Value. X Y Any value set? X No YYX Y XX - --- Any value set? --- Y No X But both X and Y would believe there is no value set. X and Y both will start to write at the same time. Lost Update
  • 21. Determine a Value.. X Any value set? X No YYX Y --- --- X Y--- Any value set? Quorum Read+Write: Remember X is the last reader --- Y No Quorum Read+Write: Remember Y is the last reader X -- Solution improved: Remember who did the last read And deny write from previous readers. now node 1 and 2 will only accept request from X. now node 2 and 3 will only accept request from Y.
  • 22. Determine a Value... By applying this policy, a value(each version of “i” in our case) can be stored safely and consistently. Leslie Lamport made a paper of this policy.
  • 23. Paxos
  • 24. What is Paxos ● A reliable storage: based on Quorum RW. ● Each paxos instance stores only 1 value. ● 2 rounds are required to determine 1 value. ● A value can’t be modified after determined. ● determined means being accepted by a quorum(>n/2). ● Immediate Consistency.
  • 25. Paxos Classic Paxos 2 rounds per instance. Multi Paxos ~1 round per instance. Fast Paxos 1 round per instance ( without conflict ). 2 rounds per instance ( with conflict ).
  • 26. Paxos: Precondition Storage must be reliable: No Data loss /* Or it falls back to Byzantine Paxos */ Tolerate: Message loss Message in random order
  • 27. Proposer: process that starts a paxos round to write sth. Acceptor: process that receives and stores messages. Quorum( of acceptors ) : n/2+1 Acceptors. Round:Including 2 phases:Phase-1 & Phase-2 Round Number (rnd): ID of a round. monotonic incremental;Last-Win;Universially unique; Paxos: Concepts
  • 28. Last Round Number (last_rnd): Greatest rnd an Acceptor has ever seen; To identify the proposer from which a acceptor would accept write request; Value (v): the value an Acceptor accepted. Value round number (vrnd): At which round an Acceptor accepted the v. Value determined: The value accepted by a quorum of acceptors. Paxos: Concepts.
  • 29. Illustration of Acceptor 5,x3 last_rnd v vrnd In following slides, an Acceptor would have 3 attributes saved on it: last_rnd, v and vrnd:
  • 30. Paxos: Classic - phase 1 X rnd=1 X last_rnd=0, v=nil, vrnd=0 last_rnd=0, v=nil, vrnd=0..Phase 1 1,1, - --- Proposer X Acceptor 1,2,3 Upon Acceptor received requests from Proposer: ● Refuse requests whose rnd < last_rnd. ● Save the rnd from phase-1 request into its last_rnd. ● Since now it only accepts phase-2 request with this last_rnd. ● Respond with last_rnd, v and vrnd it has previously accepted.
  • 31. Paxos: Classic - phase 1. X rnd=1 X Phase 1 1,1, - --- Proposer X Acceptor 1,2,3 Upon Proposer received replies from Acceptors: ● If a last_rnd > rnd found: Discard this round. ● Choose v with the greatest vrnd if there is non-nil v. ● Choose the v that Proposer wants to write. ● If less than (n+1)/2 responses received, fail this round. last_rnd=0, v=nil, vrnd=0 last_rnd=0, v=nil, vrnd=0..
  • 32. Paxos: Classic - phase 2 X v="x", rnd=1 X AcceptedPhase 2 1,1, - 1,x1 1,x1 - Proposer X Acceptor 1,2,3 v=x, vrnd=1 Proposer: Send phase-2 with v chosen from previous step to Acceptors
  • 33. Paxos: Classic - phase 2. X v="x", rnd=1 X AcceptedPhase 2 1,1, - 1,x1 1,x1 - Proposer X Acceptor 1,2,3 v=x, vrnd=1 Acceptor: ● Accept requests with rnd that equals its last_rnd last_rnd==rnd guarantees there is no other Proposer touches this Acceptor.
  • 34. Paxos: Case 1: Classic, no Conflict X rnd=1 X last_rnd=0, v=nil, vrnd=0 X v="x", rnd=1 X Accepted Phase 1 Phase 2 1,1, - --- 1,1, - 1,x1 1,x1 - Proposer X Acceptor 1,2,3 v=x, vrnd=1
  • 35. Paxos: Case 2.1: Resolve Conflict X Y rnd=1 X Phase 1 for X rnd=2 OK, forget X Phase 1 for Y Y X Y v="x", rnd=1 Fail v="y",rnd=2 OK Phase 2 Y round=1 round=2 Time 2,y2 1,x1 2,y2 2,1,x1 2, 2,1,x1 2, 2,1, 2, 1,1, - 1,1, - ---
  • 36. Paxos: Case 2.2: Respect Existed v X rnd=3 X v="y",vrnd=2; v="x",vrnd=1; choose 'y' Phase 1 X v="y",vrnd=3 Phase 2 round=3 2,y2 1,x1 2,y2 3,y2 3,x1 2,y2 3,y2 3,x1 2,y2 X OK 3,y3 3,y3 3,y3 v=“y” must be chosen by Proposer X because “y” may be a determined value and should not be overwritten. Although, without checking the 3rd acceptor we do not know if “y” is actually determined(accepted by a quorum)
  • 37. Paxos........ Learner: ● Acceptor send phase-3 message to Learner to inform that a value has been determined. ● Most of the time Proposer can also be a Learner. Livelock: Proposers continually raise its rnd and overwrite others’ last_rnd on Acceptors, thus no phase-2 can be done successfully.
  • 38. Multi Paxos Combine multiple phase-1 requests into one message. Send each phase-2 request separately. Applications: chubby zookeeper megastore spanner
  • 39. Fast Paxos ● Proposers send phase-2 without sending phase-1. ● rnd in a Fast Paxos phase-2 is 0. rnd=0 because rnd must be lower than any Classic rnd. So it can fall back to Classic Paxos safely. ● Acceptor accepts Fast-phase-2 only when v=nil ● If conflict happened, Proposer should fall back to Class Paxos with a rnd > 0. Is Fast Paxos as cheap as Class Paxos?
  • 40. Fast Paxos Quorum --- - - 0,x0 -0,x0 0,x0 0,y0 0,x0 X fast rnd=0 X phase 2 OK Y fast rnd=0 phase 2 2/5; Fails - 0,y0? ? If Quorum of Fast Paxos is n/2+1 = 3: When Y found conflict and fell back to Classic Paxos: No way for Y to know if x0 or y0 is a determined value. Solution: An undetermined value must not occupy half of the n/2+1 Acceptors: → Fast quorum > n*¾; → A value is determined in Fast Round if it is accepted by n*¾+1 Acceptors.
  • 41. Fast Paxos Quorum. Fast Paxos Quorum = n*¾ Availability becomes lower because Fast Paxos requires more Acceptors to work. Fast Paxos requires at least 5 Acceptors in order to tolerate one failed Acceptor.
  • 42. Fast Paxos ⅘: Y has a Conflict --- - - 0,x0 -0,x0 0,x0 0,x0 0,y0 0,x0 0,x0 0,x0 0,x0 2,y0 0,x0 0,x0 2,x0 2,x0 2,x2 0,x0 0,x0 2,x2 2,x2 X fast rnd=0 X phase 2 OK Y fast rnd=0 phase 2 1/5; Fail Y classic rnd=2 phase 1 OK, "x" Y phase 2 OK, writes "x" Y saw two x0 on 3 Acceptors. Y must choose x0 because x0 might be a determined value. y0 can not be determined because even if the other two untouched acceptors both have y0 , there are not enough(5*¾ ) y0 to form a quorum.
  • 43. Fast Paxos ⅘: X Y conflicts --- - - 0,x0 0,x0 0,x0 0,y0 0,y0 1,x0 1,x0 1,x0 0,y0 0,y0 1,x0 1,x0 2,y0 2,y0 2,x0 X fast rnd=0 X phase 2 Conflict Y fast rnd=0 phase 2 Y Conflict 0,x0 0,x0 0,x0 0,y0 0,y0X classic rnd=1 phase 1 Y classic rnd=2 phase 1 X OK, only "x" Y OK, choose "y" Y phase 2 2,y2 2,y2 2,y2 2,y2 2,y2X fail in phase 2
  • 44. Note In phase-2, it is also correct if Acceptor accpets request with rnd >= last_rnd