Cornelia Davis (Sr Director of Technology, Pivotal); Meaghan Kjelland (Software Engineer, Google); Erin Schnabel (Senior Technical Staff Member, IBM); Therese Stowell (Staff Product Manager, Pivotal); and Mathangi Venkatesan (Senior Member of Technical Staff, VMware) speaks on the Main Stage at SpringOne Platform 2017.
12. Where is Leader Election Used?
• Identical processes that need coordination
• Resource sharing
• Decomposable problems
• No natural choice for leader
21. Why Paxos?
What do distributed systems achieve?
• High availability
• Eliminate single point of failure
How?
• Replicated servers in a cluster
• Consensus module
- Consistency of data
- Same order of commands to the replica
22. Paxos Consensus Protocol
Safety rules
• System must agree on a value as “chosen” value
• Only one value is ever chosen
• A value is chosen if its accepted by a majority of servers ( quorum )
Actors
• Proposers
• Acceptors
• Learners
* https://angus.nyc/2012/paxos-by-example/
23. Problems to solve
• Agree on one value
• Do not allow multiple chosen values
How can this be solved ?
• Reject competing proposals on chosen value - Proposal numbers
• 2 phase approach
✓Prepare (Find out chosen, block older)
✓Accept
Global sequence
number
Unique
Server Id
29. You might have heard…
Google owns their own fibre
Google gets redundancy
(redundant fibre, redundant switches, redundant everything)
Google gets failure boundaries
(limits the blast radius)
So arguably, for much of the time, Spanner is AC!!!!
34. Two Phase Commit
Technically correct: Failures can be nasty
Worst case:
RMs blocked waiting for TM!
RM
RM
RM
TM
1. Prepared
1a. Prepare
1b. Prepared
2. Commit 2b. Committed
35. Paxos Commit:
2PC without horrible failure mode
ATOMICITY
(safely all or nothing)
But.. Async?! (!!)
What about ISOLATION?
But what if we used Paxos for consensus?
Client
1. Prepare (Paxos Request)
2. Consensus / Prepared
3. Commit
WARNING: OVERSIMPLIFICATION
4. DONE
36. TrueTime
“Global wall clock time” with bounded uncertainty — you can wait it out!
TrueTime is its own distributed system:
GPS and Atomic-clock time masters in every data center
Client contacts several, computes reference [earliest, latest] = now ± ε
TrueTime — monotonically increasing timestamps
[ ]TT.now()
earliest latest
2*ℇ
37. TrueTime
Invariant: time stamp order == commit order; ISOLATION & external consistency
All transaction logs for the commit have the same timestamp
There it is: Two-phase commit for a large-scale distributed system
Commit wait time
Pick s = TT.now().latest Wait: TT.now().earliest > ss
T
acquired locks release locks
start consensus achieve consensus notify participants
52. Couple
rarity of network partitions
with
common distributed systems algorithms
and
a new view of time
and you get a
globally distributed,
highly available,
ACID compliant database