Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
How Raft consensus algorithm will make
replication even better in MongoDB 3.2
Speaker: Henrik Ingo, @h_ingo
Author: Spence...
Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• I...
Why use Replication?
• Data redundancy
• High availability
• Geographical distribution
• Latency
What is Consensus?
• Getting multiple processes/servers to agree
on state machine state
• Must handle a wide range of fail...
Basic consensus
State
Machine
X 3
Y 2
Z 7
State
Machine
X 3
Y 2
Z 7
State
Machine
X 3
Y 2
Z 7
Leader Based Consensus
State
Machine
X 3
Y 2
Z 7
Replicated
Log
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
State
Machine
X 3
Y 2
Z 7
Replicated
...
Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• I...
Elections
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X...
Elections
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅...
Elections
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅...
Elections
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅...
Elections
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅...
Key requirements for elections
• There must be exactly 1 leader at any time
• Majority election: there can only be 1 major...
Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• I...
Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• I...
Goals for MongoDB 3.2
• Decrease failover time
• Speed up detection and resolution of false
primary situations
Finding Inspiration in Raft
• “In Search of an Understandable Consensus
Algorithm” by Diego Ongaro:
https://ramcloud.stanf...
Raft Concepts
• Term (election) IDs
• Heartbeats using existing data replication
channel
• Asymmetric election timeouts
Preventing Double Voting
• Can’t vote for 2 nodes in the same election
• Pre-3.2: 30 second vote timeout
• Post-3.2: Term ...
MongoDB 3.0 OpLog entry
> db.oplog.rs.find().sort( { $natural : -1 }
).limit(1).pretty()
{
"ts" : Timestamp(1444466011, 1)...
MongoDB 3.2 OpLog entry
> db.oplog.rs.find().sort( { $natural : -1 }
).limit(1).pretty()
{
"ts" : Timestamp(1445632081, 86...
Heartbeats
• MongoDB 3.0: Heartbeats
• Sent every two seconds from every node to every
other node
• Volume increases quadr...
Daisy chaining
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
...
Election timeouts
• Tradeoff between failover time and spurious
failovers
• Node calls for an election when it hasn’t hear...
Conclusion
• MongoDB 3.2 will have
• Faster failovers
• Faster error detection
• More control to prevent spurious failover...
Questions?
Upcoming SlideShare
Loading in …5
×

How Raft consensus algorithm will make replication even better in MongoDB 3.2 / Henrik Ingo (MongoDB)

1,072 views

Published on

MongoDB exhibits a fairly classic leader-based replication architecture, with builtin automatic provisioning and failovers. Since it was developed, an academic paper from Stanford has introduced a distributed consensus algorithm called Raft — which happens to be rather similar to the home grown algorithm of MongoDB. This talk will give a brief overview of Raft and how we are retrofitting some of it into MongoDB 3.2 to make our replication even more robust.

Published in: Engineering
  • Be the first to comment

How Raft consensus algorithm will make replication even better in MongoDB 3.2 / Henrik Ingo (MongoDB)

  1. 1. How Raft consensus algorithm will make replication even better in MongoDB 3.2 Speaker: Henrik Ingo, @h_ingo Author: Spencer T Brody, @stbrody
  2. 2. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2
  3. 3. Why use Replication? • Data redundancy • High availability • Geographical distribution • Latency
  4. 4. What is Consensus? • Getting multiple processes/servers to agree on state machine state • Must handle a wide range of failure modes • Disk failure • Network partitions • Machine freezes • Clock skews • Bugs?
  5. 5. Basic consensus State Machine X 3 Y 2 Z 7 State Machine X 3 Y 2 Z 7 State Machine X 3 Y 2 Z 7
  6. 6. Leader Based Consensus State Machine X 3 Y 2 Z 7 Replicated Log X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 State Machine X 3 Y 2 Z 7 Replicated Log X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 State Machine X 3 Y 2 Z 7 Replicated Log X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  7. 7. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2
  8. 8. Elections X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  9. 9. Elections X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  10. 10. Elections X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  11. 11. Elections X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  12. 12. Elections X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  13. 13. Key requirements for elections • There must be exactly 1 leader at any time • Majority election: there can only be 1 majority • Note: 50% is not a majority! • Old leader must know to step down, even if disconnected
  14. 14. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2
  15. 15. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2 • Goals and inspiration from Raft Consensus Algorithm • Preventing double voting • Monitoring node status • Calling for elections
  16. 16. Goals for MongoDB 3.2 • Decrease failover time • Speed up detection and resolution of false primary situations
  17. 17. Finding Inspiration in Raft • “In Search of an Understandable Consensus Algorithm” by Diego Ongaro: https://ramcloud.stanford.edu/raft.pdf • Designed to address the shortcomings of Paxos • Easier to understand • Easier to implement in real applications • Provably correct • Remarkably similar to what we’re doing already
  18. 18. Raft Concepts • Term (election) IDs • Heartbeats using existing data replication channel • Asymmetric election timeouts
  19. 19. Preventing Double Voting • Can’t vote for 2 nodes in the same election • Pre-3.2: 30 second vote timeout • Post-3.2: Term IDs • Term: • Monotonically increasing ID • Incremented on every election *attempt* • Lets voters distinguish elections so they can vote twice quickly in different elections • Logical clock
  20. 20. MongoDB 3.0 OpLog entry > db.oplog.rs.find().sort( { $natural : -1 } ).limit(1).pretty() { "ts" : Timestamp(1444466011, 1), "h" : NumberLong("-6240522391332325619"), "v" : 2, "op" : "i", "ns" : "test.nulltest", "o" : { "_id" : 3, "a" : 3 } } millisecond + counter Random hash = gtid
  21. 21. MongoDB 3.2 OpLog entry > db.oplog.rs.find().sort( { $natural : -1 } ).limit(1).pretty() { "ts" : Timestamp(1445632081, 861), "t" : NumberLong(42), "h" : NumberLong("5466055178864103715"), "v" : 2, "op" : "u", "ns" : "ycsb.usertable", "o2" : { "_id" : "user645414720104877157" }, "o" : { "$set" : { "field4" : BinData(0,"KT5sN1wrPl...Ny9wIg==") } } millisecond + Raft counter Random hash = gtid Raft Term
  22. 22. Heartbeats • MongoDB 3.0: Heartbeats • Sent every two seconds from every node to every other node • Volume increases quadratically as nodes are added to the replica set • Odd behaviors with asymmetric network partitions, corner cases • MongoDB 3.2: Extra metadata sent via existing data replication channel • Utilizes chained replication • Faster heartbeats = faster elections and detection of false primaries
  23. 23. Daisy chaining X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  24. 24. Election timeouts • Tradeoff between failover time and spurious failovers • Node calls for an election when it hasn’t heard from the primary within the election timeout • Starting in 3.2: • Single election timeout • Election timeout is configurable • Election timeout is varied randomly for each node • Varying timeouts help reduce tied votes • Fewer tied votes = faster failover rs.config(). settings. electionTimeoutMillis
  25. 25. Conclusion • MongoDB 3.2 will have • Faster failovers • Faster error detection • More control to prevent spurious failovers • Robust also in corner cases • This means your systems are • More resilient to failure • More uptime • Easier to maintain
  26. 26. Questions?

×