Successfully reported this slideshow.
Your SlideShare is downloading. ×

How Raft consensus algorithm will make replication even better in MongoDB 3.2 / Henrik Ingo (MongoDB)

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 26 Ad

How Raft consensus algorithm will make replication even better in MongoDB 3.2 / Henrik Ingo (MongoDB)

Download to read offline

MongoDB exhibits a fairly classic leader-based replication architecture, with builtin automatic provisioning and failovers. Since it was developed, an academic paper from Stanford has introduced a distributed consensus algorithm called Raft — which happens to be rather similar to the home grown algorithm of MongoDB. This talk will give a brief overview of Raft and how we are retrofitting some of it into MongoDB 3.2 to make our replication even more robust.

MongoDB exhibits a fairly classic leader-based replication architecture, with builtin automatic provisioning and failovers. Since it was developed, an academic paper from Stanford has introduced a distributed consensus algorithm called Raft — which happens to be rather similar to the home grown algorithm of MongoDB. This talk will give a brief overview of Raft and how we are retrofitting some of it into MongoDB 3.2 to make our replication even more robust.

Advertisement
Advertisement

More Related Content

Similar to How Raft consensus algorithm will make replication even better in MongoDB 3.2 / Henrik Ingo (MongoDB) (20)

More from Ontico (20)

Advertisement

Recently uploaded (20)

How Raft consensus algorithm will make replication even better in MongoDB 3.2 / Henrik Ingo (MongoDB)

  1. 1. How Raft consensus algorithm will make replication even better in MongoDB 3.2 Speaker: Henrik Ingo, @h_ingo Author: Spencer T Brody, @stbrody
  2. 2. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2
  3. 3. Why use Replication? • Data redundancy • High availability • Geographical distribution • Latency
  4. 4. What is Consensus? • Getting multiple processes/servers to agree on state machine state • Must handle a wide range of failure modes • Disk failure • Network partitions • Machine freezes • Clock skews • Bugs?
  5. 5. Basic consensus State Machine X 3 Y 2 Z 7 State Machine X 3 Y 2 Z 7 State Machine X 3 Y 2 Z 7
  6. 6. Leader Based Consensus State Machine X 3 Y 2 Z 7 Replicated Log X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 State Machine X 3 Y 2 Z 7 Replicated Log X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 State Machine X 3 Y 2 Z 7 Replicated Log X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  7. 7. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2
  8. 8. Elections X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  9. 9. Elections X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  10. 10. Elections X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  11. 11. Elections X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  12. 12. Elections X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  13. 13. Key requirements for elections • There must be exactly 1 leader at any time • Majority election: there can only be 1 majority • Note: 50% is not a majority! • Old leader must know to step down, even if disconnected
  14. 14. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2
  15. 15. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2 • Goals and inspiration from Raft Consensus Algorithm • Preventing double voting • Monitoring node status • Calling for elections
  16. 16. Goals for MongoDB 3.2 • Decrease failover time • Speed up detection and resolution of false primary situations
  17. 17. Finding Inspiration in Raft • “In Search of an Understandable Consensus Algorithm” by Diego Ongaro: https://ramcloud.stanford.edu/raft.pdf • Designed to address the shortcomings of Paxos • Easier to understand • Easier to implement in real applications • Provably correct • Remarkably similar to what we’re doing already
  18. 18. Raft Concepts • Term (election) IDs • Heartbeats using existing data replication channel • Asymmetric election timeouts
  19. 19. Preventing Double Voting • Can’t vote for 2 nodes in the same election • Pre-3.2: 30 second vote timeout • Post-3.2: Term IDs • Term: • Monotonically increasing ID • Incremented on every election *attempt* • Lets voters distinguish elections so they can vote twice quickly in different elections • Logical clock
  20. 20. MongoDB 3.0 OpLog entry > db.oplog.rs.find().sort( { $natural : -1 } ).limit(1).pretty() { "ts" : Timestamp(1444466011, 1), "h" : NumberLong("-6240522391332325619"), "v" : 2, "op" : "i", "ns" : "test.nulltest", "o" : { "_id" : 3, "a" : 3 } } millisecond + counter Random hash = gtid
  21. 21. MongoDB 3.2 OpLog entry > db.oplog.rs.find().sort( { $natural : -1 } ).limit(1).pretty() { "ts" : Timestamp(1445632081, 861), "t" : NumberLong(42), "h" : NumberLong("5466055178864103715"), "v" : 2, "op" : "u", "ns" : "ycsb.usertable", "o2" : { "_id" : "user645414720104877157" }, "o" : { "$set" : { "field4" : BinData(0,"KT5sN1wrPl...Ny9wIg==") } } millisecond + Raft counter Random hash = gtid Raft Term
  22. 22. Heartbeats • MongoDB 3.0: Heartbeats • Sent every two seconds from every node to every other node • Volume increases quadratically as nodes are added to the replica set • Odd behaviors with asymmetric network partitions, corner cases • MongoDB 3.2: Extra metadata sent via existing data replication channel • Utilizes chained replication • Faster heartbeats = faster elections and detection of false primaries
  23. 23. Daisy chaining X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  24. 24. Election timeouts • Tradeoff between failover time and spurious failovers • Node calls for an election when it hasn’t heard from the primary within the election timeout • Starting in 3.2: • Single election timeout • Election timeout is configurable • Election timeout is varied randomly for each node • Varying timeouts help reduce tied votes • Fewer tied votes = faster failover rs.config(). settings. electionTimeoutMillis
  25. 25. Conclusion • MongoDB 3.2 will have • Faster failovers • Faster error detection • More control to prevent spurious failovers • Robust also in corner cases • This means your systems are • More resilient to failure • More uptime • Easier to maintain
  26. 26. Questions?

Editor's Notes

  • Every write requires participation from all nodes – like direct democracy
    A lot of network traffic, poor performance
    This is like pure Paxos
  • Leader = Primary
    This is what most real-world deployments do
    In mongodb…
    Oplog = logical transformation of data
  • One node nominates itself and contacts others
    Pre-election check – will you vote for me?
    Election check – do you vote for me?
    Election notifications – I am now primary
  • Nodes might vote no: primary exists, repl lag
  • Question break
  • Define:
    Failover time. Failover time > election time
    False primaries
    Rollbacks
  • Raft paper presented at Usenix 2014 by Stamford University PHD students Diego Ongaro and John Ousterhout
    Provably correct
    Paxos - 1989
    Paxos hard to use – “Paxos made simple”, “Paxos made practical”
  • Continue sending heartbeats
    Initial sync source, spanning tree maintenance
    Even less frequent now
  • Define “election timeout” - *not* time limit of election duration
    One possibility: Base timeout <1second: ~500ms, plus a few milliseconds

×