Replica sets in MongoDB are built upon consensus algorithms for leader election and operation log replication. We will start with an overview of the architecture of replication in MongoDB, with special focus on our leader election protocol. After that we will dive into some of the changes we are making for our upcoming 3.2 release and will discuss how those changes lead to faster elections, more robust systems, and lay the groundwork for new replication features going forward.
2. Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• Improvements coming in MongoDB 3.2
4. What is Consensus?
• Getting multiple processes/servers to agree on something
• Must handle a wide range of failure modes
• Disk failure
• Network partitions
• Machine freezes
• Clock skews
6. Leader Based Consensus
State
Machine
X 3
Y 2
Z 7
Replicated
Log
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
State
Machine
X 3
Y 2
Z 7
Replicated
Log
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
State
Machine
X 3
Y 2
Z 7
Replicated
Log
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
7. Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• Improvements coming in MongoDB 3.2
8. Elections
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
9. Elections
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
10. Elections
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
11. Elections
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
12. Elections
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
13. Data Replication
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
14. Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• Improvements coming in MongoDB 3.2
15. Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• Improvements coming in MongoDB 3.2
• Goals and inspiration from Raft Consensus Algorithm
• Preventing double voting
• Monitoring node status
• Calling for elections
16. Goals for MongoDB 3.2
• Decrease failover time
• Speed up detection and resolution of false primary situations
17. Finding Inspiration in Raft
• “In Search of an Understandable Consensus Algorithm” by Diego
Ongaro: https://ramcloud.stanford.edu/raft.pdf
• Designed to address the shortcomings of Paxos
• Easier to understand
• Easier to implement in real applications
• Provably correct
• Remarkably similar to what we’re doing already
18. Raft Concepts
• Term (election) IDs
• Monitoring node status using existing data replication channel
• Asymmetric election timeouts
19. Preventing Double Voting
• Can’t vote for 2 nodes in the same election
• Pre-3.2: 30 second vote timeout
• Post-3.2: Term IDs
• Term:
• Monotonically increasing ID
• Incremented on every election *attempt*
• Lets voters distinguish elections so they can vote twice quickly in
different elections.
20. Monitoring Node Status
• Pre-3.2: Heartbeats
• Sent every two seconds from every node to every other node
• Volume increases quadratically as nodes are added to the replica set
• Post-3.2: Extra metadata sent via existing data replication channel
• Utilizes chained replication
• Faster heartbeats = faster elections and detection of false primaries
21. Data Replication
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
22. Determining When To Call For An Election
• Tradeoff between failover time and spurious failovers
• Node calls for an election when it hasn’t heard from the primary within
the election timeout
• Starting in 3.2:
• Election timeout is configurable
• Election timeout is varied randomly for each node
• Varying timeouts help reduce tied votes
• Fewer tied votes = faster failover
23. Conclusion
• MongoDB 3.2 will have
• Faster failovers
• Faster error detection
• More control to prevent spurious failovers
• This means your systems are
• More stable
• More resilient to failure
• Easier to maintain
Every write requires participation from all nodes – like direct democracy
A lot of network traffic, poor performance
This is like pure Paxos
Leader = Primary
This is what most real-world deployments do
In mongodb…
Oplog = logical transformation of data
One node nominates itself and contacts others
Pre-election check – will you vote for me?
Election check – do you vote for me?
Election notifications – I am now primary
Nodes might vote no: primary exists, repl lag
Chaining
Less bandwidth across long hops, less read load on primary
Question break
Define:
Failover time. Failover time > election time
False primaries
Rollbacks
Raft paper presented at Usenix 2014 by Stamford University PHD students Diego Ongaro and John Ousterhout
Provably correct
Paxos - 1989
Paxos hard to use – “Paxos made simple”, “Paxos made practical”
Continue sending heartbeats
Initial sync source, spanning tree maintenance
Even less frequent now
Define “election timeout” - *not* time limit of election duration
One possibility: Base timeout <1second: ~500ms, plus a few milliseconds