Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

•Download as PPTX, PDF•

5 likes•13,622 views

Replica sets in MongoDB are built upon consensus algorithms for leader election and operation log replication. We will start with an overview of the architecture of replication in MongoDB, with special focus on our leader election protocol. After that we will dive into some of the changes we are making for our upcoming 3.2 release and will discuss how those changes lead to faster elections, more robust systems, and lay the groundwork for new replication features going forward.

Technology

Distributed Consensus in MongoDB
Spencer T Brody
Senior Software Engineer at MongoDB
@stbrody

Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• Improvements coming in MongoDB 3.2

Why use Replication?
• Data redundancy
• High availability

What is Consensus?
• Getting multiple processes/servers to agree on something
• Must handle a wide range of failure modes
• Disk failure
• Network partitions
• Machine freezes
• Clock skews

Basic consensus
State
Machine
X 3
Y 2
Z 7
State
Machine
X 3
Y 2
Z 7
State
Machine
X 3
Y 2
Z 7

Leader Based Consensus
State
Machine
X 3
Y 2
Z 7
Replicated
Log
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
State
Machine
X 3
Y 2
Z 7
Replicated
Log
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
State
Machine
X 3
Y 2
Z 7
Replicated
Log
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3

Elections
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3

Data Replication
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3

Goals for MongoDB 3.2
• Decrease failover time
• Speed up detection and resolution of false primary situations

Finding Inspiration in Raft
• “In Search of an Understandable Consensus Algorithm” by Diego
Ongaro: https://ramcloud.stanford.edu/raft.pdf
• Designed to address the shortcomings of Paxos
• Easier to understand
• Easier to implement in real applications
• Provably correct
• Remarkably similar to what we’re doing already

Raft Concepts
• Term (election) IDs
• Monitoring node status using existing data replication channel
• Asymmetric election timeouts

Preventing Double Voting
• Can’t vote for 2 nodes in the same election
• Pre-3.2: 30 second vote timeout
• Post-3.2: Term IDs
• Term:
• Monotonically increasing ID
• Incremented on every election *attempt*
• Lets voters distinguish elections so they can vote twice quickly in
different elections.

Monitoring Node Status
• Pre-3.2: Heartbeats
• Sent every two seconds from every node to every other node
• Volume increases quadratically as nodes are added to the replica set
• Post-3.2: Extra metadata sent via existing data replication channel
• Utilizes chained replication
• Faster heartbeats = faster elections and detection of false primaries

Determining When To Call For An Election
• Tradeoff between failover time and spurious failovers
• Node calls for an election when it hasn’t heard from the primary within
the election timeout
• Starting in 3.2:
• Election timeout is configurable
• Election timeout is varied randomly for each node
• Varying timeouts help reduce tied votes
• Fewer tied votes = faster failover

Conclusion
• MongoDB 3.2 will have
• Faster failovers
• Faster error detection
• More control to prevent spurious failovers
• This means your systems are
• More stable
• More resilient to failure
• Easier to maintain

What's hot

Design issues of dosvanamali_vanu

message passing vs shared memoryHamza Zahid

Naming in Distributed SystemsNandakumar P

Design Considerations for and Electronic Voting SystemPosmart Systems Ltd

Consistency protocolsZongYing Lyu

Chord AlgorithmSijia Lyu

Distributed shared memory shyam soniShyam Soni

Bus Pass.pptxDonnelBravo

Project proposal android operating systemAttiq12

Chapter 17 - Distributed File SystemsWayne Jones Jnr

PthreadGopi Saiteja

Direct linking loaderbabyparul

Inter process communicationPradeep Kumar TS

Virtual Memory vs Cache MemoryAshik Iqbal

Microprocess Microconroller mcq 1000+Kumaran K

Mod05lec25(resource mgmt ii)Ankit Gupta

Cluster ComputingBOSS Webtech

COMPUTER GRAPHICS PROJECT REPORTvineet raj

Rtos ConceptsSundaresan Sundar

Dbms recovering from a system crashAbhishek Kumar Gupta

What's hot (20)

Design issues of dos

message passing vs shared memory

Naming in Distributed Systems

Design Considerations for and Electronic Voting System

Consistency protocols

Chord Algorithm

Distributed shared memory shyam soni

Bus Pass.pptx

Project proposal android operating system

Chapter 17 - Distributed File Systems

Pthread

Direct linking loader

Inter process communication

Virtual Memory vs Cache Memory

Microprocess Microconroller mcq 1000+

Mod05lec25(resource mgmt ii)

Cluster Computing

COMPUTER GRAPHICS PROJECT REPORT

Rtos Concepts

Dbms recovering from a system crash

Viewers also liked

Replication Internals: The Life of a WriteMongoDB

Distributed Consensus in MongoDB's Replication SystemMongoDB

How Raft consensus algorithm will make replication even better in MongoDB 3.2...Ontico

MongoDB Days Silicon Valley: Distributed Consensus in MongoDB's Replication S...MongoDB

Replication Internals: The Life of a WriteMongoDB

MongoDB London 2013: Basic Replication in MongoDB presented by Marc Schwering...MongoDB

[RakutenTechConf2014] [D-4] The next step of LeoFS and Introducing NewDB ProjectRakuten Group, Inc.

Data Replication in Distributed SystemEhsan Hessami

Webinar: Back to Basics: Thinking in DocumentsMongoDB

MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB

Back to Basics Webinar 3: Introduction to Replica SetsMongoDB

Introduction to Raft algorithmmuayyad alsadi

Non Developer Scrum Teams: How Scrum Can Improve Your OperationsMatthew Salerno

What is NoSQL and CAP TheoremRahul Jain

Viewers also liked (14)

Replication Internals: The Life of a Write

Distributed Consensus in MongoDB's Replication System

How Raft consensus algorithm will make replication even better in MongoDB 3.2...

MongoDB Days Silicon Valley: Distributed Consensus in MongoDB's Replication S...

Replication Internals: The Life of a Write

MongoDB London 2013: Basic Replication in MongoDB presented by Marc Schwering...

[RakutenTechConf2014] [D-4] The next step of LeoFS and Introducing NewDB Project

Data Replication in Distributed System

Webinar: Back to Basics: Thinking in Documents

MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...

Back to Basics Webinar 3: Introduction to Replica Sets

Introduction to Raft algorithm

Non Developer Scrum Teams: How Scrum Can Improve Your Operations

What is NoSQL and CAP Theorem

Similar to Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Mitch Trachtenberg: Open Source for Election IntegrityFriprogsenteret

Unveiling etcd: Architecture and Source Code Deep DiveChieh (Jack) Yu

Gemtalk Systems Product RoadmapESUG

Evoting2003Meenakshi Chandrasekaran

Reaching reliable agreement in an unreliable worldHeidi Howard

A Kafkaesque Raft Protocol | Jason Gustafson, ConfluentHostedbyConfluent

Fighting fraud: finding duplicates at scale (Highload+ 2019)Alexey Grigorev

Ddd by-clark chouKim Kao

Cassandra introduction apache con 2014 budapestDuyhai Doan

The Strathclyde Poker Research EnvironmentLuke Dicken

Optimal Strategies for Large-Scale Batch ETL JobsEmma Tang

Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangDatabricks

Similar to Replication Election and Consensus Algorithm Refinements for MongoDB 3.2 (12)

Mitch Trachtenberg: Open Source for Election Integrity

Unveiling etcd: Architecture and Source Code Deep Dive

Gemtalk Systems Product Roadmap

Evoting2003

Reaching reliable agreement in an unreliable world

A Kafkaesque Raft Protocol | Jason Gustafson, Confluent

Fighting fraud: finding duplicates at scale (Highload+ 2019)

Ddd by-clark chou

Cassandra introduction apache con 2014 budapest

The Strathclyde Poker Research Environment

Optimal Strategies for Large-Scale Batch ETL Jobs

Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang

Recently uploaded

How to convert PDF to text with Nanonetsnaman860154

GenCyber Cyber Security Day PresentationMichael W. Hawkins

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Slack Application Development 101 Slidespraypatel2

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Google AI Hackathon: LLM based Evaluator for RAGSujit Pal

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Recently uploaded (20)

How to convert PDF to text with Nanonets

GenCyber Cyber Security Day Presentation

How to Troubleshoot Apps for the Modern Connected Worker

Breaking the Kubernetes Kill Chain: Host Path Mount

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Slack Application Development 101 Slides

My Hashitalk Indonesia April 2024 Presentation

Google AI Hackathon: LLM based Evaluator for RAG

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Unblocking The Main Thread Solving ANRs and Frozen Frames

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Maximizing Board Effectiveness 2024 Webinar.pptx

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

08448380779 Call Girls In Civil Lines Women Seeking Men

Boost PC performance: How more available memory can improve productivity

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

CNv6 Instructor Chapter 6 Quality of Service

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

1. Distributed Consensus in MongoDB Spencer T Brody Senior Software Engineer at MongoDB @stbrody

2. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2

3. Why use Replication? • Data redundancy • High availability

4. What is Consensus? • Getting multiple processes/servers to agree on something • Must handle a wide range of failure modes • Disk failure • Network partitions • Machine freezes • Clock skews

5. Basic consensus State Machine X 3 Y 2 Z 7 State Machine X 3 Y 2 Z 7 State Machine X 3 Y 2 Z 7

6. Leader Based Consensus State Machine X 3 Y 2 Z 7 Replicated Log X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 State Machine X 3 Y 2 Z 7 Replicated Log X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 State Machine X 3 Y 2 Z 7 Replicated Log X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3

7. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2

8. Elections X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3

9. Elections X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3

10. Elections X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3

11. Elections X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3

12. Elections X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3

13. Data Replication X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3

14. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2

15. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2 • Goals and inspiration from Raft Consensus Algorithm • Preventing double voting • Monitoring node status • Calling for elections

16. Goals for MongoDB 3.2 • Decrease failover time • Speed up detection and resolution of false primary situations

17. Finding Inspiration in Raft • “In Search of an Understandable Consensus Algorithm” by Diego Ongaro: https://ramcloud.stanford.edu/raft.pdf • Designed to address the shortcomings of Paxos • Easier to understand • Easier to implement in real applications • Provably correct • Remarkably similar to what we’re doing already

18. Raft Concepts • Term (election) IDs • Monitoring node status using existing data replication channel • Asymmetric election timeouts

19. Preventing Double Voting • Can’t vote for 2 nodes in the same election • Pre-3.2: 30 second vote timeout • Post-3.2: Term IDs • Term: • Monotonically increasing ID • Incremented on every election *attempt* • Lets voters distinguish elections so they can vote twice quickly in different elections.

20. Monitoring Node Status • Pre-3.2: Heartbeats • Sent every two seconds from every node to every other node • Volume increases quadratically as nodes are added to the replica set • Post-3.2: Extra metadata sent via existing data replication channel • Utilizes chained replication • Faster heartbeats = faster elections and detection of false primaries

21. Data Replication X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3

22. Determining When To Call For An Election • Tradeoff between failover time and spurious failovers • Node calls for an election when it hasn’t heard from the primary within the election timeout • Starting in 3.2: • Election timeout is configurable • Election timeout is varied randomly for each node • Varying timeouts help reduce tied votes • Fewer tied votes = faster failover

23. Conclusion • MongoDB 3.2 will have • Faster failovers • Faster error detection • More control to prevent spurious failovers • This means your systems are • More stable • More resilient to failure • Easier to maintain

24. Questions?

Editor's Notes

Every write requires participation from all nodes – like direct democracy A lot of network traffic, poor performance This is like pure Paxos
Leader = Primary This is what most real-world deployments do In mongodb… Oplog = logical transformation of data
One node nominates itself and contacts others Pre-election check – will you vote for me? Election check – do you vote for me? Election notifications – I am now primary
Nodes might vote no: primary exists, repl lag
Chaining Less bandwidth across long hops, less read load on primary
Question break
Define: Failover time. Failover time > election time False primaries Rollbacks
Raft paper presented at Usenix 2014 by Stamford University PHD students Diego Ongaro and John Ousterhout Provably correct Paxos - 1989 Paxos hard to use – “Paxos made simple”, “Paxos made practical”
Continue sending heartbeats Initial sync source, spanning tree maintenance Even less frequent now
Define “election timeout” - *not* time limit of election duration One possibility: Base timeout <1second: ~500ms, plus a few milliseconds

Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Similar to Replication Election and Consensus Algorithm Refinements for MongoDB 3.2 (12)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

Replication Election and Consensus Algorithm Refinements for MongoDB 3.2

Editor's Notes