Advanced
Replication
Internals
Design and Goals
Goals
■ Highly Available
■ Consistent Data
■ Automatic Failover
■ Multi-Region/DC
■ Dynamic Reads
Design
● All DBs, each node
● Quorum/Election
● Smart clients
● Source selection
● Read Preferences
● Record operations
● Asynchronous
● Write/Replication
acknowledgements
Goal: High Availability
● Node Redundancy: Duplicate Data
● Record Write Operations
● Apply Write Operations
● Use capped collection called "oplog"
Replication Operations
Insert
● oplog entry (fields):
○ op, o
{
"ns" : "test.gamma",
"op" : "i", "v" : 2,
"ts" : Timestamp(1350504342, 5),
"o" : { "_id" : 2, "x" : "hi"} }
Replication Operations
Update
● oplog entry (fields):
○ o = update, o2 = query
{
"ns" : "test.tags",
"op" : "u", "v" : 2,
"ts": Timestamp(1368049619, 1),
"o2" : { "_id" : 1 },
"o" : { "$set" : { "tags.4" : "e" } } }
Operation Transformation
● Idempotent (update by _id)
● Multi-update/delete (results in many ops)
● Array modifications (replacement)
Interchangeable
● All members maintain oplog + dbs
● All able to take over, or be used for same
functions
Replication Process
● Record oplog entry on write
● Idempotent entries
● Pulled by replicas
1. Read over network
2. Buffer locally
3. Apply in batch
4. Repeat
Read + Apply Decoupled
● Background oplog reader thread
● Pool of oplog applier threads (by collection)
Repl Source
Applier
Thread
Pool
16
Buffer
DB4
DB3
DB1 DB2
Local Oplog
Network
Batch
Com
plete
Replication Metrics
"network": {
"bytes": 103830503805,
"readersCreated": 2248,
"getmores": {
"totalMillis": 257461206,
"num": 2152267 },
"ops": 7285440 }
"buffer": {
"sizeBytes": 0,
"maxSizeBytes": 268435456,
"count": 0},
"preload": { "docs": {
"totalMillis":0,"num":0},
"indexes": {
"totalMillis": 23142318,
"num": 14560667 } },
"apply": {
"batches": {
"totalMillis": 231847,
"num": 1797105},
"ops": 7285440 },
"oplog": {
"insertBytes": 106866610253,
"insert": {
"totalMillis": 1756725,
"num": 7285440 } }
Good Replication States
● Initial Sync
○ Record oplog start position
○ Clone/copy all dbs
○ Set minvalid, apply oplog since start
○ Build indexes
● Replication Batch: MinValid
Goal: Consistent Data
● Single Master
● Quorum (majority)
● Ordered Oplog
Consistent Data
Why a single master?
Election Events
Election events:
● Primary failure
● Stepdown (manual)
● Reconfigure
● Quorum loss
Election Nomination
Disqualifications
A replica will nominate itself unless:
● Priority:0 or arbiter
● Not freshest
● Just stepped down (in unelectable state)
● Would be vetoed by anyone because
○ There is a Primary already
○ They don't have us in their config
○ Higher priority member out there
● Higher config version out there
The Election
Nomination:
● If it looks like a tie, sleep random time
(unless first node)
Voting:
● If all goes well, only one nominee
● All voting members vote for one nominee
● Majority of votes wins
Goal: Automatic Failover
● Single Master
● Smart Clients
● Discovery
Discovery
isMaster command:
setName: <name>,
ismaster: true, secondary: false, arbiterOnly:
hosts: [ <visible nodes> ],
passives: [ <prio:0 nodes> ],
arbiters: [ <nodes> ],
primary: <active primary>,
tags: {<tags>},
me: <me>
Failover Scenario
Client
P
S
S
Discovery (isMaster)Active Primary
Failover Scenario
Client
P
S
S
Active Primary
P
Failed Primary
Failover Scenario
Client
Failed
P
S
Discovery (isMaster)
Active Primary
Replication Source Select'n
● Select closest source
○ Limit to non-hidden or slave delayed
○ If nothing, try again with hidden/slave delayed
○ Select node with fastest "ping" time
○ Must be fresher
● Choose source when
○ Starting
○ Any error with existing source (network, query)
○ Any member is 30s ahead of current source
● Manual override
○ replSetSyncSource -- good until we choose again
Goal: Datacenter Aware
● Dynamic replication topologies
● Beachhead data center server
P
Goal: Dynamic Reads
Controls for consistency
● Default to Primary
● Non-primary allowed
● Based on
○ Locality (ping/tags)
○ Tags
Client
S
P
S
Tags: A,
B
Tags: B, C
Asynchronous Replication
● Important considerations
● Additional requirements
● System/Application controls
Write Propagation
● Write Concern
● Replication requirements
● Timing
● Dynamic requirements
Exceptional Conditions
● Multiple Primaries
● Rollback
● Too stale
Design and Goals
Goals
■ Highly Available
■ Consistent Data
■ Automatic Failover
■ Multi-Region/DC
■ Dynamic Reads
Design
● All DBs, each node
● Quorum/Election
● Smart clients
● Source selection
● Read Preferences
● Record operations
● Asynchronous
● Write/Replication
acknowledgements
Thanks
Questions?

Advanced Replication Internals