Advanced
Replication
Internals
Design and Goals
Goals
§  Highly Available
§  Consistent Data
§  Automatic Failover
§  Multi-Region/DC
§  Dynamic Reads
Design
•  All DBs, each node
•  Quorum/Election
•  Smart clients
•  Source selection
•  Read Preferences
•  Record operations
•  Asynchronous
•  Write/Replication
acknowledgements
Goal: High Availability
•  Node Redundancy: Duplicate Data
•  Record Write Operations
•  Apply Write Operations
•  Use capped collection called "oplog"
Replication Operations
Insert
•  oplog entry (fields):
o  op, o
{
"ns" : "test.gamma",
"op" : "i", "v" : 2,
"ts" : Timestamp(1350504342, 5),
"o" : { "_id" : 2, "x" : "hi"} }
Replication Operations
Update
•  oplog entry (fields):
o  o = update, o2 = query
{
"ns" : "test.tags",
"op" : "u", "v" : 2,
"ts": Timestamp(1368049619, 1),
"o2" : { "_id" : 1 },
"o" : { "$set" : { "tags.4" : "e" } } }
Operation Transformation
•  Idempotent (update by _id)
•  Multi-update/delete (results in many ops)
•  Array modifications (replacement)
Interchangeable
•  All members maintain oplog + dbs
•  All able to take over, or be used for same
functions
Replication Process
•  Record oplog entry on write
•  Idempotent entries
•  Pulled by replicas
1.  Read over network
2.  Buffer locally
3.  Apply in batch
4.  Repeat
Read + Apply Decoupled
•  Background oplog reader thread
•  Pool of oplog applier threads (by collection)
Repl Source
Applier
Thread
Pool
16
Buffer
DB4
DB3
DB1 DB2
Local Oplog
Network
Replication Metrics
"network": {
"bytes": 103830503805,
"readersCreated": 2248,
"getmores": {
"totalMillis": 257461206,
"num": 2152267 },
"ops": 7285440 }
"buffer": {
"sizeBytes": 0,
"maxSizeBytes": 268435456,
"count": 0},
"preload": { "docs": {
"totalMillis":0,"num":0},
"indexes": {
"totalMillis": 23142318,
"num": 14560667 } },
"apply": {
"batches": {
"totalMillis": 231847,
"num": 1797105},
"ops": 7285440 },
"oplog": {
"insertBytes": 106866610253,
"insert": {
"totalMillis": 1756725,
"num": 7285440 } }
Good Replication States
•  Initial Sync
o  Record oplog start position
o  Clone/copy all dbs
o  Set minvalid, apply oplog since start
o  Build indexes
•  Replication Batch: MinValid
Goal: Consistent Data
•  Single Master
•  Quorum (majority)
•  Ordered Oplog
Consistent Data
Why a single master?
Election Events
Election events:
•  Primary failure
•  Stepdown (manual)
•  Reconfigure
•  Quorum loss
Election Nomination
Disqualifications
A replica will nominate itself unless:
•  Priority:0 or arbiter
•  Not freshest
•  Just stepped down (in unelectable state)
•  Would be vetoed by anyone because
o  There is a Primary already
o  They don't have us in their config
o  Higher priority member out there
•  Higher config version out there
The Election
Nomination:
•  If it looks like a tie, sleep random time
(unless first node)
Voting:
•  If all goes well, only one nominee
•  All voting members vote for one nominee
•  Majority of votes wins
Goal: Automatic Failover
•  Single Master
•  Smart Clients
•  Discovery
Discovery
isMaster command:
setName: <name>,
ismaster: true, secondary: false, arbiterOnly:
hosts: [ <visible nodes> ],
passives: [ <prio:0 nodes> ],
arbiters: [ <nodes> ],
primary: <active primary>,
tags: {<tags>},
me: <me>
Failover Scenario
Client
P
S
S
Discovery (isMaster)Active Primary
Failover Scenario
Client
P
S
S
Active Primary
P
Failed Primary
Failover Scenario
Client
Failed
P
S
Discovery (isMaster)
Replication Source
Select'n
•  Select closest source
o  Limit to non-hidden or slave delayed
o  If nothing, try again with hidden/slave delayed
o  Select node with fastest "ping" time
o  Must be fresher
•  Choose source when
o  Starting
o  Any error with existing source (network, query)
o  Any member is 30s ahead of current source
•  Manual override
o  replSetSyncSource -- good until we choose again
Goal: Datacenter Aware
•  Dynamic replication topologies
•  Beachhead data center server
P
Goal: Dynamic Reads
Controls for consistency
•  Default to Primary
•  Non-primary allowed
•  Based on
o  Locality (ping/tags)
o  Tags
Client
S
P
S
Tags: A,
B
Tags: B, C
Asynchronous Replication
•  Important considerations
•  Additional requirements
•  System/Application controls
Write Propagation
•  Write Concern
•  Replication requirements
•  Timing
•  Dynamic requirements
Exceptional Conditions
•  Multiple Primaries
•  Rollback
•  Too stale
Design and Goals
Goals
§  Highly Available
§  Consistent Data
§  Automatic Failover
§  Multi-Region/DC
§  Dynamic Reads
Design
•  All DBs, each node
•  Quorum/Election
•  Smart clients
•  Source selection
•  Read Preferences
•  Record operations
•  Asynchronous
•  Write/Replication
acknowledgements
Thanks
Questions?

Advanced Replication Internals