Advanced Replication Internals

Advanced
Replication
Internals

Design and Goals
Goals
■ Highly Available
■ Consistent Data
■ Automatic Failover
■ Multi-Region/DC
■ Dynamic Reads
Design
● All DBs, each node
● Quorum/Election
● Smart clients
● Source selection
● Read Preferences
● Record operations
● Asynchronous
● Write/Replication
acknowledgements

Goal: High Availability
● Node Redundancy: Duplicate Data
● Record Write Operations
● Apply Write Operations
● Use capped collection called "oplog"

Replication Operations
Insert
● oplog entry (fields):
○ op, o
{
"ns" : "test.gamma",
"op" : "i", "v" : 2,
"ts" : Timestamp(1350504342, 5),
"o" : { "_id" : 2, "x" : "hi"} }

Replication Operations
Update
● oplog entry (fields):
○ o = update, o2 = query
{
"ns" : "test.tags",
"op" : "u", "v" : 2,
"ts": Timestamp(1368049619, 1),
"o2" : { "_id" : 1 },
"o" : { "$set" : { "tags.4" : "e" } } }

Operation Transformation
● Idempotent (update by _id)
● Multi-update/delete (results in many ops)
● Array modifications (replacement)

Interchangeable
● All members maintain oplog + dbs
● All able to take over, or be used for same
functions

Replication Process
● Record oplog entry on write
● Idempotent entries
● Pulled by replicas
1. Read over network
2. Buffer locally
3. Apply in batch
4. Repeat

Read + Apply Decoupled
● Background oplog reader thread
● Pool of oplog applier threads (by collection)
Repl Source
Applier
Thread
Pool
16
Buffer
DB4
DB3
DB1 DB2
Local Oplog
Network
Batch
Com
plete

Replication Metrics
"network": {
"bytes": 103830503805,
"readersCreated": 2248,
"getmores": {
"totalMillis": 257461206,
"num": 2152267 },
"ops": 7285440 }
"buffer": {
"sizeBytes": 0,
"maxSizeBytes": 268435456,
"count": 0},
"preload": { "docs": {
"totalMillis":0,"num":0},
"indexes": {
"num": 14560667 } },
"apply": {
"batches": {
"num": 1797105},
"ops": 7285440 },
"oplog": {
"insertBytes": 106866610253,
"insert": {
"num": 7285440 } }

Good Replication States
● Initial Sync
○ Record oplog start position
○ Clone/copy all dbs
○ Set minvalid, apply oplog since start
○ Build indexes
● Replication Batch: MinValid

Goal: Consistent Data
● Single Master
● Quorum (majority)
● Ordered Oplog

Consistent Data
Why a single master?

Election Events
Election events:
● Primary failure
● Stepdown (manual)
● Reconfigure
● Quorum loss

Election Nomination
Disqualifications
A replica will nominate itself unless:
● Priority:0 or arbiter
● Not freshest
● Just stepped down (in unelectable state)
● Would be vetoed by anyone because
○ There is a Primary already
○ They don't have us in their config
○ Higher priority member out there
● Higher config version out there

The Election
Nomination:
● If it looks like a tie, sleep random time
(unless first node)
Voting:
● If all goes well, only one nominee
● All voting members vote for one nominee
● Majority of votes wins

Goal: Automatic Failover
● Single Master
● Smart Clients
● Discovery

Discovery
isMaster command:
setName: <name>,
ismaster: true, secondary: false, arbiterOnly:
hosts: [ <visible nodes> ],
passives: [ <prio:0 nodes> ],
arbiters: [ <nodes> ],
primary: <active primary>,
tags: {<tags>},
me: <me>

Failover Scenario
Client
P
S
S
Discovery (isMaster)Active Primary

Failover Scenario
Client
P
S
S
Active Primary
P
Failed Primary

Failover Scenario
Client
Failed
P
S
Discovery (isMaster)
Active Primary

Replication Source Select'n
● Select closest source
○ Limit to non-hidden or slave delayed
○ If nothing, try again with hidden/slave delayed
○ Select node with fastest "ping" time
○ Must be fresher
● Choose source when
○ Starting
○ Any error with existing source (network, query)
○ Any member is 30s ahead of current source
● Manual override
○ replSetSyncSource -- good until we choose again

Goal: Datacenter Aware
● Dynamic replication topologies
● Beachhead data center server
P

Goal: Dynamic Reads
Controls for consistency
● Default to Primary
● Non-primary allowed
● Based on
○ Locality (ping/tags)
○ Tags
Client
S
P
S
Tags: A,
B
Tags: B, C

Asynchronous Replication
● Important considerations
● Additional requirements
● System/Application controls

Write Propagation
● Write Concern
● Replication requirements
● Timing
● Dynamic requirements

Exceptional Conditions
● Multiple Primaries
● Rollback
● Too stale

Advanced Replication Internals

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Advanced Replication Internals

Similar to Advanced Replication Internals (20)

More from Scott Hernandez

More from Scott Hernandez (13)

Recently uploaded

Recently uploaded (20)

Advanced Replication Internals