Replication, Durability, and Disaster Recovery

@spf13

AKA
Steve Francia
15+ years
building the
internet
Father, husband,
skateboarder

Chief Solutions Architect @
responsible for drivers,
integrations, web & docs

Agenda
• Intro to replication
• How MongoDB does Replication
• Conﬁguring a ReplicaSet
• Advanced Replication
• Durability
• High Availability Scenarios

Use cases
• High Availability (auto-failover)

Use cases
• Read Scaling (extra copies to read from)

Use cases
• Backups Delayed Copy (fat ﬁnger)
• Online, Time (PiT) backups
• Point in

Use cases
• Backups Delayed Copy (fat ﬁnger)
• Online, Time (PiT) backups
• Point in
• Use (hidden) replica for secondary
workload
• Analytics
• Data-processingexternal systems
• Integration with

Types of outage
Planned
• Hardware upgrade
• O/S or ﬁle-system tuning
• Relocation of data to new ﬁle-system /
storage
• Software upgrade

Types of outage
Planned
• Hardware upgrade
• O/S or ﬁle-system tuning
• Relocation of data to new ﬁle-system /
storage
• Software upgrade
Unplanned
• Hardware failure
• Data center failure
• Region outage
• Human error
• Application corruption

Replica Set features
• A cluster of N servers

• Any (one) node can be primary

• Consensus election of primary

• Automatic failover

• Automatic recovery

• All writes to primary

• All writes to primary
• Reads can be to primary
(default) or a secondary

How MongoDB
Replication works
Member 1 Member 3

Member 2

• Set is made up of 2 or more nodes

How MongoDB
Replication works
Member 1 Member 3

Member 2
Primary

• Election establishes the PRIMARY
• Data replication from PRIMARY to

How MongoDB
Replication works
negotiate
new master

Member 1 Member 3

Member 2
DOWN

• PRIMARY may fail
• Automatic election of new PRIMARY if

How MongoDB
Replication works
Member 3
Member 1
Primary

Member 2
DOWN

• New PRIMARY elected
• Replica Set re-established

How MongoDB
Replication works
Member 3
Member 1
Primary

Member 2
Recovering


How MongoDB
Replication works
Member 3
Member 1
Primary

Member
2

• Replica Set re-established

How Is Data
Replicated? to the
• Change operations are written
oplog
• The oplog is a capped collection (ﬁxed size)
•Must have enough space to allow new secondaries to
catch up (from scratch or from a backup)
•Must have enough space to cope with any applicable
slaveDelay

How Is Data
Replicated? to the
• Change operations are written
oplog
• The oplog is a capped collection (ﬁxed size)
•Must have enough space to allow new secondaries to
catch up (from scratch or from a backup)
•Must have enough space to cope with any applicable
slaveDelay
• Secondaries query the primary’s oplog
and apply what they ﬁnd
• All replicas contain an oplog

Creating a Replica Set
$ ./mongod --replSet <name>

> cfg = {
_id : "<name>",
members : [
{ _id : 0, host : "sf1.acme.com" },
{ _id : 1, host : "sf2.acme.com" },
{ _id : 2, host : "sf3.acme.com" }
]
}
> use admin
> rs.initiate(cfg)

Managing a Replica Set
rs.conf()
Shell helper: get current conﬁguration

rs.conf()
rs.initiate(<cfg>);
Shell helper: initiate replica set

rs.conf()
rs.initiate(<cfg>);
rs.reconﬁg(<cfg>)
Shell helper: reconﬁgure a replica set

rs.conf()
rs.initiate(<cfg>);
rs.reconﬁg(<cfg>)
rs.add("hostname:<port>")
Shell helper: add a new member

rs.conf()
rs.initiate(<cfg>);
rs.reconﬁg(<cfg>)
rs.add("hostname:<port>")
Shell helper: add a new member
rs.remove("hostname:<port>")
Shell helper: remove a member

rs.status()
Reports status of the replica set from one
node's point of view

rs.status()
rs.stepDown(<secs>)
Request the primary to step down

rs.status()
rs.stepDown(<secs>)
Request the primary to step down
rs.freeze(<secs>)
Prevents any changes to the current replica
set conﬁguration (primary/secondary status)
Use during backups

Lots of Features

• Delayed
• Hidden
• Priorities
• Tags

Slave Delay
• Lags behind master by conﬁgurable
time delay

Slave Delay
time delay

• Automatically hidden from clients

Slave Delay
time delay

• Automatically hidden from clients
• Protects against operator errors
• Fat ﬁngering
• Application corrupts data

Other member
types
• Arbiters
• Don’t store a copy of the data
• Vote in elections
• Used as a tie breaker

Other member
types
• Arbiters
• Don’t store a copy of the data
• Vote in elections
• Used as a tie breaker
• Hidden
• Not reported in isMaster
• Hidden from slaveOk reads

Priorities
• Priority: a number between 0 and 100
• Used during an election:
• Most up to date
• Highest priority
• Less than 10s behind failed Primary
• Allows weighting of members during
failover

Priorities - example
A B C D E
p:10 p:10 p:1 p:1 p:0

A B C D E
p:10 p:10 p:1 p:1 p:0

• Assuming all members are up to date

A B C D E
p:10 p:10 p:1 p:1 p:0

• Members A or B will be chosen ﬁrst

A B C D E
p:10 p:10 p:1 p:1 p:0

• Members C or D will be chosen when:

• A and B are unavailable
• A and B are not up to date

A B C D E
p:10 p:10 p:1 p:1 p:0

• Members C or D will be chosen when:

• A and B are unavailable
• A and B are not up to date
• Member E is never chosen

• priority:0 means it cannot be elected

Durability Options

•Fire and forget

Durability Options

•Fire and forget
•Write Concern

Write Concern

&
If a write requires a
return trip

What the return trip
should depend on

Write Concern
w:
the number of servers to replicate to (or
majority)

Write Concern
w:
majority)
wtimeout:
timeout in ms waiting for replication

Write Concern
w:
majority)
wtimeout:
j:
wait for journal sync

Write Concern
w:
majority)
wtimeout:
j:
wait for journal sync
tags:
ensure replication to n nodes of given tag

Fire and Forget
Driver Primary

write

apply in memory

•Operations are applied in memory
•No waiting for persistence to disk
•MongoDB clients do not block waiting to conﬁrm
the operation completed

Wait for error
Driver Primary

write

getLastError apply in memory

•MongoDB clients do block waiting to conﬁrm the
operation completed

Wait for journal
sync
Driver

write
Primary

getLastError
apply in memory
j:true
Write to journal

•Wait for persistence to journal
operation completed

Wait for fsync
Driver Primary

write

getLastError
apply in memory
fsync:true
write to journal (if enabled)

fsync

•Wait for persistence to journal
•Wait for persistence to disk
operation completed

Wait for replication
Driver Primary Secondary

write

getLastError
apply in memory
w:2
replicate

•Waiting for replication to n nodes
operation completed

Tagging
• Control over where data is written to.
• Each member can have one or more tags:

tags: {dc: "stockholm"}

tags: {dc: "stockholm",
ip: "192.168",
rack: "row3-rk7"}

• Replica set defines rules for where data resides
• Rules defined in RS config... can change
without change application code

Tagging - example
{
_id : "someSet",
members : [
{_id : 0, host : "A", tags : {"dc": "ny"}},
{_id : 1, host : "B", tags : {"dc": "ny"}},
{_id : 2, host : "C", tags : {"dc": "sf"}},
{_id : 3, host : "D", tags : {"dc": "sf"}},
{_id : 4, host : "E", tags : {"dc": "cloud"}}
]
settings : {
getLastErrorModes : {
veryImportant : {"dc" : 3},
sortOfImportant : {"dc" : 2}
}
}
}

Single Node
• Downtime inevitable
• If node crashes human
intervention might be
needed

• Should absolutely run
with journaling to
prevent data loss /

Replica Set 1
• Single datacenter
Arbiter
• Single switch & power
• One node failure
• Automatic recovery of
single node crash

• Points of failure:
• Power
• Network
• Datacenter

Replica Set 2
Arbiter
• Multiple power/network
zones

• Automatic recovery of single
node crash

• w=2 not viable as losing 1
node means no writes

• Datacenter
• Two node failure

Replica Set 3
• Multiple power/network
zones

• Automatic recovery of
single node crash

• w=2 viable as 2/3 online
• Datacenter
• Two node failure

Replica Set 4
• Multi datacenter
• DR node for safety

• Can't do multi data
center durable write
safely since only 1 node
in distant DC

Replica Set 5
• Three data centers
• Can survive full data
center loss

• Can do w= { dc : 2 } to
guarantee write in 2
data centers

Set
Use? Data Protection High Availability Notes
size

Must use --journal to protect
X One No No
against crashes

On loss of one member, surviving
Two Yes No
member is read only

On loss of one member, surviving
Three Yes Yes - 1 failure two members can elect a new
primary
* On loss of two members,
X Four Yes Yes - 1 failure* surviving two members are read
only

On loss of two members, surviving
Five Yes Yes - 2 failures three members can elect a new
primary

Typical

http://spf13.com
http://github.com/s
@spf13

Questions?
download at mongodb.org
We’re hiring!! Contact us at jobs@10gen.com

Replication, Durability, and Disaster Recovery

Replication, Durability, and Disaster Recovery

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Replication, Durability, and Disaster Recovery

Similar to Replication, Durability, and Disaster Recovery (20)

More from Steven Francia

More from Steven Francia (20)

Recently uploaded

Recently uploaded (20)

Replication, Durability, and Disaster Recovery

Editor's Notes