Replication
& Durability
@spf13

                  AKA
Steve Francia
15+ years
building the
internet
  Father, husband,
  skateboarder


Chief Solutions Architect @
responsible for drivers,
integrations, web & docs
Agenda
• Intro to replication
• How MongoDB does Replication
• Configuring a ReplicaSet
• Advanced Replication
• Durability
• High Availability Scenarios
Replication
Use cases
Use cases
• High Availability (auto-failover)
Use cases
• High Availability (auto-failover)
• Read Scaling (extra copies to read from)
Use cases
• High Availability (auto-failover)
• Read Scaling (extra copies to read from)
• Backups Delayed Copy (fat finger)
 • Online, Time (PiT) backups
 • Point in
Use cases
• High Availability (auto-failover)
• Read Scaling (extra copies to read from)
• Backups Delayed Copy (fat finger)
 • Online, Time (PiT) backups
 • Point in
• Use (hidden) replica for secondary
  workload
 • Analytics
 • Data-processingexternal systems
 • Integration with
Types of outage
Types of outage
Planned
 • Hardware upgrade
 • O/S or file-system tuning
 • Relocation of data to new file-system /
   storage
 • Software upgrade
Types of outage
Planned
 • Hardware upgrade
 • O/S or file-system tuning
 • Relocation of data to new file-system /
   storage
 • Software upgrade
Unplanned
 • Hardware failure
 • Data center failure
 • Region outage
 • Human error
 • Application corruption
Replica Set features
Replica Set features
• A cluster of N servers
Replica Set features
• A cluster of N servers
• Any (one) node can be primary
Replica Set features
• A cluster of N servers
• Any (one) node can be primary
• Consensus election of primary
Replica Set features
•   A cluster of N servers
•   Any (one) node can be primary
•   Consensus election of primary
•   Automatic failover
Replica Set features
•   A cluster of N servers
•   Any (one) node can be primary
•   Consensus election of primary
•   Automatic failover
•   Automatic recovery
Replica Set features
•   A cluster of N servers
•   Any (one) node can be primary
•   Consensus election of primary
•   Automatic failover
•   Automatic recovery
•   All writes to primary
Replica Set features
•    A cluster of N servers
•    Any (one) node can be primary
•    Consensus election of primary
•    Automatic failover
•    Automatic recovery
•    All writes to primary
•    Reads can be to primary
    (default) or a secondary
How
 MongoDB
Replication
How MongoDB
    Replication works
        Member 1              Member 3




                   Member 2




• Set is made up of 2 or more nodes
How MongoDB
    Replication works
        Member 1              Member 3




                   Member 2
                    Primary




• Election establishes the PRIMARY
• Data replication from PRIMARY to
How MongoDB
     Replication works
                     negotiate
                    new master

         Member 1                Member 3




                    Member 2
                     DOWN




• PRIMARY may fail
• Automatic election of new PRIMARY if
How MongoDB
    Replication works
                              Member 3
        Member 1
                               Primary




                   Member 2
                    DOWN




• New PRIMARY elected
• Replica Set re-established
How MongoDB
    Replication works
                                Member 3
        Member 1
                                 Primary




                   Member 2
                   Recovering




• Automatic recovery
How MongoDB
    Replication works
                            Member 3
        Member 1
                             Primary




                   Member
                     2




• Replica Set re-established
How Is Data
Replicated?
How Is Data
         Replicated? to the
• Change operations are written
oplog
 • The oplog is a capped collection (fixed size)
  •Must have enough space to allow new secondaries to
   catch up (from scratch or from a backup)
  •Must have enough space to cope with any applicable
   slaveDelay
How Is Data
         Replicated? to the
• Change operations are written
 oplog
 • The oplog is a capped collection (fixed size)
  •Must have enough space to allow new secondaries to
   catch up (from scratch or from a backup)
  •Must have enough space to cope with any applicable
   slaveDelay
• Secondaries query the primary’s oplog
 and apply what they find
 • All replicas contain an oplog
Configuring
a ReplicaSet
Creating a Replica Set
$ ./mongod --replSet <name>

> cfg = {
  _id : "<name>",
  members : [
    { _id : 0, host : "sf1.acme.com" },
    { _id : 1, host : "sf2.acme.com" },
    { _id : 2, host : "sf3.acme.com" }
  ]
}
> use admin
> rs.initiate(cfg)
Managing a Replica Set
Managing a Replica Set
rs.conf()
   Shell helper: get current configuration
Managing a Replica Set
rs.conf()
   Shell helper: get current configuration
rs.initiate(<cfg>);
   Shell helper: initiate replica set
Managing a Replica Set
rs.conf()
   Shell helper: get current configuration
rs.initiate(<cfg>);
   Shell helper: initiate replica set
rs.reconfig(<cfg>)
   Shell helper: reconfigure a replica set
Managing a Replica Set
rs.conf()
   Shell helper: get current configuration
rs.initiate(<cfg>);
   Shell helper: initiate replica set
rs.reconfig(<cfg>)
   Shell helper: reconfigure a replica set
rs.add("hostname:<port>")
   Shell helper: add a new member
Managing a Replica Set
rs.conf()
   Shell helper: get current configuration
rs.initiate(<cfg>);
   Shell helper: initiate replica set
rs.reconfig(<cfg>)
   Shell helper: reconfigure a replica set
rs.add("hostname:<port>")
   Shell helper: add a new member
rs.remove("hostname:<port>")
   Shell helper: remove a member
Managing a Replica Set
Managing a Replica Set
 rs.status()
    Reports status of the replica set from one
    node's point of view
Managing a Replica Set
 rs.status()
    Reports status of the replica set from one
    node's point of view
 rs.stepDown(<secs>)
    Request the primary to step down
Managing a Replica Set
 rs.status()
    Reports status of the replica set from one
    node's point of view
 rs.stepDown(<secs>)
    Request the primary to step down
 rs.freeze(<secs>)
    Prevents any changes to the current replica
    set configuration (primary/secondary status)
    Use during backups
Advanced
Replication
Lots of Features

• Delayed
• Hidden
• Priorities
• Tags
Slave Delay
Slave Delay
• Lags behind master by configurable
  time delay
Slave Delay
• Lags behind master by configurable
  time delay

• Automatically hidden from clients
Slave Delay
• Lags behind master by configurable
  time delay

• Automatically hidden from clients
• Protects against operator errors
 • Fat fingering
 • Application corrupts data
Other member
    types
Other member
        types
• Arbiters
 • Don’t store a copy of the data
 • Vote in elections
 • Used as a tie breaker
Other member
        types
• Arbiters
 • Don’t store a copy of the data
 • Vote in elections
 • Used as a tie breaker
• Hidden
 • Not reported in isMaster
 • Hidden from slaveOk reads
Priorities
Priorities
• Priority: a number between 0 and 100
• Used during an election:
 • Most up to date
 • Highest priority
 • Less than 10s behind failed Primary
• Allows weighting of members during
  failover
Priorities - example
   A      B     C     D     E
  p:10   p:10   p:1   p:1   p:0
Priorities - example
         A       B         C      D       E
        p:10   p:10       p:1    p:1      p:0


•   Assuming all members are up to date
Priorities - example
           A       B       C       D      E
          p:10    p:10     p:1    p:1     p:0


•   Assuming all members are up to date
•   Members A or B will be chosen first
    •   Highest priority
Priorities - example
           A       B          C      D     E
          p:10    p:10       p:1     p:1   p:0


•   Assuming all members are up to date
•   Members A or B will be chosen first
    •   Highest priority
•   Members C or D will be chosen when:

    •   A and B are unavailable
    •   A and B are not up to date
Priorities - example
           A       B          C       D         E
          p:10    p:10       p:1     p:1        p:0


•   Assuming all members are up to date
•   Members A or B will be chosen first
    •   Highest priority
•   Members C or D will be chosen when:

    •   A and B are unavailable
    •   A and B are not up to date
•   Member E is never chosen

    •   priority:0 means it cannot be elected
Durabilit
   y
Durability Options
Durability Options


•Fire and forget
Durability Options


•Fire and forget
•Write Concern
Write Concern

                        &
If a write requires a
return trip

What the return trip
should depend on
Write Concern
Write Concern
w:
the number of servers to replicate to (or
majority)
Write Concern
w:
the number of servers to replicate to (or
majority)
wtimeout:
timeout in ms waiting for replication
Write Concern
w:
the number of servers to replicate to (or
majority)
wtimeout:
timeout in ms waiting for replication
j:
wait for journal sync
Write Concern
w:
the number of servers to replicate to (or
majority)
wtimeout:
timeout in ms waiting for replication
j:
wait for journal sync
tags:
ensure replication to n nodes of given tag
Fire and Forget
                 Driver           Primary

                          write


                                            apply in memory




•Operations are applied in memory
•No waiting for persistence to disk
•MongoDB clients do not block waiting to confirm
 the operation completed
Wait for error
              Driver                  Primary

                          write

                       getLastError             apply in memory




•Operations are applied in memory
•No waiting for persistence to disk
•MongoDB clients do block waiting to confirm the
 operation completed
Wait for journal
             sync
              Driver

                          write
                                      Primary



                       getLastError
                                                apply in memory
                         j:true
                                                Write to journal




•Operations are applied in memory
•Wait for persistence to journal
•MongoDB clients do block waiting to confirm the
 operation completed
Wait for fsync
              Driver                  Primary

                          write

                       getLastError
                                                apply in memory
                        fsync:true
                                                write to journal (if enabled)


                                                fsync




•Operations are applied in memory
•Wait for persistence to journal
•Wait for persistence to disk
•MongoDB clients do block waiting to confirm the
 operation completed
Wait for replication
   Driver                  Primary                     Secondary


               write

            getLastError
                                     apply in memory
              w:2
                                          replicate




•Operations are applied in memory
•No waiting for persistence to disk
•Waiting for replication to n nodes
•MongoDB clients do block waiting to confirm the
 operation completed
Tagging
• Control over where data is written to.
• Each member can have one or more tags:

  tags: {dc: "stockholm"}

  tags: {dc: "stockholm",
         ip: "192.168",
         rack: "row3-rk7"}


• Replica set defines rules for where data resides
• Rules defined in RS config... can change
  without change application code
Tagging - example
    {
        _id : "someSet",
        members : [
            {_id : 0, host : "A", tags : {"dc":   "ny"}},
            {_id : 1, host : "B", tags : {"dc":   "ny"}},
            {_id : 2, host : "C", tags : {"dc":   "sf"}},
            {_id : 3, host : "D", tags : {"dc":   "sf"}},
            {_id : 4, host : "E", tags : {"dc":   "cloud"}}
        ]
        settings : {
            getLastErrorModes : {
                veryImportant : {"dc" : 3},
                sortOfImportant : {"dc" : 2}
            }
        }
}
High
Availability
Scenarios
Single Node
    • Downtime inevitable
    • If node crashes human
      intervention might be
      needed

    • Should absolutely run
      with journaling to
      prevent data loss /
Replica Set 1
          • Single datacenter
Arbiter
          • Single switch & power
          • One node failure
          • Automatic recovery of
            single node crash


          • Points of failure:
           • Power
           • Network
           • Datacenter
Replica Set 2
          • Single datacenter
Arbiter
          • Multiple power/network
            zones

          • Automatic recovery of single
            node crash

          • w=2 not viable as losing 1
            node means no writes


          • Points of failure:
           • Datacenter
           • Two node failure
Replica Set 3
     • Single datacenter
     • Multiple power/network
       zones

     • Automatic recovery of
       single node crash

     • w=2 viable as 2/3 online
     • Points of failure:
      • Datacenter
      • Two node failure
When disaster
Replica Set 4
     • Multi datacenter
     • DR node for safety

     • Can't do multi data
       center durable write
       safely since only 1 node
       in distant DC
Replica Set 5
     • Three data centers
     • Can survive full data
       center loss


     • Can do w= { dc : 2 } to
       guarantee write in 2
       data centers
Set
Use?           Data Protection High Availability                Notes
       size


                                                     Must use --journal to protect
 X     One           No               No
                                                           against crashes

                                                   On loss of one member, surviving
       Two          Yes               No
                                                         member is read only

                                                   On loss of one member, surviving
       Three        Yes         Yes - 1 failure     two members can elect a new
                                                                primary
                                                     * On loss of two members,
 X     Four         Yes         Yes - 1 failure*   surviving two members are read
                                                                 only

                                                 On loss of two members, surviving
       Five         Yes         Yes - 2 failures  three members can elect a new
                                                              primary



                              Typical
http://spf13.com
                            http://github.com/s
                            @spf13




Questions?
     download at mongodb.org
 We’re hiring!! Contact us at jobs@10gen.com
Replication, Durability, and Disaster Recovery

Replication, Durability, and Disaster Recovery

  • 1.
  • 2.
    @spf13 AKA Steve Francia 15+ years building the internet Father, husband, skateboarder Chief Solutions Architect @ responsible for drivers, integrations, web & docs
  • 3.
    Agenda • Intro toreplication • How MongoDB does Replication • Configuring a ReplicaSet • Advanced Replication • Durability • High Availability Scenarios
  • 4.
  • 5.
  • 6.
    Use cases • HighAvailability (auto-failover)
  • 7.
    Use cases • HighAvailability (auto-failover) • Read Scaling (extra copies to read from)
  • 8.
    Use cases • HighAvailability (auto-failover) • Read Scaling (extra copies to read from) • Backups Delayed Copy (fat finger) • Online, Time (PiT) backups • Point in
  • 9.
    Use cases • HighAvailability (auto-failover) • Read Scaling (extra copies to read from) • Backups Delayed Copy (fat finger) • Online, Time (PiT) backups • Point in • Use (hidden) replica for secondary workload • Analytics • Data-processingexternal systems • Integration with
  • 10.
  • 11.
    Types of outage Planned • Hardware upgrade • O/S or file-system tuning • Relocation of data to new file-system / storage • Software upgrade
  • 12.
    Types of outage Planned • Hardware upgrade • O/S or file-system tuning • Relocation of data to new file-system / storage • Software upgrade Unplanned • Hardware failure • Data center failure • Region outage • Human error • Application corruption
  • 13.
  • 14.
    Replica Set features •A cluster of N servers
  • 15.
    Replica Set features •A cluster of N servers • Any (one) node can be primary
  • 16.
    Replica Set features •A cluster of N servers • Any (one) node can be primary • Consensus election of primary
  • 17.
    Replica Set features • A cluster of N servers • Any (one) node can be primary • Consensus election of primary • Automatic failover
  • 18.
    Replica Set features • A cluster of N servers • Any (one) node can be primary • Consensus election of primary • Automatic failover • Automatic recovery
  • 19.
    Replica Set features • A cluster of N servers • Any (one) node can be primary • Consensus election of primary • Automatic failover • Automatic recovery • All writes to primary
  • 20.
    Replica Set features • A cluster of N servers • Any (one) node can be primary • Consensus election of primary • Automatic failover • Automatic recovery • All writes to primary • Reads can be to primary (default) or a secondary
  • 21.
  • 22.
    How MongoDB Replication works Member 1 Member 3 Member 2 • Set is made up of 2 or more nodes
  • 23.
    How MongoDB Replication works Member 1 Member 3 Member 2 Primary • Election establishes the PRIMARY • Data replication from PRIMARY to
  • 24.
    How MongoDB Replication works negotiate new master Member 1 Member 3 Member 2 DOWN • PRIMARY may fail • Automatic election of new PRIMARY if
  • 25.
    How MongoDB Replication works Member 3 Member 1 Primary Member 2 DOWN • New PRIMARY elected • Replica Set re-established
  • 26.
    How MongoDB Replication works Member 3 Member 1 Primary Member 2 Recovering • Automatic recovery
  • 27.
    How MongoDB Replication works Member 3 Member 1 Primary Member 2 • Replica Set re-established
  • 28.
  • 29.
    How Is Data Replicated? to the • Change operations are written oplog • The oplog is a capped collection (fixed size) •Must have enough space to allow new secondaries to catch up (from scratch or from a backup) •Must have enough space to cope with any applicable slaveDelay
  • 30.
    How Is Data Replicated? to the • Change operations are written oplog • The oplog is a capped collection (fixed size) •Must have enough space to allow new secondaries to catch up (from scratch or from a backup) •Must have enough space to cope with any applicable slaveDelay • Secondaries query the primary’s oplog and apply what they find • All replicas contain an oplog
  • 31.
  • 32.
    Creating a ReplicaSet $ ./mongod --replSet <name> > cfg = { _id : "<name>", members : [ { _id : 0, host : "sf1.acme.com" }, { _id : 1, host : "sf2.acme.com" }, { _id : 2, host : "sf3.acme.com" } ] } > use admin > rs.initiate(cfg)
  • 33.
  • 34.
    Managing a ReplicaSet rs.conf() Shell helper: get current configuration
  • 35.
    Managing a ReplicaSet rs.conf() Shell helper: get current configuration rs.initiate(<cfg>); Shell helper: initiate replica set
  • 36.
    Managing a ReplicaSet rs.conf() Shell helper: get current configuration rs.initiate(<cfg>); Shell helper: initiate replica set rs.reconfig(<cfg>) Shell helper: reconfigure a replica set
  • 37.
    Managing a ReplicaSet rs.conf() Shell helper: get current configuration rs.initiate(<cfg>); Shell helper: initiate replica set rs.reconfig(<cfg>) Shell helper: reconfigure a replica set rs.add("hostname:<port>") Shell helper: add a new member
  • 38.
    Managing a ReplicaSet rs.conf() Shell helper: get current configuration rs.initiate(<cfg>); Shell helper: initiate replica set rs.reconfig(<cfg>) Shell helper: reconfigure a replica set rs.add("hostname:<port>") Shell helper: add a new member rs.remove("hostname:<port>") Shell helper: remove a member
  • 39.
  • 40.
    Managing a ReplicaSet rs.status() Reports status of the replica set from one node's point of view
  • 41.
    Managing a ReplicaSet rs.status() Reports status of the replica set from one node's point of view rs.stepDown(<secs>) Request the primary to step down
  • 42.
    Managing a ReplicaSet rs.status() Reports status of the replica set from one node's point of view rs.stepDown(<secs>) Request the primary to step down rs.freeze(<secs>) Prevents any changes to the current replica set configuration (primary/secondary status) Use during backups
  • 43.
  • 44.
    Lots of Features •Delayed • Hidden • Priorities • Tags
  • 45.
  • 46.
    Slave Delay • Lagsbehind master by configurable time delay
  • 47.
    Slave Delay • Lagsbehind master by configurable time delay • Automatically hidden from clients
  • 48.
    Slave Delay • Lagsbehind master by configurable time delay • Automatically hidden from clients • Protects against operator errors • Fat fingering • Application corrupts data
  • 49.
  • 50.
    Other member types • Arbiters • Don’t store a copy of the data • Vote in elections • Used as a tie breaker
  • 51.
    Other member types • Arbiters • Don’t store a copy of the data • Vote in elections • Used as a tie breaker • Hidden • Not reported in isMaster • Hidden from slaveOk reads
  • 52.
  • 53.
    Priorities • Priority: anumber between 0 and 100 • Used during an election: • Most up to date • Highest priority • Less than 10s behind failed Primary • Allows weighting of members during failover
  • 54.
    Priorities - example A B C D E p:10 p:10 p:1 p:1 p:0
  • 55.
    Priorities - example A B C D E p:10 p:10 p:1 p:1 p:0 • Assuming all members are up to date
  • 56.
    Priorities - example A B C D E p:10 p:10 p:1 p:1 p:0 • Assuming all members are up to date • Members A or B will be chosen first • Highest priority
  • 57.
    Priorities - example A B C D E p:10 p:10 p:1 p:1 p:0 • Assuming all members are up to date • Members A or B will be chosen first • Highest priority • Members C or D will be chosen when: • A and B are unavailable • A and B are not up to date
  • 58.
    Priorities - example A B C D E p:10 p:10 p:1 p:1 p:0 • Assuming all members are up to date • Members A or B will be chosen first • Highest priority • Members C or D will be chosen when: • A and B are unavailable • A and B are not up to date • Member E is never chosen • priority:0 means it cannot be elected
  • 59.
  • 60.
  • 61.
  • 62.
    Durability Options •Fire andforget •Write Concern
  • 63.
    Write Concern & If a write requires a return trip What the return trip should depend on
  • 64.
  • 65.
    Write Concern w: the numberof servers to replicate to (or majority)
  • 66.
    Write Concern w: the numberof servers to replicate to (or majority) wtimeout: timeout in ms waiting for replication
  • 67.
    Write Concern w: the numberof servers to replicate to (or majority) wtimeout: timeout in ms waiting for replication j: wait for journal sync
  • 68.
    Write Concern w: the numberof servers to replicate to (or majority) wtimeout: timeout in ms waiting for replication j: wait for journal sync tags: ensure replication to n nodes of given tag
  • 69.
    Fire and Forget Driver Primary write apply in memory •Operations are applied in memory •No waiting for persistence to disk •MongoDB clients do not block waiting to confirm the operation completed
  • 70.
    Wait for error Driver Primary write getLastError apply in memory •Operations are applied in memory •No waiting for persistence to disk •MongoDB clients do block waiting to confirm the operation completed
  • 71.
    Wait for journal sync Driver write Primary getLastError apply in memory j:true Write to journal •Operations are applied in memory •Wait for persistence to journal •MongoDB clients do block waiting to confirm the operation completed
  • 72.
    Wait for fsync Driver Primary write getLastError apply in memory fsync:true write to journal (if enabled) fsync •Operations are applied in memory •Wait for persistence to journal •Wait for persistence to disk •MongoDB clients do block waiting to confirm the operation completed
  • 73.
    Wait for replication Driver Primary Secondary write getLastError apply in memory w:2 replicate •Operations are applied in memory •No waiting for persistence to disk •Waiting for replication to n nodes •MongoDB clients do block waiting to confirm the operation completed
  • 74.
    Tagging • Control overwhere data is written to. • Each member can have one or more tags: tags: {dc: "stockholm"} tags: {dc: "stockholm", ip: "192.168", rack: "row3-rk7"} • Replica set defines rules for where data resides • Rules defined in RS config... can change without change application code
  • 75.
    Tagging - example { _id : "someSet", members : [ {_id : 0, host : "A", tags : {"dc": "ny"}}, {_id : 1, host : "B", tags : {"dc": "ny"}}, {_id : 2, host : "C", tags : {"dc": "sf"}}, {_id : 3, host : "D", tags : {"dc": "sf"}}, {_id : 4, host : "E", tags : {"dc": "cloud"}} ] settings : { getLastErrorModes : { veryImportant : {"dc" : 3}, sortOfImportant : {"dc" : 2} } } }
  • 76.
  • 77.
    Single Node • Downtime inevitable • If node crashes human intervention might be needed • Should absolutely run with journaling to prevent data loss /
  • 78.
    Replica Set 1 • Single datacenter Arbiter • Single switch & power • One node failure • Automatic recovery of single node crash • Points of failure: • Power • Network • Datacenter
  • 79.
    Replica Set 2 • Single datacenter Arbiter • Multiple power/network zones • Automatic recovery of single node crash • w=2 not viable as losing 1 node means no writes • Points of failure: • Datacenter • Two node failure
  • 80.
    Replica Set 3 • Single datacenter • Multiple power/network zones • Automatic recovery of single node crash • w=2 viable as 2/3 online • Points of failure: • Datacenter • Two node failure
  • 81.
  • 82.
    Replica Set 4 • Multi datacenter • DR node for safety • Can't do multi data center durable write safely since only 1 node in distant DC
  • 83.
    Replica Set 5 • Three data centers • Can survive full data center loss • Can do w= { dc : 2 } to guarantee write in 2 data centers
  • 84.
    Set Use? Data Protection High Availability Notes size Must use --journal to protect X One No No against crashes On loss of one member, surviving Two Yes No member is read only On loss of one member, surviving Three Yes Yes - 1 failure two members can elect a new primary * On loss of two members, X Four Yes Yes - 1 failure* surviving two members are read only On loss of two members, surviving Five Yes Yes - 2 failures three members can elect a new primary Typical
  • 85.
    http://spf13.com http://github.com/s @spf13 Questions? download at mongodb.org We’re hiring!! Contact us at jobs@10gen.com