Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Advanced
Replication
Internals
Design and Goals
Goals
§  Highly Available
§  Consistent Data
§  Automatic Failover
§  Multi-Region/DC
§  Dynamic Rea...
Goal: High Availability
•  Node Redundancy: Duplicate Data
•  Record Write Operations
•  Apply Write Operations
•  Use cap...
Replication Operations
Insert
•  oplog entry (fields):
o  op, o
{
"ns" : "test.gamma",
"op" : "i", "v" : 2,
"ts" : Timesta...
Replication Operations
Update
•  oplog entry (fields):
o  o = update, o2 = query
{
"ns" : "test.tags",
"op" : "u", "v" : 2...
Operation Transformation
•  Idempotent (update by _id)
•  Multi-update/delete (results in many ops)
•  Array modifications...
Interchangeable
•  All members maintain oplog + dbs
•  All able to take over, or be used for same
functions
Replication Process
•  Record oplog entry on write
•  Idempotent entries
•  Pulled by replicas
1.  Read over network
2.  B...
Read + Apply Decoupled
•  Background oplog reader thread
•  Pool of oplog applier threads (by collection)
Repl Source
Appl...
Replication Metrics
"network": {
"bytes": 103830503805,
"readersCreated": 2248,
"getmores": {
"totalMillis": 257461206,
"n...
Good Replication States
•  Initial Sync
o  Record oplog start position
o  Clone/copy all dbs
o  Set minvalid, apply oplog ...
Goal: Consistent Data
•  Single Master
•  Quorum (majority)
•  Ordered Oplog
Consistent Data
Why a single master?
Election Events
Election events:
•  Primary failure
•  Stepdown (manual)
•  Reconfigure
•  Quorum loss
Election Nomination
Disqualifications
A replica will nominate itself unless:
•  Priority:0 or arbiter
•  Not freshest
•  J...
The Election
Nomination:
•  If it looks like a tie, sleep random time
(unless first node)
Voting:
•  If all goes well, onl...
Goal: Automatic Failover
•  Single Master
•  Smart Clients
•  Discovery
Discovery
isMaster command:
setName: <name>,
ismaster: true, secondary: false, arbiterOnly:
hosts: [ <visible nodes> ],
pa...
Failover Scenario
Client
P
S
S
Discovery (isMaster)Active Primary
Failover Scenario
Client
P
S
S
Active Primary
P
Failed Primary
Failover Scenario
Client
Failed
P
S
Discovery (isMaster)
Replication Source
Select'n
•  Select closest source
o  Limit to non-hidden or slave delayed
o  If nothing, try again with...
Goal: Datacenter Aware
•  Dynamic replication topologies
•  Beachhead data center server
P
Goal: Dynamic Reads
Controls for consistency
•  Default to Primary
•  Non-primary allowed
•  Based on
o  Locality (ping/ta...
Asynchronous Replication
•  Important considerations
•  Additional requirements
•  System/Application controls
Write Propagation
•  Write Concern
•  Replication requirements
•  Timing
•  Dynamic requirements
Exceptional Conditions
•  Multiple Primaries
•  Rollback
•  Too stale
Design and Goals
Goals
§  Highly Available
§  Consistent Data
§  Automatic Failover
§  Multi-Region/DC
§  Dynamic Rea...
Thanks
Questions?
Upcoming SlideShare
Loading in …5
×

Advanced Replication Internals

2,387 views

Published on

Published in: Technology
  • Be the first to comment

Advanced Replication Internals

  1. 1. Advanced Replication Internals
  2. 2. Design and Goals Goals §  Highly Available §  Consistent Data §  Automatic Failover §  Multi-Region/DC §  Dynamic Reads Design •  All DBs, each node •  Quorum/Election •  Smart clients •  Source selection •  Read Preferences •  Record operations •  Asynchronous •  Write/Replication acknowledgements
  3. 3. Goal: High Availability •  Node Redundancy: Duplicate Data •  Record Write Operations •  Apply Write Operations •  Use capped collection called "oplog"
  4. 4. Replication Operations Insert •  oplog entry (fields): o  op, o { "ns" : "test.gamma", "op" : "i", "v" : 2, "ts" : Timestamp(1350504342, 5), "o" : { "_id" : 2, "x" : "hi"} }
  5. 5. Replication Operations Update •  oplog entry (fields): o  o = update, o2 = query { "ns" : "test.tags", "op" : "u", "v" : 2, "ts": Timestamp(1368049619, 1), "o2" : { "_id" : 1 }, "o" : { "$set" : { "tags.4" : "e" } } }
  6. 6. Operation Transformation •  Idempotent (update by _id) •  Multi-update/delete (results in many ops) •  Array modifications (replacement)
  7. 7. Interchangeable •  All members maintain oplog + dbs •  All able to take over, or be used for same functions
  8. 8. Replication Process •  Record oplog entry on write •  Idempotent entries •  Pulled by replicas 1.  Read over network 2.  Buffer locally 3.  Apply in batch 4.  Repeat
  9. 9. Read + Apply Decoupled •  Background oplog reader thread •  Pool of oplog applier threads (by collection) Repl Source Applier Thread Pool 16 Buffer DB4 DB3 DB1 DB2 Local Oplog Network
  10. 10. Replication Metrics "network": { "bytes": 103830503805, "readersCreated": 2248, "getmores": { "totalMillis": 257461206, "num": 2152267 }, "ops": 7285440 } "buffer": { "sizeBytes": 0, "maxSizeBytes": 268435456, "count": 0}, "preload": { "docs": { "totalMillis":0,"num":0}, "indexes": { "totalMillis": 23142318, "num": 14560667 } }, "apply": { "batches": { "totalMillis": 231847, "num": 1797105}, "ops": 7285440 }, "oplog": { "insertBytes": 106866610253, "insert": { "totalMillis": 1756725, "num": 7285440 } }
  11. 11. Good Replication States •  Initial Sync o  Record oplog start position o  Clone/copy all dbs o  Set minvalid, apply oplog since start o  Build indexes •  Replication Batch: MinValid
  12. 12. Goal: Consistent Data •  Single Master •  Quorum (majority) •  Ordered Oplog
  13. 13. Consistent Data Why a single master?
  14. 14. Election Events Election events: •  Primary failure •  Stepdown (manual) •  Reconfigure •  Quorum loss
  15. 15. Election Nomination Disqualifications A replica will nominate itself unless: •  Priority:0 or arbiter •  Not freshest •  Just stepped down (in unelectable state) •  Would be vetoed by anyone because o  There is a Primary already o  They don't have us in their config o  Higher priority member out there •  Higher config version out there
  16. 16. The Election Nomination: •  If it looks like a tie, sleep random time (unless first node) Voting: •  If all goes well, only one nominee •  All voting members vote for one nominee •  Majority of votes wins
  17. 17. Goal: Automatic Failover •  Single Master •  Smart Clients •  Discovery
  18. 18. Discovery isMaster command: setName: <name>, ismaster: true, secondary: false, arbiterOnly: hosts: [ <visible nodes> ], passives: [ <prio:0 nodes> ], arbiters: [ <nodes> ], primary: <active primary>, tags: {<tags>}, me: <me>
  19. 19. Failover Scenario Client P S S Discovery (isMaster)Active Primary
  20. 20. Failover Scenario Client P S S Active Primary P Failed Primary
  21. 21. Failover Scenario Client Failed P S Discovery (isMaster)
  22. 22. Replication Source Select'n •  Select closest source o  Limit to non-hidden or slave delayed o  If nothing, try again with hidden/slave delayed o  Select node with fastest "ping" time o  Must be fresher •  Choose source when o  Starting o  Any error with existing source (network, query) o  Any member is 30s ahead of current source •  Manual override o  replSetSyncSource -- good until we choose again
  23. 23. Goal: Datacenter Aware •  Dynamic replication topologies •  Beachhead data center server P
  24. 24. Goal: Dynamic Reads Controls for consistency •  Default to Primary •  Non-primary allowed •  Based on o  Locality (ping/tags) o  Tags Client S P S Tags: A, B Tags: B, C
  25. 25. Asynchronous Replication •  Important considerations •  Additional requirements •  System/Application controls
  26. 26. Write Propagation •  Write Concern •  Replication requirements •  Timing •  Dynamic requirements
  27. 27. Exceptional Conditions •  Multiple Primaries •  Rollback •  Too stale
  28. 28. Design and Goals Goals §  Highly Available §  Consistent Data §  Automatic Failover §  Multi-Region/DC §  Dynamic Reads Design •  All DBs, each node •  Quorum/Election •  Smart clients •  Source selection •  Read Preferences •  Record operations •  Asynchronous •  Write/Replication acknowledgements
  29. 29. Thanks Questions?

×