AdvancedReplicationInternals
Design and GoalsGoals■ Highly Available■ Consistent Data■ Automatic Failover■ Multi-Region/DC■ Dynamic ReadsDesign● All DB...
Goal: High Availability● Node Redundancy: Duplicate Data● Record Write Operations● Apply Write Operations● Use capped coll...
Replication OperationsInsert● oplog entry (fields):○ op, o{"ns" : "test.gamma","op" : "i", "v" : 2,"ts" : Timestamp(135050...
Replication OperationsUpdate● oplog entry (fields):○ o = update, o2 = query{"ns" : "test.tags","op" : "u", "v" : 2,"ts": T...
Operation Transformation● Idempotent (update by _id)● Multi-update/delete (results in many ops)● Array modifications (repl...
Interchangeable● All members maintain oplog + dbs● All able to take over, or be used for samefunctions
Replication Process● Record oplog entry on write● Idempotent entries● Pulled by replicas1. Read over network2. Buffer loca...
Read + Apply Decoupled● Background oplog reader thread● Pool of oplog applier threads (by collection)Repl SourceApplierThr...
Replication Metrics"network": {"bytes": 103830503805,"readersCreated": 2248,"getmores": {"totalMillis": 257461206,"num": 2...
Good Replication States● Initial Sync○ Record oplog start position○ Clone/copy all dbs○ Set minvalid, apply oplog since st...
Goal: Consistent Data● Single Master● Quorum (majority)● Ordered Oplog
Consistent DataWhy a single master?
Election EventsElection events:● Primary failure● Stepdown (manual)● Reconfigure● Quorum loss
Election NominationDisqualificationsA replica will nominate itself unless:● Priority:0 or arbiter● Not freshest● Just step...
The ElectionNomination:● If it looks like a tie, sleep random time(unless first node)Voting:● If all goes well, only one n...
Goal: Automatic Failover● Single Master● Smart Clients● Discovery
DiscoveryisMaster command:setName: <name>,ismaster: true, secondary: false, arbiterOnly:hosts: [ <visible nodes> ],passive...
Failover ScenarioClientPSSDiscovery (isMaster)Active Primary
Failover ScenarioClientPSSActive PrimaryPFailed Primary
Failover ScenarioClientFailedPSDiscovery (isMaster)Active Primary
Replication Source Selectn● Select closest source○ Limit to non-hidden or slave delayed○ If nothing, try again with hidden...
Goal: Datacenter Aware● Dynamic replication topologies● Beachhead data center serverP
Goal: Dynamic ReadsControls for consistency● Default to Primary● Non-primary allowed● Based on○ Locality (ping/tags)○ Tags...
Asynchronous Replication● Important considerations● Additional requirements● System/Application controls
Write Propagation● Write Concern● Replication requirements● Timing● Dynamic requirements
Exceptional Conditions● Multiple Primaries● Rollback● Too stale
Design and GoalsGoals■ Highly Available■ Consistent Data■ Automatic Failover■ Multi-Region/DC■ Dynamic ReadsDesign● All DB...
ThanksQuestions?
Upcoming SlideShare
Loading in …5
×

Advanced Replication Internals

516 views

Published on

Internals of replication in mongodb. These internals cover replication selection, the replication process, elections (and the rules), and oplog transformation.

This presentation was given at the MongoDB San Francisco conference.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
516
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Advanced Replication Internals

  1. 1. AdvancedReplicationInternals
  2. 2. Design and GoalsGoals■ Highly Available■ Consistent Data■ Automatic Failover■ Multi-Region/DC■ Dynamic ReadsDesign● All DBs, each node● Quorum/Election● Smart clients● Source selection● Read Preferences● Record operations● Asynchronous● Write/Replicationacknowledgements
  3. 3. Goal: High Availability● Node Redundancy: Duplicate Data● Record Write Operations● Apply Write Operations● Use capped collection called "oplog"
  4. 4. Replication OperationsInsert● oplog entry (fields):○ op, o{"ns" : "test.gamma","op" : "i", "v" : 2,"ts" : Timestamp(1350504342, 5),"o" : { "_id" : 2, "x" : "hi"} }
  5. 5. Replication OperationsUpdate● oplog entry (fields):○ o = update, o2 = query{"ns" : "test.tags","op" : "u", "v" : 2,"ts": Timestamp(1368049619, 1),"o2" : { "_id" : 1 },"o" : { "$set" : { "tags.4" : "e" } } }
  6. 6. Operation Transformation● Idempotent (update by _id)● Multi-update/delete (results in many ops)● Array modifications (replacement)
  7. 7. Interchangeable● All members maintain oplog + dbs● All able to take over, or be used for samefunctions
  8. 8. Replication Process● Record oplog entry on write● Idempotent entries● Pulled by replicas1. Read over network2. Buffer locally3. Apply in batch4. Repeat
  9. 9. Read + Apply Decoupled● Background oplog reader thread● Pool of oplog applier threads (by collection)Repl SourceApplierThreadPool16BufferDB4DB3DB1 DB2Local OplogNetworkBatchComplete
  10. 10. Replication Metrics"network": {"bytes": 103830503805,"readersCreated": 2248,"getmores": {"totalMillis": 257461206,"num": 2152267 },"ops": 7285440 }"buffer": {"sizeBytes": 0,"maxSizeBytes": 268435456,"count": 0},"preload": { "docs": {"totalMillis":0,"num":0},"indexes": {"totalMillis": 23142318,"num": 14560667 } },"apply": {"batches": {"totalMillis": 231847,"num": 1797105},"ops": 7285440 },"oplog": {"insertBytes": 106866610253,"insert": {"totalMillis": 1756725,"num": 7285440 } }
  11. 11. Good Replication States● Initial Sync○ Record oplog start position○ Clone/copy all dbs○ Set minvalid, apply oplog since start○ Build indexes● Replication Batch: MinValid
  12. 12. Goal: Consistent Data● Single Master● Quorum (majority)● Ordered Oplog
  13. 13. Consistent DataWhy a single master?
  14. 14. Election EventsElection events:● Primary failure● Stepdown (manual)● Reconfigure● Quorum loss
  15. 15. Election NominationDisqualificationsA replica will nominate itself unless:● Priority:0 or arbiter● Not freshest● Just stepped down (in unelectable state)● Would be vetoed by anyone because○ There is a Primary already○ They dont have us in their config○ Higher priority member out there● Higher config version out there
  16. 16. The ElectionNomination:● If it looks like a tie, sleep random time(unless first node)Voting:● If all goes well, only one nominee● All voting members vote for one nominee● Majority of votes wins
  17. 17. Goal: Automatic Failover● Single Master● Smart Clients● Discovery
  18. 18. DiscoveryisMaster command:setName: <name>,ismaster: true, secondary: false, arbiterOnly:hosts: [ <visible nodes> ],passives: [ <prio:0 nodes> ],arbiters: [ <nodes> ],primary: <active primary>,tags: {<tags>},me: <me>
  19. 19. Failover ScenarioClientPSSDiscovery (isMaster)Active Primary
  20. 20. Failover ScenarioClientPSSActive PrimaryPFailed Primary
  21. 21. Failover ScenarioClientFailedPSDiscovery (isMaster)Active Primary
  22. 22. Replication Source Selectn● Select closest source○ Limit to non-hidden or slave delayed○ If nothing, try again with hidden/slave delayed○ Select node with fastest "ping" time○ Must be fresher● Choose source when○ Starting○ Any error with existing source (network, query)○ Any member is 30s ahead of current source● Manual override○ replSetSyncSource -- good until we choose again
  23. 23. Goal: Datacenter Aware● Dynamic replication topologies● Beachhead data center serverP
  24. 24. Goal: Dynamic ReadsControls for consistency● Default to Primary● Non-primary allowed● Based on○ Locality (ping/tags)○ TagsClientSPSTags: A,BTags: B, C
  25. 25. Asynchronous Replication● Important considerations● Additional requirements● System/Application controls
  26. 26. Write Propagation● Write Concern● Replication requirements● Timing● Dynamic requirements
  27. 27. Exceptional Conditions● Multiple Primaries● Rollback● Too stale
  28. 28. Design and GoalsGoals■ Highly Available■ Consistent Data■ Automatic Failover■ Multi-Region/DC■ Dynamic ReadsDesign● All DBs, each node● Quorum/Election● Smart clients● Source selection● Read Preferences● Record operations● Asynchronous● Write/Replicationacknowledgements
  29. 29. ThanksQuestions?

×