Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
RADOS level replication
ISSUES
• Maintaining causal ordering
• Guarantee replication correctness when some OSDs go down
• Some other considerations
System run time
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
sl...
Some more details
• How to make all clients pause at exactly every time slice boundary?
• How to make all client Ops in th...
Making all clients pause at time slice
boundaries
• Monitor send an timestamp Tts_bound to all clients for every time slic...
Making all clients pause at time slice
boundaries
Tts_bound
Tts_bound + TimeSlice
Client1’s system clock
Client2’s system ...
Making all ops in the same time slice in the
same transaction
• All clients, after being paused for Ppause, report to the ...
Making all ops in the same time slice in the
same transaction
• OSDs in the master cluster keep replicating client ops to ...
Put together
client client client client
OSD
monitor
Tts_bound
Tts_bound_last
OSD OSD
Top_latest
Master cluster
OSD
monito...
Key points
• The expiration time of clients’ pause timer must NOT be influenced by
time synchronization service like ntp o...
ISSUES
• Maintaining causal ordering
• Guarantee replication correctness when some OSDs go down
• Some other considerations
Guarantee replication correctness when some
OSDs go down
• The following two conditions should be enough to guarantee the
...
Guarantee replication correctness when some
OSDs go down
• Can we use the current OSD journal?
• Op_replication_head –
op_...
Guarantee replication correctness when some
OSDs go down
• Can we use the current OSD journal?
• Replication is controlled...
Guarantee replication correctness when some
OSDs go down
• Details about this issue is in the document of the last CDM:
ht...
ISSUES
• Maintaining causal ordering
• Guarantee replication correctness when some OSDs go down
• Some other considerations
Some other considerations
• Making all clients to pause periodically seems not necessary, maybe
we can let clients identif...
Upcoming SlideShare
Loading in …5
×

Sep 6 cdm

PPT of the ceph's CDM of september 6th, 2017

  • Be the first to comment

  • Be the first to like this

Sep 6 cdm

  1. 1. RADOS level replication
  2. 2. ISSUES • Maintaining causal ordering • Guarantee replication correctness when some OSDs go down • Some other considerations
  3. 3. System run time Time slice Time slice Time slice Time slice Time slice Time slice Time slice Time slice Time slice Time slice Sage Weil suggested(to preserve causal order): • Split the whole system run time into a series of time slice • All client Ops during the same time slice are in the same transaction • At time slice boundaries, clients has to pause for some time in order for the physical clock to go forward enough to prevent causal order to be violated across the time slice boundary
  4. 4. Some more details • How to make all clients pause at exactly every time slice boundary? • How to make all client Ops in the same time slice either all replicated or all not-replicated?
  5. 5. Making all clients pause at time slice boundaries • Monitor send an timestamp Tts_bound to all clients for every time slice boundary • All clients set their pause timers according to the following rule: • CONDITION: Local system clock has to be synchronized with the same NTP server as monitors, and time skew is small enough. • Tpause_timer_expire = Tts_bound + TimeSlice – Tlocal • When the pause timer expires, if the client’s local system clock still satifies the CONDITION, its worker threads has to pause for the same period of time, Ppause. • Ppause must be much larger than the time synchronization error bound.
  6. 6. Making all clients pause at time slice boundaries Tts_bound Tts_bound + TimeSlice Client1’s system clock Client2’s system clock Client3’s system clock Time Sync Error Bound Client1’s Ppause Client2’s Ppause Client3’s Ppause Time Sync Error Bound
  7. 7. Making all ops in the same time slice in the same transaction • All clients, after being paused for Ppause, report to the monitor of the pause and the last Tts_bound_last according to which they set this pause. • OSDs in the master cluster periodically report to the monitor of their latest replicated client op • When all clients has finished the pauses which they set according to Tts_bound_last, and all OSDs have started to replicate client Ops whose timestamp are later than Tts_bound_last, monitor send the Tts_bound_last + TimeSlice to the backup cluster.
  8. 8. Making all ops in the same time slice in the same transaction • OSDs in the master cluster keep replicating client ops to the backup cluster despite the time slice constrains • OSDs in the backup cluster cache the ops in their journal, and write them back to the backing store only when there is a confirmed time slice boundary(Tts_bound + TimeSlice) sent from the monitor of the master cluster. • When the time slice boundary(Tts_bound + TimeSlice) is received, all ops with their time stamp earlier than that boundary are written back to the backing store.
  9. 9. Put together client client client client OSD monitor Tts_bound Tts_bound_last OSD OSD Top_latest Master cluster OSD monitor OSD OSD Backup cluster Tts_bound_last+TimeSlice Tts_bound_last+TimeSlice Transfer node Transfer node Transfer node ops ops
  10. 10. Key points • The expiration time of clients’ pause timer must NOT be influenced by time synchronization service like ntp or chrony • If one client fail to pause correctly, this time slice should be merged into later time slices instead of be replicated. • OSD journal space of the OSDs in backup cluster should be large enough to cache multiple time slices’ Ops since the sending of the confirmed time slice boundary can be delayed.
  11. 11. ISSUES • Maintaining causal ordering • Guarantee replication correctness when some OSDs go down • Some other considerations
  12. 12. Guarantee replication correctness when some OSDs go down • The following two conditions should be enough to guarantee the correctness: • For all OSDs in the acting set, “original op journal” gets removed only after its corresponding “original op” is replicated • in the recovery/backfill phase, recovery source replicate all journal related to the recovering object before pushing
  13. 13. Guarantee replication correctness when some OSDs go down • Can we use the current OSD journal? • Op_replication_head – op_replication_tail <= some_threshold • If the condition above is hold, journal_head should be pointing to the same “op” as the op_replication_head • Otherwise, only journal_head move forward.
  14. 14. Guarantee replication correctness when some OSDs go down • Can we use the current OSD journal? • Replication is controlled by acting primary • Replica OSDs should report there journal space usage info to acting primary, for example, add this info in the reply to CEPH_OSD_OP_REPOP msg.
  15. 15. Guarantee replication correctness when some OSDs go down • Details about this issue is in the document of the last CDM: http://tracker.ceph.com/attachments/download/2903/ceph_rados- level_replication.pdf :-)
  16. 16. ISSUES • Maintaining causal ordering • Guarantee replication correctness when some OSDs go down • Some other considerations
  17. 17. Some other considerations • Making all clients to pause periodically seems not necessary, maybe we can let clients identify themselves as whether or not they need the point-in-time consistency. So, only those who need point-in-time consistency need to pause.

×