Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

An Efficient Backup and Replication of Storage Slide 1 An Efficient Backup and Replication of Storage Slide 2 An Efficient Backup and Replication of Storage Slide 3 An Efficient Backup and Replication of Storage Slide 4 An Efficient Backup and Replication of Storage Slide 5 An Efficient Backup and Replication of Storage Slide 6 An Efficient Backup and Replication of Storage Slide 7 An Efficient Backup and Replication of Storage Slide 8 An Efficient Backup and Replication of Storage Slide 9 An Efficient Backup and Replication of Storage Slide 10 An Efficient Backup and Replication of Storage Slide 11 An Efficient Backup and Replication of Storage Slide 12 An Efficient Backup and Replication of Storage Slide 13 An Efficient Backup and Replication of Storage Slide 14 An Efficient Backup and Replication of Storage Slide 15 An Efficient Backup and Replication of Storage Slide 16 An Efficient Backup and Replication of Storage Slide 17 An Efficient Backup and Replication of Storage Slide 18 An Efficient Backup and Replication of Storage Slide 19 An Efficient Backup and Replication of Storage Slide 20 An Efficient Backup and Replication of Storage Slide 21 An Efficient Backup and Replication of Storage Slide 22 An Efficient Backup and Replication of Storage Slide 23 An Efficient Backup and Replication of Storage Slide 24 An Efficient Backup and Replication of Storage Slide 25 An Efficient Backup and Replication of Storage Slide 26 An Efficient Backup and Replication of Storage Slide 27 An Efficient Backup and Replication of Storage Slide 28 An Efficient Backup and Replication of Storage Slide 29 An Efficient Backup and Replication of Storage Slide 30 An Efficient Backup and Replication of Storage Slide 31 An Efficient Backup and Replication of Storage Slide 32 An Efficient Backup and Replication of Storage Slide 33 An Efficient Backup and Replication of Storage Slide 34 An Efficient Backup and Replication of Storage Slide 35 An Efficient Backup and Replication of Storage Slide 36 An Efficient Backup and Replication of Storage Slide 37 An Efficient Backup and Replication of Storage Slide 38 An Efficient Backup and Replication of Storage Slide 39 An Efficient Backup and Replication of Storage Slide 40 An Efficient Backup and Replication of Storage Slide 41 An Efficient Backup and Replication of Storage Slide 42 An Efficient Backup and Replication of Storage Slide 43 An Efficient Backup and Replication of Storage Slide 44 An Efficient Backup and Replication of Storage Slide 45 An Efficient Backup and Replication of Storage Slide 46 An Efficient Backup and Replication of Storage Slide 47 An Efficient Backup and Replication of Storage Slide 48 An Efficient Backup and Replication of Storage Slide 49 An Efficient Backup and Replication of Storage Slide 50 An Efficient Backup and Replication of Storage Slide 51 An Efficient Backup and Replication of Storage Slide 52 An Efficient Backup and Replication of Storage Slide 53 An Efficient Backup and Replication of Storage Slide 54
Upcoming SlideShare
Where We're Headed and Where NSX Fits In
Next
Download to read offline and view in fullscreen.

10 Likes

Share

Download to read offline

An Efficient Backup and Replication of Storage

Download to read offline

Talk slides for LinuxCon Japan 2013.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

An Efficient Backup and Replication of Storage

  1. 1. An Efficient Backup and Replication of Storage Takashi HOSHINO Cybozu Labs, Inc. 2013-05-29@LinuxConJapan rev.20130529a
  2. 2. Self-introduction • Takashi HOSHINO – Works at Cybozu Labs, Inc. • Technical interests – Database, storage, distributed algorithms • Current work – WalB: Today’s talk! 2
  3. 3. Contents • Motivation • Architecture • Alternative solutions • Algorithm • Performance evaluation • Summary 3
  4. 4. Contents • Motivation • Architecture • Alternative solutions • Algorithm • Performance evaluation • Summary 4
  5. 5. Motivation • Backup and replication are vital – We have our own cloud infrastructure – Requiring high availability of customers’ data – With cost-effective commodity hardware and software 5 Operating data Backup data Replicated data Primary DC Secondary DC
  6. 6. Requirements • Functionality – Getting consistent diff data for backup/replication achieving small RPO • Performance – Without usual full-scans – Small overhead • Various kinds of data support – Databases – Blob data – Full-text-search indexes – ... 6
  7. 7. • A Linux kernel device driver to provide wrapper block devices and related userland tools • Provides consistent diff extraction functionality for efficient backup and replication • “WalB” means “Block-level WAL” A solution: WalB 7 WalB storage Backup storage Replicated storage Diffs
  8. 8. WAL (Write-Ahead Logging) 8 Ordinary Storage WAL Storage Write at 0 Write at 2 Read at 2 Write at 2 0 0 2 0 2 0 2 2 Time
  9. 9. How to get diffs • Full-scan and compare • Partial-scan with bitmaps (or indexes) • Scan logs directly from WAL storages 9 a b c d e a b’ c d e’ b’ e’1 4 a b c d e a b’ c d e’ b’ e’1 4 00000 01001 a b c d e a b’ c d e’ b’ e’1 4 b’ e’1 4 Data at t0 Data at t1 Diffs from t0 to t1
  10. 10. Contents • Motivation • Architecture • Alternative solutions • Algorithm • Performance evaluation • Summary 10
  11. 11. WalB architecture A walb device as a wrapper A block device for data (Data device) A block device for logs (Log device) Read Write Logs Not special format An original format Any application (File system, DBMS, etc) Walb log Extractor Walb dev controller Control 11
  12. 12. Log device format 12 Ring bufferMetadata Address Superblock (physical block size) Unused (4KiB) Log address = (log sequence id) % (ring buffer size) + (ring buffer start address)
  13. 13. Checksum Ring buffer inside 13 2nd written data Ring buffer Log pack Logpack header block 1st written data … The oldest logpackThe latest logpack 1st log record IO address IO size ... Logpack lsid Log pack header block Num of records Total IO size 2nd log record IO address IO size ... ...
  14. 14. Redo/undo logs • Redo logs – to go forward • Undo logs – to go backward 14 0 2 Data at t0 0 2 Data at t1 Redo logs Undo logs
  15. 15. How to create redo/undo logs • Undo logs requires additional read IOs • Walb devices do not generate undo-logs 15 Data Log WalB (1) write IOs submitted (2) read current data (3) write undo logs Data Log WalB (1) write IOs submitted (2) write redo logs Generating redo logs Generating undo logs
  16. 16. Consistent full backup • The ring buffer must have enough capacity to store the logs generated from t0 to t1 Primary host Backup host Walb device (online) Log device Full archive Logs Apply (A) (B) Read the online volume (inconsistent) t1 Get consistent image at t1 (C) (A) (B) (C) Time t0 t2 Get logs from t0 to t1 16
  17. 17. Consistent incremental backup • To manage multiple backup generations, defer application or generate undo-logs during application Primary host Backup host Walb device (online) Log device Full archive Logs Apply (A) Get logs from t0 to t1 (B) (A) (B) Time t0 t2t1 Previous backup timestamp Backuped Application can be deferred 17
  18. 18. Backup and replication 18 Primary host Backup host Walb device (online) Log device Full archive Logs Apply (A) Get logs from t0 to t1 Replicated (A’) (A) Time t1 t3 Remote host Full archive Logs Apply (B’) (B) (B) t2 Backuped t0 Replication delay
  19. 19. Contents • Motivation • Architecture • Alternative solutions • Algorithm • Performance evaluation • Summary 19
  20. 20. Alternative solutions • DRBD – Specialized for replication – DRBD proxy is required for long-distance replication • Dm-snap – Snapshot management using COW (copy-on-write) – Full-scans are required to get diffs • Dm-thin – Snapshot management using COW and reference counters – Fragmentation is inevitable 20
  21. 21. Alternatives comparison WalB dm-snap Capability dm-thin 21 DRBD Incr. backup Sync repl- ication Async repl- ication Performance Negligible Search idx Negligible Read response overhead Search idx Write response overhead Fragment- ation Write log instead data Modify idx (+COW) Send IOs to slaves (async repl.) Modify idx (+COW) Never Inevitable Never Never (original lv)
  22. 22. WalB pros and cons • Pros – Small response overhead – Fragmentation never occur – Logs can be retrieved with sequential scans • Cons – 2x bandwidth is required for writes 22
  23. 23. WalB source code statistics 23 File systemsBlock device wrappers
  24. 24. Contents • Motivation • Architecture • Alternative solutions • Algorithm • Performance evaluation • Summary 24
  25. 25. Requirements to be a block device • Read/write consistency – It must read the latest written data • Storage state uniqueness – It must replay logs not changing the history • Durability of flushed data – It must make flushed write IOs be persistent • Crash recovery without undo – It must recover the data using redo-logs only 25
  26. 26. WalB algorithm • Two IO processing methods – Easy: very simple, large overhead – Fast: a bit complex, small overhead • Overlapped IO serialization • Flush/fua for durability • Crash recovery and checkpointing 26
  27. 27. IO processing flow (easy algorithm) 27 Submitted Time Completed Packed Log submitted Log completed Wait for log flushed and overlapped IOs done Data submitted Data completed Write Submitted Time Completed Data submitted Data completed Read Log IO response Data IO response WalB write IO response Data IO response
  28. 28. IO processing flow (fast algorithm) 28 Submitted Time Completed Packed Log submitted Log completed Wait for log flushed and overlapped IOs done Data submitted Data completed Log IO response Data IO response WalB write IO response Pdata inserted Pdata deleted Write Submitted Time Completed (Data submitted) (Data completed) Read Pdata copied Data IO response
  29. 29. Pending data • A red-black tree – provided as a kernel library • Sorted by IO address – to find overlapped IOs quickly • A spinlock – for exclusive accesses 29 Node0 addr size data NodeN addr size data Node1 addr size data ... ...
  30. 30. Overlapped IO serialization • Required for storage state uniqueness • Oldata (overlapped data) – similar to pdata – counter for each IO – FIFO constraint 31 Time Wait for overlapped IOs done Data submitted Data completed Data IO response Oldata inserted Oldata deletedGot notice Sent notice
  31. 31. Flush/fua for durability • The condition for a log to be persistent – All logs before the log and itself are persistent • Neither FLUSH nor FUA is required for data device IOs – Data device persistence will be guaranteed by checkpointing 34 Property to guarantee All write IOs submitted before the flush IO are persistent REQ_FLUSH What WalB need to do The fua IO is persistentREQ_FUA Set FLUSH flag of the corresponding log IOs Set FLUSH and FUA flags to the corresponding log IOs
  32. 32. Crash recovery and checkpointing • Crash recovery – Crash will make recent write IOs not be persistent in the data device – Generate write IOs from recent logs and execute them in the data device • Checkpointing – Sync the data device and update the superblock periodically – The superblock contains recent logs information 35
  33. 33. Contents • Motivation • Architecture • Alternative solutions • Algorithm • Performance evaluation • Summary 36
  34. 34. Experimental environment • Host – CPU: Intel core i7 3930K – Memory: DDR3-10600 32GB (8GBx4) – OS: Ubuntu 12.04 x86_64 – Kernel: Custom build 3.2.24 • Storage HBA – Intel X79 Internal SATA controller – Using SATA2 interfaces for the experiment 37
  35. 35. Benchmark software • Self-developed IO benchmark for block devices – https://github.com/starpos/ioreth – uses DIRECT_IO to eliminate buffer cache effects • Parameters – Pattern: Random/sequential – Mode: Read/write – Block size: 512B, 4KB, 32KB, 256KB – Concurrency: 1-32 38
  36. 36. Target storage devices • MEM – Memory block devices (self-implemented) • HDD – Seagate Barracuda 500GB (ST500DM002) – Up to 140MB/s • SSD – Intel 330 60GB (SSDSC2CT060A3) – Up to 250MB/s with SATA2 39
  37. 37. Storage settings • Baseline – A raw storage device • Wrap – Self-developed simple wrapper – Request/bio interfaces • WalB – Easy/easy-ol/fast/fast-ol – Request/bio interfaces 40 Log Data Data Baseline WalB driver WalB Data Wrap Wrap driver
  38. 38. WalB parameters • Log-flush-disabled experiments – Target storage: MEMs/HDDs/SSDs – Pdata size: 4-128MiB • Log-flush-enabled experiments – Target storage: HDDs/SSDs – Pdata size: 4-64MiB – Flush interval size: 4-32MiB – Flush interval period: 10ms, 100ms, 1s 41
  39. 39. MEM 512B random: response 42 Read Write • Smaller overhead with bio interface than with request interface • Serializing write IOs of WalB seems to enlarge response time as number of threads increases
  40. 40. MEM 512B random: IOPS 43 Read Write • Large overhead with request interface • Pdata search seems overhead of walb-bio • IOPS of walb decreases as num of threads increases due to decrease of cache-hit ratio or increase of spinlock wait time Queue length Queue length
  41. 41. Pdata and oldata overhead 44 MEM 512B random write Kernel 3.2.24, walb req prototype Overhead of oldata Overhead of pdata Queue length
  42. 42. HDD 4KB random: IOPS 45 Read Write • Negligible overhead • IO scheduling effect was observed especially with walb-req Queue length Queue length
  43. 43. HDD 256KB sequential: Bps 46 Read Write • Request interface is better • IO size is limit to 32KiB with bio interface • Additional log header blocks decrease throughput Queue length Queue length
  44. 44. SSD 4KB random: IOPS 47 Read Write • WalB performance is almost the same as that of wrap req • Fast algo. is better then easy algo. • IOPS overhead is large with smaller number of threads Queue length Queue length
  45. 45. SSD 4KB random: IOPS (two partitions in a SSD) 48 Read Write • Almost the same result as with two SSD drives • A half throughput was observed • Bandwidth of a SSD is the bottleneck Queue length Queue length
  46. 46. SSD 256KB sequential: Bps 49 Read Write • Non-negligible overhead was observed with queue length 1 • Fast is better than easy • Larger overhead of walb with smaller queue length Queue length Queue length
  47. 47. Log-flush effects with HDDs 50 IOPS (4KB random write) Bps (256KB sequential write) • IO sorting for the data device is effective for random writes • Enough memory for pdata is required to minimize the log-flush overhead Queue length Queue length
  48. 48. Log-flush effects with SSDs 51 IOPS (4KB random write) Bps (32KB sequential write) • Small flush interval is better • Log flush increases IOPS with single-thread • Log-flush effect is negligible Queue length Queue length
  49. 49. Evaluation summary • WalB overhead – Non-negligible for writes with small concurrency – Log-flush overhead is large with HDDs, negligible with SSDs • Request vs bio interface – bio is better except for workloads with large IO size • Easy vs fast algorithm – Fast is better 52
  50. 50. Contents • Motivation • Architecture • Alternative solutions • Algorithm • Performance evaluation • Summary 53
  51. 51. WalB summary • A wrapper block device driver for – incremental backup – asynchronous replication • Small performance overhead with – No persistent indexes – No undo-logs – No fragmentation 54
  52. 52. Current status • Version 1.0 – For Linux kernel 3.2+ and x86_64 architecture – Userland tools are minimal • Improve userland tools – Faster extraction/application of logs – Logical/physical compression – Backup/replication managers • Submit kernel patches 55
  53. 53. Future work • Add all-zero flag to the log record format – to avoid all-zero blocks storing to the log device • Add bitmap management – to avoid full-scans in ring buffer overflow • (Support snapshot access) – by implementing pluggable persistent address indexes • (Support thin provisioning) – if a clever defragmentation algorithm was available 56
  54. 54. Thank you for your attention! • GitHub repository: – https://github.com/starpos/walb/ • Contact to me: – Email: hoshino AT labs.cybozu.co.jp – Twitter: @starpoz (hashtag: #walbdev) 57
  • zhangcz

    Nov. 26, 2014
  • sendsage

    Jun. 9, 2013
  • tsubame9590206

    May. 30, 2013
  • ryukln

    May. 30, 2013
  • choplin

    May. 30, 2013
  • hondatakayuki

    May. 30, 2013
  • TokyoIncidents

    May. 29, 2013
  • lajixinxi

    May. 29, 2013
  • JEEN

    May. 29, 2013
  • isaoshimizu

    May. 28, 2013

Talk slides for LinuxCon Japan 2013.

Views

Total views

3,486

On Slideshare

0

From embeds

0

Number of embeds

125

Actions

Downloads

38

Shares

0

Comments

0

Likes

10

×