Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Efficient Backup and Replication of Storage


Published on

Talk slides for LinuxCon Japan 2013.

Published in: Technology
  • Login to see the comments

An Efficient Backup and Replication of Storage

  1. 1. An Efficient Backup andReplication of StorageTakashi HOSHINOCybozu Labs, Inc.2013-05-29@LinuxConJapanrev.20130529a
  2. 2. Self-introduction• Takashi HOSHINO– Works at Cybozu Labs, Inc.• Technical interests– Database, storage, distributed algorithms• Current work– WalB: Today’s talk!2
  3. 3. Contents• Motivation• Architecture• Alternative solutions• Algorithm• Performance evaluation• Summary3
  4. 4. Contents• Motivation• Architecture• Alternative solutions• Algorithm• Performance evaluation• Summary4
  5. 5. Motivation• Backup and replication are vital– We have our own cloud infrastructure– Requiring high availability of customers’ data– With cost-effective commodity hardware and software5Operating dataBackup dataReplicated dataPrimary DC Secondary DC
  6. 6. Requirements• Functionality– Getting consistent diff data for backup/replicationachieving small RPO• Performance– Without usual full-scans– Small overhead• Various kinds of data support– Databases– Blob data– Full-text-search indexes– ...6
  7. 7. • A Linux kernel device driver to provide wrapperblock devices and related userland tools• Provides consistent diff extraction functionalityfor efficient backup and replication• “WalB” means “Block-level WAL”A solution: WalB7WalBstorageBackupstorageReplicatedstorageDiffs
  8. 8. WAL (Write-Ahead Logging)8Ordinary Storage WAL StorageWrite at 0Write at 2Read at 2Write at 200 20 20 2 2Time
  9. 9. How to get diffs• Full-scan and compare• Partial-scan with bitmaps (or indexes)• Scan logs directly from WAL storages9a b c d e a b’ c d e’ b’ e’1 4a b c d e a b’ c d e’ b’ e’1 400000 01001a b c d e a b’ c d e’ b’ e’1 4b’ e’1 4Data at t0 Data at t1 Diffs from t0 to t1
  10. 10. Contents• Motivation• Architecture• Alternative solutions• Algorithm• Performance evaluation• Summary10
  11. 11. WalB architectureA walb deviceas a wrapperA block device for data(Data device)A block device for logs(Log device)Read Write LogsNot special format An original formatAny application(File system, DBMS, etc)Walb logExtractorWalb devcontrollerControl11
  12. 12. Log device format12Ring bufferMetadataAddressSuperblock(physical block size)Unused (4KiB)Log address =(log sequence id) % (ring buffer size)+ (ring buffer start address)
  13. 13. ChecksumRing buffer inside132ndwritten dataRing bufferLog packLogpackheaderblock1stwritten data …The oldest logpackThe latest logpack1st log recordIO addressIO size...Logpack lsidLog packheaderblockNum of recordsTotal IO size2nd log recordIO addressIO size......
  14. 14. Redo/undo logs• Redo logs– to go forward• Undo logs– to go backward140 2Data at t00 2Data at t1Redo logsUndo logs
  15. 15. How to create redo/undo logs• Undo logs requires additional read IOs• Walb devices do not generate undo-logs15Data LogWalB(1) write IOssubmitted(2) read currentdata(3) writeundo logsData LogWalB(1) write IOssubmitted(2) writeredo logsGenerating redo logs Generating undo logs
  16. 16. Consistent full backup• The ring buffer must have enough capacity to store thelogs generated from t0 to t1Primary host Backup hostWalb device(online)Log deviceFull archiveLogsApply(A)(B)Read the online volume(inconsistent)t1Get consistentimage at t1(C)(A) (B) (C)Timet0 t2Get logs fromt0 to t116
  17. 17. Consistent incremental backup• To manage multiple backup generations,defer application or generate undo-logs during applicationPrimary host Backup hostWalb device(online)Log deviceFull archiveLogsApply(A)Get logs fromt0 to t1(B)(A) (B)Timet0 t2t1Previous backuptimestampBackuped Applicationcan be deferred17
  18. 18. Backup and replication18Primary host Backup hostWalb device(online)Log deviceFull archiveLogsApply(A)Get logs fromt0 to t1 Replicated(A’)(A)Timet1 t3Remote hostFull archiveLogsApply (B’)(B)(B)t2Backupedt0Replication delay
  19. 19. Contents• Motivation• Architecture• Alternative solutions• Algorithm• Performance evaluation• Summary19
  20. 20. Alternative solutions• DRBD– Specialized for replication– DRBD proxy is required for long-distance replication• Dm-snap– Snapshot management using COW (copy-on-write)– Full-scans are required to get diffs• Dm-thin– Snapshot management using COW andreference counters– Fragmentation is inevitable20
  21. 21. Alternatives comparisonWalBdm-snapCapabilitydm-thin21DRBDIncr.backupSyncrepl-icationAsyncrepl-icationPerformanceNegligibleSearch idxNegligibleReadresponseoverheadSearch idxWriteresponseoverheadFragment-ationWrite loginstead dataModify idx(+COW)Send IOs toslaves(async repl.)Modify idx(+COW)NeverInevitableNeverNever(original lv)
  22. 22. WalB pros and cons• Pros– Small response overhead– Fragmentation never occur– Logs can be retrieved with sequential scans• Cons– 2x bandwidth is required for writes22
  23. 23. WalB source code statistics23File systemsBlock device wrappers
  24. 24. Contents• Motivation• Architecture• Alternative solutions• Algorithm• Performance evaluation• Summary24
  25. 25. Requirements to be a block device• Read/write consistency– It must read the latest written data• Storage state uniqueness– It must replay logs not changing the history• Durability of flushed data– It must make flushed write IOs be persistent• Crash recovery without undo– It must recover the data using redo-logs only25
  26. 26. WalB algorithm• Two IO processing methods– Easy: very simple, large overhead– Fast: a bit complex, small overhead• Overlapped IO serialization• Flush/fua for durability• Crash recovery and checkpointing26
  27. 27. IO processing flow (easy algorithm)27SubmittedTimeCompletedPackedLog submitted Log completedWait for log flushed and overlapped IOs doneData submitted Data completedWriteSubmittedTimeCompletedData submitted Data completedReadLog IO response Data IO responseWalB write IO responseData IO response
  28. 28. IO processing flow (fast algorithm)28SubmittedTimeCompletedPackedLog submitted Log completedWait for log flushed and overlapped IOs doneData submitted Data completedLog IO response Data IO responseWalB write IO responsePdata inserted Pdata deletedWriteSubmittedTimeCompleted(Data submitted) (Data completed)ReadPdata copiedData IO response
  29. 29. Pending data• A red-black tree– provided as a kernellibrary• Sorted by IO address– to find overlapped IOsquickly• A spinlock– for exclusive accesses29Node0addrsizedataNodeNaddrsizedataNode1addrsizedata......
  30. 30. Overlapped IO serialization• Required for storage state uniqueness• Oldata (overlapped data)– similar to pdata– counter for each IO– FIFO constraint31TimeWait for overlapped IOs doneData submitted Data completedData IO responseOldata inserted Oldata deletedGot notice Sent notice
  31. 31. Flush/fua for durability• The condition for a log to be persistent– All logs before the log and itself are persistent• Neither FLUSH nor FUA is required for data device IOs– Data device persistence will be guaranteed by checkpointing34Property to guaranteeAll write IOs submittedbefore the flush IO arepersistentREQ_FLUSHWhat WalB need to doThe fua IO is persistentREQ_FUASet FLUSH flag of thecorresponding log IOsSet FLUSH and FUA flags tothe corresponding log IOs
  32. 32. Crash recovery and checkpointing• Crash recovery– Crash will make recent write IOs not be persistentin the data device– Generate write IOs from recent logs andexecute them in the data device• Checkpointing– Sync the data device and update the superblockperiodically– The superblock contains recent logs information35
  33. 33. Contents• Motivation• Architecture• Alternative solutions• Algorithm• Performance evaluation• Summary36
  34. 34. Experimental environment• Host– CPU: Intel core i7 3930K– Memory: DDR3-10600 32GB (8GBx4)– OS: Ubuntu 12.04 x86_64– Kernel: Custom build 3.2.24• Storage HBA– Intel X79 Internal SATA controller– Using SATA2 interfaces for the experiment37
  35. 35. Benchmark software• Self-developed IO benchmark for block devices–– uses DIRECT_IO to eliminate buffer cache effects• Parameters– Pattern: Random/sequential– Mode: Read/write– Block size: 512B, 4KB, 32KB, 256KB– Concurrency: 1-3238
  36. 36. Target storage devices• MEM– Memory block devices (self-implemented)• HDD– Seagate Barracuda 500GB (ST500DM002)– Up to 140MB/s• SSD– Intel 330 60GB (SSDSC2CT060A3)– Up to 250MB/s with SATA239
  37. 37. Storage settings• Baseline– A raw storage device• Wrap– Self-developed simple wrapper– Request/bio interfaces• WalB– Easy/easy-ol/fast/fast-ol– Request/bio interfaces40Log DataDataBaselineWalB driverWalBDataWrapWrap driver
  38. 38. WalB parameters• Log-flush-disabled experiments– Target storage: MEMs/HDDs/SSDs– Pdata size: 4-128MiB• Log-flush-enabled experiments– Target storage: HDDs/SSDs– Pdata size: 4-64MiB– Flush interval size: 4-32MiB– Flush interval period: 10ms, 100ms, 1s41
  39. 39. MEM 512B random: response42Read Write• Smaller overhead with bio interfacethan with request interface• Serializing write IOs of WalB seemsto enlarge response time as numberof threads increases
  40. 40. MEM 512B random: IOPS43Read Write• Large overhead with requestinterface• Pdata search seems overhead ofwalb-bio• IOPS of walb decreases as num ofthreads increasesdue to decrease of cache-hit ratioor increase of spinlock wait timeQueue length Queue length
  41. 41. Pdata and oldata overhead44MEM 512B random writeKernel 3.2.24, walb req prototypeOverheadof oldataOverheadof pdataQueue length
  42. 42. HDD 4KB random: IOPS45Read Write• Negligible overhead • IO scheduling effect was observedespecially with walb-reqQueue length Queue length
  43. 43. HDD 256KB sequential: Bps46Read Write• Request interface is better• IO size is limit to 32KiB with biointerface• Additional log header blocksdecrease throughputQueue length Queue length
  44. 44. SSD 4KB random: IOPS47Read Write• WalB performance is almost thesame as that of wrap req• Fast algo. is better then easy algo.• IOPS overhead is large with smallernumber of threadsQueue length Queue length
  45. 45. SSD 4KB random: IOPS(two partitions in a SSD)48Read Write• Almost the same result as with twoSSD drives• A half throughput was observed• Bandwidth of a SSD is thebottleneckQueue length Queue length
  46. 46. SSD 256KB sequential: Bps49Read Write• Non-negligible overhead wasobserved with queue length 1• Fast is better than easy• Larger overhead of walb withsmaller queue lengthQueue length Queue length
  47. 47. Log-flush effects with HDDs50IOPS (4KB random write) Bps (256KB sequential write)• IO sorting for the data device iseffective for random writes• Enough memory for pdata isrequired to minimize the log-flushoverheadQueue length Queue length
  48. 48. Log-flush effects with SSDs51IOPS (4KB random write) Bps (32KB sequential write)• Small flush interval is better• Log flush increases IOPS withsingle-thread• Log-flush effect is negligibleQueue length Queue length
  49. 49. Evaluation summary• WalB overhead– Non-negligible for writes with small concurrency– Log-flush overhead is large with HDDs,negligible with SSDs• Request vs bio interface– bio is better except for workloads with large IO size• Easy vs fast algorithm– Fast is better52
  50. 50. Contents• Motivation• Architecture• Alternative solutions• Algorithm• Performance evaluation• Summary53
  51. 51. WalB summary• A wrapper block device driver for– incremental backup– asynchronous replication• Small performance overhead with– No persistent indexes– No undo-logs– No fragmentation54
  52. 52. Current status• Version 1.0– For Linux kernel 3.2+ and x86_64 architecture– Userland tools are minimal• Improve userland tools– Faster extraction/application of logs– Logical/physical compression– Backup/replication managers• Submit kernel patches55
  53. 53. Future work• Add all-zero flag to the log record format– to avoid all-zero blocks storing to the log device• Add bitmap management– to avoid full-scans in ring buffer overflow• (Support snapshot access)– by implementing pluggable persistent address indexes• (Support thin provisioning)– if a clever defragmentation algorithm was available56
  54. 54. Thank you for your attention!• GitHub repository:–• Contact to me:– Email: hoshino AT– Twitter: @starpoz (hashtag: #walbdev)57