Your SlideShare is downloading. ×
  • Like
An Efficient Backup and Replication of Storage
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

An Efficient Backup and Replication of Storage


Talk slides for LinuxCon Japan 2013.

Talk slides for LinuxCon Japan 2013.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. An Efficient Backup andReplication of StorageTakashi HOSHINOCybozu Labs, Inc.2013-05-29@LinuxConJapanrev.20130529a
  • 2. Self-introduction• Takashi HOSHINO– Works at Cybozu Labs, Inc.• Technical interests– Database, storage, distributed algorithms• Current work– WalB: Today’s talk!2
  • 3. Contents• Motivation• Architecture• Alternative solutions• Algorithm• Performance evaluation• Summary3
  • 4. Contents• Motivation• Architecture• Alternative solutions• Algorithm• Performance evaluation• Summary4
  • 5. Motivation• Backup and replication are vital– We have our own cloud infrastructure– Requiring high availability of customers’ data– With cost-effective commodity hardware and software5Operating dataBackup dataReplicated dataPrimary DC Secondary DC
  • 6. Requirements• Functionality– Getting consistent diff data for backup/replicationachieving small RPO• Performance– Without usual full-scans– Small overhead• Various kinds of data support– Databases– Blob data– Full-text-search indexes– ...6
  • 7. • A Linux kernel device driver to provide wrapperblock devices and related userland tools• Provides consistent diff extraction functionalityfor efficient backup and replication• “WalB” means “Block-level WAL”A solution: WalB7WalBstorageBackupstorageReplicatedstorageDiffs
  • 8. WAL (Write-Ahead Logging)8Ordinary Storage WAL StorageWrite at 0Write at 2Read at 2Write at 200 20 20 2 2Time
  • 9. How to get diffs• Full-scan and compare• Partial-scan with bitmaps (or indexes)• Scan logs directly from WAL storages9a b c d e a b’ c d e’ b’ e’1 4a b c d e a b’ c d e’ b’ e’1 400000 01001a b c d e a b’ c d e’ b’ e’1 4b’ e’1 4Data at t0 Data at t1 Diffs from t0 to t1
  • 10. Contents• Motivation• Architecture• Alternative solutions• Algorithm• Performance evaluation• Summary10
  • 11. WalB architectureA walb deviceas a wrapperA block device for data(Data device)A block device for logs(Log device)Read Write LogsNot special format An original formatAny application(File system, DBMS, etc)Walb logExtractorWalb devcontrollerControl11
  • 12. Log device format12Ring bufferMetadataAddressSuperblock(physical block size)Unused (4KiB)Log address =(log sequence id) % (ring buffer size)+ (ring buffer start address)
  • 13. ChecksumRing buffer inside132ndwritten dataRing bufferLog packLogpackheaderblock1stwritten data …The oldest logpackThe latest logpack1st log recordIO addressIO size...Logpack lsidLog packheaderblockNum of recordsTotal IO size2nd log recordIO addressIO size......
  • 14. Redo/undo logs• Redo logs– to go forward• Undo logs– to go backward140 2Data at t00 2Data at t1Redo logsUndo logs
  • 15. How to create redo/undo logs• Undo logs requires additional read IOs• Walb devices do not generate undo-logs15Data LogWalB(1) write IOssubmitted(2) read currentdata(3) writeundo logsData LogWalB(1) write IOssubmitted(2) writeredo logsGenerating redo logs Generating undo logs
  • 16. Consistent full backup• The ring buffer must have enough capacity to store thelogs generated from t0 to t1Primary host Backup hostWalb device(online)Log deviceFull archiveLogsApply(A)(B)Read the online volume(inconsistent)t1Get consistentimage at t1(C)(A) (B) (C)Timet0 t2Get logs fromt0 to t116
  • 17. Consistent incremental backup• To manage multiple backup generations,defer application or generate undo-logs during applicationPrimary host Backup hostWalb device(online)Log deviceFull archiveLogsApply(A)Get logs fromt0 to t1(B)(A) (B)Timet0 t2t1Previous backuptimestampBackuped Applicationcan be deferred17
  • 18. Backup and replication18Primary host Backup hostWalb device(online)Log deviceFull archiveLogsApply(A)Get logs fromt0 to t1 Replicated(A’)(A)Timet1 t3Remote hostFull archiveLogsApply (B’)(B)(B)t2Backupedt0Replication delay
  • 19. Contents• Motivation• Architecture• Alternative solutions• Algorithm• Performance evaluation• Summary19
  • 20. Alternative solutions• DRBD– Specialized for replication– DRBD proxy is required for long-distance replication• Dm-snap– Snapshot management using COW (copy-on-write)– Full-scans are required to get diffs• Dm-thin– Snapshot management using COW andreference counters– Fragmentation is inevitable20
  • 21. Alternatives comparisonWalBdm-snapCapabilitydm-thin21DRBDIncr.backupSyncrepl-icationAsyncrepl-icationPerformanceNegligibleSearch idxNegligibleReadresponseoverheadSearch idxWriteresponseoverheadFragment-ationWrite loginstead dataModify idx(+COW)Send IOs toslaves(async repl.)Modify idx(+COW)NeverInevitableNeverNever(original lv)
  • 22. WalB pros and cons• Pros– Small response overhead– Fragmentation never occur– Logs can be retrieved with sequential scans• Cons– 2x bandwidth is required for writes22
  • 23. WalB source code statistics23File systemsBlock device wrappers
  • 24. Contents• Motivation• Architecture• Alternative solutions• Algorithm• Performance evaluation• Summary24
  • 25. Requirements to be a block device• Read/write consistency– It must read the latest written data• Storage state uniqueness– It must replay logs not changing the history• Durability of flushed data– It must make flushed write IOs be persistent• Crash recovery without undo– It must recover the data using redo-logs only25
  • 26. WalB algorithm• Two IO processing methods– Easy: very simple, large overhead– Fast: a bit complex, small overhead• Overlapped IO serialization• Flush/fua for durability• Crash recovery and checkpointing26
  • 27. IO processing flow (easy algorithm)27SubmittedTimeCompletedPackedLog submitted Log completedWait for log flushed and overlapped IOs doneData submitted Data completedWriteSubmittedTimeCompletedData submitted Data completedReadLog IO response Data IO responseWalB write IO responseData IO response
  • 28. IO processing flow (fast algorithm)28SubmittedTimeCompletedPackedLog submitted Log completedWait for log flushed and overlapped IOs doneData submitted Data completedLog IO response Data IO responseWalB write IO responsePdata inserted Pdata deletedWriteSubmittedTimeCompleted(Data submitted) (Data completed)ReadPdata copiedData IO response
  • 29. Pending data• A red-black tree– provided as a kernellibrary• Sorted by IO address– to find overlapped IOsquickly• A spinlock– for exclusive accesses29Node0addrsizedataNodeNaddrsizedataNode1addrsizedata......
  • 30. Overlapped IO serialization• Required for storage state uniqueness• Oldata (overlapped data)– similar to pdata– counter for each IO– FIFO constraint31TimeWait for overlapped IOs doneData submitted Data completedData IO responseOldata inserted Oldata deletedGot notice Sent notice
  • 31. Flush/fua for durability• The condition for a log to be persistent– All logs before the log and itself are persistent• Neither FLUSH nor FUA is required for data device IOs– Data device persistence will be guaranteed by checkpointing34Property to guaranteeAll write IOs submittedbefore the flush IO arepersistentREQ_FLUSHWhat WalB need to doThe fua IO is persistentREQ_FUASet FLUSH flag of thecorresponding log IOsSet FLUSH and FUA flags tothe corresponding log IOs
  • 32. Crash recovery and checkpointing• Crash recovery– Crash will make recent write IOs not be persistentin the data device– Generate write IOs from recent logs andexecute them in the data device• Checkpointing– Sync the data device and update the superblockperiodically– The superblock contains recent logs information35
  • 33. Contents• Motivation• Architecture• Alternative solutions• Algorithm• Performance evaluation• Summary36
  • 34. Experimental environment• Host– CPU: Intel core i7 3930K– Memory: DDR3-10600 32GB (8GBx4)– OS: Ubuntu 12.04 x86_64– Kernel: Custom build 3.2.24• Storage HBA– Intel X79 Internal SATA controller– Using SATA2 interfaces for the experiment37
  • 35. Benchmark software• Self-developed IO benchmark for block devices–– uses DIRECT_IO to eliminate buffer cache effects• Parameters– Pattern: Random/sequential– Mode: Read/write– Block size: 512B, 4KB, 32KB, 256KB– Concurrency: 1-3238
  • 36. Target storage devices• MEM– Memory block devices (self-implemented)• HDD– Seagate Barracuda 500GB (ST500DM002)– Up to 140MB/s• SSD– Intel 330 60GB (SSDSC2CT060A3)– Up to 250MB/s with SATA239
  • 37. Storage settings• Baseline– A raw storage device• Wrap– Self-developed simple wrapper– Request/bio interfaces• WalB– Easy/easy-ol/fast/fast-ol– Request/bio interfaces40Log DataDataBaselineWalB driverWalBDataWrapWrap driver
  • 38. WalB parameters• Log-flush-disabled experiments– Target storage: MEMs/HDDs/SSDs– Pdata size: 4-128MiB• Log-flush-enabled experiments– Target storage: HDDs/SSDs– Pdata size: 4-64MiB– Flush interval size: 4-32MiB– Flush interval period: 10ms, 100ms, 1s41
  • 39. MEM 512B random: response42Read Write• Smaller overhead with bio interfacethan with request interface• Serializing write IOs of WalB seemsto enlarge response time as numberof threads increases
  • 40. MEM 512B random: IOPS43Read Write• Large overhead with requestinterface• Pdata search seems overhead ofwalb-bio• IOPS of walb decreases as num ofthreads increasesdue to decrease of cache-hit ratioor increase of spinlock wait timeQueue length Queue length
  • 41. Pdata and oldata overhead44MEM 512B random writeKernel 3.2.24, walb req prototypeOverheadof oldataOverheadof pdataQueue length
  • 42. HDD 4KB random: IOPS45Read Write• Negligible overhead • IO scheduling effect was observedespecially with walb-reqQueue length Queue length
  • 43. HDD 256KB sequential: Bps46Read Write• Request interface is better• IO size is limit to 32KiB with biointerface• Additional log header blocksdecrease throughputQueue length Queue length
  • 44. SSD 4KB random: IOPS47Read Write• WalB performance is almost thesame as that of wrap req• Fast algo. is better then easy algo.• IOPS overhead is large with smallernumber of threadsQueue length Queue length
  • 45. SSD 4KB random: IOPS(two partitions in a SSD)48Read Write• Almost the same result as with twoSSD drives• A half throughput was observed• Bandwidth of a SSD is thebottleneckQueue length Queue length
  • 46. SSD 256KB sequential: Bps49Read Write• Non-negligible overhead wasobserved with queue length 1• Fast is better than easy• Larger overhead of walb withsmaller queue lengthQueue length Queue length
  • 47. Log-flush effects with HDDs50IOPS (4KB random write) Bps (256KB sequential write)• IO sorting for the data device iseffective for random writes• Enough memory for pdata isrequired to minimize the log-flushoverheadQueue length Queue length
  • 48. Log-flush effects with SSDs51IOPS (4KB random write) Bps (32KB sequential write)• Small flush interval is better• Log flush increases IOPS withsingle-thread• Log-flush effect is negligibleQueue length Queue length
  • 49. Evaluation summary• WalB overhead– Non-negligible for writes with small concurrency– Log-flush overhead is large with HDDs,negligible with SSDs• Request vs bio interface– bio is better except for workloads with large IO size• Easy vs fast algorithm– Fast is better52
  • 50. Contents• Motivation• Architecture• Alternative solutions• Algorithm• Performance evaluation• Summary53
  • 51. WalB summary• A wrapper block device driver for– incremental backup– asynchronous replication• Small performance overhead with– No persistent indexes– No undo-logs– No fragmentation54
  • 52. Current status• Version 1.0– For Linux kernel 3.2+ and x86_64 architecture– Userland tools are minimal• Improve userland tools– Faster extraction/application of logs– Logical/physical compression– Backup/replication managers• Submit kernel patches55
  • 53. Future work• Add all-zero flag to the log record format– to avoid all-zero blocks storing to the log device• Add bitmap management– to avoid full-scans in ring buffer overflow• (Support snapshot access)– by implementing pluggable persistent address indexes• (Support thin provisioning)– if a clever defragmentation algorithm was available56
  • 54. Thank you for your attention!• GitHub repository:–• Contact to me:– Email: hoshino AT– Twitter: @starpoz (hashtag: #walbdev)57