Your SlideShare is downloading. ×
0
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
bup backup system (2011-04)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

bup backup system (2011-04)

1,712

Published on

Avery's presentation about bup at LinuxFest Northwest 2011 in Bellingham, WA.

Avery's presentation about bup at LinuxFest Northwest 2011 in Bellingham, WA.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,712
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. bup: the git-based backup system Avery Pennarun 2011 04 30
  • 2. The Challenge
    • Back up entire filesystems (> 1TB)
    • 3. Including huge VM disk images (files >100GB)
    • 4. Lots of separate files (500k or more)
    • 5. Calculate/store incrementals efficiently
    • 6. Create backups in O(n), where n = number of changed bytes
    • 7. Incremental backup direct to a remote computer (no local copy)
    • 8. ...and don't forget metadata
    5
  • 9. The Result
    • bup is extremely fast - ~80 megs/sec in python
    • 10. Sub-file incrementals are very space efficient
      • >5x better than rsnapshot in real-life use
    • VMs compress smaller and faster than gzip
    • 11. Dedupe between different client machines
    • 12. O(log N) seek times to any part of any file
    • 13. You can mount your backup history as a filesystem or browse it as a web page
    8
  • 14. The Design 8
  • 15. Why git?
    • Easily handles file renames
    • 16. Easily deduplicates between identical files and trees
    • 17. Debug my code using git commands
    • 18. Someone already thought about the repo format (packs, idxes)
    • 19. Three-level “work tree” vs “index” vs “commit”
    11
  • 20. Problem 1: Large files
    • 'git gc' explodes badly on large files; totally unusable
    • 21. git bigfiles fork “solves” the problem by just never deltifying large objects: lame
    • 22. zlib window size is very small: lousy compression on VM images --
    13
  • 23. Digression: zlib window size
    • gzip only does two things:
      • backref: copy some bytes from the preceding 64k window
      • 24. huffman code: a dictionary of common words
    • That 64k window is a serious problem!
    • 25. Duplicated data >64k apart can't be compressed
    • 26. cat file.tar file.tar | gzip -c | wc -c
      • surprisingly, twice the size of a single tar.gz
    18
  • 27. bupsplit
    • Uses a rolling checksum to --
    18
  • 28. Digression: rolling checksums
    • Popularized by rsync (Andrew Tridgell, the Samba guy)
    • 29. He wrote a (readable) Ph.D. thesis about it
    • 30. bup uses a variant of the rsync algorithm to --
    19
  • 31. Double Digression: rsync algorithm
    • First player:
      • Divide the file into fixed-size chunks
      • 32. Send the list of all chunk checksums
    • Second player:
      • Look through existing files for any blocks that have those checksums
      • 33. But any n-byte subsequence might be the match
      • 34. Searching naively is about O(n^2) ... ouch.
      • 35. So we use a rolling checksum instead
    21
  • 36. Digression: rolling checksums
    • Calculate the checksum of bytes 0..n
    • 37. Remove byte 0, add byte n+1, to get the checksum from 1..n+1
      • And so on
    • Searching is now more like O(n)... vastly faster
    • 38. Requires a special “rollable” checksum (adler32)
    23
  • 39. Digression: gzip --rsyncable
    • You can't rsync gzipped files efficiently
    • 40. Changing a byte early in a file changes the compression dictionary, so the rest of the file is different
    • 41. --rsyncable resets the compression dictionary whenever low bits of adler32 == 0
    • 42. Fraction of a percent overhead on file size
    • 43. But now your gzip files are rsyncable!
    26
  • 44. bupsplit
    • Based on gzip --rsyncable
    • 45. Instead of a compression dictionary, we break the file into blocks on adler32 boundaries
      • If low 13 checksum bits are 1, end this chunk
      • 46. Average chunk: 2**13 = 8192
    • Now we have a list of chunks and their sums
    • 47. Inserting/deleting bytes changes at most two chunks!
    28
  • 48. bupsplit trees
    • Inspired by “Tiger tree” hashing used in some P2P systems
    • 49. Arrange chunk list into trees:
      • If low 17 checksum bits are 1, end a superchunk
      • 50. If low 21 bits are 1, end a superduperchunk
      • 51. and so on.
    • Superchunk boundary is also a chunk boundary
    • 52. Inserting/deleting bytes changes at most 2*log(n) chunks!
    30
  • 53. Advantages of bupsplit
    • Never loads the whole file into RAM
    • 54. Compresses most VM images more (and faster) than gzip
    • 55. Works well on binary and text files
    • 56. Don't need to teach it about file formats
    • 57. Diff huge files in about O(log n) time
    • 58. Seek to any offset in a file in O(log n) time
    32
  • 59. Problem 2: Millions of objects
    • Plain git format:
      • 1TB of data / 8k per chunk: 122 million chunks
      • 60. x 20 bytes per SHA1: 2.4GB
      • 61. Divided into 2GB packs: 500 .idx files of 5MB each
      • 62. 8-bit prefix lookup table
    • Adding a new chunk means searching 500 * (log(5MB)-8) hops = 500 * 14 hops = 500 * 7 pages = 3500 pages
    34
  • 63. Millions of objects (cont'd)
    • bup .midx files: merge all .idx into a single .midx
    • 64. Larger initial lookup table to immediately narrow search to the last 7 bits
    • 65. log(2.4GB) = 31 bits
    • 66. 24-bit lookup table * 4 bytes per entry: 64 MB
    • 67. Adding a new chunk always only touches two pages
    35
  • 68. Bloom Filters
    • Problems with .midx:
      • Have to rewrite the whole file to merge a new .idx
      • 69. Storing 20 bytes per hash is wasteful; even 48 bits would be unique
      • 70. Paging in two pages per chunk is maybe too much
    • Solution: Bloom filters
      • Idea borrowed from Plan9 WORM fs
      • 71. Create a big hash table
      • 72. Hash each block several ways
      • 73. Gives false positives, never false negatives
      • 74. Can update incrementally
    36
  • 75. bup command line: indexed mode
    • Saving:
      • bup index -vux /
      • 76. bup save -vn mybackup /
    • Restoring:
      • bup restore (extract multiple files)
      • 77. bup ftp (command-line browsing)
      • 78. bup web
      • 79. bup fuse (mount backups as a filesystem)
    36
  • 80. Not implemented yet
    • Pruning old backups
      • 'git gc' is far too simple minded to handle a 1TB backup set
    • Metadata
      • Almost done: 'bup meta' and the .bupmeta files
    37
  • 81. Crazy Ideas
    • Encryption
    • 82. 'bup restore' that updates the index like 'git checkout' does
    • 83. Distributed filesystem: bittorrent with a smarter data structure
    39
  • 84. Questions? 40

×