Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Backing up thousands of containers


Published on

Information about how you can build a backup architecture for thousands of containers or machines.

Published in: Engineering
  • Be the first to comment

Backing up thousands of containers

  1. 1. Backing up thousands of containers OR How to fail miserably at copying data OpenFest 2015
  2. 2. Talk about backup systems...Why? ➢First backup system built in 1999 ➢Since then, 10 different systems ➢But why built your own? ➢ simple: SCALE ➢I'm very proud of the design of the last two systems my team and I build
  3. 3. Backup considerations ➢Storage capacity ➢Amount of backup copies ➢HDD and RAID speeds ➢Almost never the network
  4. 4. Networking.... ➢typical transfer speed over 1Gbit/s ~ 24MB/s ➢typical transfer speed over 10Gbit/s ~ 110MB/s ➢Restoring a 80% full 2TB drive ➢ ~21h over 1Gbit/s with 24MB/s ➢ ~4h and a half over 10Gbit/s with 110MB/s ➢Overlapping backups on the same network equipment ➢Overlapping backups and restores ➢Switch uplinks
  5. 5. Architecture of container backups ➢Designed for 100,000 containers ➢backup each container at least once a day ➢30 incremental copies ➢Now I'll explain HOW :)
  6. 6. Host machine architecture ➢We use LVM ➢RAID array which exposes a single drive ➢setup a single Physical Volume on that drive ➢setup a single Volume Group using the above PV ➢Thin provisioned VG ➢Each container with its own Logical Volume
  7. 7. Backup node architecture ➢Again we use LVM ➢RAID array which exposes a single drive ➢5 equally big Physical Volumes ➢on each PV we create a VG with thin pool ➢each container has a single LV ➢each incremental backup is a new snapshot from the LV ➢when the max number of incremental backups is reached, we remove the first LV
  8. 8. For now, there is nothing reallyFor now, there is nothing really new or very interesting or very interesting here. So let me start with the funSo let me start with the fun part.part.
  9. 9. ➢We use rsync (nothing revolutionary here) ➢We need the size of the deleted files ➢ ➢Restore files directly in client's containers, no SSH into them ➢
  10. 10. Backup system architecture ➢ One central database ➢ Public/Private IP addresses ➢ Maximum slots per machine ➢ Gearman for messaging layer ➢ Scheduler for backups ➢ Backup worker
  11. 11. The Scheduler ➢ Check if we have to backup the container ➢ Get the last backup timestamp ➢ Check if the host node has available backup slots ➢ Schedule a 'start-backup' job at the gearman on the backup node
  12. 12. start-backup worker ➢ Works on each backup node ➢ Started as many times as the Backup server can handle ➢ handles the actual backup ➢ creates snapshots ➢ monitors rsync ➢ remove snapshots ➢ update database
  13. 13. No problems... they say :) ➢ We lost ALL of our backups from TWO node ➢ corrupted VG metadata ➢ VG metadata is not enough (more then 2000) LVs ➢ create the VGs a little bit smaller then the total size of the PV ➢ separate the VGs to loose less
  14. 14. No problems... they say :) ➢ LV creation becomes sluggish because LVM tries to scan for devices in /dev ➢ obtain_device_list_from_udev = 1 ➢ write_cache_state = 0 ➢ specify the devices in scan = [ “/dev” ] ➢lvmetad and dmetad break... ➢ when they breack, they corrupt the metadata of all currently opened containers ➢lvcreate leaks file descriptors ➢ once lvmetad or dmeventd are out of FDs everything breaks
  15. 15. Then the Avatar came ➢ We wanted to reduce the restore time from 4h to under 1h, even under 30min ➢ So instead of backing up whole containers... ➢ We now backup accounts ➢ Soon we will be able to do distributed restore ➢ single host node backup ➢ from multiple backup nodes ➢ to multiple host nodes
  16. 16. Layerd backupsSparse File Physical Volume Volume Group ThinPool Logical Volume Snapshot6 Snapshot5 Snapshot4 Snapshot3 Snapshot2 Snapshot1 Snapshot0 Loop mount
  17. 17. Issues here ➢ We can't keep a machine UP for more then 19 hours, LVM kernel BUG ➢ 2.6 till 4.3 - when discarding data it crashes ➢ Removing old snapshots does not discard the data ➢ LVM umounts a volume when dmeventd reaches the limit of Fds ➢ It does umount -l, the bastard
  18. 18. Issues here ➢ LVM dmeventd try's to extend the volume, but if you don't have free extents it will silently umount -l your LV ➢ Monitor your thinpool metadata ➢ Make your thinpool smaller then the VG and always plan to have a few spare PE for extending the pool ➢ kabbi__ #lvm
  19. 19. Any Questions?