The Flying Circus is an Operations-as-a-Service platform that supports project development teams to run their custom-develop software for clients. Earlier in 2014 we experienced a major data loss and had to perform massive disaster recovery. Unfortunately our Bacula setup was not up to the task and it took us longer and more effort to restore the data than we and our customers expected.
In this case study I’d like to present our public and very honest root cause analysis on how we managed to lose a lot of VMs’ data, how the restore happened, what we learned and how we’re trying to get better. After investigating our options for the future we decided to move away from Bacula’s file and VTL-oriented model and are currently implementing a solution based on CoW-filesystems (ZFS/btrfs), block-layer snapshots and diffing, and a small utility to glue things together.
18. 24 hours are not a sufficient
RPO in quite a few cases
19. Paper cuts
• Hard link farms
• Boot loaders
• The director as a “most valuable bottleneck”
20. Recap
• Restore fiddly to script
• Undetected inconsistency that was hard to deal with
• Blind spots
• Daily Interval
• Overall complexity, performance and the VTL
• Paper cuts
23. Reliability
• Verification / Scrubbing / (Repair)
• High frequency
• Integration with storage snapshots
• Not inventing new formats
24. Operability
• Avoid bottlenecks / head-of-line blocking
• Efficient deltas for large files (ZODB)
• Parallelisation (multiple jobs and multiple servers)
• Simple scripting and environment-specific integration
• Coordination: pre/post actions on storage, hypervisor,
VM …
25. Operability II
• Simple Nagios integration to ensure we notice RPO/
SLA failures
• RTO-compliance during mass-restore
• Self-service for customers to restore files or VMs
51. What did we leave out?
• Physical host backup
• Guesstimating achievable backup storage ratio
52. Future
• trim-ready - waiting for our whole stack (Guest,
Hypervisor, Ceph, …) to pass this through
• Hot reload of scheduler
• Ensuring we can move VM backup directories between
different backup hosts