1. Causality-Based Versioning
Kiran-Kumar Muniswamy-Reddy and David A. Holland
Slides By Authors And Aleatha Parker-Wood
Tuesday, June 1, 2010
2. Versioning
• Already popular
• Saves back up “versions” of files as they change
• Two flavors: versioning (event based) and snapshotting (time based)
• Snapshots: WAFL, Venti...
• Versioning: Elephant, VersionFS...
Tuesday, June 1, 2010
3. Why Version/Snapshot?
• Disaster recovery is baked into the file system
• “Oops, I needed that...”
• “Oops, I didn’t mean to click that virus...”
• “Oops, that new driver patch broke everything...”
• Maintains backup files to which you can recover (without going
offsite)
Tuesday, June 1, 2010
4. Causality
• Depends on time (to cause Y, X must be before it)
• Uni-directional (If X causes Y, Y cannot cause X)
• Defined in terms of data flow
• A reads B ⇒ B causes A
• A writes B ⇒ A causes B
• PASS, Intrusion Dectection Systems (BackTracker, Taser...)
Tuesday, June 1, 2010
5. Why Causality?
• Track propagation of data
• Find out what files were modified by what processes
• Reconstruct the scene of the crime
Tuesday, June 1, 2010
6. Causality-Based Versioning
• Decide when to version using causal relationships between two files
• Has advantages of versioning file systems or snapshots
• Eases recovery from corruption, viruses, and user mistakes
• In addition, creates causal links between files
• Easier to decide what to restore
• Sort of like transactions on steroids
Tuesday, June 1, 2010
7. Applications
• Intrusion Recovery
• System configuration management
• IP compliance
• Reproduction of research results
Tuesday, June 1, 2010
8. A Scenario...
• Apache split-logfile Vulnerability
• Vulnerability in Apache 1.3
• Vulnerability allows attacker to overwrite any file with a .log
extension
• Let’s look at the current versioning options...
Tuesday, June 1, 2010
15. The Goal
• One of these has too much information
• The other not enough
• Can we leverage causality to create just enough versions?
Tuesday, June 1, 2010
16. Creating Just Enough Versions
• Building on top of the Provenance Aware Storage System (PASS)
• Two options
• Cycle Avoidance
• Graph Finesse
Tuesday, June 1, 2010
17. How PASS works
• Translates system calls to provenance records (read/write become
edges in a dependency graph)
• Maintains provenance for transient objects such as pipes and
processes, and creates virtual objects as needed
• Analyzes to ensure there are no cyclic dependencies between objects
• Causality based versioning extends the analysis phase
Tuesday, June 1, 2010
18. The big idea
• Cycles are violations of causality
• The creation of a cycle is an indicator that this is an interesting event
• We can prevent cycles by creating a new version every time a cycle is
about to occur
Tuesday, June 1, 2010
48. Version-On-Write?
• We could remove cycles using Version-On-Write
• Every read creates a new version of the process
• Every write creates a new version of the file
• But this results in 8 versions
• Huge management overhead
Tuesday, June 1, 2010
49. Cycle Avoidance Algorithm
• Uses local information about the object
• Create a new version of an object whenever a new ancestor is added
• Different versions are considered to be “new” ancestors
• Not every write causes a new version
Tuesday, June 1, 2010
50. The Algorithm
• Assume new data: A1 depends on B2
• If B is not in A’s dependencies, create a new version of A
• Else if B is already in A’s dependencies:
• If B2 is in dependencies, discard (no new information)
• If B3 is in dependencies, discard (no new causality)
• If B1 is in dependencies, create new version of A
Tuesday, June 1, 2010
66. Graph Finesse
• As before: A1 depends on B2
• If B2 is already in A’s history, discard
• Otherwise, check for a path from B2 - A1
• If yes, we have a cycle. Make a new version of A1
• Otherwise, add A1- B2 to the dependency graph
Tuesday, June 1, 2010
78. Evaluation
• Run-time overhead
• Space overhead
• Recovery costs
• All results are average of 5 runs
• Less than 5% standard deviation
Tuesday, June 1, 2010
79. Workloads used
• Linux compile (CPU intensive)
• Postmark (I/O intensive)
• Applying patches with Mercurial (developer workload)
• blast protein-sequencing (scientific workload)
Tuesday, June 1, 2010
80. Algorithms used
• Without causal data:
• Ext2: Baseline (Lasagna, Harvard’s versioning FS, on top of ext2)
• VER: Plain open-close versioning
• With causal data
• OC: Open-close
• CA: Cycle-Avoidance
• GF: Graph Finesse
• ALL: version on every write
Tuesday, June 1, 2010
105. Conclusions
• Both algorithms require less time and space than Version-On-Write
• Both algorithms offer finer grained control than Open-Close
• Graph-Finesse creates fewer unnecessary versions
• Cycle-Avoidance has overhead comparable to Open-Close
Tuesday, June 1, 2010
106. Expanding on it
• Not just good for disaster recovery
• Search
• Social network analysis
Tuesday, June 1, 2010