Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Progressive Provenance Capture Through Re-computation


Published on

Provenance capture relies upon instrumentation of processes (e.g. probes or extensive logging). The more instrumentation we can add to processes the richer our provenance traces can be, for example, through the addition of comprehensive descriptions of steps performed, mapping to higher levels of abstraction through ontologies, or distinguishing between automated or user actions. However, this instrumentation has costs in terms of capture time/overhead and it can be difficult to ascertain what should be instrumented upfront. In this talk, I'll discuss our research on using record-replay technology within virtual machines to incrementally add additional provenance instrumentation by replaying computations after the fact.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Progressive Provenance Capture Through Re-computation

  1. 1. Progressive Provenance Capture Through Re- computation Paul Groth Elsevier Labs @pgroth | Joint work with Manolis Stamatogiannakis and Herbert Bos Vrije Universiteit Amsterdam Incremental Re-computation Workshop - Provenance Week 2018
  2. 2. What to capture? Simon Miles, Paul Groth, Paul, Steve Munroe, Luc Moreau. PrIMe: A methodology for developing provenance-aware applications. ACM Transactions on Software Engineering and Methodology, 20, (3), 2011. 2
  3. 3. Provenance is Post-Hoc • What if we missed something? • Disclosed provenance systems: – Re-apply methodology (e.g. PriME), produce new application version. – Time consuming. • Observed provenance systems: – Update the applied instrumentation. – Instrumentation becomes progressively more intense. 3
  4. 4. Provenance is Post-Hoc Aim: Eliminate the need for developers to know what provenance needs to be captured. 4
  5. 5. Re-execution • Common tactic in disclosed provenance: – DB: Reenactment queries (Glavic ‘14) – DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13), DistTape (Zhao ‘12) – Workflows: Pegasus (Groth ‘09) – PL: Slicing (Perera ‘12) – Desktop: Excel (Asuncion ‘11) • Can we extend this idea to observed provenance systems? 5
  6. 6. Full-system logging and replay 6
  7. 7. Methodology Selection Provenance analysis Instrumentation Execution Capture 7
  8. 8. Prototype Implementation • PANDA: an open-source Platform for Architecture-Neutral Dynamic Analysis. (Dolan- Gavitt ‘14) • Based on the QEMU virtualization platform. 8
  9. 9. • PANDA logs self-contained execution traces. – An initial RAM snapshot. – Non-deterministic inputs. • Logging happens at virtual CPU I/O ports. – Virtual device state is not logged  can’t “go-live”. Prototype Implementation (2/3) PANDA CPU RAM Input Interrupt DMA Initial RAM Snapshot Non- determinism log RAM PANDA Execution Trace 9
  10. 10. Prototype Implementation (3/3) • Analysis plugins – Read-only access to the VM state. – Invoked per instr., memory access, context switch, etc. – Can be combined to implement complex functionality. – OSI Linux, PROV-Tracer, ProcStrMatch, Taint tracking • Debian Linux guest. • Provenance stored PROV/RDF triples, queried with SPARQL. PANDA Execution Trace PANDA Triple Store Plugin APlugin C Plugin B CPU RAM 10 used endedAtTime wasAssociatedWith actedOnBehalfOf wasGeneratedBy wasAttributedTo wasDerivedFrom wasInformedBy Activity Entity Agent xsd:dateTime startedAtTime xsd:dateTime
  11. 11. OS Introspection • What processes are currently executing? • Which libraries are used? • What files are used? • Possible approaches: – Execute code inside the guest-OS. – Reproduce guest-OS semantics purely from the hardware state (RAM/registers). 11
  12. 12. 12 (1) Alice downloads the front page of (2) Alice edits the document and fixes a link that points to the wrong page. (3) Alice re-uploads the HTML document and the image. (4) Bob downloads the front page of (5) Bob removes a paragraph of text. (6) Bob re-uploads the the HTML document. An example
  13. 13. 13 Select Replay
  14. 14. Thoughts • Decoupling provenance analysis from execution is possible by the use of VM record & replay. • Execution traces can be used for post-hoc provenance analysis. • 24/7 execution recording seems possible • Can we extend this notion of instrumentation to other capture systems? 14 Manolis Stamatogiannakis, Elias Athanasopoulos, Herbert Bos, Paul Groth: PROV2R: Practical Provenance Analysis of Unstructured Processes. ACM Transactions on Internet Technology 17(4): 37:1-37:24 (2017)