Successfully reported this slideshow.
Your SlideShare is downloading. ×

Progressive Provenance Capture Through Re-computation

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 14 Ad

Progressive Provenance Capture Through Re-computation

Download to read offline

Provenance capture relies upon instrumentation of processes (e.g. probes or extensive logging). The more instrumentation we can add to processes the richer our provenance traces can be, for example, through the addition of comprehensive descriptions of steps performed, mapping to higher levels of abstraction through ontologies, or distinguishing between automated or user actions. However, this instrumentation has costs in terms of capture time/overhead and it can be difficult to ascertain what should be instrumented upfront. In this talk, I'll discuss our research on using record-replay technology within virtual machines to incrementally add additional provenance instrumentation by replaying computations after the fact.

Provenance capture relies upon instrumentation of processes (e.g. probes or extensive logging). The more instrumentation we can add to processes the richer our provenance traces can be, for example, through the addition of comprehensive descriptions of steps performed, mapping to higher levels of abstraction through ontologies, or distinguishing between automated or user actions. However, this instrumentation has costs in terms of capture time/overhead and it can be difficult to ascertain what should be instrumented upfront. In this talk, I'll discuss our research on using record-replay technology within virtual machines to incrementally add additional provenance instrumentation by replaying computations after the fact.

Advertisement
Advertisement

More Related Content

Similar to Progressive Provenance Capture Through Re-computation (20)

Advertisement

More from Paul Groth (20)

Advertisement

Recently uploaded (20)

Progressive Provenance Capture Through Re-computation

  1. 1. Progressive Provenance Capture Through Re- computation Paul Groth Elsevier Labs @pgroth | pgroth.com Joint work with Manolis Stamatogiannakis and Herbert Bos Vrije Universiteit Amsterdam Incremental Re-computation Workshop - Provenance Week 2018
  2. 2. What to capture? Simon Miles, Paul Groth, Paul, Steve Munroe, Luc Moreau. PrIMe: A methodology for developing provenance-aware applications. ACM Transactions on Software Engineering and Methodology, 20, (3), 2011. 2
  3. 3. Provenance is Post-Hoc • What if we missed something? • Disclosed provenance systems: – Re-apply methodology (e.g. PriME), produce new application version. – Time consuming. • Observed provenance systems: – Update the applied instrumentation. – Instrumentation becomes progressively more intense. 3
  4. 4. Provenance is Post-Hoc Aim: Eliminate the need for developers to know what provenance needs to be captured. 4
  5. 5. Re-execution • Common tactic in disclosed provenance: – DB: Reenactment queries (Glavic ‘14) – DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13), DistTape (Zhao ‘12) – Workflows: Pegasus (Groth ‘09) – PL: Slicing (Perera ‘12) – Desktop: Excel (Asuncion ‘11) • Can we extend this idea to observed provenance systems? 5
  6. 6. Full-system logging and replay 6
  7. 7. Methodology Selection Provenance analysis Instrumentation Execution Capture 7
  8. 8. Prototype Implementation • PANDA: an open-source Platform for Architecture-Neutral Dynamic Analysis. (Dolan- Gavitt ‘14) • Based on the QEMU virtualization platform. 8
  9. 9. • PANDA logs self-contained execution traces. – An initial RAM snapshot. – Non-deterministic inputs. • Logging happens at virtual CPU I/O ports. – Virtual device state is not logged  can’t “go-live”. Prototype Implementation (2/3) PANDA CPU RAM Input Interrupt DMA Initial RAM Snapshot Non- determinism log RAM PANDA Execution Trace 9
  10. 10. Prototype Implementation (3/3) • Analysis plugins – Read-only access to the VM state. – Invoked per instr., memory access, context switch, etc. – Can be combined to implement complex functionality. – OSI Linux, PROV-Tracer, ProcStrMatch, Taint tracking • Debian Linux guest. • Provenance stored PROV/RDF triples, queried with SPARQL. PANDA Execution Trace PANDA Triple Store Plugin APlugin C Plugin B CPU RAM 10 used endedAtTime wasAssociatedWith actedOnBehalfOf wasGeneratedBy wasAttributedTo wasDerivedFrom wasInformedBy Activity Entity Agent xsd:dateTime startedAtTime xsd:dateTime
  11. 11. OS Introspection • What processes are currently executing? • Which libraries are used? • What files are used? • Possible approaches: – Execute code inside the guest-OS. – Reproduce guest-OS semantics purely from the hardware state (RAM/registers). 11
  12. 12. 12 (1) Alice downloads the front page of example.org. (2) Alice edits the document and fixes a link that points to the wrong page. (3) Alice re-uploads the HTML document and the image. (4) Bob downloads the front page of example.org. (5) Bob removes a paragraph of text. (6) Bob re-uploads the the HTML document. An example
  13. 13. 13 Select Replay
  14. 14. Thoughts • Decoupling provenance analysis from execution is possible by the use of VM record & replay. • Execution traces can be used for post-hoc provenance analysis. • 24/7 execution recording seems possible • Can we extend this notion of instrumentation to other capture systems? 14 Manolis Stamatogiannakis, Elias Athanasopoulos, Herbert Bos, Paul Groth: PROV2R: Practical Provenance Analysis of Unstructured Processes. ACM Transactions on Internet Technology 17(4): 37:1-37:24 (2017)

Editor's Notes

  • A big problem for systems capturing provenance is deciding what to capture.
    For disclosed provenance systems we can apply some methodology to decide what to capture.
  • The root of the problem is that provenance is post-hoc.
    Deciding what to capture in advance will always miss something.

    Ideally, we would like to…
  • Decouple analysis from execution.
    Has been proposed for security analysis on mobile phones. (Paranoid Android, Portokalidis ‘10)
  • Execution Capture: happens realtime
    Instrumentation: applied on the captured trace to generate provenance information
    Analysis: the provenance information is explored using existing tools (e.g. SPARQL queries)
    Selection: a subset of the execution trace is selected – we start again with more intensive instrumentation
  • We implemented our methodology using PANDA.
  • PANDA is based on QEMU.

    Input includes both executed instructions and data.

    RAM snapshot + ND log are enough to accurately replay the whole execution.

    ND log conists of inputs to CPU/RAM and other device status is not logged  we can replay but we cannot “go live” (i.e. resume execution)
  • Note: Technically, plugins can modify VM state. However this will eventually crash the execution as the trace will be out of sync with the replay state.

    Plugins are implemented as dynamic libraries.

    We focus on the highlighted plugins in this presentation.
  • Typical information that can be retrieved through VM introspection.

    In general, executing code inside the guest OS is complex.
    Moreover, in the case of PANDA we don’t have access to the state of devices. This makes injection and execution of new code even more complex and also more limited.

×