Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tradeoffs in Automatic Provenance Capture

2,671 views

Published on

Presentation from IPAW 2016. Preprint http://dare.ubvu.vu.nl/handle/1871/54358
DOI: http://doi.org/10.1007/978-3-319-40593-3_3

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Tradeoffs in Automatic Provenance Capture

  1. 1. Trade-offs in Automatic Provenance Capture Manolis Stamatogiannakis, Hasanat Kazmi, Hashim Sharif, Remco Vermeulen, Ashish Gehani, Herbert Bos, and Paul Groth
  2. 2. Capturing Provenance Disclosed Provenance + Accuracy + High-level semantics – Intrusive – Manual Effort Observed Provenance – False positives – Semantic Gap + Non-intrusive + Minimal manual effort CPL (Macko ‘12) Trio (Widom ‘09) PrIME (Miles ‘09) Taverna (Oinn ‘06) VisTrails (Fraire ‘06) ES3 (Frew ‘08) Trec (Vahdat ‘98) PASSv2 (Holland ‘08) DTrace Tool (Gessiou ‘12) 2 OPUS (Balakrishnan ‘ 13)
  3. 3. https://github.com/ashish- gehani/SPADE/wiki • Strace Reporter – Programs run under strace. Produced log is parsed to extract provenance. • LLVMTrace – Instrumentation added to function boundaries at compile time. • DataTracker – Dynamic Taint Analysis. Bytes associated with metadata which are propagated as the program executes. 3 SPADEv2 – Provenance Collection
  4. 4. SPADEv2 flow 4
  5. 5. Current Intuition 5
  6. 6. Current Intuition 6
  7. 7. Incomplete Picture • Faster, but how much? • What is the performance “price” for fewer false positives? • Does a compile-time solution worth the effort? 7
  8. 8. How can one get more insight? Run a benchmark! 8
  9. 9. Which one? • LMBench, UnixBench, Postmark, BLAST, SPECint… • [Traeger 08]: “Most popular benchmarks are flawed.” • No-matter what you chose, there will be blind spots. 9
  10. 10. Start simple: UnixBench • Well understood sub-benchmarks. • Emphasizes on performance of system calls. • System calls are commonly used for the extraction of provenance. • More insight on which collection backend would suit specific applications. • We’ll have a performance baseline to improve the specific implementations. 10
  11. 11. UnixBench Results 11
  12. 12. TRADEOFFS 12
  13. 13. Performance vs. Integration Effort • Capturing provenance from completely unmodified programs may degrade performance. • Modification of either the source (LLVMTrace) or the platform (LPM, Hi-Fi) should be considered for a production deployment. 13
  14. 14. Performance vs. Provenance Granularity • We couldn’t verify this intuition for the case of strace reoporter compared to LLVMTrace. – Strace reporter implementation is not optimal. • Tracking fine-grained provenance may interfere with existing optimizations. – E.g. buffering I/O does not benefit DataTracker. 14
  15. 15. Performance vs. False Positives/Analysis Scope • “Brute-forcing” a low false-positive ratio with the “track everything” approach of DataTracker is prohibitively expensive. • Limiting the analysis scope gives a performance boost. • If we exploit known semantics, we can have the best of both worlds. – Pre-existing semantic knowledge: LLVMTrace – Dynamically acquired knowledge: ProTracer [Ma 2016] 15
  16. 16. TAKEAWAYS 16
  17. 17. Takeaway: System Event Tracing • A good start for quick deployments • Simple versions may be expensive • What happens in the binary? 17
  18. 18. Takeaway: Compile-time Instrumentation • Middle-ground between disclosed and automatic provenance collection. • But you have to have access to source 18
  19. 19. Takeaway: Taint Analysis • Prohibitively expensive for computation- intensive programs. • Likely to remain so, even after optimizations. • Reserved for provenance analysis of unknown/legacy software. • Offline approach (Stamatogiannakis TAPP’15) 19
  20. 20. Generalizing the Results • Only one implementation was tested for each method. • Repeating testing with alternative implementations will provide confidence for the insights gained. • More confidence when choosing a specific collection method. 20 Different methods Differentimplementations
  21. 21. Implementation Details Matter • Our results are influenced by the specifics of the implementation. • Anecdote: The initial implementation of LLVMTrace was actually slower than strace reporter. 21
  22. 22. Provenance Quality • Qualitative features of the provenance are also very important. • How many vertices/edges are contained in the generated provenance graph? • Precision/Recall based on provenance ground truth. 22 Performance Benchmarks QualitativeBenchmarks
  23. 23. Where to go next? • UnixBench is a basic benchmark. • SPEC: Comprehensive in terms of performance evaluation. – Hard to get the provenance ground truth – assess quality of captured provenance. • Better directions: – Coreutils based micro-benchmarks. – Macro-benchmarks (e.g. Postmark, compilation benchmarks). 23
  24. 24. Conclusion • Automatic provenance capture is an important part of the ecosystem • Trade-offs in different capture modes • Benchmarking – to inform • Common platforms are essential 24
  25. 25. The End 25
  26. 26. UnixBench Results 26

×