Ipaw14 presentation Quan, Tanu, Ian

Auditing and Maintaining Provenance in
Software Packages
Quan Pham1 Tanu Malik2 Ian Foster1,2
Department of Computer Science1 and Computation Institute2,
The University of Chicago,
Chicago, IL 60637, USA
quanpt@cs.uchicago.edu, tanum@ci.uchicago.edu
Presented by Boris Glavic
Illinois Institute of Technology
IPAW14
June, 10th, 2014
Provenance in Software Packages June, 10th
, 2014 1 / 29

Outline
1 Introduction
2 Software Pipeline Usecase
3 CDE-SP: Software Provenance in CDE
4 Experiment and Evaluation
5 Related Work
6 Conclusion
Provenance in Software Packages June, 10th
, 2014 2 / 29

Current Solutions for Ensuring Reproducibility and Issues
1 Publish source code and data
− GitHub, Figshare, Research Compendia
Pros: (in many cases) easy to accomplish
× Cons: need to recompile and re-execute
2 Publish software package including source code, data, and
environment dependencies
− CDE, RunMyCode.org
Pros: re-execute without installation
× Cons: not easy to combine and merge shared packages
3 Publish a virtual machine image (VMI) that includes OS, source code,
data, and environment
− Cloud BioLinux (NEBC), Swift Appliance (RDCEP)
Pros: no additional modules or components needed to rerun
× Problem: too hard to provision and understand
Introduction Provenance in Software Packages June, 10th
, 2014 3 / 29

Reproducibility Problem
Our philosophy:
”... releasing shoddy VMs is easy to do, but it doesn’t help you learn how
to do a better job of reproducibility along the way. Releasing software
pipelines, however crappy, is on the path towards better reproducibility.”
C. Tituss Brown1
Reproducibility problem: How can we make it easy to combine and
merge shared packages, while correctly attributing authorship of software
packages?
No need to provision VMIs or publish simply source code and data.
1
http://ivory.idyll.org/blog/vms-considered-harmful.html
, 2014 4 / 29

Problem Scope
Use CDE2 to capture and create portable software package
Extend, partially re-use, and combine CDE packages to create new
reproducible software pipelines
Attribute authorship of software packages in new software pipelines
CDE has an OVERLAP conﬂict!
2
Guo, P.J., Engler, D.: CDE: using system call interposition to automatically create
portable software packages. USENIX Association, Portland, OR (2011)
, 2014 5 / 29

CDE
Create a portable software package
without installation, conﬁguration, or privilege permissions
Audit mode to create a CDE package
, 2014 6 / 29

CDE - Execution Mode
, 2014 7 / 29

Software Pipelines Contain CDE packages
A software pipeline consists many individual software modules
A software module depends on externally-developed libraries
A software module is often packaged together with speciﬁc versions of
libraries
, 2014 8 / 29

RDCEP Usecase
Alice, Bob, and Charlie are scientists at the Center for Robust Decision
Making on Climate and Energy Policy (RDCEP)
A develops data integration methods to produce higher-resolution
datasets depicting inferred land use over time.
B develops computational models to do model-based comparative
analysis. B’s software environment consists of A’s software modules
to produce high-resolution datasets.
C uses A and B’s software modules within data-intensive
computing methods to run them in parallel.
The Center wants to predict future yields of staple agricultural
commodities given changes in the climate.
C's Package (Merge from B's)
B's Package (from A's)
A's Package
Parallel init Aggregation Generate images Model-based analysis Parallel summary
Generate images Model-based analysisRetrive data Aggregation
Software Pipeline Usecase Provenance in Software Packages June, 10th
, 2014 9 / 29

A’s Experiment & Package
A’s package
cde-root
path to A’s ﬁles
a-experiment.sh
retrieve-data
aggregation
generate-image
f1, f2, a-output
path to common libs
libc.so
Re-execute A’s experiment:
cde-exec a-experiment.sh
cat a-experiment.sh
./retrieve-data f1
./aggregation f1 f2
./generate-image f2 a-output
, 2014 11 / 29

B’s Experiment & Package
B’s package
cde-root
[...]
path to B’s files
b-experiment.sh
analysis
b-output
path to common libs
libc.so
Re-execute B’s experiment:
cde-exec b-experiment.sh
cat b-experiment.sh
cd path to A’s experiment
cde-exec a-experiment.sh
cd path to B’s files
./analysis path to A’s files/a-output b-output
, 2014 12 / 29

C’s Experiment & Package
C’s package
cde-root
[...]
path to B’s files
[...]
path to C’s files
c-experiment.sh
parallel-init
parallel-summary
c-output
path to common libs
libc.so
Re-execute C’s experiment:
cde-exec c-experiment.sh
cat c-experiment.sh
parallel-init path to A’s files/f4
cd path to A’s files
cde-exec ./aggregation f4 f5
cde-exec ./generate-image f5 f6
cd path to B’s files
cde-exec ./analysis path to A’s files/f6 f7
cd path to C’s files
./parallel-summary path to B’s files/f7 c-output
, 2014 13 / 29

Dependency Overlap in Multiple cde-root Directories
, 2014 14 / 29

File Overlap of Different Linux Distributions
RH SUSE U12 U13
Amz 5498 / 23k 3184 / 11k 1203 / 5.4k 1819 / 5.5k
RH 3861 / 12k 1654 / 6.6k 2223 / 6.3k
SUSE 1245 / 3.9k 2085 / 6.4k
U12 8226 / 24k
Table 1 : Ratio of different files having the same path in 5 popular AMIs. The
denominator is number of files having the same path in two distributions, and the
numerator is the number of files with the same path but different md5 checksum.
Ommited are manual pages in /usr/share/ directory.
Amz Amazon Linux AMI
RH Red Hat Enterprise Linux 6.4
SUSE SUSE Linux Enterprise Server 11
U12 Ubuntu Server 12.04.3 LTS
U13 Ubuntu Server 13.10
, 2014 15 / 29

Re-direction in Multiple cde-root Directories
, 2014 16 / 29

CDE-SP
CDE-SP: Enhanced CDE that includes software provenance
Describe tools and methods to audit, store, and query provenance
Provenance queries
Determine the environment under which a dependency was build
Examine the dependencies which must be present
Answer if packages in a pipeline can satisfy a new package
Attribute authorship of software packages in a pipeline
Combine and validate authorship from stored provenance
, 2014 17 / 29

CDE-SP Audit
Objectives
Capture additional details of the origins of a library or a binary
Use these details for compiling and creating software pipelines
Methods
Create a dependency tree
Process system calls are monitored
Whenever a process executes a file system call, a dependency of that
process is recorded
Dependency can be a data file or a shared library
Extract information about binaries and required shared libraries
file, ldd, strings, and objdump UNIX commands
uname -a and function getpwuid(getuid())
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th
, 2014 18 / 29

Storage
Store provenance within the package itself
Use LevelDB: a fast and light-weight key-value storage library
Encode in the key the UNIX process identiﬁer along with spawn time
Key Value Explanation
pid.PID1.exec.TIME PID2 PID1 wasTriggeredBy PID2
pid.PID.[path, pwd, args] VALUES Other properties of PID
io.PID.action.IO.TIME FILE(PATH) PID wasGeneratedBy / wa-
sUsedBy FILE(PATH)
meta.agent USERNAME User information
meta.machine OSNAME operating system distribution
Table 2 : LevelDB key-value pairs that store ﬁle and process provenance. Capital letter words are arguments.
, 2014 19 / 29

Query
LevelDB provides a minimal API for querying
Simple, light-weight query interface
Input: a program whose dependencies need to be retrieved
Output: a GraphViz file displaying file and process dependencies
Use depth first search algorithm to create a dependency tree with the
input program as its root
Exclusion option to remove uninteresting dependencies:
/lib/, /usr/lib/, /usr/share/, /etc/
, 2014 20 / 29

Authorship of Software Modules
Combine authorship of the contributing packages
Validate authorship from the provenance stored in the original
package
Generate the subgraph associated with the part of the new package
Use subgraph isomorphism (NP-Hard) to validate with the original
provenance graph
Match provenance nodes of processes with the same paths of their
binaries and working directories
Match provenance nodes of ﬁles with the same path
, 2014 21 / 29

Experiments
Performance of CDE-SP
Auditing performance overhead
Disk storage increase
Provenance query runtime
Redirection overhead when multiple UUID-based directories are
created
Compare the lightweight virtualization approach of CDE-SP with
Kameleon3, a heavyweight virtualization approach used for
reproducibility
Experiments were run on Ubuntu 12.04 LTS workstation with an 8GBs
RAM and 8-core Intel(R) processor clocking at 1600MHz.
3
Emeras, J., Richard, O., Bzeznik, B.: Reconstructing the software environment of
an experiment with kameleon (2011)
Experiment and Evaluation Provenance in Software Packages June, 10th
, 2014 22 / 29

Performance & Size Overhead
Pipeline with two applications: Aggregation and Generate Image
2.1% slowdown of CDE-SP vs. 0-30% CDE virtualization overhead4
LevelDB database size 236kB (0.03% package size increase) contains
approximately 12,000 key-value pairs
Create
Package
Execution Disk Usage Provenance Query
CDE 852.6±2.4 568.8±2.4 732MB
CDE-SP 870.5±2.5 569.5±1.8 732MB+236kB 0.4±0.03
(seconds) (seconds) (seconds)
Table 3 : Increase in CDE-SP performance is negligible in comparison with CDE
4
Guo, P.J., Engler, D.: CDE: using system call interposition to automatically create
portable software packages. USENIX Association, Portland, OR (2011)
, 2014 23 / 29

Redirection Overhead in CDE-SP
Pipelined output of Aggregation to input of Generate Image
3 output ﬁles of Aggregation package were moved to Generate Image
package
2 cross-package execve() system calls
Less than a 1% slowdown of CDE-SP
, 2014 24 / 29

Kameleon
Use the Kameleon engine to make a bare bone VM appliance
Self-written YAML-formatted recipes
Self-written macrosteps and microsteps
Kameleon can create virtual machine appliances in diﬀerent formats
for diﬀerent Linux distributions
Generates bash scripts to create an initial virtual image of a Linux
distribution
Populates the image with more Linux packages
Populates with content of a CDE-SP package
, 2014 25 / 29

CDE-SP Vs Kameleon
0
200
400
600
800
1000
1200
1400
1600
Kameleon CDE-SP
Seconds
Figure 1 : Overhead when using CDE with Kameleon VM appliance
, 2014 26 / 29

Related Work
Research Objects: packages scientific workflows with auxiliary
information about workflows, including provenance information and
metadata, such as the authors, the version
CDE and Sumatra can capture an execution environment in a
lightweight fashion
SystemTap, being a kernel-based tracing mechanism, has better
performance compared to ptrace but needs to run at a higher
privilege level
Provenance-to-Use (PTU) and ReproZip include provenance in
self-contained software packages
Related Work Provenance in Software Packages June, 10th
, 2014 27 / 29

Conclusion
CDE does not encapsulate provenance of associated dependencies in
a software package
The lack of information about the origins of dependencies in a
software package creates issues when constructing software pipelines
from packages
CDE-SP can include software provenance as part of a software
package
CDE-SP can use software package provenance to build software
pipelines
CDE-SP can maintain provenance when used to construct software
pipelines
Conclusion Provenance in Software Packages June, 10th
, 2014 28 / 29

Acknowledgments
Neil Best at The University of Chicago
Joshua Elliott at The Columbia University
Justin Wozniak at Argonne National Laboratory
Allison Brizius at RDCEP Center
NSF grant SES-0951576, GEO-1343816
Acknowledgments Provenance in Software Packages June, 10th
, 2014 29 / 29

Ipaw14 presentation Quan, Tanu, Ian

Recommended

Recommended

More Related Content

Similar to Ipaw14 presentation Quan, Tanu, Ian

Similar to Ipaw14 presentation Quan, Tanu, Ian (20)

More from Boris Glavic

More from Boris Glavic (18)

Recently uploaded

Recently uploaded (20)

Ipaw14 presentation Quan, Tanu, Ian