Using Provenance for
    Repeatability
          Quan Pham1, Tanu Malik2, Ian Foster1,2
  Department of Computer Science1,§ and Computation
Institute2,¶ University of Chicago§,¶ and Argonne National
                         Laboratory¶
                         TaPP 2013
Publication Process
• Traditional academic publication process



   • Submit paper            • Review ideas   • Learn novel
                               &experiment      methods.
                               s

• Emerging academic publication process



   • Submit paper            • Review ideas   • Are we reading
                               &experiments     something that is
                             • Validate         repeatable and
                               software         reproducible?
Repeatability Testing

• Scientific progress relies on novel claims and verifiable
  results
• Scientific paper reviewers
  • Validate announced results
  • Validate for different
    data and parameters
  • Validate under different
    conditions and environments

• Challenge: Work under
  time & budget constraints
  Image: from http://catsandtheirmews.blogspot.com/2012/05/update-on-computer-crash.html
Repeatability Testing
       Challenges & Constraints
•   Repeatability requirements
    • Hardware : Single machine/Clusters
    • Software
      • Operating System : Which operating system was used?
        (Ubuntu/RedHat/Debian/Gentoo)
      • Environment: How to capture all environment variables?
      • Tools & libraries installation: How to precisely know all the dependencies?

•   Knowledge constraints
    • Experiment setup: how to setup the experiment?
    • Experiment usage: how the experiment is run?

•   Resource constraints
    • Requires massive processing power.
    • Operates on large amounts of data.
    • Performs significant network communication.
    • Is long-running.
An Approach to Repeatability
         Testing

Challenges & Constraints      Possible Solution
• Repeatability             • Provide a virtual
  requirements                machine
   • Hardware requirement   • Provide a portable
   • Software requirement     software
• Knowledge constraints     Provide a reference
   • Experiment setup       execution
   • Experiment usage
• Resource constraints      Provide selective
                            replay
PTU – Provenance-To-
           Use
• PTU
  • Minimizes computation time during repeatability testing
  • Guarantees that events are processed in the same order
    using the same data

• Authors build a package that includes:
  • Software program
  • Input data
  • Provenance trace

• Testers may select a subset of the package’s
  processes for a partial deterministic replay
PTU Functionalities

• ptu-audit tool
  • Build a package of authors’ source code, data, and
    environment variables
  • Record process- and file-level details about a reference
    execution     % ptu-audit java TextAnalyzer
                   news.txt
• PTU package
  • Display the provenance graph and accompanying run-time
    details

• ptu-exec tool
  • Re-execute specified part of the provenance graph
                   % ptu-exec java TextAnalyzer
                   news.txt
ptu-audit

• Uses ptrace to monitor
  system calls
  • execve, sys_fork
  • read, write, sys_io
  • bind, connect, socket

• Collects provenance
• Collects runtime
  information
• Makes package
ptu-audit

• Use ptrace to monitor
  system calls
  • execve, sys_fork
  • read, write, sys_io
  • bind, connect, socket

• Collect provenance

• Collect runtime info

• Make package
PTU Package

• [Figure 2. The PTU package. The tester chooses
  to run the sub-graph rooted at /bin/calculate ]
ptu-exec

• [Figure 3. ptu-exec re-runs part of the application
  from /bin/calculate. It uses CDE to re-route file
  dependencies]
Current PTU Components

• Uses CDE (Code-Data-Environment) tool to
  create a package
  • Details CDE

• Uses ptrace to create a provenance graph
  representing a reference run-time execution
• Uses SQLite to store the provenance graph
• Uses graphviz for graph presentation
• Enhances CDE to run the package
PEEL0

• Best, N., et. al., Synthesis of a Complete Land
  Use/Land Cover Dataset for the Conterminous
  United States. RDCEP Working Paper, 2012.
  12(08).



     • Wget           •   R             • R
     • Bash           •   Raster        • Geo
       script         •   Rgdal           algorithm
                      •   Reclassify
PEEL0

• [Figure 4: Time reduction in testing PEEL0 using
  PTU]

• Or use the actual execution graph??
TextAnalyzer

• Murphy, J., et. al., Textual Hydraulics: Mining
  Online Newspapers to Detect Physical, Social,
  and Institutional Water Management
  Infrastructure, 2013, Technical Report, Argonne
  National Lab.

• runs a named-entity recognition analysis program
  using several data dictionaries

• splits the input file into multiple input files on which
  it runs a parallel analysis
TextAnalyzer

• [Figure 5. Time reduction in testing TextAnalyzer
  using PTU]
Conclusion

• PTU is a step toward testing software programs
  that are submitted to conference proceedings and
  journals to conduct repeatability tests

• Easy and attractive for authors

• Fine control, efficient way for testers
Future Works

• Other workflow types
  • Distributed workflows.

• Improve performance
  • Decide how to store provenance compactly in a
    packge.

• Presentation
  • Improve graphic-user-interface and presentation
Acknowledgements

• Neil Best

• Jonathan Ozik

• Center for Robust Decision making on Climate
  and Energy Policy (NSF grant number 0951576)

• Contractors of the US Government under contract
  number DEAC02-06CH11357

PTU: Using Provenance for Repeatability

  • 1.
    Using Provenance for Repeatability Quan Pham1, Tanu Malik2, Ian Foster1,2 Department of Computer Science1,§ and Computation Institute2,¶ University of Chicago§,¶ and Argonne National Laboratory¶ TaPP 2013
  • 2.
    Publication Process • Traditionalacademic publication process • Submit paper • Review ideas • Learn novel &experiment methods. s • Emerging academic publication process • Submit paper • Review ideas • Are we reading &experiments something that is • Validate repeatable and software reproducible?
  • 3.
    Repeatability Testing • Scientificprogress relies on novel claims and verifiable results • Scientific paper reviewers • Validate announced results • Validate for different data and parameters • Validate under different conditions and environments • Challenge: Work under time & budget constraints Image: from http://catsandtheirmews.blogspot.com/2012/05/update-on-computer-crash.html
  • 4.
    Repeatability Testing Challenges & Constraints • Repeatability requirements • Hardware : Single machine/Clusters • Software • Operating System : Which operating system was used? (Ubuntu/RedHat/Debian/Gentoo) • Environment: How to capture all environment variables? • Tools & libraries installation: How to precisely know all the dependencies? • Knowledge constraints • Experiment setup: how to setup the experiment? • Experiment usage: how the experiment is run? • Resource constraints • Requires massive processing power. • Operates on large amounts of data. • Performs significant network communication. • Is long-running.
  • 5.
    An Approach toRepeatability Testing Challenges & Constraints Possible Solution • Repeatability • Provide a virtual requirements machine • Hardware requirement • Provide a portable • Software requirement software • Knowledge constraints Provide a reference • Experiment setup execution • Experiment usage • Resource constraints Provide selective replay
  • 6.
    PTU – Provenance-To- Use • PTU • Minimizes computation time during repeatability testing • Guarantees that events are processed in the same order using the same data • Authors build a package that includes: • Software program • Input data • Provenance trace • Testers may select a subset of the package’s processes for a partial deterministic replay
  • 7.
    PTU Functionalities • ptu-audittool • Build a package of authors’ source code, data, and environment variables • Record process- and file-level details about a reference execution % ptu-audit java TextAnalyzer news.txt • PTU package • Display the provenance graph and accompanying run-time details • ptu-exec tool • Re-execute specified part of the provenance graph % ptu-exec java TextAnalyzer news.txt
  • 8.
    ptu-audit • Uses ptraceto monitor system calls • execve, sys_fork • read, write, sys_io • bind, connect, socket • Collects provenance • Collects runtime information • Makes package
  • 9.
    ptu-audit • Use ptraceto monitor system calls • execve, sys_fork • read, write, sys_io • bind, connect, socket • Collect provenance • Collect runtime info • Make package
  • 10.
    PTU Package • [Figure2. The PTU package. The tester chooses to run the sub-graph rooted at /bin/calculate ]
  • 11.
    ptu-exec • [Figure 3.ptu-exec re-runs part of the application from /bin/calculate. It uses CDE to re-route file dependencies]
  • 12.
    Current PTU Components •Uses CDE (Code-Data-Environment) tool to create a package • Details CDE • Uses ptrace to create a provenance graph representing a reference run-time execution • Uses SQLite to store the provenance graph • Uses graphviz for graph presentation • Enhances CDE to run the package
  • 13.
    PEEL0 • Best, N.,et. al., Synthesis of a Complete Land Use/Land Cover Dataset for the Conterminous United States. RDCEP Working Paper, 2012. 12(08). • Wget • R • R • Bash • Raster • Geo script • Rgdal algorithm • Reclassify
  • 14.
    PEEL0 • [Figure 4:Time reduction in testing PEEL0 using PTU] • Or use the actual execution graph??
  • 15.
    TextAnalyzer • Murphy, J.,et. al., Textual Hydraulics: Mining Online Newspapers to Detect Physical, Social, and Institutional Water Management Infrastructure, 2013, Technical Report, Argonne National Lab. • runs a named-entity recognition analysis program using several data dictionaries • splits the input file into multiple input files on which it runs a parallel analysis
  • 16.
    TextAnalyzer • [Figure 5.Time reduction in testing TextAnalyzer using PTU]
  • 17.
    Conclusion • PTU isa step toward testing software programs that are submitted to conference proceedings and journals to conduct repeatability tests • Easy and attractive for authors • Fine control, efficient way for testers
  • 18.
    Future Works • Otherworkflow types • Distributed workflows. • Improve performance • Decide how to store provenance compactly in a packge. • Presentation • Improve graphic-user-interface and presentation
  • 19.
    Acknowledgements • Neil Best •Jonathan Ozik • Center for Robust Decision making on Climate and Energy Policy (NSF grant number 0951576) • Contractors of the US Government under contract number DEAC02-06CH11357

Editor's Notes

  • #2 Hi everyone,My name is QP,In this presentation, I’d like to introduce a system that use provenance for repeatability.The work is done with TMalik @ CI, UoC and Ifoster @ ANL.----- Meeting Notes (3/28/13 14:37) -----no abbreviationlet's the slide talk: n & v should be there
  • #4 What is the problem with repeatability in scientific community?Process of publication:+ I submit + Reviewers find interesting claims+ want to verify by re-run, with different data/param, different cond/envSo many things to validate, so little time and budget (hardware)===Problems: r u facing this? 1sldWhy? 1sldChallenges 1sld10-15 sld totalThis is what I presentSteps & stepsHigh lvl overview----- Meeting Notes (3/28/13 14:37) -----1 slide: what is a "new" publication process (authors -> (tester) -> readers)concept of author, tester and repeatability
  • #9 + uses ptrace to monitor ~50 system calls including process system calls, such as execve() and sys_fork(), for collecting process provenance; file system calls, such as read(), write(), and sys_io(), for collecting file provenance; and network calls, such as bind(), connect(), socket(), and send() for auditing network activity. + obtain process name, owner, group, parent, host, creation time, command line, and environment variables; and file name, path, host, size, and modification time+obtain memory footprint, CPU consumption, and I/O activity data for each process from /proc/$pid/stat + copies the accessed file into a package directory that consists of all sub- directories and symbolic links to the original file’s location.
  • #10 + uses ptrace to monitor ~50 system calls including process system calls, such as execve() and sys_fork(), for collecting process provenance; file system calls, such as read(), write(), and sys_io(), for collecting file provenance; and network calls, such as bind(), connect(), socket(), and send() for auditing network activity. + obtain process name, owner, group, parent, host, creation time, command line, and environment variables; and file name, path, host, size, and modification time+obtain memory footprint, CPU consumption, and I/O activity data for each process from /proc/$pid/stat + copies the accessed file into a package directory that consists of all sub- directories and symbolic links to the original file’s location.
  • #11 Reason: why choose /bin/cal (mem intensive, run-time)+ When the entire reference run finishes, PTU builds a reference execution file consisting of the topological sort of the provenance graph. The nodes of the graph enumerate run-time details, such as process memory consumption, and file sizes. The tester, as described next, can utilize the database and the graph for efficient re-execution. + Testers can - examine the provenance graph contained in a package to study the accompanying reference execution. - request a re-execution, either by specifying nodes in the provenance graph or by modifying a run configuration file that is included in the package.
  • #12 + obtains run configuration and environment variables for each process from the SQLite database + monitors it via ptrace and re-executes CDE functionality of replacing path argument(s) to refer to the corresponding path within the package cde-package/cde-root/. +provides fast audit and re-execution independent of the application. The profiling of processes enables testers to choose the processes to run.
  • #14 PEEL0Three step workflow processImplemented as an R-program Classification is memory intensive[pic: PEEL0 workflowget -> reclassify -> calculate -> final resulttesters look at (calculate) with question marks]
  • #15 five process nodes, 10000 exclusive file reads based on the number of files in the dataset, and 422 exclusive file writes for the aggregated dataset Slowdownwhenuse PTU: ~35% for PEEL0
  • #17 + eight process nodes that in aggregate conduct 616 exclusive file reads, 124 exclusive file writes, and 50 file nodes that are read and written again. + Slowdownwhenuse PTU: ~15% for TextAnalyzer+ TextAnalyzer has a particularly large improvement (>98%) since the entire process is run on a much smaller file.
  • #18 PTU is a step toward testing software programs that are submitted to conference proceedings and journals to conduct repeatability tests. Peer reviewers often must review these programs in a short period of time. By providing one utility for packaging the software pro- grams and its reference execution without modifying the application, we have made it easy and attractive for authors to use it and a fine control, efficient way for testers to use PTU.