PTU: Using Provenance for Repeatability

792 views
686 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
792
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Hi everyone,My name is QP,In this presentation, I’d like to introduce a system that use provenance for repeatability.The work is done with TMalik @ CI, UoC and Ifoster @ ANL.----- Meeting Notes (3/28/13 14:37) -----no abbreviationlet's the slide talk: n & v should be there
  • What is the problem with repeatability in scientific community?Process of publication:+ I submit + Reviewers find interesting claims+ want to verify by re-run, with different data/param, different cond/envSo many things to validate, so little time and budget (hardware)===Problems: r u facing this? 1sldWhy? 1sldChallenges 1sld10-15 sld totalThis is what I presentSteps & stepsHigh lvl overview----- Meeting Notes (3/28/13 14:37) -----1 slide: what is a "new" publication process (authors -> (tester) -> readers)concept of author, tester and repeatability
  • + uses ptrace to monitor ~50 system calls including process system calls, such as execve() and sys_fork(), for collecting process provenance; file system calls, such as read(), write(), and sys_io(), for collecting file provenance; and network calls, such as bind(), connect(), socket(), and send() for auditing network activity. + obtain process name, owner, group, parent, host, creation time, command line, and environment variables; and file name, path, host, size, and modification time+obtain memory footprint, CPU consumption, and I/O activity data for each process from /proc/$pid/stat + copies the accessed file into a package directory that consists of all sub- directories and symbolic links to the original file’s location.
  • + uses ptrace to monitor ~50 system calls including process system calls, such as execve() and sys_fork(), for collecting process provenance; file system calls, such as read(), write(), and sys_io(), for collecting file provenance; and network calls, such as bind(), connect(), socket(), and send() for auditing network activity. + obtain process name, owner, group, parent, host, creation time, command line, and environment variables; and file name, path, host, size, and modification time+obtain memory footprint, CPU consumption, and I/O activity data for each process from /proc/$pid/stat + copies the accessed file into a package directory that consists of all sub- directories and symbolic links to the original file’s location.
  • Reason: why choose /bin/cal (mem intensive, run-time)+ When the entire reference run finishes, PTU builds a reference execution file consisting of the topological sort of the provenance graph. The nodes of the graph enumerate run-time details, such as process memory consumption, and file sizes. The tester, as described next, can utilize the database and the graph for efficient re-execution. + Testers can - examine the provenance graph contained in a package to study the accompanying reference execution. - request a re-execution, either by specifying nodes in the provenance graph or by modifying a run configuration file that is included in the package.
  • + obtains run configuration and environment variables for each process from the SQLite database + monitors it via ptrace and re-executes CDE functionality of replacing path argument(s) to refer to the corresponding path within the package cde-package/cde-root/. +provides fast audit and re-execution independent of the application. The profiling of processes enables testers to choose the processes to run.
  • PEEL0Three step workflow processImplemented as an R-program Classification is memory intensive[pic: PEEL0 workflowget -> reclassify -> calculate -> final resulttesters look at (calculate) with question marks]
  • five process nodes, 10000 exclusive file reads based on the number of files in the dataset, and 422 exclusive file writes for the aggregated dataset Slowdownwhenuse PTU: ~35% for PEEL0
  • + eight process nodes that in aggregate conduct 616 exclusive file reads, 124 exclusive file writes, and 50 file nodes that are read and written again. + Slowdownwhenuse PTU: ~15% for TextAnalyzer+ TextAnalyzer has a particularly large improvement (>98%) since the entire process is run on a much smaller file.
  • PTU is a step toward testing software programs that are submitted to conference proceedings and journals to conduct repeatability tests. Peer reviewers often must review these programs in a short period of time. By providing one utility for packaging the software pro- grams and its reference execution without modifying the application, we have made it easy and attractive for authors to use it and a fine control, efficient way for testers to use PTU.
  • PTU: Using Provenance for Repeatability

    1. 1. Using Provenance for Repeatability Quan Pham1, Tanu Malik2, Ian Foster1,2 Department of Computer Science1,§ and ComputationInstitute2,¶ University of Chicago§,¶ and Argonne National Laboratory¶ TaPP 2013
    2. 2. Publication Process• Traditional academic publication process • Submit paper • Review ideas • Learn novel &experiment methods. s• Emerging academic publication process • Submit paper • Review ideas • Are we reading &experiments something that is • Validate repeatable and software reproducible?
    3. 3. Repeatability Testing• Scientific progress relies on novel claims and verifiable results• Scientific paper reviewers • Validate announced results • Validate for different data and parameters • Validate under different conditions and environments• Challenge: Work under time & budget constraints Image: from http://catsandtheirmews.blogspot.com/2012/05/update-on-computer-crash.html
    4. 4. Repeatability Testing Challenges & Constraints• Repeatability requirements • Hardware : Single machine/Clusters • Software • Operating System : Which operating system was used? (Ubuntu/RedHat/Debian/Gentoo) • Environment: How to capture all environment variables? • Tools & libraries installation: How to precisely know all the dependencies?• Knowledge constraints • Experiment setup: how to setup the experiment? • Experiment usage: how the experiment is run?• Resource constraints • Requires massive processing power. • Operates on large amounts of data. • Performs significant network communication. • Is long-running.
    5. 5. An Approach to Repeatability TestingChallenges & Constraints Possible Solution• Repeatability • Provide a virtual requirements machine • Hardware requirement • Provide a portable • Software requirement software• Knowledge constraints Provide a reference • Experiment setup execution • Experiment usage• Resource constraints Provide selective replay
    6. 6. PTU – Provenance-To- Use• PTU • Minimizes computation time during repeatability testing • Guarantees that events are processed in the same order using the same data• Authors build a package that includes: • Software program • Input data • Provenance trace• Testers may select a subset of the package’s processes for a partial deterministic replay
    7. 7. PTU Functionalities• ptu-audit tool • Build a package of authors’ source code, data, and environment variables • Record process- and file-level details about a reference execution % ptu-audit java TextAnalyzer news.txt• PTU package • Display the provenance graph and accompanying run-time details• ptu-exec tool • Re-execute specified part of the provenance graph % ptu-exec java TextAnalyzer news.txt
    8. 8. ptu-audit• Uses ptrace to monitor system calls • execve, sys_fork • read, write, sys_io • bind, connect, socket• Collects provenance• Collects runtime information• Makes package
    9. 9. ptu-audit• Use ptrace to monitor system calls • execve, sys_fork • read, write, sys_io • bind, connect, socket• Collect provenance• Collect runtime info• Make package
    10. 10. PTU Package• [Figure 2. The PTU package. The tester chooses to run the sub-graph rooted at /bin/calculate ]
    11. 11. ptu-exec• [Figure 3. ptu-exec re-runs part of the application from /bin/calculate. It uses CDE to re-route file dependencies]
    12. 12. Current PTU Components• Uses CDE (Code-Data-Environment) tool to create a package • Details CDE• Uses ptrace to create a provenance graph representing a reference run-time execution• Uses SQLite to store the provenance graph• Uses graphviz for graph presentation• Enhances CDE to run the package
    13. 13. PEEL0• Best, N., et. al., Synthesis of a Complete Land Use/Land Cover Dataset for the Conterminous United States. RDCEP Working Paper, 2012. 12(08). • Wget • R • R • Bash • Raster • Geo script • Rgdal algorithm • Reclassify
    14. 14. PEEL0• [Figure 4: Time reduction in testing PEEL0 using PTU]• Or use the actual execution graph??
    15. 15. TextAnalyzer• Murphy, J., et. al., Textual Hydraulics: Mining Online Newspapers to Detect Physical, Social, and Institutional Water Management Infrastructure, 2013, Technical Report, Argonne National Lab.• runs a named-entity recognition analysis program using several data dictionaries• splits the input file into multiple input files on which it runs a parallel analysis
    16. 16. TextAnalyzer• [Figure 5. Time reduction in testing TextAnalyzer using PTU]
    17. 17. Conclusion• PTU is a step toward testing software programs that are submitted to conference proceedings and journals to conduct repeatability tests• Easy and attractive for authors• Fine control, efficient way for testers
    18. 18. Future Works• Other workflow types • Distributed workflows.• Improve performance • Decide how to store provenance compactly in a packge.• Presentation • Improve graphic-user-interface and presentation
    19. 19. Acknowledgements• Neil Best• Jonathan Ozik• Center for Robust Decision making on Climate and Energy Policy (NSF grant number 0951576)• Contractors of the US Government under contract number DEAC02-06CH11357

    ×