This document proposes an approach called PTU (Provenance-To-Use) to improve the repeatability of scientific experiments by minimizing computation time during repeatability testing. PTU builds a package containing the software, input data, and provenance trace from a reference execution. Testers can then selectively replay parts of the provenance graph using the ptu-exec tool, reducing testing time compared to full re-execution. The document describes the PTU components, including tools for auditing reference runs, building provenance packages, and selectively replaying parts of the provenance graph. Examples applying PTU to the PEEL0 and TextAnalyzer applications show reductions in testing time.
LDV: Light-weight Database VirtualizationTanu Malik
The document summarizes the Light-weight Database Virtualization (LDV) framework. LDV aims to enable easy and efficient sharing of database applications by capturing an application's execution provenance and dependencies. It uses application virtualization techniques to package the application binaries, libraries, and data. For applications that interact with a database, it also records the interactions between the application and database using system call monitoring and SQL logging. This combined provenance allows recreating the application's execution environment and replaying the database interactions to validate or reproduce results. Key components of LDV include provenance modeling, package creation with necessary files and traces, and runtime redirection to reconstruct the environment.
The document proposes a reproducible framework powered by Globus to help researchers share and reproduce scientific models and simulations. It describes challenges with current methods for sharing work, like dependencies or configuration issues, that prevent easy reproducibility. The framework aims to [1] capture scientific activities, code, data and environments; [2] preserve them as standardized packages called "SciUnits"; [3] share and distribute SciUnits so others can [4] re-execute and re-analyze the work without installation or configuration problems. Key components are outlined for establishing this framework to support reproducibility across different scientific domains.
GeoDataspace: Simplifying Data Management Tasks with GlobusTanu Malik
This document describes GeoDataspace, a framework for enabling data and model sharing in computational geosciences. GeoDataspace uses geounits, which package code, data, and environment, to capture scientific activities and research outputs. The Globus Catalog provides a flexible metadata catalog for hosting and querying geounits. GeoDataspace aims to improve reproducibility and enable validation of shared models and data through re-executing geounits. Several geoscience applications including plate tectonics, hydrology, and space science are discussed as potential adopters.
GEN: A Database Interface Generator for HPC ProgramsTanu Malik
GEN is a database interface generator that takes user-supplied C declarations and provides an interface to load scientific array data into databases without requiring changes to source code. It works by wrapping POSIX I/O calls at runtime to generate database schema definitions and load data. Experiments show it can reduce the time needed to reorganize data in the database compared to loading data from files and reorganizing outside the database. Current work aims to relax GEN's assumptions and improve data loading performance.
Benchmarking Cloud-based Tagging ServicesTanu Malik
This document presents a framework for benchmarking cloud-based tagging services. It proposes a tagging data model and generates workloads using a Markov chain process. Experiments were conducted using OLTP-Bench to evaluate the performance of MySQL, Postgres and SQL Server under different tagging workloads and schemas on Amazon EC2. The results show that for sparse datasets, a vertically partitioned schema outperforms horizontal schemas, and the best database depends on the sparseness of the data. The benchmark aims to help determine the most efficient cloud platform for supporting dynamic, sparse tagging data.
The document outlines plans for improving access to geoscience data through the EarthCube initiative. It summarizes discussions from a workshop focused on developing use cases and standardizing the data lifecycle. The workshop highlighted that access is crucial to research and identified challenges around social and technical issues. It also described plans for a "DataSpace" platform to enable open, extensible sharing of data and services and emphasized usability over infrastructure development. Lastly, it discussed next steps from concept meetings focused on converging roadmaps, architectures, and future directions.
The Galaxy bioinformatics workflow environmentRutger Vos
The document provides an overview of Galaxy, a bioinformatics workflow environment. Galaxy allows users to analyze genomic and other biological data through an easy-to-use web interface without requiring programming skills. It tracks the provenance of analyses and allows workflows to be run, shared, and reproduced. Galaxy supports common data types and file formats, runs tools through job queuing, and stores metadata in a database. The community is involved through mailing lists, tutorials, and events.
This presentation was part of the workshop on Materials Project Software infrastructure conducted for the Materials Virtual Lab in Nov 10 2014. It presents an introduction to the Python Materials Genomics (pymatgen) materials analysis library. Pymatgen is a robust, open-source Python library for materials analysis. It currently powers the public Materials Project (http://www.materialsproject.org), an initiative to make calculated properties of all known inorganic materials available to materials researchers. These are some of the main features:
1. Highly flexible classes for the representation of Element, Site, Molecule, Structure objects.
Extensive io capabilities to manipulate many VASP (http://cms.mpi.univie.ac.at/vasp/) and ABINIT (http://www.abinit.org/) input and output files and the crystallographic information file format. This includes generating Structure objects from vasp input and output. There is also support for Gaussian input files and XYZ file for molecules.
2. Comprehensive tool to generate and view compositional and grand canonical phase diagrams.
3. Electronic structure analyses (DOS and Bandstructure).
4. Integration with the Materials Project REST API.
LDV: Light-weight Database VirtualizationTanu Malik
The document summarizes the Light-weight Database Virtualization (LDV) framework. LDV aims to enable easy and efficient sharing of database applications by capturing an application's execution provenance and dependencies. It uses application virtualization techniques to package the application binaries, libraries, and data. For applications that interact with a database, it also records the interactions between the application and database using system call monitoring and SQL logging. This combined provenance allows recreating the application's execution environment and replaying the database interactions to validate or reproduce results. Key components of LDV include provenance modeling, package creation with necessary files and traces, and runtime redirection to reconstruct the environment.
The document proposes a reproducible framework powered by Globus to help researchers share and reproduce scientific models and simulations. It describes challenges with current methods for sharing work, like dependencies or configuration issues, that prevent easy reproducibility. The framework aims to [1] capture scientific activities, code, data and environments; [2] preserve them as standardized packages called "SciUnits"; [3] share and distribute SciUnits so others can [4] re-execute and re-analyze the work without installation or configuration problems. Key components are outlined for establishing this framework to support reproducibility across different scientific domains.
GeoDataspace: Simplifying Data Management Tasks with GlobusTanu Malik
This document describes GeoDataspace, a framework for enabling data and model sharing in computational geosciences. GeoDataspace uses geounits, which package code, data, and environment, to capture scientific activities and research outputs. The Globus Catalog provides a flexible metadata catalog for hosting and querying geounits. GeoDataspace aims to improve reproducibility and enable validation of shared models and data through re-executing geounits. Several geoscience applications including plate tectonics, hydrology, and space science are discussed as potential adopters.
GEN: A Database Interface Generator for HPC ProgramsTanu Malik
GEN is a database interface generator that takes user-supplied C declarations and provides an interface to load scientific array data into databases without requiring changes to source code. It works by wrapping POSIX I/O calls at runtime to generate database schema definitions and load data. Experiments show it can reduce the time needed to reorganize data in the database compared to loading data from files and reorganizing outside the database. Current work aims to relax GEN's assumptions and improve data loading performance.
Benchmarking Cloud-based Tagging ServicesTanu Malik
This document presents a framework for benchmarking cloud-based tagging services. It proposes a tagging data model and generates workloads using a Markov chain process. Experiments were conducted using OLTP-Bench to evaluate the performance of MySQL, Postgres and SQL Server under different tagging workloads and schemas on Amazon EC2. The results show that for sparse datasets, a vertically partitioned schema outperforms horizontal schemas, and the best database depends on the sparseness of the data. The benchmark aims to help determine the most efficient cloud platform for supporting dynamic, sparse tagging data.
The document outlines plans for improving access to geoscience data through the EarthCube initiative. It summarizes discussions from a workshop focused on developing use cases and standardizing the data lifecycle. The workshop highlighted that access is crucial to research and identified challenges around social and technical issues. It also described plans for a "DataSpace" platform to enable open, extensible sharing of data and services and emphasized usability over infrastructure development. Lastly, it discussed next steps from concept meetings focused on converging roadmaps, architectures, and future directions.
The Galaxy bioinformatics workflow environmentRutger Vos
The document provides an overview of Galaxy, a bioinformatics workflow environment. Galaxy allows users to analyze genomic and other biological data through an easy-to-use web interface without requiring programming skills. It tracks the provenance of analyses and allows workflows to be run, shared, and reproduced. Galaxy supports common data types and file formats, runs tools through job queuing, and stores metadata in a database. The community is involved through mailing lists, tutorials, and events.
This presentation was part of the workshop on Materials Project Software infrastructure conducted for the Materials Virtual Lab in Nov 10 2014. It presents an introduction to the Python Materials Genomics (pymatgen) materials analysis library. Pymatgen is a robust, open-source Python library for materials analysis. It currently powers the public Materials Project (http://www.materialsproject.org), an initiative to make calculated properties of all known inorganic materials available to materials researchers. These are some of the main features:
1. Highly flexible classes for the representation of Element, Site, Molecule, Structure objects.
Extensive io capabilities to manipulate many VASP (http://cms.mpi.univie.ac.at/vasp/) and ABINIT (http://www.abinit.org/) input and output files and the crystallographic information file format. This includes generating Structure objects from vasp input and output. There is also support for Gaussian input files and XYZ file for molecules.
2. Comprehensive tool to generate and view compositional and grand canonical phase diagrams.
3. Electronic structure analyses (DOS and Bandstructure).
4. Integration with the Materials Project REST API.
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataAnubhav Jain
The DuraMat Data Hub and Analytics Capability provides a centralized resource for sharing solar PV data. It collects performance, materials properties, meteorological, and other data through a central Data Hub. A data analytics thrust works with partners to provide software, visualization, and data mining capabilities. The goal is to enhance efficiency, reproducibility, and new analyses by combining multiple data sources in one location. Examples of ongoing projects using the hub include clear sky detection modeling to automatically classify sky conditions from irradiance data.
The document provides an introduction and overview of FireWorks workflow software. Some key points:
- FireWorks is an open-source, Python-based workflow management software that uses MongoDB and is pip-installable.
- It is used by several large DOE projects and materials science groups for tasks like materials modeling, machine learning, and document processing. Over 100 million CPU-hours have been used with everyday production use.
- FireWorks allows for very dynamic workflows that can modify themselves intelligently and add/remove tasks over long periods of time in response to results. It also features job detection and status persistence.
Deep learning is finding applications in science such as predicting material properties. DLHub is being developed to facilitate sharing of deep learning models, data, and code for science. It will collect, publish, serve, and enable retraining of models on new data. This will help address challenges of applying deep learning to science like accessing relevant resources and integrating models into workflows. The goal is to deliver deep learning capabilities to thousands of scientists through software for managing data, models and workflows.
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
Atomate is a high-level interface that makes it easy to generate, execute, and analyze computational materials science workflows. It contains a library of simulation procedures for different packages like VASP. Each procedure translates instructions into workflows of jobs and tasks. Atomate encodes expertise to run simulations and allows customizing workflows. It integrates with FireWorks to execute workflows on supercomputers and store results in databases for further analysis. The goal is to automate simulations and scale to millions of calculations.
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...Keiichiro Ono
This document provides an overview of a tutorial on building reproducible network data visualization workflows using Cytoscape and IPython Notebook. The tutorial will cover integrating data, analyzing networks, visualizing results, and preparing outputs for publication. It will demonstrate setting up a portable data analysis environment using Docker and sharing work through GitHub. The bulk of the tutorial will focus on using IPython Notebook as an electronic lab notebook for interactive and reproducible experiments with Cytoscape.
Building Reproducible Network Data Analysis / Visualization WorkflowsKeiichiro Ono
The document discusses building reproducible network data analysis and visualization workflows using REST APIs and containerization. It aims to solve problems with complex software stacks that are difficult to set up and not reproducible. The goal is to create reproducible and scalable "dry experiments" using Docker containers, GitHub for source code sharing, Jupyter notebooks as electronic lab notebooks, and the cyREST module for the Cytoscape network analysis software. Examples of scenarios using local workstations and cloud computing are presented, as well as a demo and future plans.
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)Keiichiro Ono
This document outlines a tutorial on biological data analysis and visualization using Cytoscape. The tutorial covers basic concepts like networks and tables in Cytoscape, data import, network analysis features, and visualization techniques. It discusses loading sample network data, calculating network statistics, filtering networks, basic search functionality, and applying visual styles. The tutorial is intended to provide a practical introduction to Cytoscape's core features through examples and demos.
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
FireWorks is a workflow management system that allows researchers to define and execute complex computational materials science workflows on local or remote computing resources in an automated manner. It provides features such as error detection and recovery, job scheduling, provenance tracking, and remote file access. The atomate library builds on FireWorks to provide a high-level interface for common materials simulation procedures like structure optimization, band structure calculation, and property prediction using popular codes like VASP. Together, these tools aim to make high-throughput computational materials discovery and design more accessible to researchers.
The Materials Project is an open initiative that makes calculated materials property data publicly available to accelerate materials innovation. It has calculated properties for over 30,000 materials using over 10 million CPU hours. The project provides a Python library and API to access and analyze materials data, as well as a workflow manager to run calculations on supercomputers. It aims to calculate all known inorganic materials and establish collaborations to develop new materials design tools.
Apache Airavata is an open source science gateway software framework that allows users to compose, manage, execute, and monitor distributed computational workflows. It provides tools and services to register applications, schedule jobs on various resources, and manage workflows and generated data. Airavata is used across several domains to support scientific workflows and is largely derived from academic research funded by the NSF.
WBDB 2015 Performance Evaluation of Spark SQL using BigBencht_ivanov
In this paper we present the initial results of our work to run BigBench on Spark. First, we evaluated the data scalability behavior of the existing MapReduce implementation of BigBench. Next, we executed the group of 14 pure HiveQL queries on Spark SQL and compared the results with the respective Hive results. Our experiments show that: (1) for both MapReduce and Spark SQL, BigBench queries perform with the increase of the data size on average better than the linear scaling behavior and (2) pure HiveQL queries perform faster on Spark SQL than on Hive.
http://clds.sdsc.edu/wbdb2015.ca/program
Lessons Learned on Benchmarking Big Data Platformst_ivanov
The document discusses benchmarking different big data platforms and SQL-on-Hadoop engines. It evaluates the performance of Hadoop using the TPCx-HS benchmark with different network configurations. It also compares the performance of SQL query engines like Hive, Spark SQL, Impala, and file formats like ORC and Parquet using the TPC-H benchmark on a 1TB dataset. The results show that a dedicated 1Gb network is 5 times faster than a shared network. For SQL query engines, Hive with ORC format is on average 1.44 times faster than with Parquet. Spark SQL could only run 12 queries and was faster on 5 queries compared to Hive.
Accelerating Discovery via Science ServicesIan Foster
[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Keiichiro Ono
This document summarizes a presentation given by Keiichiro Ono on the open source software platform Cytoscape. Ono introduced Cytoscape as a tool for biological network analysis and visualization. He discussed how it can integrate network and attribute data, perform network analysis functions like filtering and calculating statistics, and visualize networks through customizable layouts and visual styles. Ono also highlighted Cytoscape's ecosystem of apps that extend its functionality and its use of open standards to import a variety of network and attribute data formats.
A personal journey towards more reproducible networking researchOlivier Bonaventure
The document discusses reproducibility in networking research. It summarizes a study on the accessibility of software artifacts from papers presented at SIGCOMM, CoNEXT, and Hotnets conferences between 2013-2014. The study found that only a small portion had their software artifacts publicly available either through a URL in the paper or by contacting the authors. It provides recommendations to improve reproducibility, such as encouraging authors to release source code and data with their papers. The document also discusses challenges around handling and sharing privacy-sensitive network data.
Reproducibility of computational workflows is automated using continuous anal...Kento Aoyama
The document summarizes a proposed method called "Continuous Analysis" that aims to improve reproducibility in computational research. Continuous Analysis combines continuous integration (CI) practices with computational research to automatically verify reproducibility. It proposes using Docker containers to capture computational environments and CI tools like Drone to automatically rebuild images and rerun analyses on code changes, flagging any differences in results. An experiment applying Continuous Analysis to two genomic analyses demonstrated it could more easily detect changes between versions.
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataAnubhav Jain
The DuraMat Data Hub and Analytics Capability provides a centralized resource for sharing solar PV data. It collects performance, materials properties, meteorological, and other data through a central Data Hub. A data analytics thrust works with partners to provide software, visualization, and data mining capabilities. The goal is to enhance efficiency, reproducibility, and new analyses by combining multiple data sources in one location. Examples of ongoing projects using the hub include clear sky detection modeling to automatically classify sky conditions from irradiance data.
The document provides an introduction and overview of FireWorks workflow software. Some key points:
- FireWorks is an open-source, Python-based workflow management software that uses MongoDB and is pip-installable.
- It is used by several large DOE projects and materials science groups for tasks like materials modeling, machine learning, and document processing. Over 100 million CPU-hours have been used with everyday production use.
- FireWorks allows for very dynamic workflows that can modify themselves intelligently and add/remove tasks over long periods of time in response to results. It also features job detection and status persistence.
Deep learning is finding applications in science such as predicting material properties. DLHub is being developed to facilitate sharing of deep learning models, data, and code for science. It will collect, publish, serve, and enable retraining of models on new data. This will help address challenges of applying deep learning to science like accessing relevant resources and integrating models into workflows. The goal is to deliver deep learning capabilities to thousands of scientists through software for managing data, models and workflows.
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
Atomate is a high-level interface that makes it easy to generate, execute, and analyze computational materials science workflows. It contains a library of simulation procedures for different packages like VASP. Each procedure translates instructions into workflows of jobs and tasks. Atomate encodes expertise to run simulations and allows customizing workflows. It integrates with FireWorks to execute workflows on supercomputers and store results in databases for further analysis. The goal is to automate simulations and scale to millions of calculations.
SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...Keiichiro Ono
This document provides an overview of a tutorial on building reproducible network data visualization workflows using Cytoscape and IPython Notebook. The tutorial will cover integrating data, analyzing networks, visualizing results, and preparing outputs for publication. It will demonstrate setting up a portable data analysis environment using Docker and sharing work through GitHub. The bulk of the tutorial will focus on using IPython Notebook as an electronic lab notebook for interactive and reproducible experiments with Cytoscape.
Building Reproducible Network Data Analysis / Visualization WorkflowsKeiichiro Ono
The document discusses building reproducible network data analysis and visualization workflows using REST APIs and containerization. It aims to solve problems with complex software stacks that are difficult to set up and not reproducible. The goal is to create reproducible and scalable "dry experiments" using Docker containers, GitHub for source code sharing, Jupyter notebooks as electronic lab notebooks, and the cyREST module for the Cytoscape network analysis software. Examples of scenarios using local workstations and cloud computing are presented, as well as a demo and future plans.
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)Keiichiro Ono
This document outlines a tutorial on biological data analysis and visualization using Cytoscape. The tutorial covers basic concepts like networks and tables in Cytoscape, data import, network analysis features, and visualization techniques. It discusses loading sample network data, calculating network statistics, filtering networks, basic search functionality, and applying visual styles. The tutorial is intended to provide a practical introduction to Cytoscape's core features through examples and demos.
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
FireWorks is a workflow management system that allows researchers to define and execute complex computational materials science workflows on local or remote computing resources in an automated manner. It provides features such as error detection and recovery, job scheduling, provenance tracking, and remote file access. The atomate library builds on FireWorks to provide a high-level interface for common materials simulation procedures like structure optimization, band structure calculation, and property prediction using popular codes like VASP. Together, these tools aim to make high-throughput computational materials discovery and design more accessible to researchers.
The Materials Project is an open initiative that makes calculated materials property data publicly available to accelerate materials innovation. It has calculated properties for over 30,000 materials using over 10 million CPU hours. The project provides a Python library and API to access and analyze materials data, as well as a workflow manager to run calculations on supercomputers. It aims to calculate all known inorganic materials and establish collaborations to develop new materials design tools.
Apache Airavata is an open source science gateway software framework that allows users to compose, manage, execute, and monitor distributed computational workflows. It provides tools and services to register applications, schedule jobs on various resources, and manage workflows and generated data. Airavata is used across several domains to support scientific workflows and is largely derived from academic research funded by the NSF.
WBDB 2015 Performance Evaluation of Spark SQL using BigBencht_ivanov
In this paper we present the initial results of our work to run BigBench on Spark. First, we evaluated the data scalability behavior of the existing MapReduce implementation of BigBench. Next, we executed the group of 14 pure HiveQL queries on Spark SQL and compared the results with the respective Hive results. Our experiments show that: (1) for both MapReduce and Spark SQL, BigBench queries perform with the increase of the data size on average better than the linear scaling behavior and (2) pure HiveQL queries perform faster on Spark SQL than on Hive.
http://clds.sdsc.edu/wbdb2015.ca/program
Lessons Learned on Benchmarking Big Data Platformst_ivanov
The document discusses benchmarking different big data platforms and SQL-on-Hadoop engines. It evaluates the performance of Hadoop using the TPCx-HS benchmark with different network configurations. It also compares the performance of SQL query engines like Hive, Spark SQL, Impala, and file formats like ORC and Parquet using the TPC-H benchmark on a 1TB dataset. The results show that a dedicated 1Gb network is 5 times faster than a shared network. For SQL query engines, Hive with ORC format is on average 1.44 times faster than with Parquet. Spark SQL could only run 12 queries and was faster on 5 queries compared to Hive.
Accelerating Discovery via Science ServicesIan Foster
[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Keiichiro Ono
This document summarizes a presentation given by Keiichiro Ono on the open source software platform Cytoscape. Ono introduced Cytoscape as a tool for biological network analysis and visualization. He discussed how it can integrate network and attribute data, perform network analysis functions like filtering and calculating statistics, and visualize networks through customizable layouts and visual styles. Ono also highlighted Cytoscape's ecosystem of apps that extend its functionality and its use of open standards to import a variety of network and attribute data formats.
A personal journey towards more reproducible networking researchOlivier Bonaventure
The document discusses reproducibility in networking research. It summarizes a study on the accessibility of software artifacts from papers presented at SIGCOMM, CoNEXT, and Hotnets conferences between 2013-2014. The study found that only a small portion had their software artifacts publicly available either through a URL in the paper or by contacting the authors. It provides recommendations to improve reproducibility, such as encouraging authors to release source code and data with their papers. The document also discusses challenges around handling and sharing privacy-sensitive network data.
Reproducibility of computational workflows is automated using continuous anal...Kento Aoyama
The document summarizes a proposed method called "Continuous Analysis" that aims to improve reproducibility in computational research. Continuous Analysis combines continuous integration (CI) practices with computational research to automatically verify reproducibility. It proposes using Docker containers to capture computational environments and CI tools like Drone to automatically rebuild images and rerun analyses on code changes, flagging any differences in results. An experiment applying Continuous Analysis to two genomic analyses demonstrated it could more easily detect changes between versions.
PBS and Scheduling at NCI: The past, present and futureinside-BigData.com
In this deck from the 2015 PBS Works User Group, Andrew Wellington, Senior HPC Systems Administrator, from NCI presents: PBS and Scheduling at NCI: The past, present and future.
Watch the video presentation: http://insidehpc.com/2015/09/video-pbs-and-scheduling-at-nci-the-past-present-and-future/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Delivered at the FISL13 conference in Brazil: http://www.youtube.com/watch?v=K9w2cipqfvc
This talk introduces the USE Method: a simple strategy for performing a complete check of system performance health, identifying common bottlenecks and errors. This methodology can be used early in a performance investigation to quickly identify the most severe system performance issues, and is a methodology the speaker has used successfully for years in both enterprise and cloud computing environments. Checklists have been developed to show how the USE Method can be applied to Solaris/illumos-based and Linux-based systems.
Many hardware and software resource types have been commonly overlooked, including memory and I/O busses, CPU interconnects, and kernel locks. Any of these can become a system bottleneck. The USE Method provides a way to find and identify these.
This approach focuses on the questions to ask of the system, before reaching for the tools. Tools that are ultimately used include all the standard performance tools (vmstat, iostat, top), and more advanced tools, including dynamic tracing (DTrace), and hardware performance counters.
Other performance methodologies are included for comparison: the Problem Statement Method, Workload Characterization Method, and Drill-Down Analysis Method.
Enjoying the Journey from Puppet 3.x to Puppet 4.x (PuppetConf 2016)Robert Nelson
Let's describe the process for upgrading from Puppet 3 to 4, list some common pitfalls and how to avoid them, and be sure to enjoy ourselves in the process!
2019 03-11 bio it-world west genepattern notebook slidesMichael Reich
The GenePattern Notebook Environment for Open Science and Reproducible Research, presentation by Michael Reich, UCSD, at BioITWorld West, San Francisco, March 11, 2019
Virtual Puppet User Group: Puppet Development Kit (PDK) and Puppet Platform 6...Puppet
This document discusses using the Puppet Development Kit (PDK) to improve the testing and development of Puppet modules. The PDK provides tools and configuration to make module development easier and shift testing left. It allows generating module skeletons with tests, validating code quality, and testing modules against different Puppet versions. The latest PDK releases add support for testing modules against multiple Puppet versions simultaneously.
This document discusses openness and reproducibility in computational science. It begins with an introduction and background on the challenges of analyzing non-model organisms. It then describes the goals and challenges of shotgun sequencing analysis, including assembly, counting, and variant calling. It emphasizes the need for efficient data structures, algorithms, and cloud-based analysis to handle large datasets. The document advocates for open science practices like publishing code, data, and analyses to ensure reproducibility of computational results.
The document discusses strategies for managing information processing projects. It covers documentation, evaluating efficiency, and techniques for handling information. Key points:
- Internal documentation explains a system to staff while external documentation explains it to users. Electronic documentation like help menus and email are more efficient than paper-based forms.
- Efficiency is measured by savings in time and effort. Choosing the right equipment and software impacts effectiveness.
- Strategies for managing projects include top-down, bottom-up, and middle-out planning. Procedures cover research, design, production, and evaluation to meet user needs on time and on budget.
- The document describes an experimentation platform built at Staples-SparX to run thousands of experiments in parallel daily, processing 8GB of data and 500 requests per second with a latency SLA of 10ms.
- It discusses the infrastructure implemented including a PostgreSQL cluster, real-time reporting, and simulation testing to validate the system's behavior under different conditions.
- Clojure was used to build the platform for its expressiveness and to leverage existing Java tooling, with an emphasis on focusing on the problem domain rather than the implementation.
Reactive Documents and Computational Pipelines - Bridging the GapGlobus
As scientific discovery and experimentation become increasingly reliant on computational methods, the static nature of traditional publications renders them progressively fragmented and unreproducible. How can workflow automation tools, such as Globus, be leveraged to address these issues and potentially create a new, higher-value form of publication? LivePublication leverages Globus’s custom Action Provider integrations and Compute nodes to capture semantic and provenance information during distributed flow executions. This information is then embedded within an RO-crate and interfaced with a programmatic document, creating a seamless pipeline from instruments, to computation, to publication.
PuppetConf 2016: Enjoying the Journey from Puppet 3.x to 4.x – Rob Nelson, AT&T Puppet
Here are the slides from Rob Nelson's PuppetConf 2016 presentation called Enjoying the Journey from Puppet 3.x to 4.x. Watch the videos at https://www.youtube.com/playlist?list=PLV86BgbREluVjwwt-9UL8u2Uy8xnzpIqa
Guider: An Integrated Runtime Performance Analyzer on AGLPeace Lee
Guider is an integrated runtime performance analyzer for Linux that collects system resource and task data in real-time. It traces numerous system operations and visualizes complex performance data. Guider provides highly readable reports and debugging features to help optimize performance. It is open source, system-wide, easy to use, accurate, and light on system resources. Guider can monitor and collect system stats, trace threads, functions, and system calls, and control tasks for testing. Future work includes real-time user-level function tracing and a GUI client for remote control and visualization.
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
What is reproducible research? Why should I use it? what tools should I use? This session will show you how to use scripts, version control and markdown to do better research.
PuppetConf 2017: Puppet Development Kit: A Seamless Workflow for Module Devel...Puppet
The Puppet Development Kit (PDK) provides tools to help developers easily create high-quality Puppet modules. It includes tools for testing, linting, and validating modules. Using the PDK improves the module development experience and helps ensure modules are configured correctly and ready to use. The PDK is designed for both new and experienced Puppet users and makes it easier to get started with best practices like testing and code quality checks.
This document discusses load testing web applications. It explains why load testing is important, such as verifying performance requirements and planning resources. It describes what to test, like designing test cases and selecting appropriate metrics. Finally, it outlines how to perform load testing, including using tools to record and execute tests, preparing test environments, and analyzing results to check requirements and identify bottlenecks.
StarWest 2019 - End to end testing: Stupid or Legit?mabl
This document discusses end-to-end testing and how traditional testing tools are not keeping up with modern development practices like continuous integration and delivery (CI/CD). It introduces the concept of using a testing platform called mabl that enables easy creation of reusable modular end-to-end tests. Mabl provides features like root cause analysis, visual testing, performance monitoring and data-driven parameterized testing to help scale testing in a DevTestOps environment. A live demo is shown of creating data-driven tests using mabl.
Similar to PTU: Using Provenance for Repeatability (20)
StarWest 2019 - End to end testing: Stupid or Legit?
PTU: Using Provenance for Repeatability
1. Using Provenance for
Repeatability
Quan Pham1, Tanu Malik2, Ian Foster1,2
Department of Computer Science1,§ and Computation
Institute2,¶ University of Chicago§,¶ and Argonne National
Laboratory¶
TaPP 2013
2. Publication Process
• Traditional academic publication process
• Submit paper • Review ideas • Learn novel
&experiment methods.
s
• Emerging academic publication process
• Submit paper • Review ideas • Are we reading
&experiments something that is
• Validate repeatable and
software reproducible?
3. Repeatability Testing
• Scientific progress relies on novel claims and verifiable
results
• Scientific paper reviewers
• Validate announced results
• Validate for different
data and parameters
• Validate under different
conditions and environments
• Challenge: Work under
time & budget constraints
Image: from http://catsandtheirmews.blogspot.com/2012/05/update-on-computer-crash.html
4. Repeatability Testing
Challenges & Constraints
• Repeatability requirements
• Hardware : Single machine/Clusters
• Software
• Operating System : Which operating system was used?
(Ubuntu/RedHat/Debian/Gentoo)
• Environment: How to capture all environment variables?
• Tools & libraries installation: How to precisely know all the dependencies?
• Knowledge constraints
• Experiment setup: how to setup the experiment?
• Experiment usage: how the experiment is run?
• Resource constraints
• Requires massive processing power.
• Operates on large amounts of data.
• Performs significant network communication.
• Is long-running.
5. An Approach to Repeatability
Testing
Challenges & Constraints Possible Solution
• Repeatability • Provide a virtual
requirements machine
• Hardware requirement • Provide a portable
• Software requirement software
• Knowledge constraints Provide a reference
• Experiment setup execution
• Experiment usage
• Resource constraints Provide selective
replay
6. PTU – Provenance-To-
Use
• PTU
• Minimizes computation time during repeatability testing
• Guarantees that events are processed in the same order
using the same data
• Authors build a package that includes:
• Software program
• Input data
• Provenance trace
• Testers may select a subset of the package’s
processes for a partial deterministic replay
7. PTU Functionalities
• ptu-audit tool
• Build a package of authors’ source code, data, and
environment variables
• Record process- and file-level details about a reference
execution % ptu-audit java TextAnalyzer
news.txt
• PTU package
• Display the provenance graph and accompanying run-time
details
• ptu-exec tool
• Re-execute specified part of the provenance graph
% ptu-exec java TextAnalyzer
news.txt
8. ptu-audit
• Uses ptrace to monitor
system calls
• execve, sys_fork
• read, write, sys_io
• bind, connect, socket
• Collects provenance
• Collects runtime
information
• Makes package
9. ptu-audit
• Use ptrace to monitor
system calls
• execve, sys_fork
• read, write, sys_io
• bind, connect, socket
• Collect provenance
• Collect runtime info
• Make package
10. PTU Package
• [Figure 2. The PTU package. The tester chooses
to run the sub-graph rooted at /bin/calculate ]
11. ptu-exec
• [Figure 3. ptu-exec re-runs part of the application
from /bin/calculate. It uses CDE to re-route file
dependencies]
12. Current PTU Components
• Uses CDE (Code-Data-Environment) tool to
create a package
• Details CDE
• Uses ptrace to create a provenance graph
representing a reference run-time execution
• Uses SQLite to store the provenance graph
• Uses graphviz for graph presentation
• Enhances CDE to run the package
13. PEEL0
• Best, N., et. al., Synthesis of a Complete Land
Use/Land Cover Dataset for the Conterminous
United States. RDCEP Working Paper, 2012.
12(08).
• Wget • R • R
• Bash • Raster • Geo
script • Rgdal algorithm
• Reclassify
14. PEEL0
• [Figure 4: Time reduction in testing PEEL0 using
PTU]
• Or use the actual execution graph??
15. TextAnalyzer
• Murphy, J., et. al., Textual Hydraulics: Mining
Online Newspapers to Detect Physical, Social,
and Institutional Water Management
Infrastructure, 2013, Technical Report, Argonne
National Lab.
• runs a named-entity recognition analysis program
using several data dictionaries
• splits the input file into multiple input files on which
it runs a parallel analysis
17. Conclusion
• PTU is a step toward testing software programs
that are submitted to conference proceedings and
journals to conduct repeatability tests
• Easy and attractive for authors
• Fine control, efficient way for testers
18. Future Works
• Other workflow types
• Distributed workflows.
• Improve performance
• Decide how to store provenance compactly in a
packge.
• Presentation
• Improve graphic-user-interface and presentation
19. Acknowledgements
• Neil Best
• Jonathan Ozik
• Center for Robust Decision making on Climate
and Energy Policy (NSF grant number 0951576)
• Contractors of the US Government under contract
number DEAC02-06CH11357
Editor's Notes
Hi everyone,My name is QP,In this presentation, I’d like to introduce a system that use provenance for repeatability.The work is done with TMalik @ CI, UoC and Ifoster @ ANL.----- Meeting Notes (3/28/13 14:37) -----no abbreviationlet's the slide talk: n & v should be there
What is the problem with repeatability in scientific community?Process of publication:+ I submit + Reviewers find interesting claims+ want to verify by re-run, with different data/param, different cond/envSo many things to validate, so little time and budget (hardware)===Problems: r u facing this? 1sldWhy? 1sldChallenges 1sld10-15 sld totalThis is what I presentSteps & stepsHigh lvl overview----- Meeting Notes (3/28/13 14:37) -----1 slide: what is a "new" publication process (authors -> (tester) -> readers)concept of author, tester and repeatability
+ uses ptrace to monitor ~50 system calls including process system calls, such as execve() and sys_fork(), for collecting process provenance; file system calls, such as read(), write(), and sys_io(), for collecting file provenance; and network calls, such as bind(), connect(), socket(), and send() for auditing network activity. + obtain process name, owner, group, parent, host, creation time, command line, and environment variables; and file name, path, host, size, and modification time+obtain memory footprint, CPU consumption, and I/O activity data for each process from /proc/$pid/stat + copies the accessed file into a package directory that consists of all sub- directories and symbolic links to the original file’s location.
+ uses ptrace to monitor ~50 system calls including process system calls, such as execve() and sys_fork(), for collecting process provenance; file system calls, such as read(), write(), and sys_io(), for collecting file provenance; and network calls, such as bind(), connect(), socket(), and send() for auditing network activity. + obtain process name, owner, group, parent, host, creation time, command line, and environment variables; and file name, path, host, size, and modification time+obtain memory footprint, CPU consumption, and I/O activity data for each process from /proc/$pid/stat + copies the accessed file into a package directory that consists of all sub- directories and symbolic links to the original file’s location.
Reason: why choose /bin/cal (mem intensive, run-time)+ When the entire reference run finishes, PTU builds a reference execution file consisting of the topological sort of the provenance graph. The nodes of the graph enumerate run-time details, such as process memory consumption, and file sizes. The tester, as described next, can utilize the database and the graph for efficient re-execution. + Testers can - examine the provenance graph contained in a package to study the accompanying reference execution. - request a re-execution, either by specifying nodes in the provenance graph or by modifying a run configuration file that is included in the package.
+ obtains run configuration and environment variables for each process from the SQLite database + monitors it via ptrace and re-executes CDE functionality of replacing path argument(s) to refer to the corresponding path within the package cde-package/cde-root/. +provides fast audit and re-execution independent of the application. The profiling of processes enables testers to choose the processes to run.
PEEL0Three step workflow processImplemented as an R-program Classification is memory intensive[pic: PEEL0 workflowget -> reclassify -> calculate -> final resulttesters look at (calculate) with question marks]
five process nodes, 10000 exclusive file reads based on the number of files in the dataset, and 422 exclusive file writes for the aggregated dataset Slowdownwhenuse PTU: ~35% for PEEL0
+ eight process nodes that in aggregate conduct 616 exclusive file reads, 124 exclusive file writes, and 50 file nodes that are read and written again. + Slowdownwhenuse PTU: ~15% for TextAnalyzer+ TextAnalyzer has a particularly large improvement (>98%) since the entire process is run on a much smaller file.
PTU is a step toward testing software programs that are submitted to conference proceedings and journals to conduct repeatability tests. Peer reviewers often must review these programs in a short period of time. By providing one utility for packaging the software pro- grams and its reference execution without modifying the application, we have made it easy and attractive for authors to use it and a fine control, efficient way for testers to use PTU.