This document discusses auditing and maintaining provenance in software packages. It presents CDE-SP, an enhancement to the CDE system that captures additional details about software dependencies to enable attribution of authorship as software packages are combined and merged into pipelines. CDE-SP uses a lightweight LevelDB storage to encode process and file provenance within software packages. It provides queries to retrieve dependency information and validate authorship by matching provenance graphs. Experiments show CDE-SP introduces negligible overhead compared to the original CDE system.
This document discusses the challenges of managing open source licenses in software projects. It notes that licenses can change over time and license data may be unreliable, which can lead to non-compliance and serious legal implications like lawsuits. It presents three approaches to handling licenses: a manual approach, a reactive tool-based approach, and a proactive tool-based approach. The manual approach is time-consuming. The reactive approach detects issues after software is built. The proactive approach iteratively detects licenses throughout the development process using tools to prevent non-compliance and encourage early remediation.
Software Bertillonage: Finding the Provenance of an Entitymigod
Slides from the paper presented at the 2011 IEEE Intl Conf on Mining Software Repositories, by Julius Davies, Daniel German, Mike Godfrey, and Abram Hindle
This document analyzes whether proprietary software that interacts with open source software needs to be distributed under an open source license as well, focusing specifically on the GNU General Public License (GPL). It discusses how copyright law considers a work to be a derivative if it is substantially similar to pre-existing copyrighted work. It examines guidance from GPL authors and the Oracle v. Google case to determine what constitutes copying for GPL purposes. The document concludes by recommending companies implement processes to review code for open source license compliance before distribution.
New York Kubernetes: CI/CD Patterns for KubernetesAndrew Phillips
This document discusses CI/CD patterns for Kubernetes. It begins with an agenda and speaker bios. It then discusses the challenges of CI/CD for Kubernetes, including that it is a new space with unclear standard practices. It introduces concepts like environment and application repositories, manifests, and applied manifest sets. It also debunks some myths. The main part describes four related CI/CD processes: application code/config updates, environment config updates, environment restore, and detecting environment config drift. It uses diagrams to illustrate how a tool could implement these patterns.
In familiar applications such as digital versatile disc (DVD), digital video can be found in digital TV, Internet video streaming, digital high-definition television is defined formula. Digital video sharing digital format all functions, including lossless transmission, lossless storage, easy to edit.Currently in many applications, including video conferencing, video games entertainment, DVD discs, digital video broadcasting. As digital video compression format storage requirements prohibitive, lossy digital video compression technology commonly used as the data transmission rate and a compromise between quality. In this paper, we compare and analyze the MPEG-2 , H.261 and H.264 video compression standards.After the Compression , We get the result that the compression of H.264 is better than other two but it take much time as compare to H.261 on higher cost.
Voices 2015 - Spatial Temporal Reasoning Over Play-Scripts for Artificially I...Deanna Kosaraju
Spatial Temporal Reasoning Over Play-Scripts for Artificially Intelligent Characters
Christine Talbot, University of North Carolina/Salesforce.com
www.globaltechwomen,com
Session Length: 1 hour
The objective of this session is to present current research in AI consisting of creating an Artificially Intelligent Director. The key here is the set of algorithms utilized to transform a standard play-script into spatial temporal directions for AI characters within a virtual environment. We review not only a fully AI scripted set of characters, but also the incorporation of one or more human characters in the scene, which affects the AI characters’ movements. We will review the results of experiments which show both the quantifiable evidence of the algorithms’ accuracy, as well as the qualitative results of user perception of these scenes.
1. The document describes a framework for performing analytics on RDF data through the use of analytical schemas and instances.
2. An analytical schema is a labeled directed graph that defines concepts and relationships for analyzing an RDF graph. An instance maps elements of the RDF graph to nodes and edges in the analytical schema.
3. The framework allows for analyzing heterogeneous RDF data through flexible analytical schemas and supporting queries over the relationships and semantics defined in the RDF graph.
Continuous Delivery: Fly the Friendly CI in Pivotal Cloud Foundry with ConcourseVMware Tanzu
This document provides an overview of continuous integration (CI) and continuous delivery (CD) using Concourse. It introduces Concourse and discusses why containers and pipelines are useful for CI/CD. It then demonstrates how to use Concourse in practice with examples and explains how to get started using Concourse for CI/CD workflows. The key aspects covered are Concourse's simple and scalable pipeline model using resources, tasks, and jobs, how containerization provides dependable and isolated executions, and how pipelines allow flexible yet efficient software testing and deployment.
This document discusses the challenges of managing open source licenses in software projects. It notes that licenses can change over time and license data may be unreliable, which can lead to non-compliance and serious legal implications like lawsuits. It presents three approaches to handling licenses: a manual approach, a reactive tool-based approach, and a proactive tool-based approach. The manual approach is time-consuming. The reactive approach detects issues after software is built. The proactive approach iteratively detects licenses throughout the development process using tools to prevent non-compliance and encourage early remediation.
Software Bertillonage: Finding the Provenance of an Entitymigod
Slides from the paper presented at the 2011 IEEE Intl Conf on Mining Software Repositories, by Julius Davies, Daniel German, Mike Godfrey, and Abram Hindle
This document analyzes whether proprietary software that interacts with open source software needs to be distributed under an open source license as well, focusing specifically on the GNU General Public License (GPL). It discusses how copyright law considers a work to be a derivative if it is substantially similar to pre-existing copyrighted work. It examines guidance from GPL authors and the Oracle v. Google case to determine what constitutes copying for GPL purposes. The document concludes by recommending companies implement processes to review code for open source license compliance before distribution.
New York Kubernetes: CI/CD Patterns for KubernetesAndrew Phillips
This document discusses CI/CD patterns for Kubernetes. It begins with an agenda and speaker bios. It then discusses the challenges of CI/CD for Kubernetes, including that it is a new space with unclear standard practices. It introduces concepts like environment and application repositories, manifests, and applied manifest sets. It also debunks some myths. The main part describes four related CI/CD processes: application code/config updates, environment config updates, environment restore, and detecting environment config drift. It uses diagrams to illustrate how a tool could implement these patterns.
In familiar applications such as digital versatile disc (DVD), digital video can be found in digital TV, Internet video streaming, digital high-definition television is defined formula. Digital video sharing digital format all functions, including lossless transmission, lossless storage, easy to edit.Currently in many applications, including video conferencing, video games entertainment, DVD discs, digital video broadcasting. As digital video compression format storage requirements prohibitive, lossy digital video compression technology commonly used as the data transmission rate and a compromise between quality. In this paper, we compare and analyze the MPEG-2 , H.261 and H.264 video compression standards.After the Compression , We get the result that the compression of H.264 is better than other two but it take much time as compare to H.261 on higher cost.
Voices 2015 - Spatial Temporal Reasoning Over Play-Scripts for Artificially I...Deanna Kosaraju
Spatial Temporal Reasoning Over Play-Scripts for Artificially Intelligent Characters
Christine Talbot, University of North Carolina/Salesforce.com
www.globaltechwomen,com
Session Length: 1 hour
The objective of this session is to present current research in AI consisting of creating an Artificially Intelligent Director. The key here is the set of algorithms utilized to transform a standard play-script into spatial temporal directions for AI characters within a virtual environment. We review not only a fully AI scripted set of characters, but also the incorporation of one or more human characters in the scene, which affects the AI characters’ movements. We will review the results of experiments which show both the quantifiable evidence of the algorithms’ accuracy, as well as the qualitative results of user perception of these scenes.
1. The document describes a framework for performing analytics on RDF data through the use of analytical schemas and instances.
2. An analytical schema is a labeled directed graph that defines concepts and relationships for analyzing an RDF graph. An instance maps elements of the RDF graph to nodes and edges in the analytical schema.
3. The framework allows for analyzing heterogeneous RDF data through flexible analytical schemas and supporting queries over the relationships and semantics defined in the RDF graph.
Continuous Delivery: Fly the Friendly CI in Pivotal Cloud Foundry with ConcourseVMware Tanzu
This document provides an overview of continuous integration (CI) and continuous delivery (CD) using Concourse. It introduces Concourse and discusses why containers and pipelines are useful for CI/CD. It then demonstrates how to use Concourse in practice with examples and explains how to get started using Concourse for CI/CD workflows. The key aspects covered are Concourse's simple and scalable pipeline model using resources, tasks, and jobs, how containerization provides dependable and isolated executions, and how pipelines allow flexible yet efficient software testing and deployment.
Paradigmo. Rock Kit, the Rapid Deployment Toolkit for ForgeRock Identity Plat...ForgeRock
This document discusses the DevOps philosophy and strategies for implementing DevOps with ForgeRock's Identity Platform. It advocates treating infrastructure as code by storing configurations as code that can be versioned, promoted between environments, and put through continuous integration/delivery pipelines. It also introduces Paradigmo's Rapid Deployment Toolkit, which provides templates and guides to help structure projects and guide development and deployment of ForgeRock configurations, enabling development in weeks and deployment in minutes. This aims to improve key performance indicators like short time to market and reduced effort and costs.
Using Docker Containers to Improve Reproducibility in Software and Web Engine...Vincenzo Ferme
The ability to replicate and reproduce scientific results has become an increasingly important topic for many academic disciplines. In computer science and, more specifically, software and Web engineering, contributions of scientific work rely on developed algorithms, tools and prototypes, quantitative evaluations, and other computational analyses. Published code and data come with many undocumented assumptions, dependencies, and configurations that are internal knowledge and make reproducibility hard to achieve. This tutorial presents how Docker containers can overcome these issues and aid the reproducibility of research artefacts in software engineering and discusses their applications in the field.
Cite us: http://link.springer.com/chapter/10.1007/978-3-319-38791-8_58
Keynote: DevOps 4 Networks by JR Rivers of Cumulus NetworksDevOps4Networks
The document discusses Cumulus Networks' contributions to advancing DevOps practices for networks. It highlights areas where Cumulus Networks has contributed tools and content, including projects on GitHub, as well as efforts to improve structured input/output for networking packages and develop tooling to leverage structured I/O. The document advocates creating localized truths, adding structured I/O to existing packages, and developing loosely coupled tools that can continually improve.
PuppetConf 2016: Continuous Delivery and DevOps with Jenkins and Puppet Enter...Puppet
This document discusses continuous delivery and DevOps practices using Jenkins and Puppet. It defines DevOps as addressing security at every stage and having high-performing teams that spend less time fixing issues. It also discusses the DevOps trinity of people, processes, and tools. Jenkins Pipelines and infrastructure as code with Puppet are presented as key practices for continuous delivery. The integration of Jenkins and Puppet Enterprise is demonstrated for automating application deployments.
What is DevOps Services_ Tools and Benefits.pdfkomalmanu87
This closer relationship between “Dev” and “Ops” permeates every phase of the DevOps lifecycle: from initial software planning to code, build, test, and release phases and on to deployment, operations, and ongoing monitoring. This relationship propels a continuous customer feedback loop of further improvement, development, testing, and deployment. One result of these efforts can be the more rapid, continual release of necessary feature changes or additions.
What is DevOps Services_ Tools and Benefits.pdfkomalmanu87
Some people group DevOps goals into four categories: culture, automation, measurement, and sharing (CAMS), and DevOps tools can aid in these areas. These tools can make development and operations workflows more streamlined and collaborative, automating previously time-consuming, manual, or static tasks involved in integration, development, testing, deployment, or monitoring.
penetration test using Kali linux seminar reportAbhayNaik8
This document is a seminar report submitted by Mr. Naik Abhay Suresh to fulfill requirements for a Bachelor of Engineering degree. The report discusses penetration testing using Kali Linux. It provides background on Kali Linux, including its history and relationship to Debian. It then describes the methodology of penetration testing, including phases such as information gathering, scanning, exploitation, and post-exploitation. The report discusses advantages and applications of using Kali Linux for penetration testing.
Nesta apresentação conheceremos as definições de Integração, Entrega e Implantação Contínua e como o uso do Jenkins possibilita a criação de pipelines que ajudam as equipes de desenvolvimento, testes, qualidade e operação na entrega de produtos com maior qualidade aos clientes.
Também será apresentado o uso do Jenkins em conjunto com o Gitlab, SonarQube, Maven, Nexus, Docker e Terraform, que é o tema central do livro Integração contínua com Jenkins, publicado em Fevereiro/2019 pela editora Novatec.
RockKit, the Rapid Deployment Toolkit for ForgeRock Identity PlatformOlivier Naveau
IAM projects for the digital age require new tactics and best practices. What you need is a short time to market while tackling complex integrations.
Thanks to RockKit, Paradigmo proposes a new approach to deliver your IAM project.
In this presentation we will take a look how it works and how you can benefit from RockKit to save time and money in your IAM project.
RockKit is the fastest and easiest way to deliver your IAM project based on ForgeRock technology.
What is the Secure Supply Chain and the Current State of the PHP Ecosystemsparkfabrik
In this talk I’ll present the current state of the software supply chain, the big global recent events (SolarWinds, log4shell, codecov, packagist) and the state of the PHP and Drupal ecosystem, the threats and the mitigations that can be applied using tools like Sigstore, Syft, and Grype for digital signatures, SBOM generation, and automatic vulnerability scanning and how to use them for real-world projects to gain unprecedented levels of knowledge of your digital artifacts.
There will be also a demo of the mentioned tools in action to implement a secure supply chain pipeline for your Drupal projects.
Microsoft recently released Azure DevOps, a set of services that help developers and IT ship software faster, and with higher quality. These services cover planning, source code, builds, deployments, and artifacts. One of the great things about Azure DevOps is that it works great for any app and on any platform regardless of frameworks.
In this session, I will provide a hands on workshop guiding you through getting started with Azure Pipelines to build your application. Using continuous integration and deployment processes, you will leave with clear understanding and skills to get your applications up and running quickly in Azure DevOps and see the full benefits that CI/CD can bring to your organization.
This document outlines exercises for quality control of NGS data from an Illumina sequencing experiment on tomato ripening stages. The exercises include: 1) evaluating raw fastq files for format and number of sequences; 2) using FastQC to analyze read quality scores, lengths, duplication levels, and k-mer content; and 3) preprocessing the reads using fastq-mcf to trim low quality ends and remove short reads before reanalyzing with FastQC. The goal is to learn how to evaluate NGS read quality and preprocess data prior to downstream analysis.
This is a presentation I gave at OWASP Summit 2017 talking about the Dependency-Check ecosystem, including Jenkins, SonarQube, and Dependency-Track. The presentation focused on the progress made over the course of the projects life, demos of currently unreleased features, and the future of the platform.
This document provides information about an upcoming MuleSoft Meetup event on Continuous Integration and Continuous Delivery (CI/CD) with MuleSoft. The meetup will include a deep dive on CI/CD workflows using tools like Jenkins, Anypoint CLI, and Groovy scripts to automate API deployment and configuration. The agenda covers manual deployment processes, enabling CI/CD with Anypoint Platform, Jenkins pipeline orchestration, deploying APIs with Anypoint CLI via a connected app, and automatically discovering and configuring APIs and policies within the CD flow. The meetup hosts will also demonstrate setting up API policies and auto discovery within a CI/CD pipeline from Jenkins using Groovy scripts.
DevOps is a software development method that stresses communication and integration between developers and IT operations. It aims to allow for more frequent deployment of code changes through automation of the process from development to production. Key aspects of DevOps include continuous integration, delivery, and monitoring to achieve rapid release cycles and get feedback to improve the process.
Conda is a cross-platform package manager that lets you quickly and easily build environments containing complicated software stacks. It was built to manage the NumPy stack in Python but can be used to manage any complex software dependencies.
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data EverywhereGanesh Raju
Apache Bigtop packages the Hadoop ecosystem into RPM and DEB packages. It provides a foundation for commercial Hadoop distributions and services. Bigtop features include a build toolchain, package framework, Puppet deployment scripts, and integration test framework. The next release of Bigtop 1.4 is upcoming in early April 2019, adding AArch64 support, improved testing, and package version updates. Future work includes focusing on core big data components like Spark and Flink, adding Kubernetes and cloud support, and expanding integrations.
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...Boris Glavic
Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notion of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.
Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue and year can be explained by an increased number of publications in another venue in the same year. We present a novel approach for explaining outliers in aggregation queries through counterbalancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We present efficient methods for mining such aggregate regression patterns (ARPs), discuss how to use ARPs to generate and rank explanations, and experimentally demonstrate the efficiency and effectiveness of our approach.
Paradigmo. Rock Kit, the Rapid Deployment Toolkit for ForgeRock Identity Plat...ForgeRock
This document discusses the DevOps philosophy and strategies for implementing DevOps with ForgeRock's Identity Platform. It advocates treating infrastructure as code by storing configurations as code that can be versioned, promoted between environments, and put through continuous integration/delivery pipelines. It also introduces Paradigmo's Rapid Deployment Toolkit, which provides templates and guides to help structure projects and guide development and deployment of ForgeRock configurations, enabling development in weeks and deployment in minutes. This aims to improve key performance indicators like short time to market and reduced effort and costs.
Using Docker Containers to Improve Reproducibility in Software and Web Engine...Vincenzo Ferme
The ability to replicate and reproduce scientific results has become an increasingly important topic for many academic disciplines. In computer science and, more specifically, software and Web engineering, contributions of scientific work rely on developed algorithms, tools and prototypes, quantitative evaluations, and other computational analyses. Published code and data come with many undocumented assumptions, dependencies, and configurations that are internal knowledge and make reproducibility hard to achieve. This tutorial presents how Docker containers can overcome these issues and aid the reproducibility of research artefacts in software engineering and discusses their applications in the field.
Cite us: http://link.springer.com/chapter/10.1007/978-3-319-38791-8_58
Keynote: DevOps 4 Networks by JR Rivers of Cumulus NetworksDevOps4Networks
The document discusses Cumulus Networks' contributions to advancing DevOps practices for networks. It highlights areas where Cumulus Networks has contributed tools and content, including projects on GitHub, as well as efforts to improve structured input/output for networking packages and develop tooling to leverage structured I/O. The document advocates creating localized truths, adding structured I/O to existing packages, and developing loosely coupled tools that can continually improve.
PuppetConf 2016: Continuous Delivery and DevOps with Jenkins and Puppet Enter...Puppet
This document discusses continuous delivery and DevOps practices using Jenkins and Puppet. It defines DevOps as addressing security at every stage and having high-performing teams that spend less time fixing issues. It also discusses the DevOps trinity of people, processes, and tools. Jenkins Pipelines and infrastructure as code with Puppet are presented as key practices for continuous delivery. The integration of Jenkins and Puppet Enterprise is demonstrated for automating application deployments.
What is DevOps Services_ Tools and Benefits.pdfkomalmanu87
This closer relationship between “Dev” and “Ops” permeates every phase of the DevOps lifecycle: from initial software planning to code, build, test, and release phases and on to deployment, operations, and ongoing monitoring. This relationship propels a continuous customer feedback loop of further improvement, development, testing, and deployment. One result of these efforts can be the more rapid, continual release of necessary feature changes or additions.
What is DevOps Services_ Tools and Benefits.pdfkomalmanu87
Some people group DevOps goals into four categories: culture, automation, measurement, and sharing (CAMS), and DevOps tools can aid in these areas. These tools can make development and operations workflows more streamlined and collaborative, automating previously time-consuming, manual, or static tasks involved in integration, development, testing, deployment, or monitoring.
penetration test using Kali linux seminar reportAbhayNaik8
This document is a seminar report submitted by Mr. Naik Abhay Suresh to fulfill requirements for a Bachelor of Engineering degree. The report discusses penetration testing using Kali Linux. It provides background on Kali Linux, including its history and relationship to Debian. It then describes the methodology of penetration testing, including phases such as information gathering, scanning, exploitation, and post-exploitation. The report discusses advantages and applications of using Kali Linux for penetration testing.
Nesta apresentação conheceremos as definições de Integração, Entrega e Implantação Contínua e como o uso do Jenkins possibilita a criação de pipelines que ajudam as equipes de desenvolvimento, testes, qualidade e operação na entrega de produtos com maior qualidade aos clientes.
Também será apresentado o uso do Jenkins em conjunto com o Gitlab, SonarQube, Maven, Nexus, Docker e Terraform, que é o tema central do livro Integração contínua com Jenkins, publicado em Fevereiro/2019 pela editora Novatec.
RockKit, the Rapid Deployment Toolkit for ForgeRock Identity PlatformOlivier Naveau
IAM projects for the digital age require new tactics and best practices. What you need is a short time to market while tackling complex integrations.
Thanks to RockKit, Paradigmo proposes a new approach to deliver your IAM project.
In this presentation we will take a look how it works and how you can benefit from RockKit to save time and money in your IAM project.
RockKit is the fastest and easiest way to deliver your IAM project based on ForgeRock technology.
What is the Secure Supply Chain and the Current State of the PHP Ecosystemsparkfabrik
In this talk I’ll present the current state of the software supply chain, the big global recent events (SolarWinds, log4shell, codecov, packagist) and the state of the PHP and Drupal ecosystem, the threats and the mitigations that can be applied using tools like Sigstore, Syft, and Grype for digital signatures, SBOM generation, and automatic vulnerability scanning and how to use them for real-world projects to gain unprecedented levels of knowledge of your digital artifacts.
There will be also a demo of the mentioned tools in action to implement a secure supply chain pipeline for your Drupal projects.
Microsoft recently released Azure DevOps, a set of services that help developers and IT ship software faster, and with higher quality. These services cover planning, source code, builds, deployments, and artifacts. One of the great things about Azure DevOps is that it works great for any app and on any platform regardless of frameworks.
In this session, I will provide a hands on workshop guiding you through getting started with Azure Pipelines to build your application. Using continuous integration and deployment processes, you will leave with clear understanding and skills to get your applications up and running quickly in Azure DevOps and see the full benefits that CI/CD can bring to your organization.
This document outlines exercises for quality control of NGS data from an Illumina sequencing experiment on tomato ripening stages. The exercises include: 1) evaluating raw fastq files for format and number of sequences; 2) using FastQC to analyze read quality scores, lengths, duplication levels, and k-mer content; and 3) preprocessing the reads using fastq-mcf to trim low quality ends and remove short reads before reanalyzing with FastQC. The goal is to learn how to evaluate NGS read quality and preprocess data prior to downstream analysis.
This is a presentation I gave at OWASP Summit 2017 talking about the Dependency-Check ecosystem, including Jenkins, SonarQube, and Dependency-Track. The presentation focused on the progress made over the course of the projects life, demos of currently unreleased features, and the future of the platform.
This document provides information about an upcoming MuleSoft Meetup event on Continuous Integration and Continuous Delivery (CI/CD) with MuleSoft. The meetup will include a deep dive on CI/CD workflows using tools like Jenkins, Anypoint CLI, and Groovy scripts to automate API deployment and configuration. The agenda covers manual deployment processes, enabling CI/CD with Anypoint Platform, Jenkins pipeline orchestration, deploying APIs with Anypoint CLI via a connected app, and automatically discovering and configuring APIs and policies within the CD flow. The meetup hosts will also demonstrate setting up API policies and auto discovery within a CI/CD pipeline from Jenkins using Groovy scripts.
DevOps is a software development method that stresses communication and integration between developers and IT operations. It aims to allow for more frequent deployment of code changes through automation of the process from development to production. Key aspects of DevOps include continuous integration, delivery, and monitoring to achieve rapid release cycles and get feedback to improve the process.
Conda is a cross-platform package manager that lets you quickly and easily build environments containing complicated software stacks. It was built to manage the NumPy stack in Python but can be used to manage any complex software dependencies.
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data EverywhereGanesh Raju
Apache Bigtop packages the Hadoop ecosystem into RPM and DEB packages. It provides a foundation for commercial Hadoop distributions and services. Bigtop features include a build toolchain, package framework, Puppet deployment scripts, and integration test framework. The next release of Bigtop 1.4 is upcoming in early April 2019, adding AArch64 support, improved testing, and package version updates. Future work includes focusing on core big data components like Spark and Flink, adding Kubernetes and cloud support, and expanding integrations.
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...Boris Glavic
Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notion of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.
Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue and year can be explained by an increased number of publications in another venue in the same year. We present a novel approach for explaining outliers in aggregation queries through counterbalancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We present efficient methods for mining such aggregate regression patterns (ARPs), discuss how to use ARPs to generate and rank explanations, and experimentally demonstrate the efficiency and effectiveness of our approach.
2016 VLDB - The iBench Integration Metadata GeneratorBoris Glavic
Given the maturity of the data integration field it is surprising that rigorous empirical evaluations of research ideas are so scarce. We identify a major roadblock for empirical work - the lack of comprehensive metadata generators that can be used to create benchmarks for different integration tasks. This makes it difficult to compare integration solutions, understand their generality, and understand their performance. We present iBench, the first metadata generator that can be used to evaluate a wide-range of integration tasks (data exchange, mapping creation, mapping composition, schema evolution, among many others). iBench permits control over the size and characteristics of the metadata it generates (schemas, constraints, and mappings). Our evaluation demonstrates that iBench can efficiently generate very large, complex, yet realistic scenarios with different characteristics. We also present an evaluation of three mapping creation systems using iBench and show that the intricate control that iBench provides over metadata scenarios can reveal new and important empirical insights. iBench is an open-source, extensible tool that we are providing to the community. We believe it will raise the bar for empirical evaluation and comparison of data integration systems.
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...Boris Glavic
We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To pro- vide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
This document introduces a unified framework for generalizing explanations for answers and non-answers to why/why-not questions over union of conjunctive queries (UCQs). It utilizes an available ontology, expressed as inclusion dependencies, to map concepts to instances and generate generalized explanations. Generalized explanations describe subsets of an explanation using concepts from the ontology. The most general explanation is the one that is not dominated by any other explanation. The approach is implemented using Datalog rules to model subsumption checking, successful and failed rule derivations, and computing explanations, their generalization, and the most general explanations.
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSONBoris Glavic
Since its inception, the PROV standard has been widely adopted as a standardized exchange format for provenance information. Surprisingly, this standard is currently not supported by provenance- aware database systems limiting their interoperability with other provenance-aware systems. In this work we introduce techniques for exporting database provenance as PROV documents, importing PROV graphs alongside data, and linking outputs of an SQL operation to the imported provenance for its inputs. Our implementation in the GProM system offloads generation of PROV documents to the backend database. This implementation enables provenance tracking for applications that use a relational database for managing (part of) their data, but also execute some non-database operations.
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
Explaining why an answer is present (traditional provenance) or absent (why-not provenance) from a query result is important for many use cases. Most existing approaches for positive queries use the existence (or absence) of input data to explain a (missing) answer. However, for realistically-sized databases, these explanations can be very large and, thus, may not be very helpful to a user. In this paper, we argue that logical constraints as a concise description of large (or even infinite) sets of existing or missing inputs can provide a natural way of answering a why- or why-not provenance question. For instance, consider a query that returns the names of all cities which can be reached with at most one transfer via train from Lyon in France. The provenance of a city in the result of this query, say Dijon, will contain a large number of train connections between Lyon and Dijon which each justify the existence of Dijon in the result. If we are aware that Lyon and Dijon are cities in France (e.g., an ontology of geographical locations is available), then we can use this information to generalize the query output and its provenance to provide a more concise explanation of why Dijon is in the result. For instance, we may conclude that all cities in France can be reached from each other through Paris. We demonstrate how an ontology expressed as inclusion dependencies can provide meaningful justifications for answers and non-answers, and we outline how to find a most general such explanation for a given UCQ query result using Datalog. Furthermore, we sketch several variations of this framework derived by considering other types of constraints as well as alternative definitions of explanation and generalization.
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceBoris Glavic
We reconsider some of the explicit and implicit properties that underlie well-established definitions of data provenance semantics. Previous work on comparing provenance semantics has mostly focused on expressive power (does the provenance generated by a certain semantics subsume the provenance generated by other semantics) and on understanding whether a semantics is insensitive to query rewrite (i.e., do equivalent queries have the same provenance). In contrast, we try to investigate why certain semantics possess specific properties (like insensitivity) and whether these properties are always desirable. We present a new property stability with respect to query language extension that, to the best of our knowledge, has not been isolated and studied on its own.
EDBT 2009 - Provenance for Nested SubqueriesBoris Glavic
Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that
provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use.
In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...Boris Glavic
Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data
items are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In this paper, we present an alternative approach that uses query rewriting to annotate result tuples with provenance information. The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques. In the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently provides provenance information inducing only a small overhead on normal operations.
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...Boris Glavic
Though partially automated, developing schema mappings remains a complex and potentially error-prone task. In this paper, we present TRAMP (TRAnsformation Mapping Provenance), an extensive suite of tools supporting the debugging and tracing of schema mappings and transformation queries. TRAMP combines and extends data provenance with two novel notions, transformation provenance and mapping provenance, to explain the relationship between transformed data and those transformations and mappings that produced that data. In addition we provide query support for transformations, data, and all forms of provenance. We formally define transformation and mapping provenance, present an efficient implementation of both forms of provenance, and evaluate the resulting system through extensive experiments.
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"Boris Glavic
This document discusses big data provenance and its implications for benchmarking. It begins by outlining provenance, describing challenges of big data provenance, and providing examples of approaches taken. It then discusses how provenance could be used for benchmarking by serving as data and workloads. Provenance-based metrics and using provenance for profiling and monitoring systems are proposed. Generating large datasets and workloads from provenance data is suggested to address issues with big data benchmarking.
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"Boris Glavic
Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as revision processing or query debugging. This paper introduces a novel approach that uses operator instrumentation, i.e., modifying the behavior of operators, to generate and propagate fine-grained provenance through several operators of a query network. In addition to applying this technique to compute provenance eagerly during query execution, we also study how to decouple provenance computation from query processing to reduce run-time overhead and avoid unnecessary provenance retrieval. This includes computing a concise superset of the provenance to allow lazily replaying a query network and reconstruct its provenance as well as lazy retrieval to avoid unnecessary reconstruction of provenance. We develop stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne, our provenance-aware extension of the Borealis DSMS implements these techniques. Our experiments confirm that Ariadne manages provenance with minor overhead and clearly outperforms query rewrite, the current state-of-the-art.
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"Boris Glavic
The document discusses value invention in data exchange and schema mappings. It introduces the data exchange problem involving mapping source and target schemas using a specification. Value invention involves creating values to represent incomplete information when materializing the target schema. The goal is to understand when schema mappings specified by second-order tuple-generating dependencies (SO tgds) can be rewritten as nested global-as-view mappings, which have more desirable computational properties. The paper presents an algorithm called Linearize that rewrites SO tgds as nested GLAV mappings if they are linear and consistent. It also discusses exploiting source constraints like functional dependencies to find an equivalent linear mapping.
TaPP 2013 - Provenance for Data MiningBoris Glavic
Data mining aims at extracting useful information from large datasets. Most data mining approaches reduce the input data to produce a smaller output summarizing the mining result. While the purpose of data mining (extracting information) necessitates this reduction in size, the loss of information it entails can be problematic. Specifically, the results of data mining may be more confusing than insightful, if the user is not able to understand on which input data they are based and how they were created. In this paper, we argue that the user needs access to the provenance of mining results. Provenance, while extensively studied by the database, workflow, and distributed systems communities, has not yet been considered for data mining. We analyze the differences between database, workflow, and data mining provenance, suggest new types of provenance, and identify new use-cases for provenance in data mining. To illustrate our ideas, we present a more detailed discussion of these concepts for two typical data mining algorithms: frequent itemset mining and multi-dimensional scaling.
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...Boris Glavic
This document presents a vision for a generic provenance middleware called GProM that can compute provenance for database queries, updates, and transactions. Some key points:
- GProM uses query rewriting and annotation propagation techniques to compute provenance in a non-invasive way.
- It introduces the concept of "reenactment queries" to compute provenance for past transactions by simulating their effects using time travel to access past database states.
- The reenactment queries are then rewritten to propagate provenance annotations to compute the provenance of the entire transaction.
- GProM aims to support multiple provenance types and storage policies in a database-independent way through an extensible
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...AbdullaAlAsif1
The pygmy halfbeak Dermogenys colletei, is known for its viviparous nature, this presents an intriguing case of relatively low fecundity, raising questions about potential compensatory reproductive strategies employed by this species. Our study delves into the examination of fecundity and the Gonadosomatic Index (GSI) in the Pygmy Halfbeak, D. colletei (Meisner, 2001), an intriguing viviparous fish indigenous to Sarawak, Borneo. We hypothesize that the Pygmy halfbeak, D. colletei, may exhibit unique reproductive adaptations to offset its low fecundity, thus enhancing its survival and fitness. To address this, we conducted a comprehensive study utilizing 28 mature female specimens of D. colletei, carefully measuring fecundity and GSI to shed light on the reproductive adaptations of this species. Our findings reveal that D. colletei indeed exhibits low fecundity, with a mean of 16.76 ± 2.01, and a mean GSI of 12.83 ± 1.27, providing crucial insights into the reproductive mechanisms at play in this species. These results underscore the existence of unique reproductive strategies in D. colletei, enabling its adaptation and persistence in Borneo's diverse aquatic ecosystems, and call for further ecological research to elucidate these mechanisms. This study lends to a better understanding of viviparous fish in Borneo and contributes to the broader field of aquatic ecology, enhancing our knowledge of species adaptations to unique ecological challenges.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
ESPP presentation to EU Waste Water Network, 4th June 2024 “EU policies driving nutrient removal and recycling
and the revised UWWTD (Urban Waste Water Treatment Directive)”
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxRASHMI M G
Abnormal or anomalous secondary growth in plants. It defines secondary growth as an increase in plant girth due to vascular cambium or cork cambium. Anomalous secondary growth does not follow the normal pattern of a single vascular cambium producing xylem internally and phloem externally.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Nucleic Acid-its structural and functional complexity.
Ipaw14 presentation Quan, Tanu, Ian
1. Auditing and Maintaining Provenance in
Software Packages
Quan Pham1 Tanu Malik2 Ian Foster1,2
Department of Computer Science1 and Computation Institute2,
The University of Chicago,
Chicago, IL 60637, USA
quanpt@cs.uchicago.edu, tanum@ci.uchicago.edu
Presented by Boris Glavic
Illinois Institute of Technology
IPAW14
June, 10th, 2014
Provenance in Software Packages June, 10th
, 2014 1 / 29
2. Outline
1 Introduction
2 Software Pipeline Usecase
3 CDE-SP: Software Provenance in CDE
4 Experiment and Evaluation
5 Related Work
6 Conclusion
Provenance in Software Packages June, 10th
, 2014 2 / 29
3. Current Solutions for Ensuring Reproducibility and Issues
1 Publish source code and data
− GitHub, Figshare, Research Compendia
Pros: (in many cases) easy to accomplish
× Cons: need to recompile and re-execute
2 Publish software package including source code, data, and
environment dependencies
− CDE, RunMyCode.org
Pros: re-execute without installation
× Cons: not easy to combine and merge shared packages
3 Publish a virtual machine image (VMI) that includes OS, source code,
data, and environment
− Cloud BioLinux (NEBC), Swift Appliance (RDCEP)
Pros: no additional modules or components needed to rerun
× Problem: too hard to provision and understand
Introduction Provenance in Software Packages June, 10th
, 2014 3 / 29
4. Reproducibility Problem
Our philosophy:
”... releasing shoddy VMs is easy to do, but it doesn’t help you learn how
to do a better job of reproducibility along the way. Releasing software
pipelines, however crappy, is on the path towards better reproducibility.”
C. Tituss Brown1
Reproducibility problem: How can we make it easy to combine and
merge shared packages, while correctly attributing authorship of software
packages?
No need to provision VMIs or publish simply source code and data.
1
http://ivory.idyll.org/blog/vms-considered-harmful.html
Introduction Provenance in Software Packages June, 10th
, 2014 4 / 29
5. Problem Scope
Use CDE2 to capture and create portable software package
Extend, partially re-use, and combine CDE packages to create new
reproducible software pipelines
Attribute authorship of software packages in new software pipelines
CDE has an OVERLAP conflict!
2
Guo, P.J., Engler, D.: CDE: using system call interposition to automatically create
portable software packages. USENIX Association, Portland, OR (2011)
Introduction Provenance in Software Packages June, 10th
, 2014 5 / 29
6. CDE
Create a portable software package
without installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th
, 2014 6 / 29
7. CDE
Create a portable software package
without installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th
, 2014 6 / 29
8. CDE
Create a portable software package
without installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th
, 2014 6 / 29
9. CDE
Create a portable software package
without installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th
, 2014 6 / 29
10. CDE
Create a portable software package
without installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th
, 2014 6 / 29
11. CDE
Create a portable software package
without installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th
, 2014 6 / 29
12. CDE
Create a portable software package
without installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th
, 2014 6 / 29
13. CDE
Create a portable software package
without installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th
, 2014 6 / 29
20. Software Pipelines Contain CDE packages
A software pipeline consists many individual software modules
A software module depends on externally-developed libraries
A software module is often packaged together with specific versions of
libraries
Introduction Provenance in Software Packages June, 10th
, 2014 8 / 29
21. RDCEP Usecase
Alice, Bob, and Charlie are scientists at the Center for Robust Decision
Making on Climate and Energy Policy (RDCEP)
A develops data integration methods to produce higher-resolution
datasets depicting inferred land use over time.
B develops computational models to do model-based comparative
analysis. B’s software environment consists of A’s software modules
to produce high-resolution datasets.
C uses A and B’s software modules within data-intensive
computing methods to run them in parallel.
The Center wants to predict future yields of staple agricultural
commodities given changes in the climate.
C's Package (Merge from B's)
B's Package (from A's)
A's Package
Parallel init Aggregation Generate images Model-based analysis Parallel summary
Generate images Model-based analysisRetrive data Aggregation
Software Pipeline Usecase Provenance in Software Packages June, 10th
, 2014 9 / 29
22. A’s Experiment & Package
A’s package
cde-root
path to A’s files
a-experiment.sh
retrieve-data
aggregation
generate-image
f1, f2, a-output
path to common libs
libc.so
Re-execute A’s experiment:
cde-exec a-experiment.sh
cat a-experiment.sh
./retrieve-data f1
./aggregation f1 f2
./generate-image f2 a-output
Software Pipeline Usecase Provenance in Software Packages June, 10th
, 2014 11 / 29
23. B’s Experiment & Package
B’s package
cde-root
path to A’s files
[...]
path to B’s files
b-experiment.sh
analysis
b-output
path to common libs
libc.so
Re-execute B’s experiment:
cde-exec b-experiment.sh
cat b-experiment.sh
cd path to A’s experiment
cde-exec a-experiment.sh
cd path to B’s files
./analysis path to A’s files/a-output b-output
Software Pipeline Usecase Provenance in Software Packages June, 10th
, 2014 12 / 29
24. C’s Experiment & Package
C’s package
cde-root
path to A’s files
[...]
path to B’s files
[...]
path to C’s files
c-experiment.sh
parallel-init
parallel-summary
c-output
path to common libs
libc.so
Re-execute C’s experiment:
cde-exec c-experiment.sh
cat c-experiment.sh
parallel-init path to A’s files/f4
cd path to A’s files
cde-exec ./aggregation f4 f5
cde-exec ./generate-image f5 f6
cd path to B’s files
cde-exec ./analysis path to A’s files/f6 f7
cd path to C’s files
./parallel-summary path to B’s files/f7 c-output
Software Pipeline Usecase Provenance in Software Packages June, 10th
, 2014 13 / 29
26. File Overlap of Different Linux Distributions
RH SUSE U12 U13
Amz 5498 / 23k 3184 / 11k 1203 / 5.4k 1819 / 5.5k
RH 3861 / 12k 1654 / 6.6k 2223 / 6.3k
SUSE 1245 / 3.9k 2085 / 6.4k
U12 8226 / 24k
Table 1 : Ratio of different files having the same path in 5 popular AMIs. The
denominator is number of files having the same path in two distributions, and the
numerator is the number of files with the same path but different md5 checksum.
Ommited are manual pages in /usr/share/ directory.
Amz Amazon Linux AMI
RH Red Hat Enterprise Linux 6.4
SUSE SUSE Linux Enterprise Server 11
U12 Ubuntu Server 12.04.3 LTS
U13 Ubuntu Server 13.10
Software Pipeline Usecase Provenance in Software Packages June, 10th
, 2014 15 / 29
27. Re-direction in Multiple cde-root Directories
Software Pipeline Usecase Provenance in Software Packages June, 10th
, 2014 16 / 29
28. CDE-SP
CDE-SP: Enhanced CDE that includes software provenance
Describe tools and methods to audit, store, and query provenance
Provenance queries
Determine the environment under which a dependency was build
Examine the dependencies which must be present
Answer if packages in a pipeline can satisfy a new package
Attribute authorship of software packages in a pipeline
Combine and validate authorship from stored provenance
Software Pipeline Usecase Provenance in Software Packages June, 10th
, 2014 17 / 29
29. CDE-SP Audit
Objectives
Capture additional details of the origins of a library or a binary
Use these details for compiling and creating software pipelines
Methods
Create a dependency tree
Process system calls are monitored
Whenever a process executes a file system call, a dependency of that
process is recorded
Dependency can be a data file or a shared library
Extract information about binaries and required shared libraries
file, ldd, strings, and objdump UNIX commands
uname -a and function getpwuid(getuid())
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th
, 2014 18 / 29
30. CDE-SP Audit
Objectives
Capture additional details of the origins of a library or a binary
Use these details for compiling and creating software pipelines
Methods
Create a dependency tree
Process system calls are monitored
Whenever a process executes a file system call, a dependency of that
process is recorded
Dependency can be a data file or a shared library
Extract information about binaries and required shared libraries
file, ldd, strings, and objdump UNIX commands
uname -a and function getpwuid(getuid())
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th
, 2014 18 / 29
31. CDE-SP Audit
Objectives
Capture additional details of the origins of a library or a binary
Use these details for compiling and creating software pipelines
Methods
Create a dependency tree
Process system calls are monitored
Whenever a process executes a file system call, a dependency of that
process is recorded
Dependency can be a data file or a shared library
Extract information about binaries and required shared libraries
file, ldd, strings, and objdump UNIX commands
uname -a and function getpwuid(getuid())
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th
, 2014 18 / 29
32. Storage
Store provenance within the package itself
Use LevelDB: a fast and light-weight key-value storage library
Encode in the key the UNIX process identifier along with spawn time
Key Value Explanation
pid.PID1.exec.TIME PID2 PID1 wasTriggeredBy PID2
pid.PID.[path, pwd, args] VALUES Other properties of PID
io.PID.action.IO.TIME FILE(PATH) PID wasGeneratedBy / wa-
sUsedBy FILE(PATH)
meta.agent USERNAME User information
meta.machine OSNAME operating system distribution
Table 2 : LevelDB key-value pairs that store file and process provenance. Capital letter words are arguments.
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th
, 2014 19 / 29
33. Query
LevelDB provides a minimal API for querying
Simple, light-weight query interface
Input: a program whose dependencies need to be retrieved
Output: a GraphViz file displaying file and process dependencies
Use depth first search algorithm to create a dependency tree with the
input program as its root
Exclusion option to remove uninteresting dependencies:
/lib/, /usr/lib/, /usr/share/, /etc/
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th
, 2014 20 / 29
34. Authorship of Software Modules
Combine authorship of the contributing packages
Validate authorship from the provenance stored in the original
package
Generate the subgraph associated with the part of the new package
Use subgraph isomorphism (NP-Hard) to validate with the original
provenance graph
Match provenance nodes of processes with the same paths of their
binaries and working directories
Match provenance nodes of files with the same path
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th
, 2014 21 / 29
35. Experiments
Performance of CDE-SP
Auditing performance overhead
Disk storage increase
Provenance query runtime
Redirection overhead when multiple UUID-based directories are
created
Compare the lightweight virtualization approach of CDE-SP with
Kameleon3, a heavyweight virtualization approach used for
reproducibility
Experiments were run on Ubuntu 12.04 LTS workstation with an 8GBs
RAM and 8-core Intel(R) processor clocking at 1600MHz.
3
Emeras, J., Richard, O., Bzeznik, B.: Reconstructing the software environment of
an experiment with kameleon (2011)
Experiment and Evaluation Provenance in Software Packages June, 10th
, 2014 22 / 29
36. Performance & Size Overhead
Pipeline with two applications: Aggregation and Generate Image
2.1% slowdown of CDE-SP vs. 0-30% CDE virtualization overhead4
LevelDB database size 236kB (0.03% package size increase) contains
approximately 12,000 key-value pairs
Create
Package
Execution Disk Usage Provenance Query
CDE 852.6±2.4 568.8±2.4 732MB
CDE-SP 870.5±2.5 569.5±1.8 732MB+236kB 0.4±0.03
(seconds) (seconds) (seconds)
Table 3 : Increase in CDE-SP performance is negligible in comparison with CDE
4
Guo, P.J., Engler, D.: CDE: using system call interposition to automatically create
portable software packages. USENIX Association, Portland, OR (2011)
Experiment and Evaluation Provenance in Software Packages June, 10th
, 2014 23 / 29
37. Redirection Overhead in CDE-SP
Pipelined output of Aggregation to input of Generate Image
3 output files of Aggregation package were moved to Generate Image
package
2 cross-package execve() system calls
Less than a 1% slowdown of CDE-SP
Experiment and Evaluation Provenance in Software Packages June, 10th
, 2014 24 / 29
38. Kameleon
Use the Kameleon engine to make a bare bone VM appliance
Self-written YAML-formatted recipes
Self-written macrosteps and microsteps
Kameleon can create virtual machine appliances in different formats
for different Linux distributions
Generates bash scripts to create an initial virtual image of a Linux
distribution
Populates the image with more Linux packages
Populates with content of a CDE-SP package
Experiment and Evaluation Provenance in Software Packages June, 10th
, 2014 25 / 29
40. Related Work
Research Objects: packages scientific workflows with auxiliary
information about workflows, including provenance information and
metadata, such as the authors, the version
CDE and Sumatra can capture an execution environment in a
lightweight fashion
SystemTap, being a kernel-based tracing mechanism, has better
performance compared to ptrace but needs to run at a higher
privilege level
Provenance-to-Use (PTU) and ReproZip include provenance in
self-contained software packages
Related Work Provenance in Software Packages June, 10th
, 2014 27 / 29
41. Conclusion
CDE does not encapsulate provenance of associated dependencies in
a software package
The lack of information about the origins of dependencies in a
software package creates issues when constructing software pipelines
from packages
CDE-SP can include software provenance as part of a software
package
CDE-SP can use software package provenance to build software
pipelines
CDE-SP can maintain provenance when used to construct software
pipelines
Conclusion Provenance in Software Packages June, 10th
, 2014 28 / 29
42. Acknowledgments
Neil Best at The University of Chicago
Joshua Elliott at The Columbia University
Justin Wozniak at Argonne National Laboratory
Allison Brizius at RDCEP Center
NSF grant SES-0951576, GEO-1343816
Acknowledgments Provenance in Software Packages June, 10th
, 2014 29 / 29