Brief description of the OPMW ontology for publishing workflow results.
The two OPM slides are from the OPM tutorial: http://openprovenance.org/tutorial/
(edit 01-10-2013): OPMW has constantly been evolving since it was released in 2011. Some of the details of this presentation may be out of date. For the latest version please see: http://www.opmw.org/model/OPMW/
This document summarizes a talk given at EuroPython 2012 on contributing to a Python C extension project using GitHub and the Nose testing framework. The talk discusses how Roberto Polli extended the PySmbC library to support extended permissions (ACLs) by wrapping two additional SMB functions into the library and contributing the changes using GitHub. It emphasizes that writing tests with Nose before making code changes makes the contribution process easier. The document provides examples of how functions were wrapped in C and mapped to Python along with how exceptions were defined and handled.
Generators take a high-level software specification and produce an implementation. GenVoca is an approach to building generators that composes reusable component layers. It models software as realms of components with vertical and horizontal parameters. Components are implemented as C++ templates containing member classes. Composition validation ensures semantics are correct. Aspect-oriented programming (AOP) and GenVoca both aim to improve code reuse but differ in focus, concepts, and implementation mechanisms like aspect languages versus type expressions. Generators automate implementation through transformations while GenVoca provides a systematic approach through composable and customizable components.
This document provides an overview of basic concepts in operating systems, including subroutine linkage, thread linkage, input/output, and dynamic storage allocation. It discusses subroutine linkage on Intel x86 and SPARC architectures, including how stack frames are used. It also covers the implementation of threads using control blocks and stacks for each thread context. Finally, it discusses input/output architectures and system calls.
Detecting Occurrences of Refactoring with Heuristic SearchShinpei Hayashi
This document describes a technique for detecting refactorings between two versions of a program using heuristic search. Refactorings are detected by generating intermediate program states through applying refactorings, and finding a path from the original to modified program that minimizes differences. Structural differences are used to identify likely refactorings. Candidate refactorings are evaluated and applied to generate new states, with the search terminating when the state matches the modified program. A supporting tool was developed and a case study found the technique could correctly detect an actual series of refactorings between program versions.
OPM is a model for representing the provenance of processes on digital artifacts. It defines a graph-based data model consisting of nodes for artifacts, processes, and agents connected by edges describing their relationships. The graph can contain multiple overlapping or hierarchical "accounts" from different observers. OPM aims to define the model in a precise yet technology-agnostic way and does not specify internal system representations.
This document summarizes a talk given at EuroPython 2012 on contributing to a Python C extension project using GitHub and the Nose testing framework. The talk discusses how Roberto Polli extended the PySmbC library to support extended permissions (ACLs) by wrapping two additional SMB functions into the library and contributing the changes using GitHub. It emphasizes that writing tests with Nose before making code changes makes the contribution process easier. The document provides examples of how functions were wrapped in C and mapped to Python along with how exceptions were defined and handled.
Generators take a high-level software specification and produce an implementation. GenVoca is an approach to building generators that composes reusable component layers. It models software as realms of components with vertical and horizontal parameters. Components are implemented as C++ templates containing member classes. Composition validation ensures semantics are correct. Aspect-oriented programming (AOP) and GenVoca both aim to improve code reuse but differ in focus, concepts, and implementation mechanisms like aspect languages versus type expressions. Generators automate implementation through transformations while GenVoca provides a systematic approach through composable and customizable components.
This document provides an overview of basic concepts in operating systems, including subroutine linkage, thread linkage, input/output, and dynamic storage allocation. It discusses subroutine linkage on Intel x86 and SPARC architectures, including how stack frames are used. It also covers the implementation of threads using control blocks and stacks for each thread context. Finally, it discusses input/output architectures and system calls.
Detecting Occurrences of Refactoring with Heuristic SearchShinpei Hayashi
This document describes a technique for detecting refactorings between two versions of a program using heuristic search. Refactorings are detected by generating intermediate program states through applying refactorings, and finding a path from the original to modified program that minimizes differences. Structural differences are used to identify likely refactorings. Candidate refactorings are evaluated and applied to generate new states, with the search terminating when the state matches the modified program. A supporting tool was developed and a case study found the technique could correctly detect an actual series of refactorings between program versions.
OPM is a model for representing the provenance of processes on digital artifacts. It defines a graph-based data model consisting of nodes for artifacts, processes, and agents connected by edges describing their relationships. The graph can contain multiple overlapping or hierarchical "accounts" from different observers. OPM aims to define the model in a precise yet technology-agnostic way and does not specify internal system representations.
FOOPS!: An Ontology Pitfall Scanner for the FAIR principlesdgarijo
This document describes FOOPS, an ontology validation service that checks ontologies for adherence to the FAIR principles. FOOPS tests ontologies against criteria related to findability, accessibility, interoperability, and reusability. It provides explanations for test failures to help users improve their ontologies. FOOPS validation results include an overall FAIRness score and coverage of FAIR categories to assess ontology quality, though there is no single threshold for what makes an ontology fully FAIR. The document demonstrates FOOPS and lists the types of tests it supports under each FAIR category. It invites feedback to help further improve FOOPS.
FAIR Workflows: A step closer to the Scientific Paper of the Futuredgarijo
Keynote presented at the Computational and Autonomous Workflows (CAW-2021) at the Oak Ridge National Laboratory. The keynote describes an overview of the different aspects to take into account when aiming to create FAIR workflows and associated resources.
An increasing number of researchers rely on computational methods to generate the results described in their publications. Research software created to this end is heterogeneous (e.g., scripts, libraries, packages, notebooks, etc.) and usually difficult to find, reuse, compare and understand due to its disconnected documentation (dispersed in manuals, readme files, web sites, and code comments) and a lack of structured metadata to describe it. In this talk I will describe the main challenges for finding, comparing and reusing research software, how structured metadata can help to address some of them, which are the best practices being proposed by the community; and current initiatives to aid their adoption by researchers within EOSC.
Impact: The talk addresses an important aspect of the EOSC infrastructure for quality research software by ensuring that software contributed to the EOSC ecosystem can be found, compared and reused by researchers. The talk also aims to address metadata quality of current research products, which is critical for successful adoption.
Presented at the EOSC symposium
SOMEF: a metadata extraction framework from software documentationdgarijo
Presentation done at the council of software registries on March, 2021. SOMEF is a python package for automatically extracting over 25 metadata categories from a readme file. The output is then exported in JSON or in JSON-LD using the codemeta representation
A Template-Based Approach for Annotating Long-Tailed Datasetsdgarijo
An increasing amount of data is shared on the Web through heterogeneous spreadsheets and CSV files. In order to homogenize and query these data, the scientific community has developed Extract, Transform and Load (ETL) tools and services that help making these files machine readable in Knowledge Graphs (KGs). However, tabular data may be complex; and the level of expertise required by existing ETL tools makes it difficult for users to describe their own data. In this paper we propose a simple annotation schema to guide users when transforming complex tables into KGs. We have implemented our approach by extending T2WML, a table annotation tool designed to help users annotate their data and upload the results to a public KG. We have evaluated our effort with six non-expert users, obtaining promising preliminary results.
OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphsdgarijo
In this presentation we describe the Ontology-Based APIs framework (OBA), our approach to automatically create REST APIs from ontologies while following RESTful API best
practices. Given an ontology (or ontology network) OBA uses standard technologies familiar to web developers (OpenAPI Specification, JSON) and combines them with W3C standards (OWL, JSON-LD frames and SPARQL) to create maintainable APIs with documentation, units tests, automated validation of resources and clients (in Python, Javascript, etc.) for non Semantic Web experts to access the contents of a target
knowledge graph. We showcase OBA with three examples that illustrate the capabilities of the framework for different ontologies.
Towards Knowledge Graphs of Reusable Research Software Metadatadgarijo
Research software is a key asset for understanding, reusing and reproducing results in computational sciences. An increasing amount of software is stored in code repositories, which usually contain human readable instructions indicating how to use it and set it up. However, developers and researchers often need to spend a significant amount of time to understand how to invoke a software component, prepare data in the required format, and use it in combination with other software. In addition, this time investment makes it challenging to discover and compare software with similar functionality. In this talk I will describe our efforts to address these issues by creating and using Open Knowledge Graphs that describe research software in a machine readable manner. Our work includes: 1) an ontology that extends schema.org and codemeta, designed to describe software and the specific data formats it uses; 2) an approach to publish software metadata as an open knowledge graph, linked to other Web of Data objects; and 3) a framework for automatically extracting metadata from software repositories; and 4) a framework to curate, query, explore and compare research software metadata in a collaborative manner. The talk will illustrate our approach with real-world examples, including a domain application for inspecting and discovering hydrology, agriculture, and economic software models; and the results of our framework when enriching the research software entries in Zenodo.org.
Scientific Software Registry Collaboration Workshop: From Software Metadata r...dgarijo
In this talk I briefly describe our work in OntoSoft for easy software metadata representation, and how new requirements for software reusability are making us move towards knowledge graphs of scientific software metadata
WDPlus: Leveraging Wikidata to Link and Extend Tabular Datadgarijo
Today, data about any domain can be found on the web in data repositories, web APIs and many millions of spreadsheets and CSV files. Researchers and organizations make these data available in a myriad of formats, layouts, terminology and cleanliness that make it difficult to integrate together. As a result, researchers aiming to use data in their analyses face three main challenges. The first one is finding datasets related to a feature, variable or topic of interest. For example, climate scientists need to look for years of observational data from authoritative sources when estimating the climate of a region. The second challenge is completing a given dataset with existing knowledge: machine learning applications are data hungry and require as many data points and features as possible to improve their predictions, which often requires integrating data from different sources. The third challenge is sharing integrated results: once several datasets have been merged together, how to make them available to the rest of the community?
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...dgarijo
Scientific software is crucial for understanding, reusing and reproducing results in computational sciences. Software is often stored in code repositories, which may contain human readable instructions necessary to use it and set it up. However, a significant amount of time is usually required to understand how to invoke a software component, prepare data in the format it requires, and use it in combination with other software. In this presentation we introduce OKG-Soft, an open knowledge graph that describes scientific software in a machine readable manner. OKG-Soft includes: 1) an ontology designed to describe software and the specific data formats it uses; 2) an approach to publish software metadata as an open knowledge graph, linked to other Web of Data objects; and 3) a framework to annotate, query, explore and curate scientific software metadata.
Towards Human-Guided Machine Learning - IUI 2019dgarijo
Automated Machine Learning (AutoML) systems are emerging
that automatically search for possible solutions from a large space of possible kinds of models. Although fully automated machine learning is appropriate for many applications, users often have knowledge that supplements and constraints the available data and solutions. This paper proposes human-guided machine learning (HGML) as a hybrid approach where a user interacts with an AutoML system and tasks it to explore different problem settings that reflect the user’s knowledge about the data available. We present: 1) a task analysis of HGML that shows the tasks that a user would want to carry out, 2) a characterization of two scientific publications, one in neuroscience and one in political science, in terms of how the authors would search for solutions using an AutoML system, 3) requirements for HGML based on those characterizations, and 4) an assessment of existing AutoML systems in terms of those requirements.
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
Scientists publish computational experiments in ways that do not facilitate reproducibility or reuse. Significant domain expertise, time and effort are required to understand scientific experiments and their research outputs. In order to improve this situation, mechanisms are needed to capture the exact details and the context of computational experiments. Only then, Intelligent Systems would be able help researchers understand, discover, link and reuse products of existing research.
In this presentation I will introduce my work and vision towards enabling scientists share, link, curate and reuse their computational experiments and results. In the first part of the talk, I will present my work for capturing and sharing the context of scientific experiments by using scientific workflows and machine readable representations. Thanks to this approach, experiment results are described in an unambiguous manner, have a clear trace of their creation process and include a pointer to the sources used for their generation. In the second part of the talk, I will describe examples on how the context of scientific experiments may be exploited to browse, explore and inspect research results. I will end the talk by presenting new ideas for improving and benefiting from the capture of context of scientific experiments and how to involve scientists in the process of curating and creating abstractions on available research metadata.
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...dgarijo
Traditional approaches to ontology development have a large lapse between the time when a user using the ontology has found a need to extend it and the time when it does get extended. For scientists, this delay can be weeks or months and can be a significant barrier for adoption. We present a new approach to ontology development and data annotation enabling users to add new metadata properties on the fly as they describe their datasets, creating terms that can be immediately adopted by others and eventually become standardized. This approach combines a traditional, consensus-based approach to ontology development, and a crowdsourced approach where ex-pert users (the crowd) can dynamically add terms as needed to support their work. We have implemented this approach as a socio-technical system that includes: 1) a crowdsourcing platform to support metadata annotation and addition of new terms, 2) a range of social editorial processes to make standardization decisions for those new terms, and 3) a framework for ontology revision and updates to the metadata created with the previous version of the ontology. We present a prototype implementation for the paleoclimate community, the Linked Earth Framework, currently containing 700 datasets and engaging over 50 active contributors. Users exploit the platform to do science while extending the metadata vocabulary, thereby producing useful and practical metadata
WIDOCO: A Wizard for Documenting Ontologiesdgarijo
WIDOCO is a WIzard for DOCumenting Ontologies that guides users through the documentation process of their vocabularies. Given an RDF vocabulary, WIDOCO detects missing vocabulary metadata and creates a documentation with diagrams, human readable descriptions of the ontology terms and a summary of
changes with respect to previous versions of the ontology. The documentation consists on a set of linked enriched HTML pages that can be further extended by end users. WIDOCO is open source and builds on well established Semantic Web tools. So far, it has been used to document more than one hundred ontologies in different domains.
We propose a new area of research on automating data narratives. Data narratives are containers of information about computationally generated research findings. They have three major components: 1) A record of events, that describe a new result through a workflow and/or provenance of all the computations executed; 2) Persistent entries for key entities involved for data, software versions, and workflows; 3) A set of narrative accounts that are automatically generated human-consumable renderings of the record and entities and can be included in a paper. Different narrative accounts can be used for different audiences with different content and details, based on the level of interest or expertise of the reader. Data narratives can make science more transparent and reproducible, because they ensure that the text description of the computational experiment reflects with high fidelity what was actually done. Data narratives can be incorporated in papers, either in the methods section or as supplementary materials. We introduce DANA, a prototype that illustrates how to generate data narratives automatically, and describe the information it uses from the computational records. We also present a formative evaluation of our approach and discuss potential uses of automated data narratives.
Automated Hypothesis Testing with Large Scale Scientific Workflowsdgarijo
(Credit to Varun Ratnakar and Yolanda Gil).
The automation of important aspects of scientific data analysis would significantly accelerate the pace of science and innovation. Although important aspects of data analysis can be automated, the hypothesize-test-evaluate discovery cycle is largely carried out by hand by researchers. This introduces a significant human bottleneck, which is inefficient and can lead to erroneous and incomplete explorations. We introduce a novel approach to automate the hypothesize-test-evaluate discovery cycle with an intelligent system that a scientist can task to test hypotheses of interest in a data repository. Our approach captures three types of data analytics knowledge: 1) common data analytic methods represented as semantic workflows; 2) meta-analysis methods that aggregate those results, represented as meta-workflows; and 3) data analysis strategies that specify for a type of hypothesis what data and methods to use, represented as lines of inquiry. Given a hypothesis specified by a scientist, appropriate lines of inquiry are triggered, which lead to retrieving relevant datasets, running relevant workflows on that data, and finally running meta-workflows on workflow results. The scientist is then presented with a level of confidence on the initial hypothesis (or a revised hypothesis) based on the data and methods applied. We have implemented this approach in the DISK system, and applied it to multi-omics data analysis.
OntoSoft: A Distributed Semantic Registry for Scientific Softwaredgarijo
Credit to Yolanda Gil.
OntoSoft is a distributed semantic registry for scientific software. This paper describes three major novel contributions of OntoSoft: 1) a software metadata registry designed for scientists, 2) a distributed approach to software registries that targets communities of interest, and 3) metadata crowdsourcing through access control. Software metadata is organized using the OntoSoft ontology along six dimensions that matter to scientists: identify software, understand and assess software, execute software, get support for the software, do research with the software, and update the software. OntoSoft is a distributed registry where each site is owned and maintained by a community of interest, with a distributed semantic query capability that allows users to search across all sites. The registry has metadata crowdsourcing capabilities, supported through access control so that software authors can allow others to expand on specific metadata properties.
OEG tools for supporting Ontology Engineeringdgarijo
The document summarizes several tools developed by the Ontology Engineering Group (OEG) to support ontology engineering, including Vocabularium for serving ontologies online, OnToology for evaluation reports, documentation and publishing ontologies, AR2DTool for ontology diagrams, Widoco for HTML documentation, and OOPS! for ontology quality evaluations. It provides an overview of the capabilities of each tool and URLs for their websites and GitHub repositories.
Software Metadata: Describing "dark software" in GeoSciencesdgarijo
This document discusses describing "dark software" or unshared scientific software in geosciences. It proposes using the OntoSoft ontology to capture standardized metadata about scientific software. This would allow software to be more discoverable, reusable and reproducible. The document outlines the types of metadata captured by OntoSoft and demonstrates how it can be used to describe software and facilitate search and comparison of different tools.
Reproducibility Using Semantics: An Overviewdgarijo
Overview of the different approaches for addressing reproducibilities (using semantics) in laboratory protocols, workflow description and publication and workflow infrastructure. Furthermore, Research Objects are introduced as a means to capture the context and annotations of scientific experiments, together with the privacy and IPR concerns that may arise. This presentation was presented in Dagstuhl Seminar 16041: http://www.dagstuhl.de/16041
FOOPS!: An Ontology Pitfall Scanner for the FAIR principlesdgarijo
This document describes FOOPS, an ontology validation service that checks ontologies for adherence to the FAIR principles. FOOPS tests ontologies against criteria related to findability, accessibility, interoperability, and reusability. It provides explanations for test failures to help users improve their ontologies. FOOPS validation results include an overall FAIRness score and coverage of FAIR categories to assess ontology quality, though there is no single threshold for what makes an ontology fully FAIR. The document demonstrates FOOPS and lists the types of tests it supports under each FAIR category. It invites feedback to help further improve FOOPS.
FAIR Workflows: A step closer to the Scientific Paper of the Futuredgarijo
Keynote presented at the Computational and Autonomous Workflows (CAW-2021) at the Oak Ridge National Laboratory. The keynote describes an overview of the different aspects to take into account when aiming to create FAIR workflows and associated resources.
An increasing number of researchers rely on computational methods to generate the results described in their publications. Research software created to this end is heterogeneous (e.g., scripts, libraries, packages, notebooks, etc.) and usually difficult to find, reuse, compare and understand due to its disconnected documentation (dispersed in manuals, readme files, web sites, and code comments) and a lack of structured metadata to describe it. In this talk I will describe the main challenges for finding, comparing and reusing research software, how structured metadata can help to address some of them, which are the best practices being proposed by the community; and current initiatives to aid their adoption by researchers within EOSC.
Impact: The talk addresses an important aspect of the EOSC infrastructure for quality research software by ensuring that software contributed to the EOSC ecosystem can be found, compared and reused by researchers. The talk also aims to address metadata quality of current research products, which is critical for successful adoption.
Presented at the EOSC symposium
SOMEF: a metadata extraction framework from software documentationdgarijo
Presentation done at the council of software registries on March, 2021. SOMEF is a python package for automatically extracting over 25 metadata categories from a readme file. The output is then exported in JSON or in JSON-LD using the codemeta representation
A Template-Based Approach for Annotating Long-Tailed Datasetsdgarijo
An increasing amount of data is shared on the Web through heterogeneous spreadsheets and CSV files. In order to homogenize and query these data, the scientific community has developed Extract, Transform and Load (ETL) tools and services that help making these files machine readable in Knowledge Graphs (KGs). However, tabular data may be complex; and the level of expertise required by existing ETL tools makes it difficult for users to describe their own data. In this paper we propose a simple annotation schema to guide users when transforming complex tables into KGs. We have implemented our approach by extending T2WML, a table annotation tool designed to help users annotate their data and upload the results to a public KG. We have evaluated our effort with six non-expert users, obtaining promising preliminary results.
OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphsdgarijo
In this presentation we describe the Ontology-Based APIs framework (OBA), our approach to automatically create REST APIs from ontologies while following RESTful API best
practices. Given an ontology (or ontology network) OBA uses standard technologies familiar to web developers (OpenAPI Specification, JSON) and combines them with W3C standards (OWL, JSON-LD frames and SPARQL) to create maintainable APIs with documentation, units tests, automated validation of resources and clients (in Python, Javascript, etc.) for non Semantic Web experts to access the contents of a target
knowledge graph. We showcase OBA with three examples that illustrate the capabilities of the framework for different ontologies.
Towards Knowledge Graphs of Reusable Research Software Metadatadgarijo
Research software is a key asset for understanding, reusing and reproducing results in computational sciences. An increasing amount of software is stored in code repositories, which usually contain human readable instructions indicating how to use it and set it up. However, developers and researchers often need to spend a significant amount of time to understand how to invoke a software component, prepare data in the required format, and use it in combination with other software. In addition, this time investment makes it challenging to discover and compare software with similar functionality. In this talk I will describe our efforts to address these issues by creating and using Open Knowledge Graphs that describe research software in a machine readable manner. Our work includes: 1) an ontology that extends schema.org and codemeta, designed to describe software and the specific data formats it uses; 2) an approach to publish software metadata as an open knowledge graph, linked to other Web of Data objects; and 3) a framework for automatically extracting metadata from software repositories; and 4) a framework to curate, query, explore and compare research software metadata in a collaborative manner. The talk will illustrate our approach with real-world examples, including a domain application for inspecting and discovering hydrology, agriculture, and economic software models; and the results of our framework when enriching the research software entries in Zenodo.org.
Scientific Software Registry Collaboration Workshop: From Software Metadata r...dgarijo
In this talk I briefly describe our work in OntoSoft for easy software metadata representation, and how new requirements for software reusability are making us move towards knowledge graphs of scientific software metadata
WDPlus: Leveraging Wikidata to Link and Extend Tabular Datadgarijo
Today, data about any domain can be found on the web in data repositories, web APIs and many millions of spreadsheets and CSV files. Researchers and organizations make these data available in a myriad of formats, layouts, terminology and cleanliness that make it difficult to integrate together. As a result, researchers aiming to use data in their analyses face three main challenges. The first one is finding datasets related to a feature, variable or topic of interest. For example, climate scientists need to look for years of observational data from authoritative sources when estimating the climate of a region. The second challenge is completing a given dataset with existing knowledge: machine learning applications are data hungry and require as many data points and features as possible to improve their predictions, which often requires integrating data from different sources. The third challenge is sharing integrated results: once several datasets have been merged together, how to make them available to the rest of the community?
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...dgarijo
Scientific software is crucial for understanding, reusing and reproducing results in computational sciences. Software is often stored in code repositories, which may contain human readable instructions necessary to use it and set it up. However, a significant amount of time is usually required to understand how to invoke a software component, prepare data in the format it requires, and use it in combination with other software. In this presentation we introduce OKG-Soft, an open knowledge graph that describes scientific software in a machine readable manner. OKG-Soft includes: 1) an ontology designed to describe software and the specific data formats it uses; 2) an approach to publish software metadata as an open knowledge graph, linked to other Web of Data objects; and 3) a framework to annotate, query, explore and curate scientific software metadata.
Towards Human-Guided Machine Learning - IUI 2019dgarijo
Automated Machine Learning (AutoML) systems are emerging
that automatically search for possible solutions from a large space of possible kinds of models. Although fully automated machine learning is appropriate for many applications, users often have knowledge that supplements and constraints the available data and solutions. This paper proposes human-guided machine learning (HGML) as a hybrid approach where a user interacts with an AutoML system and tasks it to explore different problem settings that reflect the user’s knowledge about the data available. We present: 1) a task analysis of HGML that shows the tasks that a user would want to carry out, 2) a characterization of two scientific publications, one in neuroscience and one in political science, in terms of how the authors would search for solutions using an AutoML system, 3) requirements for HGML based on those characterizations, and 4) an assessment of existing AutoML systems in terms of those requirements.
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
Scientists publish computational experiments in ways that do not facilitate reproducibility or reuse. Significant domain expertise, time and effort are required to understand scientific experiments and their research outputs. In order to improve this situation, mechanisms are needed to capture the exact details and the context of computational experiments. Only then, Intelligent Systems would be able help researchers understand, discover, link and reuse products of existing research.
In this presentation I will introduce my work and vision towards enabling scientists share, link, curate and reuse their computational experiments and results. In the first part of the talk, I will present my work for capturing and sharing the context of scientific experiments by using scientific workflows and machine readable representations. Thanks to this approach, experiment results are described in an unambiguous manner, have a clear trace of their creation process and include a pointer to the sources used for their generation. In the second part of the talk, I will describe examples on how the context of scientific experiments may be exploited to browse, explore and inspect research results. I will end the talk by presenting new ideas for improving and benefiting from the capture of context of scientific experiments and how to involve scientists in the process of curating and creating abstractions on available research metadata.
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...dgarijo
Traditional approaches to ontology development have a large lapse between the time when a user using the ontology has found a need to extend it and the time when it does get extended. For scientists, this delay can be weeks or months and can be a significant barrier for adoption. We present a new approach to ontology development and data annotation enabling users to add new metadata properties on the fly as they describe their datasets, creating terms that can be immediately adopted by others and eventually become standardized. This approach combines a traditional, consensus-based approach to ontology development, and a crowdsourced approach where ex-pert users (the crowd) can dynamically add terms as needed to support their work. We have implemented this approach as a socio-technical system that includes: 1) a crowdsourcing platform to support metadata annotation and addition of new terms, 2) a range of social editorial processes to make standardization decisions for those new terms, and 3) a framework for ontology revision and updates to the metadata created with the previous version of the ontology. We present a prototype implementation for the paleoclimate community, the Linked Earth Framework, currently containing 700 datasets and engaging over 50 active contributors. Users exploit the platform to do science while extending the metadata vocabulary, thereby producing useful and practical metadata
WIDOCO: A Wizard for Documenting Ontologiesdgarijo
WIDOCO is a WIzard for DOCumenting Ontologies that guides users through the documentation process of their vocabularies. Given an RDF vocabulary, WIDOCO detects missing vocabulary metadata and creates a documentation with diagrams, human readable descriptions of the ontology terms and a summary of
changes with respect to previous versions of the ontology. The documentation consists on a set of linked enriched HTML pages that can be further extended by end users. WIDOCO is open source and builds on well established Semantic Web tools. So far, it has been used to document more than one hundred ontologies in different domains.
We propose a new area of research on automating data narratives. Data narratives are containers of information about computationally generated research findings. They have three major components: 1) A record of events, that describe a new result through a workflow and/or provenance of all the computations executed; 2) Persistent entries for key entities involved for data, software versions, and workflows; 3) A set of narrative accounts that are automatically generated human-consumable renderings of the record and entities and can be included in a paper. Different narrative accounts can be used for different audiences with different content and details, based on the level of interest or expertise of the reader. Data narratives can make science more transparent and reproducible, because they ensure that the text description of the computational experiment reflects with high fidelity what was actually done. Data narratives can be incorporated in papers, either in the methods section or as supplementary materials. We introduce DANA, a prototype that illustrates how to generate data narratives automatically, and describe the information it uses from the computational records. We also present a formative evaluation of our approach and discuss potential uses of automated data narratives.
Automated Hypothesis Testing with Large Scale Scientific Workflowsdgarijo
(Credit to Varun Ratnakar and Yolanda Gil).
The automation of important aspects of scientific data analysis would significantly accelerate the pace of science and innovation. Although important aspects of data analysis can be automated, the hypothesize-test-evaluate discovery cycle is largely carried out by hand by researchers. This introduces a significant human bottleneck, which is inefficient and can lead to erroneous and incomplete explorations. We introduce a novel approach to automate the hypothesize-test-evaluate discovery cycle with an intelligent system that a scientist can task to test hypotheses of interest in a data repository. Our approach captures three types of data analytics knowledge: 1) common data analytic methods represented as semantic workflows; 2) meta-analysis methods that aggregate those results, represented as meta-workflows; and 3) data analysis strategies that specify for a type of hypothesis what data and methods to use, represented as lines of inquiry. Given a hypothesis specified by a scientist, appropriate lines of inquiry are triggered, which lead to retrieving relevant datasets, running relevant workflows on that data, and finally running meta-workflows on workflow results. The scientist is then presented with a level of confidence on the initial hypothesis (or a revised hypothesis) based on the data and methods applied. We have implemented this approach in the DISK system, and applied it to multi-omics data analysis.
OntoSoft: A Distributed Semantic Registry for Scientific Softwaredgarijo
Credit to Yolanda Gil.
OntoSoft is a distributed semantic registry for scientific software. This paper describes three major novel contributions of OntoSoft: 1) a software metadata registry designed for scientists, 2) a distributed approach to software registries that targets communities of interest, and 3) metadata crowdsourcing through access control. Software metadata is organized using the OntoSoft ontology along six dimensions that matter to scientists: identify software, understand and assess software, execute software, get support for the software, do research with the software, and update the software. OntoSoft is a distributed registry where each site is owned and maintained by a community of interest, with a distributed semantic query capability that allows users to search across all sites. The registry has metadata crowdsourcing capabilities, supported through access control so that software authors can allow others to expand on specific metadata properties.
OEG tools for supporting Ontology Engineeringdgarijo
The document summarizes several tools developed by the Ontology Engineering Group (OEG) to support ontology engineering, including Vocabularium for serving ontologies online, OnToology for evaluation reports, documentation and publishing ontologies, AR2DTool for ontology diagrams, Widoco for HTML documentation, and OOPS! for ontology quality evaluations. It provides an overview of the capabilities of each tool and URLs for their websites and GitHub repositories.
Software Metadata: Describing "dark software" in GeoSciencesdgarijo
This document discusses describing "dark software" or unshared scientific software in geosciences. It proposes using the OntoSoft ontology to capture standardized metadata about scientific software. This would allow software to be more discoverable, reusable and reproducible. The document outlines the types of metadata captured by OntoSoft and demonstrates how it can be used to describe software and facilitate search and comparison of different tools.
Reproducibility Using Semantics: An Overviewdgarijo
Overview of the different approaches for addressing reproducibilities (using semantics) in laboratory protocols, workflow description and publication and workflow infrastructure. Furthermore, Research Objects are introduced as a means to capture the context and annotations of scientific experiments, together with the privacy and IPR concerns that may arise. This presentation was presented in Dagstuhl Seminar 16041: http://www.dagstuhl.de/16041
1. OPMW
Daniel Garijo
Ontology Engineering Group, Departamento de Inteligencia
Artificial. Universidad Politécnica de Madrid
Yolanda Gil
Information Sciences and Institute
University of Southern California, Marina del Rey
Date: 14/11/2011
2. Index of contents
Overview:
1. What are we exporting with OPMW?
• Publish abstract workflow in addition to executed workflow
2. OPM Overview
3. OPMW: Extending OPM to represent abstract workflows
• Representing the process
• Representing attribution
1
3. Abstract workflow and concrete workflow
We export the
abstract workflow
in addition to the
executed workflow
Abstract workflow
has conceptual
steps and is
independent of
execution codes
2
4. Executed workflow and execution-ready workflow
SigR110293
FList100283
ChList1288
We export the
cOutPut09
abstract workflow
in addition to the
executed workflow
NonSigResults1 SigResults1
3
5. OPM Overview
Nodes
A
• Artifact: Immutable piece of state, which may have a
physical embodiment in a physical object, or a digital
representation in a computer system.
• Process: Action or series of actions performed on or caused
by artifacts, and resulting in new artifacts. P
• Agent: Contextual entity acting as a catalyst of a process,
enabling, facilitating, controlling, affecting its execution.
Ag
4
6. OPM Overview
Edges
used(R)
A P
wasTriggeredBy
P1 P2
wasGeneratedBy(R)
P A
wasDerivedFrom
A1 A2
wasControlledBy(R)
Ag P
Edge labels are in the past to express that these are used to describe past executions
5
9. OPMW
Daniel Garijo
Ontology Engineering Group, Departamento de Inteligencia
Artificial. Universidad Politécnica de Madrid
Yolanda Gil
Information Sciences and Institute
University of Southern California, Marina del Rey
Date: 14/11/2011