This document discusses research objects and scientific workflows. It introduces research objects as a way to aggregate all elements needed to understand a research investigation, including datasets, results, experiments, and provenance. Scientific workflows are presented as tools for automating data-intensive scientific activities, with prospective and retrospective provenance capturing the intended and actual methods. The document outlines an approach to summarizing complex workflows using semantic annotations of workflow motifs and reduction primitives like collapse and eliminate. This distills provenance traces for improved understanding and querying.
Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...Oscar Peña del Rio
This document proposes a visualization pipeline for Linked Open Data (LOD) scenarios that begins with classifying resource features in a dataset. It extracts features like property usage, completeness ratios, and inferred primitive datatypes. It tested this approach on five datasets with over 10 million triples, achieving over 80% agreement with expert classification. The goal is to provide coherent, understandable visualizations of Semantic Web data for non-technical users by classifying resources first. Future work includes using the extracted features to create Entity Visualization Templates and recommend visual representations.
This slides presents a novel collaborative document ranking model which aims at solving a complex information retrieval task involving a multi-faceted information need. For this purpose, we consider a group of users, viewed as experts, who collaborate by addressing the different query facets. We propose a two-step algorithm based on a relevance feedback process which first performs a document scoring towards each expert and then allocates documents to the most suitable experts using the Expectation-Maximisation learning-method. The performance improvement is demonstrated through experiments using TREC interactive benchmark.
This paper has been awarded at AIRS 2013 Conference.
http://link.springer.com/chapter/10.1007%2F978-3-642-45068-6_10
ftp://ftp.irit.fr/IRIT/SIG/2013_AIRS_STB.pdf
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesToine Bogers
The document summarizes the creation of three new domain-specific test collections for evaluating expert search systems in the domains of information retrieval, semantic web, and computational linguistics. The collections were created using workshop program committees and publications from relevant conferences and journals to represent experts, documents, and topics. The collections were then benchmarked using state-of-the-art expert search approaches, finding that term extraction methods outperformed language modeling on these domain-centered collections. Future work is discussed to expand the collections and incorporate additional evidence like citations.
This document discusses the process of classroom action research which involves 6 main steps: 1) focusing the inquiry by formulating a research question, 2) reviewing relevant literature and resources related to the question, 3) collecting relevant data, 4) analyzing and interpreting the collected data, 5) reporting the results, and 6) reflecting on the entire process and results to improve teaching and learning. The goal is to help teachers better understand and improve their own practices through systematic study.
This document discusses reproducible research and provides guidance on how to conduct research in a reproducible manner. It covers:
1. The importance of reproducible research due to large datasets, computational analyses, and the potential for human error. Ensuring reproducibility requires new expertise and infrastructure.
2. Key aspects of reproducible research include data management plans, version control, use of file formats and software/tools that allow reproducibility, and publishing data and code to allow others to replicate results.
3. Reproducible research benefits the scientific community by increasing transparency and allows researchers to re-analyze their own data in the future. Journals and funders are increasingly requiring reproducibility.
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
This document provides an overview of the PROV provenance model and some of its extensions. It discusses the motivation for provenance, the history and development of the PROV model, its key concepts of entities, activities, and agents. It also describes extensions like ProvONE and PAV that build upon PROV to model workflow and scientific provenance.
This is a keynote that I have given in polyweb workshop on the state of the art of data science reproducibility. I review tools that have been developed over the last few years in the first part. In the second part, I focus on proposals that I have been involved in to facilitate workflow reproducibility and preservation.
Resource Classification as the Basis for a Visualization Pipeline in LOD Scen...Oscar Peña del Rio
This document proposes a visualization pipeline for Linked Open Data (LOD) scenarios that begins with classifying resource features in a dataset. It extracts features like property usage, completeness ratios, and inferred primitive datatypes. It tested this approach on five datasets with over 10 million triples, achieving over 80% agreement with expert classification. The goal is to provide coherent, understandable visualizations of Semantic Web data for non-technical users by classifying resources first. Future work includes using the extracted features to create Entity Visualization Templates and recommend visual representations.
This slides presents a novel collaborative document ranking model which aims at solving a complex information retrieval task involving a multi-faceted information need. For this purpose, we consider a group of users, viewed as experts, who collaborate by addressing the different query facets. We propose a two-step algorithm based on a relevance feedback process which first performs a document scoring towards each expert and then allocates documents to the most suitable experts using the Expectation-Maximisation learning-method. The performance improvement is demonstrated through experiments using TREC interactive benchmark.
This paper has been awarded at AIRS 2013 Conference.
http://link.springer.com/chapter/10.1007%2F978-3-642-45068-6_10
ftp://ftp.irit.fr/IRIT/SIG/2013_AIRS_STB.pdf
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesToine Bogers
The document summarizes the creation of three new domain-specific test collections for evaluating expert search systems in the domains of information retrieval, semantic web, and computational linguistics. The collections were created using workshop program committees and publications from relevant conferences and journals to represent experts, documents, and topics. The collections were then benchmarked using state-of-the-art expert search approaches, finding that term extraction methods outperformed language modeling on these domain-centered collections. Future work is discussed to expand the collections and incorporate additional evidence like citations.
This document discusses the process of classroom action research which involves 6 main steps: 1) focusing the inquiry by formulating a research question, 2) reviewing relevant literature and resources related to the question, 3) collecting relevant data, 4) analyzing and interpreting the collected data, 5) reporting the results, and 6) reflecting on the entire process and results to improve teaching and learning. The goal is to help teachers better understand and improve their own practices through systematic study.
This document discusses reproducible research and provides guidance on how to conduct research in a reproducible manner. It covers:
1. The importance of reproducible research due to large datasets, computational analyses, and the potential for human error. Ensuring reproducibility requires new expertise and infrastructure.
2. Key aspects of reproducible research include data management plans, version control, use of file formats and software/tools that allow reproducibility, and publishing data and code to allow others to replicate results.
3. Reproducible research benefits the scientific community by increasing transparency and allows researchers to re-analyze their own data in the future. Journals and funders are increasingly requiring reproducibility.
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
This document provides an overview of the PROV provenance model and some of its extensions. It discusses the motivation for provenance, the history and development of the PROV model, its key concepts of entities, activities, and agents. It also describes extensions like ProvONE and PAV that build upon PROV to model workflow and scientific provenance.
This is a keynote that I have given in polyweb workshop on the state of the art of data science reproducibility. I review tools that have been developed over the last few years in the first part. In the second part, I focus on proposals that I have been involved in to facilitate workflow reproducibility and preservation.
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
Overview of my current work done at the Ontology Engineering Group. This presentation is similar to http://www.slideshare.net/dgarijo/from-scientific-workflows-to-research-objects-publication-and-abstraction-of-scientific-experiments, with a couple of extra slides with some details of my future plans.
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
Presentation of my PhD work to the UPM group on the 12th of Feb of 2014. Summary of goals, motivation, OPMW, Standards, PROV, p-plan, Workflow Motifs, Workflow fragment detection and Research Objects.
Is preserving data enough? Towards the preservation of scientific methods dgarijo
In recent years there have been many efforts towards the preservation of data belonging to scientific research. Institutions like the Virtual Observatory and journals like PLOS ONE, Geoscience Data Journal, Ecological Archives accept datasets that support or were produced in scientific publications. Other efforts like Figshare allow citing data from unpublished research and research in progress, allowing acknowledging authors and improving the shareability of their work. At the same time, many of the challenges associated to the preservation and sharing of data has been a topic of discussion in international initiatives like the Research Data Alliance, which through its working and interest groups aims at identifying requirements and proposing reference solutions to improve such tasks like data citation and provision of correct e-infrastructure for repositories.
However, data per se is often not relevant without proper description metadata, its provenance and the software used for its creation. In fact, scientists are starting to be more concerned about the preservation of the software and methods used to deliver a particular scientific result. Reproducibility and inspectability are crucial for enabling the interpretation and the reusability of a given dataset. In "in vitro" and "in vivo" sciences, protocols exist to capture the methods necessary to reproduce an experiment. In computational sciences this is achieved with scientific workflows, which capture the method (i.e., steps and data dependencies) used to obtain a specific result. In this short talk we will introduce the set of checklists we have developed for the proper conservation of scientific workflows, encapsulated as Research Objects, by adapting existing standards for data preservation.
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
Scientific Workflows have become the workhorse of BigData analytics for scientists. As well as being repeatable and optimizable pipelines that bring together datasets and analysis tools, workflows make-up an important part of the provenance of data generated from their execution. By faithfully capturing all stages in the analysis, workflows play a critical part in building up the audit-trail (a.k.a. provenance) meta- data for derived datasets and contributes to the veracity of results. Provenance is essential for reporting results, reporting the method followed, and adapting to changes in the datasets or tools. These functions, however, are hampered by the complexity of workflows and consequently the complexity of data-trails generated from their instrumented execution. In this paper we propose the generation of workflow description summaries in order to tackle workflow complexity. We elaborate reduction primitives for summarizing workflows, and show how prim- itives, as building blocks, can be used in conjunction with semantic workflow annotations to encode different summariza- tion strategies. We report on the effectiveness of the method through experimental evaluation using real-world workflows from the Taverna system.
Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012Idafen Santana Pérez
Slides from a weekly talk at OEG in December 2012, focusing on defining the main concepts for my thesis, including what we understand by reproducibility, replicability, conservation, and preservation.
Research Objects in Scientific Publicationsdgarijo
This document discusses research objects (ROs), which aggregate resources to bundle the contents of a research work. ROs allow for process preservation, reusability, repeatability, traceability, understandability, and curation of research. The RO model uses the Open Annotation vocabulary and can be extended for different domains. An example of a life sciences RO is provided. Creating ROs involves assigning permanent identifiers to resources, describing relationships between resources, and adding descriptions to a manifest file. Tools exist to help with RO creation and annotation, and ROs can be shared through repositories like RO Portal and ROHub.
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
Scientists publish computational experiments in ways that do not facilitate reproducibility or reuse. Significant domain expertise, time and effort are required to understand scientific experiments and their research outputs. In order to improve this situation, mechanisms are needed to capture the exact details and the context of computational experiments. Only then, Intelligent Systems would be able help researchers understand, discover, link and reuse products of existing research.
In this presentation I will introduce my work and vision towards enabling scientists share, link, curate and reuse their computational experiments and results. In the first part of the talk, I will present my work for capturing and sharing the context of scientific experiments by using scientific workflows and machine readable representations. Thanks to this approach, experiment results are described in an unambiguous manner, have a clear trace of their creation process and include a pointer to the sources used for their generation. In the second part of the talk, I will describe examples on how the context of scientific experiments may be exploited to browse, explore and inspect research results. I will end the talk by presenting new ideas for improving and benefiting from the capture of context of scientific experiments and how to involve scientists in the process of curating and creating abstractions on available research metadata.
Keynote presentation delivered at ELAG 2013 in Gent, Belgium, on May 29 2013. Discusses Research Objects and the relationship to work my team has been involved in during the past couple of years: OAI-ORE, Open Annotation, Memento.
This document summarizes the results of an empirical analysis of 177 scientific workflows from Taverna and Wings systems. The analysis identified common motifs in data-oriented activities and workflow implementation styles. For data activities, motifs included data preparation, data transformation, data movement and data visualization. For workflows, motifs involved different ways activities were combined and implemented. The identified motifs could help inform workflow design practices and tools to generate workflow abstractions, improving understanding and reusability of workflows.
RQ1. What are the differences between e-commerce and s-commerce?
RQ2. What are the characteristics of s-commerce?
RQ3. What are the activities of s-commerce?
RQ4. What are the research themes that are addressed in s-commerce studies?
RQ5. What are the limitations and gaps in current research of s-commerce?
The document outlines the procedures for conducting a systematic literature review on social commerce (s-commerce), including developing research questions, defining a search strategy, selecting studies, assessing study quality, extracting and synthesizing data. The review aims to understand the key concepts of s-commerce, explore common research themes,
Reproducibility in human cognitive neuroimaging: a community-driven data sha...Nolan Nichols
The document summarizes Nolan Nichols' dissertation defense on a community-driven data sharing framework for integrating and interoperating neuroimaging provenance information. His research aimed to enhance the reusability of neuroimaging data and workflows by advancing data exchange standards that incorporate provenance. Through two phases involving multiple collaborations, he extended existing standards and developed neuroimaging data models and web services to compute and discover provenance from brain imaging workflows in order to improve reproducibility in cognitive neuroimaging research.
Research Objects for improved sharing and reproducibilityOscar Corcho
Presentation about the usage of Research Objects to improve scientific experiment sharing and reproducibility, given at the Dagstuhl Perspective Workshop on the intersection between Computer Sciences and Psychology (July 2015)
A task-based scientific paper recommender system for literature review and ma...Aravind Sesagiri Raamkumar
My PhD oral defense presentation (as of Oct 3rd 2017)
The dissertation can be requested at this link https://www.researchgate.net/publication/323308750_A_task-based_scientific_paper_recommender_system_for_literature_review_and_manuscript_preparation
1) The document discusses research objects (ROs) which aim to document the full scientific process in a digital environment, including workflows, data, software, and provenance.
2) ROs in the Wf4Ever project contain detailed semantic annotations and can be aggregated into templates to help complete the scientific record.
3) Incentives for using ROs include improved reproducibility, credit for researchers, and increased citations of papers that link to their underlying data and methods.
5 minute presentation during the SC13 Birds of a Feather Session on the relationship between the Research Data Alliance and High Performance Computing.
Creating abstractions from scientific workflows: PhD symposium 2015dgarijo
This document discusses the creation of abstractions in scientific workflows. It hypothesizes that it is possible to automatically extract reusable patterns and abstractions from scientific workflow repositories that could be useful for developers. The document outlines challenges in workflow representation, abstraction, reuse, and annotation. It then describes an approach to define vocabularies and methodologies for publishing workflows as linked data. This includes defining a catalog of common workflow abstractions and techniques for finding and evaluating these abstractions across different workflow corpora. Evaluation shows the extracted patterns are similar to those defined by users and are considered useful.
Keynote: SemSci 2017: Enabling Open Semantic Science
1st International Workshop co-located with ISWC 2017, October 2017, Vienna, Austria,
https://semsci.github.io/semSci2017/
Abstract
We have all grown up with the research article and article collections (let’s call them libraries) as the prime means of scientific discourse. But research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
We can think of “Research Objects” as different types and as packages all the components of an investigation. If we stop thinking of publishing papers and start thinking of releasing Research Objects (software), then scholar exchange is a new game: ROs and their content evolve; they are multi-authored and their authorship evolves; they are a mix of virtual and embedded, and so on.
But first, some baby steps before we get carried away with a new vision of scholarly communication. Many journals (e.g. eLife, F1000, Elsevier) are just figuring out how to package together the supplementary materials of a paper. Data catalogues are figuring out how to virtually package multiple datasets scattered across many repositories to keep the integrated experimental context.
Research Objects [1] (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described. The brave new world of containerisation provides the containers and Linked Data provides the metadata framework for the container manifest construction and profiles. It’s not just theory, but also in practice with examples in Systems Biology modelling, Bioinformatics computational workflows, and Health Informatics data exchange. I’ll talk about why and how we got here, the framework and examples, and what we need to do.
[1] Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble, Why linked data is not enough for scientists, In Future Generation Computer Systems, Volume 29, Issue 2, 2013, Pages 599-611, ISSN 0167-739X, https://doi.org/10.1016/j.future.2011.08.004
This document discusses research objects (ROs) and their role in reproducible science. It makes three key points:
1. Publications should convince readers of validity through reproducible results, but current systems do not fully facilitate reproducibility. ROs can address this by explicitly representing methods used.
2. Reproducibility reinforces results and is a key factor in scientific discovery. ROs provide a reproducible representation of methods.
3. ROs bundle together essential resources from a computational study, such as data, results, methods, people involved, and annotations for understanding, interpretation, and reuse. They support the full experimental lifecycle from problem definition to publication.
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsKhalid Belhajjame
I gave this talk at the EDBT'2020 conference. It shows how the provenance of workflows can be anonymized without compromising lineage relationships between the data records that are used and generated by the modules that compose the workflow.
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
Overview of my current work done at the Ontology Engineering Group. This presentation is similar to http://www.slideshare.net/dgarijo/from-scientific-workflows-to-research-objects-publication-and-abstraction-of-scientific-experiments, with a couple of extra slides with some details of my future plans.
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
Presentation of my PhD work to the UPM group on the 12th of Feb of 2014. Summary of goals, motivation, OPMW, Standards, PROV, p-plan, Workflow Motifs, Workflow fragment detection and Research Objects.
Is preserving data enough? Towards the preservation of scientific methods dgarijo
In recent years there have been many efforts towards the preservation of data belonging to scientific research. Institutions like the Virtual Observatory and journals like PLOS ONE, Geoscience Data Journal, Ecological Archives accept datasets that support or were produced in scientific publications. Other efforts like Figshare allow citing data from unpublished research and research in progress, allowing acknowledging authors and improving the shareability of their work. At the same time, many of the challenges associated to the preservation and sharing of data has been a topic of discussion in international initiatives like the Research Data Alliance, which through its working and interest groups aims at identifying requirements and proposing reference solutions to improve such tasks like data citation and provision of correct e-infrastructure for repositories.
However, data per se is often not relevant without proper description metadata, its provenance and the software used for its creation. In fact, scientists are starting to be more concerned about the preservation of the software and methods used to deliver a particular scientific result. Reproducibility and inspectability are crucial for enabling the interpretation and the reusability of a given dataset. In "in vitro" and "in vivo" sciences, protocols exist to capture the methods necessary to reproduce an experiment. In computational sciences this is achieved with scientific workflows, which capture the method (i.e., steps and data dependencies) used to obtain a specific result. In this short talk we will introduce the set of checklists we have developed for the proper conservation of scientific workflows, encapsulated as Research Objects, by adapting existing standards for data preservation.
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
Scientific Workflows have become the workhorse of BigData analytics for scientists. As well as being repeatable and optimizable pipelines that bring together datasets and analysis tools, workflows make-up an important part of the provenance of data generated from their execution. By faithfully capturing all stages in the analysis, workflows play a critical part in building up the audit-trail (a.k.a. provenance) meta- data for derived datasets and contributes to the veracity of results. Provenance is essential for reporting results, reporting the method followed, and adapting to changes in the datasets or tools. These functions, however, are hampered by the complexity of workflows and consequently the complexity of data-trails generated from their instrumented execution. In this paper we propose the generation of workflow description summaries in order to tackle workflow complexity. We elaborate reduction primitives for summarizing workflows, and show how prim- itives, as building blocks, can be used in conjunction with semantic workflow annotations to encode different summariza- tion strategies. We report on the effectiveness of the method through experimental evaluation using real-world workflows from the Taverna system.
Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012Idafen Santana Pérez
Slides from a weekly talk at OEG in December 2012, focusing on defining the main concepts for my thesis, including what we understand by reproducibility, replicability, conservation, and preservation.
Research Objects in Scientific Publicationsdgarijo
This document discusses research objects (ROs), which aggregate resources to bundle the contents of a research work. ROs allow for process preservation, reusability, repeatability, traceability, understandability, and curation of research. The RO model uses the Open Annotation vocabulary and can be extended for different domains. An example of a life sciences RO is provided. Creating ROs involves assigning permanent identifiers to resources, describing relationships between resources, and adding descriptions to a manifest file. Tools exist to help with RO creation and annotation, and ROs can be shared through repositories like RO Portal and ROHub.
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
Scientists publish computational experiments in ways that do not facilitate reproducibility or reuse. Significant domain expertise, time and effort are required to understand scientific experiments and their research outputs. In order to improve this situation, mechanisms are needed to capture the exact details and the context of computational experiments. Only then, Intelligent Systems would be able help researchers understand, discover, link and reuse products of existing research.
In this presentation I will introduce my work and vision towards enabling scientists share, link, curate and reuse their computational experiments and results. In the first part of the talk, I will present my work for capturing and sharing the context of scientific experiments by using scientific workflows and machine readable representations. Thanks to this approach, experiment results are described in an unambiguous manner, have a clear trace of their creation process and include a pointer to the sources used for their generation. In the second part of the talk, I will describe examples on how the context of scientific experiments may be exploited to browse, explore and inspect research results. I will end the talk by presenting new ideas for improving and benefiting from the capture of context of scientific experiments and how to involve scientists in the process of curating and creating abstractions on available research metadata.
Keynote presentation delivered at ELAG 2013 in Gent, Belgium, on May 29 2013. Discusses Research Objects and the relationship to work my team has been involved in during the past couple of years: OAI-ORE, Open Annotation, Memento.
This document summarizes the results of an empirical analysis of 177 scientific workflows from Taverna and Wings systems. The analysis identified common motifs in data-oriented activities and workflow implementation styles. For data activities, motifs included data preparation, data transformation, data movement and data visualization. For workflows, motifs involved different ways activities were combined and implemented. The identified motifs could help inform workflow design practices and tools to generate workflow abstractions, improving understanding and reusability of workflows.
RQ1. What are the differences between e-commerce and s-commerce?
RQ2. What are the characteristics of s-commerce?
RQ3. What are the activities of s-commerce?
RQ4. What are the research themes that are addressed in s-commerce studies?
RQ5. What are the limitations and gaps in current research of s-commerce?
The document outlines the procedures for conducting a systematic literature review on social commerce (s-commerce), including developing research questions, defining a search strategy, selecting studies, assessing study quality, extracting and synthesizing data. The review aims to understand the key concepts of s-commerce, explore common research themes,
Reproducibility in human cognitive neuroimaging: a community-driven data sha...Nolan Nichols
The document summarizes Nolan Nichols' dissertation defense on a community-driven data sharing framework for integrating and interoperating neuroimaging provenance information. His research aimed to enhance the reusability of neuroimaging data and workflows by advancing data exchange standards that incorporate provenance. Through two phases involving multiple collaborations, he extended existing standards and developed neuroimaging data models and web services to compute and discover provenance from brain imaging workflows in order to improve reproducibility in cognitive neuroimaging research.
Research Objects for improved sharing and reproducibilityOscar Corcho
Presentation about the usage of Research Objects to improve scientific experiment sharing and reproducibility, given at the Dagstuhl Perspective Workshop on the intersection between Computer Sciences and Psychology (July 2015)
A task-based scientific paper recommender system for literature review and ma...Aravind Sesagiri Raamkumar
My PhD oral defense presentation (as of Oct 3rd 2017)
The dissertation can be requested at this link https://www.researchgate.net/publication/323308750_A_task-based_scientific_paper_recommender_system_for_literature_review_and_manuscript_preparation
1) The document discusses research objects (ROs) which aim to document the full scientific process in a digital environment, including workflows, data, software, and provenance.
2) ROs in the Wf4Ever project contain detailed semantic annotations and can be aggregated into templates to help complete the scientific record.
3) Incentives for using ROs include improved reproducibility, credit for researchers, and increased citations of papers that link to their underlying data and methods.
5 minute presentation during the SC13 Birds of a Feather Session on the relationship between the Research Data Alliance and High Performance Computing.
Creating abstractions from scientific workflows: PhD symposium 2015dgarijo
This document discusses the creation of abstractions in scientific workflows. It hypothesizes that it is possible to automatically extract reusable patterns and abstractions from scientific workflow repositories that could be useful for developers. The document outlines challenges in workflow representation, abstraction, reuse, and annotation. It then describes an approach to define vocabularies and methodologies for publishing workflows as linked data. This includes defining a catalog of common workflow abstractions and techniques for finding and evaluating these abstractions across different workflow corpora. Evaluation shows the extracted patterns are similar to those defined by users and are considered useful.
Keynote: SemSci 2017: Enabling Open Semantic Science
1st International Workshop co-located with ISWC 2017, October 2017, Vienna, Austria,
https://semsci.github.io/semSci2017/
Abstract
We have all grown up with the research article and article collections (let’s call them libraries) as the prime means of scientific discourse. But research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
We can think of “Research Objects” as different types and as packages all the components of an investigation. If we stop thinking of publishing papers and start thinking of releasing Research Objects (software), then scholar exchange is a new game: ROs and their content evolve; they are multi-authored and their authorship evolves; they are a mix of virtual and embedded, and so on.
But first, some baby steps before we get carried away with a new vision of scholarly communication. Many journals (e.g. eLife, F1000, Elsevier) are just figuring out how to package together the supplementary materials of a paper. Data catalogues are figuring out how to virtually package multiple datasets scattered across many repositories to keep the integrated experimental context.
Research Objects [1] (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described. The brave new world of containerisation provides the containers and Linked Data provides the metadata framework for the container manifest construction and profiles. It’s not just theory, but also in practice with examples in Systems Biology modelling, Bioinformatics computational workflows, and Health Informatics data exchange. I’ll talk about why and how we got here, the framework and examples, and what we need to do.
[1] Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble, Why linked data is not enough for scientists, In Future Generation Computer Systems, Volume 29, Issue 2, 2013, Pages 599-611, ISSN 0167-739X, https://doi.org/10.1016/j.future.2011.08.004
This document discusses research objects (ROs) and their role in reproducible science. It makes three key points:
1. Publications should convince readers of validity through reproducible results, but current systems do not fully facilitate reproducibility. ROs can address this by explicitly representing methods used.
2. Reproducibility reinforces results and is a key factor in scientific discovery. ROs provide a reproducible representation of methods.
3. ROs bundle together essential resources from a computational study, such as data, results, methods, people involved, and annotations for understanding, interpretation, and reuse. They support the full experimental lifecycle from problem definition to publication.
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsKhalid Belhajjame
I gave this talk at the EDBT'2020 conference. It shows how the provenance of workflows can be anonymized without compromising lineage relationships between the data records that are used and generated by the modules that compose the workflow.
Privacy-Preserving Data Analysis Workflows for eScienceKhalid Belhajjame
This document discusses an approach for preserving privacy in scientific workflows that use large datasets. It proposes using k-anonymity to anonymize sensitive workflow data. Parameter dependencies are leveraged to identify sensitive parameters and infer appropriate anonymity degrees. The approach was tested on 20 workflows, with overhead less than 1 millisecond. This preliminary work aims to assist scientists in anonymizing workflow data while enabling exploration of provenance and data products.
- The document discusses evaluating "why-not" queries against scientific workflow provenance. Why-not queries help understand why a data item was not returned by a workflow execution.
- It proposes a solution for evaluating why-not queries in workflows with black-box modules that do not preserve attribute information from inputs. The solution explores workflow modules from sink to source to identify "picky" modules responsible for a data item not appearing in results.
- To identify picky modules, it harvests information from the web by searching for traces of scientific module invocations to find valid candidate inputs and determine if a module accepts them or is likely picky. It conducts an experiment using real workflows to test the effectiveness of
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
1) The document presents a methodology to convert script-based experiments into reproducible workflow research objects (WROs). This addresses issues of understanding, reusing, and reproducing experiments conducted through scripts.
2) The methodology involves 5 steps: generate an abstract workflow, create an executable workflow, refine the workflow, record provenance data, and annotate and check the quality of the conversion.
3) Applying the methodology to a molecular dynamics simulation case study, the authors demonstrate how scripts can be transformed into WROs containing workflows, annotations, provenance data, and other resources needed for reproducibility.
The document discusses assisting designers in composing workflows through the reuse of frequent workflow fragments mined from repositories. It proposes an approach that involves mining fragments, representing workflows as graphs, homogenizing activity labels, and allowing users to search for fragments using keywords and activities from their initial workflow. Fragments are retrieved based on relevance to keywords and compatibility to specified activities, then ranked and presented to users for composition. Experiments assess different graph representations for mining fragments in terms of effectiveness, size and runtime. The approach aims to help designers reuse best practices from repositories when specifying new workflows.
This document proposes a method to improve the reuse of workflow fragments by mining workflow repositories. It evaluates different graph representations of workflows and uses the SUBDUE algorithm to identify recurrent fragments. An experiment compares representations on precision, recall, memory usage, and time. Representation D1, which labels edges and nodes, performed best. A second experiment assesses how filtering workflows by keywords impacts finding relevant fragments for a user query. The method aims to incorporate workflow fragment search capabilities into the design lifecycle to promote reuse.
Linking the prospective and retrospective provenance of scriptsKhalid Belhajjame
Scripting languages like Python, R, andMATLAB have seen significant use across a variety of scientific domains. To assist scientists in the analysis of script executions, a number of mechanisms, e.g., noWorkflow, have been recently proposed to capture the provenance of script executions. The provenance information recorded can be used, e.g., to trace the lineage of a particular result by identifying the data inputs and the processing steps that were used to produce it. By and large, the provenance information captured for scripts is fine-grained in the sense that it captures data dependencies at the level of script statement, and do so for every variable within the script. While useful, the amount of recorded provenance information can be overwhelming for users and cumbersome to use. This suggests the need for abstraction mechanisms that focus attention on specific parts of provenance relevant for analyses. Toward this goal, we advocate that fine-grained provenance information recorded as the result of script execution can be abstracted using user-specified, workflow-like views. Specifically, we show how the provenance traces recorded by noWorkflow can be mapped to the workflow specifications generated by YesWorkflow from scripts based on user annotations. We examine the issues in constructing a successful mapping, provide an initial implementation of our solution, and present competency queries illustrating how a workflow view generated from the script can be used to explore the provenance recorded during script execution.
These slides introduces the second edition of ProvBench which I am leading to collect a corpus of provenance data for benchmarking for the provenance (and scientific) community
I gave this talk in TAPP 2014 during the provenance week in Cologne, on inferring fine graine dependencies between data (ports) in scientific workflows. -- khalid
I gave this talk in the EDBT 2014 conference, which tool place in Athens, Greece.
I show how data examples can be used to characterize the behavior of scientific modules. I present a new methods that automatically generate the data examples, and show that such data examples are useful for the human user to understand the task of the modules, and that they can be used to assist curators in repairing broken workflows (i.e., workflows for which one or more modules are no longer supplied by their providers)
A use case designed in the context of the Dataone provenance woring group illustrating how the provenance traces generated by differet workflow engines can be quered via the D-PROV model.
This document proposes representing scientific workflows as first-class citizens called research objects. It presents a model for workflow research objects that aggregates all necessary elements to understand an investigation. These include experiments, annotations, results, datasets and provenance. Research objects are encoded using semantic technologies like RDF and follow standards such as the Object Exchange model. The lifecycle of research objects is also described.
A talk given at the EDBT/ICDT 2010 conference. For more details, visit the project website at http://img.cs.manchester.ac.uk/dataspaces/dataspaces.html
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Project Management Semester Long Project - Acuityjpupo2018
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
2. Storyline
• Why Research Objects?
• Overview of the Research Objects
• Portfolio of Research Object Management
Tools
• Provenance Distillation Through Workflow
Summarization
2
4. Electronic papers are not enough
4
Research
Object
Datasets
Results
Scientists
Hypothesis
Experiments
Annotations
Provenance
Electronic
paper
5. Benefits Of Research Objects
• A research object aggregates all elements
that are necessary to understand research
investigations.
• Methods (experiments) are viewed as first
class citizens
• Promote reuse
• Enable the verification of reproducibility of
the results
5
7. 47 of 53
“landmark”
publications
could not be
replicated
Inadequate cell lines
and animal models
Nature, 483, 2012
Credit to Carole Goble JCDL 2012 Keynote
7
8. Fig. 2. Overview of the Research Object Model with the different ontologies that encode it: Core RO for describing the basic
Research Object structure, roevo for tracking Research Object evolution and wfprov and wfdesc for describing workflows and
Overview of the Research Object
Model
8
9. Research Object as an ORE
Aggregation
9
Fig. 3. Research Object as an ORE aggregation.
to annotation; and ao: Body, which comprises a
description of the target.
Research Objects use annotations as a means for
decorating a resource (or a set of resources) with
metadata information. The body is specified in the
@base <ht t p : / / exampl e. com/ r o/ 389/ > .
@pr ef i x dct : <ht t p: / / pur l . or g/ dc / t er ms / >
@pr ef i x or e:
<ht t p : / / www. openar chi ves . or g/ or e/ t er ms
10. Scientific Workflows
• Data driven analysis pipelines
• Systematic gathering of data and
analysis tools into computational
solutions for scientific problem-solving
• Tools for automating frequently
performed data intensive activities
• Provenance for the resulting datasets
– The method followed
– The resources used
– The datasets used
10
11. PROV Primer, Gil et al
WF Execution Trace
Retrospective Provenance:
Actual data used, actual
invocations, timestamps and data
derivation trace
WF Description
Prospective Provenance:
Intended method for analysis
11
16. Workflows can get complex!
• Overwhelming for users who are not
the developers
• Abstractions required for reporting
• Lineage queries result in very long
trails
16
25. Analysis Data Set
• 30 Workflows from the Taverna system
• Entire dataset & queries accessible from
http://www.myexperiment.org/packs/467.html
• Manual Annotation using Motif Vocabulary
25
28. Highlights
• Research Object model and associated
management tools
• Annotations of Workflow Using Motifs
• Methods for Summarizing Workflow and distilling
their provenance traces
• Algorithms for Repairing Workflows
• Validation of the workflow summarization
• Querying of Workflow Execution Provenance
using summaries.
28
Ongoing Work
29. References
• P Alper, K Belhajjame, C Goble, P Karagoz. Small Is Beautiful: Summarizing Scientific
Workflows Using Semantic Annotations. IEEE International Congress on Big Data, 2013
• Pinar Alper, Khalid Belhajjame, Carole A. Goble, Pinar Karagoz: Enhancing and abstracting
scientific workflow provenance for data publishing. EDBT/ICDT Workshops 2013.
• Belhajjame K, Corcho O, Garijo D, et al. Workflow-Centric Research Objects: A First Class
Citizen in the Scholarly Discourse. In proceedings of the ESWC2012 Workshop on the Future
of Scholarly Communication in the Semantic Web (SePublica2012), Heraklion, Greece, May
2012
• Belhajjame K, Zhao J, Garijo D, et al. The Research Object Suite of Ontologies: Sharing and
Exchanging Research Data and Methods on the Open Web, submitted to the Journal of Web
Semanics.
• Daniel Garijo, Pinar Alper, Khalid Belhajjame, Óscar Corcho, Yolanda Gil, Carole A. Goble:
Common motifs in scientific workflows: An empirical analysis. eScience 2012.
• Zhao J, Gómez-Pérez JM, Belhajjame K, Klyne G, et al. Why workflows break - Understanding
and combating decay in Taverna workflows. IEEE eScience 2012.
29
Research is increasingly digital. Most of research results are disseminated in the form of electronic papers through traditional communication channels, such as conferences, journals, or using new mediums such as microblogging. While electronic papers have played and continue to play a primordial role in the dissemination of research results, researchers now recognize that they are by no means sufficient to communicate and share information about research investigations. Indeed, the hypothesis investigated during the research, the experiment designed to assess the validity of the hypothesis, the process (workflow) used to ran the experiment, the datasets used and the results produced by the experiment, and the conclusions drawn by the scientist, are all elements that may be needed to understand, assess the claim, or be able to re-use the results of previous research investigations.In the context of the wf4ever project, we are investigating a new abstraction that we name research objects and that aggregate all these elements. Research object provide an entry point to the elements that are necessary or useful to understand and reuse research results. A particular feature of research objects that we studying is that they contain workflows that specify and implements data intensive scientific experiments.
Research is increasingly digital. Most of research results are disseminated in the form of electronic papers through traditional communication channels, such as conferences, journals, or using new mediums such as microblogging. While electronic papers have played and continue to play a primordial role in the dissemination of research results, researchers now recognize that they are by no means sufficient to communicate and share information about research investigations. Indeed, the hypothesis investigated during the research, the experiment designed to assess the validity of the hypothesis, the process (workflow) used to ran the experiment, the datasets used and the results produced by the experiment, and the conclusions drawn by the scientist, are all elements that may be needed to understand, assess the claim, or be able to re-use the results of previous research investigations.In the context of the wf4ever project, we are investigating a new abstraction that we name research objects and that aggregate all these elements. Research object provide an entry point to the elements that are necessary or useful to understand and reuse research results. A particular feature of research objects that we studying is that they contain workflows that specify and implements data intensive scientific experiments.
In this age of data-intensive science we’re witnessing the unprecedented generation and sharing of large scientific datasets, where the pace of data generation has far surpassed the pace of conducting analysis over the data. Scientific Workflows [6] are a recent but very popular method for task automation and resource integration. Using workflows, scientists are able to systematically weave datasets and analytical tools into pipelines, represented as networks of data processing operations connected with dataflow links.(Figure 1 illustrates a workflow from genomics, which “from a given set of gene ids, retrieves corresponding enzyme ids and finds the biological pathways involving them, then for each pathway retrieves its diagram with a designated coloring scheme”).As well as being automation pipelines, workflows are of paramount importance for the provenance of data generated from their execution [6]. Provenance refers to data’s deriva- tion history starting from the original sources, namely its lineage.
By explicitly outlining the analysis process, work- flow descriptions make-up a prospective part of provenance. The instrumented execution of the workflow generates a veracious and elaborate data lineage trail outlining the derivation relations among resultant, intermediary and initial data artifacts, which makes up the retrospective part of provenance.
http://alpha.myexperiment.org/packs/405.html0.41 Exemple of a research objects as a pack in myExpeirment of a Gene Wide Association Studies: the workflow is brokedn2.33 The workflow was repared by the user, and uploaded as part of the new packOne the user upload the workfow, this will be transfomedintoa format that is compatible with wfdesc. The idea here is that regardless of the workflow language, the workflow will be transformed into a simple workflow language, which in this case is wfdesc.7.24 You can also downloads worflow runs, and in this page you will have a summary of their characetrization.9.18 You can browse the workflow in the RO portal, which is the back end library that store information about the Research objects. And here you can find infromation about the evolution of the research object, in particular, which pack was used to create which pack.
Workflows are typically complex artifacts containing sev- eral processing steps and dataflow links. Complexity is characterized by the environment in which workflows op- erate.Obfuscation degrades reporta- bility of the scientific intent embedded in the workflow. Complex workflows combined with the faithful collection of execution provenance for each step results in even more complex provenance traces containing long data trails, that overwhelm users who explore or query them [3], [1].
Scientific Workflow MotifsAt the heart of our approach lies the exploitation of information on the data processing nature of activities in workflows. We propose that this information is provided through semantic annotations, which we call motifs. We use a light-weight ontology, which we developed in previous work [8] based on an analysis of 200+ workflows (Ac- cessible: http://purl.org/net/wf-motifs). The motif ontology characterizes operations wrt their Data-Oriented functional nature and the Resource-Oriented implementation nature. We refer the reader to [8] for the details, here we briefly introduce some of the motifs in light of our running example. • Data-Oriented nature. Certain activities in workflows, such as DataRetrieval, Analysis or Visualizations, perform the scientific heavy lifting in a workflow. The gene annotation pipeline in Figure 1) collects data from various resources through several retrieval steps ( see “get enzymes by gene” , “get genes by enzyme” opera- tions). The data retrieval steps are pipelined to each other through use of adapter steps, which are categorized with the DataP reparation motif. Augmentation is a sub-category of preparation operations. Augmentation decorates data with resource/protocol specific padding or formatting. (The sub- workflow “Workflow89” in our example is an augmenter that builds up a well formed query request out of given parameters.) Extraction activities perform the inverse of augmentation, they extract data from raw results returned from analyses or retrievals (e.g. SOAP XML messages). Splitting and Merging are also a sub-category of data preparation, they consume and produce the same data but change its cardinality (see the “Flatten List” steps in our example). Another general category is DataOrganization activities, which perform querying and organization func- tionsoverdatasuchas,F iltering,J oiningandGrouping. The “Remove String Duplicates” method in our example is an instance of the filtering motif. Another frequent motif is DataMoving. In order to make data accessible to external analysis tools data may need to be moved in and out of the workflow execution environment such activities fall into this category (e.g. upload to web, write to file etc). The “Get Image From URL 2” operation in our genomics workflow is an instance of this motif.• Resource-Oriented nature, which outlines the implemen- tation aspects and reflects the characteristic of resources that underpin the operation. Classifications in this category are: Human − Interactions vs entirely Computational steps, Statefull vs Stateless Invocations and using Internal (e.g. local scripts, sub-workflows) vs External (e.g. web service, or external command line tools) computational artifacts. In our example, all data retrieval operations are realized by external, stateless resources (web services), whereas all data preparation operations are realized with
A data-flow decorated with motif annota- tions is 3-tuple: W = ⟨N,E,motifs⟩, where motifs is a function that maps each graph node n ∈ N, be it an operation or port, to its motifs. Operation nodes could have motif annotations from operation, resource perspectives, and port nodes can have annotations from the data perspective (omitted in this paper due to space limitations).As an example, we illustrate below two motif annotations of two operations in our running example. It is worth mentioning that a given node can be associated with multiple motifs.motifs(color pathway by objects) = {⟨m1 : Augmenter⟩} motifs(Get Image From URL 2) = {⟨m2 : DataRetrieval⟩}
2) Elimination: We define the function eliminate op : ⟨W,op⟩ -> ⟨W′,M⟩ as one, which takes an annotated workflow definition, and eliminates a designated operation node from it. Elimination results in a set of indirect data links connecting the predecessors of the deleted operation to its successors. The source and target ports of these new indirect links are the cartesian product of the to-be-deleted operation’s inlinks’ sources and its outlinks’ targets. As with other primitives elimination also returns a set of mappings between the ports of the resulting workflow and the input workflow. Motif Behavior: The designated operation is eliminatedtogether with its motifs and these are not propagated to any other node in the workflow graph.Constraint Checks: There are no particular constraint checks for the elimination primitive.Dataflow Characteristic: Within the conventional defini- tion of a workflow, ports are connected with direct datalinks, which correspond to data transfers i.e. the data artifact!-# appearing at the source port is transferred to the target port as-is. In the indirect link case, however, there is no direct data transfer as the source and target artifacts are dissimilar,%&# and some “eliminated” data processing occurs inbetween.
Summary-By-Elimination: Requires minimal amount of annotation effort and operates with a single rule. We have only classified the scientifically significant operations to denote their motifs, and all remaining operations in the workflow are designated to be DataPreparation steps. The following reduction rule is then used to summarize the workflow graph W. Summary-By-Collapse: Requires annotation of work- flow with motif classes having more specificity and a more fine-grained set of rules. Consequently annotations on data preparation steps are more specific designating what kind of preparation activity is occurring (e.g. Merging, Splitting, Augmentation etc). We use these annotations to remove those data preparation steps from the work- flow by the collapse primitive and collapsing in the di- rection so as to retain the ports that carry more reporting friendly artifacts. We collapse Augmentation, Splitting andRef erenceGenerationoperationsdownstreamastheir outputs are less favored than their inputs, whereas we col- lapse Extraction, Merging, ReferenceResolution and all DataOrganization steps downstream as their outputs are preferred over inputs in a summary. Two rules in this approach are given below:
In our evaluation we annotated the workflow corpusmanually using the motif ontology. The effort that goes into the design of workflows can be significant (up to several days for a workflow). When compared to design, the cost of annotation is minor (as it amounts to single attribute setup per operation). Annotation can be (semi)automated through application of mining techniques to workflows and execution traces. It is also possible to pool annotations in service registries or module libraries and automatically propagate annotations to operations that rare re-used from the registry. Instead of the motif ontology it is also possible to use other ontologies [10] to characterize operations in workflows.
Motifs are used in conjunction with reduction primitives to build summarization rules. The objective of rules is to determine how to reduce the workflow when certain motifs or combinations of motifs are encountered. We propose graph transformation rules of the form L P, where L specifies a pattern to be matched in the host graph specified in terms of a workflow-specific graph model and motif decorations and P specifies a workflow reduction primitive to be enacted upon the node(s) identified by the left-hand- side L. Primitive P captures both the replacement graph pattern and how the replacement is to be embedded into the host graph. Our approach is a more controlled primitive based re-writing mechanism, rather than a fully declarative one (as in traditional graph re-writing). This allows us to: 1) Preserve data causality relations among operations together with the constraints that such relations have (e.g. acyclicity for Scientific Datalows) 2) We can allow the user to specify high-level reduction policies by creating simple <motif, primitive> pairs. It would be a big undertaking to expect the users, who are not computer scientists or software engineers, to specify re-writes in a fully declarative manner (i.e. a matching pattern, a replacement pattern and embedding info)