Slides of the presentation for my PhD dissertation. I strongly recommend downloading the slides, as they have animations that are easier to see in power point. The abstract of the thesis is as follows: "Scientific workflows have been adopted in the last decade to represent the computational methods used in in silico scientific experiments and their associated research products. Scientific workflows have demonstrated to be useful for sharing and reproducing scientific experiments, allowing scientists to visualize, debug and save time when re-executing previous work. However, scientific workflows may be difficult to understand and reuse. The large amount of available workflows in repositories, together with their heterogeneity and lack of documentation and usage examples may become an obstacle for a scientist aiming to reuse the work from other scientists. Furthermore, given that it is often possible to implement a method using different algorithms or techniques, seemingly disparate workflows may be related at a higher level of abstraction, based on their common functionality. In this thesis we address the issue of reusability and abstraction by exploring how workflows relate to one another in a workflow repository, mining abstractions that may be helpful for workflow reuse. In order to do so, we propose a simple model for representing and relating workflows and their executions, we analyze the typical common abstractions that can be found in workflow repositories, we explore the current practices of users regarding workflow reuse and we describe a method for discovering useful abstractions for workflows based on existing graph mining techniques. Our results expose the common abstractions and practices of users in terms of workflow reuse, and show how our proposed abstractions have potential to become useful for users designing new workflows".
Reproducibility Using Semantics: An Overviewdgarijo
Overview of the different approaches for addressing reproducibilities (using semantics) in laboratory protocols, workflow description and publication and workflow infrastructure. Furthermore, Research Objects are introduced as a means to capture the context and annotations of scientific experiments, together with the privacy and IPR concerns that may arise. This presentation was presented in Dagstuhl Seminar 16041: http://www.dagstuhl.de/16041
Software Metadata: Describing "dark software" in GeoSciencesdgarijo
Credit to Yolanda Gil.
In this talk I provide an overview of the current state of the art for software description in geosciences, along with our approach to facilitate this task in OntoSoft, a distributed semantic registry for scientific software. Three key aspects of OntoSoft are: a software metadata ontology designed for scientists, a distributed approach to software registries that targets communities of interest, and metadata crowdsourcing through access control. Software metadata is organized using the OntoSoft ontology, designed to support scientists to share, document, and reuse software, and organized along six dimensions: identify software, understand and assess software, execute software, get support for the software, do research with the software, and update the software.
Keynote: SemSci 2017: Enabling Open Semantic Science
1st International Workshop co-located with ISWC 2017, October 2017, Vienna, Austria,
https://semsci.github.io/semSci2017/
Abstract
We have all grown up with the research article and article collections (let’s call them libraries) as the prime means of scientific discourse. But research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
We can think of “Research Objects” as different types and as packages all the components of an investigation. If we stop thinking of publishing papers and start thinking of releasing Research Objects (software), then scholar exchange is a new game: ROs and their content evolve; they are multi-authored and their authorship evolves; they are a mix of virtual and embedded, and so on.
But first, some baby steps before we get carried away with a new vision of scholarly communication. Many journals (e.g. eLife, F1000, Elsevier) are just figuring out how to package together the supplementary materials of a paper. Data catalogues are figuring out how to virtually package multiple datasets scattered across many repositories to keep the integrated experimental context.
Research Objects [1] (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described. The brave new world of containerisation provides the containers and Linked Data provides the metadata framework for the container manifest construction and profiles. It’s not just theory, but also in practice with examples in Systems Biology modelling, Bioinformatics computational workflows, and Health Informatics data exchange. I’ll talk about why and how we got here, the framework and examples, and what we need to do.
[1] Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble, Why linked data is not enough for scientists, In Future Generation Computer Systems, Volume 29, Issue 2, 2013, Pages 599-611, ISSN 0167-739X, https://doi.org/10.1016/j.future.2011.08.004
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Reproducibility Using Semantics: An Overviewdgarijo
Overview of the different approaches for addressing reproducibilities (using semantics) in laboratory protocols, workflow description and publication and workflow infrastructure. Furthermore, Research Objects are introduced as a means to capture the context and annotations of scientific experiments, together with the privacy and IPR concerns that may arise. This presentation was presented in Dagstuhl Seminar 16041: http://www.dagstuhl.de/16041
Software Metadata: Describing "dark software" in GeoSciencesdgarijo
Credit to Yolanda Gil.
In this talk I provide an overview of the current state of the art for software description in geosciences, along with our approach to facilitate this task in OntoSoft, a distributed semantic registry for scientific software. Three key aspects of OntoSoft are: a software metadata ontology designed for scientists, a distributed approach to software registries that targets communities of interest, and metadata crowdsourcing through access control. Software metadata is organized using the OntoSoft ontology, designed to support scientists to share, document, and reuse software, and organized along six dimensions: identify software, understand and assess software, execute software, get support for the software, do research with the software, and update the software.
Keynote: SemSci 2017: Enabling Open Semantic Science
1st International Workshop co-located with ISWC 2017, October 2017, Vienna, Austria,
https://semsci.github.io/semSci2017/
Abstract
We have all grown up with the research article and article collections (let’s call them libraries) as the prime means of scientific discourse. But research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
We can think of “Research Objects” as different types and as packages all the components of an investigation. If we stop thinking of publishing papers and start thinking of releasing Research Objects (software), then scholar exchange is a new game: ROs and their content evolve; they are multi-authored and their authorship evolves; they are a mix of virtual and embedded, and so on.
But first, some baby steps before we get carried away with a new vision of scholarly communication. Many journals (e.g. eLife, F1000, Elsevier) are just figuring out how to package together the supplementary materials of a paper. Data catalogues are figuring out how to virtually package multiple datasets scattered across many repositories to keep the integrated experimental context.
Research Objects [1] (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described. The brave new world of containerisation provides the containers and Linked Data provides the metadata framework for the container manifest construction and profiles. It’s not just theory, but also in practice with examples in Systems Biology modelling, Bioinformatics computational workflows, and Health Informatics data exchange. I’ll talk about why and how we got here, the framework and examples, and what we need to do.
[1] Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble, Why linked data is not enough for scientists, In Future Generation Computer Systems, Volume 29, Issue 2, 2013, Pages 599-611, ISSN 0167-739X, https://doi.org/10.1016/j.future.2011.08.004
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording
of metadata for their interpretation.
The FAIR Guiding Principles for scientific data management and stewardship (http://www.nature.com/articles/sdata201618) has been an effective rallying-cry for EU and USA Research Infrastructures. FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has 8 years of experience of asset sharing and data infrastructure ranging across European programmes (SysMO and EraSysAPP ERANets), national initiatives (de.NBI, German Virtual Liver Network, UK SynBio centres) and PI's labs. It aims to support Systems and Synthetic Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges of and approaches to sharing, credit, citation and asset infrastructures in practice. I'll also highlight recent experiments in affecting sharing using behavioural interventions.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Presented at COMBINE 2016, Newcastle, 19 September.
http://co.mbine.org/events/COMBINE_2016
Findable Accessable Interoperable Reusable < data |models | SOPs | samples | articles| * >. FAIR is a mantra; a meme; a myth; a mystery; a moan. For the past 15 years I have been working on FAIR in a bunch of projects and initiatives in Life Science projects. Some are top-down like Life Science European Research Infrastructures ELIXIR and ISBE, and some are bottom-up, supporting research projects in Systems and Synthetic Biology (FAIRDOM), Biodiversity (BioVel), and Pharmacology (open PHACTS), for example. Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. Some have happy endings. Who are the villains and who are the heroes? What are the morals we can draw from these stories?
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
Talk presented at Early Detection of Cancer Conference, OHSU, Portland, Oregon USA, 2-4 Oct 2018, http://earlydetectionresearch.com/ in the Data Science session
Research Objects: more than the sum of the partsCarole Goble
Workshop on Managing Digital Research Objects in an Expanding Science Ecosystem, 15 Nov 2017, Bethesda, USA
https://www.rd-alliance.org/managing-digital-research-objects-expanding-science-ecosystem
Research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
A first step is to think of Digital Research Objects as a broadening out to embrace these artefacts or assets of research. The next is to recognise that investigations use multiple, interlinked, evolving artefacts. Multiple datasets and multiple models support a study; each model is associated with datasets for construction, validation and prediction; an analytic pipeline has multiple codes and may be made up of nested sub-pipelines, and so on. Research Objects (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described.
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
Thoughts on computational science reproducibility with a focus on software. Given at the Software Sustainability Institute's 2014 Collaborations Workshop
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
Keynote presentation at the iConference 2015, Newport Beach, Los Angeles, 26 March 2015.
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
http://ischools.org/the-iconference/
BEWARE: presentation includes hidden slides AND in situ build animations - best viewed by downloading.
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
Reproducibility, Research Objects and Reality, Leiden 2016Carole Goble
Presented at the Leiden Bioscience Lecture, 24 November 2016, Reproducibility, Research Objects and Reality
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. It all sounds very laudable and straightforward. BUT…..
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange
In this talk I will explore these issues in data-driven computational life sciences through the examples and stories from initiatives I am involved, and Leiden is involved in too including:
· FAIRDOM which has built a Commons for Systems and Synthetic Biology projects, with an emphasis on standards smuggled in by stealth and efforts to affecting sharing practices using behavioural interventions
· ELIXIR, the EU Research Data Infrastructure, and its efforts to exchange workflows
· Bioschemas.org, an ELIXIR-NIH-Google effort to support the finding of assets.
Project Website: http://www.researchobject.org/
researchobjects.org is a community project that has developed an approach to describe and package up all resources used as part of an investigation as Research Objects (RO’s).
RO’s - provide two main features; a manifest - a consistent way to provide a well-typed, structured description of the resources used in an investigation; and a ‘bundle’ - a mechanism for packaging up manifests with resources as a single, publishable unit.
RO’s therefore carry the research context of an experiment - data, software, standard operating procedures (SOPs), models etc - and gather together the components of an experiment so that they are findable, accessible, interoperable and reproducible (FAIR). RO’s combine software and data into an aggregative data structure consisting of well described reconstructable parts.
RO’s have the potential to address a number of challenges pertinent to open research including: a) supporting interoperability between infrastructures by using ROs as a primary mechanism for exchange and publication b) supporting the evolution of research objects as a living collection, enabling provenance tracking c) providing the ability to pivot research object components (data, software, models) that are not restricted to the traditional publication.
Here we present work towards the development and adoption of ROs:
(i) A series of specifications and conventions, using community standards, for the RO manifest and RO bundles.
(ii) Implementations of Java, Python and Ruby APIs and tooling against those specifications;
(iii) Examples of representations of the RO models in various languages (e.g. JSON-LD, RDF, HTML).
Recent improvements in positioning technology have led to a much wider availability of massive moving object data. One of the objectives of spatio-temporal data mining is to analyze such datasets to exploit moving objects that travel together. Naturally, the moving objects in a cluster may actually diverge temporarily and congregate at certain timestamps. Thus, there are time gaps among moving object clusters. Existing approaches either put a strong constraint (i.e. no time gap) or completely relaxed (i.e. whatever the time gaps) in dealing with the gaps may result in the loss of interesting patterns or the extraction of huge amount of extraneous patterns. Thus it is difficult for analysts to understand
the object movement behavior. Motivated by this issue, we propose the concept of fuzzy swarm which softens
the time gap constraint. The goal of our paper is to find all non-redundant fuzzy swarms, namely fuzzy closed swarm. As a contribution, we propose fCS-Miner algorithm which enables us to efficiently extract all the fuzzy closed swarms. Conducted experiments on real and large synthetic datasets demonstrate the effectiveness, parameter sensitiveness and efficiency of our methods.
NhatHai Phan
CIS Department,
University of Oregon, Eugene, OR
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording
of metadata for their interpretation.
The FAIR Guiding Principles for scientific data management and stewardship (http://www.nature.com/articles/sdata201618) has been an effective rallying-cry for EU and USA Research Infrastructures. FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has 8 years of experience of asset sharing and data infrastructure ranging across European programmes (SysMO and EraSysAPP ERANets), national initiatives (de.NBI, German Virtual Liver Network, UK SynBio centres) and PI's labs. It aims to support Systems and Synthetic Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges of and approaches to sharing, credit, citation and asset infrastructures in practice. I'll also highlight recent experiments in affecting sharing using behavioural interventions.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Presented at COMBINE 2016, Newcastle, 19 September.
http://co.mbine.org/events/COMBINE_2016
Findable Accessable Interoperable Reusable < data |models | SOPs | samples | articles| * >. FAIR is a mantra; a meme; a myth; a mystery; a moan. For the past 15 years I have been working on FAIR in a bunch of projects and initiatives in Life Science projects. Some are top-down like Life Science European Research Infrastructures ELIXIR and ISBE, and some are bottom-up, supporting research projects in Systems and Synthetic Biology (FAIRDOM), Biodiversity (BioVel), and Pharmacology (open PHACTS), for example. Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. Some have happy endings. Who are the villains and who are the heroes? What are the morals we can draw from these stories?
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
Talk presented at Early Detection of Cancer Conference, OHSU, Portland, Oregon USA, 2-4 Oct 2018, http://earlydetectionresearch.com/ in the Data Science session
Research Objects: more than the sum of the partsCarole Goble
Workshop on Managing Digital Research Objects in an Expanding Science Ecosystem, 15 Nov 2017, Bethesda, USA
https://www.rd-alliance.org/managing-digital-research-objects-expanding-science-ecosystem
Research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
A first step is to think of Digital Research Objects as a broadening out to embrace these artefacts or assets of research. The next is to recognise that investigations use multiple, interlinked, evolving artefacts. Multiple datasets and multiple models support a study; each model is associated with datasets for construction, validation and prediction; an analytic pipeline has multiple codes and may be made up of nested sub-pipelines, and so on. Research Objects (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described.
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
Thoughts on computational science reproducibility with a focus on software. Given at the Software Sustainability Institute's 2014 Collaborations Workshop
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
Keynote presentation at the iConference 2015, Newport Beach, Los Angeles, 26 March 2015.
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
http://ischools.org/the-iconference/
BEWARE: presentation includes hidden slides AND in situ build animations - best viewed by downloading.
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
Reproducibility, Research Objects and Reality, Leiden 2016Carole Goble
Presented at the Leiden Bioscience Lecture, 24 November 2016, Reproducibility, Research Objects and Reality
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. It all sounds very laudable and straightforward. BUT…..
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange
In this talk I will explore these issues in data-driven computational life sciences through the examples and stories from initiatives I am involved, and Leiden is involved in too including:
· FAIRDOM which has built a Commons for Systems and Synthetic Biology projects, with an emphasis on standards smuggled in by stealth and efforts to affecting sharing practices using behavioural interventions
· ELIXIR, the EU Research Data Infrastructure, and its efforts to exchange workflows
· Bioschemas.org, an ELIXIR-NIH-Google effort to support the finding of assets.
Project Website: http://www.researchobject.org/
researchobjects.org is a community project that has developed an approach to describe and package up all resources used as part of an investigation as Research Objects (RO’s).
RO’s - provide two main features; a manifest - a consistent way to provide a well-typed, structured description of the resources used in an investigation; and a ‘bundle’ - a mechanism for packaging up manifests with resources as a single, publishable unit.
RO’s therefore carry the research context of an experiment - data, software, standard operating procedures (SOPs), models etc - and gather together the components of an experiment so that they are findable, accessible, interoperable and reproducible (FAIR). RO’s combine software and data into an aggregative data structure consisting of well described reconstructable parts.
RO’s have the potential to address a number of challenges pertinent to open research including: a) supporting interoperability between infrastructures by using ROs as a primary mechanism for exchange and publication b) supporting the evolution of research objects as a living collection, enabling provenance tracking c) providing the ability to pivot research object components (data, software, models) that are not restricted to the traditional publication.
Here we present work towards the development and adoption of ROs:
(i) A series of specifications and conventions, using community standards, for the RO manifest and RO bundles.
(ii) Implementations of Java, Python and Ruby APIs and tooling against those specifications;
(iii) Examples of representations of the RO models in various languages (e.g. JSON-LD, RDF, HTML).
Recent improvements in positioning technology have led to a much wider availability of massive moving object data. One of the objectives of spatio-temporal data mining is to analyze such datasets to exploit moving objects that travel together. Naturally, the moving objects in a cluster may actually diverge temporarily and congregate at certain timestamps. Thus, there are time gaps among moving object clusters. Existing approaches either put a strong constraint (i.e. no time gap) or completely relaxed (i.e. whatever the time gaps) in dealing with the gaps may result in the loss of interesting patterns or the extraction of huge amount of extraneous patterns. Thus it is difficult for analysts to understand
the object movement behavior. Motivated by this issue, we propose the concept of fuzzy swarm which softens
the time gap constraint. The goal of our paper is to find all non-redundant fuzzy swarms, namely fuzzy closed swarm. As a contribution, we propose fCS-Miner algorithm which enables us to efficiently extract all the fuzzy closed swarms. Conducted experiments on real and large synthetic datasets demonstrate the effectiveness, parameter sensitiveness and efficiency of our methods.
NhatHai Phan
CIS Department,
University of Oregon, Eugene, OR
The Presentation is regarding the Market Basket Analysis Concept which is done practically with the real world data from a small Canteen. This is completely a real time data on which the analysis results are drawn.
An Introduction to Bioinformatics
Drexel University INFO648-900-200915
A Presentation of Health Informatics Group 5
Cecilia Vernes
Joel Abueg
Kadodjomon Yeo
Sharon McDowell Hall
Terrence Hughes
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
Presentation of my PhD work to the UPM group on the 12th of Feb of 2014. Summary of goals, motivation, OPMW, Standards, PROV, p-plan, Workflow Motifs, Workflow fragment detection and Research Objects.
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
Overview of my current work done at the Ontology Engineering Group. This presentation is similar to http://www.slideshare.net/dgarijo/from-scientific-workflows-to-research-objects-publication-and-abstraction-of-scientific-experiments, with a couple of extra slides with some details of my future plans.
"Data Provenance: Principles and Why it matters for BioMedical Applications"Pinar Alper
Tutorial given at Informatics for HEalth 2017 COnference These slides are for the second part of the tutorial describing provenance capture and management tools.
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
This talk was given at the eScience 2016 conference. It presents a principled methodology for converting raw scripts into annotated workflow research objects.
Slides presented during the IEEE 12th International Conference on eScience in Baltimore, MD, USA in October 2016.
More information: http://w3id.org/w2share/s2rwro/
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Yury Leonychev
This is a English slides of my presentation about machine learning implementation for model web application. Some advices for developers, which decided to create the same implementation in real production environment.
Research Objects for improved sharing and reproducibilityOscar Corcho
Presentation about the usage of Research Objects to improve scientific experiment sharing and reproducibility, given at the Dagstuhl Perspective Workshop on the intersection between Computer Sciences and Psychology (July 2015)
Conference: 11th IEEE International Conference on Automation Science and Engineering, CASE 2015. Gothenburg, Sweden – August 24-28, 2015
Title of the paper: An approach for knowledge-driven product, process and resource mappings for assembly automation
Authors: Borja Ramis Ferrer, Bila l Ahmad, Andrei Lobov, Daniel Vera, José L. Martinez Lastra, Robert Harrison
The Quest for an Open Source Data Science PlatformQAware GmbH
Cloud Native Night July 2019, Munich: Talk by Jörg Schad (@joerg_schad, Head of Engineering & ML at ArangoDB)
=== Please download slides if blurred! ===
Abstract: With the rapid and recent rise of data science, the Machine Learning Platforms being built are becoming more complex. For example, consider the various Kubeflow components: Distributed Training, Jupyter Notebooks, CI/CD, Hyperparameter Optimization, Feature store, and more. Each of these components is producing metadata: Different (versions) Datasets, different versions a of a jupyter notebooks, different training parameters, test/training accuracy, different features, model serving statistics, and many more.
For production use it is critical to have a common view across all these metadata as we have to ask questions such as: Which jupyter notebook has been used to build Model xyz currently running in production? If there is new data for a given dataset, which models (currently serving in production) have to be updated?
In this talk, we look at existing implementations, in particular MLMD as part of the TensorFlow ecosystem. Further, propose a first draft of a (MLMD compatible) universal Metadata API. We demo the first implementation of this API using ArangoDB.
Presentation given at SPNHC 2015, Gainsville, FL
Kurator: An extensible, open-source workflow platform for users and makers of data curation tools
B. Ludäscher, J. Hanken, D. Lowery, J.A. Macklin, T. McPhillips, P.J. Morris, R.A. Morris, T. Song
The recently funded Kurator project builds upon earlier experiences with workflow-based approaches for quality control of biodiversity data. We are developing workflow components (“actors”) to examine data collections and perform checks, e.g., on scientific names, name authorship, collecting date, collector name (recorded by), georeference, locality, and phenological state (where applicable). Kurator is based on a number of ideas: 1) We allow “cleaning data with data”: in addition to checking the internal consistency of records, we can employ external resources to spot quality issues and suggest repairs. 2) Human curators remain in control: Kurator tools keep track of processing history and data lineage (computational provenance) to show original records, alternative forms and the respective sources, thus allowing human curators to make informed decisions about which suggested repairs and flagged records require action. 3) Kurator aims to serve both makers of data curation tools and end users. Initially, we are focusing on a modular, easily extensible approach to data curation workflows and scripts so that curation tool makers (ourselves included) are empowered to quickly develop new curation functionality. We also need to expose curation sources and curation logic to make programming of new features easy and In the second phase, the Kurator toolkit will also include a web interface for end users who might not be programmers or tool makers themselves. The ultimate goal is to allow users who don’t think of themselves as tool makers to build more complex curation workflows from simple components, thus diminishing the gap between makers and users.
Tutorial given at the European Conference for Machine Learning (ECMLPKDD 2015). It covers OpenML, how to use it in your research, interfaces in Java, R, Python, use through machine learning tools such as WEKA and MOA. Also covers topics in open science and reproducible research.
DataONE Education Module 09: Analysis and WorkflowsDataONE
Lesson 9 in a set of 10 created by DataONE on Best Practices for Data Management. The full module can be downloaded from the DataONE.org website at: http://www.dataone.org/educaiton-modules. Released under a CC0 license, attribution and citation requested.
Similar to PhD Thesis: Mining abstractions in scientific workflows (20)
FOOPS!: An Ontology Pitfall Scanner for the FAIR principlesdgarijo
Slides presented at the DBpedia Day, at the Semantcis conference in 2021. FOOPS! (available at https://w3id.org/foops) is a validator based on the FAIR principles that will guide users when conforming their ontologies to them. For each principle, FOOPS! runs a series of tests and notifies errors, suggestions and ways to conform to the best practices.
FAIR Workflows: A step closer to the Scientific Paper of the Futuredgarijo
Keynote presented at the Computational and Autonomous Workflows (CAW-2021) at the Oak Ridge National Laboratory. The keynote describes an overview of the different aspects to take into account when aiming to create FAIR workflows and associated resources.
An increasing number of researchers rely on computational methods to generate the results described in their publications. Research software created to this end is heterogeneous (e.g., scripts, libraries, packages, notebooks, etc.) and usually difficult to find, reuse, compare and understand due to its disconnected documentation (dispersed in manuals, readme files, web sites, and code comments) and a lack of structured metadata to describe it. In this talk I will describe the main challenges for finding, comparing and reusing research software, how structured metadata can help to address some of them, which are the best practices being proposed by the community; and current initiatives to aid their adoption by researchers within EOSC.
Impact: The talk addresses an important aspect of the EOSC infrastructure for quality research software by ensuring that software contributed to the EOSC ecosystem can be found, compared and reused by researchers. The talk also aims to address metadata quality of current research products, which is critical for successful adoption.
Presented at the EOSC symposium
SOMEF: a metadata extraction framework from software documentationdgarijo
Presentation done at the council of software registries on March, 2021. SOMEF is a python package for automatically extracting over 25 metadata categories from a readme file. The output is then exported in JSON or in JSON-LD using the codemeta representation
A Template-Based Approach for Annotating Long-Tailed Datasetsdgarijo
An increasing amount of data is shared on the Web through heterogeneous spreadsheets and CSV files. In order to homogenize and query these data, the scientific community has developed Extract, Transform and Load (ETL) tools and services that help making these files machine readable in Knowledge Graphs (KGs). However, tabular data may be complex; and the level of expertise required by existing ETL tools makes it difficult for users to describe their own data. In this paper we propose a simple annotation schema to guide users when transforming complex tables into KGs. We have implemented our approach by extending T2WML, a table annotation tool designed to help users annotate their data and upload the results to a public KG. We have evaluated our effort with six non-expert users, obtaining promising preliminary results.
OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphsdgarijo
In this presentation we describe the Ontology-Based APIs framework (OBA), our approach to automatically create REST APIs from ontologies while following RESTful API best
practices. Given an ontology (or ontology network) OBA uses standard technologies familiar to web developers (OpenAPI Specification, JSON) and combines them with W3C standards (OWL, JSON-LD frames and SPARQL) to create maintainable APIs with documentation, units tests, automated validation of resources and clients (in Python, Javascript, etc.) for non Semantic Web experts to access the contents of a target
knowledge graph. We showcase OBA with three examples that illustrate the capabilities of the framework for different ontologies.
Towards Knowledge Graphs of Reusable Research Software Metadatadgarijo
Research software is a key asset for understanding, reusing and reproducing results in computational sciences. An increasing amount of software is stored in code repositories, which usually contain human readable instructions indicating how to use it and set it up. However, developers and researchers often need to spend a significant amount of time to understand how to invoke a software component, prepare data in the required format, and use it in combination with other software. In addition, this time investment makes it challenging to discover and compare software with similar functionality. In this talk I will describe our efforts to address these issues by creating and using Open Knowledge Graphs that describe research software in a machine readable manner. Our work includes: 1) an ontology that extends schema.org and codemeta, designed to describe software and the specific data formats it uses; 2) an approach to publish software metadata as an open knowledge graph, linked to other Web of Data objects; and 3) a framework for automatically extracting metadata from software repositories; and 4) a framework to curate, query, explore and compare research software metadata in a collaborative manner. The talk will illustrate our approach with real-world examples, including a domain application for inspecting and discovering hydrology, agriculture, and economic software models; and the results of our framework when enriching the research software entries in Zenodo.org.
Scientific Software Registry Collaboration Workshop: From Software Metadata r...dgarijo
In this talk I briefly describe our work in OntoSoft for easy software metadata representation, and how new requirements for software reusability are making us move towards knowledge graphs of scientific software metadata
WDPlus: Leveraging Wikidata to Link and Extend Tabular Datadgarijo
Today, data about any domain can be found on the web in data repositories, web APIs and many millions of spreadsheets and CSV files. Researchers and organizations make these data available in a myriad of formats, layouts, terminology and cleanliness that make it difficult to integrate together. As a result, researchers aiming to use data in their analyses face three main challenges. The first one is finding datasets related to a feature, variable or topic of interest. For example, climate scientists need to look for years of observational data from authoritative sources when estimating the climate of a region. The second challenge is completing a given dataset with existing knowledge: machine learning applications are data hungry and require as many data points and features as possible to improve their predictions, which often requires integrating data from different sources. The third challenge is sharing integrated results: once several datasets have been merged together, how to make them available to the rest of the community?
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...dgarijo
Scientific software is crucial for understanding, reusing and reproducing results in computational sciences. Software is often stored in code repositories, which may contain human readable instructions necessary to use it and set it up. However, a significant amount of time is usually required to understand how to invoke a software component, prepare data in the format it requires, and use it in combination with other software. In this presentation we introduce OKG-Soft, an open knowledge graph that describes scientific software in a machine readable manner. OKG-Soft includes: 1) an ontology designed to describe software and the specific data formats it uses; 2) an approach to publish software metadata as an open knowledge graph, linked to other Web of Data objects; and 3) a framework to annotate, query, explore and curate scientific software metadata.
Towards Human-Guided Machine Learning - IUI 2019dgarijo
Automated Machine Learning (AutoML) systems are emerging
that automatically search for possible solutions from a large space of possible kinds of models. Although fully automated machine learning is appropriate for many applications, users often have knowledge that supplements and constraints the available data and solutions. This paper proposes human-guided machine learning (HGML) as a hybrid approach where a user interacts with an AutoML system and tasks it to explore different problem settings that reflect the user’s knowledge about the data available. We present: 1) a task analysis of HGML that shows the tasks that a user would want to carry out, 2) a characterization of two scientific publications, one in neuroscience and one in political science, in terms of how the authors would search for solutions using an AutoML system, 3) requirements for HGML based on those characterizations, and 4) an assessment of existing AutoML systems in terms of those requirements.
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
Scientists publish computational experiments in ways that do not facilitate reproducibility or reuse. Significant domain expertise, time and effort are required to understand scientific experiments and their research outputs. In order to improve this situation, mechanisms are needed to capture the exact details and the context of computational experiments. Only then, Intelligent Systems would be able help researchers understand, discover, link and reuse products of existing research.
In this presentation I will introduce my work and vision towards enabling scientists share, link, curate and reuse their computational experiments and results. In the first part of the talk, I will present my work for capturing and sharing the context of scientific experiments by using scientific workflows and machine readable representations. Thanks to this approach, experiment results are described in an unambiguous manner, have a clear trace of their creation process and include a pointer to the sources used for their generation. In the second part of the talk, I will describe examples on how the context of scientific experiments may be exploited to browse, explore and inspect research results. I will end the talk by presenting new ideas for improving and benefiting from the capture of context of scientific experiments and how to involve scientists in the process of curating and creating abstractions on available research metadata.
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...dgarijo
Traditional approaches to ontology development have a large lapse between the time when a user using the ontology has found a need to extend it and the time when it does get extended. For scientists, this delay can be weeks or months and can be a significant barrier for adoption. We present a new approach to ontology development and data annotation enabling users to add new metadata properties on the fly as they describe their datasets, creating terms that can be immediately adopted by others and eventually become standardized. This approach combines a traditional, consensus-based approach to ontology development, and a crowdsourced approach where ex-pert users (the crowd) can dynamically add terms as needed to support their work. We have implemented this approach as a socio-technical system that includes: 1) a crowdsourcing platform to support metadata annotation and addition of new terms, 2) a range of social editorial processes to make standardization decisions for those new terms, and 3) a framework for ontology revision and updates to the metadata created with the previous version of the ontology. We present a prototype implementation for the paleoclimate community, the Linked Earth Framework, currently containing 700 datasets and engaging over 50 active contributors. Users exploit the platform to do science while extending the metadata vocabulary, thereby producing useful and practical metadata
WIDOCO: A Wizard for Documenting Ontologiesdgarijo
WIDOCO is a WIzard for DOCumenting Ontologies that guides users through the documentation process of their vocabularies. Given an RDF vocabulary, WIDOCO detects missing vocabulary metadata and creates a documentation with diagrams, human readable descriptions of the ontology terms and a summary of
changes with respect to previous versions of the ontology. The documentation consists on a set of linked enriched HTML pages that can be further extended by end users. WIDOCO is open source and builds on well established Semantic Web tools. So far, it has been used to document more than one hundred ontologies in different domains.
We propose a new area of research on automating data narratives. Data narratives are containers of information about computationally generated research findings. They have three major components: 1) A record of events, that describe a new result through a workflow and/or provenance of all the computations executed; 2) Persistent entries for key entities involved for data, software versions, and workflows; 3) A set of narrative accounts that are automatically generated human-consumable renderings of the record and entities and can be included in a paper. Different narrative accounts can be used for different audiences with different content and details, based on the level of interest or expertise of the reader. Data narratives can make science more transparent and reproducible, because they ensure that the text description of the computational experiment reflects with high fidelity what was actually done. Data narratives can be incorporated in papers, either in the methods section or as supplementary materials. We introduce DANA, a prototype that illustrates how to generate data narratives automatically, and describe the information it uses from the computational records. We also present a formative evaluation of our approach and discuss potential uses of automated data narratives.
Automated Hypothesis Testing with Large Scale Scientific Workflowsdgarijo
(Credit to Varun Ratnakar and Yolanda Gil).
The automation of important aspects of scientific data analysis would significantly accelerate the pace of science and innovation. Although important aspects of data analysis can be automated, the hypothesize-test-evaluate discovery cycle is largely carried out by hand by researchers. This introduces a significant human bottleneck, which is inefficient and can lead to erroneous and incomplete explorations. We introduce a novel approach to automate the hypothesize-test-evaluate discovery cycle with an intelligent system that a scientist can task to test hypotheses of interest in a data repository. Our approach captures three types of data analytics knowledge: 1) common data analytic methods represented as semantic workflows; 2) meta-analysis methods that aggregate those results, represented as meta-workflows; and 3) data analysis strategies that specify for a type of hypothesis what data and methods to use, represented as lines of inquiry. Given a hypothesis specified by a scientist, appropriate lines of inquiry are triggered, which lead to retrieving relevant datasets, running relevant workflows on that data, and finally running meta-workflows on workflow results. The scientist is then presented with a level of confidence on the initial hypothesis (or a revised hypothesis) based on the data and methods applied. We have implemented this approach in the DISK system, and applied it to multi-omics data analysis.
OntoSoft: A Distributed Semantic Registry for Scientific Softwaredgarijo
Credit to Yolanda Gil.
OntoSoft is a distributed semantic registry for scientific software. This paper describes three major novel contributions of OntoSoft: 1) a software metadata registry designed for scientists, 2) a distributed approach to software registries that targets communities of interest, and 3) metadata crowdsourcing through access control. Software metadata is organized using the OntoSoft ontology along six dimensions that matter to scientists: identify software, understand and assess software, execute software, get support for the software, do research with the software, and update the software. OntoSoft is a distributed registry where each site is owned and maintained by a community of interest, with a distributed semantic query capability that allows users to search across all sites. The registry has metadata crowdsourcing capabilities, supported through access control so that software authors can allow others to expand on specific metadata properties.
OEG tools for supporting Ontology Engineeringdgarijo
In this talk we do an overview of the suite of tools developed at the OEG for supporting ontology engineering. The tasks we support are ontology documentation, evaluation, diagraming and publication with permanent ids and content negotiation. All the tools are integrated in OnToology, which uses GitHub to publish the outcome produced for each ontology.
Publicación de datos y métodos científicos en investigacióndgarijo
¿Cuáles son los retos a la hora de publicar datos en investigación? En esta presentación se abordan las partes asociadas a la publicación de un experiment científico en un ambiente académico, así como las propuestas que estamos siguiendo en el Grupo de Ingeniería Ontológica de la Universidad Politécnica de Madrid
This presentation shows an overview of the main concepts introduced in the EDBT2015 Summer School, which took place in Palamos. For each area, we summarize the main issues and current approaches. We also describe the challenges and main activities that were undertaken in the summer school
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
"Protectable subject matters, Protection in biotechnology, Protection of othe...
PhD Thesis: Mining abstractions in scientific workflows
1. Date: 03/12/2015
Mining Abstractions in
Scientific Workflows
Daniel Garijo *
Supervisors: Oscar Corcho *, Yolanda Gil Ŧ
* Universidad Politécnica de Madrid,
Ŧ USC Information Sciences Institute
3. Benefits of workflows
Time savings
•Copy & paste fragments of workflows
3PhD Thesis: Mining Abstractions in Scientific Workflows
Teaching
•Reduce the learning curve of new students
Visualization
•Simplify workflows
Design for modularity
•Highlight the most relevant steps on a workflow
Design for standardization
Debugging
•Provenance exploration
Reproducibility and inspectability
4. Motivation of this work
Workflow Repositories
Workflow Systems
Let’s
Share!
I want to
reuse…
?
I want to
understand…?
I want to
repurpose…
?
4PhD Thesis: Mining Abstractions in Scientific Workflows
5. Open research challenges
•Workflow representation heterogeneity
5PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow Repositories
How can we represent a description of workflows and their metadata?
How can we facilitate the homogeneous consumption of workflows and
their resources?
6. Open research challenges
•Workflow representation heterogeneity
6PhD Thesis: Mining Abstractions in Scientific Workflows
•Inadequate level of workflow abstraction
What are the most relevant
parts of a workflow
Dataset
Porter
Stemmer
Result
IDF
Final
Result
Dataset
Lovins
Stemmer
Result
Residual
IDF
Final
Result
Dataset
Stemmer
Result
Term Weighting
FinalResult
Are two seemingly disparate
workflows related at a
higher level of abstraction?
7. Open research challenges
•Workflow representation heterogeneity
7PhD Thesis: Mining Abstractions in Scientific Workflows
•Inadequate level of workflow abstraction
•Difficulties for workflow reuse
How is a workflow related to
other workflows?
Which workflow (parts) are
potentially useful for reuse?
?
?
?
8. Open research challenges
•Workflow representation heterogeneity
8PhD Thesis: Mining Abstractions in Scientific Workflows
•Inadequate level of workflow abstraction
•Difficulties for workflow reuse
•Lack of support for workflow annotation
+ +
How can we facilitate the annotation process?
9. Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work
9PhD Thesis: Mining Abstractions in Scientific Workflows
10. •H.3: Commonly occurring patterns are potentially useful for users
designing workflows.
•H.2: It is possible to detect commonly occurring patterns and
abstractions automatically.
Hypothesis
•H.1: It is possible to define a catalog of common domain
independent patterns based on the common functionality of
workflow steps.
Scientific workflow repositories can be automatically analyzed to
extract commonly occurring patterns and abstractions that are
useful for workflow developers aiming to reuse existing workflows.
Workflow abstraction
Workflow representation
Workflow reuse
Workflow annotation
Workflow reuse
10PhD Thesis: Mining Abstractions in Scientific Workflows
11. Contributions
Workflow representation and publication
Model for representing workflow templates and executions
Workflow abstraction
Methodology to publish workflows in the web
Workflow annotation
A model and means for annotating semi-automatically the abstractions in
workflows
A catalog of common domain independent workflow patterns based on the
functionality of workflow steps
A method to extract generic commonly occurring workflow fragments
automatically
Workflow reuse
Metrics for assessing the usefulness of a fragment for reuse
A model to describe and annotate workflow fragments
11PhD Thesis: Mining Abstractions in Scientific Workflows
OPMW
Linked Data
Wf-motifs
Wf-fd
Workflow
motifs
Graph mining
12. Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
a) Requirements
b) The OPMW model
c) Publishing workflows as Linked Data
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work
12PhD Thesis: Mining Abstractions in Scientific Workflows
14. Requirements
14PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow template description
Plan: P-Plan [Garijo et al 2012]
http://purl.org/net/p-plan
Workflow execution trace description
Provenance: PROV (W3C) [Lebo et al 2013]
http://www.w3.org/ns/prov#
Workflow attribution
Dublin Core, PROV (W3C)
Workflow metadata
Link between templates and executions
Scufl DAX
AGWL Dispel
IWIR
OPM
OBI EXPO ISA
PAV
RO D-PROV
[Cicarese et al 2013]
[Moreau et al 2011]
[Brinkman et al 2010]
[Soldatova and King
2006]
[Rocca et al 2008]
[Belhajjame et al 2012]
[Missier et al 2013]
[Oinn et al 2004]
[Fahringer et al 2005]
[Atkinson et al 2013]
[Plankensteiner et al
2005]
16. Outline
1. Introduction and motivation
2. Hypothesis and work methodology
3. Workflow representation: OPMW
a) Requirements
b) The OPMW model
c) Publishing workflows as Linked Data
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work
16PhD Thesis: Mining Abstractions in Scientific Workflows
17. Publishing workflows as Linked Data
Specification
17PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?
•Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]
Tested it for the Wings workflow system
1
Base URI = http://www.opmw.org/
Ontology URI = http://www.opmw.org/ontology/
Assertion URI = http://www.opmw.org/export/resource/ClassName/instanceName
Examples:
http://www.opmw.org/export/resource/WorkflowTemplate/ABSTRACTSUBWFDOCKING
http://www.opmw.org/export/resource/WorkflowExecutionAccount/ACCOUNT1348629
350796
18. Publishing workflows as Linked Data
Specification Modeling
18PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?
•Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]
Tested it for the Wings workflow system
1 2
OPMW
P-Plan
OPM DC
PROV
19. Publishing workflows as Linked Data
Specification Modeling Generation
19PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?
•Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]
Tested it for the Wings workflow system
1 2 3
Workflow system
Workflow
Template
Workflow
execution
OPMW
export
OPMW
RDF
20. Publishing workflows as Linked Data
Specification Modeling Generation Publication
20PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?
•Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]
Tested it for the Wings workflow system
1 2 3 4
RDF
Triple
store
Permanent
web-
accessible
file
store
RDF Upload Interface
SPARQL
Endpoint
OPMW
RDF
21. Publishing workflows as Linked Data
Specification Modeling Generation Publication
21PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?
•Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]
Tested it for the Wings workflow system
1 2 3 4
Exploitation
5
Curl Linked Data Browser
Workflow
Explorer
SPARQL
endpoint
22. Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
a) A catalog of common workflow abstractions
b) Workflow reuse analysis
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work
22PhD Thesis: Mining Abstractions in Scientific Workflows
23. A catalog of common workflow abstractions
Generalization of workflow steps based on functionality.
Workflow motif: Domain independent conceptual abstraction on the workflow
steps.
1. Data-oriented motifs: What kind of manipulations
does the workflow have?
•E.g.:
•Data retrieval
•Data preparation
•Data curation
•Data visualization
• etc.
23PhD Thesis: Mining Abstractions in Scientific Workflows
24. A catalog of common workflow abstractions
Generalization of workflow steps based on functionality.
Workflow motif: Domain independent conceptual abstraction on the workflow
steps.
1. Data-oriented motifs: What kind of manipulations
does the workflow have?
•E.g.:
•Data retrieval
•Data preparation
• etc.
2. Workflow-oriented motifs: How does
the workflow perform its operations?
•E.g.:
•Stateful steps
•Stateless steps
•Human interactions
•etc.
24PhD Thesis: Mining Abstractions in Scientific Workflows
25. Methodology for finding workflow motifs
Goal: Reverse-engineer the set of current practices in workflow
development through an analysis of empirical evidence
25PhD Thesis: Mining Abstractions in Scientific Workflows
= 260 workflows
89 12526 20
Collect workflows
26. Methodology for finding workflow motifs
Goal: Reverse-engineer the set of current practices in workflow
development through an analysis of empirical evidence
26PhD Thesis: Mining Abstractions in Scientific Workflows
Preliminary workflow analysis
Researcher 1 Researcher 2 Researcher 3
27. Methodology for finding workflow motifs
Goal: Reverse-engineer the set of current practices in workflow
development through an analysis of empirical evidence
27PhD Thesis: Mining Abstractions in Scientific Workflows
Agreement and cross validation
28. Result Summary
28PhD Thesis: Mining Abstractions in Scientific Workflows
•Over 60% of the motifs are data preparation motifs
•Some differences are motivated by the workflow systems in the
analysis
•Around 40% of workflows contain motifs related to workflow
reuse
composite workflowsinternal macros
But how do users perceive workflow reuse?
What about fragments of workflows?
29. Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
a) A catalog of common workflow abstractions
b) Workflow reuse survey
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work
29PhD Thesis: Mining Abstractions in Scientific Workflows
30. Use case: The LONI Pipeline
Workflow system for neuroimaging analysis
http://pipeline.loni.usc.edu/explore/library-navigator/
30PhD Thesis: Mining Abstractions in Scientific Workflows
Discussions with scientists
User survey
Collect responses
from users
21 responses
Discuss results
31. Summary results
The majority of users agree that reusing and sharing workflows is
useful
Unlike workflows, reusing groupings from one’s own work is more
useful than reusing groupings from others
Most respondents agreed that groupings help simplify workflows.
Groupings also make workflows more understandable by others
31PhD Thesis: Mining Abstractions in Scientific Workflows
Can we detect groupings automatically?
32. Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph
mining techniques
a) Corpus preparation
b) Graph mining
c) Fragment filtering
d) Fragment linking
6. Evaluation
7. Conclusions and future work
32PhD Thesis: Mining Abstractions in Scientific Workflows
33. Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
33PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow corpus
Cluster1
Cluster 2
Cluster 3
Workflow corpus
34. Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
34PhD Thesis: Mining Abstractions in Scientific Workflows
Topic 1
Topic 2
P(Topic1) = 0.7
P(Topic2)= 0.3
P(Topic1) = 0.5
P(Topic2)= 0.5
P(Topic1) = 0.2
P(Topic2)= 0.8 ….
Topic modeling [Stoyanovich et al 2010]
35. Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
Topic modeling [Stoyanovich et al 2010]
35PhD Thesis: Mining Abstractions in Scientific Workflows
Case-based reasoning [Leake and Kendall-Morwick 2008], [Müller and Bergmann 2014]
Workflow corpus
?
36. ?
Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
Topic modeling [Stoyanovich et al 2010]
Case-based reasoning [Leake and Kendall-Morwick 2008] [Müller and Bergmann 2014]
Log mining [van del Aalst et al2003] [Gómez-Pérez and Corcho, 2008]
36PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow corpus
?
PSM
37. Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
Topic modeling [Stoyanovich et al 2010]
Case-based reasoning [Leake and Kendall-Morwick 2008] [Müller and Bergmann 2014]
Log mining [van del Aalst et al2003] [Gómez-Pérez and Corcho, 2008]
Graph mining [Diamantini et al., 2012]
37PhD Thesis: Mining Abstractions in Scientific Workflows
38. Workflow Mining in FragFlow
1
2
3
4
38PhD Thesis: Mining Abstractions in Scientific Workflows
39. Corpus Preparation
Workflows converted to Labeled Directed Acyclic Graphs (LDAG)
• The label of a node in the graph corresponds to the type of the step in
the workflow
• Edges capture the dependencies between different steps
39PhD Thesis: Mining Abstractions in Scientific Workflows
Dataset
Stemmer
algorithm
Result
Term weighting
algorithm
FinalResult
Stemmer
algorithm
Term weighting
algorithm
Duplicated workflows are removed
Single-step workflows are removed
40. Graph Mining
We use popular graph mining techniques:
Inexact FSM: usage of heuristics to calculate similarity between two
graphs. The solution might not be complete
SUBDUE
2 heuristics: Minimum Description Length (MDL) and Size
Exact FSM: deliver all the possible fragments to be found the dataset.
gSpan
Depth first search strategy
FSG
Breadth first search strategy
40PhD Thesis: Mining Abstractions in Scientific Workflows
41. Filtering Relevant Fragments
The number of resulting fragments can be very large. We distinguish:
Multistep fragments:
More than one step
Filtered Multistep fragments:
Multistep fragments
Contain all smaller fragments with the same number of occurrences
41PhD Thesis: Mining Abstractions in Scientific Workflows
Stemmer
Term Weighting
Stemmer
Term Weighting
Filter
Filter
Sort
Filter
Sort
Query
F1
F2
F3
F4
(found 4 times)
(found 4 times)
(found 10 times)
(found 3 times)
42. Linking to the Corpus: Example
Workflow 1
42PhD Thesis: Mining Abstractions in Scientific Workflows
Stemmer
Term Weighting
Stemmer
Term Weighting
Merge
Stemmer
Term Weighting
Fragment1in Wf1(1)
Fragment1
Fragment1in Wf1(2)
Workflow fragment description vocabulary:
http://purl.org/net/wf-fd
(Extends P-Plan)
wffd:foundAs
wffd:foundAs
wffd:foundIn
p-plan:isPrecededBy
p-plan:isPrecededByp-plan:isPrecededBy
p-plan:isPrecededBy p-plan:isPrecededBy p-plan:isStepOfPlan
p-plan:isStepOfPlan
p-plan:isStepOfPlan
p-plan:isStepOfPlan
p-plan:isStepOfPlan
p-plan:Step
wffd:TiedWorkflowFragment
wffd:DetectedResultWorkflowFragment
43. Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph
mining techniques
6. Evaluation
a) Finding generic motifs in workflows
b) Workflow fragment assessment
7. Conclusions and future work
43PhD Thesis: Mining Abstractions in Scientific Workflows
44. Finding generic motifs in workflows
44PhD Thesis: Mining Abstractions in Scientific Workflows
?
Research question: Can we find commonly occurring abstractions?
composite workflowsinternal macros
45. Finding generic motifs in workflows
45PhD Thesis: Mining Abstractions in Scientific Workflows
?
Metrics used: precision and recall
Fragments
(F)
Annotated
motifs
(M)
46. Finding generic motifs in workflows
46PhD Thesis: Mining Abstractions in Scientific Workflows
?
Corpus: 22 templates from the same domain annotated manually
Wings workflow corpus + domain knowledge
Dataset
Porter
Stemmer
Result
IDF
Final
Result
Dataset
Lovins
Stemmer
Result
Residual
IDF
Final
Result
+
Dataset
Stemmer
Result
Term Weighting
FinalResult
Stemmer
Porter Stemmer
Lovins Stemmer
Term Weighting
Inverse Document
Frequency (IDF)
Residual IDF
Query Term Weighting
Component taxonomy
47. Finding generic motifs in workflows
47PhD Thesis: Mining Abstractions in Scientific Workflows
?
Results of the evaluation
H.2: It is possible to detect commonly occurring patterns and abstractions
automatically.
Internal Macros:
Inexact FSM : 2 out of 3 found (r=0,67); 4 out of 5 (r=0,8) when applying
generalization
Composite Workflows:
Exact FSM: all motifs are found, although the precision is low (p=0,18)
Can we find commonly occurring abstractions?
48. Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph
mining techniques
6. Evaluation
a) Finding generic motifs in workflows
b) Workflow fragment assessment
7. Conclusions and future work
48PhD Thesis: Mining Abstractions in Scientific Workflows
49. Workflow fragment assessment
49PhD Thesis: Mining Abstractions in Scientific Workflows
?
Research question: Are our proposed workflow fragments useful?
•A fragment is useful if it has been designed and (re)used by a user.
•Comparison between proposed fragments and user designed groupings
and workflow
50. Workflow fragment assessment
50PhD Thesis: Mining Abstractions in Scientific Workflows
?
Metrics: Precision and recall
Fragments
(F)
Workflows
(W)
Groupings
(G)
51. Workflow fragment assessment
51PhD Thesis: Mining Abstractions in Scientific Workflows
?
Workflow corpora
User Corpus 1 (WC1)
• Designed mostly by a single a single user
• 790 workflows (475 after data preparation)
User Corpus 2 (WC2)
• Created by a user, with collaborations of others
• 113 workflows (96 after data preparation)
Multi User Corpus 3 (WC3)
• Workflows submitted by 62 users during the month of Jan 2014
• 5859 workflows (357 after data preparation)
User Corpus 4 (WC4)
• Designed mostly by a single a single user
• 53 workflows (50 after data preparation)
52. Workflow fragment assessment
52PhD Thesis: Mining Abstractions in Scientific Workflows
?
Result assessment
•30%-60% of proposed fragments are equal to user defined groupings or
workflows
•40%-80% of proposed of proposed fragments are equal or similar to user
defined groupings or workflows
H.3: Commonly occurring patterns are potentially useful for users designing workflows
What about the rest of the fragments? Are those useful?
53. Workflow fragment assessment
53PhD Thesis: Mining Abstractions in Scientific Workflows
?
User feedback: user survey
Q1: Would you consider the proposed fragment a valuable grouping?
•I would not select it as a grouping (0)
•I would use it as a grouping with major changes (i.e., adding/removing more than 30% of the steps) (1)
•I would use it as a grouping with minor changes (i.e., adding/removing less than 30% of the steps) (2).
•I would use it as a grouping as it is (3)
Q2: What do you think about the complexity of the fragment?
•The fragment is too simple (0)
•The fragment is fine as it is (1)
•The fragment has too many steps (2)
Not enough evidence to state that all proposed workflow fragments are useful
54. Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph
mining techniques
6. Evaluation
7. Conclusions and future work
54PhD Thesis: Mining Abstractions in Scientific Workflows
55. Conclusions: Results
H.1: It is possible to define a catalog of common domain independent patterns based on
the common functionality of workflow steps.
Daniel Garijo and Yolanda Gil. A new approach for publishing workflows: Abstractions, standards, and Linked Data. (WORKS'11)
Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An empirical analysis (extended
version). Future Generation Computer Systems. 2013.
Model for representing workflows (OPMW) and publishing them as Linked Data
Catalog of workflow motifs + workflow annotation
H.2: It is possible to detect commonly occurring patterns and abstractions automatically.
Graph mining approach + workflow generalization
Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An empirical analysis. 8th IEEE
International Conference on e-Science (eScience 2012)
55PhD Thesis: Mining Abstractions in Scientific Workflows
Daniel Garijo, Oscar Corcho and Yolanda Gil. Detecting common scientific workflow fragments using templates and execution provenance. Proceedings of the
seventh international conference on Knowledge capture, (K-CAP 2013).
56. Conclusions: Results
Daniel Garijo, Oscar Corcho, Yolanda Gil, Boris Gutman, Ivo D. Dinov, Paul Thompson and Arthur W. Toga. FragFlow: Automated fragment detection in
scientific workflows. 10th IEEE Conference on e-Science, (eScience 2014)
Daniel Garijo, Oscar Corcho, Yolanda Gil, Meredith N. Braskie, Dereck Hibar, Xie Hua, Neda Jahanshad, Paul Thompson and Arthur W. Toga. Workflow reuse in
practice: A study of neuroimaging pipeline users. 10th IEEE Conference on e-Science, (eScience 2014)
H.3: Commonly occurring patterns are potentially useful for users designing workflows.
Graph mining approach + reusability metrics for assessment + workflow annotation
56PhD Thesis: Mining Abstractions in Scientific Workflows
Reuse survey
57. Conclusions: Impact and future work
Impact:
OPMW
•Workflow annotation [García-Jiménez and Wilkinson 2014b]
Motif catalog
•Expansion for distributed environments [Olabarriaga et al 2013]
•Workflow summarization [Alper et al 2013]
Future work:
•Towards workflow ecosystems
57PhD Thesis: Mining Abstractions in Scientific Workflows
[Garijo et al 2014] (WORKS’14)
58. Conclusions: Impact and future work
•Automatic detection of workflow abstractions
58PhD Thesis: Mining Abstractions in Scientific Workflows
•Improvement of workflow reuse
Custom fragments
Ranking fragments
Suggestions of workflows
59. Date: 03/12/2015
Mining Abstractions in
Scientific Workflows
Daniel Garijo *
Supervisors: Oscar Corcho *, Yolanda Gil Ŧ
* Universidad Politécnica de Madrid,
Ŧ USC Information Sciences Institute
All materials are available as Research Objects
(with pointers to Figshare)
http://w3id.org/dgarijo/ro/mining-abstractions-in-scientific-wfs
61. Methodology
Workflow representation and publication
Approach
Workflow abstraction and reuse
Empirical
analysis of
workflow
corpora
Problem Evaluation
Requirement
validation and
user feedback
Model Competency
question
validation
Provenance
Plan
Publication
Methodology
for publication
Extension of
existing
standards
and web
technologies
Workflow
abstraction
analysis for
reuse
Agreement on
a catalog of
common
abstractions
Automatic detection and annotation of
workflow abstractions
Graph mining
techniques,
generalization
Precision,
recall and user
feedback
61PhD Thesis: Mining Abstractions in Scientific Workflows
62. Provenance Models
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 62
“A record that describes the people, institutions, entities, and activities
involved in producing, influencing, or delivering a piece of data or a thing”
-PROV-DM: The PROV Data Model (W3C)
63. Replace this slide with a methodological one
prov:used
p-plan:Variable
p-plan:isStepOfPlan
p-plan:isVariableOfPlan
p-plan:hasInputVar
p-plan:isOutputVarOf
p-plan:Activity
p-plan: correspondsToStep
p-plan:Entity
prov:wasGeneratedBy
p-plan:isPrecededBy
p-plan:Bundle
Class Object property
Legend
Subclass of
prov:Bundle
prov:Plan
prov:Entity
prov:Activity
PROVextendedclasses
Statements contained in a p-plan:Bundle
p-plan:Step
p-plan:Plan
p-plan: correspondsToVariable
63PhD Thesis: Mining Abstractions in Scientific Workflows
64. Assumptions and restrictions
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 64
Restriction:
• Workflows are represented as directed acyclic graphs
Assumptions:
•Available workflow repositories exist for exploiting definitions
of workflows and workflow executions.
•All the workflow steps can be assigned a label with their type
•Two steps of a workflow with the same function have the
same type.
•Researchers aim to reuse workflows and workflow fragments
if they find them useful.
65. 9
Other models for representing workflow instances, templates and executions
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
66. Publishing as LD
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 66
•Maybe paste here an example
instead of the big picture
67. 67
Data Oriented Motifs
Data-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation
and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
68. 68
Data Oriented Motifs
Data-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation
and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
69. 69
Data Oriented Motifs
Data-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation
and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
70. 70
Data Oriented Motifs
Data-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation
and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
71. 71
Data Oriented Motifs
Data-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation
and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
72. 72
Data Oriented Motifs
Data-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation
and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
78. Result Summary: Data Oriented Motifs
•Over 60% of the motifs are data
preparation motifs
•Some differences are motivated by the
workflow systems in the analysis
•Data analysis is often the main
functionality of the workflow
78PhD Thesis: Mining Abstractions in Scientific Workflows
79. Result Summary: Workflow Oriented Motifs
• Around 40% composite workflows and internal macros
But how do users perceive workflow reuse?
•What about fragments of workflows?
79PhD Thesis: Mining Abstractions in Scientific Workflows
80. 80
Differences and commonalities of the workflow systems
•Data moving/retrieval, stateful interactions and human interaction steps are
not present in Wings
•Web services (Taverna) versus software components (Wings)
•Wings has layered execution through Pegasus
•Data preparation steps are common in both systems
•Use of sub workflows is high
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
81. Reusing workflows…
According to the respondents, the major benefits of workflows include:
• Time savings
•Organizing and storing code
• Having a visualization of the overall analysis
• Facilitating reproducibility
81PhD Thesis: Mining Abstractions in Scientific Workflows
82. Reusing groupings…
•Reuse is not the only reason why groupings are created. Unlike workflows, reusing
groupings from one’s own work is more useful than reusing groupings from others
•Most respondents agreed that groupings help simplify workflows. Groupings also
make workflows more understandable by others
82PhD Thesis: Mining Abstractions in Scientific Workflows
83. Graph Mining
We use popular graph mining techniques:
Inexact FSM: usage of heuristics to calculate similarity between two
graphs. The solution might not be complete
SUBDUE
• 2 heuristics: Minimum Description Length (MDL) and Size
• Frequency based
Exact FSM: deliver all the possible fragments to be found the dataset.
gSpan
• Depth first search strategy
• Support based
FSG
• Breadth first search strategy
• Support based
83PhD Thesis: Mining Abstractions in Scientific Workflows
84. Linking to the Corpus: Workflow fragment description vocabulary
84PhD Thesis: Mining Abstractions in Scientific Workflows
86. Conclusions: Limitations
L1: OPMW has been designed for data-intensive workflows (without loops or
conditionals)
L2: When publishing as Linked Data, it is assumed that all resources will be made public
(no privacy issues)
L3: Motif catalog may be expanded with additional motifs
L4: Size and time needed to calculate some workflow fragments
L5: A taxonomy of components is needed when generalizing workflows. This taxonomy is
provided by domain experts modeling the domain.
86PhD Thesis: Mining Abstractions in Scientific Workflows
Editor's Notes
Data driven, usually represented as Directed Acyclic Graphs (DAGs)
These are the points discussed with scientists, not the results of the user survey.
Sharing workflows with collaborators: Non-programmers find a barrier to running complex neuroimaging analyses as they cannot create components or code to that level of complexity. Reusing workflows that others have created enable them to do tasks that they would not otherwise do.
Teaching: Breakpoints are often placed throughout the pipeline to serve as checkpoints and make sure that execution was performed correctly
Visualization: The hierarchical organization can be used to group functionally related tasks into a single visual element. This allows workflow developers to group complex tasks with highly-fragmented code into a single visual unit that other users can incorporate into their workflows
Modularity: Workflows provide a high-level view of the major steps involved in an analysis, and exposing those major steps drives the design of the code in a modular fashion
Representation: different types of workflows use different types of representation. Also, we miss the links to the resources associated to the workflow itself.
Reuse: workflow reused as part of other workflows. How?
Abstractions: are two seemingly disparate workflows related to each other?
Workflow template and instance: steps and their dependencies
Workflow execution trace: provenance of the results
Experiment metadata: specific methods, author contribution, etc.
P-Plan is simple and extensible (to cater to cases that require more complex wf operators)
Say that P-Plan has been used for describing scientific processes in social sciences and lab protocols
State that the focus is workflow description
Example of motif goes here on each side, instead of the big HOW and WHAT
In order to improve understandability, we have decided as a first step to identify what are the common operations in scientific workflows, by doing an empirical analysis over different domains.
There is existing work on this, but mainly tackles the structure of the workflow rather than the operation that is going on.
Thus, our approach has been to start without any initial definitions of motifs to find. Instead we have reverse-engineered the different steps in the workflows trying to create clusters with the most common motifs.
Example of motif goes here on each side, instead of the big HOW and WHAT
In order to improve understandability, we have decided as a first step to identify what are the common operations in scientific workflows, by doing an empirical analysis over different domains.
There is existing work on this, but mainly tackles the structure of the workflow rather than the operation that is going on.
Thus, our approach has been to start without any initial definitions of motifs to find. Instead we have reverse-engineered the different steps in the workflows trying to create clusters with the most common motifs.
Corpus collection
Preliminary analysis of workflows
Discuss catalog of motifs
Find motifs in workflows
Cross validate annotations
Discussion until agreement
Corpus collection
Preliminary analysis of workflows
Discuss catalog of motifs
Find motifs in workflows
Cross validate annotations
Discussion until agreement
Corpus collection
Preliminary analysis of workflows
Discuss catalog of motifs
Find motifs in workflows
Cross validate annotations
Discussion until agreement
Workflow reuse. It is very important not because we say so, but because we have seen it in many of the workflows.
In almost 40% of the workflows include some other workflow. And when it doesn’t, it is very similar in many cases
(we have just matched the exact available workflows)
Internal macros, for instance, show how different parts of workflows repeat, which could lead to new workflow templates as well.
Explain some of the features of LONI. Grey circles are inputs, triangles outputs and blue circles components. The rectangles in dots are groupings.
Explain what a grouping is, and what is it for.
In general, workflows are considered generally more useful than groupings. On the other hand, more respondents said that groupings help make their code more modular and understandable
Can we automatically mine a repository of workflows to derive useful workflow fragments?
State that graph mining has only been tested recently
Clustering and topic modeling: good for stating similar workflows, but not necessarily mining common workflow fragments
Log mining: good for suggesting next steps, but bad for stating relationships among workflows
Case based reasoning: it is used for prediction mostly
State that graph mining has only been tested recently
Clustering and topic modeling: good for stating similar workflows, but not necessarily mining common workflow fragments
Log mining: good for suggesting next steps, but bad for stating relationships among workflows
Case based reasoning: it is used for prediction mostly
State that graph mining has only been tested recently
Clustering and topic modeling: good for stating similar workflows, but not necessarily mining common workflow fragments
Log mining: good for suggesting next steps, but bad for stating relationships among workflows
Case based reasoning: it is used for prediction mostly
State that graph mining has only been tested recently
Clustering and topic modeling: good for stating similar workflows, but not necessarily mining common workflow fragments
Log mining: good for suggesting next steps, but bad for stating relationships among workflows
Case based reasoning: it is used for prediction mostly
State that graph mining has only been tested recently
Clustering and topic modeling: good for stating similar workflows, but not necessarily mining common workflow fragments
Log mining: good for suggesting next steps, but bad for stating relationships among workflows
Case based reasoning: it is used for prediction mostly
Overview of the steps here. Say clearly that
Here explain what it means to capture a dependency between 2 steps: that a data product produced by the former is consumed by the latter.
Duplicated workflows are removed because if we have like 500 workflows that are the same, what we are going to find is that the common fragments are those repeated workflows themselves.
Explain in detail support based versus frequency based techniques!
Explain DFS versus BFR strategies! (don’t go into too much detail).
The number of fragments can be up to millions when the common parts are of size >10.
Mention that this is done by issuing sparql queries to link the fragments like the one in the figure to the corpus
This indicates that our fragments are commonly occurring and generic. It also indicated that the rest of the fragments could be useful for users.
High Recall-> expected value
This indicates that our fragments are commonly occurring and generic. It also indicated that the rest of the fragments could be useful for users.
This indicates that our fragments are commonly occurring and generic. It also indicated that the rest of the fragments could be useful for users.
This indicates that our fragments are commonly occurring and generic. It also indicated that the rest of the fragments could be useful for users.
State the expected value!! (High precision)
This indicates that our fragments are commonly occurring and generic. It also indicated that the rest of the fragments could be useful for users.
This indicates that our fragments are commonly occurring and generic. It also indicated that the rest of the fragments could be useful for users.
This indicates that our fragments are commonly occurring and generic. It also indicated that the rest of the fragments could be useful for users.
Wf ecosystems: most of the work is towards making wfs executable on other places, but forgetting about all the other apps that use and consume the wfs at different granularities.
Wf abstractions: being able to generate a domain taxonomy automatically. Detecting automatically some of the rest of the motifs.
Improvement of workflow reuse: by proposing rankings, improving the interfaces and in general exploiting directly all the data that we can discover with the thesis proposed here.
Wf ecosystems: most of the work is towards making wfs executable on other places, but forgetting about all the other apps that use and consume the wfs at different granularities.
Wf abstractions: being able to generate a domain taxonomy automatically. Detecting automatically some of the rest of the motifs.
Improvement of workflow reuse: by proposing rankings, improving the interfaces and in general exploiting directly all the data that we can discover with the thesis proposed here.
Genomics workflow - Using Biomart and EMBOSS services, This workflow retrieves a number of sequences from 3 species: mouse, human, rat; aligns them through multiple sequence alignment, and returns a plot of the alignment result. Corresponding sequence ids are also returned.
Heliophysics workflow (we counted t as astronomy) - This is a fragment of a workflow that uses several input augmentation motifs in order to create a query to be sent to the Helio Feature Catalog service to retrieve the active regions on the solar surface for a given period of time
Describe briefly each of the data oriented motifs
This workflow calculates QSAR (Quantitative structure–activity relationship) properties of a compound and saves them as a CSV file. The molecules are read iteratively from an SDF file. Additionally it writes out the molecules with unknown atom types, salt counter ions, curated molecule library with UUIDs and the used calculation time of every QSAR descriptor as a CSV file.Furthermore explicit hydrogens are added and a Hueckel aromaticity detection is performed
Cheminformatics workflow - It curates the structural information regarding a compound that is provided in Structure Data Format (SDF) file format. This workflow generates atom signatures for individual compounds given the SDF file as input
Describe briefly each of the data oriented motifs
This workflow performs an NCBI blast at the EBI. It uses the new EBI services, which are asynchronous and require multiple invocations - repeatedly invoking the getStatus sub-workflow until the blast job is complete. So the blast is actually undertaken by 3 callls RUN+GET_STATUS+GET_RESULT
Explain briefly the workflow oriented motifs:
Scientific workflows Msc course workflow - This workflow fetches the details of the countries in the world and then uses R to produce a histogram of the log of their population
Explain briefly the workflow oriented motifs:
Text-Analytics workflow- it is used for reading natural language text found within files with specified extensions in the specified directory
The first fact that we discovered about the workflows is that over 60% of the motifs in each domain are data preparation motifs.
In fact, most of them are Input augmentation, output spliting and reformatting steps are the most common in most workflows.
This is very important because it tells us how many intermediate processing steps are in the workflow. These steps are often notrelevant for explaining the functionality of the workflow.
Another relevant thing to show is that between 10-15% of the motifs are data analysis. This is very important, since this is often the main step
Of the workflow, its main functionality. If there is only a 15%, it means that the workflow could have been much smaller.
Workflow reuse. It is very important not because we say so, but because we have seen it in many of the workflows.
In almost 40% of the workflows include some other workflow. And when it doesn’t, it is very similar in many cases
(we have just matched the exact available workflows)
Internal macros, for instance, show how different parts of workflows repeat, which could lead to new workflow templates as well.
What we also noticed with this analysis is that this 2 workflow systems are essentially very similar.
They share all the motifs except for data moving/retrieval, since Wings uses Pegasus and its infrastructure for that;
Stateful interactions (since Wings is oriented to use scripts and tools rather than web services), and human interaction steps.
We noticed during the analysis that the typing of data often helps for avoiding certain intermediate steps.
Workflow reuse is high in both systems, as we stated previously
In general, workflows are considered generally more useful than groupings. On the other hand, more respondents said that groupings help make their code more modular and understandable
Explain in detail support based versus frequency based techniques!
Explain DFS versus BFR strategies! (don’t go into too much detail).
For the goal 1, we have to say that we also relaxed the first precision and recall to an 80 percent to see if similar fragments were found as well.
For the Goal 3, there is nothing to quantify.