2. Storyline
• Why Research Objects?
• Overview of the Research Objects
• Portfolio of Research Object Management
Tools
• Provenance Distillation Through Workflow
Summarization
2
4. Electronic papers are not enough
4
Research
Object
Datasets
Results
Scientists
Hypothesis
Experiments
Annotations
Provenance
Electronic
paper
5. Benefits Of Research Objects
• A research object aggregates all elements
that are necessary to understand research
investigations.
• Methods (experiments) are viewed as first
class citizens
• Promote reuse
• Enable the verification of reproducibility of
the results
5
7. 47 of 53
“landmark”
publications
could not be
replicated
Inadequate cell lines
and animal models
Nature, 483, 2012
Credit to Carole Goble JCDL 2012 Keynote
7
8. Fig. 2. Overview of the Research Object Model with the different ontologies that encode it: Core RO for describing the basic
Research Object structure, roevo for tracking Research Object evolution and wfprov and wfdesc for describing workflows and
Overview of the Research Object
Model
8
9. Research Object as an ORE
Aggregation
9
Fig. 3. Research Object as an ORE aggregation.
to annotation; and ao: Body, which comprises a
description of the target.
Research Objects use annotations as a means for
decorating a resource (or a set of resources) with
metadata information. The body is specified in the
@base <ht t p : / / exampl e. com/ r o/ 389/ > .
@pr ef i x dct : <ht t p: / / pur l . or g/ dc / t er ms / >
@pr ef i x or e:
<ht t p : / / www. openar chi ves . or g/ or e/ t er ms
10. Scientific Workflows
• Data driven analysis pipelines
• Systematic gathering of data and
analysis tools into computational
solutions for scientific problem-solving
• Tools for automating frequently
performed data intensive activities
• Provenance for the resulting datasets
– The method followed
– The resources used
– The datasets used
10
11. PROV Primer, Gil et al
WF Execution Trace
Retrospective Provenance:
Actual data used, actual
invocations, timestamps and data
derivation trace
WF Description
Prospective Provenance:
Intended method for analysis
11
16. Workflows can get complex!
• Overwhelming for users who are not
the developers
• Abstractions required for reporting
• Lineage queries result in very long
trails
16
25. Analysis Data Set
• 30 Workflows from the Taverna system
• Entire dataset & queries accessible from
http://www.myexperiment.org/packs/467.html
• Manual Annotation using Motif Vocabulary
25
28. Highlights
• Research Object model and associated
management tools
• Annotations of Workflow Using Motifs
• Methods for Summarizing Workflow and distilling
their provenance traces
• Algorithms for Repairing Workflows
• Validation of the workflow summarization
• Querying of Workflow Execution Provenance
using summaries.
28
Ongoing Work
29. References
• P Alper, K Belhajjame, C Goble, P Karagoz. Small Is Beautiful: Summarizing Scientific
Workflows Using Semantic Annotations. IEEE International Congress on Big Data, 2013
• Pinar Alper, Khalid Belhajjame, Carole A. Goble, Pinar Karagoz: Enhancing and abstracting
scientific workflow provenance for data publishing. EDBT/ICDT Workshops 2013.
• Belhajjame K, Corcho O, Garijo D, et al. Workflow-Centric Research Objects: A First Class
Citizen in the Scholarly Discourse. In proceedings of the ESWC2012 Workshop on the Future
of Scholarly Communication in the Semantic Web (SePublica2012), Heraklion, Greece, May
2012
• Belhajjame K, Zhao J, Garijo D, et al. The Research Object Suite of Ontologies: Sharing and
Exchanging Research Data and Methods on the Open Web, submitted to the Journal of Web
Semanics.
• Daniel Garijo, Pinar Alper, Khalid Belhajjame, Óscar Corcho, Yolanda Gil, Carole A. Goble:
Common motifs in scientific workflows: An empirical analysis. eScience 2012.
• Zhao J, Gómez-Pérez JM, Belhajjame K, Klyne G, et al. Why workflows break - Understanding
and combating decay in Taverna workflows. IEEE eScience 2012.
29
Research is increasingly digital. Most of research results are disseminated in the form of electronic papers through traditional communication channels, such as conferences, journals, or using new mediums such as microblogging. While electronic papers have played and continue to play a primordial role in the dissemination of research results, researchers now recognize that they are by no means sufficient to communicate and share information about research investigations. Indeed, the hypothesis investigated during the research, the experiment designed to assess the validity of the hypothesis, the process (workflow) used to ran the experiment, the datasets used and the results produced by the experiment, and the conclusions drawn by the scientist, are all elements that may be needed to understand, assess the claim, or be able to re-use the results of previous research investigations.In the context of the wf4ever project, we are investigating a new abstraction that we name research objects and that aggregate all these elements. Research object provide an entry point to the elements that are necessary or useful to understand and reuse research results. A particular feature of research objects that we studying is that they contain workflows that specify and implements data intensive scientific experiments.
Research is increasingly digital. Most of research results are disseminated in the form of electronic papers through traditional communication channels, such as conferences, journals, or using new mediums such as microblogging. While electronic papers have played and continue to play a primordial role in the dissemination of research results, researchers now recognize that they are by no means sufficient to communicate and share information about research investigations. Indeed, the hypothesis investigated during the research, the experiment designed to assess the validity of the hypothesis, the process (workflow) used to ran the experiment, the datasets used and the results produced by the experiment, and the conclusions drawn by the scientist, are all elements that may be needed to understand, assess the claim, or be able to re-use the results of previous research investigations.In the context of the wf4ever project, we are investigating a new abstraction that we name research objects and that aggregate all these elements. Research object provide an entry point to the elements that are necessary or useful to understand and reuse research results. A particular feature of research objects that we studying is that they contain workflows that specify and implements data intensive scientific experiments.
In this age of data-intensive science we’re witnessing the unprecedented generation and sharing of large scientific datasets, where the pace of data generation has far surpassed the pace of conducting analysis over the data. Scientific Workflows [6] are a recent but very popular method for task automation and resource integration. Using workflows, scientists are able to systematically weave datasets and analytical tools into pipelines, represented as networks of data processing operations connected with dataflow links.(Figure 1 illustrates a workflow from genomics, which “from a given set of gene ids, retrieves corresponding enzyme ids and finds the biological pathways involving them, then for each pathway retrieves its diagram with a designated coloring scheme”).As well as being automation pipelines, workflows are of paramount importance for the provenance of data generated from their execution [6]. Provenance refers to data’s deriva- tion history starting from the original sources, namely its lineage.
By explicitly outlining the analysis process, work- flow descriptions make-up a prospective part of provenance. The instrumented execution of the workflow generates a veracious and elaborate data lineage trail outlining the derivation relations among resultant, intermediary and initial data artifacts, which makes up the retrospective part of provenance.
http://alpha.myexperiment.org/packs/405.html0.41 Exemple of a research objects as a pack in myExpeirment of a Gene Wide Association Studies: the workflow is brokedn2.33 The workflow was repared by the user, and uploaded as part of the new packOne the user upload the workfow, this will be transfomedintoa format that is compatible with wfdesc. The idea here is that regardless of the workflow language, the workflow will be transformed into a simple workflow language, which in this case is wfdesc.7.24 You can also downloads worflow runs, and in this page you will have a summary of their characetrization.9.18 You can browse the workflow in the RO portal, which is the back end library that store information about the Research objects. And here you can find infromation about the evolution of the research object, in particular, which pack was used to create which pack.
Workflows are typically complex artifacts containing sev- eral processing steps and dataflow links. Complexity is characterized by the environment in which workflows op- erate.Obfuscation degrades reporta- bility of the scientific intent embedded in the workflow. Complex workflows combined with the faithful collection of execution provenance for each step results in even more complex provenance traces containing long data trails, that overwhelm users who explore or query them [3], [1].
Scientific Workflow MotifsAt the heart of our approach lies the exploitation of information on the data processing nature of activities in workflows. We propose that this information is provided through semantic annotations, which we call motifs. We use a light-weight ontology, which we developed in previous work [8] based on an analysis of 200+ workflows (Ac- cessible: http://purl.org/net/wf-motifs). The motif ontology characterizes operations wrt their Data-Oriented functional nature and the Resource-Oriented implementation nature. We refer the reader to [8] for the details, here we briefly introduce some of the motifs in light of our running example. • Data-Oriented nature. Certain activities in workflows, such as DataRetrieval, Analysis or Visualizations, perform the scientific heavy lifting in a workflow. The gene annotation pipeline in Figure 1) collects data from various resources through several retrieval steps ( see “get enzymes by gene” , “get genes by enzyme” opera- tions). The data retrieval steps are pipelined to each other through use of adapter steps, which are categorized with the DataP reparation motif. Augmentation is a sub-category of preparation operations. Augmentation decorates data with resource/protocol specific padding or formatting. (The sub- workflow “Workflow89” in our example is an augmenter that builds up a well formed query request out of given parameters.) Extraction activities perform the inverse of augmentation, they extract data from raw results returned from analyses or retrievals (e.g. SOAP XML messages). Splitting and Merging are also a sub-category of data preparation, they consume and produce the same data but change its cardinality (see the “Flatten List” steps in our example). Another general category is DataOrganization activities, which perform querying and organization func- tionsoverdatasuchas,F iltering,J oiningandGrouping. The “Remove String Duplicates” method in our example is an instance of the filtering motif. Another frequent motif is DataMoving. In order to make data accessible to external analysis tools data may need to be moved in and out of the workflow execution environment such activities fall into this category (e.g. upload to web, write to file etc). The “Get Image From URL 2” operation in our genomics workflow is an instance of this motif.• Resource-Oriented nature, which outlines the implemen- tation aspects and reflects the characteristic of resources that underpin the operation. Classifications in this category are: Human − Interactions vs entirely Computational steps, Statefull vs Stateless Invocations and using Internal (e.g. local scripts, sub-workflows) vs External (e.g. web service, or external command line tools) computational artifacts. In our example, all data retrieval operations are realized by external, stateless resources (web services), whereas all data preparation operations are realized with
A data-flow decorated with motif annota- tions is 3-tuple: W = ⟨N,E,motifs⟩, where motifs is a function that maps each graph node n ∈ N, be it an operation or port, to its motifs. Operation nodes could have motif annotations from operation, resource perspectives, and port nodes can have annotations from the data perspective (omitted in this paper due to space limitations).As an example, we illustrate below two motif annotations of two operations in our running example. It is worth mentioning that a given node can be associated with multiple motifs.motifs(color pathway by objects) = {⟨m1 : Augmenter⟩} motifs(Get Image From URL 2) = {⟨m2 : DataRetrieval⟩}
2) Elimination: We define the function eliminate op : ⟨W,op⟩ -> ⟨W′,M⟩ as one, which takes an annotated workflow definition, and eliminates a designated operation node from it. Elimination results in a set of indirect data links connecting the predecessors of the deleted operation to its successors. The source and target ports of these new indirect links are the cartesian product of the to-be-deleted operation’s inlinks’ sources and its outlinks’ targets. As with other primitives elimination also returns a set of mappings between the ports of the resulting workflow and the input workflow. Motif Behavior: The designated operation is eliminatedtogether with its motifs and these are not propagated to any other node in the workflow graph.Constraint Checks: There are no particular constraint checks for the elimination primitive.Dataflow Characteristic: Within the conventional defini- tion of a workflow, ports are connected with direct datalinks, which correspond to data transfers i.e. the data artifact!-# appearing at the source port is transferred to the target port as-is. In the indirect link case, however, there is no direct data transfer as the source and target artifacts are dissimilar,%&# and some “eliminated” data processing occurs inbetween.
Summary-By-Elimination: Requires minimal amount of annotation effort and operates with a single rule. We have only classified the scientifically significant operations to denote their motifs, and all remaining operations in the workflow are designated to be DataPreparation steps. The following reduction rule is then used to summarize the workflow graph W. Summary-By-Collapse: Requires annotation of work- flow with motif classes having more specificity and a more fine-grained set of rules. Consequently annotations on data preparation steps are more specific designating what kind of preparation activity is occurring (e.g. Merging, Splitting, Augmentation etc). We use these annotations to remove those data preparation steps from the work- flow by the collapse primitive and collapsing in the di- rection so as to retain the ports that carry more reporting friendly artifacts. We collapse Augmentation, Splitting andRef erenceGenerationoperationsdownstreamastheir outputs are less favored than their inputs, whereas we col- lapse Extraction, Merging, ReferenceResolution and all DataOrganization steps downstream as their outputs are preferred over inputs in a summary. Two rules in this approach are given below:
In our evaluation we annotated the workflow corpusmanually using the motif ontology. The effort that goes into the design of workflows can be significant (up to several days for a workflow). When compared to design, the cost of annotation is minor (as it amounts to single attribute setup per operation). Annotation can be (semi)automated through application of mining techniques to workflows and execution traces. It is also possible to pool annotations in service registries or module libraries and automatically propagate annotations to operations that rare re-used from the registry. Instead of the motif ontology it is also possible to use other ontologies [10] to characterize operations in workflows.
Motifs are used in conjunction with reduction primitives to build summarization rules. The objective of rules is to determine how to reduce the workflow when certain motifs or combinations of motifs are encountered. We propose graph transformation rules of the form L P, where L specifies a pattern to be matched in the host graph specified in terms of a workflow-specific graph model and motif decorations and P specifies a workflow reduction primitive to be enacted upon the node(s) identified by the left-hand- side L. Primitive P captures both the replacement graph pattern and how the replacement is to be embedded into the host graph. Our approach is a more controlled primitive based re-writing mechanism, rather than a fully declarative one (as in traditional graph re-writing). This allows us to: 1) Preserve data causality relations among operations together with the constraints that such relations have (e.g. acyclicity for Scientific Datalows) 2) We can allow the user to specify high-level reduction policies by creating simple <motif, primitive> pairs. It would be a big undertaking to expect the users, who are not computer scientists or software engineers, to specify re-writes in a fully declarative manner (i.e. a matching pattern, a replacement pattern and embedding info)