Credible workshop

413 views

Published on

Presentation of Research Objects given at the CrEDIBLE workshop

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
413
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Research is increasingly digital. Most of research results are disseminated in the form of electronic papers through traditional communication channels, such as conferences, journals, or using new mediums such as microblogging. While electronic papers have played and continue to play a primordial role in the dissemination of research results, researchers now recognize that they are by no means sufficient to communicate and share information about research investigations. Indeed, the hypothesis investigated during the research, the experiment designed to assess the validity of the hypothesis, the process (workflow) used to ran the experiment, the datasets used and the results produced by the experiment, and the conclusions drawn by the scientist, are all elements that may be needed to understand, assess the claim, or be able to re-use the results of previous research investigations.In the context of the wf4ever project, we are investigating a new abstraction that we name research objects and that aggregate all these elements. Research object provide an entry point to the elements that are necessary or useful to understand and reuse research results. A particular feature of research objects that we studying is that they contain workflows that specify and implements data intensive scientific experiments.
  • Research is increasingly digital. Most of research results are disseminated in the form of electronic papers through traditional communication channels, such as conferences, journals, or using new mediums such as microblogging. While electronic papers have played and continue to play a primordial role in the dissemination of research results, researchers now recognize that they are by no means sufficient to communicate and share information about research investigations. Indeed, the hypothesis investigated during the research, the experiment designed to assess the validity of the hypothesis, the process (workflow) used to ran the experiment, the datasets used and the results produced by the experiment, and the conclusions drawn by the scientist, are all elements that may be needed to understand, assess the claim, or be able to re-use the results of previous research investigations.In the context of the wf4ever project, we are investigating a new abstraction that we name research objects and that aggregate all these elements. Research object provide an entry point to the elements that are necessary or useful to understand and reuse research results. A particular feature of research objects that we studying is that they contain workflows that specify and implements data intensive scientific experiments.
  • In this age of data-intensive science we’re witnessing the unprecedented generation and sharing of large scientific datasets, where the pace of data generation has far surpassed the pace of conducting analysis over the data. Scientific Workflows [6] are a recent but very popular method for task automation and resource integration. Using workflows, scientists are able to systematically weave datasets and analytical tools into pipelines, represented as networks of data processing operations connected with dataflow links.(Figure 1 illustrates a workflow from genomics, which “from a given set of gene ids, retrieves corresponding enzyme ids and finds the biological pathways involving them, then for each pathway retrieves its diagram with a designated coloring scheme”).As well as being automation pipelines, workflows are of paramount importance for the provenance of data generated from their execution [6]. Provenance refers to data’s deriva- tion history starting from the original sources, namely its lineage.
  • By explicitly outlining the analysis process, work- flow descriptions make-up a prospective part of provenance. The instrumented execution of the workflow generates a veracious and elaborate data lineage trail outlining the derivation relations among resultant, intermediary and initial data artifacts, which makes up the retrospective part of provenance.
  • http://alpha.myexperiment.org/packs/405.html0.41 Exemple of a research objects as a pack in myExpeirment of a Gene Wide Association Studies: the workflow is brokedn2.33 The workflow was repared by the user, and uploaded as part of the new packOne the user upload the workfow, this will be transfomedintoa format that is compatible with wfdesc. The idea here is that regardless of the workflow language, the workflow will be transformed into a simple workflow language, which in this case is wfdesc.7.24 You can also downloads worflow runs, and in this page you will have a summary of their characetrization.9.18 You can browse the workflow in the RO portal, which is the back end library that store information about the Research objects. And here you can find infromation about the evolution of the research object, in particular, which pack was used to create which pack.
  • Workflows are typically complex artifacts containing sev- eral processing steps and dataflow links. Complexity is characterized by the environment in which workflows op- erate.Obfuscation degrades reporta- bility of the scientific intent embedded in the workflow. Complex workflows combined with the faithful collection of execution provenance for each step results in even more complex provenance traces containing long data trails, that overwhelm users who explore or query them [3], [1].
  • Scientific Workflow MotifsAt the heart of our approach lies the exploitation of information on the data processing nature of activities in workflows. We propose that this information is provided through semantic annotations, which we call motifs. We use a light-weight ontology, which we developed in previous work [8] based on an analysis of 200+ workflows (Ac- cessible: http://purl.org/net/wf-motifs). The motif ontology characterizes operations wrt their Data-Oriented functional nature and the Resource-Oriented implementation nature. We refer the reader to [8] for the details, here we briefly introduce some of the motifs in light of our running example. • Data-Oriented nature. Certain activities in workflows, such as DataRetrieval, Analysis or Visualizations, perform the scientific heavy lifting in a workflow. The gene annotation pipeline in Figure 1) collects data from various resources through several retrieval steps ( see “get enzymes by gene” , “get genes by enzyme” opera- tions). The data retrieval steps are pipelined to each other through use of adapter steps, which are categorized with the DataP reparation motif. Augmentation is a sub-category of preparation operations. Augmentation decorates data with resource/protocol specific padding or formatting. (The sub- workflow “Workflow89” in our example is an augmenter that builds up a well formed query request out of given parameters.) Extraction activities perform the inverse of augmentation, they extract data from raw results returned from analyses or retrievals (e.g. SOAP XML messages). Splitting and Merging are also a sub-category of data preparation, they consume and produce the same data but change its cardinality (see the “Flatten List” steps in our example). Another general category is DataOrganization activities, which perform querying and organization func- tionsoverdatasuchas,F iltering,J oiningandGrouping. The “Remove String Duplicates” method in our example is an instance of the filtering motif. Another frequent motif is DataMoving. In order to make data accessible to external analysis tools data may need to be moved in and out of the workflow execution environment such activities fall into this category (e.g. upload to web, write to file etc). The “Get Image From URL 2” operation in our genomics workflow is an instance of this motif.• Resource-Oriented nature, which outlines the implemen- tation aspects and reflects the characteristic of resources that underpin the operation. Classifications in this category are: Human − Interactions vs entirely Computational steps, Statefull vs Stateless Invocations and using Internal (e.g. local scripts, sub-workflows) vs External (e.g. web service, or external command line tools) computational artifacts. In our example, all data retrieval operations are realized by external, stateless resources (web services), whereas all data preparation operations are realized with
  • A data-flow decorated with motif annota- tions is 3-tuple: W = ⟨N,E,motifs⟩, where motifs is a function that maps each graph node n ∈ N, be it an operation or port, to its motifs. Operation nodes could have motif annotations from operation, resource perspectives, and port nodes can have annotations from the data perspective (omitted in this paper due to space limitations).As an example, we illustrate below two motif annotations of two operations in our running example. It is worth mentioning that a given node can be associated with multiple motifs.motifs(color pathway by objects) = {⟨m1 : Augmenter⟩} motifs(Get Image From URL 2) = {⟨m2 : DataRetrieval⟩}
  • 2) Elimination: We define the function eliminate op : ⟨W,op⟩ -> ⟨W′,M⟩ as one, which takes an annotated workflow definition, and eliminates a designated operation node from it. Elimination results in a set of indirect data links connecting the predecessors of the deleted operation to its successors. The source and target ports of these new indirect links are the cartesian product of the to-be-deleted operation’s inlinks’ sources and its outlinks’ targets. As with other primitives elimination also returns a set of mappings between the ports of the resulting workflow and the input workflow. Motif Behavior: The designated operation is eliminatedtogether with its motifs and these are not propagated to any other node in the workflow graph.Constraint Checks: There are no particular constraint checks for the elimination primitive.Dataflow Characteristic: Within the conventional defini- tion of a workflow, ports are connected with direct datalinks, which correspond to data transfers i.e. the data artifact!-# appearing at the source port is transferred to the target port as-is. In the indirect link case, however, there is no direct data transfer as the source and target artifacts are dissimilar,%&# and some “eliminated” data processing occurs inbetween.
  • Summary-By-Elimination: Requires minimal amount of annotation effort and operates with a single rule. We have only classified the scientifically significant operations to denote their motifs, and all remaining operations in the workflow are designated to be DataPreparation steps. The following reduction rule is then used to summarize the workflow graph W. Summary-By-Collapse: Requires annotation of work- flow with motif classes having more specificity and a more fine-grained set of rules. Consequently annotations on data preparation steps are more specific designating what kind of preparation activity is occurring (e.g. Merging, Splitting, Augmentation etc). We use these annotations to remove those data preparation steps from the work- flow by the collapse primitive and collapsing in the di- rection so as to retain the ports that carry more reporting friendly artifacts. We collapse Augmentation, Splitting andRef erenceGenerationoperationsdownstreamastheir outputs are less favored than their inputs, whereas we col- lapse Extraction, Merging, ReferenceResolution and all DataOrganization steps downstream as their outputs are preferred over inputs in a summary. Two rules in this approach are given below:
  • In our evaluation we annotated the workflow corpusmanually using the motif ontology. The effort that goes into the design of workflows can be significant (up to several days for a workflow). When compared to design, the cost of annotation is minor (as it amounts to single attribute setup per operation). Annotation can be (semi)automated through application of mining techniques to workflows and execution traces. It is also possible to pool annotations in service registries or module libraries and automatically propagate annotations to operations that rare re-used from the registry. Instead of the motif ontology it is also possible to use other ontologies [10] to characterize operations in workflows.
  • Motifs are used in conjunction with reduction primitives to build summarization rules. The objective of rules is to determine how to reduce the workflow when certain motifs or combinations of motifs are encountered. We propose graph transformation rules of the form L 􏰀 P, where L specifies a pattern to be matched in the host graph specified in terms of a workflow-specific graph model and motif decorations and P specifies a workflow reduction primitive to be enacted upon the node(s) identified by the left-hand- side L. Primitive P captures both the replacement graph pattern and how the replacement is to be embedded into the host graph. Our approach is a more controlled primitive based re-writing mechanism, rather than a fully declarative one (as in traditional graph re-writing). This allows us to: 1) Preserve data causality relations among operations together with the constraints that such relations have (e.g. acyclicity for Scientific Datalows) 2) We can allow the user to specify high-level reduction policies by creating simple <motif, primitive> pairs. It would be a big undertaking to expect the users, who are not computer scientists or software engineers, to specify re-writes in a fully declarative manner (i.e. a matching pattern, a replacement pattern and embedding info)
  • Credible workshop

    1. 1. Research Objects: Preserving Scientific Workflows and Provenance Khalid Belhajjame Université Paris Dauphine 1
    2. 2. Storyline • Why Research Objects? • Overview of the Research Objects • Portfolio of Research Object Management Tools • Provenance Distillation Through Workflow Summarization 2
    3. 3. Electronic papers are not enough 3 Electronic paper
    4. 4. Electronic papers are not enough 4 Research Object Datasets Results Scientists Hypothesis Experiments Annotations Provenance Electronic paper
    5. 5. Benefits Of Research Objects • A research object aggregates all elements that are necessary to understand research investigations. • Methods (experiments) are viewed as first class citizens • Promote reuse • Enable the verification of reproducibility of the results 5
    6. 6. Reproducibility a principle of the scientific method normal people and scientist 6 http://xkcd.com/242/
    7. 7. 47 of 53 “landmark” publications could not be replicated Inadequate cell lines and animal models Nature, 483, 2012 Credit to Carole Goble JCDL 2012 Keynote 7
    8. 8. Fig. 2. Overview of the Research Object Model with the different ontologies that encode it: Core RO for describing the basic Research Object structure, roevo for tracking Research Object evolution and wfprov and wfdesc for describing workflows and Overview of the Research Object Model 8
    9. 9. Research Object as an ORE Aggregation 9 Fig. 3. Research Object as an ORE aggregation. to annotation; and ao: Body, which comprises a description of the target. Research Objects use annotations as a means for decorating a resource (or a set of resources) with metadata information. The body is specified in the @base <ht t p : / / exampl e. com/ r o/ 389/ > . @pr ef i x dct : <ht t p: / / pur l . or g/ dc / t er ms / > @pr ef i x or e: <ht t p : / / www. openar chi ves . or g/ or e/ t er ms
    10. 10. Scientific Workflows • Data driven analysis pipelines • Systematic gathering of data and analysis tools into computational solutions for scientific problem-solving • Tools for automating frequently performed data intensive activities • Provenance for the resulting datasets – The method followed – The resources used – The datasets used 10
    11. 11. PROV Primer, Gil et al WF Execution Trace Retrospective Provenance: Actual data used, actual invocations, timestamps and data derivation trace WF Description Prospective Provenance: Intended method for analysis 11
    12. 12. Specifying Workflows using WfDESC 12 Fig. 5. T he wfdesc ontology and its relation to PROV-O. 21
    13. 13. Specifying Workflow Provenance using WfPROV 13 Fig. 7. T he wfprov ontology and its relationship to PROV-O.
    14. 14. Portfolio of Research Object Tools 14
    15. 15. DEMO 15
    16. 16. Workflows can get complex! • Overwhelming for users who are not the developers • Abstractions required for reporting • Lineage queries result in very long trails 16
    17. 17. Overall Approach 17 Workflow Designer Taverna Workbench Motif WF Summary WF Description Summarizer Summarization Rules
    18. 18. PART-1: Scientific Workflow Motifs • Domain Independent categorization – Data-Oriented Nature – Resource/Implementation- Oriented Nature • Captured In a lightweight OWL Ontology 18 http://purl.org/net/wf-motifs
    19. 19. Motif annotations over operations motifs(color_pathway_by_objects) = {m1:DataRetrieval} motifs(Get_Image_From_URL_2) = {m2:DataMoving} 19 DataRetrieval DataMovingl
    20. 20. PART-2: Workflow reduction primitives • Collapse (Up/Down) • Compose • Eliminate 20
    21. 21. Eliminate 21
    22. 22. Two sample strategies • By-Elimination – Minimal annotation effort – Single rule • By Collapse – More specific annotation – Multiple rules 22
    23. 23. By-Collapse23
    24. 24. By-Elimination24
    25. 25. Analysis Data Set • 30 Workflows from the Taverna system • Entire dataset & queries accessible from http://www.myexperiment.org/packs/467.html • Manual Annotation using Motif Vocabulary 25
    26. 26. Mechanistic Effect of Summarization 26
    27. 27. User Summaries vs. Summary Graphs 27
    28. 28. Highlights • Research Object model and associated management tools • Annotations of Workflow Using Motifs • Methods for Summarizing Workflow and distilling their provenance traces • Algorithms for Repairing Workflows • Validation of the workflow summarization • Querying of Workflow Execution Provenance using summaries. 28 Ongoing Work
    29. 29. References • P Alper, K Belhajjame, C Goble, P Karagoz. Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotations. IEEE International Congress on Big Data, 2013 • Pinar Alper, Khalid Belhajjame, Carole A. Goble, Pinar Karagoz: Enhancing and abstracting scientific workflow provenance for data publishing. EDBT/ICDT Workshops 2013. • Belhajjame K, Corcho O, Garijo D, et al. Workflow-Centric Research Objects: A First Class Citizen in the Scholarly Discourse. In proceedings of the ESWC2012 Workshop on the Future of Scholarly Communication in the Semantic Web (SePublica2012), Heraklion, Greece, May 2012 • Belhajjame K, Zhao J, Garijo D, et al. The Research Object Suite of Ontologies: Sharing and Exchanging Research Data and Methods on the Open Web, submitted to the Journal of Web Semanics. • Daniel Garijo, Pinar Alper, Khalid Belhajjame, Óscar Corcho, Yolanda Gil, Carole A. Goble: Common motifs in scientific workflows: An empirical analysis. eScience 2012. • Zhao J, Gómez-Pérez JM, Belhajjame K, Klyne G, et al. Why workflows break - Understanding and combating decay in Taverna workflows. IEEE eScience 2012. 29
    30. 30. Acknowledgement
    31. 31. Acknowledgement EU Wf4Ever project (270129) funded under EU FP7 (ICT- 2009.4.1). (http://www.wf4ever-project.org)

    ×