Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PhD Thesis: Mining abstractions in scientific workflows

1,205 views

Published on

Slides of the presentation for my PhD dissertation. I strongly recommend downloading the slides, as they have animations that are easier to see in power point. The abstract of the thesis is as follows: "Scientific workflows have been adopted in the last decade to represent the computational methods used in in silico scientific experiments and their associated research products. Scientific workflows have demonstrated to be useful for sharing and reproducing scientific experiments, allowing scientists to visualize, debug and save time when re-executing previous work. However, scientific workflows may be difficult to understand and reuse. The large amount of available workflows in repositories, together with their heterogeneity and lack of documentation and usage examples may become an obstacle for a scientist aiming to reuse the work from other scientists. Furthermore, given that it is often possible to implement a method using different algorithms or techniques, seemingly disparate workflows may be related at a higher level of abstraction, based on their common functionality. In this thesis we address the issue of reusability and abstraction by exploring how workflows relate to one another in a workflow repository, mining abstractions that may be helpful for workflow reuse. In order to do so, we propose a simple model for representing and relating workflows and their executions, we analyze the typical common abstractions that can be found in workflow repositories, we explore the current practices of users regarding workflow reuse and we describe a method for discovering useful abstractions for workflows based on existing graph mining techniques. Our results expose the common abstractions and practices of users in terms of workflow reuse, and show how our proposed abstractions have potential to become useful for users designing new workflows".

Published in: Education
  • Be the first to comment

PhD Thesis: Mining abstractions in scientific workflows

  1. 1. Date: 03/12/2015 Mining Abstractions in Scientific Workflows Daniel Garijo * Supervisors: Oscar Corcho *, Yolanda Gil Ŧ * Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute
  2. 2. Introduction Lab book Digital Log Laboratory Protocol (recipe) Scientific Workflow Experiment In silico experiment 2PhD Thesis: Mining Abstractions in Scientific Workflows
  3. 3. Benefits of workflows Time savings •Copy & paste fragments of workflows 3PhD Thesis: Mining Abstractions in Scientific Workflows Teaching •Reduce the learning curve of new students Visualization •Simplify workflows Design for modularity •Highlight the most relevant steps on a workflow Design for standardization Debugging •Provenance exploration Reproducibility and inspectability
  4. 4. Motivation of this work Workflow Repositories Workflow Systems Let’s Share! I want to reuse… ? I want to understand…? I want to repurpose… ? 4PhD Thesis: Mining Abstractions in Scientific Workflows
  5. 5. Open research challenges •Workflow representation heterogeneity 5PhD Thesis: Mining Abstractions in Scientific Workflows Workflow Repositories How can we represent a description of workflows and their metadata? How can we facilitate the homogeneous consumption of workflows and their resources?
  6. 6. Open research challenges •Workflow representation heterogeneity 6PhD Thesis: Mining Abstractions in Scientific Workflows •Inadequate level of workflow abstraction What are the most relevant parts of a workflow Dataset Porter Stemmer Result IDF Final Result Dataset Lovins Stemmer Result Residual IDF Final Result Dataset Stemmer Result Term Weighting FinalResult Are two seemingly disparate workflows related at a higher level of abstraction?
  7. 7. Open research challenges •Workflow representation heterogeneity 7PhD Thesis: Mining Abstractions in Scientific Workflows •Inadequate level of workflow abstraction •Difficulties for workflow reuse How is a workflow related to other workflows? Which workflow (parts) are potentially useful for reuse? ? ? ?
  8. 8. Open research challenges •Workflow representation heterogeneity 8PhD Thesis: Mining Abstractions in Scientific Workflows •Inadequate level of workflow abstraction •Difficulties for workflow reuse •Lack of support for workflow annotation + + How can we facilitate the annotation process?
  9. 9. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows 4. Workflow abstraction and reuse 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation 7. Conclusions and future work 9PhD Thesis: Mining Abstractions in Scientific Workflows
  10. 10. •H.3: Commonly occurring patterns are potentially useful for users designing workflows. •H.2: It is possible to detect commonly occurring patterns and abstractions automatically. Hypothesis •H.1: It is possible to define a catalog of common domain independent patterns based on the common functionality of workflow steps. Scientific workflow repositories can be automatically analyzed to extract commonly occurring patterns and abstractions that are useful for workflow developers aiming to reuse existing workflows. Workflow abstraction Workflow representation Workflow reuse Workflow annotation Workflow reuse 10PhD Thesis: Mining Abstractions in Scientific Workflows
  11. 11. Contributions Workflow representation and publication Model for representing workflow templates and executions Workflow abstraction Methodology to publish workflows in the web Workflow annotation A model and means for annotating semi-automatically the abstractions in workflows A catalog of common domain independent workflow patterns based on the functionality of workflow steps A method to extract generic commonly occurring workflow fragments automatically Workflow reuse Metrics for assessing the usefulness of a fragment for reuse A model to describe and annotate workflow fragments 11PhD Thesis: Mining Abstractions in Scientific Workflows OPMW Linked Data Wf-motifs Wf-fd Workflow motifs Graph mining
  12. 12. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows a) Requirements b) The OPMW model c) Publishing workflows as Linked Data 4. Workflow abstraction and reuse 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation 7. Conclusions and future work 12PhD Thesis: Mining Abstractions in Scientific Workflows
  13. 13. Workflow representation: Structures interchanged in the workflow lifecycle Dataset Stemmer algorithm Result Term weighting algorithm FinalResult File: Dataset123 LovinsStemmer algorithm Id:resultaa1 IDF algorithm Id:fresultaa2 Workflow Template 13PhD Thesis: Mining Abstractions in Scientific Workflows Workflow Instance Workflow Execution Trace Design Instantiation Execution File: Dataset124 PorterStemmer algorithm Id:resultaa1 IDF algorithm Id:fresultaa2 File: Dataset123 LovinsStemmer execution Id:resultaa1 IDF execution Id:fresultaa2 File: Dataset123 LovinsStemmer execution Id:resultaa1 IDF execution Id:fresultaa2 File: Dataset124 PorterStemmer execution Id:resultaa1 IDF execution Id:fresultaa2 File: Dataset124 PorterStemmer execution Id:resultaa1 IDF execution Id:fresultaa2 File: Dataset124 PorterStemmer execution Id:resultaa1 IDF execution Id:fresultaa2 File: Dataset123 LovinsStemmer execution Id:resultaa1 IDF execution Id:fresultaa2 … … Id:resultaa1
  14. 14. Requirements 14PhD Thesis: Mining Abstractions in Scientific Workflows Workflow template description Plan: P-Plan [Garijo et al 2012] http://purl.org/net/p-plan Workflow execution trace description Provenance: PROV (W3C) [Lebo et al 2013] http://www.w3.org/ns/prov# Workflow attribution Dublin Core, PROV (W3C) Workflow metadata Link between templates and executions Scufl DAX AGWL Dispel IWIR OPM OBI EXPO ISA PAV RO D-PROV [Cicarese et al 2013] [Moreau et al 2011] [Brinkman et al 2010] [Soldatova and King 2006] [Rocca et al 2008] [Belhajjame et al 2012] [Missier et al 2013] [Oinn et al 2004] [Fahringer et al 2005] [Atkinson et al 2013] [Plankensteiner et al 2005]
  15. 15. OPMW: Extending provenance standards and plan models template1 opmw:isVariableOfTemplate opmw:isVariable OfTemplate Input Dataset Term Weighting Topics p-plan:isOutputVarOf p-plan:hasInputVar opmw:isStepOf Template opmw:correspondsTo Template opmw:corresponds toTemplateArtifact opmw:corresponds toTemplateProcess opmw:corresponds toTemplateArtifact opmw:Workflow ExecutionProcess opmw:Workflow ExecutionAccount prov:Entity prov:Activity prov:Bundle PROV, OPM Extension opmv:Artifact opmo:Account opmv:Process opmw:Workflow ExecutionArtifact opmw:Workflow TemplateArtifact opmw:Workflow TemplateProcess opmw:Workflow Template p-plan:Plan p-plan:Step p-plan:Variable P-Plan extension Class Object property Legend Instance ofInstance Subclass of 15PhD Thesis: Mining Abstractions in Scientific Workflows execution1 File: Dataset123 IDF (java) File: FResultaa2 prov:wasGeneratedBy prov:used opmo:account opmo:account opmo:account http://www.opmw.org/ontology/
  16. 16. Outline 1. Introduction and motivation 2. Hypothesis and work methodology 3. Workflow representation: OPMW a) Requirements b) The OPMW model c) Publishing workflows as Linked Data 4. Workflow abstraction and reuse 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation 7. Conclusions and future work 16PhD Thesis: Mining Abstractions in Scientific Workflows
  17. 17. Publishing workflows as Linked Data Specification 17PhD Thesis: Mining Abstractions in Scientific Workflows Why Linked Data? •Facilitates exploitation of workflow resources in an homogeneous manner Adapted methodology from [Villazón-Terrazas et al 2011] Tested it for the Wings workflow system 1 Base URI = http://www.opmw.org/ Ontology URI = http://www.opmw.org/ontology/ Assertion URI = http://www.opmw.org/export/resource/ClassName/instanceName Examples: http://www.opmw.org/export/resource/WorkflowTemplate/ABSTRACTSUBWFDOCKING http://www.opmw.org/export/resource/WorkflowExecutionAccount/ACCOUNT1348629 350796
  18. 18. Publishing workflows as Linked Data Specification Modeling 18PhD Thesis: Mining Abstractions in Scientific Workflows Why Linked Data? •Facilitates exploitation of workflow resources in an homogeneous manner Adapted methodology from [Villazón-Terrazas et al 2011] Tested it for the Wings workflow system 1 2 OPMW P-Plan OPM DC PROV
  19. 19. Publishing workflows as Linked Data Specification Modeling Generation 19PhD Thesis: Mining Abstractions in Scientific Workflows Why Linked Data? •Facilitates exploitation of workflow resources in an homogeneous manner Adapted methodology from [Villazón-Terrazas et al 2011] Tested it for the Wings workflow system 1 2 3 Workflow system Workflow Template Workflow execution OPMW export OPMW RDF
  20. 20. Publishing workflows as Linked Data Specification Modeling Generation Publication 20PhD Thesis: Mining Abstractions in Scientific Workflows Why Linked Data? •Facilitates exploitation of workflow resources in an homogeneous manner Adapted methodology from [Villazón-Terrazas et al 2011] Tested it for the Wings workflow system 1 2 3 4 RDF Triple store Permanent web- accessible file store RDF Upload Interface SPARQL Endpoint OPMW RDF
  21. 21. Publishing workflows as Linked Data Specification Modeling Generation Publication 21PhD Thesis: Mining Abstractions in Scientific Workflows Why Linked Data? •Facilitates exploitation of workflow resources in an homogeneous manner Adapted methodology from [Villazón-Terrazas et al 2011] Tested it for the Wings workflow system 1 2 3 4 Exploitation 5 Curl Linked Data Browser Workflow Explorer SPARQL endpoint
  22. 22. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows 4. Workflow abstraction and reuse a) A catalog of common workflow abstractions b) Workflow reuse analysis 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation 7. Conclusions and future work 22PhD Thesis: Mining Abstractions in Scientific Workflows
  23. 23. A catalog of common workflow abstractions Generalization of workflow steps based on functionality. Workflow motif: Domain independent conceptual abstraction on the workflow steps. 1. Data-oriented motifs: What kind of manipulations does the workflow have? •E.g.: •Data retrieval •Data preparation •Data curation •Data visualization • etc. 23PhD Thesis: Mining Abstractions in Scientific Workflows
  24. 24. A catalog of common workflow abstractions Generalization of workflow steps based on functionality. Workflow motif: Domain independent conceptual abstraction on the workflow steps. 1. Data-oriented motifs: What kind of manipulations does the workflow have? •E.g.: •Data retrieval •Data preparation • etc. 2. Workflow-oriented motifs: How does the workflow perform its operations? •E.g.: •Stateful steps •Stateless steps •Human interactions •etc. 24PhD Thesis: Mining Abstractions in Scientific Workflows
  25. 25. Methodology for finding workflow motifs Goal: Reverse-engineer the set of current practices in workflow development through an analysis of empirical evidence 25PhD Thesis: Mining Abstractions in Scientific Workflows = 260 workflows 89 12526 20 Collect workflows
  26. 26. Methodology for finding workflow motifs Goal: Reverse-engineer the set of current practices in workflow development through an analysis of empirical evidence 26PhD Thesis: Mining Abstractions in Scientific Workflows Preliminary workflow analysis Researcher 1 Researcher 2 Researcher 3
  27. 27. Methodology for finding workflow motifs Goal: Reverse-engineer the set of current practices in workflow development through an analysis of empirical evidence 27PhD Thesis: Mining Abstractions in Scientific Workflows Agreement and cross validation
  28. 28. Result Summary 28PhD Thesis: Mining Abstractions in Scientific Workflows •Over 60% of the motifs are data preparation motifs •Some differences are motivated by the workflow systems in the analysis •Around 40% of workflows contain motifs related to workflow reuse composite workflowsinternal macros But how do users perceive workflow reuse? What about fragments of workflows?
  29. 29. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows 4. Workflow abstraction and reuse a) A catalog of common workflow abstractions b) Workflow reuse survey 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation 7. Conclusions and future work 29PhD Thesis: Mining Abstractions in Scientific Workflows
  30. 30. Use case: The LONI Pipeline Workflow system for neuroimaging analysis http://pipeline.loni.usc.edu/explore/library-navigator/ 30PhD Thesis: Mining Abstractions in Scientific Workflows Discussions with scientists User survey Collect responses from users 21 responses Discuss results
  31. 31. Summary results The majority of users agree that reusing and sharing workflows is useful Unlike workflows, reusing groupings from one’s own work is more useful than reusing groupings from others Most respondents agreed that groupings help simplify workflows. Groupings also make workflows more understandable by others 31PhD Thesis: Mining Abstractions in Scientific Workflows Can we detect groupings automatically?
  32. 32. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows 4. Workflow abstraction and reuse 5. Mining abstractions from workflows using graph mining techniques a) Corpus preparation b) Graph mining c) Fragment filtering d) Fragment linking 6. Evaluation 7. Conclusions and future work 32PhD Thesis: Mining Abstractions in Scientific Workflows
  33. 33. Workflow mining approaches Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014] 33PhD Thesis: Mining Abstractions in Scientific Workflows Workflow corpus Cluster1 Cluster 2 Cluster 3 Workflow corpus
  34. 34. Workflow mining approaches Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014] 34PhD Thesis: Mining Abstractions in Scientific Workflows Topic 1 Topic 2 P(Topic1) = 0.7 P(Topic2)= 0.3 P(Topic1) = 0.5 P(Topic2)= 0.5 P(Topic1) = 0.2 P(Topic2)= 0.8 …. Topic modeling [Stoyanovich et al 2010]
  35. 35. Workflow mining approaches Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014] Topic modeling [Stoyanovich et al 2010] 35PhD Thesis: Mining Abstractions in Scientific Workflows Case-based reasoning [Leake and Kendall-Morwick 2008], [Müller and Bergmann 2014] Workflow corpus ?
  36. 36. ? Workflow mining approaches Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014] Topic modeling [Stoyanovich et al 2010] Case-based reasoning [Leake and Kendall-Morwick 2008] [Müller and Bergmann 2014] Log mining [van del Aalst et al2003] [Gómez-Pérez and Corcho, 2008] 36PhD Thesis: Mining Abstractions in Scientific Workflows Workflow corpus ? PSM
  37. 37. Workflow mining approaches Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014] Topic modeling [Stoyanovich et al 2010] Case-based reasoning [Leake and Kendall-Morwick 2008] [Müller and Bergmann 2014] Log mining [van del Aalst et al2003] [Gómez-Pérez and Corcho, 2008] Graph mining [Diamantini et al., 2012] 37PhD Thesis: Mining Abstractions in Scientific Workflows
  38. 38. Workflow Mining in FragFlow 1 2 3 4 38PhD Thesis: Mining Abstractions in Scientific Workflows
  39. 39. Corpus Preparation Workflows converted to Labeled Directed Acyclic Graphs (LDAG) • The label of a node in the graph corresponds to the type of the step in the workflow • Edges capture the dependencies between different steps 39PhD Thesis: Mining Abstractions in Scientific Workflows Dataset Stemmer algorithm Result Term weighting algorithm FinalResult Stemmer algorithm Term weighting algorithm Duplicated workflows are removed Single-step workflows are removed
  40. 40. Graph Mining We use popular graph mining techniques: Inexact FSM: usage of heuristics to calculate similarity between two graphs. The solution might not be complete SUBDUE 2 heuristics: Minimum Description Length (MDL) and Size Exact FSM: deliver all the possible fragments to be found the dataset. gSpan Depth first search strategy FSG Breadth first search strategy 40PhD Thesis: Mining Abstractions in Scientific Workflows
  41. 41. Filtering Relevant Fragments The number of resulting fragments can be very large. We distinguish: Multistep fragments: More than one step Filtered Multistep fragments: Multistep fragments Contain all smaller fragments with the same number of occurrences 41PhD Thesis: Mining Abstractions in Scientific Workflows Stemmer Term Weighting Stemmer Term Weighting Filter Filter Sort Filter Sort Query F1 F2 F3 F4 (found 4 times) (found 4 times) (found 10 times) (found 3 times)
  42. 42. Linking to the Corpus: Example Workflow 1 42PhD Thesis: Mining Abstractions in Scientific Workflows Stemmer Term Weighting Stemmer Term Weighting Merge Stemmer Term Weighting Fragment1in Wf1(1) Fragment1 Fragment1in Wf1(2) Workflow fragment description vocabulary: http://purl.org/net/wf-fd (Extends P-Plan) wffd:foundAs wffd:foundAs wffd:foundIn p-plan:isPrecededBy p-plan:isPrecededByp-plan:isPrecededBy p-plan:isPrecededBy p-plan:isPrecededBy p-plan:isStepOfPlan p-plan:isStepOfPlan p-plan:isStepOfPlan p-plan:isStepOfPlan p-plan:isStepOfPlan p-plan:Step wffd:TiedWorkflowFragment wffd:DetectedResultWorkflowFragment
  43. 43. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows 4. Workflow abstraction and reuse 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation a) Finding generic motifs in workflows b) Workflow fragment assessment 7. Conclusions and future work 43PhD Thesis: Mining Abstractions in Scientific Workflows
  44. 44. Finding generic motifs in workflows 44PhD Thesis: Mining Abstractions in Scientific Workflows ? Research question: Can we find commonly occurring abstractions? composite workflowsinternal macros
  45. 45. Finding generic motifs in workflows 45PhD Thesis: Mining Abstractions in Scientific Workflows ? Metrics used: precision and recall Fragments (F) Annotated motifs (M)
  46. 46. Finding generic motifs in workflows 46PhD Thesis: Mining Abstractions in Scientific Workflows ? Corpus: 22 templates from the same domain annotated manually Wings workflow corpus + domain knowledge Dataset Porter Stemmer Result IDF Final Result Dataset Lovins Stemmer Result Residual IDF Final Result + Dataset Stemmer Result Term Weighting FinalResult Stemmer Porter Stemmer Lovins Stemmer Term Weighting Inverse Document Frequency (IDF) Residual IDF Query Term Weighting Component taxonomy
  47. 47. Finding generic motifs in workflows 47PhD Thesis: Mining Abstractions in Scientific Workflows ? Results of the evaluation H.2: It is possible to detect commonly occurring patterns and abstractions automatically. Internal Macros: Inexact FSM : 2 out of 3 found (r=0,67); 4 out of 5 (r=0,8) when applying generalization Composite Workflows: Exact FSM: all motifs are found, although the precision is low (p=0,18) Can we find commonly occurring abstractions?
  48. 48. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows 4. Workflow abstraction and reuse 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation a) Finding generic motifs in workflows b) Workflow fragment assessment 7. Conclusions and future work 48PhD Thesis: Mining Abstractions in Scientific Workflows
  49. 49. Workflow fragment assessment 49PhD Thesis: Mining Abstractions in Scientific Workflows ? Research question: Are our proposed workflow fragments useful? •A fragment is useful if it has been designed and (re)used by a user. •Comparison between proposed fragments and user designed groupings and workflow
  50. 50. Workflow fragment assessment 50PhD Thesis: Mining Abstractions in Scientific Workflows ? Metrics: Precision and recall Fragments (F) Workflows (W) Groupings (G)
  51. 51. Workflow fragment assessment 51PhD Thesis: Mining Abstractions in Scientific Workflows ? Workflow corpora User Corpus 1 (WC1) • Designed mostly by a single a single user • 790 workflows (475 after data preparation) User Corpus 2 (WC2) • Created by a user, with collaborations of others • 113 workflows (96 after data preparation) Multi User Corpus 3 (WC3) • Workflows submitted by 62 users during the month of Jan 2014 • 5859 workflows (357 after data preparation) User Corpus 4 (WC4) • Designed mostly by a single a single user • 53 workflows (50 after data preparation)
  52. 52. Workflow fragment assessment 52PhD Thesis: Mining Abstractions in Scientific Workflows ? Result assessment •30%-60% of proposed fragments are equal to user defined groupings or workflows •40%-80% of proposed of proposed fragments are equal or similar to user defined groupings or workflows H.3: Commonly occurring patterns are potentially useful for users designing workflows What about the rest of the fragments? Are those useful?
  53. 53. Workflow fragment assessment 53PhD Thesis: Mining Abstractions in Scientific Workflows ? User feedback: user survey Q1: Would you consider the proposed fragment a valuable grouping? •I would not select it as a grouping (0) •I would use it as a grouping with major changes (i.e., adding/removing more than 30% of the steps) (1) •I would use it as a grouping with minor changes (i.e., adding/removing less than 30% of the steps) (2). •I would use it as a grouping as it is (3) Q2: What do you think about the complexity of the fragment? •The fragment is too simple (0) •The fragment is fine as it is (1) •The fragment has too many steps (2) Not enough evidence to state that all proposed workflow fragments are useful
  54. 54. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows 4. Workflow abstraction and reuse 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation 7. Conclusions and future work 54PhD Thesis: Mining Abstractions in Scientific Workflows
  55. 55. Conclusions: Results H.1: It is possible to define a catalog of common domain independent patterns based on the common functionality of workflow steps. Daniel Garijo and Yolanda Gil. A new approach for publishing workflows: Abstractions, standards, and Linked Data. (WORKS'11) Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An empirical analysis (extended version). Future Generation Computer Systems. 2013. Model for representing workflows (OPMW) and publishing them as Linked Data Catalog of workflow motifs + workflow annotation H.2: It is possible to detect commonly occurring patterns and abstractions automatically. Graph mining approach + workflow generalization Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An empirical analysis. 8th IEEE International Conference on e-Science (eScience 2012) 55PhD Thesis: Mining Abstractions in Scientific Workflows Daniel Garijo, Oscar Corcho and Yolanda Gil. Detecting common scientific workflow fragments using templates and execution provenance. Proceedings of the seventh international conference on Knowledge capture, (K-CAP 2013).
  56. 56. Conclusions: Results Daniel Garijo, Oscar Corcho, Yolanda Gil, Boris Gutman, Ivo D. Dinov, Paul Thompson and Arthur W. Toga. FragFlow: Automated fragment detection in scientific workflows. 10th IEEE Conference on e-Science, (eScience 2014) Daniel Garijo, Oscar Corcho, Yolanda Gil, Meredith N. Braskie, Dereck Hibar, Xie Hua, Neda Jahanshad, Paul Thompson and Arthur W. Toga. Workflow reuse in practice: A study of neuroimaging pipeline users. 10th IEEE Conference on e-Science, (eScience 2014) H.3: Commonly occurring patterns are potentially useful for users designing workflows. Graph mining approach + reusability metrics for assessment + workflow annotation 56PhD Thesis: Mining Abstractions in Scientific Workflows Reuse survey
  57. 57. Conclusions: Impact and future work Impact: OPMW •Workflow annotation [García-Jiménez and Wilkinson 2014b] Motif catalog •Expansion for distributed environments [Olabarriaga et al 2013] •Workflow summarization [Alper et al 2013] Future work: •Towards workflow ecosystems 57PhD Thesis: Mining Abstractions in Scientific Workflows [Garijo et al 2014] (WORKS’14)
  58. 58. Conclusions: Impact and future work •Automatic detection of workflow abstractions 58PhD Thesis: Mining Abstractions in Scientific Workflows •Improvement of workflow reuse Custom fragments Ranking fragments Suggestions of workflows
  59. 59. Date: 03/12/2015 Mining Abstractions in Scientific Workflows Daniel Garijo * Supervisors: Oscar Corcho *, Yolanda Gil Ŧ * Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute All materials are available as Research Objects (with pointers to Figshare) http://w3id.org/dgarijo/ro/mining-abstractions-in-scientific-wfs
  60. 60. Supporting material 60PhD Thesis: Mining Abstractions in Scientific Workflows
  61. 61. Methodology Workflow representation and publication Approach Workflow abstraction and reuse Empirical analysis of workflow corpora Problem Evaluation Requirement validation and user feedback Model Competency question validation Provenance Plan Publication Methodology for publication Extension of existing standards and web technologies Workflow abstraction analysis for reuse Agreement on a catalog of common abstractions Automatic detection and annotation of workflow abstractions Graph mining techniques, generalization Precision, recall and user feedback 61PhD Thesis: Mining Abstractions in Scientific Workflows
  62. 62. Provenance Models PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 62 “A record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing” -PROV-DM: The PROV Data Model (W3C)
  63. 63. Replace this slide with a methodological one prov:used p-plan:Variable p-plan:isStepOfPlan p-plan:isVariableOfPlan p-plan:hasInputVar p-plan:isOutputVarOf p-plan:Activity p-plan: correspondsToStep p-plan:Entity prov:wasGeneratedBy p-plan:isPrecededBy p-plan:Bundle Class Object property Legend Subclass of prov:Bundle prov:Plan prov:Entity prov:Activity PROVextendedclasses Statements contained in a p-plan:Bundle p-plan:Step p-plan:Plan p-plan: correspondsToVariable 63PhD Thesis: Mining Abstractions in Scientific Workflows
  64. 64. Assumptions and restrictions PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 64 Restriction: • Workflows are represented as directed acyclic graphs Assumptions: •Available workflow repositories exist for exploiting definitions of workflows and workflow executions. •All the workflow steps can be assigned a label with their type •Two steps of a workflow with the same function have the same type. •Researchers aim to reuse workflows and workflow fragments if they find them useful.
  65. 65. 9 Other models for representing workflow instances, templates and executions PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  66. 66. Publishing as LD PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 66 •Maybe paste here an example instead of the big picture
  67. 67. 67 Data Oriented Motifs Data-Oriented Motifs Data Retrieval Data Preparation Format Transformation Input Augmentation and Output Splitting Data Organisation Data Analysis Data Curation/Cleaning Data Movement Data Visualisation PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  68. 68. 68 Data Oriented Motifs Data-Oriented Motifs Data Retrieval Data Preparation Format Transformation Input Augmentation and Output Splitting Data Organisation Data Analysis Data Curation/Cleaning Data Movement Data Visualisation PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  69. 69. 69 Data Oriented Motifs Data-Oriented Motifs Data Retrieval Data Preparation Format Transformation Input Augmentation and Output Splitting Data Organisation Data Analysis Data Curation/Cleaning Data Movement Data Visualisation PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  70. 70. 70 Data Oriented Motifs Data-Oriented Motifs Data Retrieval Data Preparation Format Transformation Input Augmentation and Output Splitting Data Organisation Data Analysis Data Curation/Cleaning Data Movement Data Visualisation PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  71. 71. 71 Data Oriented Motifs Data-Oriented Motifs Data Retrieval Data Preparation Format Transformation Input Augmentation and Output Splitting Data Organisation Data Analysis Data Curation/Cleaning Data Movement Data Visualisation PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  72. 72. 72 Data Oriented Motifs Data-Oriented Motifs Data Retrieval Data Preparation Format Transformation Input Augmentation and Output Splitting Data Organisation Data Analysis Data Curation/Cleaning Data Movement Data Visualisation PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  73. 73. 73 Workflow Oriented Motifs Workflow-Oriented Motifs Intra-Workflow Motifs Stateful (Asynchronous) Invocations Stateless (Synchronous) Invocations Internal Macros Human Interactions Inter-Workflow Motifs Atomic Workflows Composite Workflows Workflow Overloading PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  74. 74. 74 Workflow Oriented Motifs Workflow-Oriented Motifs Intra-Workflow Motifs Stateful (Asynchronous) Invocations Stateless (Synchronous) Invocations Internal Macros Human Interactions Inter-Workflow Motifs Atomic Workflows Composite Workflows Workflow Overloading PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  75. 75. 75 Workflow Oriented Motifs Workflow-Oriented Motifs Intra-Workflow Motifs Stateful (Asynchronous) Invocations Stateless (Synchronous) Invocations Internal Macros Human Interactions Inter-Workflow Motifs Atomic Workflows Composite Workflows Workflow Overloading PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  76. 76. 76 Workflow Oriented Motifs Workflow-Oriented Motifs Intra-Workflow Motifs Stateful (Asynchronous) Invocations Stateless (Synchronous) Invocations Internal Macros Human Interactions Inter-Workflow Motifs Atomic Workflows Composite Workflows Workflow Overloading PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  77. 77. 77 Workflow Oriented Motifs Workflow-Oriented Motifs Intra-Workflow Motifs Stateful (Asynchronous) Invocations Stateless (Synchronous) Invocations Internal Macros Human Interactions Inter-Workflow Motifs Atomic Workflows Composite Workflows Workflow Overloading PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  78. 78. Result Summary: Data Oriented Motifs •Over 60% of the motifs are data preparation motifs •Some differences are motivated by the workflow systems in the analysis •Data analysis is often the main functionality of the workflow 78PhD Thesis: Mining Abstractions in Scientific Workflows
  79. 79. Result Summary: Workflow Oriented Motifs • Around 40% composite workflows and internal macros But how do users perceive workflow reuse? •What about fragments of workflows? 79PhD Thesis: Mining Abstractions in Scientific Workflows
  80. 80. 80 Differences and commonalities of the workflow systems •Data moving/retrieval, stateful interactions and human interaction steps are not present in Wings •Web services (Taverna) versus software components (Wings) •Wings has layered execution through Pegasus •Data preparation steps are common in both systems •Use of sub workflows is high PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  81. 81. Reusing workflows… According to the respondents, the major benefits of workflows include: • Time savings •Organizing and storing code • Having a visualization of the overall analysis • Facilitating reproducibility 81PhD Thesis: Mining Abstractions in Scientific Workflows
  82. 82. Reusing groupings… •Reuse is not the only reason why groupings are created. Unlike workflows, reusing groupings from one’s own work is more useful than reusing groupings from others •Most respondents agreed that groupings help simplify workflows. Groupings also make workflows more understandable by others 82PhD Thesis: Mining Abstractions in Scientific Workflows
  83. 83. Graph Mining We use popular graph mining techniques: Inexact FSM: usage of heuristics to calculate similarity between two graphs. The solution might not be complete SUBDUE • 2 heuristics: Minimum Description Length (MDL) and Size • Frequency based Exact FSM: deliver all the possible fragments to be found the dataset. gSpan • Depth first search strategy • Support based FSG • Breadth first search strategy • Support based 83PhD Thesis: Mining Abstractions in Scientific Workflows
  84. 84. Linking to the Corpus: Workflow fragment description vocabulary 84PhD Thesis: Mining Abstractions in Scientific Workflows
  85. 85. Workflow fragment assessment: Summary of results 85PhD Thesis: Mining Abstractions in Scientific Workflows
  86. 86. Conclusions: Limitations L1: OPMW has been designed for data-intensive workflows (without loops or conditionals) L2: When publishing as Linked Data, it is assumed that all resources will be made public (no privacy issues) L3: Motif catalog may be expanded with additional motifs L4: Size and time needed to calculate some workflow fragments L5: A taxonomy of components is needed when generalizing workflows. This taxonomy is provided by domain experts modeling the domain. 86PhD Thesis: Mining Abstractions in Scientific Workflows

×