Date: 21/06/2013
Detecting common scientific
workflow fragments using
templates and execution
provenance
Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ
* Ontology Engineering Group
Universidad Politécnica de Madrid,
Ŧ USC Information Sciences Institute
K-CAP 2013. Banff, Canada
2
Overview
• Creation of abstractions from low level and high level tasks in
scientific workflows.
• Approach for detecting common groups of tasks among scientific
workflows.
•Discoverability, understandability, reuse and design
K-CAP 2013. Banff, Canada
Lab book
Digital Log
Laboratory Protocol
(recipe)
Workflow
Experiment
3
Background
• Workflows as software artifacts that capture the scientific method
• Addition to paper publication
• Reuse
• Existing repositories of workflows (myExperiment)
• Sharing workflows
• Exploring existing workflows.
• PROBLEMS to address:
• Sometimes workflows are difficult to understand
• Provenance is captured at a too low level. How
can it be generalized?
• Workflow descriptions are hard to relate to each
other.
• What are the common fragments shared among
workflow templates?
http://www.myexperiment.org
K-CAP 2013. Banff, Canada
4
Terminology: workflow templates
“A workflow template connects the steps of the workflow together, its inputs,
intermediate results and expected outputs, and defines their types and dependencies”.
•Abstract workflow template: Template with some unbound steps
•Specific workflow template: Template in which all the steps are bound to a
specific service, tool or code.
K-CAP 2013. Banff, Canada
Abstract
Specific
Taxonomy of components
5
Terminology: workflow templates
“A workflow template connects the steps of the workflow together, its inputs,
intermediate results and expected outputs, and defines their types and dependencies”.
•Abstract workflow template: Template with some unbound steps
•Specific workflow template: Template in which all the steps are bound to a
specific service, tool or code.
K-CAP 2013. Banff, Canada
Abstract
Specific
Taxonomy of components
Problem
Solving
Methods
6
Terminology : Workflow execution provenance traces
Workflow execution provenance trace: structured log of the workflow execution
results.
•Inputs of the run
•Outputs of the run
•Intermediate steps resultant form the run.
•Software codes used by the steps.
Porter Stemmer
Result
TF
Output
Dataset
ReutersTrain
TestDataset
A12314
TFResultRun
21-06-2013
K-CAP 2013. Banff, Canada
DataTemplate
ExecutionProcessP1
ExecutionprocessP2
7
Internal Macro
K-CAP 2013. Banff, Canada
•Same sequence of steps
in different parts of the
workflow.
•Types of data and steps
are the same.
•May or not may be found
among other workflows.
Local to a workflow.
8
Composite Workflows
K-CAP 2013. Banff, Canada
•Same sequence of steps among
different workflows.
•Types of data and steps are the
same.
9
Background: Motifs
•Workflow motifs catalogue [Garijo et al. 2012]: Domain independent
conceptual abstractions on the workflow steps.
1. Data-oriented motifs: What kind of manipulations does the workflow
have?
2. Workflow-oriented motifs: How does the workflow perform its
operations?
•We aim to automatically detect two types of motifs
•Internal Macro (common sequences of steps within a workflow)
•Composite workflows (common sequences of steps among workflows)
K-CAP 2013. Banff, Canada
[Garijo et al. 2012] Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An
empirical analysis. IEEE 8th International Conference on eScience 2012.
11
Motifs: Summary
K-CAP 2013. Banff, Canada
Most popular HOW motifs: Atomic workflows, Composite Workflows and Internal Macro
12
Approach
Workflow Retrieval
Common fragment
detection
Result analysis
K-CAP 2013. Banff, Canada
1. Retrieval of workflow templates and execution
provenance traces from a repository of
workflows.
2. Algorithms to obtain the most common
fragments among the workflow dataset.
3. Derivation of statistics and annotation of
workflows.
13
Workflow representation
•Workflows are labeled DAGs (Directed Acyclic
Graphs)
•Representation for both templates and
workflow execution provenance traces.
•No loops
•No conditionals
•Popular representation in data oriented
scientific workflows (supported by many
workflow engines).
K-CAP 2013. Banff, Canada
14
Challenges: Common workflow fragment detection
K-CAP 2013. Banff, Canada
[Holder et al 1994]: L. B. Holder, D. J. Cook, and S. Djoko. Substructure Discovery in the SUBDUE System. AAAI Workshop on Knowledge Discovery,
pages 169{180, 1994.
•Given a collection of workflows, which are the most common fragments?
•Common sub-graphs among the collection
•Sub-graph isomorphism (NP-complete)
•We use the SUBDUE algorithm [Holder et al 1994]
•Graph Grammar learning
•The rules of the grammar are the workflow fragments
•Graph based hierarchical clustering
•Each cluster corresponds to a workflow fragment
•Iterative algorithm with two measures for compressing the graph:
•Minimum Description Length (MDL)
•Size
15
How does SUBDUE work?
K-CAP 2013. Banff, Canada
ProcessType1
DatasetT1
DatasetT2
ProcessType2
DatasetT3
ProcessType3
DatasetT3
ProcessType1
DatasetT1
DatasetT2
ProcessType2
DatasetT3
DatasetT2
ProcessType2
DatasetT3
Input Graph
16
How does SUBDUE work?
K-CAP 2013. Banff, Canada
ProcessType1
DatasetT1
DatasetT2
ProcessType2
DatasetT3
ProcessType3
DatasetT3
ProcessType1
DatasetT1
DatasetT2
ProcessType2
DatasetT3
DatasetT2
ProcessType2
DatasetT3
Iteration 1
Fragment1
17
How does SUBDUE work?
K-CAP 2013. Banff, Canada
ProcessType1
DatasetT1
FRAG1
ProcessType3
DatasetT3
ProcessType1
DatasetT1
FRAG1
Iteration 1
result
FRAG1
18
How does SUBDUE work?
K-CAP 2013. Banff, Canada
ProcessType1
DatasetT1
FRAG1
ProcessType3
DatasetT3
ProcessType1
DatasetT1
FRAG1
Iteration 2
Fragment2
FRAG1
19
How does SUBDUE work?
K-CAP 2013. Banff, Canada
FRAG2
ProcessType3
DatasetT3
FRAG2
Iteration 2
result
(STOP)
FRAG1
20
How does SUBDUE work?
K-CAP 2013. Banff, Canada
Results:
Fragment 1 (FRAG1) : Fragment 2 (FRAG2):
Occurrences:
3 times 2 times
DatasetT2
ProcessType2
DatasetT3
ProcessType1
DatasetT1
FRAG1
21
Challenges: Generalization of workflows
K-CAP 2013. Banff, Canada
Workflow Retrieval
Common fragment
detection
Result analysis
Workflow
Generalization
Porter
Stemmer
Lovins
Stemmer
Term Weighting
DFTF
Stemmer
CF
22
Analysis setup
Analysis performed on 22 workflow templates with 30 workflow
execution provenance traces.
•Abstract and specific workflow templates
•Several workflow executions belong to the same template
•Some workflow executions had errors during the execution.
•Workflows have been manually analyzed to find motifs.
•Internal Macros
•SubWorkflows
K-CAP 2013. Banff, Canada
23
Evaluation results. Internal Macro
K-CAP 2013. Banff, Canada
•Our goal is to maximize the filtered multi-step fragments.
•The algorithm finds more multi-step fragments due to the way it operates.
•A step for filtering the multi-step fragments must be applied on the obtained
results (some are part of others).
24
Evaluation results. Composite workflows
K-CAP 2013. Banff, Canada
•More filtered multi-step fragments are found automatically than manually.
•Manual analysis affects sub-workflows.
•More than 50% of the filtered multi-step fragments overlap with the manual ones.
•The fragments found automatically have more occurrences than those found
manually.
25
Limitations
K-CAP 2013. Banff, Canada
Overlapping fragment may not
be fully detected!!
26
Conclusions & future work
•Approach for detecting commonalities among scientific
workflows.
• Workflow execution provenance traces
• Workflow templates
•Detection of the most common workflow fragments.
•Generalization of the datasets.
Future work
•Expand analysis to other domains.
•Add support for other workflow systems: Taverna, Knime, GenePattern, Galaxy,
Vistrails, etc.
•Test other graph matching algorithms.
•Optimize the algorithm by reducing the search space.
•All inputs and results are available here: http://www.oeg-upm.net/files/dgarijo/kcap2013Eval
K-CAP 2013. Banff, Canada
27
Towards automatic annotation of workflows
K-CAP 2013. Banff, Canada
•Ontology for describing workflow motifs
•The Workflow Motif Ontology
•URL: http://purl.org/net/wf-motifs
•Ontology for linking fragments to
the workflows of the dataset (Work in progress).
•The Workflow Fragment Description Ontology
•URL: To be announced
28
Current improvements
•Testing other domains.
•Expanding the compatible workflow
systems: Taverna.
•Improving the workflow representation to
reduce the graph size.
K-CAP 2013. Banff, Canada
29
Who are we?
•Daniel Garijo
Ontology Engineering Group, UPM
•Oscar Corcho
Ontology Engineering Group, UPM
•Yolanda Gil
Information Sciences Institute, USC
EU Wf4Ever project (270129)
funded under EU FP7 (ICT- 2009.4.1).
(http://www.wf4ever-project.org)
K-CAP 2013. Banff, Canada
Date: 21/06/2013
Detecting common scientific
workflow fragments using
templates and execution
provenance
Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ
* Ontology Engineering Group
Universidad Politécnica de Madrid,
Ŧ USC Information Sciences Institute
K-CAP 2013. Banff, Canada

Detecting common scientific workflow fragments using templates and execution provenance (K-CAP 2013)

  • 1.
    Date: 21/06/2013 Detecting commonscientific workflow fragments using templates and execution provenance Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ * Ontology Engineering Group Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute K-CAP 2013. Banff, Canada
  • 2.
    2 Overview • Creation ofabstractions from low level and high level tasks in scientific workflows. • Approach for detecting common groups of tasks among scientific workflows. •Discoverability, understandability, reuse and design K-CAP 2013. Banff, Canada Lab book Digital Log Laboratory Protocol (recipe) Workflow Experiment
  • 3.
    3 Background • Workflows assoftware artifacts that capture the scientific method • Addition to paper publication • Reuse • Existing repositories of workflows (myExperiment) • Sharing workflows • Exploring existing workflows. • PROBLEMS to address: • Sometimes workflows are difficult to understand • Provenance is captured at a too low level. How can it be generalized? • Workflow descriptions are hard to relate to each other. • What are the common fragments shared among workflow templates? http://www.myexperiment.org K-CAP 2013. Banff, Canada
  • 4.
    4 Terminology: workflow templates “Aworkflow template connects the steps of the workflow together, its inputs, intermediate results and expected outputs, and defines their types and dependencies”. •Abstract workflow template: Template with some unbound steps •Specific workflow template: Template in which all the steps are bound to a specific service, tool or code. K-CAP 2013. Banff, Canada Abstract Specific Taxonomy of components
  • 5.
    5 Terminology: workflow templates “Aworkflow template connects the steps of the workflow together, its inputs, intermediate results and expected outputs, and defines their types and dependencies”. •Abstract workflow template: Template with some unbound steps •Specific workflow template: Template in which all the steps are bound to a specific service, tool or code. K-CAP 2013. Banff, Canada Abstract Specific Taxonomy of components Problem Solving Methods
  • 6.
    6 Terminology : Workflowexecution provenance traces Workflow execution provenance trace: structured log of the workflow execution results. •Inputs of the run •Outputs of the run •Intermediate steps resultant form the run. •Software codes used by the steps. Porter Stemmer Result TF Output Dataset ReutersTrain TestDataset A12314 TFResultRun 21-06-2013 K-CAP 2013. Banff, Canada DataTemplate ExecutionProcessP1 ExecutionprocessP2
  • 7.
    7 Internal Macro K-CAP 2013.Banff, Canada •Same sequence of steps in different parts of the workflow. •Types of data and steps are the same. •May or not may be found among other workflows. Local to a workflow.
  • 8.
    8 Composite Workflows K-CAP 2013.Banff, Canada •Same sequence of steps among different workflows. •Types of data and steps are the same.
  • 9.
    9 Background: Motifs •Workflow motifscatalogue [Garijo et al. 2012]: Domain independent conceptual abstractions on the workflow steps. 1. Data-oriented motifs: What kind of manipulations does the workflow have? 2. Workflow-oriented motifs: How does the workflow perform its operations? •We aim to automatically detect two types of motifs •Internal Macro (common sequences of steps within a workflow) •Composite workflows (common sequences of steps among workflows) K-CAP 2013. Banff, Canada [Garijo et al. 2012] Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An empirical analysis. IEEE 8th International Conference on eScience 2012.
  • 10.
    11 Motifs: Summary K-CAP 2013.Banff, Canada Most popular HOW motifs: Atomic workflows, Composite Workflows and Internal Macro
  • 11.
    12 Approach Workflow Retrieval Common fragment detection Resultanalysis K-CAP 2013. Banff, Canada 1. Retrieval of workflow templates and execution provenance traces from a repository of workflows. 2. Algorithms to obtain the most common fragments among the workflow dataset. 3. Derivation of statistics and annotation of workflows.
  • 12.
    13 Workflow representation •Workflows arelabeled DAGs (Directed Acyclic Graphs) •Representation for both templates and workflow execution provenance traces. •No loops •No conditionals •Popular representation in data oriented scientific workflows (supported by many workflow engines). K-CAP 2013. Banff, Canada
  • 13.
    14 Challenges: Common workflowfragment detection K-CAP 2013. Banff, Canada [Holder et al 1994]: L. B. Holder, D. J. Cook, and S. Djoko. Substructure Discovery in the SUBDUE System. AAAI Workshop on Knowledge Discovery, pages 169{180, 1994. •Given a collection of workflows, which are the most common fragments? •Common sub-graphs among the collection •Sub-graph isomorphism (NP-complete) •We use the SUBDUE algorithm [Holder et al 1994] •Graph Grammar learning •The rules of the grammar are the workflow fragments •Graph based hierarchical clustering •Each cluster corresponds to a workflow fragment •Iterative algorithm with two measures for compressing the graph: •Minimum Description Length (MDL) •Size
  • 14.
    15 How does SUBDUEwork? K-CAP 2013. Banff, Canada ProcessType1 DatasetT1 DatasetT2 ProcessType2 DatasetT3 ProcessType3 DatasetT3 ProcessType1 DatasetT1 DatasetT2 ProcessType2 DatasetT3 DatasetT2 ProcessType2 DatasetT3 Input Graph
  • 15.
    16 How does SUBDUEwork? K-CAP 2013. Banff, Canada ProcessType1 DatasetT1 DatasetT2 ProcessType2 DatasetT3 ProcessType3 DatasetT3 ProcessType1 DatasetT1 DatasetT2 ProcessType2 DatasetT3 DatasetT2 ProcessType2 DatasetT3 Iteration 1 Fragment1
  • 16.
    17 How does SUBDUEwork? K-CAP 2013. Banff, Canada ProcessType1 DatasetT1 FRAG1 ProcessType3 DatasetT3 ProcessType1 DatasetT1 FRAG1 Iteration 1 result FRAG1
  • 17.
    18 How does SUBDUEwork? K-CAP 2013. Banff, Canada ProcessType1 DatasetT1 FRAG1 ProcessType3 DatasetT3 ProcessType1 DatasetT1 FRAG1 Iteration 2 Fragment2 FRAG1
  • 18.
    19 How does SUBDUEwork? K-CAP 2013. Banff, Canada FRAG2 ProcessType3 DatasetT3 FRAG2 Iteration 2 result (STOP) FRAG1
  • 19.
    20 How does SUBDUEwork? K-CAP 2013. Banff, Canada Results: Fragment 1 (FRAG1) : Fragment 2 (FRAG2): Occurrences: 3 times 2 times DatasetT2 ProcessType2 DatasetT3 ProcessType1 DatasetT1 FRAG1
  • 20.
    21 Challenges: Generalization ofworkflows K-CAP 2013. Banff, Canada Workflow Retrieval Common fragment detection Result analysis Workflow Generalization Porter Stemmer Lovins Stemmer Term Weighting DFTF Stemmer CF
  • 21.
    22 Analysis setup Analysis performedon 22 workflow templates with 30 workflow execution provenance traces. •Abstract and specific workflow templates •Several workflow executions belong to the same template •Some workflow executions had errors during the execution. •Workflows have been manually analyzed to find motifs. •Internal Macros •SubWorkflows K-CAP 2013. Banff, Canada
  • 22.
    23 Evaluation results. InternalMacro K-CAP 2013. Banff, Canada •Our goal is to maximize the filtered multi-step fragments. •The algorithm finds more multi-step fragments due to the way it operates. •A step for filtering the multi-step fragments must be applied on the obtained results (some are part of others).
  • 23.
    24 Evaluation results. Compositeworkflows K-CAP 2013. Banff, Canada •More filtered multi-step fragments are found automatically than manually. •Manual analysis affects sub-workflows. •More than 50% of the filtered multi-step fragments overlap with the manual ones. •The fragments found automatically have more occurrences than those found manually.
  • 24.
    25 Limitations K-CAP 2013. Banff,Canada Overlapping fragment may not be fully detected!!
  • 25.
    26 Conclusions & futurework •Approach for detecting commonalities among scientific workflows. • Workflow execution provenance traces • Workflow templates •Detection of the most common workflow fragments. •Generalization of the datasets. Future work •Expand analysis to other domains. •Add support for other workflow systems: Taverna, Knime, GenePattern, Galaxy, Vistrails, etc. •Test other graph matching algorithms. •Optimize the algorithm by reducing the search space. •All inputs and results are available here: http://www.oeg-upm.net/files/dgarijo/kcap2013Eval K-CAP 2013. Banff, Canada
  • 26.
    27 Towards automatic annotationof workflows K-CAP 2013. Banff, Canada •Ontology for describing workflow motifs •The Workflow Motif Ontology •URL: http://purl.org/net/wf-motifs •Ontology for linking fragments to the workflows of the dataset (Work in progress). •The Workflow Fragment Description Ontology •URL: To be announced
  • 27.
    28 Current improvements •Testing otherdomains. •Expanding the compatible workflow systems: Taverna. •Improving the workflow representation to reduce the graph size. K-CAP 2013. Banff, Canada
  • 28.
    29 Who are we? •DanielGarijo Ontology Engineering Group, UPM •Oscar Corcho Ontology Engineering Group, UPM •Yolanda Gil Information Sciences Institute, USC EU Wf4Ever project (270129) funded under EU FP7 (ICT- 2009.4.1). (http://www.wf4ever-project.org) K-CAP 2013. Banff, Canada
  • 29.
    Date: 21/06/2013 Detecting commonscientific workflow fragments using templates and execution provenance Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ * Ontology Engineering Group Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute K-CAP 2013. Banff, Canada