SlideShare a Scribd company logo
Date: 03/12/2015
Mining Abstractions in
Scientific Workflows
Daniel Garijo *
Supervisors: Oscar Corcho *, Yolanda Gil Ŧ
* Universidad Politécnica de Madrid,
Ŧ USC Information Sciences Institute
Introduction
Lab book
Digital Log
Laboratory Protocol
(recipe)
Scientific Workflow
Experiment
In silico experiment
2PhD Thesis: Mining Abstractions in Scientific Workflows
Benefits of workflows
Time savings
•Copy & paste fragments of workflows
3PhD Thesis: Mining Abstractions in Scientific Workflows
Teaching
•Reduce the learning curve of new students
Visualization
•Simplify workflows
Design for modularity
•Highlight the most relevant steps on a workflow
Design for standardization
Debugging
•Provenance exploration
Reproducibility and inspectability
Motivation of this work
Workflow Repositories
Workflow Systems
Let’s
Share!
I want to
reuse…
?
I want to
understand…?
I want to
repurpose…
?
4PhD Thesis: Mining Abstractions in Scientific Workflows
Open research challenges
•Workflow representation heterogeneity
5PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow Repositories
How can we represent a description of workflows and their metadata?
How can we facilitate the homogeneous consumption of workflows and
their resources?
Open research challenges
•Workflow representation heterogeneity
6PhD Thesis: Mining Abstractions in Scientific Workflows
•Inadequate level of workflow abstraction
What are the most relevant
parts of a workflow
Dataset
Porter
Stemmer
Result
IDF
Final
Result
Dataset
Lovins
Stemmer
Result
Residual
IDF
Final
Result
Dataset
Stemmer
Result
Term Weighting
FinalResult
Are two seemingly disparate
workflows related at a
higher level of abstraction?
Open research challenges
•Workflow representation heterogeneity
7PhD Thesis: Mining Abstractions in Scientific Workflows
•Inadequate level of workflow abstraction
•Difficulties for workflow reuse
How is a workflow related to
other workflows?
Which workflow (parts) are
potentially useful for reuse?
?
?
?
Open research challenges
•Workflow representation heterogeneity
8PhD Thesis: Mining Abstractions in Scientific Workflows
•Inadequate level of workflow abstraction
•Difficulties for workflow reuse
•Lack of support for workflow annotation
+ +
How can we facilitate the annotation process?
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work
9PhD Thesis: Mining Abstractions in Scientific Workflows
•H.3: Commonly occurring patterns are potentially useful for users
designing workflows.
•H.2: It is possible to detect commonly occurring patterns and
abstractions automatically.
Hypothesis
•H.1: It is possible to define a catalog of common domain
independent patterns based on the common functionality of
workflow steps.
Scientific workflow repositories can be automatically analyzed to
extract commonly occurring patterns and abstractions that are
useful for workflow developers aiming to reuse existing workflows.
Workflow abstraction
Workflow representation
Workflow reuse
Workflow annotation
Workflow reuse
10PhD Thesis: Mining Abstractions in Scientific Workflows
Contributions
Workflow representation and publication
Model for representing workflow templates and executions
Workflow abstraction
Methodology to publish workflows in the web
Workflow annotation
A model and means for annotating semi-automatically the abstractions in
workflows
A catalog of common domain independent workflow patterns based on the
functionality of workflow steps
A method to extract generic commonly occurring workflow fragments
automatically
Workflow reuse
Metrics for assessing the usefulness of a fragment for reuse
A model to describe and annotate workflow fragments
11PhD Thesis: Mining Abstractions in Scientific Workflows
OPMW
Linked Data
Wf-motifs
Wf-fd
Workflow
motifs
Graph mining
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
a) Requirements
b) The OPMW model
c) Publishing workflows as Linked Data
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work
12PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow representation: Structures interchanged in the workflow lifecycle
Dataset
Stemmer
algorithm
Result
Term weighting
algorithm
FinalResult
File:
Dataset123
LovinsStemmer
algorithm
Id:resultaa1
IDF
algorithm
Id:fresultaa2
Workflow
Template
13PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow Instance Workflow Execution Trace
Design Instantiation Execution
File:
Dataset124
PorterStemmer
algorithm
Id:resultaa1
IDF
algorithm
Id:fresultaa2
File:
Dataset123
LovinsStemmer
execution
Id:resultaa1
IDF
execution
Id:fresultaa2
File:
Dataset123
LovinsStemmer
execution
Id:resultaa1
IDF
execution
Id:fresultaa2
File:
Dataset124
PorterStemmer
execution
Id:resultaa1
IDF
execution
Id:fresultaa2
File:
Dataset124
PorterStemmer
execution
Id:resultaa1
IDF
execution
Id:fresultaa2
File:
Dataset124
PorterStemmer
execution
Id:resultaa1
IDF
execution
Id:fresultaa2
File:
Dataset123
LovinsStemmer
execution
Id:resultaa1
IDF
execution
Id:fresultaa2
…
…
Id:resultaa1
Requirements
14PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow template description
Plan: P-Plan [Garijo et al 2012]
http://purl.org/net/p-plan
Workflow execution trace description
Provenance: PROV (W3C) [Lebo et al 2013]
http://www.w3.org/ns/prov#
Workflow attribution
Dublin Core, PROV (W3C)
Workflow metadata
Link between templates and executions
Scufl DAX
AGWL Dispel
IWIR
OPM
OBI EXPO ISA
PAV
RO D-PROV
[Cicarese et al 2013]
[Moreau et al 2011]
[Brinkman et al 2010]
[Soldatova and King
2006]
[Rocca et al 2008]
[Belhajjame et al 2012]
[Missier et al 2013]
[Oinn et al 2004]
[Fahringer et al 2005]
[Atkinson et al 2013]
[Plankensteiner et al
2005]
OPMW: Extending provenance standards and plan models
template1
opmw:isVariableOfTemplate
opmw:isVariable
OfTemplate
Input Dataset
Term Weighting
Topics
p-plan:isOutputVarOf
p-plan:hasInputVar
opmw:isStepOf
Template
opmw:correspondsTo
Template
opmw:corresponds
toTemplateArtifact
opmw:corresponds
toTemplateProcess
opmw:corresponds
toTemplateArtifact
opmw:Workflow
ExecutionProcess
opmw:Workflow
ExecutionAccount
prov:Entity
prov:Activity
prov:Bundle
PROV, OPM Extension
opmv:Artifact
opmo:Account
opmv:Process
opmw:Workflow
ExecutionArtifact
opmw:Workflow
TemplateArtifact
opmw:Workflow
TemplateProcess
opmw:Workflow
Template
p-plan:Plan
p-plan:Step
p-plan:Variable
P-Plan extension
Class Object property
Legend
Instance ofInstance Subclass of
15PhD Thesis: Mining Abstractions in Scientific Workflows
execution1
File: Dataset123
IDF
(java)
File: FResultaa2
prov:wasGeneratedBy
prov:used
opmo:account
opmo:account
opmo:account
http://www.opmw.org/ontology/
Outline
1. Introduction and motivation
2. Hypothesis and work methodology
3. Workflow representation: OPMW
a) Requirements
b) The OPMW model
c) Publishing workflows as Linked Data
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work
16PhD Thesis: Mining Abstractions in Scientific Workflows
Publishing workflows as Linked Data
Specification
17PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?
•Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]
Tested it for the Wings workflow system
1
Base URI = http://www.opmw.org/
Ontology URI = http://www.opmw.org/ontology/
Assertion URI = http://www.opmw.org/export/resource/ClassName/instanceName
Examples:
http://www.opmw.org/export/resource/WorkflowTemplate/ABSTRACTSUBWFDOCKING
http://www.opmw.org/export/resource/WorkflowExecutionAccount/ACCOUNT1348629
350796
Publishing workflows as Linked Data
Specification Modeling
18PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?
•Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]
Tested it for the Wings workflow system
1 2
OPMW
P-Plan
OPM DC
PROV
Publishing workflows as Linked Data
Specification Modeling Generation
19PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?
•Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]
Tested it for the Wings workflow system
1 2 3
Workflow system
Workflow
Template
Workflow
execution
OPMW
export
OPMW
RDF
Publishing workflows as Linked Data
Specification Modeling Generation Publication
20PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?
•Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]
Tested it for the Wings workflow system
1 2 3 4
RDF
Triple
store
Permanent
web-
accessible
file
store
RDF Upload Interface
SPARQL
Endpoint
OPMW
RDF
Publishing workflows as Linked Data
Specification Modeling Generation Publication
21PhD Thesis: Mining Abstractions in Scientific Workflows
Why Linked Data?
•Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]
Tested it for the Wings workflow system
1 2 3 4
Exploitation
5
Curl Linked Data Browser
Workflow
Explorer
SPARQL
endpoint
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
a) A catalog of common workflow abstractions
b) Workflow reuse analysis
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work
22PhD Thesis: Mining Abstractions in Scientific Workflows
A catalog of common workflow abstractions
Generalization of workflow steps based on functionality.
Workflow motif: Domain independent conceptual abstraction on the workflow
steps.
1. Data-oriented motifs: What kind of manipulations
does the workflow have?
•E.g.:
•Data retrieval
•Data preparation
•Data curation
•Data visualization
• etc.
23PhD Thesis: Mining Abstractions in Scientific Workflows
A catalog of common workflow abstractions
Generalization of workflow steps based on functionality.
Workflow motif: Domain independent conceptual abstraction on the workflow
steps.
1. Data-oriented motifs: What kind of manipulations
does the workflow have?
•E.g.:
•Data retrieval
•Data preparation
• etc.
2. Workflow-oriented motifs: How does
the workflow perform its operations?
•E.g.:
•Stateful steps
•Stateless steps
•Human interactions
•etc.
24PhD Thesis: Mining Abstractions in Scientific Workflows
Methodology for finding workflow motifs
Goal: Reverse-engineer the set of current practices in workflow
development through an analysis of empirical evidence
25PhD Thesis: Mining Abstractions in Scientific Workflows
= 260 workflows
89 12526 20
Collect workflows
Methodology for finding workflow motifs
Goal: Reverse-engineer the set of current practices in workflow
development through an analysis of empirical evidence
26PhD Thesis: Mining Abstractions in Scientific Workflows
Preliminary workflow analysis
Researcher 1 Researcher 2 Researcher 3
Methodology for finding workflow motifs
Goal: Reverse-engineer the set of current practices in workflow
development through an analysis of empirical evidence
27PhD Thesis: Mining Abstractions in Scientific Workflows
Agreement and cross validation
Result Summary
28PhD Thesis: Mining Abstractions in Scientific Workflows
•Over 60% of the motifs are data preparation motifs
•Some differences are motivated by the workflow systems in the
analysis
•Around 40% of workflows contain motifs related to workflow
reuse
composite workflowsinternal macros
But how do users perceive workflow reuse?
What about fragments of workflows?
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
a) A catalog of common workflow abstractions
b) Workflow reuse survey
5. Mining abstractions from workflows using graph mining techniques
6. Evaluation
7. Conclusions and future work
29PhD Thesis: Mining Abstractions in Scientific Workflows
Use case: The LONI Pipeline
Workflow system for neuroimaging analysis
http://pipeline.loni.usc.edu/explore/library-navigator/
30PhD Thesis: Mining Abstractions in Scientific Workflows
Discussions with scientists
User survey
Collect responses
from users
21 responses
Discuss results
Summary results
The majority of users agree that reusing and sharing workflows is
useful
Unlike workflows, reusing groupings from one’s own work is more
useful than reusing groupings from others
Most respondents agreed that groupings help simplify workflows.
Groupings also make workflows more understandable by others
31PhD Thesis: Mining Abstractions in Scientific Workflows
Can we detect groupings automatically?
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph
mining techniques
a) Corpus preparation
b) Graph mining
c) Fragment filtering
d) Fragment linking
6. Evaluation
7. Conclusions and future work
32PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
33PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow corpus
Cluster1
Cluster 2
Cluster 3
Workflow corpus
Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
34PhD Thesis: Mining Abstractions in Scientific Workflows
Topic 1
Topic 2
P(Topic1) = 0.7
P(Topic2)= 0.3
P(Topic1) = 0.5
P(Topic2)= 0.5
P(Topic1) = 0.2
P(Topic2)= 0.8 ….
Topic modeling [Stoyanovich et al 2010]
Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
Topic modeling [Stoyanovich et al 2010]
35PhD Thesis: Mining Abstractions in Scientific Workflows
Case-based reasoning [Leake and Kendall-Morwick 2008], [Müller and Bergmann 2014]
Workflow corpus
?
?
Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
Topic modeling [Stoyanovich et al 2010]
Case-based reasoning [Leake and Kendall-Morwick 2008] [Müller and Bergmann 2014]
Log mining [van del Aalst et al2003] [Gómez-Pérez and Corcho, 2008]
36PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow corpus
?
PSM
Workflow mining approaches
Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]
Topic modeling [Stoyanovich et al 2010]
Case-based reasoning [Leake and Kendall-Morwick 2008] [Müller and Bergmann 2014]
Log mining [van del Aalst et al2003] [Gómez-Pérez and Corcho, 2008]
Graph mining [Diamantini et al., 2012]
37PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow Mining in FragFlow
1
2
3
4
38PhD Thesis: Mining Abstractions in Scientific Workflows
Corpus Preparation
Workflows converted to Labeled Directed Acyclic Graphs (LDAG)
• The label of a node in the graph corresponds to the type of the step in
the workflow
• Edges capture the dependencies between different steps
39PhD Thesis: Mining Abstractions in Scientific Workflows
Dataset
Stemmer
algorithm
Result
Term weighting
algorithm
FinalResult
Stemmer
algorithm
Term weighting
algorithm
Duplicated workflows are removed
Single-step workflows are removed
Graph Mining
We use popular graph mining techniques:
Inexact FSM: usage of heuristics to calculate similarity between two
graphs. The solution might not be complete
SUBDUE
2 heuristics: Minimum Description Length (MDL) and Size
Exact FSM: deliver all the possible fragments to be found the dataset.
gSpan
Depth first search strategy
FSG
Breadth first search strategy
40PhD Thesis: Mining Abstractions in Scientific Workflows
Filtering Relevant Fragments
The number of resulting fragments can be very large. We distinguish:
Multistep fragments:
More than one step
Filtered Multistep fragments:
Multistep fragments
Contain all smaller fragments with the same number of occurrences
41PhD Thesis: Mining Abstractions in Scientific Workflows
Stemmer
Term Weighting
Stemmer
Term Weighting
Filter
Filter
Sort
Filter
Sort
Query
F1
F2
F3
F4
(found 4 times)
(found 4 times)
(found 10 times)
(found 3 times)
Linking to the Corpus: Example
Workflow 1
42PhD Thesis: Mining Abstractions in Scientific Workflows
Stemmer
Term Weighting
Stemmer
Term Weighting
Merge
Stemmer
Term Weighting
Fragment1in Wf1(1)
Fragment1
Fragment1in Wf1(2)
Workflow fragment description vocabulary:
http://purl.org/net/wf-fd
(Extends P-Plan)
wffd:foundAs
wffd:foundAs
wffd:foundIn
p-plan:isPrecededBy
p-plan:isPrecededByp-plan:isPrecededBy
p-plan:isPrecededBy p-plan:isPrecededBy p-plan:isStepOfPlan
p-plan:isStepOfPlan
p-plan:isStepOfPlan
p-plan:isStepOfPlan
p-plan:isStepOfPlan
p-plan:Step
wffd:TiedWorkflowFragment
wffd:DetectedResultWorkflowFragment
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph
mining techniques
6. Evaluation
a) Finding generic motifs in workflows
b) Workflow fragment assessment
7. Conclusions and future work
43PhD Thesis: Mining Abstractions in Scientific Workflows
Finding generic motifs in workflows
44PhD Thesis: Mining Abstractions in Scientific Workflows
?
Research question: Can we find commonly occurring abstractions?
composite workflowsinternal macros
Finding generic motifs in workflows
45PhD Thesis: Mining Abstractions in Scientific Workflows
?
Metrics used: precision and recall
Fragments
(F)
Annotated
motifs
(M)
Finding generic motifs in workflows
46PhD Thesis: Mining Abstractions in Scientific Workflows
?
Corpus: 22 templates from the same domain annotated manually
Wings workflow corpus + domain knowledge
Dataset
Porter
Stemmer
Result
IDF
Final
Result
Dataset
Lovins
Stemmer
Result
Residual
IDF
Final
Result
+
Dataset
Stemmer
Result
Term Weighting
FinalResult
Stemmer
Porter Stemmer
Lovins Stemmer
Term Weighting
Inverse Document
Frequency (IDF)
Residual IDF
Query Term Weighting
Component taxonomy
Finding generic motifs in workflows
47PhD Thesis: Mining Abstractions in Scientific Workflows
?
Results of the evaluation
H.2: It is possible to detect commonly occurring patterns and abstractions
automatically.
Internal Macros:
Inexact FSM : 2 out of 3 found (r=0,67); 4 out of 5 (r=0,8) when applying
generalization
Composite Workflows:
Exact FSM: all motifs are found, although the precision is low (p=0,18)
Can we find commonly occurring abstractions?
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph
mining techniques
6. Evaluation
a) Finding generic motifs in workflows
b) Workflow fragment assessment
7. Conclusions and future work
48PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow fragment assessment
49PhD Thesis: Mining Abstractions in Scientific Workflows
?
Research question: Are our proposed workflow fragments useful?
•A fragment is useful if it has been designed and (re)used by a user.
•Comparison between proposed fragments and user designed groupings
and workflow
Workflow fragment assessment
50PhD Thesis: Mining Abstractions in Scientific Workflows
?
Metrics: Precision and recall
Fragments
(F)
Workflows
(W)
Groupings
(G)
Workflow fragment assessment
51PhD Thesis: Mining Abstractions in Scientific Workflows
?
Workflow corpora
User Corpus 1 (WC1)
• Designed mostly by a single a single user
• 790 workflows (475 after data preparation)
User Corpus 2 (WC2)
• Created by a user, with collaborations of others
• 113 workflows (96 after data preparation)
Multi User Corpus 3 (WC3)
• Workflows submitted by 62 users during the month of Jan 2014
• 5859 workflows (357 after data preparation)
User Corpus 4 (WC4)
• Designed mostly by a single a single user
• 53 workflows (50 after data preparation)
Workflow fragment assessment
52PhD Thesis: Mining Abstractions in Scientific Workflows
?
Result assessment
•30%-60% of proposed fragments are equal to user defined groupings or
workflows
•40%-80% of proposed of proposed fragments are equal or similar to user
defined groupings or workflows
H.3: Commonly occurring patterns are potentially useful for users designing workflows
What about the rest of the fragments? Are those useful?
Workflow fragment assessment
53PhD Thesis: Mining Abstractions in Scientific Workflows
?
User feedback: user survey
Q1: Would you consider the proposed fragment a valuable grouping?
•I would not select it as a grouping (0)
•I would use it as a grouping with major changes (i.e., adding/removing more than 30% of the steps) (1)
•I would use it as a grouping with minor changes (i.e., adding/removing less than 30% of the steps) (2).
•I would use it as a grouping as it is (3)
Q2: What do you think about the complexity of the fragment?
•The fragment is too simple (0)
•The fragment is fine as it is (1)
•The fragment has too many steps (2)
Not enough evidence to state that all proposed workflow fragments are useful
Outline
1. Introduction and motivation
2. Hypothesis and contributions
3. Workflow representation: Open Provenance Model for Workflows
4. Workflow abstraction and reuse
5. Mining abstractions from workflows using graph
mining techniques
6. Evaluation
7. Conclusions and future work
54PhD Thesis: Mining Abstractions in Scientific Workflows
Conclusions: Results
H.1: It is possible to define a catalog of common domain independent patterns based on
the common functionality of workflow steps.
Daniel Garijo and Yolanda Gil. A new approach for publishing workflows: Abstractions, standards, and Linked Data. (WORKS'11)
Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An empirical analysis (extended
version). Future Generation Computer Systems. 2013.
Model for representing workflows (OPMW) and publishing them as Linked Data
Catalog of workflow motifs + workflow annotation
H.2: It is possible to detect commonly occurring patterns and abstractions automatically.
Graph mining approach + workflow generalization
Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An empirical analysis. 8th IEEE
International Conference on e-Science (eScience 2012)
55PhD Thesis: Mining Abstractions in Scientific Workflows
Daniel Garijo, Oscar Corcho and Yolanda Gil. Detecting common scientific workflow fragments using templates and execution provenance. Proceedings of the
seventh international conference on Knowledge capture, (K-CAP 2013).
Conclusions: Results
Daniel Garijo, Oscar Corcho, Yolanda Gil, Boris Gutman, Ivo D. Dinov, Paul Thompson and Arthur W. Toga. FragFlow: Automated fragment detection in
scientific workflows. 10th IEEE Conference on e-Science, (eScience 2014)
Daniel Garijo, Oscar Corcho, Yolanda Gil, Meredith N. Braskie, Dereck Hibar, Xie Hua, Neda Jahanshad, Paul Thompson and Arthur W. Toga. Workflow reuse in
practice: A study of neuroimaging pipeline users. 10th IEEE Conference on e-Science, (eScience 2014)
H.3: Commonly occurring patterns are potentially useful for users designing workflows.
Graph mining approach + reusability metrics for assessment + workflow annotation
56PhD Thesis: Mining Abstractions in Scientific Workflows
Reuse survey
Conclusions: Impact and future work
Impact:
OPMW
•Workflow annotation [García-Jiménez and Wilkinson 2014b]
Motif catalog
•Expansion for distributed environments [Olabarriaga et al 2013]
•Workflow summarization [Alper et al 2013]
Future work:
•Towards workflow ecosystems
57PhD Thesis: Mining Abstractions in Scientific Workflows
[Garijo et al 2014] (WORKS’14)
Conclusions: Impact and future work
•Automatic detection of workflow abstractions
58PhD Thesis: Mining Abstractions in Scientific Workflows
•Improvement of workflow reuse
Custom fragments
Ranking fragments
Suggestions of workflows
Date: 03/12/2015
Mining Abstractions in
Scientific Workflows
Daniel Garijo *
Supervisors: Oscar Corcho *, Yolanda Gil Ŧ
* Universidad Politécnica de Madrid,
Ŧ USC Information Sciences Institute
All materials are available as Research Objects
(with pointers to Figshare)
http://w3id.org/dgarijo/ro/mining-abstractions-in-scientific-wfs
Supporting material
60PhD Thesis: Mining Abstractions in Scientific Workflows
Methodology
Workflow representation and publication
Approach
Workflow abstraction and reuse
Empirical
analysis of
workflow
corpora
Problem Evaluation
Requirement
validation and
user feedback
Model Competency
question
validation
Provenance
Plan
Publication
Methodology
for publication
Extension of
existing
standards
and web
technologies
Workflow
abstraction
analysis for
reuse
Agreement on
a catalog of
common
abstractions
Automatic detection and annotation of
workflow abstractions
Graph mining
techniques,
generalization
Precision,
recall and user
feedback
61PhD Thesis: Mining Abstractions in Scientific Workflows
Provenance Models
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 62
“A record that describes the people, institutions, entities, and activities
involved in producing, influencing, or delivering a piece of data or a thing”
-PROV-DM: The PROV Data Model (W3C)
Replace this slide with a methodological one
prov:used
p-plan:Variable
p-plan:isStepOfPlan
p-plan:isVariableOfPlan
p-plan:hasInputVar
p-plan:isOutputVarOf
p-plan:Activity
p-plan: correspondsToStep
p-plan:Entity
prov:wasGeneratedBy
p-plan:isPrecededBy
p-plan:Bundle
Class Object property
Legend
Subclass of
prov:Bundle
prov:Plan
prov:Entity
prov:Activity
PROVextendedclasses
Statements contained in a p-plan:Bundle
p-plan:Step
p-plan:Plan
p-plan: correspondsToVariable
63PhD Thesis: Mining Abstractions in Scientific Workflows
Assumptions and restrictions
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 64
Restriction:
• Workflows are represented as directed acyclic graphs
Assumptions:
•Available workflow repositories exist for exploiting definitions
of workflows and workflow executions.
•All the workflow steps can be assigned a label with their type
•Two steps of a workflow with the same function have the
same type.
•Researchers aim to reuse workflows and workflow fragments
if they find them useful.
9
Other models for representing workflow instances, templates and executions
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
Publishing as LD
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 66
•Maybe paste here an example
instead of the big picture
67
Data Oriented Motifs
Data-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation
and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
68
Data Oriented Motifs
Data-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation
and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
69
Data Oriented Motifs
Data-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation
and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
70
Data Oriented Motifs
Data-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation
and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
71
Data Oriented Motifs
Data-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation
and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
72
Data Oriented Motifs
Data-Oriented Motifs
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation
and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Movement
Data Visualisation
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
73
Workflow Oriented Motifs
Workflow-Oriented Motifs
Intra-Workflow Motifs
Stateful (Asynchronous) Invocations
Stateless (Synchronous) Invocations
Internal Macros
Human Interactions
Inter-Workflow Motifs
Atomic Workflows
Composite Workflows
Workflow Overloading
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
74
Workflow Oriented Motifs
Workflow-Oriented Motifs
Intra-Workflow Motifs
Stateful (Asynchronous) Invocations
Stateless (Synchronous) Invocations
Internal Macros
Human Interactions
Inter-Workflow Motifs
Atomic Workflows
Composite Workflows
Workflow Overloading
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
75
Workflow Oriented Motifs
Workflow-Oriented Motifs
Intra-Workflow Motifs
Stateful (Asynchronous) Invocations
Stateless (Synchronous) Invocations
Internal Macros
Human Interactions
Inter-Workflow Motifs
Atomic Workflows
Composite Workflows
Workflow Overloading
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
76
Workflow Oriented Motifs
Workflow-Oriented Motifs
Intra-Workflow Motifs
Stateful (Asynchronous) Invocations
Stateless (Synchronous) Invocations
Internal Macros
Human Interactions
Inter-Workflow Motifs
Atomic Workflows
Composite Workflows
Workflow Overloading
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
77
Workflow Oriented Motifs
Workflow-Oriented Motifs
Intra-Workflow Motifs
Stateful (Asynchronous) Invocations
Stateless (Synchronous) Invocations
Internal Macros
Human Interactions
Inter-Workflow Motifs
Atomic Workflows
Composite Workflows
Workflow Overloading
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
Result Summary: Data Oriented Motifs
•Over 60% of the motifs are data
preparation motifs
•Some differences are motivated by the
workflow systems in the analysis
•Data analysis is often the main
functionality of the workflow
78PhD Thesis: Mining Abstractions in Scientific Workflows
Result Summary: Workflow Oriented Motifs
• Around 40% composite workflows and internal macros
But how do users perceive workflow reuse?
•What about fragments of workflows?
79PhD Thesis: Mining Abstractions in Scientific Workflows
80
Differences and commonalities of the workflow systems
•Data moving/retrieval, stateful interactions and human interaction steps are
not present in Wings
•Web services (Taverna) versus software components (Wings)
•Wings has layered execution through Pegasus
•Data preparation steps are common in both systems
•Use of sub workflows is high
PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
Reusing workflows…
According to the respondents, the major benefits of workflows include:
• Time savings
•Organizing and storing code
• Having a visualization of the overall analysis
• Facilitating reproducibility
81PhD Thesis: Mining Abstractions in Scientific Workflows
Reusing groupings…
•Reuse is not the only reason why groupings are created. Unlike workflows, reusing
groupings from one’s own work is more useful than reusing groupings from others
•Most respondents agreed that groupings help simplify workflows. Groupings also
make workflows more understandable by others
82PhD Thesis: Mining Abstractions in Scientific Workflows
Graph Mining
We use popular graph mining techniques:
Inexact FSM: usage of heuristics to calculate similarity between two
graphs. The solution might not be complete
SUBDUE
• 2 heuristics: Minimum Description Length (MDL) and Size
• Frequency based
Exact FSM: deliver all the possible fragments to be found the dataset.
gSpan
• Depth first search strategy
• Support based
FSG
• Breadth first search strategy
• Support based
83PhD Thesis: Mining Abstractions in Scientific Workflows
Linking to the Corpus: Workflow fragment description vocabulary
84PhD Thesis: Mining Abstractions in Scientific Workflows
Workflow fragment assessment: Summary of results
85PhD Thesis: Mining Abstractions in Scientific Workflows
Conclusions: Limitations
L1: OPMW has been designed for data-intensive workflows (without loops or
conditionals)
L2: When publishing as Linked Data, it is assumed that all resources will be made public
(no privacy issues)
L3: Motif catalog may be expanded with additional motifs
L4: Size and time needed to calculate some workflow fragments
L5: A taxonomy of components is needed when generalizing workflows. This taxonomy is
provided by domain experts modeling the domain.
86PhD Thesis: Mining Abstractions in Scientific Workflows

More Related Content

What's hot

ISMB Workshop 2014
ISMB Workshop 2014ISMB Workshop 2014
ISMB Workshop 2014
Alejandra Gonzalez-Beltran
 
The role of annotation in reproducibility (Empirical 2014)
The role of annotation in reproducibility (Empirical 2014)The role of annotation in reproducibility (Empirical 2014)
The role of annotation in reproducibility (Empirical 2014)
Oscar Corcho
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Carole Goble
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
myGrid team
 
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
Carole Goble
 
FAIRy Stories
FAIRy StoriesFAIRy Stories
FAIRy Stories
Carole Goble
 
The Research Object Initiative: Frameworks and Use Cases
The Research Object Initiative:Frameworks and Use CasesThe Research Object Initiative:Frameworks and Use Cases
The Research Object Initiative: Frameworks and Use Cases
Carole Goble
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data Science
Carole Goble
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
Carole Goble
 
NETTAB 2013
NETTAB 2013NETTAB 2013
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014
Carole Goble
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Carole Goble
 
OpenTox Europe 2013
OpenTox Europe 2013OpenTox Europe 2013
OpenTox Europe 2013
Alejandra Gonzalez-Beltran
 
FAIRer Research
FAIRer ResearchFAIRer Research
FAIRer Research
Carole Goble
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research Objects
Carole Goble
 
Research Object Community Update
Research Object Community UpdateResearch Object Community Update
Research Object Community Update
Carole Goble
 
Beyond the PDF 2, 2013
Beyond the PDF 2, 2013Beyond the PDF 2, 2013
Beyond the PDF 2, 2013
Alejandra Gonzalez-Beltran
 
Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016
Carole Goble
 
Research Shared: researchobject.org
Research Shared: researchobject.orgResearch Shared: researchobject.org
Research Shared: researchobject.org
Norman Morrison
 
CSHALS 2013
CSHALS 2013CSHALS 2013

What's hot (20)

ISMB Workshop 2014
ISMB Workshop 2014ISMB Workshop 2014
ISMB Workshop 2014
 
The role of annotation in reproducibility (Empirical 2014)
The role of annotation in reproducibility (Empirical 2014)The role of annotation in reproducibility (Empirical 2014)
The role of annotation in reproducibility (Empirical 2014)
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
 
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
 
FAIRy Stories
FAIRy StoriesFAIRy Stories
FAIRy Stories
 
The Research Object Initiative: Frameworks and Use Cases
The Research Object Initiative:Frameworks and Use CasesThe Research Object Initiative:Frameworks and Use Cases
The Research Object Initiative: Frameworks and Use Cases
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data Science
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
 
NETTAB 2013
NETTAB 2013NETTAB 2013
NETTAB 2013
 
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
 
OpenTox Europe 2013
OpenTox Europe 2013OpenTox Europe 2013
OpenTox Europe 2013
 
FAIRer Research
FAIRer ResearchFAIRer Research
FAIRer Research
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research Objects
 
Research Object Community Update
Research Object Community UpdateResearch Object Community Update
Research Object Community Update
 
Beyond the PDF 2, 2013
Beyond the PDF 2, 2013Beyond the PDF 2, 2013
Beyond the PDF 2, 2013
 
Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016
 
Research Shared: researchobject.org
Research Shared: researchobject.orgResearch Shared: researchobject.org
Research Shared: researchobject.org
 
CSHALS 2013
CSHALS 2013CSHALS 2013
CSHALS 2013
 

Viewers also liked

Mining Fuzzy Moving Object Clusters
Mining Fuzzy Moving Object ClustersMining Fuzzy Moving Object Clusters
Mining Fuzzy Moving Object Clusters
NhatHai Phan
 
PhD Defense -- Ashish Mangalampalli
PhD Defense -- Ashish MangalampalliPhD Defense -- Ashish Mangalampalli
PhD Defense -- Ashish Mangalampalli
Ashish Mangalampalli
 
Python programming advance lab api npr 2
Python programming advance lab api npr  2Python programming advance lab api npr  2
Python programming advance lab api npr 2
profbnk
 
26 Machine Learning Unsupervised Fuzzy C-Means
26 Machine Learning Unsupervised Fuzzy C-Means26 Machine Learning Unsupervised Fuzzy C-Means
26 Machine Learning Unsupervised Fuzzy C-Means
Andres Mendez-Vazquez
 
Market basketanalysis using r
Market basketanalysis using rMarket basketanalysis using r
Market basketanalysis using r
Yogesh Khandelwal
 
Market Basket Analysis in SAS
Market Basket Analysis in SASMarket Basket Analysis in SAS
Market Basket Analysis in SAS
Andrew Kramer
 
Data mining- Association Analysis -market basket
Data mining- Association Analysis -market basketData mining- Association Analysis -market basket
Data mining- Association Analysis -market basket
Swapnil Soni
 
Market basket analysis
Market basket analysisMarket basket analysis
Market basket analysis
tsering choezom
 
Masket Basket Analysis
Masket Basket AnalysisMasket Basket Analysis
Masket Basket Analysis
Marc Berman
 
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket Analysis
Mahendra Gupta
 
Real-time Market Basket Analysis for Retail with Hadoop
Real-time Market Basket Analysis for Retail with HadoopReal-time Market Basket Analysis for Retail with Hadoop
Real-time Market Basket Analysis for Retail with Hadoop
DataWorks Summit
 
Market baasket analysis
Market baasket analysisMarket baasket analysis
Market baasket analysis
SiddharthaPanapakam
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
JTADrexel
 

Viewers also liked (14)

Mining Fuzzy Moving Object Clusters
Mining Fuzzy Moving Object ClustersMining Fuzzy Moving Object Clusters
Mining Fuzzy Moving Object Clusters
 
PhD Defense -- Ashish Mangalampalli
PhD Defense -- Ashish MangalampalliPhD Defense -- Ashish Mangalampalli
PhD Defense -- Ashish Mangalampalli
 
Python programming advance lab api npr 2
Python programming advance lab api npr  2Python programming advance lab api npr  2
Python programming advance lab api npr 2
 
26 Machine Learning Unsupervised Fuzzy C-Means
26 Machine Learning Unsupervised Fuzzy C-Means26 Machine Learning Unsupervised Fuzzy C-Means
26 Machine Learning Unsupervised Fuzzy C-Means
 
Market basketanalysis using r
Market basketanalysis using rMarket basketanalysis using r
Market basketanalysis using r
 
Market Basket Analysis in SAS
Market Basket Analysis in SASMarket Basket Analysis in SAS
Market Basket Analysis in SAS
 
Data mining- Association Analysis -market basket
Data mining- Association Analysis -market basketData mining- Association Analysis -market basket
Data mining- Association Analysis -market basket
 
Market basket analysis
Market basket analysisMarket basket analysis
Market basket analysis
 
Masket Basket Analysis
Masket Basket AnalysisMasket Basket Analysis
Masket Basket Analysis
 
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket Analysis
 
Real-time Market Basket Analysis for Retail with Hadoop
Real-time Market Basket Analysis for Retail with HadoopReal-time Market Basket Analysis for Retail with Hadoop
Real-time Market Basket Analysis for Retail with Hadoop
 
Market baasket analysis
Market baasket analysisMarket baasket analysis
Market baasket analysis
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 

Similar to PhD Thesis: Mining abstractions in scientific workflows

Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015
dgarijo
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
dgarijo
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
dgarijo
 
Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...
dgarijo
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
Pinar Alper
 
Converting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research ObjectsConverting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research Objects
Lucas Augusto Carvalho
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
Khalid Belhajjame
 
Ikc 2015
Ikc 2015Ikc 2015
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
Shiyong Lu
 
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Yury Leonychev
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibility
Oscar Corcho
 
Credible workshop
Credible workshopCredible workshop
Credible workshop
Khalid Belhajjame
 
An approach for knowledge-driven product, process and resource mappings for a...
An approach for knowledge-driven product, process and resource mappings for a...An approach for knowledge-driven product, process and resource mappings for a...
An approach for knowledge-driven product, process and resource mappings for a...
FAST-Lab. Factory Automation Systems and Technologies Laboratory, Tampere University of Technology
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
QAware GmbH
 
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
Philippe Rocca-Serra
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
DataWorks Summit
 
Research Objects Tutorial (TPDL)
Research Objects Tutorial (TPDL)Research Objects Tutorial (TPDL)
Research Objects Tutorial (TPDL)
dgarijo
 
Kurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere MortalsKurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere Mortals
Bertram Ludäscher
 
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015
Joaquin Vanschoren
 
DataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and Workflows
DataONE
 

Similar to PhD Thesis: Mining abstractions in scientific workflows (20)

Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
 
Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
 
Converting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research ObjectsConverting Scripts into Reproducible Workflow Research Objects
Converting Scripts into Reproducible Workflow Research Objects
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
Ikc 2015
Ikc 2015Ikc 2015
Ikc 2015
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibility
 
Credible workshop
Credible workshopCredible workshop
Credible workshop
 
An approach for knowledge-driven product, process and resource mappings for a...
An approach for knowledge-driven product, process and resource mappings for a...An approach for knowledge-driven product, process and resource mappings for a...
An approach for knowledge-driven product, process and resource mappings for a...
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
 
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
Research Objects Tutorial (TPDL)
Research Objects Tutorial (TPDL)Research Objects Tutorial (TPDL)
Research Objects Tutorial (TPDL)
 
Kurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere MortalsKurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere Mortals
 
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015
 
DataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and WorkflowsDataONE Education Module 09: Analysis and Workflows
DataONE Education Module 09: Analysis and Workflows
 

More from dgarijo

FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principlesFOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
dgarijo
 
FAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the FutureFAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the Future
dgarijo
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
dgarijo
 
SOMEF: a metadata extraction framework from software documentation
SOMEF: a metadata extraction framework from software documentationSOMEF: a metadata extraction framework from software documentation
SOMEF: a metadata extraction framework from software documentation
dgarijo
 
A Template-Based Approach for Annotating Long-Tailed Datasets
A Template-Based Approach for Annotating Long-Tailed DatasetsA Template-Based Approach for Annotating Long-Tailed Datasets
A Template-Based Approach for Annotating Long-Tailed Datasets
dgarijo
 
OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphs
OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge GraphsOBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphs
OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphs
dgarijo
 
Towards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software MetadataTowards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software Metadata
dgarijo
 
Scientific Software Registry Collaboration Workshop: From Software Metadata r...
Scientific Software Registry Collaboration Workshop: From Software Metadata r...Scientific Software Registry Collaboration Workshop: From Software Metadata r...
Scientific Software Registry Collaboration Workshop: From Software Metadata r...
dgarijo
 
WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
WDPlus: Leveraging Wikidata to Link and Extend Tabular DataWDPlus: Leveraging Wikidata to Link and Extend Tabular Data
WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
dgarijo
 
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
dgarijo
 
Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019
dgarijo
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
dgarijo
 
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
dgarijo
 
WIDOCO: A Wizard for Documenting Ontologies
WIDOCO: A Wizard for Documenting OntologiesWIDOCO: A Wizard for Documenting Ontologies
WIDOCO: A Wizard for Documenting Ontologies
dgarijo
 
Towards Automating Data Narratives
Towards Automating Data NarrativesTowards Automating Data Narratives
Towards Automating Data Narratives
dgarijo
 
Automated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific WorkflowsAutomated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific Workflows
dgarijo
 
OntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific SoftwareOntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific Software
dgarijo
 
OEG tools for supporting Ontology Engineering
OEG tools for supporting Ontology EngineeringOEG tools for supporting Ontology Engineering
OEG tools for supporting Ontology Engineering
dgarijo
 
Publicación de datos y métodos científicos en investigación
Publicación de datos y métodos científicos en investigaciónPublicación de datos y métodos científicos en investigación
Publicación de datos y métodos científicos en investigación
dgarijo
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overview
dgarijo
 

More from dgarijo (20)

FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principlesFOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
 
FAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the FutureFAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the Future
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
 
SOMEF: a metadata extraction framework from software documentation
SOMEF: a metadata extraction framework from software documentationSOMEF: a metadata extraction framework from software documentation
SOMEF: a metadata extraction framework from software documentation
 
A Template-Based Approach for Annotating Long-Tailed Datasets
A Template-Based Approach for Annotating Long-Tailed DatasetsA Template-Based Approach for Annotating Long-Tailed Datasets
A Template-Based Approach for Annotating Long-Tailed Datasets
 
OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphs
OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge GraphsOBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphs
OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphs
 
Towards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software MetadataTowards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software Metadata
 
Scientific Software Registry Collaboration Workshop: From Software Metadata r...
Scientific Software Registry Collaboration Workshop: From Software Metadata r...Scientific Software Registry Collaboration Workshop: From Software Metadata r...
Scientific Software Registry Collaboration Workshop: From Software Metadata r...
 
WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
WDPlus: Leveraging Wikidata to Link and Extend Tabular DataWDPlus: Leveraging Wikidata to Link and Extend Tabular Data
WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
 
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
 
Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
 
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
 
WIDOCO: A Wizard for Documenting Ontologies
WIDOCO: A Wizard for Documenting OntologiesWIDOCO: A Wizard for Documenting Ontologies
WIDOCO: A Wizard for Documenting Ontologies
 
Towards Automating Data Narratives
Towards Automating Data NarrativesTowards Automating Data Narratives
Towards Automating Data Narratives
 
Automated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific WorkflowsAutomated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific Workflows
 
OntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific SoftwareOntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific Software
 
OEG tools for supporting Ontology Engineering
OEG tools for supporting Ontology EngineeringOEG tools for supporting Ontology Engineering
OEG tools for supporting Ontology Engineering
 
Publicación de datos y métodos científicos en investigación
Publicación de datos y métodos científicos en investigaciónPublicación de datos y métodos científicos en investigación
Publicación de datos y métodos científicos en investigación
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overview
 

Recently uploaded

Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
imrankhan141184
 
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdfREASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
giancarloi8888
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
S. Raj Kumar
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
deepaannamalai16
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
 
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
zuzanka
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
siemaillard
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
PsychoTech Services
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
Celine George
 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
RidwanHassanYusuf
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
nitinpv4ai
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
Steve Thomason
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
National Information Standards Organization (NISO)
 
Standardized tool for Intelligence test.
Standardized tool for Intelligence test.Standardized tool for Intelligence test.
Standardized tool for Intelligence test.
deepaannamalai16
 

Recently uploaded (20)

Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
 
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdfREASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
 
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
 
Standardized tool for Intelligence test.
Standardized tool for Intelligence test.Standardized tool for Intelligence test.
Standardized tool for Intelligence test.
 

PhD Thesis: Mining abstractions in scientific workflows

  • 1. Date: 03/12/2015 Mining Abstractions in Scientific Workflows Daniel Garijo * Supervisors: Oscar Corcho *, Yolanda Gil Ŧ * Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute
  • 2. Introduction Lab book Digital Log Laboratory Protocol (recipe) Scientific Workflow Experiment In silico experiment 2PhD Thesis: Mining Abstractions in Scientific Workflows
  • 3. Benefits of workflows Time savings •Copy & paste fragments of workflows 3PhD Thesis: Mining Abstractions in Scientific Workflows Teaching •Reduce the learning curve of new students Visualization •Simplify workflows Design for modularity •Highlight the most relevant steps on a workflow Design for standardization Debugging •Provenance exploration Reproducibility and inspectability
  • 4. Motivation of this work Workflow Repositories Workflow Systems Let’s Share! I want to reuse… ? I want to understand…? I want to repurpose… ? 4PhD Thesis: Mining Abstractions in Scientific Workflows
  • 5. Open research challenges •Workflow representation heterogeneity 5PhD Thesis: Mining Abstractions in Scientific Workflows Workflow Repositories How can we represent a description of workflows and their metadata? How can we facilitate the homogeneous consumption of workflows and their resources?
  • 6. Open research challenges •Workflow representation heterogeneity 6PhD Thesis: Mining Abstractions in Scientific Workflows •Inadequate level of workflow abstraction What are the most relevant parts of a workflow Dataset Porter Stemmer Result IDF Final Result Dataset Lovins Stemmer Result Residual IDF Final Result Dataset Stemmer Result Term Weighting FinalResult Are two seemingly disparate workflows related at a higher level of abstraction?
  • 7. Open research challenges •Workflow representation heterogeneity 7PhD Thesis: Mining Abstractions in Scientific Workflows •Inadequate level of workflow abstraction •Difficulties for workflow reuse How is a workflow related to other workflows? Which workflow (parts) are potentially useful for reuse? ? ? ?
  • 8. Open research challenges •Workflow representation heterogeneity 8PhD Thesis: Mining Abstractions in Scientific Workflows •Inadequate level of workflow abstraction •Difficulties for workflow reuse •Lack of support for workflow annotation + + How can we facilitate the annotation process?
  • 9. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows 4. Workflow abstraction and reuse 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation 7. Conclusions and future work 9PhD Thesis: Mining Abstractions in Scientific Workflows
  • 10. •H.3: Commonly occurring patterns are potentially useful for users designing workflows. •H.2: It is possible to detect commonly occurring patterns and abstractions automatically. Hypothesis •H.1: It is possible to define a catalog of common domain independent patterns based on the common functionality of workflow steps. Scientific workflow repositories can be automatically analyzed to extract commonly occurring patterns and abstractions that are useful for workflow developers aiming to reuse existing workflows. Workflow abstraction Workflow representation Workflow reuse Workflow annotation Workflow reuse 10PhD Thesis: Mining Abstractions in Scientific Workflows
  • 11. Contributions Workflow representation and publication Model for representing workflow templates and executions Workflow abstraction Methodology to publish workflows in the web Workflow annotation A model and means for annotating semi-automatically the abstractions in workflows A catalog of common domain independent workflow patterns based on the functionality of workflow steps A method to extract generic commonly occurring workflow fragments automatically Workflow reuse Metrics for assessing the usefulness of a fragment for reuse A model to describe and annotate workflow fragments 11PhD Thesis: Mining Abstractions in Scientific Workflows OPMW Linked Data Wf-motifs Wf-fd Workflow motifs Graph mining
  • 12. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows a) Requirements b) The OPMW model c) Publishing workflows as Linked Data 4. Workflow abstraction and reuse 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation 7. Conclusions and future work 12PhD Thesis: Mining Abstractions in Scientific Workflows
  • 13. Workflow representation: Structures interchanged in the workflow lifecycle Dataset Stemmer algorithm Result Term weighting algorithm FinalResult File: Dataset123 LovinsStemmer algorithm Id:resultaa1 IDF algorithm Id:fresultaa2 Workflow Template 13PhD Thesis: Mining Abstractions in Scientific Workflows Workflow Instance Workflow Execution Trace Design Instantiation Execution File: Dataset124 PorterStemmer algorithm Id:resultaa1 IDF algorithm Id:fresultaa2 File: Dataset123 LovinsStemmer execution Id:resultaa1 IDF execution Id:fresultaa2 File: Dataset123 LovinsStemmer execution Id:resultaa1 IDF execution Id:fresultaa2 File: Dataset124 PorterStemmer execution Id:resultaa1 IDF execution Id:fresultaa2 File: Dataset124 PorterStemmer execution Id:resultaa1 IDF execution Id:fresultaa2 File: Dataset124 PorterStemmer execution Id:resultaa1 IDF execution Id:fresultaa2 File: Dataset123 LovinsStemmer execution Id:resultaa1 IDF execution Id:fresultaa2 … … Id:resultaa1
  • 14. Requirements 14PhD Thesis: Mining Abstractions in Scientific Workflows Workflow template description Plan: P-Plan [Garijo et al 2012] http://purl.org/net/p-plan Workflow execution trace description Provenance: PROV (W3C) [Lebo et al 2013] http://www.w3.org/ns/prov# Workflow attribution Dublin Core, PROV (W3C) Workflow metadata Link between templates and executions Scufl DAX AGWL Dispel IWIR OPM OBI EXPO ISA PAV RO D-PROV [Cicarese et al 2013] [Moreau et al 2011] [Brinkman et al 2010] [Soldatova and King 2006] [Rocca et al 2008] [Belhajjame et al 2012] [Missier et al 2013] [Oinn et al 2004] [Fahringer et al 2005] [Atkinson et al 2013] [Plankensteiner et al 2005]
  • 15. OPMW: Extending provenance standards and plan models template1 opmw:isVariableOfTemplate opmw:isVariable OfTemplate Input Dataset Term Weighting Topics p-plan:isOutputVarOf p-plan:hasInputVar opmw:isStepOf Template opmw:correspondsTo Template opmw:corresponds toTemplateArtifact opmw:corresponds toTemplateProcess opmw:corresponds toTemplateArtifact opmw:Workflow ExecutionProcess opmw:Workflow ExecutionAccount prov:Entity prov:Activity prov:Bundle PROV, OPM Extension opmv:Artifact opmo:Account opmv:Process opmw:Workflow ExecutionArtifact opmw:Workflow TemplateArtifact opmw:Workflow TemplateProcess opmw:Workflow Template p-plan:Plan p-plan:Step p-plan:Variable P-Plan extension Class Object property Legend Instance ofInstance Subclass of 15PhD Thesis: Mining Abstractions in Scientific Workflows execution1 File: Dataset123 IDF (java) File: FResultaa2 prov:wasGeneratedBy prov:used opmo:account opmo:account opmo:account http://www.opmw.org/ontology/
  • 16. Outline 1. Introduction and motivation 2. Hypothesis and work methodology 3. Workflow representation: OPMW a) Requirements b) The OPMW model c) Publishing workflows as Linked Data 4. Workflow abstraction and reuse 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation 7. Conclusions and future work 16PhD Thesis: Mining Abstractions in Scientific Workflows
  • 17. Publishing workflows as Linked Data Specification 17PhD Thesis: Mining Abstractions in Scientific Workflows Why Linked Data? •Facilitates exploitation of workflow resources in an homogeneous manner Adapted methodology from [Villazón-Terrazas et al 2011] Tested it for the Wings workflow system 1 Base URI = http://www.opmw.org/ Ontology URI = http://www.opmw.org/ontology/ Assertion URI = http://www.opmw.org/export/resource/ClassName/instanceName Examples: http://www.opmw.org/export/resource/WorkflowTemplate/ABSTRACTSUBWFDOCKING http://www.opmw.org/export/resource/WorkflowExecutionAccount/ACCOUNT1348629 350796
  • 18. Publishing workflows as Linked Data Specification Modeling 18PhD Thesis: Mining Abstractions in Scientific Workflows Why Linked Data? •Facilitates exploitation of workflow resources in an homogeneous manner Adapted methodology from [Villazón-Terrazas et al 2011] Tested it for the Wings workflow system 1 2 OPMW P-Plan OPM DC PROV
  • 19. Publishing workflows as Linked Data Specification Modeling Generation 19PhD Thesis: Mining Abstractions in Scientific Workflows Why Linked Data? •Facilitates exploitation of workflow resources in an homogeneous manner Adapted methodology from [Villazón-Terrazas et al 2011] Tested it for the Wings workflow system 1 2 3 Workflow system Workflow Template Workflow execution OPMW export OPMW RDF
  • 20. Publishing workflows as Linked Data Specification Modeling Generation Publication 20PhD Thesis: Mining Abstractions in Scientific Workflows Why Linked Data? •Facilitates exploitation of workflow resources in an homogeneous manner Adapted methodology from [Villazón-Terrazas et al 2011] Tested it for the Wings workflow system 1 2 3 4 RDF Triple store Permanent web- accessible file store RDF Upload Interface SPARQL Endpoint OPMW RDF
  • 21. Publishing workflows as Linked Data Specification Modeling Generation Publication 21PhD Thesis: Mining Abstractions in Scientific Workflows Why Linked Data? •Facilitates exploitation of workflow resources in an homogeneous manner Adapted methodology from [Villazón-Terrazas et al 2011] Tested it for the Wings workflow system 1 2 3 4 Exploitation 5 Curl Linked Data Browser Workflow Explorer SPARQL endpoint
  • 22. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows 4. Workflow abstraction and reuse a) A catalog of common workflow abstractions b) Workflow reuse analysis 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation 7. Conclusions and future work 22PhD Thesis: Mining Abstractions in Scientific Workflows
  • 23. A catalog of common workflow abstractions Generalization of workflow steps based on functionality. Workflow motif: Domain independent conceptual abstraction on the workflow steps. 1. Data-oriented motifs: What kind of manipulations does the workflow have? •E.g.: •Data retrieval •Data preparation •Data curation •Data visualization • etc. 23PhD Thesis: Mining Abstractions in Scientific Workflows
  • 24. A catalog of common workflow abstractions Generalization of workflow steps based on functionality. Workflow motif: Domain independent conceptual abstraction on the workflow steps. 1. Data-oriented motifs: What kind of manipulations does the workflow have? •E.g.: •Data retrieval •Data preparation • etc. 2. Workflow-oriented motifs: How does the workflow perform its operations? •E.g.: •Stateful steps •Stateless steps •Human interactions •etc. 24PhD Thesis: Mining Abstractions in Scientific Workflows
  • 25. Methodology for finding workflow motifs Goal: Reverse-engineer the set of current practices in workflow development through an analysis of empirical evidence 25PhD Thesis: Mining Abstractions in Scientific Workflows = 260 workflows 89 12526 20 Collect workflows
  • 26. Methodology for finding workflow motifs Goal: Reverse-engineer the set of current practices in workflow development through an analysis of empirical evidence 26PhD Thesis: Mining Abstractions in Scientific Workflows Preliminary workflow analysis Researcher 1 Researcher 2 Researcher 3
  • 27. Methodology for finding workflow motifs Goal: Reverse-engineer the set of current practices in workflow development through an analysis of empirical evidence 27PhD Thesis: Mining Abstractions in Scientific Workflows Agreement and cross validation
  • 28. Result Summary 28PhD Thesis: Mining Abstractions in Scientific Workflows •Over 60% of the motifs are data preparation motifs •Some differences are motivated by the workflow systems in the analysis •Around 40% of workflows contain motifs related to workflow reuse composite workflowsinternal macros But how do users perceive workflow reuse? What about fragments of workflows?
  • 29. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows 4. Workflow abstraction and reuse a) A catalog of common workflow abstractions b) Workflow reuse survey 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation 7. Conclusions and future work 29PhD Thesis: Mining Abstractions in Scientific Workflows
  • 30. Use case: The LONI Pipeline Workflow system for neuroimaging analysis http://pipeline.loni.usc.edu/explore/library-navigator/ 30PhD Thesis: Mining Abstractions in Scientific Workflows Discussions with scientists User survey Collect responses from users 21 responses Discuss results
  • 31. Summary results The majority of users agree that reusing and sharing workflows is useful Unlike workflows, reusing groupings from one’s own work is more useful than reusing groupings from others Most respondents agreed that groupings help simplify workflows. Groupings also make workflows more understandable by others 31PhD Thesis: Mining Abstractions in Scientific Workflows Can we detect groupings automatically?
  • 32. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows 4. Workflow abstraction and reuse 5. Mining abstractions from workflows using graph mining techniques a) Corpus preparation b) Graph mining c) Fragment filtering d) Fragment linking 6. Evaluation 7. Conclusions and future work 32PhD Thesis: Mining Abstractions in Scientific Workflows
  • 33. Workflow mining approaches Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014] 33PhD Thesis: Mining Abstractions in Scientific Workflows Workflow corpus Cluster1 Cluster 2 Cluster 3 Workflow corpus
  • 34. Workflow mining approaches Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014] 34PhD Thesis: Mining Abstractions in Scientific Workflows Topic 1 Topic 2 P(Topic1) = 0.7 P(Topic2)= 0.3 P(Topic1) = 0.5 P(Topic2)= 0.5 P(Topic1) = 0.2 P(Topic2)= 0.8 …. Topic modeling [Stoyanovich et al 2010]
  • 35. Workflow mining approaches Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014] Topic modeling [Stoyanovich et al 2010] 35PhD Thesis: Mining Abstractions in Scientific Workflows Case-based reasoning [Leake and Kendall-Morwick 2008], [Müller and Bergmann 2014] Workflow corpus ?
  • 36. ? Workflow mining approaches Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014] Topic modeling [Stoyanovich et al 2010] Case-based reasoning [Leake and Kendall-Morwick 2008] [Müller and Bergmann 2014] Log mining [van del Aalst et al2003] [Gómez-Pérez and Corcho, 2008] 36PhD Thesis: Mining Abstractions in Scientific Workflows Workflow corpus ? PSM
  • 37. Workflow mining approaches Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014] Topic modeling [Stoyanovich et al 2010] Case-based reasoning [Leake and Kendall-Morwick 2008] [Müller and Bergmann 2014] Log mining [van del Aalst et al2003] [Gómez-Pérez and Corcho, 2008] Graph mining [Diamantini et al., 2012] 37PhD Thesis: Mining Abstractions in Scientific Workflows
  • 38. Workflow Mining in FragFlow 1 2 3 4 38PhD Thesis: Mining Abstractions in Scientific Workflows
  • 39. Corpus Preparation Workflows converted to Labeled Directed Acyclic Graphs (LDAG) • The label of a node in the graph corresponds to the type of the step in the workflow • Edges capture the dependencies between different steps 39PhD Thesis: Mining Abstractions in Scientific Workflows Dataset Stemmer algorithm Result Term weighting algorithm FinalResult Stemmer algorithm Term weighting algorithm Duplicated workflows are removed Single-step workflows are removed
  • 40. Graph Mining We use popular graph mining techniques: Inexact FSM: usage of heuristics to calculate similarity between two graphs. The solution might not be complete SUBDUE 2 heuristics: Minimum Description Length (MDL) and Size Exact FSM: deliver all the possible fragments to be found the dataset. gSpan Depth first search strategy FSG Breadth first search strategy 40PhD Thesis: Mining Abstractions in Scientific Workflows
  • 41. Filtering Relevant Fragments The number of resulting fragments can be very large. We distinguish: Multistep fragments: More than one step Filtered Multistep fragments: Multistep fragments Contain all smaller fragments with the same number of occurrences 41PhD Thesis: Mining Abstractions in Scientific Workflows Stemmer Term Weighting Stemmer Term Weighting Filter Filter Sort Filter Sort Query F1 F2 F3 F4 (found 4 times) (found 4 times) (found 10 times) (found 3 times)
  • 42. Linking to the Corpus: Example Workflow 1 42PhD Thesis: Mining Abstractions in Scientific Workflows Stemmer Term Weighting Stemmer Term Weighting Merge Stemmer Term Weighting Fragment1in Wf1(1) Fragment1 Fragment1in Wf1(2) Workflow fragment description vocabulary: http://purl.org/net/wf-fd (Extends P-Plan) wffd:foundAs wffd:foundAs wffd:foundIn p-plan:isPrecededBy p-plan:isPrecededByp-plan:isPrecededBy p-plan:isPrecededBy p-plan:isPrecededBy p-plan:isStepOfPlan p-plan:isStepOfPlan p-plan:isStepOfPlan p-plan:isStepOfPlan p-plan:isStepOfPlan p-plan:Step wffd:TiedWorkflowFragment wffd:DetectedResultWorkflowFragment
  • 43. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows 4. Workflow abstraction and reuse 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation a) Finding generic motifs in workflows b) Workflow fragment assessment 7. Conclusions and future work 43PhD Thesis: Mining Abstractions in Scientific Workflows
  • 44. Finding generic motifs in workflows 44PhD Thesis: Mining Abstractions in Scientific Workflows ? Research question: Can we find commonly occurring abstractions? composite workflowsinternal macros
  • 45. Finding generic motifs in workflows 45PhD Thesis: Mining Abstractions in Scientific Workflows ? Metrics used: precision and recall Fragments (F) Annotated motifs (M)
  • 46. Finding generic motifs in workflows 46PhD Thesis: Mining Abstractions in Scientific Workflows ? Corpus: 22 templates from the same domain annotated manually Wings workflow corpus + domain knowledge Dataset Porter Stemmer Result IDF Final Result Dataset Lovins Stemmer Result Residual IDF Final Result + Dataset Stemmer Result Term Weighting FinalResult Stemmer Porter Stemmer Lovins Stemmer Term Weighting Inverse Document Frequency (IDF) Residual IDF Query Term Weighting Component taxonomy
  • 47. Finding generic motifs in workflows 47PhD Thesis: Mining Abstractions in Scientific Workflows ? Results of the evaluation H.2: It is possible to detect commonly occurring patterns and abstractions automatically. Internal Macros: Inexact FSM : 2 out of 3 found (r=0,67); 4 out of 5 (r=0,8) when applying generalization Composite Workflows: Exact FSM: all motifs are found, although the precision is low (p=0,18) Can we find commonly occurring abstractions?
  • 48. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows 4. Workflow abstraction and reuse 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation a) Finding generic motifs in workflows b) Workflow fragment assessment 7. Conclusions and future work 48PhD Thesis: Mining Abstractions in Scientific Workflows
  • 49. Workflow fragment assessment 49PhD Thesis: Mining Abstractions in Scientific Workflows ? Research question: Are our proposed workflow fragments useful? •A fragment is useful if it has been designed and (re)used by a user. •Comparison between proposed fragments and user designed groupings and workflow
  • 50. Workflow fragment assessment 50PhD Thesis: Mining Abstractions in Scientific Workflows ? Metrics: Precision and recall Fragments (F) Workflows (W) Groupings (G)
  • 51. Workflow fragment assessment 51PhD Thesis: Mining Abstractions in Scientific Workflows ? Workflow corpora User Corpus 1 (WC1) • Designed mostly by a single a single user • 790 workflows (475 after data preparation) User Corpus 2 (WC2) • Created by a user, with collaborations of others • 113 workflows (96 after data preparation) Multi User Corpus 3 (WC3) • Workflows submitted by 62 users during the month of Jan 2014 • 5859 workflows (357 after data preparation) User Corpus 4 (WC4) • Designed mostly by a single a single user • 53 workflows (50 after data preparation)
  • 52. Workflow fragment assessment 52PhD Thesis: Mining Abstractions in Scientific Workflows ? Result assessment •30%-60% of proposed fragments are equal to user defined groupings or workflows •40%-80% of proposed of proposed fragments are equal or similar to user defined groupings or workflows H.3: Commonly occurring patterns are potentially useful for users designing workflows What about the rest of the fragments? Are those useful?
  • 53. Workflow fragment assessment 53PhD Thesis: Mining Abstractions in Scientific Workflows ? User feedback: user survey Q1: Would you consider the proposed fragment a valuable grouping? •I would not select it as a grouping (0) •I would use it as a grouping with major changes (i.e., adding/removing more than 30% of the steps) (1) •I would use it as a grouping with minor changes (i.e., adding/removing less than 30% of the steps) (2). •I would use it as a grouping as it is (3) Q2: What do you think about the complexity of the fragment? •The fragment is too simple (0) •The fragment is fine as it is (1) •The fragment has too many steps (2) Not enough evidence to state that all proposed workflow fragments are useful
  • 54. Outline 1. Introduction and motivation 2. Hypothesis and contributions 3. Workflow representation: Open Provenance Model for Workflows 4. Workflow abstraction and reuse 5. Mining abstractions from workflows using graph mining techniques 6. Evaluation 7. Conclusions and future work 54PhD Thesis: Mining Abstractions in Scientific Workflows
  • 55. Conclusions: Results H.1: It is possible to define a catalog of common domain independent patterns based on the common functionality of workflow steps. Daniel Garijo and Yolanda Gil. A new approach for publishing workflows: Abstractions, standards, and Linked Data. (WORKS'11) Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An empirical analysis (extended version). Future Generation Computer Systems. 2013. Model for representing workflows (OPMW) and publishing them as Linked Data Catalog of workflow motifs + workflow annotation H.2: It is possible to detect commonly occurring patterns and abstractions automatically. Graph mining approach + workflow generalization Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An empirical analysis. 8th IEEE International Conference on e-Science (eScience 2012) 55PhD Thesis: Mining Abstractions in Scientific Workflows Daniel Garijo, Oscar Corcho and Yolanda Gil. Detecting common scientific workflow fragments using templates and execution provenance. Proceedings of the seventh international conference on Knowledge capture, (K-CAP 2013).
  • 56. Conclusions: Results Daniel Garijo, Oscar Corcho, Yolanda Gil, Boris Gutman, Ivo D. Dinov, Paul Thompson and Arthur W. Toga. FragFlow: Automated fragment detection in scientific workflows. 10th IEEE Conference on e-Science, (eScience 2014) Daniel Garijo, Oscar Corcho, Yolanda Gil, Meredith N. Braskie, Dereck Hibar, Xie Hua, Neda Jahanshad, Paul Thompson and Arthur W. Toga. Workflow reuse in practice: A study of neuroimaging pipeline users. 10th IEEE Conference on e-Science, (eScience 2014) H.3: Commonly occurring patterns are potentially useful for users designing workflows. Graph mining approach + reusability metrics for assessment + workflow annotation 56PhD Thesis: Mining Abstractions in Scientific Workflows Reuse survey
  • 57. Conclusions: Impact and future work Impact: OPMW •Workflow annotation [García-Jiménez and Wilkinson 2014b] Motif catalog •Expansion for distributed environments [Olabarriaga et al 2013] •Workflow summarization [Alper et al 2013] Future work: •Towards workflow ecosystems 57PhD Thesis: Mining Abstractions in Scientific Workflows [Garijo et al 2014] (WORKS’14)
  • 58. Conclusions: Impact and future work •Automatic detection of workflow abstractions 58PhD Thesis: Mining Abstractions in Scientific Workflows •Improvement of workflow reuse Custom fragments Ranking fragments Suggestions of workflows
  • 59. Date: 03/12/2015 Mining Abstractions in Scientific Workflows Daniel Garijo * Supervisors: Oscar Corcho *, Yolanda Gil Ŧ * Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute All materials are available as Research Objects (with pointers to Figshare) http://w3id.org/dgarijo/ro/mining-abstractions-in-scientific-wfs
  • 60. Supporting material 60PhD Thesis: Mining Abstractions in Scientific Workflows
  • 61. Methodology Workflow representation and publication Approach Workflow abstraction and reuse Empirical analysis of workflow corpora Problem Evaluation Requirement validation and user feedback Model Competency question validation Provenance Plan Publication Methodology for publication Extension of existing standards and web technologies Workflow abstraction analysis for reuse Agreement on a catalog of common abstractions Automatic detection and annotation of workflow abstractions Graph mining techniques, generalization Precision, recall and user feedback 61PhD Thesis: Mining Abstractions in Scientific Workflows
  • 62. Provenance Models PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 62 “A record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing” -PROV-DM: The PROV Data Model (W3C)
  • 63. Replace this slide with a methodological one prov:used p-plan:Variable p-plan:isStepOfPlan p-plan:isVariableOfPlan p-plan:hasInputVar p-plan:isOutputVarOf p-plan:Activity p-plan: correspondsToStep p-plan:Entity prov:wasGeneratedBy p-plan:isPrecededBy p-plan:Bundle Class Object property Legend Subclass of prov:Bundle prov:Plan prov:Entity prov:Activity PROVextendedclasses Statements contained in a p-plan:Bundle p-plan:Step p-plan:Plan p-plan: correspondsToVariable 63PhD Thesis: Mining Abstractions in Scientific Workflows
  • 64. Assumptions and restrictions PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 64 Restriction: • Workflows are represented as directed acyclic graphs Assumptions: •Available workflow repositories exist for exploiting definitions of workflows and workflow executions. •All the workflow steps can be assigned a label with their type •Two steps of a workflow with the same function have the same type. •Researchers aim to reuse workflows and workflow fragments if they find them useful.
  • 65. 9 Other models for representing workflow instances, templates and executions PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  • 66. Publishing as LD PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 66 •Maybe paste here an example instead of the big picture
  • 67. 67 Data Oriented Motifs Data-Oriented Motifs Data Retrieval Data Preparation Format Transformation Input Augmentation and Output Splitting Data Organisation Data Analysis Data Curation/Cleaning Data Movement Data Visualisation PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  • 68. 68 Data Oriented Motifs Data-Oriented Motifs Data Retrieval Data Preparation Format Transformation Input Augmentation and Output Splitting Data Organisation Data Analysis Data Curation/Cleaning Data Movement Data Visualisation PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  • 69. 69 Data Oriented Motifs Data-Oriented Motifs Data Retrieval Data Preparation Format Transformation Input Augmentation and Output Splitting Data Organisation Data Analysis Data Curation/Cleaning Data Movement Data Visualisation PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  • 70. 70 Data Oriented Motifs Data-Oriented Motifs Data Retrieval Data Preparation Format Transformation Input Augmentation and Output Splitting Data Organisation Data Analysis Data Curation/Cleaning Data Movement Data Visualisation PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  • 71. 71 Data Oriented Motifs Data-Oriented Motifs Data Retrieval Data Preparation Format Transformation Input Augmentation and Output Splitting Data Organisation Data Analysis Data Curation/Cleaning Data Movement Data Visualisation PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  • 72. 72 Data Oriented Motifs Data-Oriented Motifs Data Retrieval Data Preparation Format Transformation Input Augmentation and Output Splitting Data Organisation Data Analysis Data Curation/Cleaning Data Movement Data Visualisation PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  • 73. 73 Workflow Oriented Motifs Workflow-Oriented Motifs Intra-Workflow Motifs Stateful (Asynchronous) Invocations Stateless (Synchronous) Invocations Internal Macros Human Interactions Inter-Workflow Motifs Atomic Workflows Composite Workflows Workflow Overloading PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  • 74. 74 Workflow Oriented Motifs Workflow-Oriented Motifs Intra-Workflow Motifs Stateful (Asynchronous) Invocations Stateless (Synchronous) Invocations Internal Macros Human Interactions Inter-Workflow Motifs Atomic Workflows Composite Workflows Workflow Overloading PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  • 75. 75 Workflow Oriented Motifs Workflow-Oriented Motifs Intra-Workflow Motifs Stateful (Asynchronous) Invocations Stateless (Synchronous) Invocations Internal Macros Human Interactions Inter-Workflow Motifs Atomic Workflows Composite Workflows Workflow Overloading PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  • 76. 76 Workflow Oriented Motifs Workflow-Oriented Motifs Intra-Workflow Motifs Stateful (Asynchronous) Invocations Stateless (Synchronous) Invocations Internal Macros Human Interactions Inter-Workflow Motifs Atomic Workflows Composite Workflows Workflow Overloading PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  • 77. 77 Workflow Oriented Motifs Workflow-Oriented Motifs Intra-Workflow Motifs Stateful (Asynchronous) Invocations Stateless (Synchronous) Invocations Internal Macros Human Interactions Inter-Workflow Motifs Atomic Workflows Composite Workflows Workflow Overloading PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  • 78. Result Summary: Data Oriented Motifs •Over 60% of the motifs are data preparation motifs •Some differences are motivated by the workflow systems in the analysis •Data analysis is often the main functionality of the workflow 78PhD Thesis: Mining Abstractions in Scientific Workflows
  • 79. Result Summary: Workflow Oriented Motifs • Around 40% composite workflows and internal macros But how do users perceive workflow reuse? •What about fragments of workflows? 79PhD Thesis: Mining Abstractions in Scientific Workflows
  • 80. 80 Differences and commonalities of the workflow systems •Data moving/retrieval, stateful interactions and human interaction steps are not present in Wings •Web services (Taverna) versus software components (Wings) •Wings has layered execution through Pegasus •Data preparation steps are common in both systems •Use of sub workflows is high PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid
  • 81. Reusing workflows… According to the respondents, the major benefits of workflows include: • Time savings •Organizing and storing code • Having a visualization of the overall analysis • Facilitating reproducibility 81PhD Thesis: Mining Abstractions in Scientific Workflows
  • 82. Reusing groupings… •Reuse is not the only reason why groupings are created. Unlike workflows, reusing groupings from one’s own work is more useful than reusing groupings from others •Most respondents agreed that groupings help simplify workflows. Groupings also make workflows more understandable by others 82PhD Thesis: Mining Abstractions in Scientific Workflows
  • 83. Graph Mining We use popular graph mining techniques: Inexact FSM: usage of heuristics to calculate similarity between two graphs. The solution might not be complete SUBDUE • 2 heuristics: Minimum Description Length (MDL) and Size • Frequency based Exact FSM: deliver all the possible fragments to be found the dataset. gSpan • Depth first search strategy • Support based FSG • Breadth first search strategy • Support based 83PhD Thesis: Mining Abstractions in Scientific Workflows
  • 84. Linking to the Corpus: Workflow fragment description vocabulary 84PhD Thesis: Mining Abstractions in Scientific Workflows
  • 85. Workflow fragment assessment: Summary of results 85PhD Thesis: Mining Abstractions in Scientific Workflows
  • 86. Conclusions: Limitations L1: OPMW has been designed for data-intensive workflows (without loops or conditionals) L2: When publishing as Linked Data, it is assumed that all resources will be made public (no privacy issues) L3: Motif catalog may be expanded with additional motifs L4: Size and time needed to calculate some workflow fragments L5: A taxonomy of components is needed when generalizing workflows. This taxonomy is provided by domain experts modeling the domain. 86PhD Thesis: Mining Abstractions in Scientific Workflows

Editor's Notes

  1. Data driven, usually represented as Directed Acyclic Graphs (DAGs)
  2. These are the points discussed with scientists, not the results of the user survey. Sharing workflows with collaborators: Non-programmers find a barrier to running complex neuroimaging analyses as they cannot create components or code to that level of complexity. Reusing workflows that others have created enable them to do tasks that they would not otherwise do. Teaching: Breakpoints are often placed throughout the pipeline to serve as checkpoints and make sure that execution was performed correctly Visualization: The hierarchical organization can be used to group functionally related tasks into a single visual element. This allows workflow developers to group complex tasks with highly-fragmented code into a single visual unit that other users can incorporate into their workflows Modularity: Workflows provide a high-level view of the major steps involved in an analysis, and exposing those major steps drives the design of the code in a modular fashion
  3. Representation: different types of workflows use different types of representation. Also, we miss the links to the resources associated to the workflow itself. Reuse: workflow reused as part of other workflows. How? Abstractions: are two seemingly disparate workflows related to each other?
  4. Workflow template and instance: steps and their dependencies Workflow execution trace: provenance of the results Experiment metadata: specific methods, author contribution, etc.
  5. P-Plan is simple and extensible (to cater to cases that require more complex wf operators) Say that P-Plan has been used for describing scientific processes in social sciences and lab protocols
  6. State that the focus is workflow description
  7. Example of motif goes here on each side, instead of the big HOW and WHAT In order to improve understandability, we have decided as a first step to identify what are the common operations in scientific workflows, by doing an empirical analysis over different domains. There is existing work on this, but mainly tackles the structure of the workflow rather than the operation that is going on. Thus, our approach has been to start without any initial definitions of motifs to find. Instead we have reverse-engineered the different steps in the workflows trying to create clusters with the most common motifs.
  8. Example of motif goes here on each side, instead of the big HOW and WHAT In order to improve understandability, we have decided as a first step to identify what are the common operations in scientific workflows, by doing an empirical analysis over different domains. There is existing work on this, but mainly tackles the structure of the workflow rather than the operation that is going on. Thus, our approach has been to start without any initial definitions of motifs to find. Instead we have reverse-engineered the different steps in the workflows trying to create clusters with the most common motifs.
  9. Corpus collection Preliminary analysis of workflows Discuss catalog of motifs Find motifs in workflows Cross validate annotations Discussion until agreement
  10. Corpus collection Preliminary analysis of workflows Discuss catalog of motifs Find motifs in workflows Cross validate annotations Discussion until agreement
  11. Corpus collection Preliminary analysis of workflows Discuss catalog of motifs Find motifs in workflows Cross validate annotations Discussion until agreement
  12. Workflow reuse. It is very important not because we say so, but because we have seen it in many of the workflows. In almost 40% of the workflows include some other workflow. And when it doesn’t, it is very similar in many cases (we have just matched the exact available workflows) Internal macros, for instance, show how different parts of workflows repeat, which could lead to new workflow templates as well.
  13. Explain some of the features of LONI. Grey circles are inputs, triangles outputs and blue circles components. The rectangles in dots are groupings. Explain what a grouping is, and what is it for.
  14. In general, workflows are considered generally more useful than groupings. On the other hand, more respondents said that groupings help make their code more modular and understandable
  15. Can we automatically mine a repository of workflows to derive useful workflow fragments?
  16. State that graph mining has only been tested recently Clustering and topic modeling: good for stating similar workflows, but not necessarily mining common workflow fragments Log mining: good for suggesting next steps, but bad for stating relationships among workflows Case based reasoning: it is used for prediction mostly
  17. State that graph mining has only been tested recently Clustering and topic modeling: good for stating similar workflows, but not necessarily mining common workflow fragments Log mining: good for suggesting next steps, but bad for stating relationships among workflows Case based reasoning: it is used for prediction mostly
  18. State that graph mining has only been tested recently Clustering and topic modeling: good for stating similar workflows, but not necessarily mining common workflow fragments Log mining: good for suggesting next steps, but bad for stating relationships among workflows Case based reasoning: it is used for prediction mostly
  19. State that graph mining has only been tested recently Clustering and topic modeling: good for stating similar workflows, but not necessarily mining common workflow fragments Log mining: good for suggesting next steps, but bad for stating relationships among workflows Case based reasoning: it is used for prediction mostly
  20. State that graph mining has only been tested recently Clustering and topic modeling: good for stating similar workflows, but not necessarily mining common workflow fragments Log mining: good for suggesting next steps, but bad for stating relationships among workflows Case based reasoning: it is used for prediction mostly
  21. Overview of the steps here. Say clearly that
  22. Here explain what it means to capture a dependency between 2 steps: that a data product produced by the former is consumed by the latter. Duplicated workflows are removed because if we have like 500 workflows that are the same, what we are going to find is that the common fragments are those repeated workflows themselves.
  23. Explain in detail support based versus frequency based techniques! Explain DFS versus BFR strategies! (don’t go into too much detail).
  24. The number of fragments can be up to millions when the common parts are of size >10.
  25. Mention that this is done by issuing sparql queries to link the fragments like the one in the figure to the corpus
  26. This indicates that our fragments are commonly occurring and generic. It also indicated that the rest of the fragments could be useful for users.
  27. High Recall-> expected value
  28. This indicates that our fragments are commonly occurring and generic. It also indicated that the rest of the fragments could be useful for users.
  29. This indicates that our fragments are commonly occurring and generic. It also indicated that the rest of the fragments could be useful for users.
  30. This indicates that our fragments are commonly occurring and generic. It also indicated that the rest of the fragments could be useful for users.
  31. State the expected value!! (High precision)
  32. This indicates that our fragments are commonly occurring and generic. It also indicated that the rest of the fragments could be useful for users.
  33. This indicates that our fragments are commonly occurring and generic. It also indicated that the rest of the fragments could be useful for users.
  34. This indicates that our fragments are commonly occurring and generic. It also indicated that the rest of the fragments could be useful for users.
  35. Wf ecosystems: most of the work is towards making wfs executable on other places, but forgetting about all the other apps that use and consume the wfs at different granularities. Wf abstractions: being able to generate a domain taxonomy automatically. Detecting automatically some of the rest of the motifs. Improvement of workflow reuse: by proposing rankings, improving the interfaces and in general exploiting directly all the data that we can discover with the thesis proposed here.
  36. Wf ecosystems: most of the work is towards making wfs executable on other places, but forgetting about all the other apps that use and consume the wfs at different granularities. Wf abstractions: being able to generate a domain taxonomy automatically. Detecting automatically some of the rest of the motifs. Improvement of workflow reuse: by proposing rankings, improving the interfaces and in general exploiting directly all the data that we can discover with the thesis proposed here.
  37. Genomics workflow - Using Biomart and EMBOSS services, This workflow retrieves a number of sequences from 3 species: mouse, human, rat; aligns them through multiple sequence alignment, and returns a plot of the alignment result. Corresponding sequence ids are also returned.
  38. Heliophysics workflow (we counted t as astronomy) - This is a fragment of a workflow that uses several input augmentation motifs in order to create a query to be sent to the Helio Feature Catalog service to retrieve the active regions on the solar surface for a given period of time
  39. Describe briefly each of the data oriented motifs
  40. This workflow calculates QSAR (Quantitative structure–activity relationship) properties of a compound and saves them as a CSV file. The molecules are read iteratively from an SDF file. Additionally it writes out the molecules with unknown atom types, salt counter ions, curated molecule library with UUIDs and the used calculation time of every QSAR descriptor as a CSV file.Furthermore explicit hydrogens are added and a Hueckel aromaticity detection is performed
  41. Cheminformatics workflow  - It curates the structural information regarding a compound that is provided in Structure Data Format (SDF) file format.  This workflow generates atom signatures for individual compounds given the SDF file as input
  42. Describe briefly each of the data oriented motifs
  43. This workflow performs an NCBI blast at the EBI. It uses the new EBI services, which are asynchronous and require multiple invocations - repeatedly invoking the getStatus sub-workflow until the blast job is complete. So the blast is actually undertaken by 3 callls RUN+GET_STATUS+GET_RESULT
  44. Explain briefly the workflow oriented motifs:
  45. Scientific workflows Msc course workflow - This workflow fetches the details of the countries in the world and then uses R to produce a histogram of the log of their population
  46. Explain briefly the workflow oriented motifs:
  47. Text-Analytics workflow- it is used for reading natural language text found within files with specified extensions in the specified directory
  48. The first fact that we discovered about the workflows is that over 60% of the motifs in each domain are data preparation motifs. In fact, most of them are Input augmentation, output spliting and reformatting steps are the most common in most workflows. This is very important because it tells us how many intermediate processing steps are in the workflow. These steps are often not relevant for explaining the functionality of the workflow. Another relevant thing to show is that between 10-15% of the motifs are data analysis. This is very important, since this is often the main step Of the workflow, its main functionality. If there is only a 15%, it means that the workflow could have been much smaller.
  49. Workflow reuse. It is very important not because we say so, but because we have seen it in many of the workflows. In almost 40% of the workflows include some other workflow. And when it doesn’t, it is very similar in many cases (we have just matched the exact available workflows) Internal macros, for instance, show how different parts of workflows repeat, which could lead to new workflow templates as well.
  50. What we also noticed with this analysis is that this 2 workflow systems are essentially very similar. They share all the motifs except for data moving/retrieval, since Wings uses Pegasus and its infrastructure for that; Stateful interactions (since Wings is oriented to use scripts and tools rather than web services), and human interaction steps. We noticed during the analysis that the typing of data often helps for avoiding certain intermediate steps. Workflow reuse is high in both systems, as we stated previously
  51. In general, workflows are considered generally more useful than groupings. On the other hand, more respondents said that groupings help make their code more modular and understandable
  52. Explain in detail support based versus frequency based techniques! Explain DFS versus BFR strategies! (don’t go into too much detail).
  53. For the goal 1, we have to say that we also relaxed the first precision and recall to an 80 percent to see if similar fragments were found as well. For the Goal 3, there is nothing to quantify.