SlideShare a Scribd company logo
1 of 29
Download to read offline
Date: 24/10/2014 
FragFlow: Automatic Fragment Detection in Scientific Workflows 
Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ, Boris A. Gutman ⱡ, Ivo D. Dinov ⱡ, Paul Thompson ⱡ and Arthur W. Toga ⱡ 
* Universidad Politécnica de Madrid, 
Ŧ USC Information Sciences Institute, 
ⱡ USC Laboratory of Neuroimaging
2 
Overview 
•Detecting common groups of tasks in corpus of scientific workflows 
•Application of exact and inexact graph matching techniques 
•Filtering and linking results to the input corpus 
•Benefits: Discoverability, understandability, reuse, design, modularization, 
visualization 
Lab book 
Digital Log 
Laboratory Protocol 
(recipe) 
Workflow 
Experiment 
IEEE eScience 2014. Guarujá, Brasil
Background 
•Workflows are software artifacts that capture computational experiments 
•Addition to paper publication 
•Provenance of results 
•Reuse 
•Existing repositories of workflows (Galaxy, myExperiment, the LONI Pipeline, CrowdLabs, etc.) 
•Sharing workflows 
•Exploring existing workflows 
•PROBLEMS to address: 
•Workflows have many detailed steps and may be difficult to understand 
•The general method may not apparent 
•How are different workflow related? 
•What steps do they have in common? 
3 
IEEE eScience 2014. Guarujá, Brasil
Workflow Fragment: set of connected steps that are part of a workflow. 
•Common Workflow Fragment: fragments that occur more than once in a corpus of workflows 
•Grouping: Workflow fragment manually annotated by a user 
•Sub-Grouping: Grouping included as part of another grouping 
Workflow Fragments and Groupings 
4 
A 
B 
C 
A 
F 
D 
A 
B 
C 
G 
B 
H 
A 
B 
F 
B 
E 
Common workflow fragments 
Workflow 1 
Workflow 2 
Workflow 3 
IEEE eScience 2014. Guarujá, Brasil
Our Goals 
Our goal is to automatically detect useful workflow fragments to be reused by scientists. In this work, given a workflow corpus… 
•Goal 1: Are automatically detected workflow fragments similar to user- defined groupings? 
•Goal 2: For those automatically detected fragments that were NOT similar to user-defined groupings, do users find them useful? 
•Goal 3: How are workflows and groupings reused? 
5 
IEEE eScience 2014. Guarujá, Brasil
The LONI Pipeline 
6 
•Workflow system for neuroimaging analysis 
•Active community of users creating workflows 
•Enables users to define groupings in workflows 
•Has a corpus of published workflows 
•Has a library of (uniquely identified) components with a well defined functionality http://pipeline.loni.usc.edu/explore/library-navigator/ 
IEEE eScience 2014. Guarujá, Brasil
Workflow Mining in FragFlow 
7 
1 
2 
3 
4 
IEEE eScience 2014. Guarujá, Brasil 
Corpus
Corpus Preparation 
Workflows converted to Labeled Directed Acyclic Graphs (LDAG) 
•The label of a node in the graph corresponds to the type of the step in the workflow 
•Edges capture the dependencies between different steps 
•Duplicated workflows are removed 
•Single-step workflows are removed 
8 
IEEE eScience 2014. Guarujá, Brasil
Graph Mining 
9 
We use popular graph mining techniques: 
•Inexact FGM: usage of heuristics to calculate similarity between two graphs. The solution might not be complete 
•SUBDUE 
•2 heuristics: Minimum Description Length (MDL) and Size 
•Frequency based 
•Exact FGM: deliver all the possible fragments to be found the dataset. 
•gSpan 
•Depth first search strategy 
•Support based 
•FSG 
•Breadth first search strategy 
•Support based 
IEEE eScience 2014. Guarujá, Brasil
Filtering Relevant Fragments 
10 
The number of resulting fragments can be very large. We distinguish: 
•Multistep fragments: 
•More than one step 
•Filtered Multistep fragments: 
•Multistep fragments 
•Contain all smaller fragments with the same number of occurrences 
IEEE eScience 2014. Guarujá, Brasil
Linking to the Corpus: Wf-fd 
11 
IEEE eScience 2014. Guarujá, Brasil
Linking to the Corpora: Example 
12 
IEEE eScience 2014. Guarujá, Brasil 
Corpus 
Fragment
Evaluation 
13 
Three workflow corpora: User Corpus 1 (WC1) 
•Designed mostly by a single a single user 
•General medial imaging 
•790 workflows (475 after data preparation) User Corpus 2 (WC2) 
•Created by a user, with collaborations of others 
•Well documented workflows, meant for reuse 
•113 workflows (96 after data preparation) Multi User Corpus 3 (WC3) 
•Workflows submitted by 62 users during the month of Jan 2014 
•Several executions of the same workflows 
•5859 workflows (357 after data preparation) 
IEEE eScience 2014. Guarujá, Brasil
Evaluation: Metrics 
14 
Goal 1: Are automatically detected workflow fragments similar to user-defined groupings ? Goal 2: Do users find useful the fragments that were NOT similar to their defined groupings? 
IEEE eScience 2014. Guarujá, Brasil
Evaluation: Inexact FGM techniques 
15 
Exact 
Overlap (>80%) 
Corpus 
Workflows (w) + groupings(g) 
Inexact FGM 
Frequency 
MultiStep Frag. 
Fragment 
Precision 
Recall 
Fragment 
Precision 
Recall 
WC1 
475(w)+ 209(g) 
MDL 
min 
264 
76 
29% 
11% 
113 
42% 
16% 
2% 
64 
21 
32% 
3% 
27 
42% 
3% 
5% 
26 
9 
34% 
1% 
11 
42% 
1% 
10% 
19 
8 
42% 
1% 
10 
52% 
1% 
Size 
min 
381 
136 
35% 
19% 
223 
58% 
32% 
2% 
52 
20 
38% 
2% 
32 
61% 
4% 
5% 
22 
8 
36% 
1% 
14 
63% 
3% 
10% 
10 
3 
30% 
0,4% 
8 
80% 
1% 
WC2 
96 (w)+108(g) 
MDL 
min 
95 
15 
15% 
7% 
21 
22% 
10% 
2% 
95 
15 
15% 
7% 
21 
22% 
10% 
5% 
12 
3 
25% 
1% 
3 
25% 
1% 
10% 
5 
2 
40% 
1% 
2 
40% 
1% 
Size 
min 
88 
17 
19% 
8% 
34 
38% 
16% 
2% 
88 
17 
19% 
8% 
34 
38% 
16% 
5% 
14 
4 
28% 
2% 
9 
64% 
4% 
10% 
4 
3 
75% 
1% 
3 
75% 
1% 
WC3 
375(w)+ 175(g) 
MDL 
min 
186 
100 
50% 
18% 
117 
62% 
21% 
2% 
23 
7 
30% 
1% 
11 
47% 
2% 
5% 
4 
1 
25% 
0,1% 
2 
50% 
0,3% 
10% 
0 
0 
0% 
0% 
0 
0% 
0% 
Size 
min 
178 
101 
56% 
18% 
119 
66% 
22% 
2% 
22 
12 
54% 
2% 
16 
72% 
3% 
5% 
8 
3 
37% 
0,5% 
4 
50% 
0,7% 
10% 
0 
0 
0% 
0% 
0 
0% 
0% 
IEEE eScience 2014. Guarujá, Brasil
Evaluation: Inexact FGM techniques 
16 
Exact 
Overlap (>80%) 
Corpus 
Workflows (w) + groupings(g) 
Inexact FGM 
Frequency 
MultiStep Frag. 
Fragment 
Precision 
Recall 
Fragment 
Precision 
Recall 
WC1 
475(w)+ 209(g) 
MDL 
min 
264 
76 
29% 
11% 
113 
42% 
16% 
2% 
64 
21 
32% 
3% 
27 
42% 
3% 
5% 
26 
9 
34% 
1% 
11 
42% 
1% 
10% 
19 
8 
42% 
1% 
10 
52% 
1% 
Size 
min 
381 
136 
35% 
19% 
223 
58% 
32% 
2% 
52 
20 
38% 
2% 
32 
61% 
4% 
5% 
22 
8 
36% 
1% 
14 
63% 
3% 
10% 
10 
3 
30% 
0,4% 
8 
80% 
1% 
WC2 
96 (w)+108(g) 
MDL 
min 
95 
15 
15% 
7% 
21 
22% 
10% 
2% 
95 
15 
15% 
7% 
21 
22% 
10% 
5% 
12 
3 
25% 
1% 
3 
25% 
1% 
10% 
5 
2 
40% 
1% 
2 
40% 
1% 
Size 
min 
88 
17 
19% 
8% 
34 
38% 
16% 
2% 
88 
17 
19% 
8% 
34 
38% 
16% 
5% 
14 
4 
28% 
2% 
9 
64% 
4% 
10% 
4 
3 
75% 
1% 
3 
75% 
1% 
WC3 
375(w)+ 175(g) 
MDL 
min 
186 
100 
50% 
18% 
117 
62% 
21% 
2% 
23 
7 
30% 
1% 
11 
47% 
2% 
5% 
4 
1 
25% 
0,1% 
2 
50% 
0,3% 
10% 
0 
0 
0% 
0% 
0 
0% 
0% 
Size 
min 
178 
101 
56% 
18% 
119 
66% 
22% 
2% 
22 
12 
54% 
2% 
16 
72% 
3% 
5% 
8 
3 
37% 
0,5% 
4 
50% 
0,7% 
10% 
0 
0 
0% 
0% 
0 
0% 
0% 
Frequent fragments overlap with groupings in single user corpora (30% to 75% with 10% frequency, 40% to 80% overlapping) 
IEEE eScience 2014. Guarujá, Brasil
Evaluation: Inexact FGM techniques 
17 
Exact 
Overlap (>80%) 
Corpus 
Workflows (w) + groupings(g) 
Inexact FGM 
Frequency 
MultiStep Frag. 
Fragment 
Precision 
Recall 
Fragment 
Precision 
Recall 
WC1 
475(w)+ 209(g) 
MDL 
min 
264 
76 
29% 
11% 
113 
42% 
16% 
2% 
64 
21 
32% 
3% 
27 
42% 
3% 
5% 
26 
9 
34% 
1% 
11 
42% 
1% 
10% 
19 
8 
42% 
1% 
10 
52% 
1% 
Size 
min 
381 
136 
35% 
19% 
223 
58% 
32% 
2% 
52 
20 
38% 
2% 
32 
61% 
4% 
5% 
22 
8 
36% 
1% 
14 
63% 
3% 
10% 
10 
3 
30% 
0,4% 
8 
80% 
1% 
WC2 
96 (w)+108(g) 
MDL 
min 
95 
15 
15% 
7% 
21 
22% 
10% 
2% 
95 
15 
15% 
7% 
21 
22% 
10% 
5% 
12 
3 
25% 
1% 
3 
25% 
1% 
10% 
5 
2 
40% 
1% 
2 
40% 
1% 
Size 
min 
88 
17 
19% 
8% 
34 
38% 
16% 
2% 
88 
17 
19% 
8% 
34 
38% 
16% 
5% 
14 
4 
28% 
2% 
9 
64% 
4% 
10% 
4 
3 
75% 
1% 
3 
75% 
1% 
WC3 
375(w)+ 175(g) 
MDL 
min 
186 
100 
50% 
18% 
117 
62% 
21% 
2% 
23 
7 
30% 
1% 
11 
47% 
2% 
5% 
4 
1 
25% 
0,1% 
2 
50% 
0,3% 
10% 
0 
0 
0% 
0% 
0 
0% 
0% 
Size 
min 
178 
101 
56% 
18% 
119 
66% 
22% 
2% 
22 
12 
54% 
2% 
16 
72% 
3% 
5% 
8 
3 
37% 
0,5% 
4 
50% 
0,7% 
10% 
0 
0 
0% 
0% 
0 
0% 
0% 
Precision decreases in the Multi user corpus. Best results are 50% to 56% with minimum frequency.
Evaluation: Inexact FGM techniques 
18 
Exact 
Overlap (>80%) 
Corpus 
Workflows (w) + groupings(g) 
Inexact FGM 
Frequency 
MultiStep Frag. 
Fragment 
Precision 
Recall 
Fragment 
Precision 
Recall 
WC1 
475(w)+ 209(g) 
MDL 
min 
264 
76 
29% 
11% 
113 
42% 
16% 
2% 
64 
21 
32% 
3% 
27 
42% 
3% 
5% 
26 
9 
34% 
1% 
11 
42% 
1% 
10% 
19 
8 
42% 
1% 
10 
52% 
1% 
Size 
min 
381 
136 
35% 
19% 
223 
58% 
32% 
2% 
52 
20 
38% 
2% 
32 
61% 
4% 
5% 
22 
8 
36% 
1% 
14 
63% 
3% 
10% 
10 
3 
30% 
0,4% 
8 
80% 
1% 
WC2 
96 (w)+108(g) 
MDL 
min 
95 
15 
15% 
7% 
21 
22% 
10% 
2% 
95 
15 
15% 
7% 
21 
22% 
10% 
5% 
12 
3 
25% 
1% 
3 
25% 
1% 
10% 
5 
2 
40% 
1% 
2 
40% 
1% 
Size 
min 
88 
17 
19% 
8% 
34 
38% 
16% 
2% 
88 
17 
19% 
8% 
34 
38% 
16% 
5% 
14 
4 
28% 
2% 
9 
64% 
4% 
10% 
4 
3 
75% 
1% 
3 
75% 
1% 
WC3 
375(w)+ 175(g) 
MDL 
min 
186 
100 
50% 
18% 
117 
62% 
21% 
2% 
23 
7 
30% 
1% 
11 
47% 
2% 
5% 
4 
1 
25% 
0,1% 
2 
50% 
0,3% 
10% 
0 
0 
0% 
0% 
0 
0% 
0% 
Size 
min 
178 
101 
56% 
18% 
119 
66% 
22% 
2% 
22 
12 
54% 
2% 
16 
72% 
3% 
5% 
8 
3 
37% 
0,5% 
4 
50% 
0,7% 
10% 
0 
0 
0% 
0% 
0 
0% 
0% 
IEEE eScience 2014. Guarujá, Brasil
Evaluation: Exact FGM techniques 
19 
Exact 
Overlap (>80%) 
Corpus 
Wf (w) + groups. (g) 
Support 
MultiStep Fragments 
MultiStep 
Filtered Fragments 
Fragments 
Precision 
Recall 
Fragments 
Precision 
Recall 
WC1 
475(w) + 209(g) 
5% 
Out of memory 
- 
- 
- 
- 
- 
- 
- 
10% 
51613 
16 
1 
6,2% 
0,1% 
11 
69% 
1% 
15% 
2264 
8 
6 
75% 
0,8% 
6 
75% 
0,8% 
20% 
3 
1 
0 
0% 
0% 
0 
0% 
0% 
WC2 
96 (w) + 108(g) 
5% 
Out of Memory 
- 
- 
- 
- 
- 
- 
- 
10% 
33236 
4 
0 
0% 
0% 
1 
25% 
0,4% 
15% 
25 
2 
0 
0% 
0% 
0 
0% 
0% 
20% 
0 
0 
0 
- 
- 
0 
- 
- 
WC3 
375(w) + 175(g) 
5% 
5701 
3 
1 
33% 
0,1% 
1 
33% 
0,1% 
10% 
1074 
1 
1 
100% 
0,1% 
1 
100% 
0,1% 
15% 
1 
1 
0 
0% 
0% 
0 
0% 
0% 
20% 
0 
0 
0 
- 
- 
0 
- 
- 
IEEE eScience 2014. Guarujá, Brasil
Evaluation: Exact FGM techniques 
20 
Exact 
Overlap (>80%) 
Corpus 
Wf (w) + groups. (g) 
Support 
MultiStep Fragments 
MultiStep 
Filtered Fragments 
Fragments 
Precision 
Recall 
Fragments 
Precision 
Recall 
WC1 
475(w) + 209(g) 
5% 
Out of memory 
- 
- 
- 
- 
- 
- 
- 
10% 
51613 
16 
1 
6,2% 
0,1% 
11 
69% 
1% 
15% 
2264 
8 
6 
75% 
0,8% 
6 
75% 
0,8% 
20% 
3 
1 
0 
0% 
0% 
0 
0% 
0% 
WC2 
96 (w) + 108(g) 
5% 
Out of Memory 
- 
- 
- 
- 
- 
- 
- 
10% 
33236 
4 
0 
0% 
0% 
1 
25% 
0,4% 
15% 
25 
2 
0 
0% 
0% 
0 
0% 
0% 
20% 
0 
0 
0 
- 
- 
0 
- 
- 
WC3 
375(w) + 175(g) 
5% 
5701 
3 
1 
33% 
0,1% 
1 
33% 
0,1% 
10% 
1074 
1 
1 
100% 
0,1% 
1 
100% 
0,1% 
15% 
1 
1 
0 
0% 
0% 
0 
0% 
0% 
20% 
0 
0 
0 
- 
- 
0 
- 
- 
Less results than inexact FGM, even when high numbers of fragments are found 
IEEE eScience 2014. Guarujá, Brasil
Evaluation: Exact FGM techniques 
21 
Exact 
Overlap (>80%) 
Corpus 
Wf (w) + groups. (g) 
Support 
MultiStep Fragments 
MultiStep 
Filtered Fragments 
Fragments 
Precision 
Recall 
Fragments 
Precision 
Recall 
WC1 
475(w) + 209(g) 
5% 
Out of memory 
- 
- 
- 
- 
- 
- 
- 
10% 
51613 
16 
1 
6,2% 
0,1% 
11 
69% 
1% 
15% 
2264 
8 
6 
75% 
0,8% 
6 
75% 
0,8% 
20% 
3 
1 
0 
0% 
0% 
0 
0% 
0% 
WC2 
96 (w) + 108(g) 
5% 
Out of Memory 
- 
- 
- 
- 
- 
- 
- 
10% 
33236 
4 
0 
0% 
0% 
1 
25% 
0,4% 
15% 
25 
2 
0 
0% 
0% 
0 
0% 
0% 
20% 
0 
0 
0 
- 
- 
0 
- 
- 
WC3 
375(w) + 175(g) 
5% 
5701 
3 
1 
33% 
0,1% 
1 
33% 
0,1% 
10% 
1074 
1 
1 
100% 
0,1% 
1 
100% 
0,1% 
15% 
1 
1 
0 
0% 
0% 
0 
0% 
0% 
20% 
0 
0 
0 
- 
- 
0 
- 
- 
How users define fragments affect the results 
IEEE eScience 2014. Guarujá, Brasil
Preliminary Evaluation: User based evaluation 
22 
•Manual evaluation: each user is given 16-18 common workflow fragments detected by FragFlow 
•66% and 100% accuracy respectively 
•Some of the reasons to not use fragments depended on the user preferences 
•Currently evaluating additional users 
IEEE eScience 2014. Guarujá, Brasil 
User 
Use as proposed 
Use with minor changes 
Use with major changes 
Use 
User1 (WC1) 
11% 
16,6% 
38% 
66,6% 
User 2 (WC2) 
44% 
6% 
50% 
100%
Evaluation: Grouping analysis 
23 
•Workflows with groupings are more common in single user corpora (WC1 and WC2) 
•Groupings are reused 
•1463 groupings versus 209 unique groupings in WC1 
•302 grouping versus 108 unique groupings in WC2 
•456 groupings versus 175 unique groupings in WC3 
•Grouping size ranges from 60 to 0 
•Facilitate copy paste by users (large grouping size) 
•Reducing unnecessary inputs (groupings with no steps) 
IEEE eScience 2014. Guarujá, Brasil 
Corpus 
Total qroup. 
Unique multistep qroup. 
Wf with qroup. 
Avg. group. per wf 
Max nºof steps in qroup. 
Min nº of steps in qroup. 
WC1 
1463 
209 
327 
4 
56 
1 
WC2 
302 
108 
42 
7 
39 
0 
WC3 
456 
175 
89 
5 
60 
1
Findings 
24 
With respect to our goals… 
•Goal 1: Are automatically detected workflow fragments similar to user-defined groupings? 
•(with freq 10%, single user, inexact FGM) 30% to 75% of the total FragFlow fragments found correspond directly to user-defined groupings 
•(multi user)Best results are 50% to 56% inexact FGM with minimum frequency. If we consider the overlap of 80% of the steps, the precision is 62% to 66% 
•Goal 2: For those automatically detected fragments that were NOT similar to user-defined groupings, do users find them useful? 
•For one user 66% of the proposed fragments were useful, for another 100% were useful 
•Further evaluation is needed 
•Goal 3: How are workflows and groupings reused? 
•Those workflows with groupings have at least 4 groupings 
•Reuse of groupings (grouping numbers are up to 7 times more than the unique groupings in the corpora) 
IEEE eScience 2014. Guarujá, Brasil
Limitations 
25 
•Graph mining is an NP-Complete problem 
•Big fragments can take time to be recognized 
•Errors derived from memory heap issues 
•Detection of groupings may depend on user preferences on size and frequency 
IEEE eScience 2014. Guarujá, Brasil
Conclusions and Future Work 
26 
•FragFlow: Approach to find the most common fragments in a corpus of workflows 
•Several integrated graph mining techniques 
•FragFlow can be used with different settings 
•Minimum or maximum frequency and support. 
•Size 
•Type of the graph mining algorithm to be applied 
•Evaluation of the results using corpora belonging to the LONI Pipeline system. 
•New algorithms are being integrated! 
•Sigma (inexact FGM), Gaston (exact FGM) 
•Future work 
•Test FragFlow with other workflow systems, domains, and perform further user evaluations. 
•Evaluate how workflow quality improves when users are proposed automatically mined workflow fragments Evaluation and resources available here: http://purl.org/net/escience2014 
IEEE eScience 2014. Guarujá, Brasil
27 
Who are we? 
•Daniel Garijo, Oscar Corcho Ontology Engineering Group, UPM 
•Yolanda Gil Information Sciences Institute, USC 
•Boris A. Gutman, Ivo D. Dinov, Paul Thompson Arthur W. Toga. USC Laboratory of Neuro Imaging 
IEEE eScience 2014. Guarujá, Brasil
Want to collaborate? Contact me at dgarijo@fi.upm.es 
28 
Questions? 
IEEE eScience 2014. Guarujá, Brasil
Date: 24/10/2014 
FragFlow: Automatic Fragment Detection in Scientific Workflows 
Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ, Boris A. Gutman ⱡ, Ivo D. Dinov ⱡ, Paul Thompson ⱡ and Arthur W. Toga ⱡ 
* Universidad Politécnica de Madrid, 
Ŧ USC Information Sciences Institute, 
ⱡ USC Laboratory of Neuroimaging

More Related Content

Similar to Frag Flow: Automated Fragment Detection in Scientific Workflows

Production Schedule Report 2016-rv
Production Schedule Report 2016-rvProduction Schedule Report 2016-rv
Production Schedule Report 2016-rv
James Thomas
 
QuestionPoint user group June 2010
QuestionPoint user group June 2010QuestionPoint user group June 2010
QuestionPoint user group June 2010
OCLC
 
What I did at DPL -Print 1
What I did at DPL -Print 1What I did at DPL -Print 1
What I did at DPL -Print 1
Dhanuka Fernando
 
NavTap: A Long Term Study with Excluded Blind Users
NavTap: A Long Term Study with Excluded Blind Users NavTap: A Long Term Study with Excluded Blind Users
NavTap: A Long Term Study with Excluded Blind Users
Hugo Nicolau
 
Waterfall Turbine Development Primer - Updated
Waterfall Turbine Development Primer - UpdatedWaterfall Turbine Development Primer - Updated
Waterfall Turbine Development Primer - Updated
Jason Rota
 
OpenStack User Committee - Havana Summit
OpenStack User Committee - Havana SummitOpenStack User Committee - Havana Summit
OpenStack User Committee - Havana Summit
OpenStack Foundation
 

Similar to Frag Flow: Automated Fragment Detection in Scientific Workflows (20)

Supply chain design and operation
Supply chain design and operationSupply chain design and operation
Supply chain design and operation
 
Rob Baarda - Are Real Test Metrics Predictive for the Future?
Rob Baarda - Are Real Test Metrics Predictive for the Future?Rob Baarda - Are Real Test Metrics Predictive for the Future?
Rob Baarda - Are Real Test Metrics Predictive for the Future?
 
Git Github GDSC.pptx
Git Github GDSC.pptxGit Github GDSC.pptx
Git Github GDSC.pptx
 
Active Portfolio Management
Active Portfolio ManagementActive Portfolio Management
Active Portfolio Management
 
Sbst2018 contest2018
Sbst2018 contest2018Sbst2018 contest2018
Sbst2018 contest2018
 
Production Schedule Report 2016-rv
Production Schedule Report 2016-rvProduction Schedule Report 2016-rv
Production Schedule Report 2016-rv
 
JTP - EV Presentation
JTP - EV PresentationJTP - EV Presentation
JTP - EV Presentation
 
QuestionPoint user group June 2010
QuestionPoint user group June 2010QuestionPoint user group June 2010
QuestionPoint user group June 2010
 
What I did at DPL -Print 1
What I did at DPL -Print 1What I did at DPL -Print 1
What I did at DPL -Print 1
 
this-is-garbage-talk-2022.pptx
this-is-garbage-talk-2022.pptxthis-is-garbage-talk-2022.pptx
this-is-garbage-talk-2022.pptx
 
Verifying Deadlock and Livelock Freedom in an SOA Scenario
Verifying Deadlock and Livelock Freedom in an SOA ScenarioVerifying Deadlock and Livelock Freedom in an SOA Scenario
Verifying Deadlock and Livelock Freedom in an SOA Scenario
 
NavTap: A Long Term Study with Excluded Blind Users
NavTap: A Long Term Study with Excluded Blind Users NavTap: A Long Term Study with Excluded Blind Users
NavTap: A Long Term Study with Excluded Blind Users
 
Scaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insertScaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insert
 
Waterfall Turbine Development Primer - Updated
Waterfall Turbine Development Primer - UpdatedWaterfall Turbine Development Primer - Updated
Waterfall Turbine Development Primer - Updated
 
CATALOGO YUPP versión PDF
CATALOGO YUPP versión PDFCATALOGO YUPP versión PDF
CATALOGO YUPP versión PDF
 
Reducing Waste in Expandable Collections
Reducing Waste in Expandable CollectionsReducing Waste in Expandable Collections
Reducing Waste in Expandable Collections
 
Value add: Single User Performance Testing (http://managingperformancetesting...
Value add: Single User Performance Testing (http://managingperformancetesting...Value add: Single User Performance Testing (http://managingperformancetesting...
Value add: Single User Performance Testing (http://managingperformancetesting...
 
TankerCalc for Bunker Surveyors
TankerCalc for Bunker SurveyorsTankerCalc for Bunker Surveyors
TankerCalc for Bunker Surveyors
 
RBK Artworks Presentation: Designing For Diversified Sales Tools.
RBK Artworks Presentation:  Designing For Diversified Sales Tools. RBK Artworks Presentation:  Designing For Diversified Sales Tools.
RBK Artworks Presentation: Designing For Diversified Sales Tools.
 
OpenStack User Committee - Havana Summit
OpenStack User Committee - Havana SummitOpenStack User Committee - Havana Summit
OpenStack User Committee - Havana Summit
 

More from dgarijo

Towards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software MetadataTowards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software Metadata
dgarijo
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
dgarijo
 
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
dgarijo
 
Automated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific WorkflowsAutomated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific Workflows
dgarijo
 
PhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflowsPhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflows
dgarijo
 

More from dgarijo (20)

FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principlesFOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
 
FAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the FutureFAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the Future
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
 
SOMEF: a metadata extraction framework from software documentation
SOMEF: a metadata extraction framework from software documentationSOMEF: a metadata extraction framework from software documentation
SOMEF: a metadata extraction framework from software documentation
 
A Template-Based Approach for Annotating Long-Tailed Datasets
A Template-Based Approach for Annotating Long-Tailed DatasetsA Template-Based Approach for Annotating Long-Tailed Datasets
A Template-Based Approach for Annotating Long-Tailed Datasets
 
OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphs
OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge GraphsOBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphs
OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphs
 
Towards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software MetadataTowards Knowledge Graphs of Reusable Research Software Metadata
Towards Knowledge Graphs of Reusable Research Software Metadata
 
Scientific Software Registry Collaboration Workshop: From Software Metadata r...
Scientific Software Registry Collaboration Workshop: From Software Metadata r...Scientific Software Registry Collaboration Workshop: From Software Metadata r...
Scientific Software Registry Collaboration Workshop: From Software Metadata r...
 
WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
WDPlus: Leveraging Wikidata to Link and Extend Tabular DataWDPlus: Leveraging Wikidata to Link and Extend Tabular Data
WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
 
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
 
Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
 
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
 
WIDOCO: A Wizard for Documenting Ontologies
WIDOCO: A Wizard for Documenting OntologiesWIDOCO: A Wizard for Documenting Ontologies
WIDOCO: A Wizard for Documenting Ontologies
 
Towards Automating Data Narratives
Towards Automating Data NarrativesTowards Automating Data Narratives
Towards Automating Data Narratives
 
Automated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific WorkflowsAutomated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific Workflows
 
OntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific SoftwareOntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific Software
 
Reproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An OverviewReproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An Overview
 
PhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflowsPhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflows
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
 

Recently uploaded

一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 

Recently uploaded (20)

2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 

Frag Flow: Automated Fragment Detection in Scientific Workflows

  • 1. Date: 24/10/2014 FragFlow: Automatic Fragment Detection in Scientific Workflows Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ, Boris A. Gutman ⱡ, Ivo D. Dinov ⱡ, Paul Thompson ⱡ and Arthur W. Toga ⱡ * Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute, ⱡ USC Laboratory of Neuroimaging
  • 2. 2 Overview •Detecting common groups of tasks in corpus of scientific workflows •Application of exact and inexact graph matching techniques •Filtering and linking results to the input corpus •Benefits: Discoverability, understandability, reuse, design, modularization, visualization Lab book Digital Log Laboratory Protocol (recipe) Workflow Experiment IEEE eScience 2014. Guarujá, Brasil
  • 3. Background •Workflows are software artifacts that capture computational experiments •Addition to paper publication •Provenance of results •Reuse •Existing repositories of workflows (Galaxy, myExperiment, the LONI Pipeline, CrowdLabs, etc.) •Sharing workflows •Exploring existing workflows •PROBLEMS to address: •Workflows have many detailed steps and may be difficult to understand •The general method may not apparent •How are different workflow related? •What steps do they have in common? 3 IEEE eScience 2014. Guarujá, Brasil
  • 4. Workflow Fragment: set of connected steps that are part of a workflow. •Common Workflow Fragment: fragments that occur more than once in a corpus of workflows •Grouping: Workflow fragment manually annotated by a user •Sub-Grouping: Grouping included as part of another grouping Workflow Fragments and Groupings 4 A B C A F D A B C G B H A B F B E Common workflow fragments Workflow 1 Workflow 2 Workflow 3 IEEE eScience 2014. Guarujá, Brasil
  • 5. Our Goals Our goal is to automatically detect useful workflow fragments to be reused by scientists. In this work, given a workflow corpus… •Goal 1: Are automatically detected workflow fragments similar to user- defined groupings? •Goal 2: For those automatically detected fragments that were NOT similar to user-defined groupings, do users find them useful? •Goal 3: How are workflows and groupings reused? 5 IEEE eScience 2014. Guarujá, Brasil
  • 6. The LONI Pipeline 6 •Workflow system for neuroimaging analysis •Active community of users creating workflows •Enables users to define groupings in workflows •Has a corpus of published workflows •Has a library of (uniquely identified) components with a well defined functionality http://pipeline.loni.usc.edu/explore/library-navigator/ IEEE eScience 2014. Guarujá, Brasil
  • 7. Workflow Mining in FragFlow 7 1 2 3 4 IEEE eScience 2014. Guarujá, Brasil Corpus
  • 8. Corpus Preparation Workflows converted to Labeled Directed Acyclic Graphs (LDAG) •The label of a node in the graph corresponds to the type of the step in the workflow •Edges capture the dependencies between different steps •Duplicated workflows are removed •Single-step workflows are removed 8 IEEE eScience 2014. Guarujá, Brasil
  • 9. Graph Mining 9 We use popular graph mining techniques: •Inexact FGM: usage of heuristics to calculate similarity between two graphs. The solution might not be complete •SUBDUE •2 heuristics: Minimum Description Length (MDL) and Size •Frequency based •Exact FGM: deliver all the possible fragments to be found the dataset. •gSpan •Depth first search strategy •Support based •FSG •Breadth first search strategy •Support based IEEE eScience 2014. Guarujá, Brasil
  • 10. Filtering Relevant Fragments 10 The number of resulting fragments can be very large. We distinguish: •Multistep fragments: •More than one step •Filtered Multistep fragments: •Multistep fragments •Contain all smaller fragments with the same number of occurrences IEEE eScience 2014. Guarujá, Brasil
  • 11. Linking to the Corpus: Wf-fd 11 IEEE eScience 2014. Guarujá, Brasil
  • 12. Linking to the Corpora: Example 12 IEEE eScience 2014. Guarujá, Brasil Corpus Fragment
  • 13. Evaluation 13 Three workflow corpora: User Corpus 1 (WC1) •Designed mostly by a single a single user •General medial imaging •790 workflows (475 after data preparation) User Corpus 2 (WC2) •Created by a user, with collaborations of others •Well documented workflows, meant for reuse •113 workflows (96 after data preparation) Multi User Corpus 3 (WC3) •Workflows submitted by 62 users during the month of Jan 2014 •Several executions of the same workflows •5859 workflows (357 after data preparation) IEEE eScience 2014. Guarujá, Brasil
  • 14. Evaluation: Metrics 14 Goal 1: Are automatically detected workflow fragments similar to user-defined groupings ? Goal 2: Do users find useful the fragments that were NOT similar to their defined groupings? IEEE eScience 2014. Guarujá, Brasil
  • 15. Evaluation: Inexact FGM techniques 15 Exact Overlap (>80%) Corpus Workflows (w) + groupings(g) Inexact FGM Frequency MultiStep Frag. Fragment Precision Recall Fragment Precision Recall WC1 475(w)+ 209(g) MDL min 264 76 29% 11% 113 42% 16% 2% 64 21 32% 3% 27 42% 3% 5% 26 9 34% 1% 11 42% 1% 10% 19 8 42% 1% 10 52% 1% Size min 381 136 35% 19% 223 58% 32% 2% 52 20 38% 2% 32 61% 4% 5% 22 8 36% 1% 14 63% 3% 10% 10 3 30% 0,4% 8 80% 1% WC2 96 (w)+108(g) MDL min 95 15 15% 7% 21 22% 10% 2% 95 15 15% 7% 21 22% 10% 5% 12 3 25% 1% 3 25% 1% 10% 5 2 40% 1% 2 40% 1% Size min 88 17 19% 8% 34 38% 16% 2% 88 17 19% 8% 34 38% 16% 5% 14 4 28% 2% 9 64% 4% 10% 4 3 75% 1% 3 75% 1% WC3 375(w)+ 175(g) MDL min 186 100 50% 18% 117 62% 21% 2% 23 7 30% 1% 11 47% 2% 5% 4 1 25% 0,1% 2 50% 0,3% 10% 0 0 0% 0% 0 0% 0% Size min 178 101 56% 18% 119 66% 22% 2% 22 12 54% 2% 16 72% 3% 5% 8 3 37% 0,5% 4 50% 0,7% 10% 0 0 0% 0% 0 0% 0% IEEE eScience 2014. Guarujá, Brasil
  • 16. Evaluation: Inexact FGM techniques 16 Exact Overlap (>80%) Corpus Workflows (w) + groupings(g) Inexact FGM Frequency MultiStep Frag. Fragment Precision Recall Fragment Precision Recall WC1 475(w)+ 209(g) MDL min 264 76 29% 11% 113 42% 16% 2% 64 21 32% 3% 27 42% 3% 5% 26 9 34% 1% 11 42% 1% 10% 19 8 42% 1% 10 52% 1% Size min 381 136 35% 19% 223 58% 32% 2% 52 20 38% 2% 32 61% 4% 5% 22 8 36% 1% 14 63% 3% 10% 10 3 30% 0,4% 8 80% 1% WC2 96 (w)+108(g) MDL min 95 15 15% 7% 21 22% 10% 2% 95 15 15% 7% 21 22% 10% 5% 12 3 25% 1% 3 25% 1% 10% 5 2 40% 1% 2 40% 1% Size min 88 17 19% 8% 34 38% 16% 2% 88 17 19% 8% 34 38% 16% 5% 14 4 28% 2% 9 64% 4% 10% 4 3 75% 1% 3 75% 1% WC3 375(w)+ 175(g) MDL min 186 100 50% 18% 117 62% 21% 2% 23 7 30% 1% 11 47% 2% 5% 4 1 25% 0,1% 2 50% 0,3% 10% 0 0 0% 0% 0 0% 0% Size min 178 101 56% 18% 119 66% 22% 2% 22 12 54% 2% 16 72% 3% 5% 8 3 37% 0,5% 4 50% 0,7% 10% 0 0 0% 0% 0 0% 0% Frequent fragments overlap with groupings in single user corpora (30% to 75% with 10% frequency, 40% to 80% overlapping) IEEE eScience 2014. Guarujá, Brasil
  • 17. Evaluation: Inexact FGM techniques 17 Exact Overlap (>80%) Corpus Workflows (w) + groupings(g) Inexact FGM Frequency MultiStep Frag. Fragment Precision Recall Fragment Precision Recall WC1 475(w)+ 209(g) MDL min 264 76 29% 11% 113 42% 16% 2% 64 21 32% 3% 27 42% 3% 5% 26 9 34% 1% 11 42% 1% 10% 19 8 42% 1% 10 52% 1% Size min 381 136 35% 19% 223 58% 32% 2% 52 20 38% 2% 32 61% 4% 5% 22 8 36% 1% 14 63% 3% 10% 10 3 30% 0,4% 8 80% 1% WC2 96 (w)+108(g) MDL min 95 15 15% 7% 21 22% 10% 2% 95 15 15% 7% 21 22% 10% 5% 12 3 25% 1% 3 25% 1% 10% 5 2 40% 1% 2 40% 1% Size min 88 17 19% 8% 34 38% 16% 2% 88 17 19% 8% 34 38% 16% 5% 14 4 28% 2% 9 64% 4% 10% 4 3 75% 1% 3 75% 1% WC3 375(w)+ 175(g) MDL min 186 100 50% 18% 117 62% 21% 2% 23 7 30% 1% 11 47% 2% 5% 4 1 25% 0,1% 2 50% 0,3% 10% 0 0 0% 0% 0 0% 0% Size min 178 101 56% 18% 119 66% 22% 2% 22 12 54% 2% 16 72% 3% 5% 8 3 37% 0,5% 4 50% 0,7% 10% 0 0 0% 0% 0 0% 0% Precision decreases in the Multi user corpus. Best results are 50% to 56% with minimum frequency.
  • 18. Evaluation: Inexact FGM techniques 18 Exact Overlap (>80%) Corpus Workflows (w) + groupings(g) Inexact FGM Frequency MultiStep Frag. Fragment Precision Recall Fragment Precision Recall WC1 475(w)+ 209(g) MDL min 264 76 29% 11% 113 42% 16% 2% 64 21 32% 3% 27 42% 3% 5% 26 9 34% 1% 11 42% 1% 10% 19 8 42% 1% 10 52% 1% Size min 381 136 35% 19% 223 58% 32% 2% 52 20 38% 2% 32 61% 4% 5% 22 8 36% 1% 14 63% 3% 10% 10 3 30% 0,4% 8 80% 1% WC2 96 (w)+108(g) MDL min 95 15 15% 7% 21 22% 10% 2% 95 15 15% 7% 21 22% 10% 5% 12 3 25% 1% 3 25% 1% 10% 5 2 40% 1% 2 40% 1% Size min 88 17 19% 8% 34 38% 16% 2% 88 17 19% 8% 34 38% 16% 5% 14 4 28% 2% 9 64% 4% 10% 4 3 75% 1% 3 75% 1% WC3 375(w)+ 175(g) MDL min 186 100 50% 18% 117 62% 21% 2% 23 7 30% 1% 11 47% 2% 5% 4 1 25% 0,1% 2 50% 0,3% 10% 0 0 0% 0% 0 0% 0% Size min 178 101 56% 18% 119 66% 22% 2% 22 12 54% 2% 16 72% 3% 5% 8 3 37% 0,5% 4 50% 0,7% 10% 0 0 0% 0% 0 0% 0% IEEE eScience 2014. Guarujá, Brasil
  • 19. Evaluation: Exact FGM techniques 19 Exact Overlap (>80%) Corpus Wf (w) + groups. (g) Support MultiStep Fragments MultiStep Filtered Fragments Fragments Precision Recall Fragments Precision Recall WC1 475(w) + 209(g) 5% Out of memory - - - - - - - 10% 51613 16 1 6,2% 0,1% 11 69% 1% 15% 2264 8 6 75% 0,8% 6 75% 0,8% 20% 3 1 0 0% 0% 0 0% 0% WC2 96 (w) + 108(g) 5% Out of Memory - - - - - - - 10% 33236 4 0 0% 0% 1 25% 0,4% 15% 25 2 0 0% 0% 0 0% 0% 20% 0 0 0 - - 0 - - WC3 375(w) + 175(g) 5% 5701 3 1 33% 0,1% 1 33% 0,1% 10% 1074 1 1 100% 0,1% 1 100% 0,1% 15% 1 1 0 0% 0% 0 0% 0% 20% 0 0 0 - - 0 - - IEEE eScience 2014. Guarujá, Brasil
  • 20. Evaluation: Exact FGM techniques 20 Exact Overlap (>80%) Corpus Wf (w) + groups. (g) Support MultiStep Fragments MultiStep Filtered Fragments Fragments Precision Recall Fragments Precision Recall WC1 475(w) + 209(g) 5% Out of memory - - - - - - - 10% 51613 16 1 6,2% 0,1% 11 69% 1% 15% 2264 8 6 75% 0,8% 6 75% 0,8% 20% 3 1 0 0% 0% 0 0% 0% WC2 96 (w) + 108(g) 5% Out of Memory - - - - - - - 10% 33236 4 0 0% 0% 1 25% 0,4% 15% 25 2 0 0% 0% 0 0% 0% 20% 0 0 0 - - 0 - - WC3 375(w) + 175(g) 5% 5701 3 1 33% 0,1% 1 33% 0,1% 10% 1074 1 1 100% 0,1% 1 100% 0,1% 15% 1 1 0 0% 0% 0 0% 0% 20% 0 0 0 - - 0 - - Less results than inexact FGM, even when high numbers of fragments are found IEEE eScience 2014. Guarujá, Brasil
  • 21. Evaluation: Exact FGM techniques 21 Exact Overlap (>80%) Corpus Wf (w) + groups. (g) Support MultiStep Fragments MultiStep Filtered Fragments Fragments Precision Recall Fragments Precision Recall WC1 475(w) + 209(g) 5% Out of memory - - - - - - - 10% 51613 16 1 6,2% 0,1% 11 69% 1% 15% 2264 8 6 75% 0,8% 6 75% 0,8% 20% 3 1 0 0% 0% 0 0% 0% WC2 96 (w) + 108(g) 5% Out of Memory - - - - - - - 10% 33236 4 0 0% 0% 1 25% 0,4% 15% 25 2 0 0% 0% 0 0% 0% 20% 0 0 0 - - 0 - - WC3 375(w) + 175(g) 5% 5701 3 1 33% 0,1% 1 33% 0,1% 10% 1074 1 1 100% 0,1% 1 100% 0,1% 15% 1 1 0 0% 0% 0 0% 0% 20% 0 0 0 - - 0 - - How users define fragments affect the results IEEE eScience 2014. Guarujá, Brasil
  • 22. Preliminary Evaluation: User based evaluation 22 •Manual evaluation: each user is given 16-18 common workflow fragments detected by FragFlow •66% and 100% accuracy respectively •Some of the reasons to not use fragments depended on the user preferences •Currently evaluating additional users IEEE eScience 2014. Guarujá, Brasil User Use as proposed Use with minor changes Use with major changes Use User1 (WC1) 11% 16,6% 38% 66,6% User 2 (WC2) 44% 6% 50% 100%
  • 23. Evaluation: Grouping analysis 23 •Workflows with groupings are more common in single user corpora (WC1 and WC2) •Groupings are reused •1463 groupings versus 209 unique groupings in WC1 •302 grouping versus 108 unique groupings in WC2 •456 groupings versus 175 unique groupings in WC3 •Grouping size ranges from 60 to 0 •Facilitate copy paste by users (large grouping size) •Reducing unnecessary inputs (groupings with no steps) IEEE eScience 2014. Guarujá, Brasil Corpus Total qroup. Unique multistep qroup. Wf with qroup. Avg. group. per wf Max nºof steps in qroup. Min nº of steps in qroup. WC1 1463 209 327 4 56 1 WC2 302 108 42 7 39 0 WC3 456 175 89 5 60 1
  • 24. Findings 24 With respect to our goals… •Goal 1: Are automatically detected workflow fragments similar to user-defined groupings? •(with freq 10%, single user, inexact FGM) 30% to 75% of the total FragFlow fragments found correspond directly to user-defined groupings •(multi user)Best results are 50% to 56% inexact FGM with minimum frequency. If we consider the overlap of 80% of the steps, the precision is 62% to 66% •Goal 2: For those automatically detected fragments that were NOT similar to user-defined groupings, do users find them useful? •For one user 66% of the proposed fragments were useful, for another 100% were useful •Further evaluation is needed •Goal 3: How are workflows and groupings reused? •Those workflows with groupings have at least 4 groupings •Reuse of groupings (grouping numbers are up to 7 times more than the unique groupings in the corpora) IEEE eScience 2014. Guarujá, Brasil
  • 25. Limitations 25 •Graph mining is an NP-Complete problem •Big fragments can take time to be recognized •Errors derived from memory heap issues •Detection of groupings may depend on user preferences on size and frequency IEEE eScience 2014. Guarujá, Brasil
  • 26. Conclusions and Future Work 26 •FragFlow: Approach to find the most common fragments in a corpus of workflows •Several integrated graph mining techniques •FragFlow can be used with different settings •Minimum or maximum frequency and support. •Size •Type of the graph mining algorithm to be applied •Evaluation of the results using corpora belonging to the LONI Pipeline system. •New algorithms are being integrated! •Sigma (inexact FGM), Gaston (exact FGM) •Future work •Test FragFlow with other workflow systems, domains, and perform further user evaluations. •Evaluate how workflow quality improves when users are proposed automatically mined workflow fragments Evaluation and resources available here: http://purl.org/net/escience2014 IEEE eScience 2014. Guarujá, Brasil
  • 27. 27 Who are we? •Daniel Garijo, Oscar Corcho Ontology Engineering Group, UPM •Yolanda Gil Information Sciences Institute, USC •Boris A. Gutman, Ivo D. Dinov, Paul Thompson Arthur W. Toga. USC Laboratory of Neuro Imaging IEEE eScience 2014. Guarujá, Brasil
  • 28. Want to collaborate? Contact me at dgarijo@fi.upm.es 28 Questions? IEEE eScience 2014. Guarujá, Brasil
  • 29. Date: 24/10/2014 FragFlow: Automatic Fragment Detection in Scientific Workflows Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ, Boris A. Gutman ⱡ, Ivo D. Dinov ⱡ, Paul Thompson ⱡ and Arthur W. Toga ⱡ * Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute, ⱡ USC Laboratory of Neuroimaging