Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Frag Flow: Automated Fragment Detection in Scientific Workflows

610 views

Published on

eScience 2014, Guarujá (Brasil). Abstract—Scientific workflows provide the means to define, execute and reproduce computational experiments. However, reusing existing workflows still poses challenges for workflow designers. Workflows are often too large and too specific to reuse in their entirety, so reuse is more likely to happen for fragments of workflows. These fragments may be identified manually by users as sub-workflows, or detected automatically. In this paper we present the FragFlow approach, which detects workflow fragments automatically by analyzing existing workflow corpora with graph mining algorithms. FragFlow detects the most common workflow fragments, links them to the original workflows and visualizes them. We evaluate our approach by comparing FragFlow results against user-defined sub-workflows from three different corpora of the LONI Pipeline system. Based on this evaluation, we discuss how automated workflow fragment detection could facilitate workflow reuse

Published in: Data & Analytics
  • Hi there! Get Your Professional Job-Winning Resume Here - Check our website! http://bit.ly/resumpro
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Frag Flow: Automated Fragment Detection in Scientific Workflows

  1. 1. Date: 24/10/2014 FragFlow: Automatic Fragment Detection in Scientific Workflows Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ, Boris A. Gutman ⱡ, Ivo D. Dinov ⱡ, Paul Thompson ⱡ and Arthur W. Toga ⱡ * Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute, ⱡ USC Laboratory of Neuroimaging
  2. 2. 2 Overview •Detecting common groups of tasks in corpus of scientific workflows •Application of exact and inexact graph matching techniques •Filtering and linking results to the input corpus •Benefits: Discoverability, understandability, reuse, design, modularization, visualization Lab book Digital Log Laboratory Protocol (recipe) Workflow Experiment IEEE eScience 2014. Guarujá, Brasil
  3. 3. Background •Workflows are software artifacts that capture computational experiments •Addition to paper publication •Provenance of results •Reuse •Existing repositories of workflows (Galaxy, myExperiment, the LONI Pipeline, CrowdLabs, etc.) •Sharing workflows •Exploring existing workflows •PROBLEMS to address: •Workflows have many detailed steps and may be difficult to understand •The general method may not apparent •How are different workflow related? •What steps do they have in common? 3 IEEE eScience 2014. Guarujá, Brasil
  4. 4. Workflow Fragment: set of connected steps that are part of a workflow. •Common Workflow Fragment: fragments that occur more than once in a corpus of workflows •Grouping: Workflow fragment manually annotated by a user •Sub-Grouping: Grouping included as part of another grouping Workflow Fragments and Groupings 4 A B C A F D A B C G B H A B F B E Common workflow fragments Workflow 1 Workflow 2 Workflow 3 IEEE eScience 2014. Guarujá, Brasil
  5. 5. Our Goals Our goal is to automatically detect useful workflow fragments to be reused by scientists. In this work, given a workflow corpus… •Goal 1: Are automatically detected workflow fragments similar to user- defined groupings? •Goal 2: For those automatically detected fragments that were NOT similar to user-defined groupings, do users find them useful? •Goal 3: How are workflows and groupings reused? 5 IEEE eScience 2014. Guarujá, Brasil
  6. 6. The LONI Pipeline 6 •Workflow system for neuroimaging analysis •Active community of users creating workflows •Enables users to define groupings in workflows •Has a corpus of published workflows •Has a library of (uniquely identified) components with a well defined functionality http://pipeline.loni.usc.edu/explore/library-navigator/ IEEE eScience 2014. Guarujá, Brasil
  7. 7. Workflow Mining in FragFlow 7 1 2 3 4 IEEE eScience 2014. Guarujá, Brasil Corpus
  8. 8. Corpus Preparation Workflows converted to Labeled Directed Acyclic Graphs (LDAG) •The label of a node in the graph corresponds to the type of the step in the workflow •Edges capture the dependencies between different steps •Duplicated workflows are removed •Single-step workflows are removed 8 IEEE eScience 2014. Guarujá, Brasil
  9. 9. Graph Mining 9 We use popular graph mining techniques: •Inexact FGM: usage of heuristics to calculate similarity between two graphs. The solution might not be complete •SUBDUE •2 heuristics: Minimum Description Length (MDL) and Size •Frequency based •Exact FGM: deliver all the possible fragments to be found the dataset. •gSpan •Depth first search strategy •Support based •FSG •Breadth first search strategy •Support based IEEE eScience 2014. Guarujá, Brasil
  10. 10. Filtering Relevant Fragments 10 The number of resulting fragments can be very large. We distinguish: •Multistep fragments: •More than one step •Filtered Multistep fragments: •Multistep fragments •Contain all smaller fragments with the same number of occurrences IEEE eScience 2014. Guarujá, Brasil
  11. 11. Linking to the Corpus: Wf-fd 11 IEEE eScience 2014. Guarujá, Brasil
  12. 12. Linking to the Corpora: Example 12 IEEE eScience 2014. Guarujá, Brasil Corpus Fragment
  13. 13. Evaluation 13 Three workflow corpora: User Corpus 1 (WC1) •Designed mostly by a single a single user •General medial imaging •790 workflows (475 after data preparation) User Corpus 2 (WC2) •Created by a user, with collaborations of others •Well documented workflows, meant for reuse •113 workflows (96 after data preparation) Multi User Corpus 3 (WC3) •Workflows submitted by 62 users during the month of Jan 2014 •Several executions of the same workflows •5859 workflows (357 after data preparation) IEEE eScience 2014. Guarujá, Brasil
  14. 14. Evaluation: Metrics 14 Goal 1: Are automatically detected workflow fragments similar to user-defined groupings ? Goal 2: Do users find useful the fragments that were NOT similar to their defined groupings? IEEE eScience 2014. Guarujá, Brasil
  15. 15. Evaluation: Inexact FGM techniques 15 Exact Overlap (>80%) Corpus Workflows (w) + groupings(g) Inexact FGM Frequency MultiStep Frag. Fragment Precision Recall Fragment Precision Recall WC1 475(w)+ 209(g) MDL min 264 76 29% 11% 113 42% 16% 2% 64 21 32% 3% 27 42% 3% 5% 26 9 34% 1% 11 42% 1% 10% 19 8 42% 1% 10 52% 1% Size min 381 136 35% 19% 223 58% 32% 2% 52 20 38% 2% 32 61% 4% 5% 22 8 36% 1% 14 63% 3% 10% 10 3 30% 0,4% 8 80% 1% WC2 96 (w)+108(g) MDL min 95 15 15% 7% 21 22% 10% 2% 95 15 15% 7% 21 22% 10% 5% 12 3 25% 1% 3 25% 1% 10% 5 2 40% 1% 2 40% 1% Size min 88 17 19% 8% 34 38% 16% 2% 88 17 19% 8% 34 38% 16% 5% 14 4 28% 2% 9 64% 4% 10% 4 3 75% 1% 3 75% 1% WC3 375(w)+ 175(g) MDL min 186 100 50% 18% 117 62% 21% 2% 23 7 30% 1% 11 47% 2% 5% 4 1 25% 0,1% 2 50% 0,3% 10% 0 0 0% 0% 0 0% 0% Size min 178 101 56% 18% 119 66% 22% 2% 22 12 54% 2% 16 72% 3% 5% 8 3 37% 0,5% 4 50% 0,7% 10% 0 0 0% 0% 0 0% 0% IEEE eScience 2014. Guarujá, Brasil
  16. 16. Evaluation: Inexact FGM techniques 16 Exact Overlap (>80%) Corpus Workflows (w) + groupings(g) Inexact FGM Frequency MultiStep Frag. Fragment Precision Recall Fragment Precision Recall WC1 475(w)+ 209(g) MDL min 264 76 29% 11% 113 42% 16% 2% 64 21 32% 3% 27 42% 3% 5% 26 9 34% 1% 11 42% 1% 10% 19 8 42% 1% 10 52% 1% Size min 381 136 35% 19% 223 58% 32% 2% 52 20 38% 2% 32 61% 4% 5% 22 8 36% 1% 14 63% 3% 10% 10 3 30% 0,4% 8 80% 1% WC2 96 (w)+108(g) MDL min 95 15 15% 7% 21 22% 10% 2% 95 15 15% 7% 21 22% 10% 5% 12 3 25% 1% 3 25% 1% 10% 5 2 40% 1% 2 40% 1% Size min 88 17 19% 8% 34 38% 16% 2% 88 17 19% 8% 34 38% 16% 5% 14 4 28% 2% 9 64% 4% 10% 4 3 75% 1% 3 75% 1% WC3 375(w)+ 175(g) MDL min 186 100 50% 18% 117 62% 21% 2% 23 7 30% 1% 11 47% 2% 5% 4 1 25% 0,1% 2 50% 0,3% 10% 0 0 0% 0% 0 0% 0% Size min 178 101 56% 18% 119 66% 22% 2% 22 12 54% 2% 16 72% 3% 5% 8 3 37% 0,5% 4 50% 0,7% 10% 0 0 0% 0% 0 0% 0% Frequent fragments overlap with groupings in single user corpora (30% to 75% with 10% frequency, 40% to 80% overlapping) IEEE eScience 2014. Guarujá, Brasil
  17. 17. Evaluation: Inexact FGM techniques 17 Exact Overlap (>80%) Corpus Workflows (w) + groupings(g) Inexact FGM Frequency MultiStep Frag. Fragment Precision Recall Fragment Precision Recall WC1 475(w)+ 209(g) MDL min 264 76 29% 11% 113 42% 16% 2% 64 21 32% 3% 27 42% 3% 5% 26 9 34% 1% 11 42% 1% 10% 19 8 42% 1% 10 52% 1% Size min 381 136 35% 19% 223 58% 32% 2% 52 20 38% 2% 32 61% 4% 5% 22 8 36% 1% 14 63% 3% 10% 10 3 30% 0,4% 8 80% 1% WC2 96 (w)+108(g) MDL min 95 15 15% 7% 21 22% 10% 2% 95 15 15% 7% 21 22% 10% 5% 12 3 25% 1% 3 25% 1% 10% 5 2 40% 1% 2 40% 1% Size min 88 17 19% 8% 34 38% 16% 2% 88 17 19% 8% 34 38% 16% 5% 14 4 28% 2% 9 64% 4% 10% 4 3 75% 1% 3 75% 1% WC3 375(w)+ 175(g) MDL min 186 100 50% 18% 117 62% 21% 2% 23 7 30% 1% 11 47% 2% 5% 4 1 25% 0,1% 2 50% 0,3% 10% 0 0 0% 0% 0 0% 0% Size min 178 101 56% 18% 119 66% 22% 2% 22 12 54% 2% 16 72% 3% 5% 8 3 37% 0,5% 4 50% 0,7% 10% 0 0 0% 0% 0 0% 0% Precision decreases in the Multi user corpus. Best results are 50% to 56% with minimum frequency.
  18. 18. Evaluation: Inexact FGM techniques 18 Exact Overlap (>80%) Corpus Workflows (w) + groupings(g) Inexact FGM Frequency MultiStep Frag. Fragment Precision Recall Fragment Precision Recall WC1 475(w)+ 209(g) MDL min 264 76 29% 11% 113 42% 16% 2% 64 21 32% 3% 27 42% 3% 5% 26 9 34% 1% 11 42% 1% 10% 19 8 42% 1% 10 52% 1% Size min 381 136 35% 19% 223 58% 32% 2% 52 20 38% 2% 32 61% 4% 5% 22 8 36% 1% 14 63% 3% 10% 10 3 30% 0,4% 8 80% 1% WC2 96 (w)+108(g) MDL min 95 15 15% 7% 21 22% 10% 2% 95 15 15% 7% 21 22% 10% 5% 12 3 25% 1% 3 25% 1% 10% 5 2 40% 1% 2 40% 1% Size min 88 17 19% 8% 34 38% 16% 2% 88 17 19% 8% 34 38% 16% 5% 14 4 28% 2% 9 64% 4% 10% 4 3 75% 1% 3 75% 1% WC3 375(w)+ 175(g) MDL min 186 100 50% 18% 117 62% 21% 2% 23 7 30% 1% 11 47% 2% 5% 4 1 25% 0,1% 2 50% 0,3% 10% 0 0 0% 0% 0 0% 0% Size min 178 101 56% 18% 119 66% 22% 2% 22 12 54% 2% 16 72% 3% 5% 8 3 37% 0,5% 4 50% 0,7% 10% 0 0 0% 0% 0 0% 0% IEEE eScience 2014. Guarujá, Brasil
  19. 19. Evaluation: Exact FGM techniques 19 Exact Overlap (>80%) Corpus Wf (w) + groups. (g) Support MultiStep Fragments MultiStep Filtered Fragments Fragments Precision Recall Fragments Precision Recall WC1 475(w) + 209(g) 5% Out of memory - - - - - - - 10% 51613 16 1 6,2% 0,1% 11 69% 1% 15% 2264 8 6 75% 0,8% 6 75% 0,8% 20% 3 1 0 0% 0% 0 0% 0% WC2 96 (w) + 108(g) 5% Out of Memory - - - - - - - 10% 33236 4 0 0% 0% 1 25% 0,4% 15% 25 2 0 0% 0% 0 0% 0% 20% 0 0 0 - - 0 - - WC3 375(w) + 175(g) 5% 5701 3 1 33% 0,1% 1 33% 0,1% 10% 1074 1 1 100% 0,1% 1 100% 0,1% 15% 1 1 0 0% 0% 0 0% 0% 20% 0 0 0 - - 0 - - IEEE eScience 2014. Guarujá, Brasil
  20. 20. Evaluation: Exact FGM techniques 20 Exact Overlap (>80%) Corpus Wf (w) + groups. (g) Support MultiStep Fragments MultiStep Filtered Fragments Fragments Precision Recall Fragments Precision Recall WC1 475(w) + 209(g) 5% Out of memory - - - - - - - 10% 51613 16 1 6,2% 0,1% 11 69% 1% 15% 2264 8 6 75% 0,8% 6 75% 0,8% 20% 3 1 0 0% 0% 0 0% 0% WC2 96 (w) + 108(g) 5% Out of Memory - - - - - - - 10% 33236 4 0 0% 0% 1 25% 0,4% 15% 25 2 0 0% 0% 0 0% 0% 20% 0 0 0 - - 0 - - WC3 375(w) + 175(g) 5% 5701 3 1 33% 0,1% 1 33% 0,1% 10% 1074 1 1 100% 0,1% 1 100% 0,1% 15% 1 1 0 0% 0% 0 0% 0% 20% 0 0 0 - - 0 - - Less results than inexact FGM, even when high numbers of fragments are found IEEE eScience 2014. Guarujá, Brasil
  21. 21. Evaluation: Exact FGM techniques 21 Exact Overlap (>80%) Corpus Wf (w) + groups. (g) Support MultiStep Fragments MultiStep Filtered Fragments Fragments Precision Recall Fragments Precision Recall WC1 475(w) + 209(g) 5% Out of memory - - - - - - - 10% 51613 16 1 6,2% 0,1% 11 69% 1% 15% 2264 8 6 75% 0,8% 6 75% 0,8% 20% 3 1 0 0% 0% 0 0% 0% WC2 96 (w) + 108(g) 5% Out of Memory - - - - - - - 10% 33236 4 0 0% 0% 1 25% 0,4% 15% 25 2 0 0% 0% 0 0% 0% 20% 0 0 0 - - 0 - - WC3 375(w) + 175(g) 5% 5701 3 1 33% 0,1% 1 33% 0,1% 10% 1074 1 1 100% 0,1% 1 100% 0,1% 15% 1 1 0 0% 0% 0 0% 0% 20% 0 0 0 - - 0 - - How users define fragments affect the results IEEE eScience 2014. Guarujá, Brasil
  22. 22. Preliminary Evaluation: User based evaluation 22 •Manual evaluation: each user is given 16-18 common workflow fragments detected by FragFlow •66% and 100% accuracy respectively •Some of the reasons to not use fragments depended on the user preferences •Currently evaluating additional users IEEE eScience 2014. Guarujá, Brasil User Use as proposed Use with minor changes Use with major changes Use User1 (WC1) 11% 16,6% 38% 66,6% User 2 (WC2) 44% 6% 50% 100%
  23. 23. Evaluation: Grouping analysis 23 •Workflows with groupings are more common in single user corpora (WC1 and WC2) •Groupings are reused •1463 groupings versus 209 unique groupings in WC1 •302 grouping versus 108 unique groupings in WC2 •456 groupings versus 175 unique groupings in WC3 •Grouping size ranges from 60 to 0 •Facilitate copy paste by users (large grouping size) •Reducing unnecessary inputs (groupings with no steps) IEEE eScience 2014. Guarujá, Brasil Corpus Total qroup. Unique multistep qroup. Wf with qroup. Avg. group. per wf Max nºof steps in qroup. Min nº of steps in qroup. WC1 1463 209 327 4 56 1 WC2 302 108 42 7 39 0 WC3 456 175 89 5 60 1
  24. 24. Findings 24 With respect to our goals… •Goal 1: Are automatically detected workflow fragments similar to user-defined groupings? •(with freq 10%, single user, inexact FGM) 30% to 75% of the total FragFlow fragments found correspond directly to user-defined groupings •(multi user)Best results are 50% to 56% inexact FGM with minimum frequency. If we consider the overlap of 80% of the steps, the precision is 62% to 66% •Goal 2: For those automatically detected fragments that were NOT similar to user-defined groupings, do users find them useful? •For one user 66% of the proposed fragments were useful, for another 100% were useful •Further evaluation is needed •Goal 3: How are workflows and groupings reused? •Those workflows with groupings have at least 4 groupings •Reuse of groupings (grouping numbers are up to 7 times more than the unique groupings in the corpora) IEEE eScience 2014. Guarujá, Brasil
  25. 25. Limitations 25 •Graph mining is an NP-Complete problem •Big fragments can take time to be recognized •Errors derived from memory heap issues •Detection of groupings may depend on user preferences on size and frequency IEEE eScience 2014. Guarujá, Brasil
  26. 26. Conclusions and Future Work 26 •FragFlow: Approach to find the most common fragments in a corpus of workflows •Several integrated graph mining techniques •FragFlow can be used with different settings •Minimum or maximum frequency and support. •Size •Type of the graph mining algorithm to be applied •Evaluation of the results using corpora belonging to the LONI Pipeline system. •New algorithms are being integrated! •Sigma (inexact FGM), Gaston (exact FGM) •Future work •Test FragFlow with other workflow systems, domains, and perform further user evaluations. •Evaluate how workflow quality improves when users are proposed automatically mined workflow fragments Evaluation and resources available here: http://purl.org/net/escience2014 IEEE eScience 2014. Guarujá, Brasil
  27. 27. 27 Who are we? •Daniel Garijo, Oscar Corcho Ontology Engineering Group, UPM •Yolanda Gil Information Sciences Institute, USC •Boris A. Gutman, Ivo D. Dinov, Paul Thompson Arthur W. Toga. USC Laboratory of Neuro Imaging IEEE eScience 2014. Guarujá, Brasil
  28. 28. Want to collaborate? Contact me at dgarijo@fi.upm.es 28 Questions? IEEE eScience 2014. Guarujá, Brasil
  29. 29. Date: 24/10/2014 FragFlow: Automatic Fragment Detection in Scientific Workflows Daniel Garijo *, Oscar Corcho *, Yolanda Gil Ŧ, Boris A. Gutman ⱡ, Ivo D. Dinov ⱡ, Paul Thompson ⱡ and Arthur W. Toga ⱡ * Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute, ⱡ USC Laboratory of Neuroimaging

×