Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotations. IEEE BigData Conference 2013


Published on

Scientific Workflows have become the workhorse of BigData analytics for scientists. As well as being repeatable and optimizable pipelines that bring together datasets and analysis tools, workflows make-up an important part of the provenance of data generated from their execution. By faithfully capturing all stages in the analysis, workflows play a critical part in building up the audit-trail (a.k.a. provenance) meta- data for derived datasets and contributes to the veracity of results. Provenance is essential for reporting results, reporting the method followed, and adapting to changes in the datasets or tools. These functions, however, are hampered by the complexity of workflows and consequently the complexity of data-trails generated from their instrumented execution. In this paper we propose the generation of workflow description summaries in order to tackle workflow complexity. We elaborate reduction primitives for summarizing workflows, and show how prim- itives, as building blocks, can be used in conjunction with semantic workflow annotations to encode different summariza- tion strategies. We report on the effectiveness of the method through experimental evaluation using real-world workflows from the Taverna system.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotations. IEEE BigData Conference 2013

  1. 1. Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotations Pinar Alper, Khalid Belhajjame, Carole A. Goble University of Manchester Pinar Karagoz Middle East Technical University IEEE 2nd International Congress on Big Data June 27-July 2, 2013
  2. 2. Pinar and her daughter Nile at the end of year school party.
  3. 3. • Data driven analysis pipelines • Systematic gathering of data and analysis tools into computational solutions for scientific problem-solving • Tools for automating frequently performed data intensive activities • Provenance for the resulting datasets – The method followed – The resources used – The datasets used Scientific Workflows
  4. 4. Science with workflows GWAS, Pharmacogenomics Association study of Nevirapine-induced skin rash in Thai Population Trypanosomiasis (sleeping sickness parasite) in African Cattle Astronomy & HelioPhysics Library Doc Preservation Systems Biology of Micro- Organisms Observing Systems Simulation Experiments JPL, NASA BioDiversity Invasive Species Modelling [Credit Carole A. Goble]
  5. 5. Provenance is paramount for science • Reporting findings – Derivation - how did we get this result? • processes/programs used, execution trace, data lineage, source of components (data, services) – History - who did what when? • creator, contributors, timestamps. • Adapting to Change – Explanation - why did this record start to appear in the result? – Change Impact - which steps will be affected if I change this tool or data input?
  6. 6. PROV Primer, Gil et al WF Execution Trace Retrospective Provenance: Actual data used, actual invocations, timestamps and data derivation trace WF Description Prospective Provenance: Intended method for analysis
  7. 7. Workflows can get complex! • Overwhelming for users who are not the developers • Abstractions required for reporting • Lineage queries result in very long trails
  8. 8. Reason and extent of complexity • a.k.a. Shims • Dealing with data and protocol heterogeneities • Local organization of data Garijo D., Alper. P., Belhajjame K. et al D. Hull et al ~ 60%
  9. 9. Static Ways To Tackle Complexity Process-Wise and Data-Wise abstractions • Sub-workflows – Not always a significant unit of function (e.g. aesthetic purposes) • Bookmarked data links – Cluster the output signature – Further complicates workflow • Components – Library dependent
  10. 10. Our Solution: Workflow Description Summaries • A graph model for representing workflows • Graph re-write rules for summarization IF <performs certain function> THEN <re-write WF graph> motifs reduction-primitives
  11. 11. • Domain Independent categorization – Data-Oriented Nature – Resource/Implementation-Oriented Nature • Captured In a lightweight OWL Ontology PART-1: Scientific Workflow Motifs
  12. 12. A graph model of data-driven workflows Pure Dataflows W= <N,E> Operation and Port Nodes N = (Nop U Np) Dataflow edges E = (Eopp U Epp U Epop )
  13. 13. Motif annotations over operations motifs(color_pathway_by_objects) = {m1:DataRetrieval} motifs(Get_Image_From_URL_2) = {m2:DataMoving} DataRetrievalDataRetrieval DataMovinglDataMovingl
  14. 14. PART-2: Workflow reduction primitives • Collapse (Up/Down) • Compose • Eliminate
  15. 15. Collapse Down
  16. 16. Collapse Up
  17. 17. Compose
  18. 18. Eliminate
  19. 19. How will rules be put to use • Strategies as a set of rules for summarization • Two sample strategies based on an empirical analysis of workflows • Reporting: – Process: Significant activities (Retrieval, Analysis, Visualization) – Data: • Reduced cardinality • Stripped of protocol specific payload/formatting
  20. 20. Two sample strategies • By-Eliminate – Minimal annotation effort – Single rule • By Collapse – More specific annotation – Multiple rules
  21. 21. Overall Approach Workflow Designer Taverna Workbench Motif Ontology WF Summary WF Description Summarizer Summarization Rules
  22. 22. Analysis Data Set • 30 Workflows from the Taverna system • Entire dataset & queries accessible from • Manual Annotation using Motif Vocabulary
  23. 23. Summaries at a glance By-Collapse By-Elimination • Causal Ordering of operations • Reduced depth
  24. 24. By-Collapse
  25. 25. By-Elimination
  26. 26. Mechanistic Effect of Summarization
  27. 27. User Summaries vs. Summary Graphs
  28. 28. Related Work • User Views over provenance O. Biton, et al. – User specified significant operations – Automatic partitioning of workflow graph. • Provenance Redaction T. Cadenhead, et al. – Redaction primitives – Graph queries with regular expressions • Provenance Publishing S. C. Dey et al. – User policies on publishing (hide, retain) – Consistency checks
  29. 29. Highlights • Re-writing workflow graphs with rules • Exploiting semantic annotations of operations • Controlled, primitive-based re-writing – Preserve acyclicity • Users indirectly control the summarization – Encoding their preferences as summary rules • Querying of Workflow Execution Provenance using summaries. Future Work
  30. 30. Thank you! Carole A. GOBLE University of Manchester Khalid BELHAJJAME University of Manchester Pinar KARAGOZ Middle East Technical University Pinar ALPER University of Manchester
  31. 31. Bibliography D. Garijo, P. Alper, K. Belhajjame, O. Corcho, C. Goble, and Y. Gil. Common motifs in scientific workflows: An empirical analysis. In the proceedings of the IEEE eScience Conference 2012. P. Alper, K. Belhajjame, C. A. Goble, and P. Senkul. Enhancing and abstracting scientific workflow provenance for data publishing. Submitted for publication to BIGProv 2013 International Workshop on Managing and Querying Provenance Data at Scale, co-located with EDBT-2013. O. Biton, et al. Querying and Managing Provenance through User Views in Scientific Workflows. 2008 IEEE 24th International Conference on Data Engineering, pages 1072–1081, Apr. 2008. J.Cheney et al.Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379–474, 2009. D. Hull et al. Treating shimantic web syndrome with ontologies. In AKT Workshop on Semantic Web Services, 2004. S. C. Dey, D. Zinn, and B. Ludäscher. Propub: towards a declarative approach for publishing customized, policy-aware provenance. In Proceedings of the 23rd international conference on Scientific and statistical database management, SSDBM’11, pages 225–243, Berlin, Heidelberg, 2011. Springer-Verlag. Y.Gil and S. Miles Editors. The PROV Model Primer . S. C. Dey, D. Zinn, and B. Ludascher̈ . Propub: towards a declarative approach for publishing customized, policy-aware provenance. In Proceedings of the 23rd international conference on Scientific and statistical database management, SSDBM’11, pages 225–243, Berlin T. Cadenhead, V. Khadilkar, M. Kantarcioglu, and B. Thuraisingham. Transforming provenance using redaction. In Proceedings of the 16th ACM symposium on Access control models and technologies, SACMAT ’11, pages 93–102, New York, NY, USA, 2011. ACM., Taverna Open Source and Domain Independent Workflow Management System