Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

1,829 views

Published on

Workflow systems support the design, configuration and execution of repetitive, multi-step pipelines and analytics, well established in many disciplines, notably biology and chemistry, but less so in biodiversity and ecology. From an experimental perspective workflows are a means to handle the work of accessing an ecosystem of software and platforms, manage data and security, and handle errors. From a reporting perspective they are a means to accurately document methodology for reproducibility, comparison, exchange and reuse, and to trace the provenance of results for review, credit, workflow interoperability and impact analysis. Workflows operate in an evolving ecosystem and are assemblages of components in that ecosystem; their provenance trails are snapshots of intermediate and final results. Taking a lifecycle perspective, what are the challenges in workflow design and use with different stakeholders? What needs to be tackled in evolution, resilience, and preservation? And what are the “mitigate or adapt” strategies adopted by workflow systems in the face of changes in the ecosystem/environment, for example when tools are depreciated or datasets become inaccessible in the face of funding shortfalls?

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

  1. 1. Workflows, Provenance & Reporting A Lifecycle Perspective Professor Carole Goble FREng FBCS The University of Manchester, UK carole.goble@manchester.ac.uk 3rd – 6th September 2013, Rome, Italy
  2. 2. The Scientific and Technical Ecosystem Mobilising Big and Broad Data • Streaming • Sweeps through models • Integrative analysis • Results synthesis • Heavy compute Interoperability, plugging together • Multi step chains, Multi software / data • Mixed resources / platforms • Incompatibility smoothing • Trans-disciplinary, Alien processes [DataONE]
  3. 3. BioSTIF inputs: data, parameters, configurations outputs Workflow nutshell • A series of automated / interactive data analysis steps • Process data at scale • Import data / codes from one’s own research and/or from existing libraries • Pipelines & analytic and synthesis procedures • Chains of components • Bridges between resources • Shield from change and operational complexity • Releasing capacity Services Resources
  4. 4. Provisioning Workflows Appln Service Appln Service Users Workflows Composition Incorporation Invocation Applications • Applications components of workflows • Compose applications into workflows • Incorporate workflows into applications Infrastructure • Provision physical resources to support application workflows • Coordinate resources through workflows • Optimise and adapt to change [Foster 2005] Workflows Wfms
  5. 5. Assembly of Components Interoperability Covering up incompatibility
  6. 6. Flexible variation Stabilising Optimising
  7. 7. Workflows: maturing approach Underpin integrative platforms. Established in many disciplines, notably chemistry and biology, esp. ‘omics: assembly, synthesis, annotation, analytics. Overlaps with metagenomics, phylogenetics and genetic ecology Powering service based science and science as a service http://www.globus.org/genomics/solution Sandve, Nekrutenko, Taylor, Hovig Ten simple rules for reproducible in silico research, PLoS Comp Bio submitted
  8. 8. Ecological Niche modelling, population modelling, Metagenomics and Phylogenetics ‘omics pipelines and analytic workflows http://www.biovel.eu Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis http://camera.calit2.net/index.shtm Combine species occurrence data with global climate, terrain and land cover information, to identify environmental correlates of species ranges. http://www.lifemapper.org/species BioDiversity
  9. 9. Taxonomic Data Refinement www.biovel.eu • Synonym expansion • Taxonomic name resolution • Occurrence retrieval • Spell checking • Geographic and taxonomic cleaning • Temporal refinement • Data processing log [Matthias Obst, INTECOL 2013]
  10. 10. Data Operations in Workflows in the Wild Analysis of 260 publicly available workflows in Taverna, WINGS, Galaxy and Vistrails Garijo et al Common Motifs in Scientific Workflows: An Empirical Analysis, in press, FGCS
  11. 11. Large Scale Ecological Niche Modeling Workflow . Step 1: Explorative modeling -Use unfiltered data -Use fixed parameters: Mahalonobis distance (Farber and Kadmon 2003) -Native projections -Test the model, distribution of points, number of points Step 2: Deep modeling -Filtering environmentally unique points with BioClim algorithm (Nix 1986) -ENM with Support Vector Machine (Cristianini & Shawe- Taylor 2000) and Maximum Entropy (Phillips 2004) -Parameter optimization (if necessary) on the model test results -2 masks (model generate, model project) Data discoveryData discovery Data assembly, cleaning, and refinement Data assembly, cleaning, and refinement Ecological Niche Modeling Ecological Niche Modeling Statistical analysisStatistical analysis Analytical cycle Pilumnus hirtellus Enclosed sea problem (Ready et al., 2010) [Matthias Obst, INTECOL 2013]
  12. 12. Workflow-enabled science • Common Templates • Prepared components • Systematic assembly • (Steered) automation • Hybrid combinations • Variations • Extensibility • Customisation • Parameterisation • Repeats • Cross-run synthesis • Routine, pooled methods • Tracking
  13. 13. Repeated model sweeps Ten insect species were modelled: European spruce bark beetle – Ips typographus L. Bordered white moth (syn. pine looper) - Bupalus piniarius L., (syn. B. piniaria L.) Pine-tree lappet - Dendrolimus pini L. Mottled umber - Erannis defoliaria Clerck Nun moth - Lymantria monacha L. Winter moth - Operopthera brumata L. Pine beauty moth - Panolis flammea Den. & Schiff Green oak tortrix - Tortrix viridana L. European pine sawfly – Neodiprion sertifer Geoffr. Common pine sawfly – Diprion pini L. Tortrix viridana Image by Kimmo & Seppo Silvonen Lymantria monacha data configuration parameters steps Päivi Lyytikäinen-Saarenmaa presentation, INTECOL 2013
  14. 14. http://www.jisc.ac.uk/whatwedo/campaigns/res3/jischelp.aspx Workflows workflows results provenance process (log) results (origin) Reporting Record of science Reproducibility Transparent process Integrate with reporting systems Know how Training See Penev presentation
  15. 15. Provenance the link between computation and results W3C PROV model standard record for reporting compare diffs/discrepancies provenance analytics track changes, adapt partial repeat/reproduce carry attributions compute credits compute data quality/trust select data to keep/release optimisation and debugging d1 S0 d2 S1 w S2 y S4 df d1' S0 d2 S1 z w S'2 y' S4 df' (i) Trace A (ii) Trace B PDIFF: comparing provenance traces to diagnose divergence across experimental results [Woodman et al, 2011]
  16. 16. [Freire]http://www.aosabook.org/en/vistrails.html Collecting -> Using Provenance Instrumenting, cross-tool interoperability Reporting at different scales
  17. 17. b Publishing with Provenance
  18. 18. Summary: Infrastructure Productivity CustomiseCustomise ProcessProcess CustomiseCustomise ProcessProcess CustomiseCustomise EnvironmentEnvironment Legacy, others and your own software, datasets, services, codes, and platforms. optimise and manage use of computing infrastructure, HPC, clouds and platforms WFMS middleware WFMS middleware Support the design, config. and execution of workflows. manage utility actions for data, logging, security, compute, errors…shield incompatibilities / complexity / change Parameterised, integrative, multi-step (data) pipelines, analytics, computational protocols. That can be repetitively reused. dependency-rich interoperability. WorkflowWorkflow AppsApps Domain/task specific apps that incorporate (an ecosystem of) workflows Integrate
  19. 19. Summary: User Productivity: Capability Raising AccessAccess Framework to access and leverage heterogeneous legacy applications, services, datasets and codes. Shielding from complexity. CustomiseCustomise Rapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components ProcessProcess Automated plumbing + Interaction Systematic, repetitive and unbiased analysis and processing and error handling Ensembles, comparisons, “what ifs” CustomiseCustomise Rapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components ProcessProcess Automated plumbing + Interaction Systematic, repetitive and unbiased analysis and processing and error handling Ensembles, comparisons, “what ifs” CustomiseCustomise Rapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components AccessAccess Framework to access and leverage heterogeneous legacy applications, services, datasets and codes and combine with yours. Shielding from complexity. ProcessProcess Integration, Reusable workflows/components Automated plumbing + Interaction Systematic, repetitive and unbiased analysis Ensembles, comparisons, “what ifs” Process reporting. Citation tracking. Reproducibility, Provenance, Audit. Quality Control. Standard Operating Procedures.RecordRecord CustomiseCustomise Rapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components
  20. 20. Workflow Commodities building cohorts, capturing traits, explicit reporting, clear instructions • Workflow templates • Workflow sets • Libraries of sub workflow parts • Design practices for mix, match and reuse • Future proofed design predicting need to adapt • Discovery and exchange • Workflow engineers • Workflow custodians
  21. 21. Seeding a workflow library
  22. 22. Workflow Commodities exchanging, curating, preserving, packaging, life cycle management http://www.researchobject.orghttp://www.dcc.ac.uk
  23. 23. Katy’s student’s 200 hours Tracking where data went Workflow Commodities getting credit, capability, engineers and custodians
  24. 24. Application Building user variety, outcome focused • Right apps, right users. • Commodity apps: – Web. Spreadsheets. R. • Customisation • Mixed workflow / scripting • Deployment / Portability – Web based / desktop – Virtualised deployments – Cloud hosted service – A cloud-enabled local host • Local ownership • Capability building WorkflowVisibility BioDiversity Low ConceptKnowledge High Technology/InfrastructureDomainScientist Technicalspecialists ComputationalScientist Custom Specific Apps General Toolkits Policy makers Low High Versatility
  25. 25. Who are the users? • Policy makers? • Biodiversity researcher? • Computational scientist? • Tool developer? • Service provider? • Infrastructure provider? • Digital custodian?
  26. 26. Workflow management systems • Integrated into community frameworks, coupled into tools • Virtualised (Web) Services • Scaling, Optimisation • Interoperability, Using provenance • No one workflow language/system • Specialisation & its cost • Plug-ins for common community platforms and resources • Mitigating and adapting to changes in infrastructures and resources. • Sustainability and engineering Generic Specific http://www.erflow.eu/
  27. 27. Population dynamics The life cycle of infrastructures • Dynamics: Mitigate, Adapt, Disperse, Die • Standard and maintained prog. interfaces (APIs) • Standard formats and ids • Stability, reliability, repair • Interoperability • Semantic descriptions • Sustainability of services and infrastructure • Instrument resources for citation & microattribution • Coupled services and infrastructure.
  28. 28. Impact of dependencies [Zhao et al. Why workflows break e-Science 2012]
  29. 29. Summary Scale. Standards data formats, programmatic interfaces. Governance. Workflow commodities Design practices Credit A seamless, pluggable service. Scale. Adaptability. Specific-Generic tension. Putting provenance to use for data credit. Embedding workflows in common applications Integration into reporting and publishing lifecycles
  30. 30. BioDiversity Virtual e-Laboratory www.biovel.eu Wf4Ever www.wf4ever-project.org SysMO www.sysmo-db.org SCaleable Preservation Environments http://www.scape-project.eu

×