Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

1,710 views

Published on

Workflow systems support the design, configuration and execution of repetitive, multi-step pipelines and analytics, well established in many disciplines, notably biology and chemistry, but less so in biodiversity and ecology. From an experimental perspective workflows are a means to handle the work of accessing an ecosystem of software and platforms, manage data and security, and handle errors. From a reporting perspective they are a means to accurately document methodology for reproducibility, comparison, exchange and reuse, and to trace the provenance of results for review, credit, workflow interoperability and impact analysis. Workflows operate in an evolving ecosystem and are assemblages of components in that ecosystem; their provenance trails are snapshots of intermediate and final results. Taking a lifecycle perspective, what are the challenges in workflow design and use with different stakeholders? What needs to be tackled in evolution, resilience, and preservation? And what are the “mitigate or adapt” strategies adopted by workflow systems in the face of changes in the ecosystem/environment, for example when tools are depreciated or datasets become inaccessible in the face of funding shortfalls?

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,710
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • The Technical Environment: Challenging Areas and Promising Technologies Workflows, provenance and reporting: a lifecycle perspective Workflow systems support the design, configuration and execution of repetitive, multi-step pipelines and analytics, well established in many disciplines, notably biology and chemistry, but less so in biodiversity and ecology. From an experimental perspective workflows are a means to handle the work of accessing an ecosystem of software and platforms, manage data and security, and handle errors. From a reporting perspective they are a means to accurately document methodology for reproducibility, comparison, exchange and reuse, and to trace the provenance of results for review, credit, workflow interoperability and impact analysis. Workflows operate in an evolving ecosystem and are assemblages of components in that ecosystem; their provenance trails are snapshots of intermediate and final results. Taking a lifecycle perspective, what are the challenges in workflow design and use with different stakeholders? What needs to be tackled in evolution, resilience, and preservation? And what are the “mitigate or adapt” strategies adopted by workflow systems in the face of changes in the ecosystem/environment, for example when tools are depreciated or datasets become inaccessible in the face of funding shortfalls? Bio: Carole Goble is a full professor in Computer Science at the University of Manchester, UK, and a partner of the Software Sustainability Institute UK. She has an international reputation in Semantic technologies, Distributed computing and Social Computing for scientific collaboration through eLabs. She directs the myGrid project, which produces the widely-used open source Taverna workflow management system; myExperiment, a social web site for sharing scientific workflows; the BioDiversityCatalogue of web services ; and the SEEK for storing, sharing and preserving Systems Biology outcomes, which is part of the ERANet e-infrastructure for EU-based Systems Biology. Her technical infrastructure underpins the EU BioVeL Project e-Laboratory. In 2008 Carole was awarded the Microsoft Jim Gray award for outstanding contributions to e-Science. In 2010 she was elected a Fellow of the Royal Academy of Engineering. In 2012 she was nominated for the Benjamin Franklin award for open science in Biology. She serves on the UK BBSRC funding agency governance Council and is the Deputy Director of the UK's Node of the ESFRI ELIXIR programme.
  • Katy Willis talk on Wednesday shows the value of automation of data integration standardised pipelines auto record of experiment and set-up report & variant reuse Systematically capture, coordinate, run and record the steps buffered infrastructure platform libraries, plugins Infrastructure components, services infrastructure
  • aimed at different layers of the software stack “ The Many Faces of IT as Service”, Foster, Tuecke, 2005 “ Provisioning” – reservation to configuration to … … make sure resource will do what I want it to do, with the right qualities of service Virtualization = separation of concerns between provider & consumer of “content” Client and service Service provider and resource provider Provisioning = assemble & configure resources to meet user needs Management = sustain desired qualities of service despite dynamic environment
  • Just in time interoperability by papering over the cracks.
  • Scale of data – from Matthias talk. Geographic: we can build models in China and project it into Europe Taxonomic: we can build models for plants (phytoplankton), animals (birds), and in one year hopefully even microbial communities Environmental: sea, land, still very difficult for lakes and rivers
  • Analysis factories Typical variations in workflows Local and Global workflow population variations Micro and Macro level
  • Came up in policy session reporting perspective accurately document methodology for reproducibility , comparison, exchange and reuse trace the provenance of results for review, credit, workflow interoperability and impact analysis
  • Simplify Track Versions and retractions Error propagation Contributions and credits Fix Workflow repair, alternate component discovery, Black box annotation Rerun and Replay Partial reproducibility: Replay some of the workflow A verifiable, reviewable trace in people terms Analyse Calculate data quality & trust, Decide what data to keep or release Compare to find differences and discrepancies S. Woodman, H. Hiden, P. Watson,  P. Missier Achieving Reproducibility by Combining Provenance with Service and Workflow Versioning.  In: The 6th Workshop on Workflows in Support of Large-Scale Science . 2011, Seattle
  • http://www.aosabook.org/en/vistrails.html http://biodiversity.ku.edu/blog/lab-notes/lifemapper-vistrails-better-science
  • Galaxy pages (30K users, 1K new users/month)
  • Environment: services, codes, datasets, platforms
  • Workflow templates Workflow sets Libraries of sub workflow parts Design practices for mix, match and reuse Future proofed design: mitigate or adapt Discovery and exchange Life cycle management Curation Packaging. Credit and publishing. Workflow engineers Workflow custodians
  • Workflow templates Workflow sets Libraries of sub workflow parts Design practices for mix, match and reuse Future proofed design: mitigate or adapt Discovery and exchange Life cycle management Curation Packaging. Credit and publishing. Workflow engineers Workflow custodians
  • Future proofed design: mitigate or adapt Discovery and exchange Life cycle management Curation Packaging. Credit and publishing. Workflow engineers Workflow custodians
  • Local level or eu hosted
  • Reducing sensitivity, robustness to loss SHIWA and ER Flow Factories
  • Reducing Mortality, Invasion, Predatory Black boxes Poor metadata Incompatibility of data formats and identifiers. Poor awareness or adherence to standards. Poor methodology Unrepeatable or unknown experimental method. Black boxes. Incorrect interpretations and poor quality. Poor service / tool / resource ethic Service decay, service palpability & complexity, service reliability & stability, poor diagnostics. GEO, GEOSS, Ecosystems, earth observations NextData c2012.org Encyclopedia of Life Global BioDiversity Informatics Conference www.gbic2012.org Dawn and Cynthia Parr (EOL)
  • A virtual machine (VM) is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine. Virtual machines are separated into two major classifications, based on their use and degree of correspondence to any real machine: System Zhao, Gomez-Perez, Belhajjame, Klyne, Garcia-Cuesta, Garrido, Hettne, Roos, De Roure and Goble. Why workflows break - Understanding and combating decay in Taverna workflows, 8th Intl Conf e-Science 2012 Reproducibility success is proportional to the number of dependent components and your control over them” Many reasons why. Change / Availability Updates to public datasets, changes to services / codes Availability/Access to components / execution environment Platform differences on simulations, code ports Volatile third-party resources (50%): Not available, available but inaccessible, changed Prevent, Detect, Repair
  • Logbook data Capacity, services, collaboration Variation, diversity and change at all levels Modularity Plugins Separate Services from underlying infrastructure Ensure Service Networks are built using standard Web 2.0 technologies Separation of applications, workflows and VREs from the services
  • Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome

    1. 1. Workflows, Provenance & Reporting A Lifecycle Perspective Professor Carole Goble FREng FBCS The University of Manchester, UK carole.goble@manchester.ac.uk 3rd – 6th September 2013, Rome, Italy
    2. 2. The Scientific and Technical Ecosystem Mobilising Big and Broad Data • Streaming • Sweeps through models • Integrative analysis • Results synthesis • Heavy compute Interoperability, plugging together • Multi step chains, Multi software / data • Mixed resources / platforms • Incompatibility smoothing • Trans-disciplinary, Alien processes [DataONE]
    3. 3. BioSTIF inputs: data, parameters, configurations outputs Workflow nutshell • A series of automated / interactive data analysis steps • Process data at scale • Import data / codes from one’s own research and/or from existing libraries • Pipelines & analytic and synthesis procedures • Chains of components • Bridges between resources • Shield from change and operational complexity • Releasing capacity Services Resources
    4. 4. Provisioning Workflows Appln Service Appln Service Users Workflows Composition Incorporation Invocation Applications • Applications components of workflows • Compose applications into workflows • Incorporate workflows into applications Infrastructure • Provision physical resources to support application workflows • Coordinate resources through workflows • Optimise and adapt to change [Foster 2005] Workflows Wfms
    5. 5. Assembly of Components Interoperability Covering up incompatibility
    6. 6. Flexible variation Stabilising Optimising
    7. 7. Workflows: maturing approach Underpin integrative platforms. Established in many disciplines, notably chemistry and biology, esp. ‘omics: assembly, synthesis, annotation, analytics. Overlaps with metagenomics, phylogenetics and genetic ecology Powering service based science and science as a service http://www.globus.org/genomics/solution Sandve, Nekrutenko, Taylor, Hovig Ten simple rules for reproducible in silico research, PLoS Comp Bio submitted
    8. 8. Ecological Niche modelling, population modelling, Metagenomics and Phylogenetics ‘omics pipelines and analytic workflows http://www.biovel.eu Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis http://camera.calit2.net/index.shtm Combine species occurrence data with global climate, terrain and land cover information, to identify environmental correlates of species ranges. http://www.lifemapper.org/species BioDiversity
    9. 9. Taxonomic Data Refinement www.biovel.eu • Synonym expansion • Taxonomic name resolution • Occurrence retrieval • Spell checking • Geographic and taxonomic cleaning • Temporal refinement • Data processing log [Matthias Obst, INTECOL 2013]
    10. 10. Data Operations in Workflows in the Wild Analysis of 260 publicly available workflows in Taverna, WINGS, Galaxy and Vistrails Garijo et al Common Motifs in Scientific Workflows: An Empirical Analysis, in press, FGCS
    11. 11. Large Scale Ecological Niche Modeling Workflow . Step 1: Explorative modeling -Use unfiltered data -Use fixed parameters: Mahalonobis distance (Farber and Kadmon 2003) -Native projections -Test the model, distribution of points, number of points Step 2: Deep modeling -Filtering environmentally unique points with BioClim algorithm (Nix 1986) -ENM with Support Vector Machine (Cristianini & Shawe- Taylor 2000) and Maximum Entropy (Phillips 2004) -Parameter optimization (if necessary) on the model test results -2 masks (model generate, model project) Data discoveryData discovery Data assembly, cleaning, and refinement Data assembly, cleaning, and refinement Ecological Niche Modeling Ecological Niche Modeling Statistical analysisStatistical analysis Analytical cycle Pilumnus hirtellus Enclosed sea problem (Ready et al., 2010) [Matthias Obst, INTECOL 2013]
    12. 12. Workflow-enabled science • Common Templates • Prepared components • Systematic assembly • (Steered) automation • Hybrid combinations • Variations • Extensibility • Customisation • Parameterisation • Repeats • Cross-run synthesis • Routine, pooled methods • Tracking
    13. 13. Repeated model sweeps Ten insect species were modelled: European spruce bark beetle – Ips typographus L. Bordered white moth (syn. pine looper) - Bupalus piniarius L., (syn. B. piniaria L.) Pine-tree lappet - Dendrolimus pini L. Mottled umber - Erannis defoliaria Clerck Nun moth - Lymantria monacha L. Winter moth - Operopthera brumata L. Pine beauty moth - Panolis flammea Den. & Schiff Green oak tortrix - Tortrix viridana L. European pine sawfly – Neodiprion sertifer Geoffr. Common pine sawfly – Diprion pini L. Tortrix viridana Image by Kimmo & Seppo Silvonen Lymantria monacha data configuration parameters steps Päivi Lyytikäinen-Saarenmaa presentation, INTECOL 2013
    14. 14. http://www.jisc.ac.uk/whatwedo/campaigns/res3/jischelp.aspx Workflows workflows results provenance process (log) results (origin) Reporting Record of science Reproducibility Transparent process Integrate with reporting systems Know how Training See Penev presentation
    15. 15. Provenance the link between computation and results W3C PROV model standard record for reporting compare diffs/discrepancies provenance analytics track changes, adapt partial repeat/reproduce carry attributions compute credits compute data quality/trust select data to keep/release optimisation and debugging d1 S0 d2 S1 w S2 y S4 df d1' S0 d2 S1 z w S'2 y' S4 df' (i) Trace A (ii) Trace B PDIFF: comparing provenance traces to diagnose divergence across experimental results [Woodman et al, 2011]
    16. 16. [Freire]http://www.aosabook.org/en/vistrails.html Collecting -> Using Provenance Instrumenting, cross-tool interoperability Reporting at different scales
    17. 17. b Publishing with Provenance
    18. 18. Summary: Infrastructure Productivity CustomiseCustomise ProcessProcess CustomiseCustomise ProcessProcess CustomiseCustomise EnvironmentEnvironment Legacy, others and your own software, datasets, services, codes, and platforms. optimise and manage use of computing infrastructure, HPC, clouds and platforms WFMS middleware WFMS middleware Support the design, config. and execution of workflows. manage utility actions for data, logging, security, compute, errors…shield incompatibilities / complexity / change Parameterised, integrative, multi-step (data) pipelines, analytics, computational protocols. That can be repetitively reused. dependency-rich interoperability. WorkflowWorkflow AppsApps Domain/task specific apps that incorporate (an ecosystem of) workflows Integrate
    19. 19. Summary: User Productivity: Capability Raising AccessAccess Framework to access and leverage heterogeneous legacy applications, services, datasets and codes. Shielding from complexity. CustomiseCustomise Rapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components ProcessProcess Automated plumbing + Interaction Systematic, repetitive and unbiased analysis and processing and error handling Ensembles, comparisons, “what ifs” CustomiseCustomise Rapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components ProcessProcess Automated plumbing + Interaction Systematic, repetitive and unbiased analysis and processing and error handling Ensembles, comparisons, “what ifs” CustomiseCustomise Rapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components AccessAccess Framework to access and leverage heterogeneous legacy applications, services, datasets and codes and combine with yours. Shielding from complexity. ProcessProcess Integration, Reusable workflows/components Automated plumbing + Interaction Systematic, repetitive and unbiased analysis Ensembles, comparisons, “what ifs” Process reporting. Citation tracking. Reproducibility, Provenance, Audit. Quality Control. Standard Operating Procedures.RecordRecord CustomiseCustomise Rapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components
    20. 20. Workflow Commodities building cohorts, capturing traits, explicit reporting, clear instructions • Workflow templates • Workflow sets • Libraries of sub workflow parts • Design practices for mix, match and reuse • Future proofed design predicting need to adapt • Discovery and exchange • Workflow engineers • Workflow custodians
    21. 21. Seeding a workflow library
    22. 22. Workflow Commodities exchanging, curating, preserving, packaging, life cycle management http://www.researchobject.orghttp://www.dcc.ac.uk
    23. 23. Katy’s student’s 200 hours Tracking where data went Workflow Commodities getting credit, capability, engineers and custodians
    24. 24. Application Building user variety, outcome focused • Right apps, right users. • Commodity apps: – Web. Spreadsheets. R. • Customisation • Mixed workflow / scripting • Deployment / Portability – Web based / desktop – Virtualised deployments – Cloud hosted service – A cloud-enabled local host • Local ownership • Capability building WorkflowVisibility BioDiversity Low ConceptKnowledge High Technology/InfrastructureDomainScientist Technicalspecialists ComputationalScientist Custom Specific Apps General Toolkits Policy makers Low High Versatility
    25. 25. Who are the users? • Policy makers? • Biodiversity researcher? • Computational scientist? • Tool developer? • Service provider? • Infrastructure provider? • Digital custodian?
    26. 26. Workflow management systems • Integrated into community frameworks, coupled into tools • Virtualised (Web) Services • Scaling, Optimisation • Interoperability, Using provenance • No one workflow language/system • Specialisation & its cost • Plug-ins for common community platforms and resources • Mitigating and adapting to changes in infrastructures and resources. • Sustainability and engineering Generic Specific http://www.erflow.eu/
    27. 27. Population dynamics The life cycle of infrastructures • Dynamics: Mitigate, Adapt, Disperse, Die • Standard and maintained prog. interfaces (APIs) • Standard formats and ids • Stability, reliability, repair • Interoperability • Semantic descriptions • Sustainability of services and infrastructure • Instrument resources for citation & microattribution • Coupled services and infrastructure.
    28. 28. Impact of dependencies [Zhao et al. Why workflows break e-Science 2012]
    29. 29. Summary Scale. Standards data formats, programmatic interfaces. Governance. Workflow commodities Design practices Credit A seamless, pluggable service. Scale. Adaptability. Specific-Generic tension. Putting provenance to use for data credit. Embedding workflows in common applications Integration into reporting and publishing lifecycles
    30. 30. BioDiversity Virtual e-Laboratory www.biovel.eu Wf4Ever www.wf4ever-project.org SysMO www.sysmo-db.org SCaleable Preservation Environments http://www.scape-project.eu

    ×