Assessing Galaxy's ability to express scientific workflows in bioinformatics
Upcoming SlideShare
Loading in...5
×
 

Assessing Galaxy's ability to express scientific workflows in bioinformatics

on

  • 273 views

 

Statistics

Views

Total Views
273
Views on SlideShare
268
Embed Views
5

Actions

Likes
0
Downloads
6
Comments
0

1 Embed 5

https://twitter.com 5

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Assessing Galaxy's ability to express scientific workflows in bioinformatics Assessing Galaxy's ability to express scientific workflows in bioinformatics Presentation Transcript

  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Assessing Galaxy’s ability to express scientific workflows in bioinformatics Peter van Heusden and Alan Christoffels South African National Bioinformatics Institute University of the Western Cape Bellville, South Africa 10th FASTAR/Espresso Workshop 2013 / 4-6 November 2013 Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References What is bioinformatics? Bioinformatics is the discipline of solving problems in biology and medicine using computational resources. Within bioinformatics, biological sequence analysis (BSA) describes those analyses that “infer biological information from sequence alone”. (Durbin, 1998) Cost of biological sequence analysis has two parts: 1 2 Cost of acquiring sequence Cost of analysing sequence Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Cost of acquiring sequence (Wetterstrand, 2013) Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Cost of analysing sequence The “sudden reliance on computation has created an ‘informatics crisis’ for life science researchers: computational resources can be difficult to use, and ensuring that computational experiments are communicated well and hence reproducible is challenging” (Goecks et al., 2010) As cost of sequencing plummets analysis faces two challenges: 1 2 Growing data volume demands more sophisticated computational approaches Translating biological questions into computational workflows remains difficult Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References How do we do bioinformatics? Given a set of protein sequences from species A, which genes from species B produce similar proteins, and where are these genes located on the genome of B? Analysis proceeds (Stevens et al., 2001) using: 1 2 3 Collections of data objects Transformers that generate new collections (e.g. transform collection of proteins into collection of genome regions that they match) Filters (e.g. discard low quality matches to genome) Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References How we do bioinformatics (2) Data collections typically exist as (compressed) files Bioinformatics tools typically are command line executables that accept and generate files (often using ad-hoc formats) Scripting languages (Perl, Python) used to compose workflows, APIs often used for reading/writing file formats 1 2 Workflow enactment often involves manual steps and is closely tied to execution environment Workflow is not easily reproducible nor reusable Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Scientific workflow management systems Scientific workflow management systems (SciWMS) have been proposed as an alternative to current script-based approaches to analysis workflow. SciWMSs “provide a high-level declarative way of specifying what a particular in silico experiment modelled by a workflow is set to achieve, not how it will be executed.” (Taverna project, 2009) Workflow descriptions resemble dataflow languages (McPhillips et al., 2009) Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References The promise of SciWMSs In addition to workflow specification, SciWMSs sometimes offer: Types that model objects of scientific domain Recording of provenance of data objects Execution of scientific workflows on diverse computing environments (desktop, cluster, grid, cloud) Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References SciWMSs for bioinformatics Many workflow systems have been proposed for use in bioinformatics: Taverna, Kepler, Triana, Bioopera, Mobyle, BiosFlow, bpipe Some workflow features are also available in Galaxy Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use What is Galaxy Galaxy emerged in 2004/5 as a web interface to bioinformatics tools and data Galaxy is becoming common platform through which to “publish” tools and data More than 30 known public Galaxy servers 36 000 users on main public Galaxy server, 0.8 Pb of data Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Galaxy as an open-source project Galaxy consists of c. 250 000 lines of (mostly Python) code Core team includes 15 developers spread across 4 different institutes Development is open source and “out in the open” with code hosted on BitBucket, development planning on Trello and mailing lists Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics Galaxy I
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics Galaxy II
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Galaxy workflow management features Galaxy allows composition of workflows defined as series of tasks and related dataflow Allows execution of workflows on local machine or via various job schedulers Data objects generated in Galaxy have associated provenance information Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Limitations of Galaxy as a SciWMS Limited support for scientific workflow patterns Type refers to format of data items Provenance is recorded as attribute of data files Workflows are not first class objects Analysis view focuses on individual datasets Execution engine schedules tasks (with limited support for task collections) Galaxy can be enriched by drawing on prior research on SciWMSs Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Scientific workflow patterns Analysis of scientific workflows has yielded a set of design patterns used in workflows (Yildiz et al., 2009) Galaxy workflow language supports sequential dataflow, parallel split and synchronisation Tool definition language has recently been extended to support multiple instances of task (not workflow) execution with a-priori runtime knowledge Tool authors can signal that input to tool can be split for parallel execution No interface between workflow authors and multiple instance support Support for cancel of individual task but not entire workflow No support for triggering new thread of activity (restart) Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Scientific workflow patterns (2) No support for exclusive choice (e.g. execute different dataflow path based on different input) No support for sub-workflows Galaxy workflow language is “abstraction hating” (Green and Petre, 1996) Leads to workflow diagrams resembling bowl of spaghetti for anything but the most simple cases Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use The Galaxy type system Galaxy types represent file types File type does not map simply to semantics Collection types are not supported, although some types are “splittable” to allow parallel task execution Workflow parameters are not supported via type system Cannot guarantee that workflow is well-formed Provenance recording is coarse-grained What will happen if we update single element of input data collection? Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Science questions vs execution plans Type system could model scientific domain objects (e.g. protein and nuceleotide sequences) but . . . Bioinformatics tools do not support standard formats or support standard formats with quirks Not clear what information to save from tool output Experienced bioinformaticists want opportunity to review “raw output” to explore factors that underpin confidence in analysis Need to support both recording and reporting of workflow output Both recording “raw” output trace and reporting provenance of scientific domain objects are necessary features for SciWMS Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Workflow execution in Galaxy Internally workflows are expanded into collections of tasks at execution time Tasks are executed by backend classes: either local or via scheduler Execution parameters can be set by “dynamic job runners” Allows e.g. resource requirements of job to be signalled to scheduler Configured using a combination of XML and Python code maintained by Galaxy administrator Workflow execution leaves no visible trace in the user interface At runtime execution shows individual jobs running Data objects are grouped by “history”, not associated with a workflow No support for re-execution of part of workflow Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Scope for workflow optimisation Workflows are dataflow graphs (Johnston et al., 2004) Knowledge of inputs and types can be used to plan execution efficiently, e.g. pipeline tasks and exploit opportunities for streaming Collection of data objects and parameters sets can be exploited for automatic parallel enactment of tasks and sub-workflows Data collections and workflows provide structures for nesting of provenance information Knowledge of data provenance could facilitate lifecycle of data products: kept for re-use or discarded as “intermediate products” Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Conclusion Bioinformatics faces an “informatics crisis” as cost to generate sequence has decreased while cost to compose or reproduce analysis has remained high Galaxy has emerged as a popular interface to bioinformatics tools and data with workflow management features Insight from prior research on SciWMSs suggests areas for enhancement: Support for additional workflow patterns Extension of type system with support for biological types, collections and parameter sets Improvement of workflow execution through treating workflows as first class objects with associated optimisation of execution and provenance storage Currently being pursued as a research agenda at SANBI Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Thanks Workflows for biological sequence analysis are discussed by the “Pipelines collaboration” Research on SciWMS supported by the MRC and Prof Christoffels Professor Alan Christoffels Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Bibliography I R. Durbin. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Apr. 1998. ISBN 9780521629713. J. Goecks, A. Nekrutenko, J. Taylor, and T. G. Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol, 11(8), 2010. T. R. G. Green and M. Petre. Usability analysis of visual programming environments: a ‘cognitive dimensions’ framework. Journal of Visual Languages and Computing, 7:131–174, 1996. W. M. Johnston, J. R. P. Hanna, and R. J. Millar. Advances in dataflow programming languages. ACM Computing Surveys, 36(1):1–34, Mar. 2004. T. McPhillips, S. Bowers, D. Zinn, and B. Ludäscher. Scientific workflow design for mere mortals. Future Generation Computer Systems, 25(5):541–551, May 2009. R. Stevens, C. Goble, P. Baker, and A. Brass. A classification of tasks in bioinformatics. Bioinformatics, 17(2):180–188, Feb. 2001. Taverna project. Why use workflows?, 2009. URL http://www.taverna.org.uk/introduction/why-use-workflows/. Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Bibliography II K. Wetterstrand. DNA sequencing costs: Data from the NHGRI genome sequencing program (GSP), 2013. URL http://www.genome.gov/sequencingcosts/. U. Yildiz, A. Guabtni, and A. H. H. Ngu. Towards scientific workflow patterns. In Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, WORKS ’09, page 13:1–13:10, New York, NY, USA, 2009. ACM. Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics