Successfully reported this slideshow.

The Taverna Workflow Management Software Suite - Past, Present, Future

1

Share

Upcoming SlideShare
OpenTox Europe 2013
OpenTox Europe 2013
Loading in …3
×
1 of 59
1 of 59

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

The Taverna Workflow Management Software Suite - Past, Present, Future

  1. 1. The Taverna Workflow Management Software Suite: Past, Present, Future. Prof Carole Goble CBE FREng FBCS CITP The University of Manchester, UK Software Sustainability Institute UK carole.goble@manchester.ac.uk http://www.taverna.org.uk http://www.mygrid.org.uk
  2. 2. More of what we generally do! Prof Carole Goble CBE FREng FBCS CITP The University of Manchester, UK Software Sustainability Institute UK carole.goble@manchester.ac.uk http://www.taverna.org.uk http://www.mygrid.org.uk
  3. 3. e-Science, Computational Science, Scientific Computing • Support global scientific collaboration, enable large scale resource, tools and results sharing, assist scientific processing, avoid unnecessary repeated work. • Accelerate scientific discovery, improving scientific productivity, stimulate technological innovation. • Cope with scales and speed of scientific innovation and data.
  4. 4. Data-centric Computation Scientific workflows over Distributed Cyber-Infrastructure. Data sharing Social Methods libraries and catalogues for all types of scientific artefacts and all types of scientists. Knowledge Management Metadata, semantics digital exchange, preservation, publishing Software Engineering Software sustainability, software and data policy, training Products Methods Systems Biology Chemistry Astro-Physics Astronomy Biology Social Science Library Digital Preservation Biodiversity Public Health Applications
  5. 5. Computer Science Software Engineering Scientific Informatics Computational Science THEORY PRACTICEAPPLICATION fundamental applied PRODUCT (Open Source) PRINCIPLE Science “USE CASE”
  6. 6. Long Tail Little science Self-organising groups Disconnected, independent, distributed scientists Disconnected, independent, distributed resources Open in the wild. Organised science Organised groups Clubs of scientists Organised, planned and in-house resources Closed and well behaved services.
  7. 7. VPH-Share Models of Human Physiology Eagle Genomics Next Generation Sequencing based Patient Diagnostics Astronomy & HelioPhysics Document Preservation Digitisation Systems Biology OpenTox Project Chemistry Development Kit Drug Toxicity Ecological Niche Modelling Population Modelling Meta- genomics Phylo- genetics • Data cleaning • Data movement • Data retrieval and annotation • Data analysis • Data mining • knowledge management • Data curation and data warehouse population • Data visualisation • Parameter sweeps over simulations Drug discovery, small molecules, targets, compounds OpenPHACTS
  8. 8. BioSTIF Inputs: data, parameters, configurations Outputs Workflow in a nutshell • Orchestrate series of automated / interactive steps – Process pipelines – Analytic and synthesis procedures – Repetitive code-run sweeps • Housekeeping tasks – Process data at scale – Auto documentation • Mix in house & public resources, native hosting – Chain and choreograph components – Handle interoperability – Bridge resources – Shield operational complexity and change Services & Resources Infrastructures
  9. 9. Taverna Workflow Management http://www.taverna.org.uk • Dataflow – Computational Lambda Calculus with a monad extension* – Simple control flows, iterations over collections – Data type agnostic, domain independent – Data movement, monitoring, staging, reference – Custom (VO Tables), XML, JSON • Mixed steps – Services, codes & command line tools – SOAP + REST Web Services – Scripts: R, “In Workflow Programming” Beanshell scripting … – Codes: Java, libraries, HPC, Grid and ~Cloud platforms etc … – Nested workflows – Interactions and Batch *Turi et al Taverna Workflows: Syntax and Semantics e-Science 2007: 441-448; Sroka et al A formal semantics for the Taverna 2 workflow model J. Comput. Syst. Sci. 76(6): 490-508 (2010)
  10. 10. • Computational Lambda Calculus • Visual Programming • Process mining • Adaptive & parallel computing • Cloud computing • SOA, Semantic Web Services • Data integration, data quality • Semantic representation and linked data • Reporting & tracking, credit propagation • Workflow reusability, quality, discovery • Security, monitoring, fault detection • AI planning, re-run analysis, auto-planning, auto-repair, auto-composition, auto- annotation, service discovery, service matching, auto-substitution E.Science laboris Tools Standards Services
  11. 11. Weeks -> Hours Surprise predicted result tested in lab. DAXX Gene Genetic differences between breeds Noyes, PNAS 2011 108(22) 9304-9309 BioDiversity Invasive Species Modelling American Horseshow Crabs in the Baltic Trypanosomiasis resistance in African Cattle Software as a Service / (Cloud) Appliance Analytic bottleneck Repetitive, unbiased, accurate record, taming data, transparency, avoiding shortcuts. Interactive steps Dev. Years->Weeks Runs. Weeks -> Hours Generalised ENM data mapping and overlaying pipelines. Workflow-based Computation
  12. 12. 15 #SummerSchool 24-Jun-13 VPH-Share @neurist Aneurysm Morphology Workflow P a t ie n t P s e u d o id e n t ifi e r (P ID ) D e m o g r a p h ic s H e ig h t W e ig h t V it a l S ig n s H e a r t R a t e B lo o d P r e s s u r e F lo w R a t e T r a n s ie n t P r e s s u r e A n e u r y s m P r o p e r t ie s T is s u e P r o p e r t ie s W a ll T h ic k n e s s R is k F a c t o r s M e d ic a l Im a g e s M e d ic a t io n s Patients Patient Avatar Disease Simulation Work ofl w Systemic Factors Gene Expression Pro lfie P a t ie n t P s e u d o id e n t ifi e r (P ID ) D e m o g r a p h ic s H e ig h t W e ig h t V it a l S ig n s H e a r t R a t e B lo o d P r e s s u r e F lo w R a t e T r a n s ie n t P r e s s u r e A n e u r y s m P r o p e r t ie s T is s u e P r o p e r t ie s W a ll T h ic k n e s s R is k F a c t o r s M e d ic a l Im a g e s M e d ic a t io n s A n e u ry sm R u p tu r e P ro fi le M o rp h o lo g y P r o fi le H a e m o d y n a m ic P r o fi le M e c h a n o b io lo g ic a l P r o fi le P re d ic tio n U n c e rta in ity Patient Avatar Updated RISK Patients Patient Avatar Disease Simulation Workflow Patient Avatar updatedSystemic Factors Gene Expression Profile RISK [Susheel Varma] http://www.vph-share.eu/
  13. 13. • Morphological, hemodynamic and structural analyses have been linked to aneurysm genesis, growth and rupture. • Evidence indicating differences in morphology and flow between ruptured and unruptured aneurysms have been shown for reduced patient cohorts. • Structural wall mechanics has been used to justify the growth and remodelling happening at the aneurysm level. Confidence in physical measures + images + BC, material + BC, material Morphological analysis Direct diagnostic power + Morphological descriptors Structural descriptors Hemodynamic descriptors Haemodynamic analysis Structural analysis Practically, morphological characterizations might currently have the highest predictive capabilities with respect to the other analyses. Morphological Workflow [Susheel Varma]
  14. 14. Medical image from imaging equipment @neurIST morphological descriptors Complex indices (Zernike moment invariants) Basic size indices describing aneurysm sac depth neck Morphological Analysis Workflow [Susheel Varma]
  15. 15. Implementation in VPH-Share The @neurIST morphological workflow specification in Taverna [Susheel Varma]
  16. 16. Biodiversity marine monitoring and health assessment ecological niche modelling Data Intensive Science Collaborative Science Pilumnus hirtellusEnclosed sea problem (Ready et al., 2010) Sarah Bourlat
  17. 17. Ecological Niche Modeling . Step 1: Explorative modeling -Use unfiltered data -Use fixed parameters: Mahalonobis distance -Native projections -Test the model, distribution of points, number of points Step 2: Deep modeling -Filtering environmentally unique points with BioClim algorithm -ENM with Support Vector Machine and Maximum Entropy -Parameter optimization (if necessary) on the model test results -2 masks (model generate, model project) Data discoveryData discovery Data assembly, cleaning, and refinement Data assembly, cleaning, and refinement Ecological Niche Modeling Ecological Niche Modeling Statistical analysisStatistical analysis Analytical cycle Pilumnus hirtellusEnclosed sea problem (Ready et al., 2010) The workflows work over large geographical, taxonomic, and environmental scales, incl. terrestrial ecosystems Baltic species invasions of various crabs/sea creatures Interactions of different forest insects and trees
  18. 18. Ecological Niche Modeling . Step 1: Explorative modeling -Use unfiltered data -Use fixed parameters: Mahalonobis distance -Native projections -Test the model, distribution of points, number of points Step 2: Deep modeling -Filtering environmentally unique points with BioClim algorithm -ENM with Support Vector Machine and Maximum Entropy -Parameter optimization (if necessary) on the model test results -2 masks (model generate, model project) Data discoveryData discovery Data assembly, cleaning, and refinement Data assembly, cleaning, and refinement Ecological Niche Modeling Ecological Niche Modeling Statistical analysisStatistical analysis Analytical cycle Pilumnus hirtellusEnclosed sea problem (Ready et al., 2010) The workflows work over large geographical, taxonomic, and environmental scales, incl. terrestrial ecosystems Baltic species invasions of various crabs/sea creatures Interactions of different forest insects and trees BioSTIF
  19. 19. www.biovel.eu Ecological Niche Modeling Workflow (ENM)
  20. 20. data configuration parameters steps Data and Parameter Sweeps
  21. 21. Hosted installation Local installations
  22. 22. Taverna: a Knowledge Discovery Framework •Asthma sputum inflammatory phenotypes, a transcriptome analysis, Saeedeh Maleki-Dizaji, Chris Newby, Rachid Berair, Rod Smallwood , Chris Brightling 2014 (to be submitted) •A systematic approach to a transcriptome analysis to asthma sputum inflammatory phenotypes ISMB 2014. •The Battle of the Sexes starts in the oviduct : modulation of oviductal transcriptome by X and Y-bearing spermatozoa: Almiñana C, Caballero I, Heath PR, Maleki-Dizaji S, Parrilla I, Cuello C, Gil MA, Vazquez JL, Vazquez JM, Roca J, Martinez EA, Holt WV and Fazeli A. submitted to BMC Genomics 2014 ,(In Press) •transcription regulation network involving E2F6, IRF7 and STAT1, Thomas R.J. Lovewella ,Andrew J.G. McDonaghb, Andrew G Messengerb, Saeedeh Maleki- Dizaji, Mimoun Azzouzd and Rachid Tazi-Ahniniaformation submitted to PNAS, 2014 •Kiran, M., Bicak, M., Maleki-Dizaji, S., Holcombe, M. FLAME: A Platform for High Performance Computing of Complex Systems. Journal of Acta Physica Polonica 2011. •Maleki-Dizaji S, Holcombe M, Rolfe MD, Fisher P, Green J, Poole RK, Graham AI, A Systematic Approach to Understanding Escherichia coli Responses to Oxygen: From Microarray Raw Data to Pathways and Published Abstracts, Online J Bioinformatics, (1):51-59, 2009 [Saeedeh Maleki-Dizaji]
  23. 23. Application Runtime Middleware Resources/Codes/Services Infrastructures Repositories Execution Activity Plug-ins Application Scufl Runtime Middleware Resources/Codes/Services Platforms Repositories Taverna Desktop Workbench Taverna Online Web Tool Portals and Applications Engine Server Player Cmd line Provenance Third Party Servers BioSTIF Workflows & workflow components PROV, OPM Data Provenance Registries
  24. 24. Taverna Workflow Management Open extensibility • Plug-in framework – Command line tool – Data Services: VOTables for AstroTaverna – Optimisations: E.g. Holl. model parameter sweeps – Infrastructures: Grid, HPC, Web Services – Domains: CDK, BioMart, VOTable – Commodities: Excel Spreadsheets, Open Refine, R • Plug into other frameworks & platforms – Portals: Scratchpads – Interactive platforms: iPython Notebook – Wfms: KNIME Node, Galaxy tool, Kepler Actor • Third party applications – Taverna Online – XworX – OGC chainer
  25. 25. Taverna Online: 3rd party app Dr Vadim Surpin and Vitaly Sharanutsa, Institute for Information Transmission Problems of Russian Academy of Sciences (IITP RAS) An online, in-browser application for assembling and running Taverna Workflows over a HPC platform http://onlinehpc.com/site/main
  26. 26. Interoperability: Data format/identity mismatches Service interface handling Components: Well described, behaved, curated, annotated modularised workflow modules • Semantic annotations, prescribed failover, formats, provenance • Organised into common families
  27. 27. Taverna Directions AccessAccess Framework to access and leverage heterogeneous legacy applications, services, datasets and codes. Shielding from complexity. CustomiseCustomise Rapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components ProcessProcess Automated plumbing + Interaction Systematic, repetitive and unbiased analysis and processing and error handling Ensembles, comparisons, “what ifs” CustomiseCustomise Rapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components ProcessProcess Automated plumbing + Interaction Systematic, repetitive and unbiased analysis and processing and error handling Ensembles, comparisons, “what ifs” CustomiseCustomise Rapid development: Flexibility, Extensibility, Adaptability, Reuse. Reusable Workflow Components AccessAccess Cloud and Scale, Registries Standards data formats, programmatic interfaces. Adapting to change. Security. Governance of components ProcessProcess Seamless, pluggable wf as a service. Scale. Adaptability. Specific-Generic tension. Easier development, user experience Workflow commodities, Research Objects Design practices for reuse. Credit Executable interactive notebooks. Provenance A tool for reproducibility ReportReport EmbedEmbed Workflows in common applications Integration into reporting & publishing Underpin integrative platforms. Service based science and science as a service
  28. 28. Fix on demand. Notify as needed. Monitor for decay Workflow/Service Monitors 3rd Party Monitors Workflow analytics Detect and Repair QUASAR toolkit [Zhao et al. Why workflows break e-Science 2012]
  29. 29. The Execution Provenance Gap Data tracking Summarisation, Labelling, Distillations, Selective tracking Filtering Big Fine grain 1 White box One System Special tools Collection A Big Graph What do I cite? What did I do? N Black boxes Many Systems My Lab Book Analytics Smart in situ Presentation Why am I citing? Pinar Alper, Khalid Belhajjame, Carole A. Goble, Pinar Karagoz: Enhancing and abstracting scientific workflow provenance for data publishing. EDBT/ICDT Workshops 2013: 313-318 Sarah Cohen Boulakia, Jiuqiang Chen, Paolo Missier, Carole A. Goble, Alan R. Williams, Christine Froidevaux: Distilling structure in Taverna scientific workflows: a refactoring approach. BMC Bioinformatics 15(S-1): S12 (2014) http://provenanceweek.dlr.de
  30. 30. Tracking Provenance File Stores Lab Books Repositories • Granularity • Scales • Blackbox • Hybrid
  31. 31. Research Objects • Bundles and relates multi-hosted digital resources of a scientific experiment or investigation using standard mechanisms • Descriptive reproducibility • Exchange, Releasing paradigm for publishing http://www.researchobject.org/ http://www.researchobject.org/
  32. 32. Flexibility Review, Revise/Discard Scale Deploy into tools Comparison Personal Group Production Research Reporting Harden
  33. 33. http://nbviewer.ipython.org/github/myGrid/DataHackL eiden/blob/alan/Player_example.ipynb https://www.youtube.com/watch?v=QVQwSOX5S08 ?
  34. 34. Archiving Publishing Component Libraries Preserving Recording Storing Exchanging Versioning Sharing PACKS
  35. 35. SEEK4Science Sharing and interlinking Methods, Models, Data… Data Model Article External Databases Metadata
  36. 36. Virtual Liver Network BMBF “Großprojekt“• ~45 organisations, ~70 groups • multiscale rep. of the liver • clinical impact • general public portal 47  Same key requirements: yellow pages, exchange of all sops/data/models, sharing rights  Different biology • Multiscale data • Multiscale models • Imaging  Different project structure • Hierarchies (A, A1, A1.2) • Regional groups of groups  Flexibility, extensibility, open sourceness of SEEK key
  37. 37. simulate models project mgt, access control reporting, citation governance & policies yellow pages of peers projects, experts catalogue and link data, models, samples, specimens, sops, experiments, publications using standards curate & annotate data and models using standards access, link to and deposit in public data and model repositories manage, store and exchange different types and scales of data integrate local and project tools and data systems scaled-out collection & processing
  38. 38. experimentalists, modellers, X- informaticians, computational Xs, software engineers, computer scientists, systems administrators, resource providers, tool builders social scientists, librarians, curators Social Computation Storing, Sharing and Reusing data, methods, models, between collaborating and competing scientists e-Laboratories, collaboratories, VREs, repositories An ego-system
  39. 39. Computer Scientist Software Engineer Social Engineer
  40. 40. Knowledge Computation •Accurate, intelligible and comparable descriptions •Data interoperability •Machine readable metadata Semantic technologies, Ontologies, Linked Data, Data schema
  41. 41. Semantic Description Describing and linking data in terms of shared concepts, relationships and identifiers Data object property data property subClassOf Ontology Person Organization Place State name birthdate bornIn worksFor state name phone name livesIn City Event ceo location organizer nearby startDate endDate title isPartOf postalCode Column 1 Column 2 Column 3 Column 4 Column 5 Bill Gates Oct 1955 Microsoft Seattle WA Mark Zuckerberg May 1984 Facebook White Plains NY Larry Page Mar 1973 Google East Lansing MI [Taheriyan et al adapted]
  42. 42. Curation Knowledge Ramps Populous http://www.rightfield.org.uk Katy Wolstencroft
  43. 43. Pathways Pharmacological Activities Biological Processes Transcripts Pathological Processes Diseases Genes Proteins Interactions Clinical Drug Applications Indications Drugs Compounds Pharmacological data for drug discovery combining public and private datasets Pre-competitive silo-breaking for competitive analytics
  44. 44. Pathways Pharmacological Activities Biological Processes Transcripts Pathological Processes Diseases Genes Proteins Interactions Clinical Drug Applications Indications Drugs Compounds “Find me compounds that inhibit targets in NFkB pathway assayed in only functional assays with a potency <1 μM” “What is the selectivity profile of known p38 inhibitors?” “Let me compare MW, logP and PSA for known oxidoreductase inhibitors” Broad data: combining public and private datasets
  45. 45. NanopubNanopub DbDb VoIDVoID Data Cache (Virtuoso Triple Store) Semantic Workflow EngineSemantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Service Indexin g CorePlatformCorePlatform P12374 EC2.43.4 CS4532 “Adenosine receptor 2a” VoIDVoID DbDb NanopubNanopub DbDb VoIDVoID DbDb VoIDVoID NanopubNanopub VoIDVoID Public Content Commercial Public Ontologies User Annotations Apps ChemBio Navigator Target Dossier Pipeline Pilot Under the hood
  46. 46. Strict Relaxed Analysing Browsing Dynamic Equality skos:closeMatch (Drug Name) skos:closeMatch (Drug Name) skos:exactMatch (InChI)
  47. 47. CS Research Software Engineering Science Engage Delivery & Support 2001- 2006
  48. 48. CS Research Software Engineering Science Engage Delivery & Support 2006 -today
  49. 49. “Startup-Like” Balance Innovation with Usefulness
  50. 50. Software Engineering Research Software Engineers. Sustainable software.
  51. 51. Zeeya Merali , Nature 467, 775-777 (2010) | doi:10.1038/467775a Computational science: ...Error…why scientific programming does not compute.
  52. 52. Training • Training infrastructure • Scalable training approaches • Review needs • Coordinate activities and materials • Liaise with Nodes and Hub
  53. 53. Data-centric Computation Scientific workflows over Distributed Cyber-Infrastructure. Data sharing Social Methods libraries and catalogues for all types of scientific artefacts and all types of scientists. Knowledge Management Metadata, semantics digital exchange, preservation, publishing Software Engineering Software sustainability, software and data policy, training Products Methods Systems Biology Chemistry Astro-Physics Astronomy Biology Social Science Library Digital Preservation Biodiversity Public Health Applications
  54. 54. Lemberger T Mol Syst Biol 2014;10:715 ©2014 by European Molecular Biology Organization Born Reproducible | Exchangeable | Reusable Rich descriptions Open & Available Transparent Method Re-executable
  55. 55. • myGrid – http://www.mygrid.org.uk • Taverna – http://www.taverna.org.uk • myExperiment – http://www.myexperiment.org • BioCatalogue – http://www.biocatalogue.org • SEEK and SysMO-SEEK – http://www.seek4science.org – http://seek.sysmo-db.org • RightField – http://www.rightfield.org.uk • BioVeL – http://www.biovel.eu • Wf4ever – http://www.wf4ever-project.org • Research Object – http://www.researchobject.org • Software Sustainability Institute – http://www.software.ac.uk

Editor's Notes

  • Mature workflow platform – since 2004
  • Mature workflow platform – since 2004
  • Bioinformaticians in the wild
    No predetermined VOs
    Exploratory investigations
    Services in the wild
    Natively and distributedly hosted
    Data and Platform agnostic
    Production level engine to handle cross cutting concerns and large data collections
    Customisation opportunities
    Experiment with Semantic Technologies
    Domain independence
    Restrictive vs open worlds
    OPEN STUFF
    Independent life science informaticians in the field
    Expert bioinformaticians but not programmers
    An open community
    Open applications
    Independent third party world-wide service providers, local and remote over the web
    In house applications, tools and datasets
    Open (and closed) worlds.
  • Open SourceManaged worldsWild worlds
  • Underpin integrative platforms.
    Powering service based science and science as a service
    A tool for reproducibility
    logos
    Coordinate execution of services and codes.
    Dataflow at scale
    Reusable variants
    Comparable repetitions
    Import own data / codes + public libraries/datasets
    Honour hosted codes
    Shield operational complexity
    Auto-document provenance
    Package up dependencies
  • aimed at different layers of the software stack
    “The Many Faces of IT as Service”, Foster, Tuecke, 2005
    “Provisioning” – reservation to configuration to … … make sure resource will do what I want it to do, with the right qualities of service
    Virtualization = separation of concerns between provider &amp; consumer of “content”
    Client and service
    Service provider and resource provider
    Provisioning = assemble &amp; configure resources to meet user needs
    Management = sustain desired qualities of service despite dynamic environment
  • It’s a framework!
    Provenance collection…
    W3C PROV+, OPM formats
    OAuth security plug-in
    Java, Grid services, R scripts, libraries (BioConductor, libSBML…)
    Just released Taverna 2.5, since 17 April It&amp;apos;s now 642 workbench and 500 CLT downloads
    100,000+ downloads over its lifetime.
    Audit last year to track startups – just under 1000unique starts in one month
  • IInteraction: Visual programming, workflow reusability, workflow quality, workflow discovery
    Service oriented computing, cloud computing, grid computing, optimisation, parallelism, adaptation, security, monitoring and fault correction
    AI &amp; Semantics: re-run analysis, auto-planning, auto-repair, auto-composition, auto-annotation, service discovery, service matching, auto-substitution
    Data integration, data mapping, service integration, provenance tracking, credit propagation, data spaces, data quality
  • Understanding genetic differences between breeds of cattle
    Ecological niche modeling of Baltic invasives
    Collection, Preparation &amp; Production Pipelines
    Exploratory analytics
    Simulation codes
    Text mining
    Auto recommendations
    Visual analytics
  • Morphological, hemodynamic and structural analyses have been linked to aneurysm genesis, growth and rupture.
    Evidence indicating differences in morphology and flow between ruptured and unruptured aneurysms have been shown for reduced patient cohorts.
    Structural wall mechanics has been used to justify the growth and remodelling happening at the aneurysm level.
  • Collecting, processing and management of big data
    Metagenomics, genotyping, genome sequencing, phylogenetics, gene expression analysis, proteomics, metabolomics, auto sampling
    Analytics and management of broad data from many different disciplines
    Coupling analytical metagenomics with meaningful ecological interpretations
    Continuous development of novel methods and technologies
    Functional trait-based ecology approach proposed by Barberán et. al 2012.
  • Not all things are batch
    VPH-Share opens a VNC connection spawned instance.
    Taverna Interaction Service
    Users interact with a workflow (wherever it is running) in a web browser.
    Interaction Service Workbench Plug-in
  • The BioVeL Ecological Niche Modelling workflow running while embedded into the AntKey Scratchpads site
  • Custom resources and platforms
    Components
    Plug-in Framework
    Infrastructures: Grid, HPC, Web Services (SOAP, REST)
    Domain: CDK, BioMart, VOTable, SADI
    Common Tools: Excel Spreadsheets, Open Refine, R
  • COMPUTING POWER
    The service provides two types of computing nodes: Amazon AWS cluster computing instances
    Automatic configuration of computer clusters on AWS cloud resources
    In-house powerful computing cluster
    Hundreds of Intel Xeon (3.00 Ghz, 4 cores) nodes available
    2 CPUs per node
    8 cores per node
    2 Gb of RAM per core
    100 Gb of local storage per node
    Providers
    Contact us to access you own computing facilities with our service
    OnlineHPC looks for partnership with supercomputer providers all over the world. Contact us for details.
  • Large number of re-usable, versioned components
    26 ENM components
    42 components in myExperiment
    A workflow in their own right
    Test by running individually
    Annotatable for semantic description of profile
    Create new workflows remixing any components – like the ENM ones we have made.
  • Research Objects, Metadata structuring
    Annotation by Stealth, Shared Templates
    Other communities
    Workflows Apps
    Workflow commodities
    Adaptability, Tiers of infrastructure
    Computational Reproducibility is hard in the wild: description / execution
  • Added after the fact
    Shims – beanshell programming in the small
    Mapping services for names
    Curated service signatures
    Data and semantic interoperability in the services, service families and service collections (that is where your types are)
    Data agnostic, Semantic layering
    Shim services
    Workflow flexibility and reusability but makes things untidy
    Next steps – Shim libraries and packaged components
    Annotation
    What do the services DO? And HOW? Expert curation
    One size does not fit all: scientists need simplish metadata for decision support; automated validation, configuration, repair needs rich metadata decision making.
    Next steps – BioCatalogue social &amp; auto curation through myExperiment
  • Workflow Run RO BundleFolder structure or Zip file with some JSONUnpack into local file system, ship to myExperiment or notebook
  • 1 constantly running server for workflows that aren’t security sensitive
    Multiple commandline tools
    For secure workflows, spawn own server and own command line in a bubble
    Start up performance issues: start server, start cmmdline start image start apps.
    VPH-Share plugin exposes in Taverna Online list of tools you can instantiate on their VM
    Execution deals with requesting ofstart and close down of VM. WSDL at a specific location rebinds the tool.
    BioCatalogue work by Dimitri for unbound WSDL for the tools
  • Player needs a workflow file from the portal or myExperiment or something else.
    Rails plugin for running Taverna Workflows
    Integrates into any Rails app
    Embed workflows into any web page
    Job queuing system scales runs with the number of workflows the servers can handle. Each run in parallel with its own worker.
    Input provenance: setup, input gathering, parameters and data used
    Runs: Taverna Server operations, interactions, run workflow, re-run / restart
    Results management: storing, viewing, downloading, result type rendering
    Service credential management: for secure services within workflows
    Look and results rendering fully customizable
    LifeWatch, Scratchpads, personal web page, …
    Just like embedding a YouTube video
    Gets bigger when it needs to &amp; tells you when its full.
    Result type rendering: Text, XML, JSON, HTML, Images, PDF, Workflow errors, Links for types that browsers cannot show inline, more…..
  • Taverna server spawns commandline tool for user separation.
    The components of the architecture:
    An OSGi platform, with the Taverna Platform API
    implemented by Taverna Core 
    executes a workflow using the Taverna Engine
    uses Activity plugins for the different service types (WSDL, REST, Biomart, R scripts, command line tools, etc)
    also implemented by the Taverna Server client which uses the Java Client library to proxy running of a workflow on the Taverna Server
    The Taverna workbench to design and run workflows
    UI plugins for each service type
    executes workflows using the Taverna platform API
    The Taverna command line which executes workflows using the Taverna platform API
    A Taverna Server, which exposes the Taverna platform API as a REST API and SOAP API for executing workflows
    Taverna Player, which use the Ruby client library to execute workflows on the Taverna Server
    Taverna Lite, which also uses the Ruby client library to execute workflows, but also manage a repository of workflows and allow user interactions.
    The OSGi framework (OSGi being an acronym for &amp;quot;Open Services Gateway initiative&amp;quot;) is a module system and service platform for the Java programming language that implements a complete and dynamic component model, something that does not exist in standalone Java/VM environments. Applications or components (coming in the form of bundles for deployment) can be remotely installed, started, stopped, updated, and uninstalled without requiring a reboot; management of Java packages/classes is specified in great detail. Application life cycle management (start, stop, install, etc.) is done via APIs that allow for remote downloading of management policies. The service registry allows bundles to detect the addition of new services, or the removal of services, and adapt accordingly.
    The OSGi specifications have moved beyond the original focus of service gateways, and are now used in applications ranging from mobile phones to the open source Eclipse IDE. Other application areas include automobiles, industrial automation, building automation, PDAs, grid computing, entertainment, fleet management and application servers.
  • ENCODE threads
    exchange between tools and researchers
    bundles and relates digital resources of a scientific experiment or investigation using standard mechanisms
  • Explore, Personal….
    Recording and reporting
    Production….
    Reporting.
  • Issues: non-secure html using http inside secure https iframe in ipython doesn’t work – need to update interaction service to deliver on https.
  • Variety:
    common metadata models
    rich metadata collection
    ecosystem
    Validity:
    auto record of experiment set-up, citable and shareable descriptions
    curation, publication,
    mixed stewardship
    third part availability
    model executability
    citability, QC/QA. trust.
    Social issues of understanding the culture of risk, reward, sharing and reporting.
  • Blending SEEK and openBIS together
  • It’s a lot like a start-up
    Software Engineering
    for Science, Software sustainability, software and data policy, training
  • Why did I start as a Computer Scientist and, proudly, end up as a Software Engineer and Social Worker?
    Web Science related activity
    Making people think its their idea
    Nearly every time I ask people they ask for today’s and not tomorrow.
  • Sample of three commercial datasets
    Information on handful of targets only
    Gemma Sattertwaite mentioned this
  • Sample of three commercial datasets
    Information on handful of targets only
  • Cache copies of data
    Chemistry data normalisation/alignment through ChemSpider
    Domain specific API
    API calls populate SPARQL queries
  • It’s like a start up
    Social Software Engineering
    T shaped people
  • “As a general rule, researchers do not test or document their programs rigorously, and they rarely release their codes, making it almost impossible to reproduce and verify published results generated by scientific software”
    An aside
  • Training infrastructure
    A pilot training e-support service platform
    Share training material
    Scalable training approaches
    Training the trainers, Support network
    Trainer pool, Share know-how
    Review needs
    Cooperating training sectors
    Manage and monitor outcomes
    Coordinate activities and materials
    Workshops, bootcamps, online
    Pop-up training provision
    Liaise with Nodes and Hub
    Programmes retain branding
  • The multidimensional paper
    A scientific article can be envisioned as juxtaposed layers—Title, Abstract, Synopsis, Article, Expanded View and Datasets—that provide access to the paper with increasing resolution and allow readers to zoom in or out to access the information at the required level of granularity.
    A scientific article can be envisioned as juxtaposed layers—Title, Abstract, Synopsis, Article, Expanded View and Datasets—that provide access to the paper with increasing resolution and allow readers to zoom in or out to access th
  • ×