Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Moeller bosc2010 debian_taverna

  • Login to see the comments

  • Be the first to like this

Moeller bosc2010 debian_taverna

  1. 1. Community-driven computational biology with Debian and Taverna Steffen Möller, Hajo Krabbenhöft (Lübeck) Alan Williams, Katy Wolstencroft, Carole Goble (Manchester) Andreas Tille, Charles Plessy, David Paleino (Debian) BOSC 2010, Boston 2010, Boston
  2. 2. Motivation ● Open Source Bioinformatics continues to grow and improve ● steadily increasing number of tools and databases ● addressing more and more complex issues ● Bioinformatics found entry into wet-lab routine ● strong service units with many diverse projects ● single deeply embedded individuals ● Wanted: ● Exchange of bioinformatics recipes, as a database or eventually linked from papers' method sections ● Reliable, instant-available powerful external resources to perform analysis 2010, Boston
  3. 3. Dual role of Cloud technologies ● Sharing of physical resources ● Computation ● Storage ● Sharing of management resources ● Reference Images ● Pre-downloaded, pre-indexed data – Amazon public data sets – “whatever BOSC 2010 agrees on” for our Eucalyptus playground 2010, Boston
  4. 4. How to Co-Maintain Cloud Images ● Cloud images can be maintained just like regular machines ● The installation of many tools by many people ● works, you get somewhere, but then you don't want to touch it again ● Is error prone because of inter-dependencies of packages (shared files, version incompatibilities) ● The partial update of such co maintained images ● will most likely break something somewhere → modularity ● you want to know what has been done to an image without a dependency on external web pages → introspection 2010, Boston
  5. 5. How to Co-Maintain Cloud Images Wanted: ● Mechanism to allow the individual upgrading of software tools and integrity checks ● Sharing of the effort – to compile the source code – one wants to install the binaries only whenever possible – to describe the packages – should be of little overhead or be already available This is basically what Linux distributions do. 2010, Boston
  6. 6. Dual role of Debian ● Package provider ● many tens of thousands packages are offered – directly as a Linux distribution – indirectly via descendents Ubuntu or BioLinux ● technical excellence – coherent builds across many platforms (PowerPC, Intel 32 and 64 bit, AMD, MIPS) and Kernels (Linux, HURD, BSD, OpenSolaris) – separation of documentation from binaries, GUI from command line, ... ● Community ● bug reports ● mailing Lists, special interest groups, you may discuss – packages that are missing – problems that many of us have that are yet unsolved 2010, Boston
  7. 7. bioinformatics blend ● subversion and git repositories for packages ● friendly and open community ● keen on close links with upstream ● Series of tasks within Debian Med – not only bioinformatics: Biology - Debian Med micro-biology packages Biology development - Debian Med packages for development of micro-biology applications Content management - Debian Med content management systems Medical data - Debian Med suggestions for medical databases Dental - Debian Med packages related to dental practice Epidemiology - Debian Med epidemiology related packages Hospital information systems - Debian Med suggestions for Hospital Information Systems Imaging - Cross-platform for visualizing, processing and analysing of bioimages Imaging development - Debian Med packages for medical image development Laboratory - Debian Med suggestions for medical laboratories Pharmacy - Debian Med packages for pharmaceutical research Physics - Debian Med packages for medical physicists Practice - Debian Med packages for practice management Psychology - Debian Med packages for psychology Statistics - Debian Med statistics Tools - Debian Med several tools Typesetting - Debian Med support for typesetting and publishing 2010, Boston
  8. 8. How to Co-Maintain a Debian Package ● Technically ● Do not touch the original source tree ● Create folder “debian” with files – “control” - description of package + build deps – “changelog” - version of package and what's new – “rules” - how to say “make” and “make install” – “install” - to split documentation from the rest Should not be more difficult than executing “make all” directly, contact me or the list when running into problems. ● FTP-upload of package to distribution's server ● Sharing of “debian” folder with community with subversion/git/bazaar ● Community-driven security ● Web of trust: Creator of package signs with his GPG key prior to upload, GPG key is signed by others ● Bug reports may block transition of package to “stable” release 2010, Boston
  9. 9. Something's missing ● We now have the resources. ● packages that auto-transform into Cloud images ● machines and disk to compute and store in-/output ● We have quite some Bio* community ● Wanted: ● Linking of cloud resources with the desktop ● Linking of web resources into it ● Exchange and reference of – Inter-package – Inter-resource processes that (have) work(ed for someone) and may be adapted 2010, Boston
  10. 10. Dual role of Taverna ● Technology: ● Connects files, web services and applications to workflows ● Workflows may comprise other workflows ● Community: Portal to complete and partial solutions as workflows on 2010, Boston
  11. 11. Taverna integrates command line ● Any command executed in the shell can be integrated ● local execution, remote execution with ssh or grid ● nicely links clouds, packages and web ● Introduction of UseCases as workflow elements ● Database with XML-specification of – Inputs, Outputs and their MIME types – Commmand line and tools it needs ● Purpose-specific wrappers around binaries or scripts 2010, Boston Krabbenhöft et al., Bioinformatics, 2008
  12. 12. Shared UseCase management 2010, Boston
  13. 13. Example: Clustering many sequences ● Compute times of several hours are generally not acceptable for public web services ● Not a problem with integrated clouds Inform Cloud Local Start Taverna Results Image instance about Interpretation Selection IP number Workflow Execution apt-get Cloud install t-coffee 2010, Boston
  14. 14. Remaining challenge: sharing public data ● Could work like the management of software, but ● Often large with frequent updates users differ in their demands for latest versions ● Involves post-processing users differ in their demand to perform such ● Clouds could help, but ● one would not want to pay for everything all the time ● the installation process would need to be transparent to locally recreate or update or … improve the data 2010, Boston
  15. 15. Proposal: getData, a shared Perl script ● The script is a large hash table ● extendable by configuration files that may be contributed from various packages, like EMBOSS ● Every entry comprises another hash table with attributes – Name – full name of database – Source – how to retrieve it – Post-download – what to do once it has arrived – Recommends – tools suggested to install with the data ● All very simple and extendable ● Direct mirroring of effort performed on the command line ● The community can co-maintain this script more easily than some cloud instance ● More on 2010, Boston
  16. 16. Summary ● Debian as community and repository for bioinformatics software ● Mailing lists, source code management ● FTP servers ● Clouds introduce dynamics into the collaboration ● Data flow between packages ● Usability ● Shared maintenance of public data ● Taverna ● Connects web, grid, cloud instances and local machine ● Fosters exchange of experiences with various workflows 2010, Boston
  17. 17. References and Acknowledgements [1] Debian-Med [2] getData [3] Eucalyptus [4] Taverna [5] Taverna UseCases [6] myExperiment [7] Eucalyptus The development of the UseCass plugin to Taverna was funded by the “KnowARC” EU project. 2010, Boston
  18. 18. Debian/Ubuntu contributes ● Impressive number of packages ● Bioinformatics (Bio*, EMBOSS, clustering, ...) ● Cheminformatics (autodock, gromacs, ballview, …) ● General scientific computing tools and libraries – Clustering (Torque, Sun Grid Engine, ...) – Eucalyptus Cloud environment ● Automation of database updates and indexing with the “getData” script 2010, Boston
  19. 19. Concept: Distro+Workflows+Cloud ● Debian/Ubuntu Linux Distribution ● Chem- + Bioinformatics packages ● Friendly Community ● Taverna Workflow Suite ● Access to services in the web ● Access to command line tools via ssh or grids ● Exchange of ideas via ● Eucalyptus or Amazon Clouds ● Sharing of databases and indices ● Readily available or customized images to instantiate 2010, Boston
  20. 20. The Cloud contributes A platform for individuals to share ● Data (“download only once”) ● Its management (“update and index only once”) ● Experiences (“I show you”) Physical resources ● To be shared in community (“common cluster”) ● To be bought on demand (“run at”) Solutions ● Readily usable images – by community or industry ● Adaptability to local demands 2010, Boston