Cloud BioLinux: Standardized, Pre-Configured and On-Demand
            Computing for Genomics and Beyond



                    Ntino Krampis, PhD
                         GSC 2011
                        Hinxton, UK
Expensive sequencing and large organizations
                    Commodity sequencing and small labs

●
    large sequencing center, multi-million, broad-impact sequencing projects
●   dedicated bioinformatics department, coordination with other centers


●   small-factor, bench-top sequencer available: GS Junior by 454
●   sequencing as a standard technique in basic biology and genetics research
●   RNAseq and ChiPseq, and each biologist will be tackling a metagenome
“Bioinformatics nation is a land of city-states” Lincoln Stein

●   smaller labs building small-scale bioinformatics infrastructures
●   duplication of effort in compiling and installing software tools
●   some labs have no hardware, expertise, or time to install and run software


●   early pioneer in this area was NEBC BioLinux ( tinyurl.com/BioLinux-NEBC )
●
    desktop linux with with 100+ pre-configured bioinformatics tools
●   example: glimmer, hmmer, phylip, rasmol, genespring, clustalw, EMBOSS


                                  how about large-scale sequence
                                  datasets ?
Cloud BioLinux
standardized, pre-configured and on-demand bioinformatics computing on the cloud


                                 ●   JCVI's cloud computing expertise
                                 ●   NEBC's bioinformatics software repository
                                 ●   community effort – ISMB / BOSC 2010
                                 ●   standardized, pre-configured Virtual Machine (VM, image)

      +                          ●   VM: emulates a computer server, encapsulates operating
                                     system, software libraries and bioinformatics tools
                                 ●   Amazon EC2 computational capacity as a utility, on-demand
                                 ●   rich interface through a remote desktop client

      =

tinyurl.com/CloudBioLinux-JCVI
http://cloudbiolinux.com
Cloud BioLinux and Genomic Standards
      framework to distribute bioinformatics tools, data and analysis results


    create cloud VM / images with standardized software configurations
● customize Cloud BioLinux VMs, based on community requirements
● share customized VMs with collaborators, avoiding effort duplication

● mix and match software from NEBC or other (DebianMed, Scientific Linux etc.)




    whole system snapshot exchange (Dudley and Butte 2010)
● capture the state of the computing system and data
● software execution parameters and “massaged” input datasets

● save into cloud VM / image and share along with analysis results




    democratize access to computing resources
● large-scale computing independently of institutional or geographic boundaries
● only need a desktop computer with internet access
Cloud BioLinux and Genomic Standards
        create cloud VM / images with standard software configurations

●   framework to describe software components in cloud VM / image
●   based on python-fabric automated deployment tool
●   software components listed in simple text files
●   edit the files to mix and match software according to your community needs
●   community members use files to share descriptions of customized systems
●   start with a bare-bones VM, fabric downloads and installs specified software
●   Labs with sensitive data and capacity for private clouds: works identically on
Amazon EC2 or Eucalyptus open-source cloud




tinyurl.com/python-fabric       open.eucalyptus.com
software domains in bioinformatics: nextgen
sequencing, de novo assembly, annotation, phylogeny,
    molecular structures, gene expression analysis

 high-level configuration describing software groups
    for each group individual bioinformatics tools
         tinyurl.com/CloudBioLinux-github
Cloud BioLinux and Genomic Standards
          whole system snapshot exchange


                                                 simply signup at

                                                aws.amazon.com
                                                      then
                                             aws.amazon.com/console
                                                      and




http://tinyurl.com/cloud-biolinux-tutorial
Cloud BioLinux and Genomic Standards
       whole system snapshot exchange



                                              find Cloud Biolinux
                                                   using ID

                                                  enter desired
                                              password for remote
                                                 desktop login

                                                all other default
 http://tinyurl.com/cloud-biolinux-tutorial
free remote desktop client:
nomachine.com/download.php

  simply enter VM IP address
     and your password
What if I want to
    share my
alignments with
a collaborator?

save your data as
   a new VM

  0.10$ / GB /
     month

at 15GB, it costs
  1.5$ / month
Cloud BioLinux and Genomic Standards
                                whole system snapshot exchange
share your analysis results: publicly or only with your
                     collaborators

authorized users can access the cloud VM/image with
       all the software, data, analysis results
Cloud BioLinux and Genomic Standards
                        whole system snapshot exchange



                start VM / image           share


                perform analysis           snapshot           researcher B
researcher A

                snapshot                   perform analysis


                share                      start VM / image
Cloud Biolinux
                                  The future


●   expand community, receive feedback, add more software to the VM

●   analysis pipelines that are used by large sequencing centers

●   actively seeking funding to put major effort in development

●   2011 ISMB/BOSC in Vienna, Austria, http://metalab.at/

●   tinyurl.com/cloudbiolinux-lists or community@cloudbiolinux.com
Acknowledgments & Credits
Brad Chapman      - development of the fabric scripts and community organizer
Tim Booth, Bela Tiwari, Dawn Field – BioLinux 6.0 development and EC2 documentation
Deepak Singh and AWS - education grant supporting ISMB / BOSC workshop
Justin Johnson    –   community and sponsorship of cloudbiolinux.com
J. Craig Venter Inst. - time allowed to work on an open-source project
D. Gomez, E. Navarro, J. Shao, I. Singh – JCVI technology innovation

Members of the Cloud Biolinux community:
Enis Afgan
Michael Heuer
Richard Holland
Mark Jensen                                        Thank you !
Dave Messina
Steffen Möller
Roman Valls

Ntino Krampis GSC 2011

  • 1.
    Cloud BioLinux: Standardized,Pre-Configured and On-Demand Computing for Genomics and Beyond Ntino Krampis, PhD GSC 2011 Hinxton, UK
  • 2.
    Expensive sequencing andlarge organizations Commodity sequencing and small labs ● large sequencing center, multi-million, broad-impact sequencing projects ● dedicated bioinformatics department, coordination with other centers ● small-factor, bench-top sequencer available: GS Junior by 454 ● sequencing as a standard technique in basic biology and genetics research ● RNAseq and ChiPseq, and each biologist will be tackling a metagenome
  • 3.
    “Bioinformatics nation isa land of city-states” Lincoln Stein ● smaller labs building small-scale bioinformatics infrastructures ● duplication of effort in compiling and installing software tools ● some labs have no hardware, expertise, or time to install and run software ● early pioneer in this area was NEBC BioLinux ( tinyurl.com/BioLinux-NEBC ) ● desktop linux with with 100+ pre-configured bioinformatics tools ● example: glimmer, hmmer, phylip, rasmol, genespring, clustalw, EMBOSS how about large-scale sequence datasets ?
  • 4.
    Cloud BioLinux standardized, pre-configuredand on-demand bioinformatics computing on the cloud ● JCVI's cloud computing expertise ● NEBC's bioinformatics software repository ● community effort – ISMB / BOSC 2010 ● standardized, pre-configured Virtual Machine (VM, image) + ● VM: emulates a computer server, encapsulates operating system, software libraries and bioinformatics tools ● Amazon EC2 computational capacity as a utility, on-demand ● rich interface through a remote desktop client = tinyurl.com/CloudBioLinux-JCVI http://cloudbiolinux.com
  • 5.
    Cloud BioLinux andGenomic Standards framework to distribute bioinformatics tools, data and analysis results create cloud VM / images with standardized software configurations ● customize Cloud BioLinux VMs, based on community requirements ● share customized VMs with collaborators, avoiding effort duplication ● mix and match software from NEBC or other (DebianMed, Scientific Linux etc.) whole system snapshot exchange (Dudley and Butte 2010) ● capture the state of the computing system and data ● software execution parameters and “massaged” input datasets ● save into cloud VM / image and share along with analysis results democratize access to computing resources ● large-scale computing independently of institutional or geographic boundaries ● only need a desktop computer with internet access
  • 6.
    Cloud BioLinux andGenomic Standards create cloud VM / images with standard software configurations ● framework to describe software components in cloud VM / image ● based on python-fabric automated deployment tool ● software components listed in simple text files ● edit the files to mix and match software according to your community needs ● community members use files to share descriptions of customized systems ● start with a bare-bones VM, fabric downloads and installs specified software ● Labs with sensitive data and capacity for private clouds: works identically on Amazon EC2 or Eucalyptus open-source cloud tinyurl.com/python-fabric open.eucalyptus.com
  • 7.
    software domains inbioinformatics: nextgen sequencing, de novo assembly, annotation, phylogeny, molecular structures, gene expression analysis high-level configuration describing software groups for each group individual bioinformatics tools tinyurl.com/CloudBioLinux-github
  • 8.
    Cloud BioLinux andGenomic Standards whole system snapshot exchange simply signup at aws.amazon.com then aws.amazon.com/console and http://tinyurl.com/cloud-biolinux-tutorial
  • 9.
    Cloud BioLinux andGenomic Standards whole system snapshot exchange find Cloud Biolinux using ID enter desired password for remote desktop login all other default http://tinyurl.com/cloud-biolinux-tutorial
  • 10.
    free remote desktopclient: nomachine.com/download.php simply enter VM IP address and your password
  • 11.
    What if Iwant to share my alignments with a collaborator? save your data as a new VM 0.10$ / GB / month at 15GB, it costs 1.5$ / month
  • 12.
    Cloud BioLinux andGenomic Standards whole system snapshot exchange share your analysis results: publicly or only with your collaborators authorized users can access the cloud VM/image with all the software, data, analysis results
  • 13.
    Cloud BioLinux andGenomic Standards whole system snapshot exchange start VM / image share perform analysis snapshot researcher B researcher A snapshot perform analysis share start VM / image
  • 14.
    Cloud Biolinux The future ● expand community, receive feedback, add more software to the VM ● analysis pipelines that are used by large sequencing centers ● actively seeking funding to put major effort in development ● 2011 ISMB/BOSC in Vienna, Austria, http://metalab.at/ ● tinyurl.com/cloudbiolinux-lists or community@cloudbiolinux.com
  • 15.
    Acknowledgments & Credits BradChapman - development of the fabric scripts and community organizer Tim Booth, Bela Tiwari, Dawn Field – BioLinux 6.0 development and EC2 documentation Deepak Singh and AWS - education grant supporting ISMB / BOSC workshop Justin Johnson – community and sponsorship of cloudbiolinux.com J. Craig Venter Inst. - time allowed to work on an open-source project D. Gomez, E. Navarro, J. Shao, I. Singh – JCVI technology innovation Members of the Cloud Biolinux community: Enis Afgan Michael Heuer Richard Holland Mark Jensen Thank you ! Dave Messina Steffen Möller Roman Valls