Cloud BioLinux: Pre-Configured and On-Demand
High Performance Computing for the Genomics Community



                  Ntino Krampis, PhD
           Next-Gen Sequence Data Management '10
                      Providence, RI
Expensive sequencing, computing and large organizations

●
    multi-million, broad-impact sequencing projects

●   large sequencing center, with a dedicated bioinformatics department

●   large-scale computations on SGE cluster, algorithm acceleration hardware
Bench-top, commodity sequencing and small labs


●
    small-factor sequencer available: GS Junior by 454

●   sequencing as a standard technique in basic biology and genetics research

●   remember microarrays and lengthy assays for protein interactions ?

●   RNAseq and ChiPseq, and each biologist will be tackling a metagenome
Will small labs become the long tail of sequencing ?



                amount of
                sequencing             Credit: WikiMedia Commons




                                   number of labs


●   downstream bioinformatic analysis required for biological discovery

●   basic analysis example: large-scale BLAST to public DBs (try 0.5GB at NCBI)

●   do not have the hardware, expertise, or time to install and run software locally
Cloud Biolinux
              pre-configured and on-demand bioinformatics on the cloud


●   a public virtual machine (VM) on EC2 with 100+ bioinformatics tools

●   how it came to be, what offers for sequence analysis

●   where and how do I run it, especially if I am not a computer expert

●   modifying and sharing VM configurations and data with your peers

●   openness and community around Cloud Biolinux
Cloud Biolinux
                                        The Biolinux part



                                 ●   an Ubuntu Linux desktop for bioinformatics
tinyurl.com/BioLinux-NEBC        ●   NEBC packaged software and maintains repository

           +                     ●   Ubuntu AMI on EC2, pull packages from repository

                                 ●   additional software of interest to JCVI



           =


tinyurl.com/CloudBioLinux-JCVI
Cloud Biolinux
                          what comes in the box

●   glimmer, hmmer, phylip, rasmol, genespring, clustalw, EMBOSS

●   mpiBLAST clusters using EC2 virtual machine instances

●   Celera whole genome shotgun assembler

●   NX remote desktop, easy to use for benchtop scientists
Cloud Biolinux
                                 The Cloud part

●   find our VM on Amazon EC2:

          Biolinux 5.0 packages (32-bit): ami-6953b200
          Biolinux 6.0 packages (64-bit): ami-6011e409 , EBS based

●   17GB / 6 core instances 0.5$ / hour, see aws.amazon.com/ec2/pricing

●   a small bacterial genome assembly costs a little over 2$

●   up to 68 RAM / 26 core, EBS up to 1000 GB in size (0.10$ / GB / month)

●   make a copy of our public Biolinux ami - add your data - make private
Cloud Biolinux
http://tinyurl.com/cloud-biolinux-tutorial (credit to the NEBC team)




                                                          simply signup at

                                                          aws.amazon.com
                                                               then
                                                      aws.amazon.com/console
                                                               and
Cloud Biolinux
http://tinyurl.com/cloud-biolinux-tutorial (credit to the NEBC team)




                                                      ●find Cloud Biolinux
                                                      AMI using ID

                                                      ● enter desired password
                                                      for remote desktop login

                                                      ●   all other default
●get remote desktop client:
nomachine.com/download.php

●simply enter VM's IP address
and your password
What if I want to share my
alignments with a collaborator?

save your data as a new AMI

EBS cost 0.10$ / GB / month

at 15GB, it costs 1.5$ / month
share your data: public or with another AWS user

users with access can boot the AMI with all the
software + data
Cloud Biolinux
                                  The Cloud part

●   run Cloud Biolinux on your private cloud ?

●   Eucalyptus open source cloud platform

●   identical API with EC2, without the usage charges

●   easy to set up on your lab's cluster, comes with Ubuntu server (UEC)

●   download VMs from Sourceforge ( tinyurl.com/CloudBiolinux-SF )



                          open.eucalyptus.com
Cloud Biolinux

●   porting VMs across cloud platforms is not trivial

●   Cloud Biolinux VMs from EC2 to Eucalyptus, Xen kernel and boot sector

●   framework to share VM configurations ( tinyurl.com/bootstrap-cloudbiolinux )

●   based on python-fabric automated deployment tool

●   simply edit the software list files and share with collaborators

●   they start with fresh VM, python-fabric replicates VM setup on their cloud



                                tinyurl.com/python-fabric
Cloud Biolinux
       Collaboration and open source

high-level configuration describing software groups

   for each group individual software packages

simply edit the files to change the VM configuration

        tinyurl.com/CloudBioLinux-github




              ...............
Cloud Biolinux
                                 The community

●   from JCVI and NEBC to an open-source, community-based project

●   community initiated during tele-conference meeting at SC '10, Portland, OR

●   first meeting past July in Boston, tinyurl.com/openbio-codefest-2010

●   work done: 64-bit AMIs, NX remote desktop, set-up the fabric framework

●   next year's at ISMB/BOSC in Vienna, Austria http://metalab.at/

●   cloudbiolinux.com and most important, tinyurl.com/cloudbiolinux-lists
Cloud Biolinux
                                  The future

●   expand community, receive feedback, add more software to the VM

●   genome assemblers, high-memory EC2 instances up to 68GB RAM

●   Hadoop / MapReduce (for those running the VM in private clouds)

●   analysis pipelines that are used by large sequencing centers

●   actively seeking funding to put major effort in development

●   tinyurl.com/cloudbiolinux-lists or community@cloudbiolinux.com
Acknowledgments & Credits
Brad Chapman      - development of the fabric scripts and community organizer
Tim Booth, Bela Tiwari – BioLinux 6.0 development and EC2 documentation
Deepak Singh and AWS - education grant supporting codefest workshop
Justin Johnson    –   community and sponsorship of cloudbiolinux.com
J. Craig Venter Inst. - time allowed to work on an open-source project
D. Gomez, E. Navarro, J. Shao, I. Singh – JCVI technology innovation

Members of the Cloud Biolinux community:
Enis Afgan
Michael Heuer
Richard Holland
Mark Jensen                                        Thank you !
Dave Messina
Steffen Möller
Roman Valls

Chi next gen-ntino-krampis

  • 1.
    Cloud BioLinux: Pre-Configuredand On-Demand High Performance Computing for the Genomics Community Ntino Krampis, PhD Next-Gen Sequence Data Management '10 Providence, RI
  • 2.
    Expensive sequencing, computingand large organizations ● multi-million, broad-impact sequencing projects ● large sequencing center, with a dedicated bioinformatics department ● large-scale computations on SGE cluster, algorithm acceleration hardware
  • 3.
    Bench-top, commodity sequencingand small labs ● small-factor sequencer available: GS Junior by 454 ● sequencing as a standard technique in basic biology and genetics research ● remember microarrays and lengthy assays for protein interactions ? ● RNAseq and ChiPseq, and each biologist will be tackling a metagenome
  • 4.
    Will small labsbecome the long tail of sequencing ? amount of sequencing Credit: WikiMedia Commons number of labs ● downstream bioinformatic analysis required for biological discovery ● basic analysis example: large-scale BLAST to public DBs (try 0.5GB at NCBI) ● do not have the hardware, expertise, or time to install and run software locally
  • 5.
    Cloud Biolinux pre-configured and on-demand bioinformatics on the cloud ● a public virtual machine (VM) on EC2 with 100+ bioinformatics tools ● how it came to be, what offers for sequence analysis ● where and how do I run it, especially if I am not a computer expert ● modifying and sharing VM configurations and data with your peers ● openness and community around Cloud Biolinux
  • 6.
    Cloud Biolinux The Biolinux part ● an Ubuntu Linux desktop for bioinformatics tinyurl.com/BioLinux-NEBC ● NEBC packaged software and maintains repository + ● Ubuntu AMI on EC2, pull packages from repository ● additional software of interest to JCVI = tinyurl.com/CloudBioLinux-JCVI
  • 7.
    Cloud Biolinux what comes in the box ● glimmer, hmmer, phylip, rasmol, genespring, clustalw, EMBOSS ● mpiBLAST clusters using EC2 virtual machine instances ● Celera whole genome shotgun assembler ● NX remote desktop, easy to use for benchtop scientists
  • 8.
    Cloud Biolinux The Cloud part ● find our VM on Amazon EC2: Biolinux 5.0 packages (32-bit): ami-6953b200 Biolinux 6.0 packages (64-bit): ami-6011e409 , EBS based ● 17GB / 6 core instances 0.5$ / hour, see aws.amazon.com/ec2/pricing ● a small bacterial genome assembly costs a little over 2$ ● up to 68 RAM / 26 core, EBS up to 1000 GB in size (0.10$ / GB / month) ● make a copy of our public Biolinux ami - add your data - make private
  • 9.
    Cloud Biolinux http://tinyurl.com/cloud-biolinux-tutorial (creditto the NEBC team) simply signup at aws.amazon.com then aws.amazon.com/console and
  • 10.
    Cloud Biolinux http://tinyurl.com/cloud-biolinux-tutorial (creditto the NEBC team) ●find Cloud Biolinux AMI using ID ● enter desired password for remote desktop login ● all other default
  • 11.
    ●get remote desktopclient: nomachine.com/download.php ●simply enter VM's IP address and your password
  • 13.
    What if Iwant to share my alignments with a collaborator? save your data as a new AMI EBS cost 0.10$ / GB / month at 15GB, it costs 1.5$ / month
  • 14.
    share your data:public or with another AWS user users with access can boot the AMI with all the software + data
  • 15.
    Cloud Biolinux The Cloud part ● run Cloud Biolinux on your private cloud ? ● Eucalyptus open source cloud platform ● identical API with EC2, without the usage charges ● easy to set up on your lab's cluster, comes with Ubuntu server (UEC) ● download VMs from Sourceforge ( tinyurl.com/CloudBiolinux-SF ) open.eucalyptus.com
  • 16.
    Cloud Biolinux ● porting VMs across cloud platforms is not trivial ● Cloud Biolinux VMs from EC2 to Eucalyptus, Xen kernel and boot sector ● framework to share VM configurations ( tinyurl.com/bootstrap-cloudbiolinux ) ● based on python-fabric automated deployment tool ● simply edit the software list files and share with collaborators ● they start with fresh VM, python-fabric replicates VM setup on their cloud tinyurl.com/python-fabric
  • 17.
    Cloud Biolinux Collaboration and open source high-level configuration describing software groups for each group individual software packages simply edit the files to change the VM configuration tinyurl.com/CloudBioLinux-github ...............
  • 18.
    Cloud Biolinux The community ● from JCVI and NEBC to an open-source, community-based project ● community initiated during tele-conference meeting at SC '10, Portland, OR ● first meeting past July in Boston, tinyurl.com/openbio-codefest-2010 ● work done: 64-bit AMIs, NX remote desktop, set-up the fabric framework ● next year's at ISMB/BOSC in Vienna, Austria http://metalab.at/ ● cloudbiolinux.com and most important, tinyurl.com/cloudbiolinux-lists
  • 19.
    Cloud Biolinux The future ● expand community, receive feedback, add more software to the VM ● genome assemblers, high-memory EC2 instances up to 68GB RAM ● Hadoop / MapReduce (for those running the VM in private clouds) ● analysis pipelines that are used by large sequencing centers ● actively seeking funding to put major effort in development ● tinyurl.com/cloudbiolinux-lists or community@cloudbiolinux.com
  • 20.
    Acknowledgments & Credits BradChapman - development of the fabric scripts and community organizer Tim Booth, Bela Tiwari – BioLinux 6.0 development and EC2 documentation Deepak Singh and AWS - education grant supporting codefest workshop Justin Johnson – community and sponsorship of cloudbiolinux.com J. Craig Venter Inst. - time allowed to work on an open-source project D. Gomez, E. Navarro, J. Shao, I. Singh – JCVI technology innovation Members of the Cloud Biolinux community: Enis Afgan Michael Heuer Richard Holland Mark Jensen Thank you ! Dave Messina Steffen Möller Roman Valls