Cloud BioLinux: open source, fully-customizable bioinformatics computing on the cloud for the genomics community and beyond BOSC 2011 - Vienna, Austria Ntino Krampis, PhD Asst. Professor J. Craig Venter Institute (JCVI) firstname.lastname@example.org
Expensive sequencing and large organizations Commodity sequencing and small labs● large sequencing center, multi-million, broad-impact sequencing projects● dedicated bioinformatics department, large Sun Grid Engine cluster● small-factor, bench-top sequencer available: GS Junior by 454● sequencing as a standard technique in basic biology and genetics research● RNAseq and ChiPseq, and each biologist will be tackling a metagenome
Will small labs become the long tail of sequencing ? amount of sequencing Credit: WikiMedia Commons number of labs
“Bioinformatics nation is a land of city-states” Lincoln Stein● small labs building small-scale bioinformatics infrastructures● duplication of effort in compiling and installing software tools● some labs have no hardware, expertise, or time to install and run software● NEBC BioLinux ( tinyurl.com/BioLinux-NEBC ) 100+ pre-configured tools● example: glimmer, hmmer, phylip, rasmol, genespring, clustalw, EMBOSS how about large-scale sequence datasets ?
Cloud BioLinux pre-configured and on-demand bioinformatics computing on the cloud ● JCVI cloud computing research ● NEBC bioinformatics software repository + ● community effort – Hackathon / BOSC 2010 - 11 ● pre-configured Virtual Machine (VM, image) ● large-scale computing independently of institutional or geographic boundaries = ● only need a desktop computer with internet accesscloudbiolinux.org
Cloud BioLinux simple for end-users signup at aws.amazon.com then aws.amazon.com/console andhttp://tinyurl.com/cloud-biolinux-tutorial
Amazon EC2→linux desktopvia remotedesktop client
What if I want to share myalignments witha collaborator?save your data as a new VM 0.10$ / GB / monthat 15GB, it costs 1.5$ / month
“whole system snapshot exchange” (Dudley and Butte 2010)capture the state of the computing system and datasoftware execution parameters and “massaged” input datasets
Cloud BioLinux developers framework create cloud VM / images with standardized software configurations● customize Cloud BioLinux based on community requirements● mix and match software from NEBC or other (DebianMed, Scientific Linux etc.)● share customized VMs with collaborators, avoiding effort duplication● deploy Cloud BioLinux on private and local clouds
Cloud BioLinux developers framework ● based on python-fabric auto-deployment tool ● software components listed in plain text files ● collaborators use files to share descriptions of cloud VM / images ● start with a bare-bones VM / image ● fabric downloads and installs specified softwaretinyurl.com/python-fabric open.eucalyptus.com
software domains in bioinformatics: nextgensequencing, de novo assembly, annotation, phylogeny, molecular structures, gene expression analysis github.com/chapmanb/cloudbiolinux
Cloud Biolinux The future● expand community, receive feedback, add more software to the VM● groups.google.com/cloudbiolinux and cloudbiolinux.org● add data analysis pipelines that are used by sequencing centers● actively seeking funding to put major effort in development● 2011 ISMB/BOSC in Vienna, Austria, http://metalab.at/●
Acknowledgments & CreditsBrad Chapman - development of the fabric scripts and community organizerTim Booth, Mesude Bicak, Dawn Field, Bela Tiwari – BioLinux 6.0J. Craig Venter Inst. - time allowed to work on an open-source projectD. Gomez, E. Navarro, J. Shao, I. Singh – JCVI technology innovationDeepak Singh and AWS - education grant supporting ISMB / BOSC workshopMembers of the Cloud Biolinux community – precious development time Thank you !