Ntino Cloud BioLinux Barcelona Spain 2012
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
436
On Slideshare
436
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Cloud BioLinux: Pre-configured Bioinformatics Computing for the Genomics Community Ntino Krampis Asst. Professor - Informatics J. Craig Venter Institute kkrampis@jcvi.org http://www.jcvi.org/cms/about/bios/kkrampis/Tuesday, November 6, 12
  • 2. J. Craig Venter Institute ( JCVI ) • Human Microbiome Project (Nelson et al. Science 2010; 328: 994–99) • NIH funded, launched in 2008, $115 million • metagenomic sequencing of microbial genomes from the human body • sequence everything in sample, use informatics to separate genomesTuesday, November 6, 12
  • 3. J. Craig Venter Institute • Global Ocean Survey (first publication, Venter et al. Science 2004; 304: 66-74) • metagenomic sequencing of microbes from oceans around the world • Darwin’s route ? • Numbers: HMP > 2 mil. new proteins, GOS > 1.2Tuesday, November 6, 12
  • 4. Big Data and sequencing • JCVI sequencing facility: 454, Solexa, HiSeq, and IonTorrent on the way • Processed data: size information content • But... look at SOLiD 3 Source: http://www.politigenomics.com/next-generation- sequencing-informaticsTuesday, November 6, 12
  • 5. JCVI: sequencing and computing infrastructure • “big” sequencing needs large-scale informatics • ~1000 node Grid Engine cluster • research with Hadoop / MapRecuce, and a small private cloud • 50+ bioinformaticians and software developersTuesday, November 6, 12
  • 6. A new paradigm: Low-cost, bench-top sequencers • GS Junior - 454, MiSeq -Illumina • complete sequencing of bacterial, viral, fungal genomes • RNAseq (gene expression), ChiPseq (protein interactions), gene variant discovery • sequencing as a standard technique in basic genetics research - like PCR ?Tuesday, November 6, 12
  • 7. Will smaller academic labs become the long tail of sequencing ? “sequencing factories” : JCVI, Broad Inst. Washington Univ. Amount Inst. of Genome Sciences of small academic labs with sequencing bench-top sequencers Number of labsTuesday, November 6, 12
  • 8. Sequencers shipped without clusters • Problem A : sequence analysis requires computational capacity • genome assembly, BLAST, gene finders - annotation • Problem B: bioinformatics ??? tools need software engineering expertise • unix/linux operating systems, maintaining software libraries, compiling source codeTuesday, November 6, 12
  • 9. Each lab builds a cluster ? • need additional funds to buy the hardware • funds for personnel to maintain the cluster and software • duplication of effort across labs • sub-optimal utilization of the hardwareTuesday, November 6, 12
  • 10. Centralized bioinformatics services • Bioinformatic Resource Centers ex. GSCID • bioinformatic services usually coupled with sequencing of a genome • provide mostly data access to external PIs • cannot support to every lab with a sequencerTuesday, November 6, 12
  • 11. Problem A : sequence analysis requires computational capacity • Amazon Elastic Compute Cloud (EC2), pay-by-the- hour computing • cloud servers cost $0.085 - $2 per hour • max capacity 64GB RAM / 8 CPU (can boot hundreds of servers) World-wide data centers 750 hours free for new users: aws.amazon.com/free/ free compute for teaching: aws.amazon.com/grants/Tuesday, November 6, 12
  • 12. Cloud Computing and Virtualization • OS, software and data, pre-installed in Virtual Machine (VM) • cloud provider: hardware and virtualization layer • VM is a full-featured server in a single file • VM transfer on private cloud Credit: VMware Inc.Tuesday, November 6, 12
  • 13. Problem B: bioinformatics tools need software engineering expertise • VM with pre-installed software on the cloud • avoid compiling source code, or other software dependencies • rent computational capacity, on a pay as you go basis • run the VM on the closest Amazon data centerTuesday, November 6, 12
  • 14. Solving Problems A & B : Cloud BioLinux • Cloud BioLinux: publicly accessible VM on EC2 • 100+ pre-installed bioinformatics tools • remote desktop for non- command line experts • you can create a cluster with Cloud BioLinux - CloudMan Krampis K, Booth T, Chapman B, Tiwari B, Bicak M, Field D, Nelson K Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC Bioinformatics. 2012 Mar 19; 13: 42.Tuesday, November 6, 12
  • 15. Accessing Cloud BioLinux http://aws.amazon.com/consoleTuesday, November 6, 12
  • 16. Launch through the EC2 cloud consoleTuesday, November 6, 12
  • 17. Amazon EC2 VM launch wizard cloudbiolinux.orgTuesday, November 6, 12
  • 18. Tuesday, November 6, 12
  • 19. Cloud BioLinux desktop remote connection tinyurl.com/bootcloud1 tinyurl.com/bootcloud2Tuesday, November 6, 12
  • 20. Cloud BioLinux desktopTuesday, November 6, 12
  • 21. Cloud BioLinux desktopTuesday, November 6, 12
  • 22. Data exchange on the cloud VM snapshotsTuesday, November 6, 12
  • 23. Cloud computing research at JCVI • open-source cloud platforms, fully compatible with Amazon EC2 • active funding, NIAID viral genomics pipeline on cloud • end-to-end, sequence to assembly, annotation, visualization via Galaxy • run on Amazon, private cloud, or desktopTuesday, November 6, 12
  • 24. Scriptable Cloud Infrastructures Fabric framework • Cloud BioLinux VM configuration in plain text • high-level configuration, software groups • each group individual bioinformatics toolsTuesday, November 6, 12
  • 25. Scriptable Cloud Infrastructures • Python Fabric leverages Linux packages (APTitude repositories) • mix and match software from repositories • share VM configuration as source code • clone across clouds Krampis K, Booth T, Chapman B, Tiwari B, Bicak M, Field D, Nelson K Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC Bioinformatics. 2012 Mar 19; 13: 42.Tuesday, November 6, 12
  • 26. Scalable Data Analysis • Cloud BioLinux + Cloudman • dual role : Master / Worker • Cloud BioLinux VM, has Cloudman scripts that start more copies of itself • Grid Engine (SGE) cluster • http://usecloudman.org/ Afgan, E., Chapman, B. et al. (2012). Using Cloud Computing Infrastructure with CloudBioLinux, CloudMan, and Galaxy.Current Protocols in Bioinformatics, 11-9.Tuesday, November 6, 12
  • 27. Goodies with Cloud BioLinuxTuesday, November 6, 12
  • 28. Goodies with Cloud BioLinuxTuesday, November 6, 12
  • 29. From sequencer to the cloud credit: basespace.illumina.comTuesday, November 6, 12
  • 30. Acknowledgments • Cloud BioLinux community: cloudbiolinux.org Brad Chapman, Enis Afgan,Tim Booth, Mesude Bicak, Dawn Field groups.google.com/group/cloudbiolinux • JCVI collaborators: Alex Richter, tinyurl.com/cloudboot1 Ravi Sanka, Andrey Tovichgrechko, Johannes Goll, Karen Nelson, Bill tinyurl.com/cloudboot2 Nierman, JCVI IT support. kkrampis@jcvi.org • NIAID and for funding: Maria Giovani, Punam Mathur slideshare.com/agbiotec Thank you !Tuesday, November 6, 12