Bionimbus - Northwestern CGI Workshop 4-21-2011


Published on

This is a talk I gave at a Northwestern University - Complete Genomics Workshop on April 21, 2011 about using clouds to support research in genomics and related areas.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Bionimbus - Northwestern CGI Workshop 4-21-2011

  1. 1. Bionimbus: A Cloud-Based Infrastructure for Managing, Analyzing and Sharing Genomics Data <br />April 21, 2011<br />Robert Grossman<br />Institute for Genomics & Systems Biology (IGSB)<br />Computation InstituteUniversity of Chicago<br />and<br />Open Cloud Consortium <br />
  2. 2. Background<br />
  3. 3. Growth of Genomic Data<br />Sequence everything<br />AWS <br />Hadoop<br />GFS<br />Sequence environment<br />2006<br />2008<br />2003<br />Sequence species<br />ENCODE<br />HGP<br />2003<br />2001<br />1977<br />1995<br />2005<br />Sanger Sequencing<br />Microarray technology<br />454, Solexa sequencing<br />10^10<br />Genbank<br />10^5<br />10^8<br />
  4. 4. Source: Lincoln Stein<br />
  5. 5. The Challenge is to Support Cubes of High Throughput Sequence Data<br />Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set.<br />Different developmental stages<br />Different pathologies<br />Perturb the environment<br />
  6. 6. We Have a Problem<br />…<br />vs<br />More and more of your colleagues produce so much data that they cannot easily manage, move, analyze and share it. <br />Centers and large projects build their own infrastructure.<br />Every else is on their own.<br />
  7. 7. Part 1. Using Bionimbus<br /><br />
  8. 8. Bionimbus is a community cloud for storing, analyzing and sharing genomics and related data.<br />8<br />
  9. 9. Enabling a broad community to utilize genome research<br />User<br />1.<br />3.<br />2.<br />9<br />Bionimbus Cloud<br />Sequencing Partner or Center<br />
  10. 10. Step 1. Prepare a Sample<br />
  11. 11. Step 2. Login to Bionimbus and get a Bionimbus Key.<br />
  12. 12. Step 3. Fedex your sample to CGI.<br />
  13. 13. Step 4. Login on to Bionimbus and view your data<br />
  14. 14. Step 5. Use Bionimbus to perform standard and custom pipelines.<br />Using the ability of Bionimbus to launch multiple virtual machines reduced this analysis from 25 days to 1 day.<br />
  15. 15. Step 2. Send sample tobe sequenced.<br />Step 1. Get Bionimbus ID (BID), assign project, private/community, public cloud, etc.<br />InternalSequencers<br />BID Generator<br />CGI<br />Step 5. Cloud based analysis using IGSB and 3rd<br />party tools and applications. <br />Step 3a. Return rawreads.<br />Step 3b. Returnvariant calls, CNV, annotation…<br />Bionimbus Private Cloud UC<br />Bionimbus Community Cloud<br />Step 4. Secure datarouting to appropriatecloud based upon BID.<br />Bionimbus Private Cloud XY<br />Amazon<br />dbGaP<br />
  16. 16. Part 2. Introduction to Clouds<br />
  17. 17. Clouds provide on-demand computing and storage resources at the scale and with the reliability of a data center.<br />Computer scientists were caught by surprise.<br />17<br />
  18. 18. What is a Cloud?<br />18<br />Software as a Service (SaaS)<br />
  19. 19. What Else a Cloud?<br />19<br />Infrastructure as a Service (IaaS)<br />Users get one or more virtual machines “on demand”<br />
  20. 20. Are There Other Types of Clouds?<br />20<br />ad targeting <br />Hadoop was developed for processing Internet scale data for ad targeting and related applications but is now used for processing genomics data and may other applications.<br />
  21. 21. What is a new about clouds?<br />21<br />
  22. 22. 22<br />Scale is New<br />
  23. 23. Elastic, On-Demand Computing with Usage Based Pricing Is New<br />23<br />costs the same as<br />1 computer in a rack for 120 hours<br />120 computers in three racks for 1 hour<br />Data center scale computing often leverages virtualization technologies.<br />
  24. 24. Part 3. Some BionimbusCases<br />
  25. 25. Case Study: Public Datasets in Bionimbus<br />
  26. 26.
  27. 27. Case Study: ModENCODE<br />Bionimbus is used to process the modENCODE data from the White lab (over 1000 experiments).<br />BionimbusVMs were used for some of the integrative analysis.<br />Bionimbus is used as a backup for the modENCODE DCC<br />
  28. 28. 28<br />>300 ChIP datasets<br /><ul><li>Chromatin/RNA timecourse
  29. 29. CBP
  30. 30. PolII
  31. 31. Pho/silencers
  32. 32. HDACs
  33. 33. Insulators
  34. 34. TFs</li></ul>Predictions<br />537 silencers<br />2,307 new promoters<br />12,285 enhancers<br />14,145 insulators<br /><br /><br />Negre et al. Nature 2011<br />
  35. 35. Case Study: IGSB<br />All samples processed by the Institute for Genomics & Systems Biology High-Throughput Genome Analysis Core (HGAC) at the University of Chicago use Bionimbus.<br />
  36. 36. Bionimbus Virtual Machine Releases <br />30<br />
  37. 37. Part 4<br />31<br />Data Centers for Science<br />
  38. 38. 2004<br />10x-100x<br />1976<br />10x-100x<br />data<br />science<br />1670<br />250x<br />simulation science<br />1609<br />30x<br />experimental science<br />
  39. 39. Open Science Data Cloud<br />Astronomical data<br />Biological data (Bionimbus)<br />NSF-PIRE OSDC Data Challenge<br />Earth science data (& disaster relief)<br />
  40. 40. The goal is to build a data center in Chicago for biological, scientific, medical and health care data in 4 to 5 years.<br />
  41. 41. Part 5. More About Bionimbus<br />
  42. 42. GWT-based Front End<br />Elastic Cloud Services<br />Database Services<br />Analysis Pipelines & Re-analysis Services<br />Intercloud Services<br />Large Data Cloud Services<br />Data Ingestion Services<br />
  43. 43. (Eucalyptus,<br />OpenStack)<br />GWT-based Front End<br />Elastic Cloud Services<br />(PostgreSQL)<br />Database Services<br />Analysis Pipelines & Re-analysis Services<br />Intercloud Services<br />(IDs, etc.)<br />Large Data Cloud Services<br />(UDT, replication)<br />Data Ingestion Services<br />(Hadoop,<br />Sector/Sphere)<br />
  44. 44. Bionimbus Deployment Options<br />Bionimbus Community<br />BionimbusAMIs & Amazon hosted applications<br />Bionimbus Private Clouds<br />
  45. 45. A successful cloud will…<br />3. High performance ingestion and transport of data.<br />2. Provide Compute services at the scale of a data center.<br />1. Provide long term persistent storage services at the scale of a data center.<br />
  46. 46. A successful cloud will…<br />6. Peer with private genomics clouds.<br />5. Peer with public clouds.<br />4. Support the liberation of data.<br />
  47. 47. Bionimbus satisfies each of these six requirements.<br />
  48. 48. Bionimbus Road Map<br />Over the next 3 to 4 months, we will:<br />Launch Bionimbus (we are in a pre-launch)<br />Add Galaxy-based workflow to Bionimbus<br />Add secure routing of genomes<br />Add more public datasets<br />Add more pipelines<br />
  49. 49. For More<br />