Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bioclouds CAMDA (Robert Grossman) 09-v9p


Published on

This is a talk titled "Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data: Lessons from Cistrack" that I gave at CAMDA 2009 on October 6, 2009.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Bioclouds CAMDA (Robert Grossman) 09-v9p

  1. 1. Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data: Lessons from Cistrack<br />October 6, 2009<br />Robert Grossman<br />Laboratory for Advanced Computing<br />University of Illinois at Chicago<br />Open Data Group<br />Institute for Genomics & Systems BiologyUniversity of Chicago<br />1<br />
  2. 2. Cistrack Team (UIC & U. Chicago)<br /><ul><li>Nick Bild
  3. 3. Jia Chen
  4. 4. Robert Grossman</li></ul>YunhongGu<br />David Hanley<br /><ul><li>OleksiyKarpenko</li></ul>Xiangjun Liu<br /><ul><li>Nicolas Negre
  5. 5. Michal Sabala
  6. 6. Damian Roqueiro
  7. 7. Parantu K Shah
  8. 8. FengTian
  9. 9. Kevin White</li></li></ul><li>Part 1Biology as a Data Intensive Science.<br />3<br />Two of the four Solexa machines at the IGSB facility at Argonne National Laboratory.<br />
  10. 10. Projected sequencing capabilities (world-wide)<br />2060<br />Total human population<br />2019<br />One of each described species<br />2031<br />One of each species ~100M estimate<br />log10 billions of base pairs <br />2023<br />One of each species ~10M estimate<br />Kevin White, unpublished<br />
  11. 11. Is Biology a Large Data Science?<br />vs<br />CPUs double approximately every 18 months (Moore’s Law). Disks double every 12-15 months (Johnson’s Law).<br />Amount of publically available sequence data is doubling approximately every 12 months.<br />5<br />
  12. 12. IBM joins race for $100 personal genome.<br />
  13. 13. We Have a Problem<br />vs<br />More and more of your colleagues (e.g. the biologist down the hall) with access to modern instruments are producing so much data that they cannot easily manage, analyze and archive it.<br />Large projects build their own infrastructure.<br />Almost all other biologists are on their own.<br />7<br />
  14. 14. Point of View<br />To do research today…<br />Analytic infrastructure<br />Analytic algorithms & statistical models<br />Data<br />
  15. 15. Part 2What is a Cloud?<br />9<br />
  16. 16. What is a Cloud?<br />10<br />Software as a Service<br />
  17. 17. Is Anything Else a Cloud?<br />11<br />Infrastructure as a Service – based upon scaling Virtual Machines (VMs)<br />
  18. 18. Are There Other Types of Clouds?<br />12<br />ad targeting <br />Large Data Cloud Services<br />
  19. 19. What is Virtualization?<br />13<br />
  20. 20. Idea Dates Back to the 1960s<br />14<br />App<br />App<br />App<br />CMS<br />CMS<br />MVS<br />IBM VM/370<br />IBM Mainframe<br />Native (Full) Virtualization<br />Examples: Vmware ESX<br />Virtualization first widely deployed with IBM VM/370.<br />
  21. 21. One Definition<br />Clouds provide on-demand resources or services over a network, often the Internet, with the scale and reliability of a data center.<br />No standard definition.<br />Cloud architectures are not new.<br />What is new:<br />Scale<br />Ease of use<br />Pricing model.<br />15<br />
  22. 22. 16<br />Scale is new.<br />
  23. 23. Elastic, Usage Based Pricing Is New<br />17<br />costs the same as<br />1 computer in a rack for 120 hours<br />120 computers in three racks for 1 hour<br /><ul><li>Elastic, usage based pricing turns capex into opex.
  24. 24. Clouds can be used to manage surges in computing.</li></li></ul><li>Simplicity Offered By the Cloud is New<br />18<br />+<br />.. and you have a computer ready to work.<br />A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.<br />
  25. 25. 2004<br />10x-100x<br />1976<br />10x-100x<br />data<br />science<br />1670<br />250x<br />simulation science<br />1609<br />30x<br />experimental science<br />
  26. 26. 20<br />Clouds vs Grids<br />
  27. 27. Hadoop & Sector<br />21<br />
  28. 28. MalStone B Benchmark<br />22<br />
  29. 29. Part 3Cistrack<br />23<br /><br />
  30. 30. Cistrack<br />Resource for cis-regulatory data.<br />It is open source and based upon CUBioS.<br />Currently used by the White Lab at University of Chicago for managing ModENCODE fly data.<br />Contains raw data, intermediate, and analyzed data from approximately 240 experiments from Agilent, Affy and Solexa platforms.<br />
  31. 31. Chromatin Developmental Time-Course<br />H3K4me1 enhancers<br />H3K4me3 promoters & enhancers<br />H3K9Ac activation<br />H3K9me3 heterochromotin<br />H3K27Ac activation<br />H3K27me3 repression<br />PolII transcript. & promoters<br />CBP HAT- enhancers<br />Total RNA expression<br />X<br />12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre)<br />8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)<br />
  32. 32. 1. Cistrack Supports Cubes of Data<br />Drosophila regulatory elements from Drosophila modENCODE.<br />ChIP-chip data using Agilent 244K dual-color arrays.<br />Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1, H3K27Ac, H3K9Ac), PolII and CBP. <br />Each factor has been studied for 12 different time-points of Drosophila development. <br />
  33. 33. 2. ChIP-Seq Data Volumes are Large<br />Cistrack integrates with large data clouds.<br />
  34. 34. 3. Continuous Reanalysis is Desirable<br />In general, it is quite labor intensive to reanalyze your existing data with a new algorithm.<br />Cistrack supports VMs that can simplify re-applingCistrackpipelines that have been updated to include a new algorithm.<br />
  35. 35. Cistrack Architecture<br />Cistrack Web Portal & Widgets<br />Cistrack Database<br />Analysis Pipelines & Re-analysis Services<br />CistrackCloud Services<br />Ingestion Services<br />
  36. 36. Part 4Reanalysis<br />30<br />Can you repeat an analytic pipeline one year after a post-doc leaves your lab?<br />
  37. 37. Promoters: Use H3K4me3, PolII &RNA to Map Active Genes<br />
  38. 38. Promoters: Use of H3K4me3, PolII & RNA to Map Active Genes<br />
  39. 39. Active Genes - Solexa Result<br />
  40. 40. Basic Idea<br />Replace<br />Cloud<br />VM<br />VM<br />VM<br />At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.<br />
  41. 41. Raywulf<br />We have designed a cluster (called a Raywolf Cloud) that is optimized to serve as your own private cloud.<br />About $2K/TB.<br />Will be used by the Open Science Data Cloud.<br />
  42. 42. Acknowledgements<br />
  43. 43. Cis-Regulatory Map of the Drosophila Genome (modENCODE)<br />Data Generation<br />Kevin White, U. Chicago (Antibody pipeline, ChIP-chip pipeline)<br />Bing Ren, UCSD (Antibody validation, ChIP-chip pipeline)<br />Robert Grossman U. Illinois (LIMS, data management & analysis)<br />Computational identification of Cis-Regulatory Motifs<br />ManolisKellis, MIT (Motif analysis, ChIP-chip data analysis)<br />Biological validation<br />Jim Posakony, UCSD (Promoters/Enhancers)<br />Steve Russell, Cambridge U. (Insulators/Silencers)<br />Hugo Bellen, Baylor (Element “necessity” validations)<br />
  44. 44. Cistrack<br />Cistrack Cloud<br />YunhongGu<br />Michal Sabala<br />Cistrack DB<br />David Hanley<br />Xiangjun Liu<br />Nicolas Negre<br />Michal Sabala<br />Parantu K Shah<br /><ul><li>Cistrack Analysis Pipelines & Tools
  45. 45. Nick Bild
  46. 46. Jia Chen
  47. 47. Xiangjun Liu
  48. 48. Nicolas Negre
  49. 49. Damian Roqueiro
  50. 50. Parantu K Shah
  51. 51. FengTian
  52. 52. Kevin White</li></li></ul><li>Thank You<br />