The Transformation of Systems Biology Into A Large Data Science


Published on

This is a talk I gave at the Institute for Genomics & System Biology (IGSB) on December 7, 2009. The talk looks at the role of cloud computing platforms, including private clouds, for managing the large data produced by next generation sequencing platforms.

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The Transformation of Systems Biology Into A Large Data Science

  1. 1. Is Systems Biology Becoming a Data Intensive Science? Assuming So, Are You Ready?<br />December 7, 2009<br />Robert Grossman<br />Laboratory for Advanced Computing<br />University of Illinois at Chicago<br />1<br />
  2. 2. Part 1Biology as a Data Intensive Science.<br />2<br />Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR). <br />
  3. 3. Growth of Genomic Data<br />ENCODE<br />HGP<br />2003<br />2001<br />1977<br />1995<br />2005<br />Sanger Sequencing<br />Microarray technology<br />454, Solexa sequencing<br />10^10<br />Genbank<br />10^5<br />10^8<br />
  4. 4. Growth of Genomic Data<br />Sequence individuals<br />AWS <br />Hadoop<br />GFS<br />Sequence environment<br />2006<br />2008<br />2003<br />Sequence species<br />ENCODE<br />HGP<br />2003<br />2001<br />1977<br />1995<br />2005<br />Sanger Sequencing<br />Microarray technology<br />454, Solexa sequencing<br />10^10<br />Genbank<br />10^5<br />10^8<br />
  5. 5. The Challenge is to Support Cubes of High Throughput Sequence Data<br />Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set.<br />Different developmental stages<br />Differentconditions<br />Perturb the environment<br />
  6. 6. We Have a Problem<br />…<br />vs<br />More and more of your colleagues produce so much data that they cannot easily manage & analyze it. <br />Large projects build their own infrastructure.<br />Every else is on their own.<br />
  7. 7. 2003<br />10x-100x<br />1976<br />10x-100x<br />data<br />science<br />1670<br />250x<br />simulation science<br />1609<br />30x<br />experimental science<br />
  8. 8. To Answer today’s biological questions<br />Point of View<br />Analytic infrastructure<br />Analytic algorithms & statistical models<br />Data<br />
  9. 9. Part 2What is a Cloud?<br />9<br />
  10. 10. What is a Cloud?<br />10<br />Software as a Service<br />
  11. 11. Is Anything Else a Cloud?<br />11<br />Infrastructure as a Service – based upon scaling Virtual Machines (VMs)<br />
  12. 12. Are There Other Types of Clouds?<br />12<br />web search & ad targeting <br />Large Data Cloud Services<br />
  13. 13. What is Virtualization?<br />13<br />
  14. 14. Idea Dates Back to the 1960s<br />14<br />App<br />App<br />App<br />CMS<br />CMS<br />MVS<br />IBM VM/370<br />IBM Mainframe<br />Native (Full) Virtualization<br />Examples: Vmware ESX<br />Virtualization first widely deployed with IBM VM/370.<br />
  15. 15. What Do You Optimize?<br />Goal: Minimze latency and control heat.<br />Goal: Maximze data (with matching compute) and control cost.<br />
  16. 16. 16<br />Scale is new<br />
  17. 17. Elastic, Usage Based Pricing Is New<br />17<br />costs the same as<br />1 computer in a rack for 120 hours<br />120 computers in three racks for 1 hour<br /><ul><li>Elastic, usage based pricing turns capex into opex.
  18. 18. Clouds can be used to manage surges in computing.</li></li></ul><li>Simplicity Offered By the Cloud is New<br />18<br />+<br />.. and you have a computer ready to work.<br />A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.<br />
  19. 19. 19<br />Clouds vs Grids<br />
  20. 20. Part 3Case Studies <br />
  21. 21. Case Study 1Cistrack Large Data Cloud<br />21<br /><br />
  22. 22. Cistrack<br />Resource for cis-regulatory data.<br />It is open source and based upon CUBioS.<br />Currently used by the White Lab at University of Chicago for managing ModENCODE fly data.<br />Contains raw data, intermediate, and analyzed data from approximately 240 experiments from Agilent, Affy and Solexa platforms.<br />
  23. 23. CUBioS Applications<br />Front Ends<br />CUBioS<br />Bowtie, TopHAT, R pipelines, etc… <br />Ingestion<br />Cistrack is an instance of CUBioS.<br />RNA seq<br />ChIPseq<br />DNA capture<br />etc.<br />
  24. 24. Chromatin Developmental Time-Course<br />H3K4me1 enhancers<br />H3K4me3 promoters & enhancers<br />H3K9Ac activation<br />H3K9me3 heterochromotin<br />H3K27Ac activation<br />H3K27me3 repression<br />PolII transcript. & promoters<br />CBP HAT- enhancers<br />Total RNA expression<br />X<br />12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre)<br />8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)<br />
  25. 25. Cistrack Supports Multi-Dim. Cubes…<br />Drosophila regulatory elements from Drosophila modENCODE.<br />ChIP-chip data using Agilent 244K dual-color arrays.<br />Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1, H3K27Ac, H3K9Ac), PolII and CBP. <br />Each factor has been studied for 12 different time-points of Drosophila development. <br />
  26. 26. … Each Cell in a Cube Can Be Three ChIP-Seq Datasets from a Solexa<br />Cistrack integrates with large data clouds.<br />Cistrack uses the Sector/Sphere large data cloud.<br />
  27. 27. Hadoopvs Sector<br />27<br />Source: Gu and Grossman, Sector and Sphere, Phil. Trans. Royal Society A, 2009.<br />
  28. 28. Cistrack Web Portal & Widgets<br />Cistrack Elastic Cloud Services<br />Cistrack Database<br />Analysis Pipelines & Re-analysis Services<br />Cistrack Large Data Cloud Services<br />Ingestion Services<br />
  29. 29. Case Study 2: Combinatorial Analysis of Marks<br />
  30. 30. Active Gene - Method<br />K4Me3 to TSS distance<br />Gene Activeness: Label a transcript t as XYZ<br />X=1 if a H3K4Me3 binds in <br />[-1800, min(2200, TranscriptLength)]<br />Y=1 if a Pol II binds in <br />[-1800, min(2200, TranscriptLength)]<br />Z=1 if at least one exon has ≥30% covered by RNA, and in total ≥10% covered by RNA.<br />Pol II to TSS distance<br />Source: Jia Chen et. al. (ModENCODE) <br />
  31. 31. Promoters: Use H3K4me3, PolII & RNA to Map Active Genes<br />Source: Jia Chen et. al. (ModENCODE) <br />
  32. 32. Active Genes (cont’d)<br />A.<br />B.<br />C.<br />PolII<br />H3K4me3<br />1418<br />332<br />753<br />6104<br />680<br />482<br />1350<br />RNA<br />bp from TSS<br />bp from TSS<br />Source: Jia Chen et. al. (ModENCODE) <br />
  33. 33. Interesting Combinatorial Combination of Marks<br />Probes along genome<br />…<br />Marks<br />Item-sets formed by sliding moving window along genome. <br />A-prior algorithm generates interesting itemsets.<br />Post-processing retains itemsets of biological relevance.<br />
  34. 34. Case Study 3Cistrack Elastic Cloud<br />
  35. 35. Cistrack Elastic Cloud <br />A rack in Cistrack’s Elastic Cloud contains 128 Cores and 128 TB.<br />Multiple racks form a data center.<br />Virtual machines can run pipelines.<br />Virtual machines have access to large data services.<br />No need to move large datasets in and out of Amazon public cloud.<br />
  36. 36. Use VMs to Support Reanalysis<br />Replace<br />Cloud<br />VM<br />VM<br />VM<br />At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.<br />
  37. 37. Comparing Peak Calling Algorithms for ModENCODE<br />We’re using the Cistrack Elastic Cloud to rerun peak calls for fly data using the worm pipeline.<br />Also running the worm peak calling pipeline on the fly data.<br />
  38. 38. Case Study 4Ensembles of Trees on Clouds<br />100 tree models<br />data<br />10,000??? tree models<br />WenxuanGao, Robert Grossman, Philip S. Yu, YunhongGu, Why Naïve Ensembles Do Not Work in Cloud Computing, Proceedings of LSDM, 2009.<br />
  39. 39. Ensembles of Trees for Clouds<br />Top-k ensembles<br />Each node builds single random tree with local data.<br />Central node picks k best random trees to predict.<br />Lower cost with corresponding lower accuracy.<br />Shuffling data can improve accuracy.<br />Skeleton ensembles<br />Central node builds k skeletons of random trees.<br />Each local node fills in the skeletons.<br />Central node merges all trees from local nodes.<br />Greater cost, but more accurate.<br />
  40. 40. Experimental Studies<br />Performed experimental studies on 4 racks (104 nodes) of Open Cloud Testbed.<br />Standard ensemble based models are more expensive than proposed approaches and can overfit.<br />Skeleton ensembles are more accurate but more expensive to build.<br />Shuffling improves accuracy of top-k algorithm.<br />For KDDCup99 dataset top-k ensembles with shuffling 0.1% of data matches accuracy of skeleton method.<br />For UCI Census income dataset, 20% shuffle required, which is more expensive than top-k ensemble.<br />Without knowledge of uniformity of dataset, recommend skeleton ensembles.<br />
  41. 41. KDDCup99 dataset<br />Census income dataset<br />
  42. 42. Part 5.Open Cloud Consortium<br />Biocloud<br />
  43. 43. Open Cloud Testbed<br />C-Wave<br />CENIC<br />Dragon<br />Phase 2<br />9 racks<br />250+ Nodes<br />1000+ Cores<br />10+ Gb/s<br /><ul><li>Hadoop
  44. 44. Sector/Sphere
  45. 45. Thrift
  46. 46. KVM VMs
  47. 47. Eucalyptus VMs</li></ul>MREN<br />43<br />
  48. 48. Open Science Data Cloud<br />sky cloud<br />additional projects in planning…<br />biocloud<br />44<br />
  49. 49. OCC Condominium Clouds<br />In a condominium cloud, you buy your own rack or bunch of racks.<br />The racks are managed and operated by the condominium association, in this case the OCC.<br />If your rack is 120 TB, you get the rights to approx. 40 TB of storage in the cloud. The rest is a shared resource. <br />45<br />
  50. 50. Acknowledgements<br />
  51. 51. To Get Involved<br />The Cistrack resource for transcriptional data:<br />Sector/Sphere cloud:<br />
  52. 52. Thank You<br />For more information: or<br />