Is Systems Biology Becoming a Data Intensive Science?  Assuming So, Are You Ready?December 7, 2009Robert GrossmanLaboratory for Advanced ComputingUniversity of Illinois at Chicago1
Part 1Biology as a Data Intensive Science.2Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR).
Growth of Genomic DataENCODEHGP20032001197719952005Sanger SequencingMicroarray technology454, Solexa sequencing10^10Genbank10^510^8
Growth of Genomic DataSequence individualsAWS  HadoopGFSSequence environment200620082003Sequence speciesENCODEHGP20032001197719952005Sanger SequencingMicroarray technology454, Solexa sequencing10^10Genbank10^510^8
The Challenge is to Support Cubes of High Throughput Sequence DataEach cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq,  movie, etc. data set.Different developmental stagesDifferentconditionsPerturb the environment
We Have a Problem…vsMore and more of your colleagues produce so much data that they cannot easily manage &  analyze it.  Large projects build their own infrastructure.Every else is on their own.
200310x-100x197610x-100xdatascience1670250xsimulation science160930xexperimental science
To Answer today’s biological questionsPoint of ViewAnalytic infrastructureAnalytic algorithms & statistical modelsData
Part 2What is a Cloud?9
What is a Cloud?10Software as a Service
Is Anything Else a Cloud?11Infrastructure as a Service – based upon scaling Virtual Machines (VMs)
Are There Other Types of Clouds?12web search & ad targeting Large Data Cloud Services
What is Virtualization?13
Idea Dates Back to the 1960s14AppAppAppCMSCMSMVSIBM VM/370IBM MainframeNative (Full) VirtualizationExamples: Vmware ESXVirtualization first widely deployed with IBM VM/370.
What Do You Optimize?Goal: Minimze latency and control heat.Goal: Maximze data (with matching compute) and control cost.
16Scale is new
Elastic, Usage Based Pricing Is New17costs the same as1 computer in a rack for 120 hours120 computers in  three racks for 1 hourElastic, usage based pricing turns capex into opex.
Clouds can be used to manage surges in computing.Simplicity Offered By the Cloud is New18+.. and you have a computer ready to work.A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.
19Clouds vs Grids
Part 3Case Studies
Case Study 1Cistrack Large Data Cloud21www.cistrack.org
CistrackResource for cis-regulatory data.It is open source and based upon CUBioS.Currently used by the White Lab at University of Chicago for managing ModENCODE fly data.Contains raw data, intermediate, and analyzed data from approximately 240 experiments from Agilent, Affy and Solexa platforms.
CUBioS ApplicationsFront EndsCUBioSBowtie, TopHAT, R pipelines, etc… IngestionCistrack is an instance of CUBioS.RNA seqChIPseqDNA captureetc.
Chromatin Developmental Time-CourseH3K4me1 	enhancersH3K4me3		promoters & enhancersH3K9Ac		activationH3K9me3		heterochromotinH3K27Ac		activationH3K27me3	repressionPolII			transcript. & promotersCBP			HAT- enhancersTotal RNA		expressionX12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre)8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)
Cistrack Supports Multi-Dim. Cubes…Drosophila regulatory elements from Drosophila modENCODE.ChIP-chip data using Agilent 244K dual-color arrays.Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1, H3K27Ac, H3K9Ac), PolII and CBP. Each factor has been studied for 12 different time-points of Drosophila development.
… Each Cell in a Cube Can Be Three ChIP-Seq Datasets from a SolexaCistrack integrates with large data clouds.Cistrack uses the Sector/Sphere large data cloud.
Hadoopvs Sector27Source: Gu and Grossman, Sector and Sphere, Phil. Trans. Royal Society A, 2009.
Cistrack Web Portal & WidgetsCistrack Elastic Cloud ServicesCistrack DatabaseAnalysis Pipelines & Re-analysis ServicesCistrack Large Data Cloud ServicesIngestion Services
Case Study 2: Combinatorial Analysis of Marks
Active Gene - MethodK4Me3 to TSS distanceGene Activeness: Label a transcript t as XYZX=1 if a H3K4Me3 binds in [-1800, min(2200, TranscriptLength)]Y=1 if a Pol II binds in [-1800, min(2200, TranscriptLength)]Z=1 if at least one exon has ≥30% covered by RNA, and in total ≥10% covered by RNA.Pol II to TSS distanceSource: Jia Chen et. al. (ModENCODE)
Promoters: Use H3K4me3, PolII & RNA to Map Active GenesSource: Jia Chen et. al. (ModENCODE)
Active Genes (cont’d)A.B.C.PolIIH3K4me3141833275361046804821350RNAbp from TSSbp from TSSSource: Jia Chen et. al. (ModENCODE)
Interesting Combinatorial Combination of MarksProbes along genome…MarksItem-sets formed by sliding moving window along genome.  A-prior algorithm generates interesting itemsets.Post-processing retains itemsets of biological relevance.
Case Study 3Cistrack Elastic Cloud
Cistrack Elastic Cloud A rack in Cistrack’s Elastic Cloud contains 128 Cores and 128 TB.Multiple racks form a data center.Virtual machines can run pipelines.Virtual machines have access to large data services.No need to move large datasets in and out of Amazon public cloud.
Use VMs to Support ReanalysisReplaceCloudVMVMVMAt anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.
Comparing Peak Calling Algorithms for ModENCODEWe’re using the Cistrack Elastic Cloud to rerun peak calls for fly data using the worm pipeline.Also running the worm peak calling pipeline on the fly data.
Case Study 4Ensembles of Trees on Clouds100 tree modelsdata10,000??? tree modelsWenxuanGao, Robert Grossman, Philip S. Yu, YunhongGu, Why Naïve Ensembles Do Not Work in Cloud Computing, Proceedings of LSDM, 2009.
Ensembles of Trees for CloudsTop-k ensemblesEach node builds single random tree with local data.Central node picks k best random trees to predict.Lower cost with corresponding lower accuracy.Shuffling data can improve accuracy.Skeleton ensemblesCentral node builds k skeletons of random trees.Each local node fills in the skeletons.Central node merges all trees from local nodes.Greater cost, but more accurate.
Experimental StudiesPerformed experimental studies on 4 racks (104 nodes) of Open Cloud Testbed.Standard ensemble based models are more expensive than proposed approaches and can overfit.Skeleton ensembles are more accurate but more expensive to build.Shuffling improves accuracy of top-k algorithm.For KDDCup99 dataset top-k ensembles with shuffling 0.1% of data matches accuracy of skeleton method.For UCI Census income dataset, 20% shuffle required, which is more expensive than top-k ensemble.Without knowledge of uniformity of dataset, recommend skeleton ensembles.
KDDCup99 datasetCensus income dataset
Part 5.Open Cloud ConsortiumBiocloud
             Open Cloud TestbedC-WaveCENICDragonPhase 29 racks250+ Nodes1000+ Cores10+ Gb/sHadoop
Sector/Sphere
Thrift
KVM VMs
Eucalyptus VMsMREN43
                   Open Science Data Cloudsky cloudadditional projects in planning…biocloud44

The Transformation of Systems Biology Into A Large Data Science

  • 1.
    Is Systems BiologyBecoming a Data Intensive Science? Assuming So, Are You Ready?December 7, 2009Robert GrossmanLaboratory for Advanced ComputingUniversity of Illinois at Chicago1
  • 2.
    Part 1Biology asa Data Intensive Science.2Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR).
  • 3.
    Growth of GenomicDataENCODEHGP20032001197719952005Sanger SequencingMicroarray technology454, Solexa sequencing10^10Genbank10^510^8
  • 4.
    Growth of GenomicDataSequence individualsAWS HadoopGFSSequence environment200620082003Sequence speciesENCODEHGP20032001197719952005Sanger SequencingMicroarray technology454, Solexa sequencing10^10Genbank10^510^8
  • 5.
    The Challenge isto Support Cubes of High Throughput Sequence DataEach cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set.Different developmental stagesDifferentconditionsPerturb the environment
  • 6.
    We Have aProblem…vsMore and more of your colleagues produce so much data that they cannot easily manage & analyze it. Large projects build their own infrastructure.Every else is on their own.
  • 7.
  • 8.
    To Answer today’sbiological questionsPoint of ViewAnalytic infrastructureAnalytic algorithms & statistical modelsData
  • 9.
    Part 2What isa Cloud?9
  • 10.
    What is aCloud?10Software as a Service
  • 11.
    Is Anything Elsea Cloud?11Infrastructure as a Service – based upon scaling Virtual Machines (VMs)
  • 12.
    Are There OtherTypes of Clouds?12web search & ad targeting Large Data Cloud Services
  • 13.
  • 14.
    Idea Dates Backto the 1960s14AppAppAppCMSCMSMVSIBM VM/370IBM MainframeNative (Full) VirtualizationExamples: Vmware ESXVirtualization first widely deployed with IBM VM/370.
  • 15.
    What Do YouOptimize?Goal: Minimze latency and control heat.Goal: Maximze data (with matching compute) and control cost.
  • 16.
  • 17.
    Elastic, Usage BasedPricing Is New17costs the same as1 computer in a rack for 120 hours120 computers in three racks for 1 hourElastic, usage based pricing turns capex into opex.
  • 18.
    Clouds can beused to manage surges in computing.Simplicity Offered By the Cloud is New18+.. and you have a computer ready to work.A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.
  • 19.
  • 20.
  • 21.
    Case Study 1CistrackLarge Data Cloud21www.cistrack.org
  • 22.
    CistrackResource for cis-regulatorydata.It is open source and based upon CUBioS.Currently used by the White Lab at University of Chicago for managing ModENCODE fly data.Contains raw data, intermediate, and analyzed data from approximately 240 experiments from Agilent, Affy and Solexa platforms.
  • 23.
    CUBioS ApplicationsFront EndsCUBioSBowtie,TopHAT, R pipelines, etc… IngestionCistrack is an instance of CUBioS.RNA seqChIPseqDNA captureetc.
  • 24.
    Chromatin Developmental Time-CourseH3K4me1 enhancersH3K4me3 promoters & enhancersH3K9Ac activationH3K9me3 heterochromotinH3K27Ac activationH3K27me3 repressionPolII transcript. & promotersCBP HAT- enhancersTotal RNA expressionX12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre)8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)
  • 25.
    Cistrack Supports Multi-Dim.Cubes…Drosophila regulatory elements from Drosophila modENCODE.ChIP-chip data using Agilent 244K dual-color arrays.Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1, H3K27Ac, H3K9Ac), PolII and CBP. Each factor has been studied for 12 different time-points of Drosophila development.
  • 26.
    … Each Cellin a Cube Can Be Three ChIP-Seq Datasets from a SolexaCistrack integrates with large data clouds.Cistrack uses the Sector/Sphere large data cloud.
  • 27.
    Hadoopvs Sector27Source: Guand Grossman, Sector and Sphere, Phil. Trans. Royal Society A, 2009.
  • 28.
    Cistrack Web Portal& WidgetsCistrack Elastic Cloud ServicesCistrack DatabaseAnalysis Pipelines & Re-analysis ServicesCistrack Large Data Cloud ServicesIngestion Services
  • 29.
    Case Study 2:Combinatorial Analysis of Marks
  • 30.
    Active Gene -MethodK4Me3 to TSS distanceGene Activeness: Label a transcript t as XYZX=1 if a H3K4Me3 binds in [-1800, min(2200, TranscriptLength)]Y=1 if a Pol II binds in [-1800, min(2200, TranscriptLength)]Z=1 if at least one exon has ≥30% covered by RNA, and in total ≥10% covered by RNA.Pol II to TSS distanceSource: Jia Chen et. al. (ModENCODE)
  • 31.
    Promoters: Use H3K4me3,PolII & RNA to Map Active GenesSource: Jia Chen et. al. (ModENCODE)
  • 32.
    Active Genes (cont’d)A.B.C.PolIIH3K4me3141833275361046804821350RNAbpfrom TSSbp from TSSSource: Jia Chen et. al. (ModENCODE)
  • 33.
    Interesting Combinatorial Combinationof MarksProbes along genome…MarksItem-sets formed by sliding moving window along genome. A-prior algorithm generates interesting itemsets.Post-processing retains itemsets of biological relevance.
  • 34.
  • 35.
    Cistrack Elastic CloudA rack in Cistrack’s Elastic Cloud contains 128 Cores and 128 TB.Multiple racks form a data center.Virtual machines can run pipelines.Virtual machines have access to large data services.No need to move large datasets in and out of Amazon public cloud.
  • 36.
    Use VMs toSupport ReanalysisReplaceCloudVMVMVMAt anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.
  • 37.
    Comparing Peak CallingAlgorithms for ModENCODEWe’re using the Cistrack Elastic Cloud to rerun peak calls for fly data using the worm pipeline.Also running the worm peak calling pipeline on the fly data.
  • 38.
    Case Study 4Ensemblesof Trees on Clouds100 tree modelsdata10,000??? tree modelsWenxuanGao, Robert Grossman, Philip S. Yu, YunhongGu, Why Naïve Ensembles Do Not Work in Cloud Computing, Proceedings of LSDM, 2009.
  • 39.
    Ensembles of Treesfor CloudsTop-k ensemblesEach node builds single random tree with local data.Central node picks k best random trees to predict.Lower cost with corresponding lower accuracy.Shuffling data can improve accuracy.Skeleton ensemblesCentral node builds k skeletons of random trees.Each local node fills in the skeletons.Central node merges all trees from local nodes.Greater cost, but more accurate.
  • 40.
    Experimental StudiesPerformed experimentalstudies on 4 racks (104 nodes) of Open Cloud Testbed.Standard ensemble based models are more expensive than proposed approaches and can overfit.Skeleton ensembles are more accurate but more expensive to build.Shuffling improves accuracy of top-k algorithm.For KDDCup99 dataset top-k ensembles with shuffling 0.1% of data matches accuracy of skeleton method.For UCI Census income dataset, 20% shuffle required, which is more expensive than top-k ensemble.Without knowledge of uniformity of dataset, recommend skeleton ensembles.
  • 41.
  • 42.
    Part 5.Open CloudConsortiumBiocloud
  • 43.
    Open Cloud TestbedC-WaveCENICDragonPhase 29 racks250+ Nodes1000+ Cores10+ Gb/sHadoop
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
    Open Science Data Cloudsky cloudadditional projects in planning…biocloud44
  • 49.
    OCC Condominium CloudsIna condominium cloud, you buy your own rack or bunch of racks.The racks are managed and operated by the condominium association, in this case the OCC.If your rack is 120 TB, you get the rights to approx. 40 TB of storage in the cloud. The rest is a shared resource. 45
  • 50.
  • 51.
    To Get InvolvedTheCistrack resource for transcriptional data: www.cistrack.orgSector/Sphere cloud: sector.sourceforge.net
  • 52.
    Thank YouFor moreinformation: blog.rgrossman.com or www.rgrossman.com