The Transformation of Systems Biology Into A Large Data Science
 

The Transformation of Systems Biology Into A Large Data Science

on

  • 2,400 views

This is a talk I gave at the Institute for Genomics & System Biology (IGSB) on December 7, 2009. The talk looks at the role of cloud computing platforms, including private clouds, for managing the ...

This is a talk I gave at the Institute for Genomics & System Biology (IGSB) on December 7, 2009. The talk looks at the role of cloud computing platforms, including private clouds, for managing the large data produced by next generation sequencing platforms.

Statistics

Views

Total Views
2,400
Views on SlideShare
2,390
Embed Views
10

Actions

Likes
1
Downloads
45
Comments
0

1 Embed 10

http://www.slideshare.net 10

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    The Transformation of Systems Biology Into A Large Data Science The Transformation of Systems Biology Into A Large Data Science Presentation Transcript

    • Is Systems Biology Becoming a Data Intensive Science? Assuming So, Are You Ready?
      December 7, 2009
      Robert Grossman
      Laboratory for Advanced Computing
      University of Illinois at Chicago
      1
    • Part 1Biology as a Data Intensive Science.
      2
      Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR).
    • Growth of Genomic Data
      ENCODE
      HGP
      2003
      2001
      1977
      1995
      2005
      Sanger Sequencing
      Microarray technology
      454, Solexa sequencing
      10^10
      Genbank
      10^5
      10^8
    • Growth of Genomic Data
      Sequence individuals
      AWS
      Hadoop
      GFS
      Sequence environment
      2006
      2008
      2003
      Sequence species
      ENCODE
      HGP
      2003
      2001
      1977
      1995
      2005
      Sanger Sequencing
      Microarray technology
      454, Solexa sequencing
      10^10
      Genbank
      10^5
      10^8
    • The Challenge is to Support Cubes of High Throughput Sequence Data
      Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set.
      Different developmental stages
      Differentconditions
      Perturb the environment
    • We Have a Problem

      vs
      More and more of your colleagues produce so much data that they cannot easily manage & analyze it.
      Large projects build their own infrastructure.
      Every else is on their own.
    • 2003
      10x-100x
      1976
      10x-100x
      data
      science
      1670
      250x
      simulation science
      1609
      30x
      experimental science
    • To Answer today’s biological questions
      Point of View
      Analytic infrastructure
      Analytic algorithms & statistical models
      Data
    • Part 2What is a Cloud?
      9
    • What is a Cloud?
      10
      Software as a Service
    • Is Anything Else a Cloud?
      11
      Infrastructure as a Service – based upon scaling Virtual Machines (VMs)
    • Are There Other Types of Clouds?
      12
      web search & ad targeting
      Large Data Cloud Services
    • What is Virtualization?
      13
    • Idea Dates Back to the 1960s
      14
      App
      App
      App
      CMS
      CMS
      MVS
      IBM VM/370
      IBM Mainframe
      Native (Full) Virtualization
      Examples: Vmware ESX
      Virtualization first widely deployed with IBM VM/370.
    • What Do You Optimize?
      Goal: Minimze latency and control heat.
      Goal: Maximze data (with matching compute) and control cost.
    • 16
      Scale is new
    • Elastic, Usage Based Pricing Is New
      17
      costs the same as
      1 computer in a rack for 120 hours
      120 computers in three racks for 1 hour
      • Elastic, usage based pricing turns capex into opex.
      • Clouds can be used to manage surges in computing.
    • Simplicity Offered By the Cloud is New
      18
      +
      .. and you have a computer ready to work.
      A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.
    • 19
      Clouds vs Grids
    • Part 3Case Studies
    • Case Study 1Cistrack Large Data Cloud
      21
      www.cistrack.org
    • Cistrack
      Resource for cis-regulatory data.
      It is open source and based upon CUBioS.
      Currently used by the White Lab at University of Chicago for managing ModENCODE fly data.
      Contains raw data, intermediate, and analyzed data from approximately 240 experiments from Agilent, Affy and Solexa platforms.
    • CUBioS Applications
      Front Ends
      CUBioS
      Bowtie, TopHAT, R pipelines, etc…
      Ingestion
      Cistrack is an instance of CUBioS.
      RNA seq
      ChIPseq
      DNA capture
      etc.
    • Chromatin Developmental Time-Course
      H3K4me1 enhancers
      H3K4me3 promoters & enhancers
      H3K9Ac activation
      H3K9me3 heterochromotin
      H3K27Ac activation
      H3K27me3 repression
      PolII transcript. & promoters
      CBP HAT- enhancers
      Total RNA expression
      X
      12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre)
      8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)
    • Cistrack Supports Multi-Dim. Cubes…
      Drosophila regulatory elements from Drosophila modENCODE.
      ChIP-chip data using Agilent 244K dual-color arrays.
      Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1, H3K27Ac, H3K9Ac), PolII and CBP.
      Each factor has been studied for 12 different time-points of Drosophila development.
    • … Each Cell in a Cube Can Be Three ChIP-Seq Datasets from a Solexa
      Cistrack integrates with large data clouds.
      Cistrack uses the Sector/Sphere large data cloud.
    • Hadoopvs Sector
      27
      Source: Gu and Grossman, Sector and Sphere, Phil. Trans. Royal Society A, 2009.
    • Cistrack Web Portal & Widgets
      Cistrack Elastic Cloud Services
      Cistrack Database
      Analysis Pipelines & Re-analysis Services
      Cistrack Large Data Cloud Services
      Ingestion Services
    • Case Study 2: Combinatorial Analysis of Marks
    • Active Gene - Method
      K4Me3 to TSS distance
      Gene Activeness: Label a transcript t as XYZ
      X=1 if a H3K4Me3 binds in
      [-1800, min(2200, TranscriptLength)]
      Y=1 if a Pol II binds in
      [-1800, min(2200, TranscriptLength)]
      Z=1 if at least one exon has ≥30% covered by RNA, and in total ≥10% covered by RNA.
      Pol II to TSS distance
      Source: Jia Chen et. al. (ModENCODE)
    • Promoters: Use H3K4me3, PolII & RNA to Map Active Genes
      Source: Jia Chen et. al. (ModENCODE)
    • Active Genes (cont’d)
      A.
      B.
      C.
      PolII
      H3K4me3
      1418
      332
      753
      6104
      680
      482
      1350
      RNA
      bp from TSS
      bp from TSS
      Source: Jia Chen et. al. (ModENCODE)
    • Interesting Combinatorial Combination of Marks
      Probes along genome

      Marks
      Item-sets formed by sliding moving window along genome.
      A-prior algorithm generates interesting itemsets.
      Post-processing retains itemsets of biological relevance.
    • Case Study 3Cistrack Elastic Cloud
    • Cistrack Elastic Cloud
      A rack in Cistrack’s Elastic Cloud contains 128 Cores and 128 TB.
      Multiple racks form a data center.
      Virtual machines can run pipelines.
      Virtual machines have access to large data services.
      No need to move large datasets in and out of Amazon public cloud.
    • Use VMs to Support Reanalysis
      Replace
      Cloud
      VM
      VM
      VM
      At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.
    • Comparing Peak Calling Algorithms for ModENCODE
      We’re using the Cistrack Elastic Cloud to rerun peak calls for fly data using the worm pipeline.
      Also running the worm peak calling pipeline on the fly data.
    • Case Study 4Ensembles of Trees on Clouds
      100 tree models
      data
      10,000??? tree models
      WenxuanGao, Robert Grossman, Philip S. Yu, YunhongGu, Why Naïve Ensembles Do Not Work in Cloud Computing, Proceedings of LSDM, 2009.
    • Ensembles of Trees for Clouds
      Top-k ensembles
      Each node builds single random tree with local data.
      Central node picks k best random trees to predict.
      Lower cost with corresponding lower accuracy.
      Shuffling data can improve accuracy.
      Skeleton ensembles
      Central node builds k skeletons of random trees.
      Each local node fills in the skeletons.
      Central node merges all trees from local nodes.
      Greater cost, but more accurate.
    • Experimental Studies
      Performed experimental studies on 4 racks (104 nodes) of Open Cloud Testbed.
      Standard ensemble based models are more expensive than proposed approaches and can overfit.
      Skeleton ensembles are more accurate but more expensive to build.
      Shuffling improves accuracy of top-k algorithm.
      For KDDCup99 dataset top-k ensembles with shuffling 0.1% of data matches accuracy of skeleton method.
      For UCI Census income dataset, 20% shuffle required, which is more expensive than top-k ensemble.
      Without knowledge of uniformity of dataset, recommend skeleton ensembles.
    • KDDCup99 dataset
      Census income dataset
    • Part 5.Open Cloud Consortium
      Biocloud
    • Open Cloud Testbed
      C-Wave
      CENIC
      Dragon
      Phase 2
      9 racks
      250+ Nodes
      1000+ Cores
      10+ Gb/s
      • Hadoop
      • Sector/Sphere
      • Thrift
      • KVM VMs
      • Eucalyptus VMs
      MREN
      43
    • Open Science Data Cloud
      sky cloud
      additional projects in planning…
      biocloud
      44
    • OCC Condominium Clouds
      In a condominium cloud, you buy your own rack or bunch of racks.
      The racks are managed and operated by the condominium association, in this case the OCC.
      If your rack is 120 TB, you get the rights to approx. 40 TB of storage in the cloud. The rest is a shared resource.
      45
    • Acknowledgements
    • To Get Involved
      The Cistrack resource for transcriptional data: www.cistrack.org
      Sector/Sphere cloud: sector.sourceforge.net
    • Thank You
      For more information: blog.rgrossman.com or www.rgrossman.com