The Transformation of Systems Biology Into A Large Data Science
 

The Transformation of Systems Biology Into A Large Data Science

on

  • 2,412 views

This is a talk I gave at the Institute for Genomics & System Biology (IGSB) on December 7, 2009. The talk looks at the role of cloud computing platforms, including private clouds, for managing the ...

This is a talk I gave at the Institute for Genomics & System Biology (IGSB) on December 7, 2009. The talk looks at the role of cloud computing platforms, including private clouds, for managing the large data produced by next generation sequencing platforms.

Statistics

Views

Total Views
2,412
Views on SlideShare
2,402
Embed Views
10

Actions

Likes
1
Downloads
45
Comments
0

1 Embed 10

http://www.slideshare.net 10

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

The Transformation of Systems Biology Into A Large Data Science The Transformation of Systems Biology Into A Large Data Science Presentation Transcript

  • Is Systems Biology Becoming a Data Intensive Science? Assuming So, Are You Ready?
    December 7, 2009
    Robert Grossman
    Laboratory for Advanced Computing
    University of Illinois at Chicago
    1
  • Part 1Biology as a Data Intensive Science.
    2
    Two of the 14 high throughput sequencers at the Ontario Institute for Cancer Research (OICR).
  • Growth of Genomic Data
    ENCODE
    HGP
    2003
    2001
    1977
    1995
    2005
    Sanger Sequencing
    Microarray technology
    454, Solexa sequencing
    10^10
    Genbank
    10^5
    10^8
  • Growth of Genomic Data
    Sequence individuals
    AWS
    Hadoop
    GFS
    Sequence environment
    2006
    2008
    2003
    Sequence species
    ENCODE
    HGP
    2003
    2001
    1977
    1995
    2005
    Sanger Sequencing
    Microarray technology
    454, Solexa sequencing
    10^10
    Genbank
    10^5
    10^8
  • The Challenge is to Support Cubes of High Throughput Sequence Data
    Each cell in data cube can be ChIP-chip, ChIP-seq, RNA-seq, movie, etc. data set.
    Different developmental stages
    Differentconditions
    Perturb the environment
  • We Have a Problem

    vs
    More and more of your colleagues produce so much data that they cannot easily manage & analyze it.
    Large projects build their own infrastructure.
    Every else is on their own.
  • 2003
    10x-100x
    1976
    10x-100x
    data
    science
    1670
    250x
    simulation science
    1609
    30x
    experimental science
  • To Answer today’s biological questions
    Point of View
    Analytic infrastructure
    Analytic algorithms & statistical models
    Data
  • Part 2What is a Cloud?
    9
  • What is a Cloud?
    10
    Software as a Service
  • Is Anything Else a Cloud?
    11
    Infrastructure as a Service – based upon scaling Virtual Machines (VMs)
  • Are There Other Types of Clouds?
    12
    web search & ad targeting
    Large Data Cloud Services
  • What is Virtualization?
    13
  • Idea Dates Back to the 1960s
    14
    App
    App
    App
    CMS
    CMS
    MVS
    IBM VM/370
    IBM Mainframe
    Native (Full) Virtualization
    Examples: Vmware ESX
    Virtualization first widely deployed with IBM VM/370.
  • What Do You Optimize?
    Goal: Minimze latency and control heat.
    Goal: Maximze data (with matching compute) and control cost.
  • 16
    Scale is new
  • Elastic, Usage Based Pricing Is New
    17
    costs the same as
    1 computer in a rack for 120 hours
    120 computers in three racks for 1 hour
    • Elastic, usage based pricing turns capex into opex.
    • Clouds can be used to manage surges in computing.
  • Simplicity Offered By the Cloud is New
    18
    +
    .. and you have a computer ready to work.
    A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.
  • 19
    Clouds vs Grids
  • Part 3Case Studies
  • Case Study 1Cistrack Large Data Cloud
    21
    www.cistrack.org
  • Cistrack
    Resource for cis-regulatory data.
    It is open source and based upon CUBioS.
    Currently used by the White Lab at University of Chicago for managing ModENCODE fly data.
    Contains raw data, intermediate, and analyzed data from approximately 240 experiments from Agilent, Affy and Solexa platforms.
  • CUBioS Applications
    Front Ends
    CUBioS
    Bowtie, TopHAT, R pipelines, etc…
    Ingestion
    Cistrack is an instance of CUBioS.
    RNA seq
    ChIPseq
    DNA capture
    etc.
  • Chromatin Developmental Time-Course
    H3K4me1 enhancers
    H3K4me3 promoters & enhancers
    H3K9Ac activation
    H3K9me3 heterochromotin
    H3K27Ac activation
    H3K27me3 repression
    PolII transcript. & promoters
    CBP HAT- enhancers
    Total RNA expression
    X
    12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre)
    8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)
  • Cistrack Supports Multi-Dim. Cubes…
    Drosophila regulatory elements from Drosophila modENCODE.
    ChIP-chip data using Agilent 244K dual-color arrays.
    Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1, H3K27Ac, H3K9Ac), PolII and CBP.
    Each factor has been studied for 12 different time-points of Drosophila development.
  • … Each Cell in a Cube Can Be Three ChIP-Seq Datasets from a Solexa
    Cistrack integrates with large data clouds.
    Cistrack uses the Sector/Sphere large data cloud.
  • Hadoopvs Sector
    27
    Source: Gu and Grossman, Sector and Sphere, Phil. Trans. Royal Society A, 2009.
  • Cistrack Web Portal & Widgets
    Cistrack Elastic Cloud Services
    Cistrack Database
    Analysis Pipelines & Re-analysis Services
    Cistrack Large Data Cloud Services
    Ingestion Services
  • Case Study 2: Combinatorial Analysis of Marks
  • Active Gene - Method
    K4Me3 to TSS distance
    Gene Activeness: Label a transcript t as XYZ
    X=1 if a H3K4Me3 binds in
    [-1800, min(2200, TranscriptLength)]
    Y=1 if a Pol II binds in
    [-1800, min(2200, TranscriptLength)]
    Z=1 if at least one exon has ≥30% covered by RNA, and in total ≥10% covered by RNA.
    Pol II to TSS distance
    Source: Jia Chen et. al. (ModENCODE)
  • Promoters: Use H3K4me3, PolII & RNA to Map Active Genes
    Source: Jia Chen et. al. (ModENCODE)
  • Active Genes (cont’d)
    A.
    B.
    C.
    PolII
    H3K4me3
    1418
    332
    753
    6104
    680
    482
    1350
    RNA
    bp from TSS
    bp from TSS
    Source: Jia Chen et. al. (ModENCODE)
  • Interesting Combinatorial Combination of Marks
    Probes along genome

    Marks
    Item-sets formed by sliding moving window along genome.
    A-prior algorithm generates interesting itemsets.
    Post-processing retains itemsets of biological relevance.
  • Case Study 3Cistrack Elastic Cloud
  • Cistrack Elastic Cloud
    A rack in Cistrack’s Elastic Cloud contains 128 Cores and 128 TB.
    Multiple racks form a data center.
    Virtual machines can run pipelines.
    Virtual machines have access to large data services.
    No need to move large datasets in and out of Amazon public cloud.
  • Use VMs to Support Reanalysis
    Replace
    Cloud
    VM
    VM
    VM
    At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.
  • Comparing Peak Calling Algorithms for ModENCODE
    We’re using the Cistrack Elastic Cloud to rerun peak calls for fly data using the worm pipeline.
    Also running the worm peak calling pipeline on the fly data.
  • Case Study 4Ensembles of Trees on Clouds
    100 tree models
    data
    10,000??? tree models
    WenxuanGao, Robert Grossman, Philip S. Yu, YunhongGu, Why Naïve Ensembles Do Not Work in Cloud Computing, Proceedings of LSDM, 2009.
  • Ensembles of Trees for Clouds
    Top-k ensembles
    Each node builds single random tree with local data.
    Central node picks k best random trees to predict.
    Lower cost with corresponding lower accuracy.
    Shuffling data can improve accuracy.
    Skeleton ensembles
    Central node builds k skeletons of random trees.
    Each local node fills in the skeletons.
    Central node merges all trees from local nodes.
    Greater cost, but more accurate.
  • Experimental Studies
    Performed experimental studies on 4 racks (104 nodes) of Open Cloud Testbed.
    Standard ensemble based models are more expensive than proposed approaches and can overfit.
    Skeleton ensembles are more accurate but more expensive to build.
    Shuffling improves accuracy of top-k algorithm.
    For KDDCup99 dataset top-k ensembles with shuffling 0.1% of data matches accuracy of skeleton method.
    For UCI Census income dataset, 20% shuffle required, which is more expensive than top-k ensemble.
    Without knowledge of uniformity of dataset, recommend skeleton ensembles.
  • KDDCup99 dataset
    Census income dataset
  • Part 5.Open Cloud Consortium
    Biocloud
  • Open Cloud Testbed
    C-Wave
    CENIC
    Dragon
    Phase 2
    9 racks
    250+ Nodes
    1000+ Cores
    10+ Gb/s
    • Hadoop
    • Sector/Sphere
    • Thrift
    • KVM VMs
    • Eucalyptus VMs
    MREN
    43
  • Open Science Data Cloud
    sky cloud
    additional projects in planning…
    biocloud
    44
  • OCC Condominium Clouds
    In a condominium cloud, you buy your own rack or bunch of racks.
    The racks are managed and operated by the condominium association, in this case the OCC.
    If your rack is 120 TB, you get the rights to approx. 40 TB of storage in the cloud. The rest is a shared resource.
    45
  • Acknowledgements
  • To Get Involved
    The Cistrack resource for transcriptional data: www.cistrack.org
    Sector/Sphere cloud: sector.sourceforge.net
  • Thank You
    For more information: blog.rgrossman.com or www.rgrossman.com