Bioclouds CAMDA (Robert Grossman) 09-v9p

Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data: Lessons from Cistrack October 6, 2009 Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago Open Data Group Institute for Genomics & Systems BiologyUniversity of Chicago 1

Cistrack Team (UIC & U. Chicago) ,[object Object]

Robert GrossmanYunhongGu David Hanley ,[object Object],Xiangjun Liu ,[object Object]

Projected sequencing capabilities (world-wide) 2060 Total human population 2019 One of each described species 2031 One of each species ~100M estimate log10 billions of base pairs 2023 One of each species ~10M estimate Kevin White, unpublished

Is Biology a Large Data Science? vs CPUs double approximately every 18 months (Moore’s Law). Disks double every 12-15 months (Johnson’s Law). Amount of publically available sequence data is doubling approximately every 12 months. 5

IBM joins race for $100 personal genome.

We Have a Problem vs More and more of your colleagues (e.g. the biologist down the hall) with access to modern instruments are producing so much data that they cannot easily manage, analyze and archive it. Large projects build their own infrastructure. Almost all other biologists are on their own. 7

Point of View To do research today… Analytic infrastructure Analytic algorithms & statistical models Data

What is a Cloud? 10 Software as a Service

Is Anything Else a Cloud? 11 Infrastructure as a Service – based upon scaling Virtual Machines (VMs)

Are There Other Types of Clouds? 12 ad targeting Large Data Cloud Services

Idea Dates Back to the 1960s 14 App App App CMS CMS MVS IBM VM/370 IBM Mainframe Native (Full) Virtualization Examples: Vmware ESX Virtualization first widely deployed with IBM VM/370.

One Definition Clouds provide on-demand resources or services over a network, often the Internet, with the scale and reliability of a data center. No standard definition. Cloud architectures are not new. What is new: Scale Ease of use Pricing model. 15

Elastic, Usage Based Pricing Is New 17 costs the same as 1 computer in a rack for 120 hours 120 computers in three racks for 1 hour ,[object Object]

Clouds can be used to manage surges in computing.,[object Object]

2004 10x-100x 1976 10x-100x data science 1670 250x simulation science 1609 30x experimental science

Part 3Cistrack 23 www.cistrack.org

Cistrack Resource for cis-regulatory data. It is open source and based upon CUBioS. Currently used by the White Lab at University of Chicago for managing ModENCODE fly data. Contains raw data, intermediate, and analyzed data from approximately 240 experiments from Agilent, Affy and Solexa platforms.

Chromatin Developmental Time-Course H3K4me1 enhancers H3K4me3 promoters & enhancers H3K9Ac activation H3K9me3 heterochromotin H3K27Ac activation H3K27me3 repression PolII transcript. & promoters CBP HAT- enhancers Total RNA expression X 12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre) 8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)

1. Cistrack Supports Cubes of Data Drosophila regulatory elements from Drosophila modENCODE. ChIP-chip data using Agilent 244K dual-color arrays. Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1, H3K27Ac, H3K9Ac), PolII and CBP. Each factor has been studied for 12 different time-points of Drosophila development.

2. ChIP-Seq Data Volumes are Large Cistrack integrates with large data clouds.

3. Continuous Reanalysis is Desirable In general, it is quite labor intensive to reanalyze your existing data with a new algorithm. Cistrack supports VMs that can simplify re-applingCistrackpipelines that have been updated to include a new algorithm.

Cistrack Architecture Cistrack Web Portal & Widgets Cistrack Database Analysis Pipelines & Re-analysis Services CistrackCloud Services Ingestion Services

Part 4Reanalysis 30 Can you repeat an analytic pipeline one year after a post-doc leaves your lab?

Promoters: Use H3K4me3, PolII &RNA to Map Active Genes

Promoters: Use of H3K4me3, PolII & RNA to Map Active Genes

Bioclouds CAMDA (Robert Grossman) 09-v9p

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Bioclouds CAMDA (Robert Grossman) 09-v9p

Similar to Bioclouds CAMDA (Robert Grossman) 09-v9p (20)

More from Robert Grossman

More from Robert Grossman (16)

Recently uploaded

Recently uploaded (20)

Bioclouds CAMDA (Robert Grossman) 09-v9p