The Cancer Genomics Cloud (CGC) pilots - an Introduction

NCI Cancer Genomics Cloud (CGC) Pilots
Steve Tsang
Attain LLC
National Cancer Institute

Disclaimer
The opinions/comments/assessment expressed in this article are the author's own and do not necessarily
reflect the view of the National Cancer Institute or National Institutes of Health.
https://ethics.od.nih.gov/topics/Disclaimer.htm

Cancer Genomic Data Challenges
● > 2.5 PB of TCGA data (WXS, RNASeq, WGS)
● Fragmentary repositories of cancer genomic data
○ TCGA, TARGET and CGCI have their own data repositories (DCCs)
○ Sequencing data: BAM files at CGhub while VCF/MAF files at DCC
● Assuming the 2.5 PB TCGA data set
○ Storage and Data Protection cost approximately $2,000,000 per year
○ Downloading TCGA data at 10 Gb/sec = 23 days
○ Only large institutions have the ability to utilize this data
○ These data types will continue to grow
Slide Courtesy of Tanja Davidsen, NCI

Cloud Pilots Concept: Co-located Compute & Data

Three Cancer Genomics Cloud Pilot Awardees

http://firecloud.orgFireCloud Concepts
● Data Files reside in Google Cloud
Storage
● Workspaces
● Tasks and Workflows
● Method Repositories
● Provenance captured for every
analysis run (i.e. what version of
what methods was run on what data
at what time)

FireCloud Overview
● The Workspace is the organizing
principle for FireCloud
○ When a workspace is created,
a Google bucket is
automatically attached to that
workspace
● The Data Model is the backbone
within the workspace
○ Holds meta-data, and bucket
pointers to input and output

http://cgc.systemsbiology.net/
… is to make TCGA data, together with tools and
compute-power, available and accessible to a broad
range of users using multiple access modes:
❏ Interactive web application
❏ Scripting languages: R, Python, SQL
❏ Direct programmatic access

❏ Build an open platform that can grow and evolve to satisfy a
broad range of users and use-cases
❏ Leverage the best existing tools and technologies, as they are
released
❏ Collaborate with the research community in areas of data
standards, containers, workflows, etc
❏ Provide a range of examples and tutorials to get newcomers
up and running quickly

http://www.cancergenomicscloud.org
/
❖The CGC aims to provide a collaborative environment where researchers can
take advantage of co-localized public data (like TCGA) and public tools; but
also recombine these with their private data and tools.
❖Guiding Principles
➢ Making data available isn’t enough to make it usable.
➢ The best science happens in teams.
➢ Reproducibility shouldn’t be hard.
➢ The impact of TCGA is extended by new data & tools
Seven Bridges Genomics CGC Objectives

❖Explore processed TCGA data for
mutations, copy number variations
and expression levels
❖Analyze data from their private
cohorts alongside TCGA data.
❖Use standard bioinformatics pipelines
to perform analyses.
❖Bring their own analysis tools directly
to the TCGA dataset.
❖Collaborate with researchers around
the world.
❖Access storage and compute
resources on the cloud on demand.
❖Access the CGC using the API as
Seven Bridges Genomic
CGC Features

Acknowledgement
Team CGC - https://goo.gl/f21Lqq
National Cancer Institute CBIIT
CGC Fact sheet - https://cbiit.nci.nih.gov/sites/nci-cbiit/files/Cloud_Pilot_Handout.pdf
Access Cloud Pilots https://cbiit.nci.nih.gov/ncip/nci-cancer-genomics-cloud-pilots/access-the-cloud-pilot-
platforms
Broad Institute - FireCloud - http://firecloud.org
Institute of Systems Biology - Cancer Genomics Cloud - http://cgc.systemsbiology.net/
Seven Bridges Genomics - Cancer Genomics Cloud - http://www.cancergenomicscloud.org/
Attain, LLC - http://http://www.attain.com/

The Cancer Genomics Cloud (CGC) pilots - an Introduction

Recommended

Recommended

More Related Content

Similar to The Cancer Genomics Cloud (CGC) pilots - an Introduction

Similar to The Cancer Genomics Cloud (CGC) pilots - an Introduction (20)

Recently uploaded

Recently uploaded (20)

The Cancer Genomics Cloud (CGC) pilots - an Introduction

Editor's Notes