The document discusses the National Cancer Institute's efforts to address challenges in cancer data access and analysis through the development of the NCI Genomics Data Commons and NCI Cloud Pilots. The NCI Genomics Data Commons will provide integrated genomic and clinical cancer data from projects like TCGA to researchers. The NCI Cloud Pilots aim to explore cloud-based models for analyzing large cancer genomics datasets without having to download the full datasets locally, helping to enable more widespread data access and analysis. The goal is to build a national learning health system for cancer clinical genomics through open data sharing and cloud-based approaches.
1. National Cancer Institute
U.S. DEPARTMENT OF HEALTH AND HUMAN SERVICES
National Institutes of Health
NCI Genomics
Data Commons
and cloud pilots
September 2014
2. Overview
• National Challenges in Cancer Data
• Disruptive Technologies
• NCI Genomics Data Commons
• NCI Cloud Pilots
• Building a national learning health system
for cancer clinical genomics
3. National Challenges in Cancer
Informatics
• Lowering barriers to data access,
analysis and modeling for cancer
research
• Integration of data and learning from
basic and clinical research with
cancer care that enable prediction
and improved outcomes
4. We need:
• Open Science (Open Access, Open Data,
Open Source) and Data Liquidity for the
cancer community
• Semantic interoperability through CDEs
and Case Report Forms mapped to
standards
• Sustainable models for informatics
infrastructure, services, data
5. Where we are
Disruptive technologies
Getting social
Open access to data
6. Disruptive Technologies
• Printing
• Steam power
• Transportation
• Electricity
• Antibiotics
• Semiconductors &VLSI design
• http
• High throughput biology
Systems view - end of reductionism?
7. Disruptive Technologies
• Printing
• Steam power
• Transportation
• Electricity
• Antibiotics
• Semiconductors &VLSI design
• http
• High throughput biology
• Ubiquitous computing
Everyone is a data provider
Data immersion
World:
6.6B active mobile contracts
1.9B smart phone contracts
1.1B land lines
World population 7.1B
US:
345M active mobile contracts
287M smart phone contracts
US population 313M
8. What about social media?
• Social media may be one avenue for
modifying behaviors that result in cancer
• Properly orchestrated, social media can
have dramatic impact on quality of life
for patients and survivors
• It can reach into all segments of our
society, including underserved populations
9. Public Health
• These three modifiable factors -
infectious disease, smoking, and poor
nutrition and lack of exercise contribute
to at least 50% of our current cancer
burden. And the cost from loss of quality of
life, pain and suffering is incalculable.
10. Some NCI Big Data activities
• TCGA, TARGET and ICGC
– Cancer Genomics Data Commons
– NCI Cloud Pilots
• Molecular Clinical Trials:
– MPACT, MATCH, Exceptional Responders
12. From the Second Machine Age
From: The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant
Technologies by Erik Brynjolfsson & Andrew McAfee
13. Molecular data is Big Data
• Brief trip down memory lane
• Sequencing and the Human Genome
Project
21. TCGA history
• Initiated in 2005
• Collaboration of NHGRI and NCI to
examine GBM, Lung and Ovarian cancer
using genomic techniques in 2006.
• Expanded to 20+ tumor types.
22. TCGA drivers
• Providing high quality reference sets for
20+ tissue types
• Providing a platform for systems biology
and hypothesis generation
• Providing a test bed for understanding the
real world implications of consent and data
access policies on genomic and clinical
data.
23. Focus on TCGA
• TCGA consortium slides
• Thanks to Lou Staudt and Jean Claude
Zenklusen
24. TCGA –
Lessons from
structural
genomics
Jean Claude Zenklusen,
Ph.D.
Director
TCGA Program Office
National Cancer Institute
25. The Mutational Burden of Human Cancer
Mike Lawrence and Gaddy Getz
Increasing genomic
complexity
Childhood
cancers
Carcinogens
26.
27.
28. Molecular Subgroups Refine Histological Diagnosis
TCGA Nature 497:67 (2013)
Of Endometrial Carcinoma
POLE
(ultra-mutated)
MSI
(hypermutated)
Copy-number low
(endometriod)
Copy-number high
(serous-like)
Mutations
Per Mb
PolE
MSI / MSH2
Copy #
PTEN
p53
Histology
Serous
misdiagnosed
as endometrioid?
Histology
Endometrioid
Serous
29. Molecular Diagnosis of Endometrial Cancer May
Surgery only?
Adjuvant
radiotherapy?
TCGA Nature 497:67 (2013)
Influence Choice of Therapy
POLE
(ultra-mutated)
MSI
(hypermutated)
Copy-number low
(endometriod)
Copy-number high
(serous-like)
Mutations
Per Mb
PolE
MSI / MSH2
Copy #
PTEN
p53
Histology
Adjuvant
chemotherapy?
30. NCI Cancer Genomics Data Commons
GDC
NCI Genomics
Data Commons
Genomic +
clinical data
. . .
31. NCI Cancer Genomics Data Commons
GDC
NCI Genomics
Data Commons
Genomic +
clinical data
. . .
Cancer
information
donor
32. Utility of a Cancer Knowledge Base
GDC
Identify
low-frequency
cancer drivers
Define genomic
determinants of response
to therapy
Compose clinical trial
cohorts sharing
Targeted genetic lesions
Cancer
information
donor
33. DACO
ICGC
dbGaP
EGA
TCGA
BAM
Open
Open
ERA
BA
M
Germ
Line
+ EGA id
BA
M
BA
M
34. ICGC
BAM/FASTQ
TCGA
BAM/FASTQ
ICGC
Open
Data
(includes
TCGA
Open Data)
COSMIC
Open
Data
35. Driver for the Cloud Pilots
• An inflection point for TCGA is looming
2,500,000
2,000,000
1,500,000
1,000,000
500,000
0
7/1/09
1/1/10
7/1/10
1/1/11
7/1/11
1/1/12
7/1/12
1/1/13
7/1/13
1/1/14
7/1/14
Gigabytes (GB)
36. Local copies and computation
• Assuming the 2.5 PB TCGA data set
• Storage and backups ~ $1M US
• Downloading TCGA data at 10 Gb/sec =
23 days
• Size + high dimensionality = high
computational requirements that grow
quickly
37. Relationship of the Cancer Genomics
Data Commons and NCI Cloud Pilots
GDC
NCI Cloud
Computational Centers
Periodic
Data Freezes
Search /
retrieve
Analysis
NCI Genomics
Data Commons
39. NCI Cloud Pilots
• Funding for up to 3 cloud pilots - 24
month pilots that are meant to inform the
Cancer Genomics Data Commons
– Explore models for cancer genomics APIs
– Explore cloud models for data+analysis
40. NCI Cloud Pilots
• A way to move computation to the data
• Sustainable models for providing access
to data
• Reproducible pipelines for QA, variant
calling, knowledge sharing
• Define genomics/phenomics APIs for
discovering new variants contributing to
cancer, enhancing response, modulating
risk
41. The future
• Elastic computing ‘clouds’
• Social networks
• Big Data analytics
• Precision medicine
• Measuring health
• Practicing protective medicine
Semantic and
synoptic data
Intervening
before health is
compromised
Learning systems that enable learning
from every cancer patient