How Data Commons are Changing the Way that Large Datasets Are Analyzed and Shared

How data commons are changing the way that large
genomic and clinical datasets are analyzed and shared
Robert Grossman
Center for Data Intensive Science
University of Chicago &
Open Commons Consortium
Welcome Genome Campus Workshop on Big Data in Health and Biology
September 26, 2017

Data Commons
2014 - 2024
Data Clouds
2010 - 2020
Databases &
Repositories
1982 - present

1. Big Data in Biology, Medicine & Health Care:
A Thirty Year Perspective

1993 2004
Data Mining
& KDD
1984
Computationally
Intensive Statistics
Predictive
Analytics
2011
PageRank
MobileInternetPOSDirect marketing
Big Data
& Data Science
AI (redux) /
Deep Learn.
2016
CNN, ANN
Labeled
images etc.
Hamiltonian
Monte Carlo

Deep Learning (DL) Dogma
• Use Deep Neural Networks (DNN) with lots of layers (10’s to
100’s of layers).
• Try using Convolutional Neural Networks (CNN), even if the
problem is not translation invariant.
• Represent the inputs and internal states as long vectors of
numbers (even if the input is an image, text, spoken voice,
etc. that has structure)
• Train with very large amounts of labeled data.
• Don’t worry about the internal structure of the model, just its
accuracy and coverage.

DNN (2016) Data Mining (1996)
Use lots of parameters 10’s to 100’s of hidden layers Large trees, large ensembles
Long vectors for inputs Part of the dogma Generally the case
Use as much data as
possible
Yes Yes
Do we care about the
internal structure of
the model?
No No
Hardware GPU & custom chips Clusters of workstations
Labeled data Often the limiting factor Often the limiting factor
Features Not needed Generally the hard part

• In some sense, Deep Learning is
eating the world.
• Compare: Marc Andressen, Why
Software Is Eating The World, Wall
Street Journal, August 20, 2011.
• From a broader perspective,
Machine Learning (ML) continues
to eat the world, as it has been
doing for the last 20 years drive by
the exponentially growth in the
amount of data and the
computational power available to
estimate parameters.

2. Data Commons, An Emerging Platform for
Data Science

Data Commons
Data commons co-locate data, storage and computing infrastructure
with commonly used services, tools & apps for analyzing and sharing
data to create an interoperable resource for the research community.*
*Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson and Walt Wells, A Case for Data Commons Towards Data Science as a Service, IEEE Computing in
Science and Engineer, 2016. Source of image: The CDIS, GDC, & OCC data commons infrastructure at the University of Chicago Kenwood Data Center.

NCI Genomic Data Commons*
• Launched in 2016
with over 4 PB of
data.
• Joint project with
OICR.
• Used by 1500 -
2000+ users per
day.
• Based upon an
open source
software stack that
can be used to
build other data
commons.*See: NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer
genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112.

System 1: Data Portals to Explore and Submit Data

• MuSE
(MD Anderson)
• VarScan2 (Washington
Univ.)
• SomaticSniper
(Washington Univ.)
• MuTect2
(Broad Institute)
Source: Zhenyu Zhang, et. al. and the GDC Project Team, Uniform Genomic Data Analysis in
the NCI Genomic Data Commons, to appear.
System 2: Data Harmonization System To Analyze all
of the Submitted Data with a Common Pipelines

System 3: User Defined Applications and Notebooks
to Create a Data Ecosystem
https://gdc-api.nci.nih.gov/files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state
• The GDC has a REST API so that researchers can develop their own
applications.
• There are third party applications that use the REST API for Python, R,
Jupyter notebooks and Shiny.
• The REST API drives the GDC data portal, data submission system, etc.

GDC Application Programming Interface (API) –
To Build Applications
https://gdc-api.nci.nih.gov/files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state
API URL Endpoint Optional Entity ID Query parameters
• Based upon data model
• Drives internally developed applications, e.g. data portal
• Allows third parties to develop their own applications
• Can be used by other commons, by workspaces, by other
systems, by user-developed applications and notebooks

Purple balls are PCA-based analysis of RNA-seq data for lung adenocarcinoma.
Grey are associated with lung squamous cell carcinoma. Green appear to be
misdiagnosed.
The GDC enables
bioinformaticians to
build their own
applications using
the GDC API.
Source: Center for Data Intensive Science, University of Chicago.
Shiny R app
built using
the GDC API

• Supports big data with cloud
computing
• Researchers can analyze data
with collaborative tools
(workspaces) – i. e. data does not
have to be downloaded)
• Data repository
• Researchers
download data.
Databases
Data Clouds
Data Commons
• Supports big data
• Workspaces
• Common data models
• Core data services
• Harmonized data
• Governance
1982 - present
2010 - 2020
2014 - 2024

3. GDC Gen3 Open Source Software Stack

OCC Open Science Data Cloud (2010)
OCC – NASA Project Matsu (2009)
NCI Genomic Data Commons* (2016)
OCC-NOAA Environmental Data
Commons (2016)
OCC Blood Profiling
Atlas in Cancer (2017)
Bionimbus Protected Data Cloud* (2013)
*Operated under a subcontract from NCI / Leidos Biomedical
to the University of Chicago with support from the OCC.
Brain Commons
(2017)
Kids First Data
Resource (2017)
Gen3
Gen2
Gen1

The Gen3 Data Model
Is Customizable &
Extensible
• BloodPAC
• Brain Commons
• Wellness Commons
• Kids First Data Resource

Object-based
storage with access
control lists
Scalable light
weight workflow
Community
data
products
Data Commons Framework Services (Digital ID, Metadata, Authentication, Auth.,
etc.) that support multiple data commons.
Apps
Database
services
Architecture used by
Gen3 Data Commons
Data Commons 1
Data Commons 2
Portals for
accessing &
submitting
data
Workspaces
APIs
Data Commons Framework Services
Workspaces
Workspaces
Notebooks
Apps
Apps & Notebooks

Core Data Commons Framework Services
• Digital ID services
• Metadata services
• Authentication services
• Authorization services
• Designed to span multiple data commons
• Designed to support multiple private and commercial
clouds
• In the future, we will support portable workspaces

Open Source Software for Data
Commons
Existing open
source apps
Commercial
apps
New FOSS sponsor
funded apps
Public Clouds
Data managed by the commons
Private Clouds
CSOC
(Common
Services Ops
Center)
Data Commons Management & Governance
Sponsor (e.g. funder or consortium of funders)
OCC Data Commons
Framework
1
2
3
0

Data Commons
Framework Services
Private Academic Cloud
Univ. of Chicago
CSOC (ops center)
Cross Cloud Services

NCI Clouds
Pilots
Compliant
apps
Bionimbus
PDC & other
clouds
FAIR Principles
NCI GDC
Other data commonsData Peering
Principles
Commons
Services
Operations
Center
Commons
services
Commons Services
Framework
appapp
app

Summary
1. Designed to support disease specific, project specific or
consortium specific data commons, including governance
model.
2. Designed to support multiple data commons that peer and
interoperate.
3. Designed to support an ecosystem of FAIR-based
applications.
4. The core underlying software stack is open source.
5. Data commons governance model in which data is public and
you “pay for compute”.
6. Supported by the independent not-for-profit Open Commons
Consortium.

Data Commons
2014 - 2024
Data Clouds
2010 - 2020
Data Ecosystems
2018 - 2028
Databases
1982 - present

Three Large Scale Data Commons That are
Working Towards Common APIs
1. NCI GDC / Cloud
Resources (UChicago
/ Broad)
2. NIH All of Us (Broad /
Verily)
3. CZI HCA Data
Platform
(UCSC/Broad)
For more information, see: Josh Denny, David Glazer, Robert L. Grossman, Benedict Paten & Anthony Philippakis, A Data
Biosphere for Biomedical Research, https://medium.com/@benedictpaten/a-data-biosphere-for-biomedical-research-
d212bbfae95d. Also available at: https://goo.gl/9CySeo

Accumulating Knowledge About Small Effects
Data Commons
Data harmonization
Data Ecosystems
Analysis reuse
Databases
Curated data

100%
75%
70%
50%
0%
25%
25%
25%
0% 0%
5%
25%
0%
20%
40%
60%
80%
100%
120%
1995 2005 2015 2025
Hardware / data production Software / data analysis
Data ecosystem / data reuse

Big (deep) knowledge
Big Information (informatics)
- Shared analysis
Big Data
Data
Infrastructure
Apps
Academic Data Centers
Apps for clinical researchers
Apps for bioinformaticians
Apps for system builders
Data
Commons
Data Ecosystems
Cisplatin?
Idarubicin?
Floxuridine?

Questions?
33
rgrossman.com
@bobgrossman

For more information:
• To learn more about data commons: Robert L. Grossman, et. al. A Case for Data Commons:
Toward Data Science as a Service, Computing in Science & Engineering 18.5 (2016): 10-20.
Also https://arxiv.org/abs/1604.02608
• To large more about large scale, secure compliant cloud based computing environments for
biomedical data, see: Heath, Allison P., et al. "Bionimbus: a cloud for managing, analyzing and
sharing large genomics datasets." Journal of the American Medical Informatics Association
21.6 (2014): 969-975. This article describes Bionimbus Gen1.
• To learn more about the NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a
shared vision for cancer genomic data." New England Journal of Medicine 375.12 (2016):
1109-1112. The GDC was developed using Bionimbus Gen2.
• To learn more about BloodPAC, Grossman, R. L., et al. "Collaborating to compete: Blood
Profiling Atlas in Cancer (BloodPAC) Consortium." Clinical Pharmacology & Therapeutics
(2017). BloodPAC was developed using the GDC Community Edition (CE) aka Bionimbus Gen3

cdis.uchicago.edu
Robert L. Grossman
rgrossman.com
@BobGrossman
robert dot grossman at uchicago.edu
Contract Information

How Data Commons are Changing the Way that Large Datasets Are Analyzed and Shared

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How Data Commons are Changing the Way that Large Datasets Are Analyzed and Shared

Similar to How Data Commons are Changing the Way that Large Datasets Are Analyzed and Shared (20)

More from Robert Grossman

More from Robert Grossman (15)

Recently uploaded

Recently uploaded (20)

How Data Commons are Changing the Way that Large Datasets Are Analyzed and Shared