How Data Commons are Changing the Way that
Large Biomedical Datasets are Analyzed and Shared
Robert L. Grossman
Center for Data Intensive Science
University of Chicago
& Open Commons Consortium
January 10, 2018
AMIA Webinar
Learning Objectives
1. What is a data commons?
2. How does a data commons accelerate the analysis and integration of
biomedical data?
3. How does a data commons support data sharing?
4. What are some of the differences between a data cloud and a data
commons?
5. What are some emerging de facto standards for data commons?
6. How can you build your own data commons?
The slide shown today are available at:
https://www.slideshare.net/rgrossman
1. What is a Data Commons?
The challenge of big data in biomedicine…
The commoditization of sensors is
creating an explosive growth of data.
It can take weeks to download large datasets, it is difficult to
set up compliant computing infrastructure, and it can take
months to integrate & format the data for analysis.
There is not enough
funding for every
researcher to house all the
data they need
More challenges…
Data produced by different groups using different
methods is hard to integrate and compare.
There are no good software
platforms for researchers to use to
share their large datasets.
Most researchers don’t have the
bioinformatics support to process all
the data that could help their
research.
… but today, data commons are emerging as a solution.
Data commons co-locate data with cloud computing infrastructure and
commonly used software services, tools & apps for managing, analyzing and
sharing data to create an interoperable resource for the research community.*
• Data commons grew out large scale
commercial cloud computing
technology.
• This technology has transformed
many fields, but only now
beginning to impact biomedical
research.
*Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson and Walt Wells, A Case for Data Commons Towards Data Science as a Service, IEEE
Computing in Science and Engineer, 2016. Source of image: The CDIS, GDC, & OCC data commons infrastructure at a University of Chicago data center.
Research ethics
committees (RECs) review
the ethical acceptability of
research involving human
participants. Historically,
the principal emphases of
RECs have been to protect
participants from physical
harms and to provide
assurance as to
participants’ interests and
welfare.*
[The Framework] is
guided by, Article 27 of
the 1948 Universal
Declaration of Human
Rights. Article 27
guarantees the rights
of every individual in
the world "to share in
scientific advancement
and its benefits"
(including to freely
engage in responsible
scientific inquiry)…*
Protect patients
The right of
patients to benefit
from research.
*GA4GH Framework for Responsible Sharing of Genomic and Health-Related Data, see goo.gl/CTavQR
Data sharing with protections provides the evidence
so patients can benefit from advances in research.
Data commons balance protecting patient data with open
research that benefits patients:
Discovery / Clinical
Trials (research)
Quality and
patient safety
Patient Care and
Hospital Operations
Clinical quality & outcomesTranslation
Raw data aggregated data Raw data aggregated data Raw data aggregated data
Research
databases and
repositories
Strength of
evidence
databases
Quality and
outcome
databases
Identified deIdentified Identified deIdentified Identified deIdentified
data commons
2. An Example of a Data Commons
NCI Genomic Data Commons* • The GDC makes
available over 2.5 PB of
data available for access
via an API, analysis by
cloud resources on
public clouds, and
downloading.
• In Oct, 2017, the GDC
was used by over 22,000
users and over 2.3 PB of
data was downloaded.
• The GDC is based upon
an open source
software stack that can
be used to build other
data commons.*See: NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer
genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112.
The GDC consists of a 1) data exploration & visualization portal (DAVE), 2) data
submission portal, 3) data analysis and harmonization system system, 4) an API
so third party can build applications.
A
B
C
D
Systems 1 & 2: Data Portals to Explore and Submit Data
• MuSE
(MD Anderson)
• VarScan2 (Washington
Univ.)
• SomaticSniper
(Washington Univ.)
• MuTect2
(Broad Institute)
Source: Zhenyu Zhang, et. al. and the GDC Project Team, Uniform Genomic Data Analysis in
the NCI Genomic Data Commons, to appear.
System 3: Data Harmonization System To Analyze all of the
Submitted Data with a Common Pipelines
System 4: An API to Support User Defined Applications and
Notebooks to Create a Data Ecosystem
https://gdc-api.nci.nih.gov/files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state
• The GDC has a REST API so that researchers can develop their own
applications.
• There are third party applications that use the REST API for Python, R,
Jupyter notebooks and Shiny.
• The REST API drives the GDC data portal, data submission system, etc.
GDC Application Programming Interface (API)
https://gdc-api.nci.nih.gov/files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state
API URL Endpoint Optional Entity ID Query parameters
• Based upon a (graph-based) data model
• Drives all internally developed applications, e.g. data portal
• Allows third parties to develop their own applications
• Can be used by other commons, by workspaces, by other
systems, by user-developed applications and notebooks
For more about the API, see: Shane Wilson, Michael Fitzsimons, Martin Ferguson, Allison Heath, Mark Jensen, Josh Miller, Mark W. Murphy, James
Porter, Himanso Sahni, Louis Staudt, Yajing Tang, Zhining Wang, Christine Yu, Junjun Zhang, Vincent Ferretti and Robert L. Grossman, Developing
Cancer Informatics Applications and Tools Using the NCI Genomic Data Commons API, Cancer Research, volume 77, number 21, 2017, pages e15-e18.
Purple balls are PCA-based analysis of RNA-seq data for lung adenocarcinoma.
Grey are associated with lung squamous cell carcinoma. Green appear to be
misdiagnosed.
The GDC enables
bioinformaticians to
build their own
applications using
the GDC API.
Source: Center for Data Intensive Science, University of Chicago. This app was built over the GDC API.
Shiny R app
built using
the GDC API
3. Data Commons in More Detail
• Supports big data & data
intensive computing with
cloud computing
• Researchers can analyze data
with collaborative tools
(workspaces) – i. e. data does
not have to be downloaded)
• Data repository
• Researchers
download data.
Databases
Data Clouds
Data Commons
• Supports big data
• Workspaces
• Common data models
• Core data services
• Data & Commons
Governance
• Harmonized data
• Data sharing
• Reproducible research
1982 - present
2010 - 2020
2014 - 2024
The Commons Alliance: Three Large Scale Data Commons Working
Towards Common APIs to Create to Create de Facto Standards
1. NCI Cloud CRDC
Framework Services /
GDC (UChicago / Broad)
2. NIH All of Us (Broad /
Verily)
3. CZI HCA Data Platform
(UCSC/Broad)
For more information, see: Josh Denny, David Glazer, Robert L. Grossman, Benedict Paten & Anthony Philippakis, A Data
Biosphere for Biomedical Research, https://medium.com/@benedictpaten/a-data-biosphere-for-biomedical-research-
d212bbfae95d. Also available at: https://goo.gl/9CySeo
Researcher /
Working Group
Embargo
Consortium
Embargo
Broad Research
Community
Counts
only
Analyzed,
higher level
data
Raw data
What data?
To whom & when?
Infrastructure as a Service (virtual machines)
Platform as a Service (containers)
Software as a Service (software applications hosted by the commons)
Query Gateway
Counts
Approved
Queries
Approved Tools
& Services
Approved
Infrastructure
What service
model?
Public
Various Data sharing Models Are Supported by Data Commons
Benefits of Data Commons and Data Sharing (1 of 2)
1. The data is available to other researchers for discovery,
which moves the research field faster.
2. Data commons support repeatable, reproducible and open
research.
3. Some diseases are dependent upon having a critical mass
of data to provide the required statistical power for the
scientific evidence (e.g. to study combinations of rare
mutations in cancer)
4. With more data, smaller effects can be studied (e.g. to
understand the effect of environmental factors on disease).
Source: Robert L. Grossman, Supporting Open Data and Open Science With Data Commons: Some Suggested Guidelines for Funding Organizations,
2017, https://www.healthra.org/download-resource/?resource-url=/wp-content/uploads/2017/08/Data-Commons-
Guidelines_Grossman_8_2017.pdf
Benefits of Data Commons and Data Sharing (2 of 2)
5. Data commons enable researchers to work with large
datasets at much lower cost to the funder than if each
researcher set up their own local environment.
6. Data commons generally provide higher security and greater
compliance than most local computing environments.
7. Data commons support large scale computation so that the
latest bioinformatics pipelines can be run.
8. Data commons can interoperate with each other so that
over time data sharing can benefit from a “network effect”
4. The Gen3 Data Commons Platform
OCC Open Science Data Cloud (2010)
OCC – NASA Project Matsu (2009)
NCI Genomic Data Commons* (2016)
OCC-NOAA Environmental Data
Commons (2016)
OCC Blood Profiling
Atlas in Cancer (2017)
Bionimbus Protected Data Cloud* (2013)
*Operated under a subcontract from NCI / Leidos Biomedical
to the University of Chicago with support from the OCC.
** CHOP is the lead, with the University of Chicago developing
a Gen3 Data Commons for the project.
Brain Commons
(2017)
Kids First Data
Resource (2017)**
Gen3
Gen2
Gen1
OCC is the Open
Commons Consortium
cdis.uchicago.edu
• Open source
• Designed to support
project specific data
commons
• Designed to support
an ecosystem of
commons, workspaces,
notebooks &
applications.
• We are building an
open source Gen3
community.
• Cloud agnostic,
including your own
private cloud.
The Gen3 Data Model
Is Customizable &
Extensible
• Extends the GDC data
model
• BloodPAC
• BRAIN Commons
• Kids First Data Resource
• Data commons
supporting several pilots
Object-based
storage with access
control lists
Scalable workflows
Community
data
products
Data Commons Framework Services (Digital ID, Metadata, Authentication, Auth.,
etc.) that support multiple data commons.
Apps
Database
services
Data Commons 1
Data Commons 2
Portals for
accessing &
submitting
data
Workspaces
APIs
Data Commons Framework Services
Workspaces
Workspaces
Notebooks
Apps
Apps & Notebooks
Gen3 Framework
Services are designed to
support multiple Gen3
Data Commons
Core Gen3 Data Commons Framework Services
• Digital ID services
• Metadata services
• Authentication services
• Authorization services
• Data model driven APIs for submitting, searching & accessing data
• Designed to span multiple data commons
• Designed to support multiple private and commercial clouds
• In the future, we will support portable workspaces
NCI Clouds
Pilots
Compliant
apps
Bionimbus
PDC & other
clouds
FAIR Principles
Your Data Commons
Other data commonsData Peering
Principles
Commons
Services
Operations
Center
Commons
services
Commons Services
Framework
appapp
app
Data Commons
Framework Services
Private Academic Cloud
Univ. of Chicago
CSOC (ops center)
Cross Cloud Services
Other Private
Academic Cloud
5. Developing Your Own Data Commons
Sharing Data with Data Commons – the Main Steps
1. Require data sharing. Put data sharing requirements into your
project or consortium agreements.
2. Build a commons. Set up, work with others to set up, or join an
existing data commons, fund it, and develop an operating plan,
governance structure, and a sustainability plan.
3. Populate the commons. Provide resources to your data
generators to get the data into data commons.
4. Interoperate with other commons. Interoperate with other
commons that can accelerate research discoveries.
5. Support commons use. Support the development of third party
apps that can make discoveries over your commons.
Open Source Software for
Data Commons
Third party open
source apps
Third party vendor
apps
Sponsor developed apps
Public Clouds
Data
Commons
Governance &
Standards
On Premise Clouds
Commons
Operations
Center
Data managed by the data commons
Sponsor or Co-Sponsors
OCC Data Commons Framework occ-data.org
Research groups
submit data
Clean and
process the
data following
the standards
Researchers use
the commons for
data analysis
Adapt the data
model to your
project
Building the Data Commons
Set up &
configure the
data commons
(CSOC)
Put in place the
OCC data
governance model
New research
discoveries
www.occ-data.org
• U.S based 501(c)(3) not-for-profit corporation founded in 2008.
• The OCC manages data commons to support medical and health care
research, including the BloodPAC Data Commons and BRAIN
Commons.
• The OCC manages data commons and cloud computing infrastructure
to support more general scientific research, including the OCC NOAA
Data Commons and the Open Science Data Cloud.
• It is international and includes universities, not-for-profits, companies
and government agencies.
6. Summary and Conclusion
Summary
1. Data commons co-locate data with cloud computing infrastructure and
commonly used software services, tools & apps for managing,
analyzing and sharing data to create an interoperable resource for the
research community.
2. Data commons provide a platform for open data, open science and
reproducible research.
3. The open source Gen3 Data Commons software platform:
a) supports disease specific, project specific or consortium specific data
commons.
b) supports an ecosystem of FAIR-based applications.
c) supports multiple data commons that peer and interoperate.
4. The independent not-for-profit Open Commons Consortium can help
you set up your own data commons.
Datasets organize
the data around an
experiment.
Data warehouses
and databases
organize the data
for an organization
Data commons
organize the data
for a scientific
discipline or field
Data
Warehouse
Questions?
To get involved:
• Gen3 Data Commons software stack
o cdis.uchicago.edu
• Open Commons Consortium to help you build a data commons
o occ-data.org
• NCI Genomic Data Commons
o gdc.cancer.gov
• BloodPAC
o bloodpac.org
To learn more about some of the data commons:
For more information:
• To learn more about data commons: Robert L. Grossman, et. al. A Case for Data Commons: Toward Data Science
as a Service, Computing in Science & Engineering 18.5 (2016): 10-20. Also https://arxiv.org/abs/1604.02608
• To large more about large scale, secure compliant cloud based computing environments for biomedical data, see:
Heath, Allison P., et al. "Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets." Journal
of the American Medical Informatics Association 21.6 (2014): 969-975. This article describes Bionimbus Gen1.
• To learn more about the NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for
cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112. The GDC was developed
using Bionimbus Gen2.
• To learn more about BloodPAC: Grossman, R. L., et al. "Collaborating to compete: Blood Profiling Atlas in Cancer
(BloodPAC) Consortium." Clinical Pharmacology & Therapeutics (2017). BloodPAC was developed using the GDC
Community Edition (CE) aka Bionimbus Gen3
• To learn about the GDC / Gen3 API: Shane Wilson, Michael Fitzsimons, Martin Ferguson, Allison Heath, Mark
Jensen, Josh Miller, Mark W. Murphy, James Porter, Himanso Sahni, Louis Staudt, Yajing Tang, Zhining Wang,
Christine Yu, Junjun Zhang, Vincent Ferretti and Robert L. Grossman, Developing Cancer Informatics Applications
and Tools Using the NCI Genomic Data Commons API, Cancer Research, volume 77, number 21, 2017, pages e15-
e18.
• To learn more about the de facto standards being developed by the Commons Alliance: Josh Denny, David Glazer,
Robert L. Grossman, Benedict Paten, Anthony Philippakis, A Data Biosphere for Biomedical Research,
https://medium.com/@benedictpaten/a-data-biosphere-for-biomedical-research-d212bbfae95d
cdis.uchicago.edu
Robert L. Grossman
rgrossman.com
@BobGrossman
robert.grossman@uchicago.edu
Contact Information

How Data Commons are Changing the Way that Large Datasets Are Analyzed and Shared

  • 1.
    How Data Commonsare Changing the Way that Large Biomedical Datasets are Analyzed and Shared Robert L. Grossman Center for Data Intensive Science University of Chicago & Open Commons Consortium January 10, 2018 AMIA Webinar
  • 2.
    Learning Objectives 1. Whatis a data commons? 2. How does a data commons accelerate the analysis and integration of biomedical data? 3. How does a data commons support data sharing? 4. What are some of the differences between a data cloud and a data commons? 5. What are some emerging de facto standards for data commons? 6. How can you build your own data commons? The slide shown today are available at: https://www.slideshare.net/rgrossman
  • 3.
    1. What isa Data Commons?
  • 4.
    The challenge ofbig data in biomedicine… The commoditization of sensors is creating an explosive growth of data. It can take weeks to download large datasets, it is difficult to set up compliant computing infrastructure, and it can take months to integrate & format the data for analysis. There is not enough funding for every researcher to house all the data they need
  • 5.
    More challenges… Data producedby different groups using different methods is hard to integrate and compare. There are no good software platforms for researchers to use to share their large datasets. Most researchers don’t have the bioinformatics support to process all the data that could help their research.
  • 6.
    … but today,data commons are emerging as a solution. Data commons co-locate data with cloud computing infrastructure and commonly used software services, tools & apps for managing, analyzing and sharing data to create an interoperable resource for the research community.* • Data commons grew out large scale commercial cloud computing technology. • This technology has transformed many fields, but only now beginning to impact biomedical research. *Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson and Walt Wells, A Case for Data Commons Towards Data Science as a Service, IEEE Computing in Science and Engineer, 2016. Source of image: The CDIS, GDC, & OCC data commons infrastructure at a University of Chicago data center.
  • 7.
    Research ethics committees (RECs)review the ethical acceptability of research involving human participants. Historically, the principal emphases of RECs have been to protect participants from physical harms and to provide assurance as to participants’ interests and welfare.* [The Framework] is guided by, Article 27 of the 1948 Universal Declaration of Human Rights. Article 27 guarantees the rights of every individual in the world "to share in scientific advancement and its benefits" (including to freely engage in responsible scientific inquiry)…* Protect patients The right of patients to benefit from research. *GA4GH Framework for Responsible Sharing of Genomic and Health-Related Data, see goo.gl/CTavQR Data sharing with protections provides the evidence so patients can benefit from advances in research. Data commons balance protecting patient data with open research that benefits patients:
  • 8.
    Discovery / Clinical Trials(research) Quality and patient safety Patient Care and Hospital Operations Clinical quality & outcomesTranslation Raw data aggregated data Raw data aggregated data Raw data aggregated data Research databases and repositories Strength of evidence databases Quality and outcome databases Identified deIdentified Identified deIdentified Identified deIdentified data commons
  • 9.
    2. An Exampleof a Data Commons
  • 10.
    NCI Genomic DataCommons* • The GDC makes available over 2.5 PB of data available for access via an API, analysis by cloud resources on public clouds, and downloading. • In Oct, 2017, the GDC was used by over 22,000 users and over 2.3 PB of data was downloaded. • The GDC is based upon an open source software stack that can be used to build other data commons.*See: NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112. The GDC consists of a 1) data exploration & visualization portal (DAVE), 2) data submission portal, 3) data analysis and harmonization system system, 4) an API so third party can build applications.
  • 11.
  • 12.
    Systems 1 &2: Data Portals to Explore and Submit Data
  • 13.
    • MuSE (MD Anderson) •VarScan2 (Washington Univ.) • SomaticSniper (Washington Univ.) • MuTect2 (Broad Institute) Source: Zhenyu Zhang, et. al. and the GDC Project Team, Uniform Genomic Data Analysis in the NCI Genomic Data Commons, to appear. System 3: Data Harmonization System To Analyze all of the Submitted Data with a Common Pipelines
  • 14.
    System 4: AnAPI to Support User Defined Applications and Notebooks to Create a Data Ecosystem https://gdc-api.nci.nih.gov/files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state • The GDC has a REST API so that researchers can develop their own applications. • There are third party applications that use the REST API for Python, R, Jupyter notebooks and Shiny. • The REST API drives the GDC data portal, data submission system, etc.
  • 15.
    GDC Application ProgrammingInterface (API) https://gdc-api.nci.nih.gov/files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state API URL Endpoint Optional Entity ID Query parameters • Based upon a (graph-based) data model • Drives all internally developed applications, e.g. data portal • Allows third parties to develop their own applications • Can be used by other commons, by workspaces, by other systems, by user-developed applications and notebooks For more about the API, see: Shane Wilson, Michael Fitzsimons, Martin Ferguson, Allison Heath, Mark Jensen, Josh Miller, Mark W. Murphy, James Porter, Himanso Sahni, Louis Staudt, Yajing Tang, Zhining Wang, Christine Yu, Junjun Zhang, Vincent Ferretti and Robert L. Grossman, Developing Cancer Informatics Applications and Tools Using the NCI Genomic Data Commons API, Cancer Research, volume 77, number 21, 2017, pages e15-e18.
  • 16.
    Purple balls arePCA-based analysis of RNA-seq data for lung adenocarcinoma. Grey are associated with lung squamous cell carcinoma. Green appear to be misdiagnosed. The GDC enables bioinformaticians to build their own applications using the GDC API. Source: Center for Data Intensive Science, University of Chicago. This app was built over the GDC API. Shiny R app built using the GDC API
  • 17.
    3. Data Commonsin More Detail
  • 18.
    • Supports bigdata & data intensive computing with cloud computing • Researchers can analyze data with collaborative tools (workspaces) – i. e. data does not have to be downloaded) • Data repository • Researchers download data. Databases Data Clouds Data Commons • Supports big data • Workspaces • Common data models • Core data services • Data & Commons Governance • Harmonized data • Data sharing • Reproducible research 1982 - present 2010 - 2020 2014 - 2024
  • 19.
    The Commons Alliance:Three Large Scale Data Commons Working Towards Common APIs to Create to Create de Facto Standards 1. NCI Cloud CRDC Framework Services / GDC (UChicago / Broad) 2. NIH All of Us (Broad / Verily) 3. CZI HCA Data Platform (UCSC/Broad) For more information, see: Josh Denny, David Glazer, Robert L. Grossman, Benedict Paten & Anthony Philippakis, A Data Biosphere for Biomedical Research, https://medium.com/@benedictpaten/a-data-biosphere-for-biomedical-research- d212bbfae95d. Also available at: https://goo.gl/9CySeo
  • 20.
    Researcher / Working Group Embargo Consortium Embargo BroadResearch Community Counts only Analyzed, higher level data Raw data What data? To whom & when? Infrastructure as a Service (virtual machines) Platform as a Service (containers) Software as a Service (software applications hosted by the commons) Query Gateway Counts Approved Queries Approved Tools & Services Approved Infrastructure What service model? Public Various Data sharing Models Are Supported by Data Commons
  • 21.
    Benefits of DataCommons and Data Sharing (1 of 2) 1. The data is available to other researchers for discovery, which moves the research field faster. 2. Data commons support repeatable, reproducible and open research. 3. Some diseases are dependent upon having a critical mass of data to provide the required statistical power for the scientific evidence (e.g. to study combinations of rare mutations in cancer) 4. With more data, smaller effects can be studied (e.g. to understand the effect of environmental factors on disease). Source: Robert L. Grossman, Supporting Open Data and Open Science With Data Commons: Some Suggested Guidelines for Funding Organizations, 2017, https://www.healthra.org/download-resource/?resource-url=/wp-content/uploads/2017/08/Data-Commons- Guidelines_Grossman_8_2017.pdf
  • 22.
    Benefits of DataCommons and Data Sharing (2 of 2) 5. Data commons enable researchers to work with large datasets at much lower cost to the funder than if each researcher set up their own local environment. 6. Data commons generally provide higher security and greater compliance than most local computing environments. 7. Data commons support large scale computation so that the latest bioinformatics pipelines can be run. 8. Data commons can interoperate with each other so that over time data sharing can benefit from a “network effect”
  • 23.
    4. The Gen3Data Commons Platform
  • 24.
    OCC Open ScienceData Cloud (2010) OCC – NASA Project Matsu (2009) NCI Genomic Data Commons* (2016) OCC-NOAA Environmental Data Commons (2016) OCC Blood Profiling Atlas in Cancer (2017) Bionimbus Protected Data Cloud* (2013) *Operated under a subcontract from NCI / Leidos Biomedical to the University of Chicago with support from the OCC. ** CHOP is the lead, with the University of Chicago developing a Gen3 Data Commons for the project. Brain Commons (2017) Kids First Data Resource (2017)** Gen3 Gen2 Gen1 OCC is the Open Commons Consortium
  • 25.
    cdis.uchicago.edu • Open source •Designed to support project specific data commons • Designed to support an ecosystem of commons, workspaces, notebooks & applications. • We are building an open source Gen3 community. • Cloud agnostic, including your own private cloud.
  • 26.
    The Gen3 DataModel Is Customizable & Extensible • Extends the GDC data model • BloodPAC • BRAIN Commons • Kids First Data Resource • Data commons supporting several pilots
  • 27.
    Object-based storage with access controllists Scalable workflows Community data products Data Commons Framework Services (Digital ID, Metadata, Authentication, Auth., etc.) that support multiple data commons. Apps Database services Data Commons 1 Data Commons 2 Portals for accessing & submitting data Workspaces APIs Data Commons Framework Services Workspaces Workspaces Notebooks Apps Apps & Notebooks Gen3 Framework Services are designed to support multiple Gen3 Data Commons
  • 28.
    Core Gen3 DataCommons Framework Services • Digital ID services • Metadata services • Authentication services • Authorization services • Data model driven APIs for submitting, searching & accessing data • Designed to span multiple data commons • Designed to support multiple private and commercial clouds • In the future, we will support portable workspaces
  • 29.
    NCI Clouds Pilots Compliant apps Bionimbus PDC &other clouds FAIR Principles Your Data Commons Other data commonsData Peering Principles Commons Services Operations Center Commons services Commons Services Framework appapp app
  • 30.
    Data Commons Framework Services PrivateAcademic Cloud Univ. of Chicago CSOC (ops center) Cross Cloud Services Other Private Academic Cloud
  • 31.
    5. Developing YourOwn Data Commons
  • 32.
    Sharing Data withData Commons – the Main Steps 1. Require data sharing. Put data sharing requirements into your project or consortium agreements. 2. Build a commons. Set up, work with others to set up, or join an existing data commons, fund it, and develop an operating plan, governance structure, and a sustainability plan. 3. Populate the commons. Provide resources to your data generators to get the data into data commons. 4. Interoperate with other commons. Interoperate with other commons that can accelerate research discoveries. 5. Support commons use. Support the development of third party apps that can make discoveries over your commons.
  • 33.
    Open Source Softwarefor Data Commons Third party open source apps Third party vendor apps Sponsor developed apps Public Clouds Data Commons Governance & Standards On Premise Clouds Commons Operations Center Data managed by the data commons Sponsor or Co-Sponsors OCC Data Commons Framework occ-data.org
  • 34.
    Research groups submit data Cleanand process the data following the standards Researchers use the commons for data analysis Adapt the data model to your project Building the Data Commons Set up & configure the data commons (CSOC) Put in place the OCC data governance model New research discoveries
  • 35.
    www.occ-data.org • U.S based501(c)(3) not-for-profit corporation founded in 2008. • The OCC manages data commons to support medical and health care research, including the BloodPAC Data Commons and BRAIN Commons. • The OCC manages data commons and cloud computing infrastructure to support more general scientific research, including the OCC NOAA Data Commons and the Open Science Data Cloud. • It is international and includes universities, not-for-profits, companies and government agencies.
  • 36.
    6. Summary andConclusion
  • 37.
    Summary 1. Data commonsco-locate data with cloud computing infrastructure and commonly used software services, tools & apps for managing, analyzing and sharing data to create an interoperable resource for the research community. 2. Data commons provide a platform for open data, open science and reproducible research. 3. The open source Gen3 Data Commons software platform: a) supports disease specific, project specific or consortium specific data commons. b) supports an ecosystem of FAIR-based applications. c) supports multiple data commons that peer and interoperate. 4. The independent not-for-profit Open Commons Consortium can help you set up your own data commons.
  • 38.
    Datasets organize the dataaround an experiment. Data warehouses and databases organize the data for an organization Data commons organize the data for a scientific discipline or field Data Warehouse
  • 39.
  • 40.
    To get involved: •Gen3 Data Commons software stack o cdis.uchicago.edu • Open Commons Consortium to help you build a data commons o occ-data.org • NCI Genomic Data Commons o gdc.cancer.gov • BloodPAC o bloodpac.org To learn more about some of the data commons:
  • 41.
    For more information: •To learn more about data commons: Robert L. Grossman, et. al. A Case for Data Commons: Toward Data Science as a Service, Computing in Science & Engineering 18.5 (2016): 10-20. Also https://arxiv.org/abs/1604.02608 • To large more about large scale, secure compliant cloud based computing environments for biomedical data, see: Heath, Allison P., et al. "Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets." Journal of the American Medical Informatics Association 21.6 (2014): 969-975. This article describes Bionimbus Gen1. • To learn more about the NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112. The GDC was developed using Bionimbus Gen2. • To learn more about BloodPAC: Grossman, R. L., et al. "Collaborating to compete: Blood Profiling Atlas in Cancer (BloodPAC) Consortium." Clinical Pharmacology & Therapeutics (2017). BloodPAC was developed using the GDC Community Edition (CE) aka Bionimbus Gen3 • To learn about the GDC / Gen3 API: Shane Wilson, Michael Fitzsimons, Martin Ferguson, Allison Heath, Mark Jensen, Josh Miller, Mark W. Murphy, James Porter, Himanso Sahni, Louis Staudt, Yajing Tang, Zhining Wang, Christine Yu, Junjun Zhang, Vincent Ferretti and Robert L. Grossman, Developing Cancer Informatics Applications and Tools Using the NCI Genomic Data Commons API, Cancer Research, volume 77, number 21, 2017, pages e15- e18. • To learn more about the de facto standards being developed by the Commons Alliance: Josh Denny, David Glazer, Robert L. Grossman, Benedict Paten, Anthony Philippakis, A Data Biosphere for Biomedical Research, https://medium.com/@benedictpaten/a-data-biosphere-for-biomedical-research-d212bbfae95d
  • 42.