Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What is Data Commons and How Can Your Organization Build One?

869 views

Published on

This is a talk that I gave at the Molecular Medicine Tri Conference on data commons and data sharing to accelerate research discoveries and improve patient outcomes. It also covers how your organization can build a data commons using the Open Commons Consortium's Data Commons Framework and the University of Chicago's Gen3 data commons platform.

Published in: Data & Analytics
  • Be the first to comment

What is Data Commons and How Can Your Organization Build One?

  1. 1. How Data Commons Are Changing the Way That Large Biomedical Datasets are Analyzed and Shared Robert L. Grossman Center for Data Intensive Science University of Chicago & Open Commons Consortium February 12, 2018 Molecular Med Tri Conf
  2. 2. 1. What is a Data Commons?
  3. 3. NCI Genomic Data Commons* • The GDC makes over 2.5 PB of cancer genomics data available to the research community. • The data is harmonized by processing with a common set of bioinformatics pipelines. • Each month, the GDC is used by over 20,000 users and over 2 PB of data is downloaded. • The GDC is based upon an open source software stack that can be used to build other data commons.*See: NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112. The GDC consists of: 1) a data exploration & visualization portal (DAVE), 2) a data submission portal, 3) a data analysis and harmonization system system, 4) an API so third party can build applications.
  4. 4. A B C D
  5. 5. Systems 1 & 2: Data Portals to Explore and Submit Data See: NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112.
  6. 6. • MuSE (MD Anderson) • VarScan2 (Washington Univ.) • SomaticSniper (Washington Univ.) • MuTect2 (Broad Institute) Source: Zhenyu Zhang, et. al. and the GDC Project Team, Uniform Genomic Data Analysis in the NCI Genomic Data Commons, to appear. System 3: Data Harmonization System To Analyze all of the Submitted Data with a Common Pipelines
  7. 7. System 4: An API to Support User Defined Applications and Notebooks to Create a Data Ecosystem https://gdc-api.nci.nih.gov/files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state API URL Endpoint Optional Entity ID Query parameters • Based upon a (graph-based) data model • Drives all internally developed applications, e.g. data portal • Allows third parties to develop their own applications • Can be used by other commons, by workspaces, by other systems, by user-developed applications and notebooks For more about the API, see: Shane Wilson, Michael Fitzsimons, Martin Ferguson, Allison Heath, Mark Jensen, Josh Miller, Mark W. Murphy, James Porter, Himanso Sahni, Louis Staudt, Yajing Tang, Zhining Wang, Christine Yu, Junjun Zhang, Vincent Ferretti and Robert L. Grossman, Developing Cancer Informatics Applications and Tools Using the NCI Genomic Data Commons API, Cancer Research, volume 77, number 21, 2017, pages e15-e18.
  8. 8. What is a Data Commons? Data commons co-locate data with cloud computing infrastructure and commonly used software services, tools & apps for managing, analyzing and sharing data to create an interoperable resource for the research community.* *Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson and Walt Wells, A Case for Data Commons Towards Data Science as a Service, IEEE Computing in Science and Engineer, 2016. Source of image: The CDIS, GDC, & OCC data commons infrastructure at a University of Chicago data center.
  9. 9. Research ethics committees (RECs) review the ethical acceptability of research involving human participants. Historically, the principal emphases of RECs have been to protect participants from physical harms and to provide assurance as to participants’ interests and welfare.* [The Framework] is guided by, Article 27 of the 1948 Universal Declaration of Human Rights. Article 27 guarantees the rights of every individual in the world "to share in scientific advancement and its benefits" (including to freely engage in responsible scientific inquiry)…* Protect patients The right of patients to benefit from research. *GA4GH Framework for Responsible Sharing of Genomic and Health-Related Data, see goo.gl/CTavQR Data sharing with protections provides the evidence so patients can benefit from advances in research. Data commons balance protecting patient data with open research that benefits patients:
  10. 10. • Supports big data & data intensive computing with cloud computing • Researchers can analyze data with collaborative tools (workspaces) – i. e. data does not have to be downloaded) • Data repository • Researchers download data. Databases Data Clouds Data Commons • Supports big data • Workspaces • Common data models • Core data services • Data & Commons Governance • Harmonized data • Data sharing • Reproducible research 1982 - present 2010 - 2020 2014 - 2024
  11. 11. 2. The Gen3 Data Commons Platform
  12. 12. OCC Open Science Data Cloud (2010) OCC – NASA Project Matsu (2009) NCI Genomic Data Commons* (2016) OCC-NOAA Environmental Data Commons (2016) OCC BloodPAC (2017) Bionimbus Protected Data Cloud* (2013) *Operated under a subcontract from NCI / Leidos Biomedical to the University of Chicago with support from the OCC. ** CHOP is the lead, with the University of Chicago developing a Gen3 Data Commons for the project. Kids First (2017)** Gen3 Gen2 Gen1 OCC is the Open Commons Consortium NCI CRDC (2017) Brain Commons (2017)
  13. 13. cdis.uchicago.edu • Open source • Designed to support project specific data commons • Designed to support an ecosystem of commons, workspaces, notebooks & applications. • We are starting to build an open source Gen3 community. • Cloud agnostic, including your own private cloud.
  14. 14. The Gen3 is Based on Widely Used Open Source Tech
  15. 15. Gen3 is Cloud Agnostic • The Gen3 service for Digital IDs (indexd) enables Gen3 managed digital objects to be on 1, 2, 3 or more private or public clouds. • Gen3 client applications can then retrieve the digital object from a local or remote cloud as required.
  16. 16. The Gen3 Data Model is customizable & extensible. The Gen3 Data Model extends the GDC data model.
  17. 17. Object-based storage with access control lists Scalable workflows Community data products Data Commons Framework Services (Digital ID, Metadata, Authentication, Auth., etc.) that support multiple data commons. Apps Database services Data Commons 1 Data Commons 2 Portals for accessing & submitting data Workspaces APIs Data Commons Framework Services Workspaces Workspaces Notebooks Apps Apps & Notebooks Gen3 Framework Services are designed to support multiple Gen3 Data Commons
  18. 18. Core Gen3 Data Commons Framework Services • Digital ID services • Metadata services • Authentication services • Authorization services • Data model driven APIs for submitting, searching & accessing data • Designed to span multiple data commons • Designed to support multiple private and commercial clouds • In the future, we will support portable workspaces
  19. 19. NCI Clouds Resources Compliant appsBionimbus, Cancer Collab., etc. FAIR Principles Your Data Commons Other data commons Commons Services Operations Center Commons services Commons Services Framework appapp app
  20. 20. Data Commons (bloodpac.org) BloodPAC is a public-private consortium for liquid biopsy data developed using Gen3 technology by the Open Commons Consortium (OCC) containing: 1. Datasets from circulating tumor cells, circulating tumor DNA, and exosome assays 2. Relevant clinical data (e.g. clinical diagnosis, treatment history and outcomes) 3. Sample preparation and handling protocols from different studies.
  21. 21. • The BRAIN Commons is a commons for PTSD, TBI, major depressive disorders, and other neurological diseases. • The images above show a meta-analysis of MRI structural measures in PTSD by Dr. R. Morey, Duke University performed in the commons. (www.braincommons.org)
  22. 22. • Clinical data • Imaging data • Genomic data • Biospecimen data • Wearable data Data types supported by the BRAIN Commons Linked to CDISC & other relevant standards
  23. 23. 3. Developing Your Own Data Commons Using the OCC Data Commons Framework
  24. 24. www.occ-data.org • U.S based 501(c)(3) not-for-profit corporation founded in 2008. • The OCC manages data commons to support medical and health care research, including the BloodPAC Data Commons and the BRAIN Commons. • The OCC manages data commons and cloud computing infrastructure to support more general scientific research, including the OCC NOAA Environmental Data Commons and the Open Science Data Cloud. • It is international and includes universities, not-for-profits, companies and government agencies.
  25. 25. Sharing Data with Data Commons – the Main Steps 1. Require data sharing. Put data sharing requirements into your project or consortium agreements. 2. Build a commons. Set up, work with others to set up, or join an existing data commons, fund it, and develop an operating plan, governance structure, and a sustainability plan. 3. Populate the commons. Provide resources to your data generators to get the data into data commons. 4. Interoperate with other commons. Interoperate with other commons that can accelerate research discoveries. 5. Support commons use. Support the development of third party apps that can make discoveries over your commons.
  26. 26. Open Source Software for Data Commons Third party open source apps Third party vendor apps Sponsor developed apps Public Clouds Data Commons Governance & Standards On Premise Clouds Commons Services Operations Center Data managed by the data commons Sponsor or Co-Sponsors OCC Data Commons Framework occ-data.org 1 2 3 45
  27. 27. Research groups submit data Clean and process the data following the standards Researchers use the commons for data analysis Adapt the data model to your project Example: Building the Data Commons Using the OCC Data Commons Framework Set up & configure the data commons (CSOC) Put in place the OCC data governance model New research discoveries
  28. 28. Researcher / Working Group Embargo Consortium Embargo Broad Research Community Counts only Analyzed, higher level data Raw data What data? To whom & when? Infrastructure as a Service (virtual machines) Platform as a Service (containers) Software as a Service (software applications hosted by the commons) Query Gateway Counts Approved Queries Approved Tools & Services Approved Infrastructure What service model? Public Various Data sharing Models Are Supported by Data Commons
  29. 29. • It is important to note that with the Gen3 platform, multiple geographically distributed data commons can interoperate in different ways: o through data peering o through a FAIR-based set of APIs for applications o through scattering queries/analyses and gathering the results o through a controlled and monitored query/analysis gateway
  30. 30. The Commons Alliance: Three Large Scale Data Commons Working Towards Common APIs to Create to Create de Facto Standards 1. NCI Cloud CRDC Framework Services / NCI GDC (UChicago / Broad) 2. NIH All of Us (Broad / Verily) 3. CZI HCA Data Platform (UCSC/Broad) For more information, see: Josh Denny, David Glazer, Robert L. Grossman, Benedict Paten & Anthony Philippakis, A Data Biosphere for Biomedical Research, https://medium.com/@benedictpaten/a-data-biosphere-for-biomedical-research- d212bbfae95d. Also available at: https://goo.gl/9CySeo
  31. 31. Standards for Genomic & Health Data www.ga4gh.org
  32. 32. Six Options for Building a Data Commons 1. Add your project data to an existing data commons. 2. Join an existing partnership that serves multiple related diseases based upon the Gen3 Data Commons platform or other platform designed with similar goals 3. Build your own data commons and peer with an existing partnership of data commons developed using the Gen3 Data Commons Platform or other platform designed with similar goals. 4. Build your own data commons that follow standards for interoperating data commons, such as the Commons Alliance standards. 5. Build your own data commons that follow standards for research data, such as the FAIR standards. 6. Build your own data commons with a new architecture and new standards and get others to adapt your standards.
  33. 33. 4. Summary and Conclusion
  34. 34. Benefits of Data Commons and Data Sharing 1. The data is available to other researchers for discovery, which moves the research field faster. 2. Data commons support repeatable, reproducible and open research. 3. Some diseases are dependent upon having a critical mass of data to provide the required statistical power for the scientific evidence (e.g. to study combinations of rare mutations in cancer) 4. Data commons can interoperate with each other so that over time data sharing can benefit from a “network effect” 5. With more data, smaller effects can be studied (e.g. to understand the effect of environmental factors on disease) and machine learning techniques that require large data can developed.
  35. 35. Data Commons 2014 - 2024 Data Clouds 2010 - 2020 Data Ecosystems 2018 - 2028 Databases 1982 - present
  36. 36. Summary 1. Data commons co-locate data with cloud computing infrastructure and commonly used software services, tools & apps for managing, analyzing and sharing data to create an interoperable resource for the research community. 2. Data commons provide a platform for open data, open science and reproducible research that provide the data sharing infrastructure to support precision medicine. 3. The Gen3 platform is an open source platform that is creating an ecosystem of data commons, applications and resources. 4. The independent not-for-profit 501(c)(3) Open Commons Consortium can help you set up your own data commons to support data sharing for precision medicine.
  37. 37. Questions?
  38. 38. To get involved: • Open Commons Consortium to help you build a data commons o occ-data.org • Gen3 Data Commons software stack o cdis.uchicago.edu • NCI Genomic Data Commons o gdc.cancer.gov • BRAIN Commons o braincommons.org To learn more about some of the data commons: • BloodPAC o bloodpac.org • OCC-NOAA Environmental Data Commons o edc.occ-data.org
  39. 39. For more information: • To learn more about data commons: Robert L. Grossman, et. al. A Case for Data Commons: Toward Data Science as a Service, Computing in Science & Engineering 18.5 (2016): 10-20. Also https://arxiv.org/abs/1604.02608 • To large more about large scale, secure compliant cloud based computing environments for biomedical data, see: Heath, Allison P., et al. "Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets." Journal of the American Medical Informatics Association 21.6 (2014): 969-975. This article describes Bionimbus Gen1. • To learn more about the NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112. The GDC was developed using Bionimbus Gen2. • To learn more about BloodPAC: Grossman, R. L., et al. "Collaborating to compete: Blood Profiling Atlas in Cancer (BloodPAC) Consortium." Clinical Pharmacology & Therapeutics (2017). BloodPAC was developed using the GDC Community Edition (CE) aka Bionimbus Gen3 • To learn about the GDC / Gen3 API: Shane Wilson, Michael Fitzsimons, Martin Ferguson, Allison Heath, Mark Jensen, Josh Miller, Mark W. Murphy, James Porter, Himanso Sahni, Louis Staudt, Yajing Tang, Zhining Wang, Christine Yu, Junjun Zhang, Vincent Ferretti and Robert L. Grossman, Developing Cancer Informatics Applications and Tools Using the NCI Genomic Data Commons API, Cancer Research, volume 77, number 21, 2017, pages e15- e18. • To learn more about the de facto standards being developed by the Commons Alliance: Josh Denny, David Glazer, Robert L. Grossman, Benedict Paten, Anthony Philippakis, A Data Biosphere for Biomedical Research, https://medium.com/@benedictpaten/a-data-biosphere-for-biomedical-research-d212bbfae95d
  40. 40. Abstract Biomedical data has grown too large for most research groups to host and analyze the data from large projects themselves. Data commons provide an alternative by co-locating data, storage and computing resources with commonly used software services, applications and tools for analyzing, harmonizing and sharing data to create an interoperable resource for the research community. We give an overview of data commons and describe some lessons learned from the NCI Genomic Data Commons, the BloodPAC Data Commons and the BRAIN Commons. We also give an overview of how an organization can set up a commons themselves.
  41. 41. cdis.uchicago.edu Robert L. Grossman rgrossman.com @BobGrossman robert.grossman@uchicago.edu Contact Information occ-data.org

×