Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How Data Commons are Changing the Way that Large Datasets Are Analyzed and Shared

365 views

Published on

This is a talk I gave the Welcome Genome Campus Workshop on Big Data in Biology and Health on September 26, 2017.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

How Data Commons are Changing the Way that Large Datasets Are Analyzed and Shared

  1. 1. How data commons are changing the way that large genomic and clinical datasets are analyzed and shared Robert Grossman Center for Data Intensive Science University of Chicago & Open Commons Consortium Welcome Genome Campus Workshop on Big Data in Health and Biology September 26, 2017
  2. 2. Data Commons 2014 - 2024 Data Clouds 2010 - 2020 Databases & Repositories 1982 - present
  3. 3. 1. Big Data in Biology, Medicine & Health Care: A Thirty Year Perspective
  4. 4. 1993 2004 Data Mining & KDD 1984 Computationally Intensive Statistics Predictive Analytics 2011 PageRank MobileInternetPOSDirect marketing Big Data & Data Science AI (redux) / Deep Learn. 2016 CNN, ANN Labeled images etc. Hamiltonian Monte Carlo
  5. 5. Deep Learning (DL) Dogma • Use Deep Neural Networks (DNN) with lots of layers (10’s to 100’s of layers). • Try using Convolutional Neural Networks (CNN), even if the problem is not translation invariant. • Represent the inputs and internal states as long vectors of numbers (even if the input is an image, text, spoken voice, etc. that has structure) • Train with very large amounts of labeled data. • Don’t worry about the internal structure of the model, just its accuracy and coverage.
  6. 6. DNN (2016) Data Mining (1996) Use lots of parameters 10’s to 100’s of hidden layers Large trees, large ensembles Long vectors for inputs Part of the dogma Generally the case Use as much data as possible Yes Yes Do we care about the internal structure of the model? No No Hardware GPU & custom chips Clusters of workstations Labeled data Often the limiting factor Often the limiting factor Features Not needed Generally the hard part
  7. 7. • In some sense, Deep Learning is eating the world. • Compare: Marc Andressen, Why Software Is Eating The World, Wall Street Journal, August 20, 2011. • From a broader perspective, Machine Learning (ML) continues to eat the world, as it has been doing for the last 20 years drive by the exponentially growth in the amount of data and the computational power available to estimate parameters.
  8. 8. 2. Data Commons, An Emerging Platform for Data Science
  9. 9. Data Commons Data commons co-locate data, storage and computing infrastructure with commonly used services, tools & apps for analyzing and sharing data to create an interoperable resource for the research community.* *Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson and Walt Wells, A Case for Data Commons Towards Data Science as a Service, IEEE Computing in Science and Engineer, 2016. Source of image: The CDIS, GDC, & OCC data commons infrastructure at the University of Chicago Kenwood Data Center.
  10. 10. NCI Genomic Data Commons* • Launched in 2016 with over 4 PB of data. • Joint project with OICR. • Used by 1500 - 2000+ users per day. • Based upon an open source software stack that can be used to build other data commons.*See: NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112.
  11. 11. System 1: Data Portals to Explore and Submit Data
  12. 12. • MuSE (MD Anderson) • VarScan2 (Washington Univ.) • SomaticSniper (Washington Univ.) • MuTect2 (Broad Institute) Source: Zhenyu Zhang, et. al. and the GDC Project Team, Uniform Genomic Data Analysis in the NCI Genomic Data Commons, to appear. System 2: Data Harmonization System To Analyze all of the Submitted Data with a Common Pipelines
  13. 13. System 3: User Defined Applications and Notebooks to Create a Data Ecosystem https://gdc-api.nci.nih.gov/files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state • The GDC has a REST API so that researchers can develop their own applications. • There are third party applications that use the REST API for Python, R, Jupyter notebooks and Shiny. • The REST API drives the GDC data portal, data submission system, etc.
  14. 14. GDC Application Programming Interface (API) – To Build Applications https://gdc-api.nci.nih.gov/files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state API URL Endpoint Optional Entity ID Query parameters • Based upon data model • Drives internally developed applications, e.g. data portal • Allows third parties to develop their own applications • Can be used by other commons, by workspaces, by other systems, by user-developed applications and notebooks
  15. 15. Purple balls are PCA-based analysis of RNA-seq data for lung adenocarcinoma. Grey are associated with lung squamous cell carcinoma. Green appear to be misdiagnosed. The GDC enables bioinformaticians to build their own applications using the GDC API. Source: Center for Data Intensive Science, University of Chicago. Shiny R app built using the GDC API
  16. 16. • Supports big data with cloud computing • Researchers can analyze data with collaborative tools (workspaces) – i. e. data does not have to be downloaded) • Data repository • Researchers download data. Databases Data Clouds Data Commons • Supports big data • Workspaces • Common data models • Core data services • Harmonized data • Governance 1982 - present 2010 - 2020 2014 - 2024
  17. 17. 3. GDC Gen3 Open Source Software Stack
  18. 18. OCC Open Science Data Cloud (2010) OCC – NASA Project Matsu (2009) NCI Genomic Data Commons* (2016) OCC-NOAA Environmental Data Commons (2016) OCC Blood Profiling Atlas in Cancer (2017) Bionimbus Protected Data Cloud* (2013) *Operated under a subcontract from NCI / Leidos Biomedical to the University of Chicago with support from the OCC. Brain Commons (2017) Kids First Data Resource (2017) Gen3 Gen2 Gen1
  19. 19. The Gen3 Data Model Is Customizable & Extensible • BloodPAC • Brain Commons • Wellness Commons • Kids First Data Resource
  20. 20. Object-based storage with access control lists Scalable light weight workflow Community data products Data Commons Framework Services (Digital ID, Metadata, Authentication, Auth., etc.) that support multiple data commons. Apps Database services Architecture used by Gen3 Data Commons Data Commons 1 Data Commons 2 Portals for accessing & submitting data Workspaces APIs Data Commons Framework Services Workspaces Workspaces Notebooks Apps Apps & Notebooks
  21. 21. Core Data Commons Framework Services • Digital ID services • Metadata services • Authentication services • Authorization services • Designed to span multiple data commons • Designed to support multiple private and commercial clouds • In the future, we will support portable workspaces
  22. 22. Open Source Software for Data Commons Existing open source apps Commercial apps New FOSS sponsor funded apps Public Clouds Data managed by the commons Private Clouds CSOC (Common Services Ops Center) Data Commons Management & Governance Sponsor (e.g. funder or consortium of funders) OCC Data Commons Framework 1 2 3 0
  23. 23. Data Commons Framework Services Private Academic Cloud Univ. of Chicago CSOC (ops center) Cross Cloud Services
  24. 24. NCI Clouds Pilots Compliant apps Bionimbus PDC & other clouds FAIR Principles NCI GDC Other data commonsData Peering Principles Commons Services Operations Center Commons services Commons Services Framework appapp app
  25. 25. Summary 1. Designed to support disease specific, project specific or consortium specific data commons, including governance model. 2. Designed to support multiple data commons that peer and interoperate. 3. Designed to support an ecosystem of FAIR-based applications. 4. The core underlying software stack is open source. 5. Data commons governance model in which data is public and you “pay for compute”. 6. Supported by the independent not-for-profit Open Commons Consortium.
  26. 26. 4. Towards Data Ecosystems
  27. 27. Data Commons 2014 - 2024 Data Clouds 2010 - 2020 Data Ecosystems 2018 - 2028 Databases 1982 - present
  28. 28. Three Large Scale Data Commons That are Working Towards Common APIs 1. NCI GDC / Cloud Resources (UChicago / Broad) 2. NIH All of Us (Broad / Verily) 3. CZI HCA Data Platform (UCSC/Broad) For more information, see: Josh Denny, David Glazer, Robert L. Grossman, Benedict Paten & Anthony Philippakis, A Data Biosphere for Biomedical Research, https://medium.com/@benedictpaten/a-data-biosphere-for-biomedical-research- d212bbfae95d. Also available at: https://goo.gl/9CySeo
  29. 29. Accumulating Knowledge About Small Effects Data Commons Data harmonization Data Ecosystems Analysis reuse Databases Curated data
  30. 30. 100% 75% 70% 50% 0% 25% 25% 25% 0% 0% 5% 25% 0% 20% 40% 60% 80% 100% 120% 1995 2005 2015 2025 Hardware / data production Software / data analysis Data ecosystem / data reuse
  31. 31. Big (deep) knowledge Big Information (informatics) - Shared analysis Big Data Data Infrastructure Apps Academic Data Centers Apps for clinical researchers Apps for bioinformaticians Apps for system builders Data Commons Data Ecosystems Cisplatin? Idarubicin? Floxuridine?
  32. 32. Questions? 33 rgrossman.com @bobgrossman
  33. 33. For more information: • To learn more about data commons: Robert L. Grossman, et. al. A Case for Data Commons: Toward Data Science as a Service, Computing in Science & Engineering 18.5 (2016): 10-20. Also https://arxiv.org/abs/1604.02608 • To large more about large scale, secure compliant cloud based computing environments for biomedical data, see: Heath, Allison P., et al. "Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets." Journal of the American Medical Informatics Association 21.6 (2014): 969-975. This article describes Bionimbus Gen1. • To learn more about the NCI Genomic Data Commons: Grossman, Robert L., et al. "Toward a shared vision for cancer genomic data." New England Journal of Medicine 375.12 (2016): 1109-1112. The GDC was developed using Bionimbus Gen2. • To learn more about BloodPAC, Grossman, R. L., et al. "Collaborating to compete: Blood Profiling Atlas in Cancer (BloodPAC) Consortium." Clinical Pharmacology & Therapeutics (2017). BloodPAC was developed using the GDC Community Edition (CE) aka Bionimbus Gen3
  34. 34. cdis.uchicago.edu Robert L. Grossman rgrossman.com @BobGrossman robert dot grossman at uchicago.edu Contract Information

×