NIH Data Initiatives: Harnessing Big (and small) Data to Improve Health
Presentation at the internet2 Global Forum, April 28, 2015
Session NIH Perspectives
The document proposes the creation of a federated cloud computing platform called "The Commons" to support biomedical data sharing and analysis across multiple cloud providers. Key points:
- The Commons would index metadata and digital objects across conformant public and private cloud providers.
- It would be funded by providing credits to investigators for storage and computing, creating competition among providers to offer better services at lower costs.
- A phased implementation is outlined to initially involve experienced users and later expand to all NIH grantees.
NIH Data Commons - Note: Presentation has animations Vivien Bonazzi
Presented at the Data Commons & Data Science Workshop (University of Chicago - Centre for Data Intensive Science):
NB- there are animations in these slides so static slides might not view well
The NIH Data Commons - BD2K All Hands Meeting 2015Vivien Bonazzi
Presentation given at the BD2K All Hands meeting in Bethesda, MD, USA in November 2015
https://datascience.nih.gov/bd2k/events/NOV2015-AllHands
Video cast of this presentation:
http://videocast.nih.gov/summary.asp?Live=17480&bhcp=1
talk starts at 2hrs 40min (its about 55mins long) - includes video!
Document describing the Commons : https://datascience.nih.gov/commons
The document discusses the need for an NIH Data Commons to address challenges with data sharing and storage. It describes how factors like increasing data volumes, availability of cloud technologies, and emphasis on FAIR data principles are driving the need for a centralized data platform. The proposed NIH Data Commons would provide findable, accessible, interoperable and reusable data through cloud-based services and tools. It would enable data-driven science by facilitating discovery, access and analysis of biomedical data across different sources. Plans are outlined to develop and test an initial Data Commons pilot using existing genomic and other biomedical datasets.
EMBL Australian Bioinformatics Resource AHM - Data CommonsVivien Bonazzi
This document discusses the development of the NIH Data Commons, which aims to create a shared framework and infrastructure for biomedical data. It notes the increasing amounts of data being generated and the need for data sharing and interoperability. The Data Commons framework treats data, tools, and publications as digital objects that are findable, accessible, interoperable and reusable. Current pilots include deploying reference datasets in the cloud, indexing data and tools, and a credits system for cloud resources. Challenges discussed include metrics, costs, standards, incentives and sustainability. The framework's relevance for supporting open data in Australia is also addressed.
The Commons: Leveraging the Power of the Cloud for Big DataPhilip Bourne
The document discusses the need for a Commons framework to leverage cloud computing for big data in biomedicine. It describes key principles of the Commons, including supporting a digital ecosystem, treating research outputs as digital objects, and ensuring objects are FAIR (findable, accessible, interoperable, and reusable). The Commons framework exploits cloud technologies to provide access to data and tools through APIs and containers. Current pilots applying this framework include the Cloud Credits Model, BD2K Centers, model organism databases, the Human Microbiome Project, and NCI cancer genomics data. The goal is to make large biomedical datasets and associated tools broadly available for research in a standardized, interoperable manner.
The document provides an overview of the development of the NIH Data Commons. It discusses factors driving the need for a data commons, including large amounts of data being generated and increased support for data sharing. It outlines the goals of making data findable, accessible, interoperable and reusable. Several pilots are exploring the feasibility of the commons framework, including placing large datasets in the cloud and developing indexing methods. Considerations in fully realizing the commons are also discussed, such as standards, discoverability, policies and incentives.
The document proposes the creation of a federated cloud computing platform called "The Commons" to support biomedical data sharing and analysis across multiple cloud providers. Key points:
- The Commons would index metadata and digital objects across conformant public and private cloud providers.
- It would be funded by providing credits to investigators for storage and computing, creating competition among providers to offer better services at lower costs.
- A phased implementation is outlined to initially involve experienced users and later expand to all NIH grantees.
NIH Data Commons - Note: Presentation has animations Vivien Bonazzi
Presented at the Data Commons & Data Science Workshop (University of Chicago - Centre for Data Intensive Science):
NB- there are animations in these slides so static slides might not view well
The NIH Data Commons - BD2K All Hands Meeting 2015Vivien Bonazzi
Presentation given at the BD2K All Hands meeting in Bethesda, MD, USA in November 2015
https://datascience.nih.gov/bd2k/events/NOV2015-AllHands
Video cast of this presentation:
http://videocast.nih.gov/summary.asp?Live=17480&bhcp=1
talk starts at 2hrs 40min (its about 55mins long) - includes video!
Document describing the Commons : https://datascience.nih.gov/commons
The document discusses the need for an NIH Data Commons to address challenges with data sharing and storage. It describes how factors like increasing data volumes, availability of cloud technologies, and emphasis on FAIR data principles are driving the need for a centralized data platform. The proposed NIH Data Commons would provide findable, accessible, interoperable and reusable data through cloud-based services and tools. It would enable data-driven science by facilitating discovery, access and analysis of biomedical data across different sources. Plans are outlined to develop and test an initial Data Commons pilot using existing genomic and other biomedical datasets.
EMBL Australian Bioinformatics Resource AHM - Data CommonsVivien Bonazzi
This document discusses the development of the NIH Data Commons, which aims to create a shared framework and infrastructure for biomedical data. It notes the increasing amounts of data being generated and the need for data sharing and interoperability. The Data Commons framework treats data, tools, and publications as digital objects that are findable, accessible, interoperable and reusable. Current pilots include deploying reference datasets in the cloud, indexing data and tools, and a credits system for cloud resources. Challenges discussed include metrics, costs, standards, incentives and sustainability. The framework's relevance for supporting open data in Australia is also addressed.
The Commons: Leveraging the Power of the Cloud for Big DataPhilip Bourne
The document discusses the need for a Commons framework to leverage cloud computing for big data in biomedicine. It describes key principles of the Commons, including supporting a digital ecosystem, treating research outputs as digital objects, and ensuring objects are FAIR (findable, accessible, interoperable, and reusable). The Commons framework exploits cloud technologies to provide access to data and tools through APIs and containers. Current pilots applying this framework include the Cloud Credits Model, BD2K Centers, model organism databases, the Human Microbiome Project, and NCI cancer genomics data. The goal is to make large biomedical datasets and associated tools broadly available for research in a standardized, interoperable manner.
The document provides an overview of the development of the NIH Data Commons. It discusses factors driving the need for a data commons, including large amounts of data being generated and increased support for data sharing. It outlines the goals of making data findable, accessible, interoperable and reusable. Several pilots are exploring the feasibility of the commons framework, including placing large datasets in the cloud and developing indexing methods. Considerations in fully realizing the commons are also discussed, such as standards, discoverability, policies and incentives.
Bonazzi data commons nhgri council feb 2017Vivien Bonazzi
The NIH is developing a Data Commons to enable data-driven biomedical research. The Data Commons will treat research data, methods, and papers as digital objects stored in a shared virtual space according to FAIR principles. It will provide tools and infrastructure for users to find, deposit, manage, share, and reuse these digital objects at scale. The goal is to accelerate discoveries, therapies, and cures by enabling researchers to leverage all available data and analysis tools. The Data Commons is being designed as an interoperable platform that can integrate with other data commons through common APIs, container technologies, metadata standards, and authentication.
Big Data as a Catalyst for Collaboration & InnovationPhilip Bourne
Big data is disrupting biomedical research through digitization of data sources. The National Institutes of Health (NIH) launched the Big Data to Knowledge (BD2K) initiative to support this disruption. BD2K funds various programs including data sharing policies, data science training, and the development of shared infrastructure and standards. This infrastructure includes the "Commons" which would provide discoverable, accessible, interoperable and reusable research objects to catalyze collaboration using open APIs and computing platforms. SRP could interact with BD2K through initiatives like open science competitions, data standards development, and leadership in trans-NIH big data efforts.
Data commons bonazzi bd2 k fundamentals of science feb 2017Vivien Bonazzi
Vivien Bonazzi leads the Data Commons efforts within NIH. She discussed how big data is characterized by volume, velocity, variety and veracity. She explained that data is becoming the central currency of a new digital economy and organizations must leverage their digital assets through platforms like the Data Commons to transform into digital enterprises. The Data Commons platform fosters development of a digital ecosystem by enabling interactions between producers and consumers of FAIR digital objects like data, software and publications.
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
Data commons are emerging as a solution to challenges in analyzing and sharing large biomedical datasets. A data commons co-locates data with cloud computing infrastructure and software tools to create an interoperable resource for the research community. Examples include the NCI Genomic Data Commons and the Open Commons Consortium. The open source Gen3 platform supports building disease- or project-specific data commons to facilitate open data sharing while protecting patient privacy. Developing interoperable data commons can accelerate research through increased access to data.
The document discusses the proposed phases of the NIH Cloud Credits Model project. Phase 1 would focus on finalizing requirements, arranging initial cloud providers, and developing a basic investigator portal. Phase 2 would involve opening the credits model to experienced cloud users and distributing an initial batch of credits. Phase 3 would include a larger scale opening and distribution of credits, along with infrastructure improvements. Phase 4 would broadly scale up credit distribution and distribution analysis.
The document discusses a meeting agenda between GBIF (Global Biodiversity Information Facility) and Elsevier to discuss opportunities for collaboration around data publishing and sharing biodiversity data. Some key points discussed in the agenda include GBIF's role in facilitating open access to biodiversity data, its data publishing framework to encourage data mobilization and sharing, and potential areas of collaboration around simultaneous publishing of data and scholarly articles.
What is Data Commons and How Can Your Organization Build One?Robert Grossman
1. Data commons co-locate large biomedical datasets with cloud computing infrastructure and analysis tools to create shared resources for the research community.
2. The NCI Genomic Data Commons is an example of a data commons that makes over 2.5 petabytes of cancer genomics data available through web portals, APIs, and harmonized analysis pipelines.
3. The Gen3 platform is an open source software stack for building data commons that can interoperate through common APIs and data models to support reproducible, collaborative research across projects.
The document discusses GBIF's (Global Biodiversity Information Facility) goals of facilitating open access to biodiversity data worldwide to support scientific research. GBIF shares over 200 million biodiversity records through data publishers and resources. The document proposes a Data Publishing Framework to improve data mobilization and cultural acceptance of open data sharing. It describes challenges to the framework and its potential impacts, such as increased data usage and quality through incentives like data papers and a Data Usage Index.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
Smith RDAP11 NSF Data Management Plan Case StudiesASIS&T
MacKenzie Smith, MIT; NSF Data Management Plan Case Studies; RDAP11 Summit
The 2nd Research Data Access and Preservation (RDAP) Summit
An ASIS&T Summit
March 31-April 1, 2011 Denver, CO
In cooperation with the Coalition for Networked Information
http://asist.org/Conferences/RDAP11/index.html
Big Data in Biomedicine – An NIH PerspectivePhilip Bourne
Keynote at the IEEE International Conference on Bioinformatics and Biomedicine, Washington DC, November 10, 2015.
https://cci.drexel.edu/ieeebibm/bibm2015/
The document discusses a global initiative to facilitate open access to scholarly resources and research data across boundaries by building a federation of registries. It provides use cases of how such a system could help postgraduate students, research project leaders, administrators, and ICT specialists discover and monitor globally accessible data relevant to their work. The proposed strategy is to create a "Register of Registries" that would enable consistent discovery services for finding data in collections through a standardized, interoperable model. An initial scoping meeting was held in 2007 and annual meetings since to develop the strategy.
Micah Altman, Harvard; Policy-based Data Management
The 2nd Research Data Access and Preservation (RDAP) Summit
An ASIS&T Summit
March 31-April 1, 2011 Denver, CO
In cooperation with the Coalition for Networked Information
http://asist.org/Conferences/RDAP11/index.html
Integration of research literature and data (InFoLiS)Philipp Zumstein
Talk at CNI 2015 Spring Membership Meeting in Seattle on April 14th, 2015, see http://www.cni.org/events/membership-meetings/upcoming-meeting/spring-2015/
Abstract: The goal of the InFoLiS project is to connect research data and publications. Links between data and literature are created automatically by means of text mining and made available as Linked Open Data (LOD) for seamless integration into different retrieval systems. This enables scientists to directly access information about corresponding research data in a literature information system, and, vice versa, it is possible to directly find different interpretations and analyses in the literature of the same research data. In our talk, we will describe our methods for generating the links and give insight into the Linked Data infrastructure including the services we are currently building. Most importantly, we will detail how our solutions can be used by other institutions and invite all interested participants to discuss with us their ideas and thoughts on the requirements for these services to ensure broad interoperability with existing systems and infrastructures. InFoLiS is a joint project by the GESIS – Leibniz Institute for the Social Sciences, Cologne, Mannheim University Library, and Mannheim University supported by a grant from the DFG – German Research Foundation.
The document describes the DATS (Data Tag Suite) model, which aims to provide a standardized way to index datasets through a community effort. The DATS model was developed by combining use cases and existing data schemas, and represents datasets and their metadata in a scalable way. It focuses on key descriptors like authors, datasets, publications, and funding to enable discoverability. The DATS model is serialized in JSON and JSON-LD using schema.org to increase visibility, accessibility, and search engine ranking. It is being adopted by databases and aligned with other bioscience metadata efforts.
The NSF DataNet Program aims to create exemplar data infrastructure organizations called DataNet Partners to provide researchers with access to data and advance research. SEAD is one such DataNet Partner that provides lightweight data services for sustainability science. It acts as an active content repository and curation service, and is developing tools for community exploration of data. The current focus is on an end-user workshop, conference demonstrations, and interface redesign to refine models for supporting the full lifecycle of research data objects.
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
This document summarizes Philip Bourne's presentation on data science at NIH with an emphasis on health policy and management. It discusses how digitization, deception, disruption, and other trends are transforming biomedical research and healthcare. It outlines NIH's Precision Medicine Initiative to build a national research cohort of over 1 million volunteers. It also describes NIH's Office of Biomedical Data Science, whose mission is to accelerate biomedical research through an open digital ecosystem using data science. Key goals and programs discussed include the Big Data to Knowledge initiative and the Mobile Sensor Data-to-Knowledge Center of Excellence.
The Transformation of Systems Biology Into A Large Data ScienceRobert Grossman
Systems biology is becoming a data-intensive science due to the exponential growth of genomic and biological data. Large projects now produce petabytes of data that require new computational infrastructure to store, manage, and analyze. Cloud computing provides elastic resources that can scale to support the increasing data needs of systems biology. Case studies show how clouds are used for large-scale data integration and analysis, running combinatorial analysis over genomic marks, and enabling reanalysis of biological data through elastic virtual machines. The Open Cloud Consortium is working to provide open cloud resources for biological and biomedical research through testbeds and proposed bioclouds.
Bonazzi data commons nhgri council feb 2017Vivien Bonazzi
The NIH is developing a Data Commons to enable data-driven biomedical research. The Data Commons will treat research data, methods, and papers as digital objects stored in a shared virtual space according to FAIR principles. It will provide tools and infrastructure for users to find, deposit, manage, share, and reuse these digital objects at scale. The goal is to accelerate discoveries, therapies, and cures by enabling researchers to leverage all available data and analysis tools. The Data Commons is being designed as an interoperable platform that can integrate with other data commons through common APIs, container technologies, metadata standards, and authentication.
Big Data as a Catalyst for Collaboration & InnovationPhilip Bourne
Big data is disrupting biomedical research through digitization of data sources. The National Institutes of Health (NIH) launched the Big Data to Knowledge (BD2K) initiative to support this disruption. BD2K funds various programs including data sharing policies, data science training, and the development of shared infrastructure and standards. This infrastructure includes the "Commons" which would provide discoverable, accessible, interoperable and reusable research objects to catalyze collaboration using open APIs and computing platforms. SRP could interact with BD2K through initiatives like open science competitions, data standards development, and leadership in trans-NIH big data efforts.
Data commons bonazzi bd2 k fundamentals of science feb 2017Vivien Bonazzi
Vivien Bonazzi leads the Data Commons efforts within NIH. She discussed how big data is characterized by volume, velocity, variety and veracity. She explained that data is becoming the central currency of a new digital economy and organizations must leverage their digital assets through platforms like the Data Commons to transform into digital enterprises. The Data Commons platform fosters development of a digital ecosystem by enabling interactions between producers and consumers of FAIR digital objects like data, software and publications.
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
Data commons are emerging as a solution to challenges in analyzing and sharing large biomedical datasets. A data commons co-locates data with cloud computing infrastructure and software tools to create an interoperable resource for the research community. Examples include the NCI Genomic Data Commons and the Open Commons Consortium. The open source Gen3 platform supports building disease- or project-specific data commons to facilitate open data sharing while protecting patient privacy. Developing interoperable data commons can accelerate research through increased access to data.
The document discusses the proposed phases of the NIH Cloud Credits Model project. Phase 1 would focus on finalizing requirements, arranging initial cloud providers, and developing a basic investigator portal. Phase 2 would involve opening the credits model to experienced cloud users and distributing an initial batch of credits. Phase 3 would include a larger scale opening and distribution of credits, along with infrastructure improvements. Phase 4 would broadly scale up credit distribution and distribution analysis.
The document discusses a meeting agenda between GBIF (Global Biodiversity Information Facility) and Elsevier to discuss opportunities for collaboration around data publishing and sharing biodiversity data. Some key points discussed in the agenda include GBIF's role in facilitating open access to biodiversity data, its data publishing framework to encourage data mobilization and sharing, and potential areas of collaboration around simultaneous publishing of data and scholarly articles.
What is Data Commons and How Can Your Organization Build One?Robert Grossman
1. Data commons co-locate large biomedical datasets with cloud computing infrastructure and analysis tools to create shared resources for the research community.
2. The NCI Genomic Data Commons is an example of a data commons that makes over 2.5 petabytes of cancer genomics data available through web portals, APIs, and harmonized analysis pipelines.
3. The Gen3 platform is an open source software stack for building data commons that can interoperate through common APIs and data models to support reproducible, collaborative research across projects.
The document discusses GBIF's (Global Biodiversity Information Facility) goals of facilitating open access to biodiversity data worldwide to support scientific research. GBIF shares over 200 million biodiversity records through data publishers and resources. The document proposes a Data Publishing Framework to improve data mobilization and cultural acceptance of open data sharing. It describes challenges to the framework and its potential impacts, such as increased data usage and quality through incentives like data papers and a Data Usage Index.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
Smith RDAP11 NSF Data Management Plan Case StudiesASIS&T
MacKenzie Smith, MIT; NSF Data Management Plan Case Studies; RDAP11 Summit
The 2nd Research Data Access and Preservation (RDAP) Summit
An ASIS&T Summit
March 31-April 1, 2011 Denver, CO
In cooperation with the Coalition for Networked Information
http://asist.org/Conferences/RDAP11/index.html
Big Data in Biomedicine – An NIH PerspectivePhilip Bourne
Keynote at the IEEE International Conference on Bioinformatics and Biomedicine, Washington DC, November 10, 2015.
https://cci.drexel.edu/ieeebibm/bibm2015/
The document discusses a global initiative to facilitate open access to scholarly resources and research data across boundaries by building a federation of registries. It provides use cases of how such a system could help postgraduate students, research project leaders, administrators, and ICT specialists discover and monitor globally accessible data relevant to their work. The proposed strategy is to create a "Register of Registries" that would enable consistent discovery services for finding data in collections through a standardized, interoperable model. An initial scoping meeting was held in 2007 and annual meetings since to develop the strategy.
Micah Altman, Harvard; Policy-based Data Management
The 2nd Research Data Access and Preservation (RDAP) Summit
An ASIS&T Summit
March 31-April 1, 2011 Denver, CO
In cooperation with the Coalition for Networked Information
http://asist.org/Conferences/RDAP11/index.html
Integration of research literature and data (InFoLiS)Philipp Zumstein
Talk at CNI 2015 Spring Membership Meeting in Seattle on April 14th, 2015, see http://www.cni.org/events/membership-meetings/upcoming-meeting/spring-2015/
Abstract: The goal of the InFoLiS project is to connect research data and publications. Links between data and literature are created automatically by means of text mining and made available as Linked Open Data (LOD) for seamless integration into different retrieval systems. This enables scientists to directly access information about corresponding research data in a literature information system, and, vice versa, it is possible to directly find different interpretations and analyses in the literature of the same research data. In our talk, we will describe our methods for generating the links and give insight into the Linked Data infrastructure including the services we are currently building. Most importantly, we will detail how our solutions can be used by other institutions and invite all interested participants to discuss with us their ideas and thoughts on the requirements for these services to ensure broad interoperability with existing systems and infrastructures. InFoLiS is a joint project by the GESIS – Leibniz Institute for the Social Sciences, Cologne, Mannheim University Library, and Mannheim University supported by a grant from the DFG – German Research Foundation.
The document describes the DATS (Data Tag Suite) model, which aims to provide a standardized way to index datasets through a community effort. The DATS model was developed by combining use cases and existing data schemas, and represents datasets and their metadata in a scalable way. It focuses on key descriptors like authors, datasets, publications, and funding to enable discoverability. The DATS model is serialized in JSON and JSON-LD using schema.org to increase visibility, accessibility, and search engine ranking. It is being adopted by databases and aligned with other bioscience metadata efforts.
The NSF DataNet Program aims to create exemplar data infrastructure organizations called DataNet Partners to provide researchers with access to data and advance research. SEAD is one such DataNet Partner that provides lightweight data services for sustainability science. It acts as an active content repository and curation service, and is developing tools for community exploration of data. The current focus is on an end-user workshop, conference demonstrations, and interface redesign to refine models for supporting the full lifecycle of research data objects.
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
This document summarizes Philip Bourne's presentation on data science at NIH with an emphasis on health policy and management. It discusses how digitization, deception, disruption, and other trends are transforming biomedical research and healthcare. It outlines NIH's Precision Medicine Initiative to build a national research cohort of over 1 million volunteers. It also describes NIH's Office of Biomedical Data Science, whose mission is to accelerate biomedical research through an open digital ecosystem using data science. Key goals and programs discussed include the Big Data to Knowledge initiative and the Mobile Sensor Data-to-Knowledge Center of Excellence.
The Transformation of Systems Biology Into A Large Data ScienceRobert Grossman
Systems biology is becoming a data-intensive science due to the exponential growth of genomic and biological data. Large projects now produce petabytes of data that require new computational infrastructure to store, manage, and analyze. Cloud computing provides elastic resources that can scale to support the increasing data needs of systems biology. Case studies show how clouds are used for large-scale data integration and analysis, running combinatorial analysis over genomic marks, and enabling reanalysis of biological data through elastic virtual machines. The Open Cloud Consortium is working to provide open cloud resources for biological and biomedical research through testbeds and proposed bioclouds.
Talk given at "Cloud Computing for Systems Biology" workshopDeepak Singh
This document discusses the role of cloud computing in biology and big data. It describes how cloud platforms like Amazon Web Services provide scalable, cost-effective and reliable infrastructure for storing and analyzing large biological datasets. The document outlines how researchers are using cloud computing to collaborate on projects, run computationally intensive analyses, and develop new tools and applications for processing life sciences data in the cloud.
Big Data, Computational Biology & the Future of Strategic Planning for ResearchNBBJDesign
The advent of computational biology in the era of “big data” is triggering a dramatic change in the strategic capital planning process and metrics for space allocation and utilization for translational science. In this presentation, Andy Snyder - Principal and NBBJ's Science & Education Practice leader, and Bruce Stevenson, VP of Research Operations at Nationwide Childrens Hospital - chart new relationships between strategic planning, programming, facility planning and scientific workplace features for biomedical research and translational medicine. The presentation sets out new best practices for navigating limited funding resources while preparing for new science directions and workforce needs, research space requirements, and advancements in scientific equipment, and they identify new ways to leverage data, metrics, analytical processes, and tools for improved program/infrastructure alignment.
The Future of Research (Science and Technology)Duncan Hull
This document summarizes the key trends in modern scientific research, including the rise of data-intensive science, collaborative and distributed research, and open science. It discusses how research is becoming more data-driven and dependent on large datasets. It also notes the growth of virtual and distributed collaboration between researchers. Finally, it outlines some of the implications for libraries and services to support reproducible, open, and data-driven scientific research.
The document discusses how information technology has profoundly impacted how we live and work over the past 20-30 years. Research has accelerated due to these changes with implications for both healthcare providers and patients. We are now entering an era where the world's knowledge is freely available at our fingertips online. This era of open data has the potential to decentralize and democratize information. It is predicted that institutions will become seamless digital enterprises and healthcare will become more personalized and focused on preventative care as big data and technologies like 3D printing continue to disrupt many industries.
The Philosophy of Big Data is the branch of philosophy concerned with the foundations, methods, and implications of big data; the definitions, meaning, conceptualization, knowledge possibilities, truth standards, and practices in situations involving very-large data sets that are big in volume, velocity, variety, veracity, and variability
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
This is a talk on a fundamental approach to thinking about scalability, and how Hadoop, HBase, and Lucene are enabling companies to process amazing amounts of data. It's also about how Social Media is making the traditional RDBMS irrelevant.
The Crypto Enlightenment: Social Theory of Blockchains Melanie Swan
Text Write-up: http://futurememes.blogspot.com/2015/10/crypto-enlightenment-social-theory-of.html
Introduction
What is Bitcoin, blockchain, decentralization?
Stakes: Transition from labor economy to actualization economy
Crypto Enlightenment
Rethinking Authority (Self, Society)
Philosophy of Immanence (open-ended upside)
Theory of Crypto Flourishing
Scarcity as a social pathology
Abundance theory of Flourishing
Practicalities and extensive blockchain applications
\\
The Future of Research - Data and the Rise of Digital Scholarship presents the trends that stand to have a significant impact on the changing face of academic publishing and scholarly research. As million of connected devices come online and an unprecedented volume of information moves into digitized formats, it is still estimated that less 1 percent of this data has been analyzed. This reports presents strategic insights for how researchers can get the most out of their data while keeping a human perspective at heart, and how to concisely and effectively present insights to an information overloaded reader.
Want to Learn More About This Topic or Any Other?
Go to labs.psfk.com to learn more about accessing in-depth trend reports on industries, markets, and topics, database access, workshops, presentations and events.
This document summarizes an update on the Big Data to Knowledge (BD2K) initiative at the National Institutes of Health (NIH). It discusses progress made in the first year of BD2K funding in three key areas: advancing data science research through centers and targeted awards; sharing data and software through the development of indexing tools and standards; and expanding training programs. It outlines funding amounts and recipient numbers for fiscal year 2015. Future plans are outlined through 2021 with the goals of further developing tools and applications, expanding the data sharing commons, and increasing training and sustainability efforts.
Presentation at the Department of Health and Human Services October 17, 2014 to introduce other agencies outside of NIH the development of the Commons concept.
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
Seminar for Dr. Min Zhang's Purdue Bioinformatics Seminar Series. Touched on learning health systems, the Gen3 Data Commons, the NCI Genomic Data Commons, Data Harmonization, FAIR, and open science.
The document discusses the NIH's efforts to create a modernized, integrated, and FAIR biomedical data ecosystem. It outlines the NIH Office of Data Science Strategy's goals of optimizing data infrastructure and management, developing tools and the workforce, and ensuring stewardship and sustainability. It describes specific initiatives like STRIDES, the Generalist Repository Ecosystem Initiative, and AIM-AHEAD which aim to improve data sharing, train researchers, and address health disparities through AI. The overall goal is to make biomedical data more accessible, interoperable, and useful to advance biomedical research.
- The document discusses challenges related to biomedical data including that data is growing rapidly, stored across silos, and expensive to maintain while demands for sharing are increasing. It also notes a lack of data science skills.
- Solutions explored include developing the NIH Commons, which would integrate disparate cloud initiatives using BD2K standards to make data findable, accessible, interoperable and reusable. This could enable new insights from aggregate analysis across datasets.
- A 3-year BD2K-sponsored pilot of the Commons is underway to address questions around discoveries, productivity, reproducibility and cost-effectiveness compared to current approaches. The pilot involves moving model organism databases to the Commons as a test case.
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
Maximizing the value of data, computing, data science in an academic medical center, or 'towards a molecularly informed Learning Health System. Given in October at the University of Florida in Gainesville
The document summarizes NIH's approach to data science and the ADDS mission. It discusses establishing a data ecosystem through community, policy, and infrastructure. The goals are to foster sustainability, efficiency, collaboration, reproducibility, and accessibility. NIH plans to seed the ecosystem through existing resources and funding. Example initiatives include establishing a data commons, standards, and training programs to develop a diverse data science workforce. The overall aim is to support a "digital enterprise" that enhances biomedical research and health outcomes.
NDS Relevant Update from the NIH Data Science (ADDS) OfficePhilip Bourne
This document summarizes a presentation given by Dr. Phil Bourne on the National Data Science (NDS) initiative and the National Institutes of Health (NIH) All of Us Data and Science (ADDS) office. The presentation discusses how NDS can succeed by defining clear problems, starting with pilots, and developing sustainable applications. It then outlines ADDS's mission to accelerate biomedical research through an open data ecosystem. ADDS's strategy focuses on discovery, workforce development, policy, leadership, and sustainability through developing a shared "Commons" of digital research objects in the cloud. Pilot projects are evaluating this Commons framework and populating it with datasets and tools.
Merritt’s micro-services-based architecture provides a number of options for easy integration with diverse external discovery services with specific disciplinary focus on scientific data sharing. By removing many of the barriers faced by researchers interested in data publication, the integrations of Merritt with DataShare and Research Hub exemplify a new service model for cooperative and distributed data sharing. The widespread adoption of such sharing is critical to open scientific inquiry and advancement.
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...dkNET
dkNET provides a single portal for discovering over 3,500 biomedical research resources and datasets. It aims to make these resources findable, accessible, interoperable, and reusable in accordance with the FAIR principles. The portal contains three main sections for browsing community resources, additional resources, and literature. It utilizes faceted searching and provides analytics and notifications to help users track changes to resources over time.
Building an Intelligent Biobank to Power Research Decision-MakingDenodo
This presentation belongs to the workshop: "Building an Intelligent Biobank to Power Research Decision-Making", from ISBER 2015 Annual Meeting by Lori A. Ball (Chief Operating Officer, President of Integrated Client Solutions at BioStorage Technologies, Inc), Brian Brunner (Senior Manager, Clinical Practice at LabAnswer) and Suresh Chandrasekaran (Senior Vice President at Denodo).
The workshop cover three different topic areas:
- Research sample intelligence: the growing need for Global Data Integration (Biobank Sample and Data Stakeholders).
- Building a research data integration plan and cloud sourcing strategy (data integration).
- How data virtualization works and the value it delivers (a data virtualization introduction, solution portfolio and current customers in Life Sciences industry).
The biomedical R&D environment is increasingly dependent on data meta-analysis and bioinformatics to support research advancements. The integration of biorepository sample inventory data with biomarker and clinical research information has become a priority to R&D organizations. Therefore, a flexible IT system for managing sample collections, integrating sample data with clinical data and providing a data virtualization platform will enable the advancement of research studies. This workshop provides an overview of how sample data integration, virtualization and analytics can lead to more streamlined and unified sample intelligence to support global biobanking for future research.
This is an overview of the Data Biosphere Project, its goals, its architecture, and the three core projects that form its foundation. We also discuss data commons.
The NIH as a Digital Enterprise: Implications for PAGPhilip Bourne
The document discusses the NIH's vision of becoming a digital enterprise to enhance biomedical research. It outlines how research is becoming more digital and data-driven. The NIH aims to foster open sharing of data and tools through its Commons platform to facilitate collaboration and reproducibility. It also stresses the importance of training the next generation of data scientists to enable the digital enterprise. The end goal is to accelerate discovery and improve health outcomes through more integrated and data-driven research.
Philip Bourne presented on the NIH's Big Data to Knowledge (BD2K) initiative and the Associate Director for Data Science (ADDS) office. The goals of BD2K are to use data science to accelerate biomedical research and enhance health outcomes. BD2K supports various centers, projects, and training programs related to data discovery, standards, cloud computing, sustainability, and workforce development. The ADDS office oversees BD2K and aims to establish a sustainable data science ecosystem and well-trained workforce to enable major scientific discoveries through data-driven research.
NCI Cancer Research Data Commons - Overviewimgcommcall
The NCI Cancer Research Data Commons aims to enable sharing of diverse cancer research data across institutions by providing easy access to data stored in domain-specific repositories through a common authentication and authorization mechanism. It utilizes a framework of reusable components including data nodes, a cancer data aggregator, and cloud resources to integrate genomic, imaging, proteomic, and other data types while controlling access. The goals are to facilitate discovery and analysis tools as well as sustainably sharing data publicly to advance cancer research.
STI 2022 - Generating large-scale network analyses of scientific landscapes i...Michele Pasin
The growth of large, programatically accessible bibliometrics databases presents new opportunities for complex analyses of publication metadata. In addition to providing a wealth of information about authors and institutions, databases such as those provided by Dimensions also provide conceptual information and links to entities such as grants, funders and patents. However, data is not the only challenge in evaluating patterns in scholarly work: These large datasets can be challenging to integrate, particularly for those unfamiliar with the complex schemas necessary for accommodating such heterogeneous information, and those most comfortable with data mining may not be as experienced in data visualisation. Here, we present an open-source Python library that streamlines the process accessing and diagramming subsets of the Dimensions on Google BigQuery database and demonstrate its use on the freely available Dimensions COVID-19 dataset. We are optimistic that this tool will expand access to this valuable information by streamlining what would otherwise be multiple complex technical tasks, enabling more researchers to examine patterns in research focus and collaboration over time.
The document discusses recommendations from a workshop on peer review of research data. It focuses on three key areas:
1. Connecting data review with data management planning by requiring data sharing plans, ensuring adequate funding for data management, and refusing publication without clear data access.
2. Connecting scientific and technical review with data curation by linking articles and data with versioning, avoiding duplicate review efforts, and addressing issues found in data.
3. Connecting data review with article review by requiring methods/software information, providing review checklists, ensuring data access for reviewers, and permanent dataset identifiers from repositories.
Similar to Komatsoulis internet2 global forum 2015 (20)
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
ESPP presentation to EU Waste Water Network, 4th June 2024 “EU policies driving nutrient removal and recycling
and the revised UWWTD (Urban Waste Water Treatment Directive)”
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...AbdullaAlAsif1
The pygmy halfbeak Dermogenys colletei, is known for its viviparous nature, this presents an intriguing case of relatively low fecundity, raising questions about potential compensatory reproductive strategies employed by this species. Our study delves into the examination of fecundity and the Gonadosomatic Index (GSI) in the Pygmy Halfbeak, D. colletei (Meisner, 2001), an intriguing viviparous fish indigenous to Sarawak, Borneo. We hypothesize that the Pygmy halfbeak, D. colletei, may exhibit unique reproductive adaptations to offset its low fecundity, thus enhancing its survival and fitness. To address this, we conducted a comprehensive study utilizing 28 mature female specimens of D. colletei, carefully measuring fecundity and GSI to shed light on the reproductive adaptations of this species. Our findings reveal that D. colletei indeed exhibits low fecundity, with a mean of 16.76 ± 2.01, and a mean GSI of 12.83 ± 1.27, providing crucial insights into the reproductive mechanisms at play in this species. These results underscore the existence of unique reproductive strategies in D. colletei, enabling its adaptation and persistence in Borneo's diverse aquatic ecosystems, and call for further ecological research to elucidate these mechanisms. This study lends to a better understanding of viviparous fish in Borneo and contributes to the broader field of aquatic ecology, enhancing our knowledge of species adaptations to unique ecological challenges.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
1. George A. Komatsoulis, Ph.D.
National Center for Biotechnology Information (NCBI)
National Library of Medicine
National Institutes of Health
U.S. Department of Health and Human Services
2. Mission: ”To seek
fundamental knowledge
about the nature and
behavior of living systems
and the application of that
knowledge to enhance
health, lengthen life, and
reduce illness and disability.”
Composed of 27 Institutes
and Centers
Annual Budget = $30.3B
80% of NIH budget goes to
about 50,000 grants
5. Sensor Stream = 500 EB/day
Stores 69 TB/day
Collection = 14 EB/day
Store 1PB/day
Total Data = 14 PB
Store an average of 3.3TB/day for 10 years!
6.
7. Launched to support biomedical data science research
Support for multiple facets of data science:
BD2K Centers
Data and Software Discovery
Standards and Interoperability
Training and Workforce Development
The Commons
Led by Dr. Phil Bourne, NIH Associate Director for Data
Science
8. Public Data Repositories
Local Data
U N I V E R S I T YU N I V E R S I T Y
Locally Developed Software
Publicly Available
Software
Local storage and
compute resources
9. Is scalable and exploits new computing models
Is more cost effective given digital growth
Simplifies sharing digital research objects such
as data, software, metadata and workflows
Makes digital research objects more FAIR:
Findable, Accessible, Interoperable and
Reusable
DOES NOT replace existing, well-curated
databases Phil Bourne, 2014
14. The Commons: Business Model
Researcher
Discovery Index
The Commons
Cloud Provider
C
Cloud Provider
B
Cloud Provider
A
NIH
Provides Digital Objects
Retrieves/Uses Digital
Objects
Option: Fund Providers to
Support NIH Directed
Resources
Indexes Commons
Provide
Credits
Uses
Credits
Finds
Objects
Commons
Implemented as a
federation of
‘conformant’ cloud
providers and HPC
environments
Funded primarily
by providing
credits to
investigators
15. Cost effective - Only pay for IT support used
Drives competition – Better services at lower
cost
Supports Data sharing by driving science into
the Commons
Facilitates public-private partnership
Scalable to most categories of data expected in
the next 5 years.
16. Novelty:
Never been tried, so we don’t have data about likelihood of success
Cost Models:
Predicated on stable or declining prices among providers
True for the last several years, but we can’t guarantee that it will
continue, particularly if there is significant consolidation in industry
Service Providers:
Predicated on service providers willing to make the investment to
become conformant
Market research suggests 3-5 providers within 2-3 months of program
launch
Persistence:
The model is ‘Pay As You Go’ which means if you stop paying it stops
going
Giving investigators an unprecedented level of control over what lives
(or dies) in the Commons
17. Investigator
Reseller of Cloud
Services
The Commons
Cloud Provider
C
Cloud Provider
B
Cloud Provider
A
Investigator Institution
Directs reseller
to distribute
credits
Instructs provider to
put credits on
investigator account
1
2
Review
NIH
3
4
5
6
7
Approves Credit
Request
Requests Credits
Uses credits
Distributes Credits
To Investigator
18. Minimum set of requirements for
Business relationships (reseller, investigators)
Interfaces (upload, download, manage, compute)
Capacity (storage, compute)
Networking and Connectivity
Information Assurance
Authentication and authorization
Still need to work out details of how to manage approval of
conformance
A conformant cloud ≠ an IaaS provider
Draft specification out for comment among vendors
19. Phase 0: Build the plumbing
Phase 1: Pilot the model on a small number of
investigators experienced with cloud computing, probably
within the context of BD2K awards
Phase 2: Open the Commons credit process to grantees
from a subset of NIH Institutes and Centers
Phase 3: Open the process to all NIH grantees
21. Secure Computational Capacity
Pre-loaded Data
Secure Computational Capacity
Pre-loaded Data
Secure Computational Capacity
Pre-loaded Data
NCI Genomics Consortium
NCI Genomic Data Repositories
22. NIH Office of ADDS
Vivien Bonazzi, Ph.D.
Philip Bourne, Ph.D
Michelle Dunn, Ph.D
Mark Guyer, Ph.D.
Jennie Larkin, Ph.D.
Leigh Finnegan
Beth Russell
NCBI
Dennis Benson, Ph.D.
Alan Graeff
David Lipman, MD
Jim Ostell, Ph.D.
Don Preuss
Steve Sherry
Editor's Notes
1965 – Generation capacity < 100 aa’s/year/person => Dayhoff creates 1 base code to simplify computing in punch card era
1977 – Sanger and Maxam-Gilbert Sequencing invented. By mid 1980’s increase in production of 2 orders of magnitude (maybe 10-20K bases total 2-3K finished/year)
1986 – Development of dye based sequencing, ABI 370A 2000 bases/day/instrument by mid 1990’s
1996 – Development of DNA microarrays. 2 dye 100K chips => 200K/chip/day
2000’s- Next gen sequencing; 100M’s/day
This has worked well for a long time, but:
Every investigator has their own copy of the data!
Every investigator needs the computational resources to do whatever calculation they want to do.
Making locally developed software work outside of the local institution is often a challenge. Everyone likes the Broad Firehose, only Broad has made it work!
Consider the TCGA Data Set (2.5 PB)
Storage and Data Protection cost approximately $2,000,000 per year per copy
Constant network updates at universities
2.5 PB = 20,000,000 Gb = 23 days at 10 Gb/sec
Redundant computing environments
Most HPC environments are either drastically over or under utilized
This is an issue with more ‘normal’ sized data sets as well
Mimimum Requirements:
Business relationship is to allow distribution and billing of credits and to ensure that liability issues are resolved. Investigator that puts digital object in the commons is the one that retains the liability associated with its use.
Interfaces – would need to be open, but not necessarily open-source. Requires support for basic operations. In addition, environment has to be open to all; so a private environment behind a university firewall won’t work.
Identifiers and metadata: Tied together and together enable researchers to search for and find resources.
Networking and Connectivity: Make sure that stuff is accessible, require connection to commodity internet and internet2, but key element from investigator point of view is a free egress tier for academics
Environment is secure
A&A: Must support inCommon because most NIH investigators have it. Minimizes hassle of granting access to collaborators across multiple platforms.
Approval of clouds: Self certify vs. NIH certify vs. 3rd party certify. In early test cases, may simply say ‘FedRamped’
Cloud vs IaaS: Some IaaS (AWS comes to mind) may be uninterested in providing the ‘conformant’ layer but support other companies that provide these services using AWS backend. Already exemplars of this: Seven Bridges Genomics and the Cancer Genomics Cloud Pilots are all software layers over an IaaS provider.