This document discusses the NIH's Big Data to Knowledge (BD2K) initiative and its efforts to create a NIH Data Catalog. It describes how the Data Catalog will bring datasets into the research ecosystem by making them discoverable, citable, and linked to the scientific literature. It also discusses common metadata elements, mapping these elements to existing standards like DataCite and Dryad, and how datasets can be cited following ICMJE guidelines. Next steps proposed include determining how many NIH datasets currently exist in repositories and uniquely, and how to manage unique datasets not currently housed in a repository.
3. Big Data 2 Knowledge
Frameworks and Policies and Data
Data Catalog
Standards Sharing
4. Data Sharing Repositories
All NIH-funded data sharing repositories
that are open to receiving data
submissions from any researcher
internationally - whether they are funded
by the NIH or not
5. Data Sharing Policies
All data sharing policies that exist within
the NIH that assist researchers in
developing a plan to share their research
data
6. Big Data 2 Knowledge
Frameworks and Policies and Data
Data Catalog
Standards Sharing
8. Datasets are discoverable
Each dataset will be identified via Data Unique
Identifier [DUID] (in NIH Data Catalog and in the
associated journal)
Datasets specified in catalog using MeSH
(creation of a dataset Publication Type)
9. Datasets are citable
NIH Data Catalog produces citable data
publications
Citability + proper credit = incentives to
submit and publish data
10. Datasets are linked to the
literature
Data citations linked between and across
the NIH Data Catalog with their associated
scientific publication in PubMed/PubMed
Central
11. Datasets become information in
the research ecosystem
Analysis of trends, impact of data, effect on
NIH research funding
12. Common Metadata
Elements
How do current data repositories describe their data?
16. Building a Taxonomy of Metadata
Descriptors
• Title information • Authorship
o Name Title o Attribution
o Collection Type o Authors
o Type of Deposit o Creator(s)
o Service Name o Data Authors
o Image File Name o Data Owner
o File Name o Data Attribution
o Data Collection Title o Contributor(s)
o Dataset Title o PI Name(s)
o Dataset Name and Accession o Investigator(s)
o Submission Title o Sequence Authors
o Lab Data Title o Responsible Party
o Research Objective o Data Provider
o Submitter
17. Mapping Metadata to Existing Standards
Common
Common
Metadata
Common Metadata
Metadata Elements
Elements
Elements
18. Mapping to DataCite
• Common Metadata Elements • DataCite Metadata Schema
o Data Unique Identifier o Identifier
o Authorship o Creator
o Data Title o Title
o Data Location o Publisher
o Data Completion/Release Date o PublicationYear
o Data Descriptors (controlled o Subject
vocabulary) o Contributor
o Data Submitter/Affiliation o Date
o Date Information o Resource Type
o Data File Types o RelatedIdentifier
o Related Resources o Rights
o Access Data Restrictions o Description
o Data Description (narrative) o Size, Format, Version
19. Mapping to Dryad
• Common Metadata Elements • Dryad Metadata Schema
o Data Unique Identifier o dcterms:identifier/Data
o Authorship Package Identifier
o Data Title o dcterms:creator/Author
o Data Location o dcterms:title/Data Package
o Data Completion/Release Date Title
o Data Descriptors (controlled o dcterms:relation/Location of
vocabulary) related content outside of
Dryad
o Data Submitter/Affiliation
o dcterms:available/Date
o Date Information Available
o Data File Types o dcterms:description
o Related Resources o dcterms:subject/Keyword
o Access Data Restrictions o dwc:scientificName
o Data Description (narrative) o dcterms:references/Associated
Dryad publication record ID
20. Mapping to MEDLINE
Common Metadata Elements Proposed Definition
Data Unique Identifier A unique ID string that identifies a dataset within the catalog
Author Individuals involved in producing or contributing to data
Affiliation Affiliation of each author associated with the appropriate author
occurrence
Data Title Name or title by which the dataset is known
Data Location The name of the entity that holds, archives, publishes,
distributes, releases, issues, or produces the data w/ its
associated accession number
Date The year, month and date when the data was made available
Data Description (structured narrative) Structured narrative description for efficient indexing
Data Descriptors Metadata describing data contents using controlled labels (i.e.
Organism, Disease, Perturbation, Gender, Cell type, etc.)
PMIDs Identifier that will link dataset to associated article(s)
Availability/Accessibility of Data Whether the data is available to use and how to access it
Award Number Grant/award numbers associated with the dataset
Version The version of the dataset (represented as a unique record)
21. Data Citation - ICMJE
Author
Data Title
Marazita ML, Weynat RJ, Feingold E, Weeks D,
Crout R, McNeill D. Dental Caries: Whole Genome
Association and Gene x Environment Studies. NIH
Data Catalog. 2014 Jan;1(1):DUID00001. PubMed
PMID PMID: 22123456.
Assigned NIH Data Catalog
to NIH Volume (Issue)
Data SI: dbGaP/pht002543.v2.p1
Catalog
Record
Data Unique Data
Secondary Date data is Identifier Location
source ID (Link submitted
to actual and paper is
dataset) ready to
publish
30. Next Steps
• Find out how many datasets are currently in NIH
data sharing repositories
o How many datasets do these repositories process per year?
• How many datasets are unique and NOT housed in a
repository?
o Search PubMed and PubMed Central and assign categories
• MeSH
o PT: Electronic Supplementary material
o SH: Statistical and numerical data
o MeSH: Databases, Factual
o Statistical Analysis – exclude datasets that already have a location
• How do we manage these unique datasets?
an initiative to address how best to manage and utilize the large amounts of biomedical data that new technologies can generate (http://www.nih.gov/news/health/dec2012/od-07.htm). This initiative resulted from a set of recommendations from the Data and Informatics Working Group to the Advisory Committee to the Director, NIH (Data and Informatics Working Group). As part of the its response to the recommendations, NIH has established a working group to develop plans to implement new programs to increase training in this area, and this working group intends to convene a workshop to discuss training and education needs in how to manage and utilize large complex data sets. Prior to the workshop, NIH wishes to collect information and relevant materials that will help inform the discussions of the workshop participants.