Contributing to the Big Data to Knowledge Initiative at the NIH

Contributing to the Big
Data to Knowledge
Initiative at the NIH

Data Sharing and the NIH Data Catalog

Big Data to Knowledge
(BD2K)

Big Data 2 Knowledge

Frameworks and Policies and Data
Data Catalog
Standards Sharing

Data Sharing Repositories

All NIH-funded data sharing repositories
that are open to receiving data
submissions from any researcher
internationally - whether they are funded
by the NIH or not

Data Sharing Policies

All data sharing policies that exist within
the NIH that assist researchers in
developing a plan to share their research
data

NIH Data Catalog

Bringing Data Into the Research Ecosystem

Datasets are discoverable
Each dataset will be identified via Data Unique
Identifier [DUID] (in NIH Data Catalog and in the
associated journal)

Datasets specified in catalog using MeSH
(creation of a dataset Publication Type)

Datasets are citable
NIH Data Catalog produces citable data
publications

Citability + proper credit = incentives to
submit and publish data

Datasets are linked to the
literature
Data citations linked between and across
the NIH Data Catalog with their associated
scientific publication in PubMed/PubMed
Central

Datasets become information in
the research ecosystem
Analysis of trends, impact of data, effect on
NIH research funding

Common Metadata
Elements

How do current data repositories describe their data?

Identifying Metadata Commonalities

Identifying Metadata Commonalities

Date
Information
Data
Description

Authorship

Common Metadata Elements

Building a Taxonomy of Metadata
Descriptors
• Title information • Authorship
o Name Title o Attribution
o Collection Type o Authors
o Type of Deposit o Creator(s)
o Service Name o Data Authors
o Image File Name o Data Owner
o File Name o Data Attribution
o Data Collection Title o Contributor(s)
o Dataset Title o PI Name(s)
o Dataset Name and Accession o Investigator(s)
o Submission Title o Sequence Authors
o Lab Data Title o Responsible Party
o Research Objective o Data Provider
o Submitter

Mapping Metadata to Existing Standards

Common
Common
Metadata
Common Metadata
Metadata Elements
Elements
Elements

Mapping to DataCite

• Common Metadata Elements • DataCite Metadata Schema
o Data Unique Identifier o Identifier
o Authorship o Creator
o Data Title o Title
o Data Location o Publisher
o Data Completion/Release Date o PublicationYear
o Data Descriptors (controlled o Subject
vocabulary) o Contributor
o Data Submitter/Affiliation o Date
o Date Information o Resource Type
o Data File Types o RelatedIdentifier
o Related Resources o Rights
o Access Data Restrictions o Description
o Data Description (narrative) o Size, Format, Version

Mapping to Dryad

• Common Metadata Elements • Dryad Metadata Schema
o Data Unique Identifier o dcterms:identifier/Data
o Authorship Package Identifier
o Data Title o dcterms:creator/Author
o Data Location o dcterms:title/Data Package
o Data Completion/Release Date Title
o Data Descriptors (controlled o dcterms:relation/Location of
vocabulary) related content outside of
Dryad
o Data Submitter/Affiliation
o dcterms:available/Date
o Date Information Available
o Data File Types o dcterms:description
o Related Resources o dcterms:subject/Keyword
o Access Data Restrictions o dwc:scientificName
o Data Description (narrative) o dcterms:references/Associated
Dryad publication record ID

Mapping to MEDLINE

Common Metadata Elements Proposed Definition
Data Unique Identifier A unique ID string that identifies a dataset within the catalog

Author Individuals involved in producing or contributing to data

Affiliation Affiliation of each author associated with the appropriate author
occurrence
Data Title Name or title by which the dataset is known

Data Location The name of the entity that holds, archives, publishes,
distributes, releases, issues, or produces the data w/ its
associated accession number
Date The year, month and date when the data was made available

Data Description (structured narrative) Structured narrative description for efficient indexing

Data Descriptors Metadata describing data contents using controlled labels (i.e.
Organism, Disease, Perturbation, Gender, Cell type, etc.)
PMIDs Identifier that will link dataset to associated article(s)

Availability/Accessibility of Data Whether the data is available to use and how to access it

Award Number Grant/award numbers associated with the dataset

Version The version of the dataset (represented as a unique record)

Data Citation - ICMJE
Author
Data Title

Marazita ML, Weynat RJ, Feingold E, Weeks D,
Crout R, McNeill D. Dental Caries: Whole Genome
Association and Gene x Environment Studies. NIH
Data Catalog. 2014 Jan;1(1):DUID00001. PubMed
PMID PMID: 22123456.
Assigned NIH Data Catalog
to NIH Volume (Issue)
Data SI: dbGaP/pht002543.v2.p1
Catalog
Record
Data Unique Data
Secondary Date data is Identifier Location
source ID (Link submitted
to actual and paper is
dataset) ready to
publish

NIH Data Catalog Issues
and Concerns
What are we missing?

How many NIH
datasets actually exist?

How many unique NIH
datasets are NOT
represented in existing
data repositories?

Could these datasets be
represented as a data
publication instead of in
a repository?

If the datasets are
already housed
somewhere – do we
need a one stop shop?

Is a NIH Data Catalog
the best solution?

Next Steps
• Find out how many datasets are currently in NIH
data sharing repositories
o How many datasets do these repositories process per year?

• How many datasets are unique and NOT housed in a
repository?
o Search PubMed and PubMed Central and assign categories
• MeSH
o PT: Electronic Supplementary material
o SH: Statistical and numerical data
o MeSH: Databases, Factual
o Statistical Analysis – exclude datasets that already have a location

• How do we manage these unique datasets?

Contributing to the Big Data to Knowledge Initiative at the NIH

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Contributing to the Big Data to Knowledge Initiative at the NIH

Similar to Contributing to the Big Data to Knowledge Initiative at the NIH (20)

Contributing to the Big Data to Knowledge Initiative at the NIH

Editor's Notes