Building an NIH Data Catalog: Bit by Bit

Building an NIH Data Catalog:
Bit by Bit
Kevin Read
NLM Associate Fellowship Presentation
July 24, 2013
1

NIH Big Data to Knowledge
Facilitating Broad Use of Biomedical Big Data
2

NIH Data Catalog
What is it designed to do?
3

NIH Data Catalog
Data sets are
CITABLE
Data sets are
DISCOVERABLE
Data sets are
LINKED TO THE
LITERATURE
Data sets are
PART OF THE
RESEARCH
ECOSYSTEM
4

NIH Data Catalog
What do we need to know in order to build it?
Minimal Metadata
Elements
How do current data repositories
describe their data?
Orphaned Data sets
How many data sets are not
currently represented in a data
repository?
5

Finding Common Metadata
Elements
Exploring how NIH Data Repositories describe their data
6

Categorizing Metadata
Descriptors
Common Metadata Elements
Authorship
Data
Description
Title
Information
8

Identifying Metadata Variations
Date
Study
Date
Date
Processed
Release
Date
Completion
Date
Last
Updated
Date
Prepared
on Date
Authorship
Authors
Creators
Data
Provider
Principal
Investiga
tor(s)
Contribu
tors
Data
Authors
9

Mapping Metadata
Commonalities to Existing
Standards
Common
Metadata
Elements
Common
Metadata
Elements
10

Mapping Metadata to MEDLINE
Common Metadata Elements Proposed Definition
Data Unique Identifier A unique ID string that identifies a data set within the catalog
Author Individuals involved in producing or contributing to data
Affiliation Affiliation of each author associated with the appropriate author
occurrence
Data Title Name or title by which the data set is known
Data Location The name of the entity that holds, archives, publishes, distributes,
releases, issues, or produces the data w/ its associated accession
number.
Date The year, month and date when the data was made available
Data Description (structured narrative) Structured narrative description for efficient indexing
Data Descriptors Metadata describing data contents using controlled labels (e.g.
Organism, Disease, Perturbation, Gender, Cell type)
PMID Identifier that will link dataset to associated article(s) AND be provided
for the data catalog entry
Availability/Accessibility of Data Indication of whether the data is available to use and how to access it
Award Number Grant/award numbers associated with the data set
Related Data Data that was used in the creation of the new data set 11

Data Catalog Citation
Marazita ML, Weynat RJ, Feingold E, Weeks D,
Crout R, McNeill D. Dental Caries: Whole Genome
Association and Gene x Environment Studies. NIH
Data Catalog. 2014 Jan;1(1):DUID00001. PubMed
PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
Author
Data Title
Data
Description
Location
Date of NIH
Data
Catalog
issue
NIH Data Catalog
Volume (Issue)
Data Unique
Identifier
PMID
Assigned
to NIH
Data
Catalog
Record
Secondary
source ID (Link
to actual
dataset)
Marazita ML, Weynat RJ, Feingold E, Weeks D,
Crout R, McNeill D. Dental Caries: Whole Genome
Association and Gene x Environment Studies. NIH
Data Catalog. 2014 Jan;1(1):DUID00001. PubMed
PMID: 22123456.
12

Searching for NIH-funded
‘Orphaned’ data sets in PubMed
and PubMed Central
13

113,089
75,441
Remaining articles with
orphaned data sets
NIH-funded articles for 2011:
88,592
78,901
Non-PMC Articles
Non-research Articles
Molecular Sequence Data MH
71,913SI Field
71,680PMC Acknowledgements
69,857XML
14

SI Field Exclusions
0
200
400
600
800
1000
1200
1400
1600
Excluded Articles
15

PMC Acknowledgement Exclusions
0
100
200
300
400
500
600
700
800
Excluded
keywords
16

XML Keyword Exclusions
0
100
200
300
400
500
600
Excluded
keywords
FlyBase:GeneNetwork:Mouse Genome
Informatics:Neuroscience Information
Framework:Rat Genome
Database:WormBase:Zebrafish Model
Organism Database
GenBank:PDB
17

Total # of articles
collected for 2011
after exclusion:
69,657
Random sample
with 95% confid.
interval:
383
18

383
What category of data
set was used for the
research described in
the article?
Were live human or
animal subjects used
in the collection of the
data?
What were the
subject(s) of study
(from which or
whom the data was
collected)?
If new data set(s) were
created, what type(s)
of data were
collected?
What existing data
set(s) were used? If
any?
How many data sets
are there in each
article?
19

Measuring blood
pressure in mice
Measuring left
hemisphere of brain
for growth factor
Staining and imaging
Analysis of images
using software 20

Preliminary Results
‘Orphaned’ Data
50 articles
21

Average number of data
sets per article:
5.84
22

% of data sets that use live
subjects
51%
Human
60%
Animal
40%
23

% of data sets that
were considered to
be new
74%
% of data sets that
used existing data
with mods or added
value
12%
% of data sets that
used existing data
as is
13%
% with no data
1%
24

% of articles
that collected
only new data:
56%
% of articles that
used only
existing data:
32%
% of articles
that used a
combination of
data:
8%
% of articles that
used no data:
4% 25

Building an NIH Data Catalog
Questions to Consider
27

What do we consider to be
a data set?
All of the data created within a paper?
Multiple data sets of different data
types within a paper?
Every individual collection of data
within a paper?
28

Where in the
collection/processing pipeline
should data be described?
29

Is there a convenient way to
point to data sets within an
article?
Abstract?
Labeled area?
Reference list?
30

How do we adequately
describe data sets so that they
are discoverable?
Develop a strategy to create appropriate
data descriptors
31

How do we adequately describe data
sets so that they are discoverable?
Is there a convenient way to point to
data sets within an article?
Where in the data collection/processing
pipeline should data be described?
What do we consider to be a data set?
32

Acknowledgements
Project Sponsors
Jerry Sheehan & Mike Huerta
Special Thanks
Lou Knecht & Jim Mork
Annotators
Preeti Kochar, Helen Ochej, Susan Schmidt, Melissa Yorks, Shari Mohary, Olga
Printseva, Janice Ward, Oleg Rodionov, Sally Davidson, Jennie Larkin, Peter
Lyster, Matt McAuliffe, Greg Farber, Betsy Humphreys, Jerry Sheehan, Mike
Huerta, Lou Knecht, Suzy Roy, Swapna Abhyankar, Olivier Bodenreider, Karen
Gutzman, Dina Demner Fusman, Laritza Rodriguez, Sonya Shooshan, Samantha
Tate, Matthew Simpson, Tracy Edinger, Olubumi Akiwumi, Mary Ann Hantakas, Corinn
Sinnott
Support
Kathel Dunn & David Gillikin
Library Operations
Joyce Backus & Dianne Babski
NLM Leadership
Donald Lindberg & Betsy Humphreys
All images are CC
33

Building an NIH Data Catalog: Bit by Bit

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building an NIH Data Catalog: Bit by Bit

Similar to Building an NIH Data Catalog: Bit by Bit (20)

Recently uploaded

Recently uploaded (20)

Building an NIH Data Catalog: Bit by Bit

Editor's Notes