OBJECTIVE The purpose of the project was to a) develop a set of core, minimal metadata elements that would be used to describe data sets, and b) carry out a study to identify data sets in NIH-funded articles from PubMed and PubMed Central (PMC) that do not provide an indication that their data is stored in a specific place like a repository or registry. These efforts will inform the BD2K initiative and a planned NIH Data Catalog.
METHODS An analysis of the metadata schemas for all NIH data repositories was undertaken. Commonalities from these data repositories were identified, mapped to existing data-specific metadata standards from DataCite and Dryad, and then were integrated into MEDLINE XML metadata to attempt to establish a sustainable and integrated metadata schema.
The second phase of this project identified data sets in articles from PubMed and PMC by searching specifically for NIH-funded articles from the year 2011. After excluding articles that contain mention of data sets being deposited in existing repositories, thirty staff members from NLM and B2DK were recruited to analyze a random sample of the results to identify how many, and what types of data sets were created per article.
RESULTS A preliminary set of minimal metadata elements were developed that could sufficiently describe NIH-funded data sets and be integrated within MEDLINE’s schema, with minor additions.
At present, results of the second phase to analyze PubMed and PMC articles for data sets are pending once all submissions from NLM staff are complete.
CONCLUSION The efforts to develop a minimal set of metadata elements and identify the amount, and types of data sets that are produced from NIH funded articles will serve to inform the BD2K’s initiative to build an NIH Data Catalog going forward.
4. NIH Data Catalog
Data sets are
CITABLE
Data sets are
DISCOVERABLE
Data sets are
LINKED TO THE
LITERATURE
Data sets are
PART OF THE
RESEARCH
ECOSYSTEM
4
5. NIH Data Catalog
What do we need to know in order to build it?
Minimal Metadata
Elements
How do current data repositories
describe their data?
Orphaned Data sets
How many data sets are not
currently represented in a data
repository?
5
11. Mapping Metadata to MEDLINE
Common Metadata Elements Proposed Definition
Data Unique Identifier A unique ID string that identifies a data set within the catalog
Author Individuals involved in producing or contributing to data
Affiliation Affiliation of each author associated with the appropriate author
occurrence
Data Title Name or title by which the data set is known
Data Location The name of the entity that holds, archives, publishes, distributes,
releases, issues, or produces the data w/ its associated accession
number.
Date The year, month and date when the data was made available
Data Description (structured narrative) Structured narrative description for efficient indexing
Data Descriptors Metadata describing data contents using controlled labels (e.g.
Organism, Disease, Perturbation, Gender, Cell type)
PMID Identifier that will link dataset to associated article(s) AND be provided
for the data catalog entry
Availability/Accessibility of Data Indication of whether the data is available to use and how to access it
Award Number Grant/award numbers associated with the data set
Related Data Data that was used in the creation of the new data set 11
12. Data Catalog Citation
Marazita ML, Weynat RJ, Feingold E, Weeks D,
Crout R, McNeill D. Dental Caries: Whole Genome
Association and Gene x Environment Studies. NIH
Data Catalog. 2014 Jan;1(1):DUID00001. PubMed
PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Author
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data Title
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data
Description
Location
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Date of NIH
Data
Catalog
issue
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
NIH Data Catalog
Volume (Issue)
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data Unique
Identifier
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
PMID
Assigned
to NIH
Data
Catalog
Record
Secondary
source ID (Link
to actual
dataset)
Marazita ML, Weynat RJ, Feingold E, Weeks D,
Crout R, McNeill D. Dental Caries: Whole Genome
Association and Gene x Environment Studies. NIH
Data Catalog. 2014 Jan;1(1):DUID00001. PubMed
PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
12
18. Total # of articles
collected for 2011
after exclusion:
69,657
Random sample
with 95% confid.
interval:
383
18
19. 383
What category of data
set was used for the
research described in
the article?
Were live human or
animal subjects used
in the collection of the
data?
What were the
subject(s) of study
(from which or
whom the data was
collected)?
If new data set(s) were
created, what type(s)
of data were
collected?
What existing data
set(s) were used? If
any?
How many data sets
are there in each
article?
19
20. Measuring blood
pressure in mice
Measuring left
hemisphere of brain
for growth factor
Staining and imaging
Analysis of images
using software 20
23. % of data sets that use live
subjects
51%
Human
60%
Animal
40%
23
24. % of data sets that
were considered to
be new
74%
% of data sets that
used existing data
with mods or added
value
12%
% of data sets that
used existing data
as is
13%
% with no data
1%
24
25. % of articles
that collected
only new data:
56%
% of articles that
used only
existing data:
32%
% of articles
that used a
combination of
data:
8%
% of articles that
used no data:
4% 25
28. What do we consider to be
a data set?
All of the data created within a paper?
Multiple data sets of different data
types within a paper?
Every individual collection of data
within a paper?
28
30. Is there a convenient way to
point to data sets within an
article?
Abstract?
Labeled area?
Reference list?
30
31. How do we adequately
describe data sets so that they
are discoverable?
Develop a strategy to create appropriate
data descriptors
31
32. How do we adequately describe data
sets so that they are discoverable?
Is there a convenient way to point to
data sets within an article?
Where in the data collection/processing
pipeline should data be described?
What do we consider to be a data set?
32
33. Acknowledgements
Project Sponsors
Jerry Sheehan & Mike Huerta
Special Thanks
Lou Knecht & Jim Mork
Annotators
Preeti Kochar, Helen Ochej, Susan Schmidt, Melissa Yorks, Shari Mohary, Olga
Printseva, Janice Ward, Oleg Rodionov, Sally Davidson, Jennie Larkin, Peter
Lyster, Matt McAuliffe, Greg Farber, Betsy Humphreys, Jerry Sheehan, Mike
Huerta, Lou Knecht, Suzy Roy, Swapna Abhyankar, Olivier Bodenreider, Karen
Gutzman, Dina Demner Fusman, Laritza Rodriguez, Sonya Shooshan, Samantha
Tate, Matthew Simpson, Tracy Edinger, Olubumi Akiwumi, Mary Ann Hantakas, Corinn
Sinnott
Support
Kathel Dunn & David Gillikin
Library Operations
Joyce Backus & Dianne Babski
NLM Leadership
Donald Lindberg & Betsy Humphreys
All images are CC
33
Editor's Notes
Build a Catalog of Research Datasets to Facilitate Data Discovery125 members currently across all the IC’s working on this and they are holding a variety of workshops in August and the fallOne initiative is to create an NIH Data Catalogue – this is where I was able to contribute.
So that’s great that I’ve been talking about an NIH Data Catalog – but it is important that I describe what it is that a data catalog is designed to do.
Data sets are discoverable – in that they can be searched for and retrieved for use or analysisData sets are citable – in that when other researchers use a data set they can cite it so that the authors of a data set receive credit for their workData sets are linked to the literature – in that data sets that were created in an article would be automatically linked to that respective article(s)Data sets are part of the research ecosystem – in that when researchers apply for NIH grants, data sets would have an impact on the decision making process
Keeping those goals in mind, it then came time to figure out what do we actually need to know in order to build this thing?For my project it came down to looking at how current data repositories describe their data sets to come up with potential metadata elements that could be used for the data catalog.The other part of my project focused on how we candeal with all of the data that is not currently represented in a data repository – and how to manage and describe it.
First we will look at the first component of my project, and how I went about exploring how NIH data repositories describe their data
This project actually brought me back to my fall project, where I was responsible for curating and collecting a list of all the NIH data sharing repositories. Each of these repositories has a submission requirement where they require researchers to provide specific information about their data before they deposit it. I took this submission metadata from each repository, and attempted to extract commonalities from each.
Once I started to extract the metadata commonalities from each repository, I developed broad categories that they fell into here. You can see on this slide that unique identifiers, data information, data description and authorship are all major categories that were represented in each.
From there it is important to point out that there were a number of variations for each major category for which I’ve included two examples here. Notice in the date graphic on the left that date is represented in a number of different ways that mean entirely different things. These variations are just one of many examples that illustrate the complexities of describing data, and how there is currently no standard way of describing data effectively.
Because there were so many variations in the commonalities I found. I felt it would be useful to map my common metadata elements to existing standards to gain a better sense of how data was being described at a broad level. In this case we chose to look at DataCite – a platform where researchers can register their data sets to receive a DOI, Dryad – an open data repository that links data to literature and is prominently used in the scientific community. It describes a wide range of data sets, and finally MEDLINE in order to ascertain whether or not MEDLINE’s existing XML metadata schema could adequately describe a data set.We chose DataCite because it maintains a relatively up to date version of their metadata schema, and they also make it open source where the scientific community can provide feedback and make changes.Dryad was chosen because it is a largely popular data repository that deals with a wide variety of data, and we felt that it would provide a good indication of a baseline set of metadata.For the purposes of this presentation I will not be going into DataCite and Dryad in detail, and instead show the results of the findings from each platform in the context of MEDLINE.
After looking at the metadata standards from both DataCite and Dryad we felt that this set of metadata could be integrated into MEDLINE – with a few modifications and additions. You’ll see the common metadata that you would expect to see in a basic description of an object in that it has a unique identifier, author and title. But I would like to point out a few of the modifications and challenges we faced.Affiliation – in PM it is only available for first author, but because data can go through changes or modifications in specific labs we felt it was important to include an affiliation for each data author.Data descriptors – this is the biggest issue when it comes to describing data – as you’ll see the same theme emerge in the second half of my presentation but we felt that a data description should include a structured narrative as you would see in a structured abstract, but also a list of data descriptors such as organism and disease to tag the data set. Both DataCite and Dryad did not include biomedical data descriptors so this is an area that requires further study.PMID – it was felt that a record in the data catalog could also receive a PMID so it could be searched within PubMed as well as the data catalog. Similarly, all PMIDs associated with a data set would also be included in the background metadata so that it could be easily linked to the literature.Related data – finally, related data field would indicate whether or not the data the was created used pre-existing data. For example if a researchers used questionnaire data that was previously created to perform a new analysis – it would be important that it would be accounted for in the metadata record for the sake of provenance and transparency.
Finally, because it was felt that a data catalog record could look potentially like a data publication, and we would want data to be citable I have provided an example here breaking down each component of a citation.
Keeping in mind some of the issues we faced with data description, and the lack of standards for developing metadata for biomedical data sets, I am now going to switch gears to talk about the second component of my project which involved searching for data sets in PubMed and PMC that have not been deposited in a repository.It is fruitful to complete this exercise to discover how much work would be required to describe all the data sets that are created in an article, and figure out if there was a sufficient way to describe the different types of data that are created.
This slide is meant to demonstrate the stages of exclusions taken to come out with our final sample from which to analyze.
It is important to mention here that while it found a large number of name variations in exclusions, when it overlapped with PubMed’s search only 230 total articles were excluded.
XML keyword exclusions.There is an issue in this case because you’ll notice the bar on the right is title “Multiple Keywords” – this is unfortunate because whenever more than one of the repositories was mentioned it counted as a multiple, as opposed to towards the repository itself.What is interesting about these multiples however is that a number of articles contained a mention of a number of repositories.
We had 30 NLM staff and BD2K staff look at the 383 articles – 25 articles each with the two people looking at the same 25 for validation.
This slide is designed to illustrate the different measurements and data collection that occurs within an article and really exemplifies the complexities of data that we are working with.
If we go back to the issues we saw earlier with data types and data descriptors”Do we tackle the entire scope of biomedical data?Do we adopt existing standards for ways to describe data?Who better to answer these questions than the National Library of Medicine?
Who better to answer these questions than the National Library of Medicine?