A Tale of Two Data Catalogs

Table of Contents
I. NIH Data Discovery Index
• Methodology
• Findings
• Questions raised
II. Institutional Data Interviews
• Methodology
• Findings
III. Outcomes
• Benefits to the library
By: Charles Dickens
1

NIH Big Data to Knowledge (BD2K)
Facilitating Broad Use of Biomedical Big Data
3

NIH Data Discovery Index
Datasets are
CITABLE
Datasets are
DISCOVERABLE
Datasets are
LINKED TO
THE
LITERATURE
Datasets are
PART OF THE
RESEARCH
ECOSYSTEM
4

NIH Data Sharing Repositories
http://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html

Searching for NIH-funded unidentified
datasets in PubMed and PMC
6

113,089
75,441
Remaining articles with
unidentified datasets
NIH-funded articles for 2011:
88,592
78,901
Non-PMC Articles
Non-research Articles
Molecular Sequence Data MH
71,913SI Field
71,680
69,857XML
7
PMC Acknowledgements

SI Field
0
200
400
600
800
1000
1200
1400
1600
Excluded Articles
8

PMC Acknowledgements
0
100
200
300
400
500
600
700
800
Excluded
keywords
9

XML Keyword
0
100
200
300
400
500
600
Excluded
keywords
FlyBase:GeneNetwork:Mouse Genome
Informatics:Neuroscience Information
Framework:Rat Genome
Database:WormBase:Zebrafish Model
Organism Database
GenBank:PDB
10

NIH-sponsored data repositories now
added to PubMed and PMC search indexes
11

383
What category of dataset was
used for the research described
in the article?
Were live human or animal
subjects used in the collection
of the data?
What were the subject(s) of
study (from which or whom
the data was collected)?
If new dataset(s) were
created, what type(s) of data
were collected?
What existing dataset(s) were
used? If any?
How many datasets are there in
each article?
12

Measuring blood
pressure in mice
Measuring left
hemisphere of brain
for growth factor
Staining and imaging
Analysis of images
using software 13

Average number of
datasets per article:
2.92
15

% of datasets that use live
subjects
54%
Human
51%
Animal
49%
16

% of new data
87%
17
% of data created using
pre-existing datasets
13%

It was the worst
of times…
18

Data Types
19
Image
Genetic or Genomic
Chemical
Biochemical
Electrical
(Elecrophysiological)
Optical –
non-image
Behavioral
Computational Simulation
or model
Magnetic Resonance –
non-image
Structural
Physiological
Questionnaire/Survey
Clinical Measures
Geospatial

Inter-rater Reliability:
20
0
100
200
300
400
500
600
700
800
Total # of datasets (High) Total # of datasets (Low)
Total number of datasets found per 25
articles
43%

How do we define a data set?
21
Dataset

22
Datasets

23
Datasets

Where in the
collection/processing pipeline
should data be described?
24

Book of the Second
Understanding institutional data challenges
via interviews

Institutional Data Catalog
• Organize and describe
institutional research data
• Promote collaboration
within the institution
• Promote a culture of
sharing and transparency
26

Methodology
• Literature review
• ID researchers/PIs using
active grant system
• Analyzed datasets in
researcher papers before
interviews
– Used NIH Data Discovery
Index method
27

Understand your researchers
BASIC SCIENCE RESEARCHERS CLINICAL RESEARCHERS

Data Interviews
0 1 2 3 4 5 6 7
Postdocs or student leaves with data
Lack of standards/procedures
Size of data
Messiness/Disconnect between datasets
Too challenging
Challenges Organizing Data – Basic Science Researchers

Data Interviews
0 1 2 3 4 5 6
Storage expense
Changes in software
Lack of IT resources
Lack of preservation procedures (readme, plans, postdoc etc.)
Data in multiple storage locations
Storage space
Challenges Preserving Data – Basic Science Researchers

Data Interviews
0 1 2 3 4 5 6
Data quality
Messiness/Disconnect between datasets
Poor data output formats
Can't search data
Data loss
Team miscommunication on who's using data
Challenges Organizing Data – Clinical Researchers

Data Interviews
0 1 2 3 4 5 6 7 8 9
Collaboration only
unknown parties
data repository
general public
primary results only
Do not share
Basic Science
Clinical
Experience with Data Sharing

Only the best of times…
33
How the library benefitted from this exercise

34
Identified group to pilot institutional
data catalog – Population Health

35
Acquired new opportunities for
teaching data management

36
Developing a lab tool for basic
scientists to manage metadata

37
Developed a better understanding
of researcher needs and challenges

Acknowledgements
BD2K Project
• Lou Knecht, Jim Mork, Kathel Dunn, Betsy
Humphreys, Jerry Sheehan, Mike Huerta, Dr. Donald
Lindberg
Annotators
• Preeti Kochar, Helen Ochej, Susan Schmidt, Melissa
Yorks, Shari Mohary, Olga Printseva, Janice Ward, Oleg
Rodionov, Sally Davidson, Jennie Larkin, Peter Lyster, Matt
McAuliffe, Greg Farber, Betsy Humphreys, Jerry
Sheehan, Mike Huerta, Lou Knecht, Suzy Roy, Swapna
Abhyankar, Olivier Bodenreider, Karen Gutzman, Dina
Demner Fusman, Laritza Rodriguez, Sonya
Shooshan, Samantha Tate, Matthew Simpson, Tracy
Edinger, Olubumi Akiwumi, Mary Ann Hantakas, Corinn
Sinnott
38

References
1. Adamick J, Canavan M, McGinty S, Reznik-Zellen R, Schmidt M, Stevens R. Building as We Climb: The Data Working Group at the University of Massachusetts Amherst [Internet]. Univ.
Massachusetts New Engl. Area Libr. e-Science Symp. 2011. Available from: http://escholarship.umassmed.edu/escience_symposium/2011/posters/3
2. Bardyn TP, Resnick T, Camina SK. Translational Researchers’ Perceptions of Data Management Practices and Data Curation Needs: Findings from a Focus Group in an Academic Health
Sciences Library. J. Web Librariansh. [Internet]. 2012 Oct [cited 2013 Jan 30];6(4):274–87. Available from: http://www.tandfonline.com/doi/abs/10.1080/19322909.2012.730375
3. Carlson J, Fosmire M, Miller CC, Nelson MS. Determining Data Information Literacy Needs: A Study of Students and Research Faculty. portal Libr. Acad. 2011;11(2):629 – 657.
4. Delserone LM. At the watershed: Preparing for research data management and stewardship at the University of Minnesota Libraries. Libr. Trends [Internet]. Urbana-Champaign, Illinois: John
Hopkins University Press and the Graduate School of Library and Information Science.; 2008 [cited 2013 Jan 11]. p. 202–10. Available from: https://www.ideals.illinois.edu/handle/2142/10670
5. Harrison A, Searle S. Not drowning , ingesting : dealing with the research data deluge at an institutional level. VALA2010 Proc. [Internet]. 2010. Available from:
http://www.vala.org.au/vala2010/papers2010/VALA2010_43_Harrison_Final.pdf
6. Hruby GW, McKiernan J, Bakken S, Weng C. A centralized research data repository enhances retrospective outcomes research capacity: a case report. J. Am. Med. Inform. Assoc. [Internet]. 2013
Jan 15 [cited 2013 Apr 11];1–5. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23322812
7. Johnson LM, Butler JT, Johnston LR. Developing E-Science and Research Services and Support at the University of Minnesota Health Sciences Libraries. J. Libr. Adm. [Internet]. Routledge;
2012 Nov [cited 2013 Jan 11];52(8):754–69. Available from: http://dx.doi.org/10.1080/01930826.2012.751291
8. Jones S, Ross S, Ruusalepp R. Data Audit Framework Methodology [Internet]. Glasgow; 2009 p. 1–70. Available from: http://www.data-audit.eu/DAF_Methodology.pdf
9. Lage K, Losoff B, Maness J. Receptivity to Library Involvement in Scientific Data Curation: A Case Study at the University of Colorado Boulder. portal Libr. Acad. [Internet]. 2011 [cited 2012
Nov 21];11(4):915–37. Available from: http://muse.jhu.edu/journals/portal_libraries_and_the_academy/v011/11.4.lage.html
10. Newton MP, Miller CC, Bracke MS. Librarian Roles in Institutional Repository Data Set Collecting: Outcomes of a Research Library Task Force. Collect. Manag. 2011;36(1):53–67.
11. Peters C, Dryden AR. Assessing the Academic Library’s Role in Campus-Wide Research Data Management: A First Step at the University of Houston. Sci. Technol. Libr. [Internet]. Routledge;
2011 Sep [cited 2013 Jan 11];30(4):387–403. Available from: http://dx.doi.org/10.1080/0194262X.2011.626340
12. Piwowar H a. Who shares? Who doesn’t? Factors associated with openly archiving raw research data. PLoS One [Internet]. 2011 Jan [cited 2013 Mar 10];6(7):e18657. Available from:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3135593&tool=pmcentrez&rendertype=abstract
13. Raboin R, Reznik-Zellen RC, Salo D. Forging New Service Paths: Institutional Approaches to Providing Research Data Management Services. J. eScience Librariansh. [Internet]. 2012;1(3).
Available from: http://escholarship.umassmed.edu/jeslib/vol1/iss3/2/
14. Reznik-Zellen R, Adamick J, McGinty S. Tiers of Research Data Support Services. J. eScience Librariansh. [Internet]. 2012 [cited 2012 Nov 10];1(1):27–35. Available from:
http://escholarship.umassmed.edu/jeslib/vol1/iss1/5/
15. Scaramozzino JM, Ramirez ML, McGaughey KJ. A Study of Faculty Data Curation Behaviors and Attitudes at a Teaching-Centered University. Coll. Res. Libr. [Internet]. Association of
College & Research Libraries; 2012 Jul 1 [cited 2013 Jan 11];73(4):349–65. Available from: http://crl.acrl.org/content/73/4/349.abstract
16. Soehner C, Steeves C, Ward J. E-Science and Data Support Services. 2010 [cited 2013 Jan 11];(August). Available from: http://www.arl.org/storage/documents/publications/escience-report-
2010.pdf
17. Trinidad SB, Fullerton SM, Bares JM, Jarvik GP, Larson EB, Burke W. Genomic research and wide data sharing: views of prospective participants. Genet. Med. 2010 Aug;12(8):486–95.
18. Walters TO. Data curation program development in U.S. universities: The Georgia Institute of Technology example. Int. J. Digit. Curation [Internet]. 2009;4(3):83–92. Available from:
http://www.ijdc.net/index.php/ijdc/article/viewFile/136/153
19. Westra B. Data Services for the Sciences: A Needs Assessment. Ariadne [Internet]. 2010;(64). Available from: http://www.ariadne.ac.uk/issue64/westra
20. Williams SC. Using a Bibliographic Study to Identify Faculty Candidates for Data Services. Sci. Technol. Libr. [Internet]. Routledge; 2013 May 9 [cited 2013 May 14];1–8. Available from:
http://dx.doi.org/10.1080/0194262X.2013.774622
21. Xia J, Liu Y. Usage Patterns of Open Genomic Data. Coll. Res. Libr. [Internet]. Association of College & Research Libraries; 2013 Mar 1 [cited 2013 Mar 7];74(2):195–207. Available from:
http://crl.acrl.org/content/74/2/195.abstract
39

Images
Ponderings for All Things Blog. 2010. Available from:
http://ponderingsofallthings.blogspot.com/2010/05/tale-of-two-cities-charles-
dickens.html
Reading Charles Dickens Blog. Manette in Bastille. 2012. Available from:
http://readingcharlesdickens.com/wp-content/uploads/2012/07/Manette-in-
Bastille-253x300.jpg
Grandma’s Graphics. Old Scrooge say busy in his counting-house. 2000.
Available from:
http://www.grandmasgraphics.com/graphics/childrens/childrens379_2000.jpg
Sungardas Blog. Apple to Orange. 2010. Available from:
http://blog.sungardas.com/wp-content/uploads/Apple-to-Orange.jpg
Patel R. Questions?. Flickr. 2007. Available from:
https://www.flickr.com/photos/23679420@N00/545653437/
Biomedical Engineering Laboratory.Wikimedia. 2012. Available from:
http://upload.wikimedia.org/wikipedia/commons/a/a3/Biomedical_Engineerin
g_Laboratory.jpg
40

41
kevin.read@nyumc.org
Questions?

A Tale of Two Data Catalogs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Tale of Two Data Catalogs

Similar to A Tale of Two Data Catalogs (20)

Recently uploaded

Recently uploaded (20)

A Tale of Two Data Catalogs

Editor's Notes