Your SlideShare is downloading. ×
A Tale of Two Data Catalogs
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

A Tale of Two Data Catalogs


Published on

This presentation will describe two studies undertaken to build two separate data catalogs: the first for NIH-funded datasets and the second for institutional datasets created within an academic …

This presentation will describe two studies undertaken to build two separate data catalogs: the first for NIH-funded datasets and the second for institutional datasets created within an academic medical center.

To inform the creation of an NIH data catalog, the purpose of the first study was to a) develop a set of minimal metadata elements used to describe datasets, and b) carry out an analysis to identify datasets in NIH-funded research articles that do not provide an indication that their data has been shared in a data repository. This study served as the foundation for developing an index of all NIH-funded datasets, and provided information about in what repositories researchers share their data most often.

The second study was spurred on by the first, and involved interviewing institutional faculty members and researchers to learn more about how they collect data, what challenges they face when collecting data, whether they’ve thought about sharing data, and what they would find most useful from an institutional data catalog. The results of this study informed the workflows, metadata creation, and requirements for building a data catalog within the medical center. Additionally, interview responses were used to further inform the data services provided by the health sciences library, including education, research consultations and clinical quality improvement initiatives.

Both studies provide various examples of how a librarian working in the health sciences can contribute to, and participate in data-related services within their institution.

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Keeping in mind some of the issues we faced with data description, and the lack of standards for developing metadata for biomedical data sets, I am now going to switch gears to talk about the second component of my project which involved searching for data sets in PubMed and PMC that have not been deposited in a repository.It is fruitful to complete this exercise to discover how much work would be required to describe all the data sets that are created in an article, and figure out if there was a sufficient way to describe the different types of data that are created.
  • This slide is meant to demonstrate the stages of exclusions taken to come out with our final sample from which to analyze.
  • It is important to mention here that while it found a large number of name variations in exclusions, when it overlapped with PubMed’s search only 230 total articles were excluded.
  • XML keyword exclusions.There is an issue in this case because you’ll notice the bar on the right is title “Multiple Keywords” – this is unfortunate because whenever more than one of the repositories was mentioned it counted as a multiple, as opposed to towards the repository itself.What is interesting about these multiples however is that a number of articles contained a mention of a number of repositories.
  • Added phrases from the 45 repository names to the Compound Word Dictionary (aka Phrase List), if not already present in the indexes for a database. For PMC, that means any mention in the full text will now be a search access point (retrievable); BUT the re-indexing of PMC has not yet been done, so none of these phrases are searchable yet.  I can’t get a definite date out of NCBI; I will search every Monday until I see one of the phrases post (this is my test search because I have a PMCID where it occurs in the acknowledgements: Neuroimaging Informatics Tools and Resources Clearinghouse). For PubMed, that means any mention in the citation (essentially means the article title or abstract) will get retrieved.  KathiCanese told me it was implemented for PubMed this week (after last weekend’s full re-index of the database), but I can’t get my test search to retrieve in PubMed even though I know 5 citations have that phrase.  So I’ll wait for another weekend re-indexing and test again next week before reporting this to Kathi. This has nothing to do with the SI (Secondary Source Identifier) field, aka DatabanksList, where LO picks up mention of only certain repositories from the full text article.  We may be expanding the list of SI databanks for the coming indexing year, but no decision on that yet.  Request for advice is in to Dennis Benson, NCBI since August 6. 
  • We had 30 NLM staff and BD2K staff look at the 383 articles – 25 articles each with the two people looking at the same 25 for validation.
  • This slide is designed to illustrate the different measurements and data collection that occurs within an article and really exemplifies the complexities of data that we are working with.
  • Hard to imagine how some data would be repurposed – e.g., virology, basic science. Access to data necessary for validation, reproducibility, but VERY difficult to do without the accompanying article that provides context, describes methodology, provides logic for drawing conclusions from multiple data sets. Is the publication the best route for accessing such data? If so, what does that mean for a data catalog
  • Transcript

    • 1. Table of Contents I. NIH Data Discovery Index • Methodology • Findings • Questions raised II. Institutional Data Interviews • Methodology • Findings III. Outcomes • Benefits to the library By: Charles Dickens 1
    • 2. It was the best of times… 2
    • 3. NIH Big Data to Knowledge (BD2K) Facilitating Broad Use of Biomedical Big Data 3
    • 4. NIH Data Discovery Index Datasets are CITABLE Datasets are DISCOVERABLE Datasets are LINKED TO THE LITERATURE Datasets are PART OF THE RESEARCH ECOSYSTEM 4
    • 5. NIH Data Sharing Repositories
    • 6. Searching for NIH-funded unidentified datasets in PubMed and PMC 6
    • 7. 113,089 75,441 Remaining articles with unidentified datasets NIH-funded articles for 2011: 88,592 78,901 Non-PMC Articles Non-research Articles Molecular Sequence Data MH 71,913SI Field 71,680 69,857XML 7 PMC Acknowledgements
    • 8. SI Field 0 200 400 600 800 1000 1200 1400 1600 Excluded Articles 8
    • 9. PMC Acknowledgements 0 100 200 300 400 500 600 700 800 Excluded keywords 9
    • 10. XML Keyword 0 100 200 300 400 500 600 Excluded keywords FlyBase:GeneNetwork:Mouse Genome Informatics:Neuroscience Information Framework:Rat Genome Database:WormBase:Zebrafish Model Organism Database GenBank:PDB 10
    • 11. NIH-sponsored data repositories now added to PubMed and PMC search indexes 11
    • 12. 383 What category of dataset was used for the research described in the article? Were live human or animal subjects used in the collection of the data? What were the subject(s) of study (from which or whom the data was collected)? If new dataset(s) were created, what type(s) of data were collected? What existing dataset(s) were used? If any? How many datasets are there in each article? 12
    • 13. Measuring blood pressure in mice Measuring left hemisphere of brain for growth factor Staining and imaging Analysis of images using software 13
    • 14. Results 14
    • 15. Average number of datasets per article: 2.92 15
    • 16. % of datasets that use live subjects 54% Human 51% Animal 49% 16
    • 17. % of new data 87% 17 % of data created using pre-existing datasets 13%
    • 18. It was the worst of times… 18
    • 19. Data Types 19 Image Genetic or Genomic Chemical Biochemical Electrical (Elecrophysiological) Optical – non-image Behavioral Computational Simulation or model Magnetic Resonance – non-image Structural Physiological Questionnaire/Survey Clinical Measures Geospatial
    • 20. Inter-rater Reliability: 20 0 100 200 300 400 500 600 700 800 Total # of datasets (High) Total # of datasets (Low) Total number of datasets found per 25 articles 43%
    • 21. How do we define a data set? 21 Dataset
    • 22. How do we define a data set? 22 Datasets
    • 23. How do we define a data set? 23 Datasets
    • 24. Where in the collection/processing pipeline should data be described? 24
    • 25. Book of the Second Understanding institutional data challenges via interviews
    • 26. Institutional Data Catalog • Organize and describe institutional research data • Promote collaboration within the institution • Promote a culture of sharing and transparency 26
    • 27. Methodology • Literature review • ID researchers/PIs using active grant system • Analyzed datasets in researcher papers before interviews – Used NIH Data Discovery Index method 27
    • 29. Data Interviews 0 1 2 3 4 5 6 7 Postdocs or student leaves with data Lack of standards/procedures Size of data Messiness/Disconnect between datasets Too challenging Challenges Organizing Data – Basic Science Researchers
    • 30. Data Interviews 0 1 2 3 4 5 6 Storage expense Changes in software Lack of IT resources Lack of preservation procedures (readme, plans, postdoc etc.) Data in multiple storage locations Storage space Challenges Preserving Data – Basic Science Researchers
    • 31. Data Interviews 0 1 2 3 4 5 6 Data quality Messiness/Disconnect between datasets Poor data output formats Can't search data Data loss Team miscommunication on who's using data Challenges Organizing Data – Clinical Researchers
    • 32. Data Interviews 0 1 2 3 4 5 6 7 8 9 Collaboration only unknown parties data repository general public primary results only Do not share Basic Science Clinical Experience with Data Sharing
    • 33. Only the best of times… 33 How the library benefitted from this exercise
    • 34. 34 Identified group to pilot institutional data catalog – Population Health
    • 35. 35 Acquired new opportunities for teaching data management
    • 36. 36 Developing a lab tool for basic scientists to manage metadata
    • 37. 37 Developed a better understanding of researcher needs and challenges
    • 38. Acknowledgements BD2K Project • Lou Knecht, Jim Mork, Kathel Dunn, Betsy Humphreys, Jerry Sheehan, Mike Huerta, Dr. Donald Lindberg Annotators • Preeti Kochar, Helen Ochej, Susan Schmidt, Melissa Yorks, Shari Mohary, Olga Printseva, Janice Ward, Oleg Rodionov, Sally Davidson, Jennie Larkin, Peter Lyster, Matt McAuliffe, Greg Farber, Betsy Humphreys, Jerry Sheehan, Mike Huerta, Lou Knecht, Suzy Roy, Swapna Abhyankar, Olivier Bodenreider, Karen Gutzman, Dina Demner Fusman, Laritza Rodriguez, Sonya Shooshan, Samantha Tate, Matthew Simpson, Tracy Edinger, Olubumi Akiwumi, Mary Ann Hantakas, Corinn Sinnott 38
    • 39. References 1. Adamick J, Canavan M, McGinty S, Reznik-Zellen R, Schmidt M, Stevens R. Building as We Climb: The Data Working Group at the University of Massachusetts Amherst [Internet]. Univ. Massachusetts New Engl. Area Libr. e-Science Symp. 2011. Available from: 2. Bardyn TP, Resnick T, Camina SK. Translational Researchers’ Perceptions of Data Management Practices and Data Curation Needs: Findings from a Focus Group in an Academic Health Sciences Library. J. Web Librariansh. [Internet]. 2012 Oct [cited 2013 Jan 30];6(4):274–87. Available from: 3. Carlson J, Fosmire M, Miller CC, Nelson MS. Determining Data Information Literacy Needs: A Study of Students and Research Faculty. portal Libr. Acad. 2011;11(2):629 – 657. 4. Delserone LM. At the watershed: Preparing for research data management and stewardship at the University of Minnesota Libraries. Libr. Trends [Internet]. Urbana-Champaign, Illinois: John Hopkins University Press and the Graduate School of Library and Information Science.; 2008 [cited 2013 Jan 11]. p. 202–10. Available from: 5. Harrison A, Searle S. Not drowning , ingesting : dealing with the research data deluge at an institutional level. VALA2010 Proc. [Internet]. 2010. Available from: 6. Hruby GW, McKiernan J, Bakken S, Weng C. A centralized research data repository enhances retrospective outcomes research capacity: a case report. J. Am. Med. Inform. Assoc. [Internet]. 2013 Jan 15 [cited 2013 Apr 11];1–5. Available from: 7. Johnson LM, Butler JT, Johnston LR. Developing E-Science and Research Services and Support at the University of Minnesota Health Sciences Libraries. J. Libr. Adm. [Internet]. Routledge; 2012 Nov [cited 2013 Jan 11];52(8):754–69. Available from: 8. Jones S, Ross S, Ruusalepp R. Data Audit Framework Methodology [Internet]. Glasgow; 2009 p. 1–70. Available from: 9. Lage K, Losoff B, Maness J. Receptivity to Library Involvement in Scientific Data Curation: A Case Study at the University of Colorado Boulder. portal Libr. Acad. [Internet]. 2011 [cited 2012 Nov 21];11(4):915–37. Available from: 10. Newton MP, Miller CC, Bracke MS. Librarian Roles in Institutional Repository Data Set Collecting: Outcomes of a Research Library Task Force. Collect. Manag. 2011;36(1):53–67. 11. Peters C, Dryden AR. Assessing the Academic Library’s Role in Campus-Wide Research Data Management: A First Step at the University of Houston. Sci. Technol. Libr. [Internet]. Routledge; 2011 Sep [cited 2013 Jan 11];30(4):387–403. Available from: 12. Piwowar H a. Who shares? Who doesn’t? Factors associated with openly archiving raw research data. PLoS One [Internet]. 2011 Jan [cited 2013 Mar 10];6(7):e18657. Available from: 13. Raboin R, Reznik-Zellen RC, Salo D. Forging New Service Paths: Institutional Approaches to Providing Research Data Management Services. J. eScience Librariansh. [Internet]. 2012;1(3). Available from: 14. Reznik-Zellen R, Adamick J, McGinty S. Tiers of Research Data Support Services. J. eScience Librariansh. [Internet]. 2012 [cited 2012 Nov 10];1(1):27–35. Available from: 15. Scaramozzino JM, Ramirez ML, McGaughey KJ. A Study of Faculty Data Curation Behaviors and Attitudes at a Teaching-Centered University. Coll. Res. Libr. [Internet]. Association of College & Research Libraries; 2012 Jul 1 [cited 2013 Jan 11];73(4):349–65. Available from: 16. Soehner C, Steeves C, Ward J. E-Science and Data Support Services. 2010 [cited 2013 Jan 11];(August). Available from: 2010.pdf 17. Trinidad SB, Fullerton SM, Bares JM, Jarvik GP, Larson EB, Burke W. Genomic research and wide data sharing: views of prospective participants. Genet. Med. 2010 Aug;12(8):486–95. 18. Walters TO. Data curation program development in U.S. universities: The Georgia Institute of Technology example. Int. J. Digit. Curation [Internet]. 2009;4(3):83–92. Available from: 19. Westra B. Data Services for the Sciences: A Needs Assessment. Ariadne [Internet]. 2010;(64). Available from: 20. Williams SC. Using a Bibliographic Study to Identify Faculty Candidates for Data Services. Sci. Technol. Libr. [Internet]. Routledge; 2013 May 9 [cited 2013 May 14];1–8. Available from: 21. Xia J, Liu Y. Usage Patterns of Open Genomic Data. Coll. Res. Libr. [Internet]. Association of College & Research Libraries; 2013 Mar 1 [cited 2013 Mar 7];74(2):195–207. Available from: 39
    • 40. Images Ponderings for All Things Blog. 2010. Available from: dickens.html Reading Charles Dickens Blog. Manette in Bastille. 2012. Available from: Bastille-253x300.jpg Grandma’s Graphics. Old Scrooge say busy in his counting-house. 2000. Available from: Sungardas Blog. Apple to Orange. 2010. Available from: Patel R. Questions?. Flickr. 2007. Available from: Biomedical Engineering Laboratory.Wikimedia. 2012. Available from: g_Laboratory.jpg 40
    • 41. 41 Questions?