Every institution creates and uses data for many reasons. Data needs to be collected, described, stored, organized, retrieved, and shared, all things that librarians can help with. But how do you get started when there are many types of data and a range of services that can be offered? I will cover how to leverage the skills librarians already have to work with data and suggest some areas of data and service to get you started.
Inroads into Data: Getting Involved in Data at Your Institution
1. Inroads into Data:
Getting Involved in Data
at Your Institution
Margaret Henderson
Director, Research Data Management
mehenderson@vcu.edu
@mehlibrarian
Beyond the SEA Webinar, November 18, 2015
2. “I believe that knowledge rather than the format
or container should drive our work.”
~ Lucretia McClure, 1997
http://www.mlanet.org/blog/mcclure,-lucretia-w.-(ahip,-fmla)
8. What is Data?
• Research results
• Admission records
• Student course marks
• Patient health records
• Financial statement
• Supply order information
• Inventories
• Surgery counts
• Surgery records
• Genetic sequences
• Computer software
• Study protocols
• Clinical case histories
• Samples
• Physical collections
• Cell lines
• Spectroscopic data
• Oral history interviews
• Surveys
• Laboratory Notebooks
9. “If it gives you pain, it is Big Data.”
~ Donald Brown, Director of Virginia Integrative Data Institute,
speaking at Research Data and Technology Fair presented by
Claude Moore Health Sciences Library, University of Virginia
Health System
Presentation link at http://guides.hsl.virginia.edu/research-fair
10.
11. The Value of Reference Skills
https://commons.wikimedia.org/wiki/File:1930%27s_-_ca._-_Alma_Custead,_Librarian,_and_Staff.jpg
12. Environmental Scan
• PEST - political, economic, social, and
technological factors
• PESTEL – add environmental and legal factors
• SWOT – strengths, weaknesses, opportunities,
and threats
• Six Forces Model – competition, new entrants,
end users, suppliers, substitutes, and
complementary products
13. Potential Departments
• Information Technology/Technology Services –
backups and security
• Office of Research – grants, research output
for assessment, patents
• Administration – people, financial, facilities
data
• Records – patient health records
• Statistics or Biostatistics department
16. Simplified Data Lifecycle
Data
Management
Plan and
Ownership
Organizing
and folder
and file
name
suggestions
Metadata
or Readme
files
Clean data
and statistics
help
IR, subject
repository,
or journal
that
includes
supporting
data.
Stable file
formats,
duration as
per funder or
other policy.
18. Data Management Plans
Outlines how a researcher will:
• collect
• organize
• back up
• storing
• share
the data for a project, and indicates who the
data steward will be.
21. NIH Policies
• Public Access: ...all investigators funded by the NIH submit or have
submitted for them to the National Library of Medicine’s PubMed Central
an electronic version of their final peer-reviewed manuscripts upon
acceptance for publication, to be made publicly available no later than 12
months after the official date of publication. https://publicaccess.nih.gov/
• Data Sharing: extension of NIH policy on sharing research resources, and
reaffirms NIH support for the concept of data sharing.
http://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html
• Genomic Data Sharing: Applies to all NIH-funded research that generates
large-scale human or non-human genomic data, as well as the use of
those data for subsequent research. Requires “Genomic Data Sharing
Plan”.Allows for expenses in project budget.
http://grants.nih.gov/grants/guide/notice-files/NOT-OD-07-088.html
22. NSF Policies
NSF Data Sharing Policy
Investigators are expected to share with other researchers, at no more than
incremental cost and within a reasonable time, the primary data, samples,
physical collections and other supporting materials created or gathered in
the course of work under NSF grants. Grantees are expected to encourage
and facilitate such sharing. See Award & Administration Guide (AAG) Chapter
VI.D.4. http://www.nsf.gov/bfa/dias/policy/dmp.jsp
NSF Data Management Plan Requirements
Proposals submitted or due on or after January 18, 2011, must include a
supplementary document of no more than two pages labeled “Data
Management Plan”. This supplementary document should describe how the
proposal will conform to NSF policy on the dissemination and sharing of
research results. See Grant Proposal Guide (GPG) Chapter II.C.2.j for full
policy implementation. https://www.nsf.gov/eng/general/dmp.jsp
23. NSF Policies
NSF Data Sharing Policy
Investigators are expected to share with other researchers, at no more than
incremental cost and within a reasonable time, the primary data, samples,
physical collections and other supporting materials created or gathered
in the course of work under NSF grants. Grantees are expected to
encourage and facilitate such sharing. See Award & Administration Guide
(AAG) Chapter VI.D.4. http://www.nsf.gov/bfa/dias/policy/dmp.jsp
NSF Data Management Plan Requirements
Proposals submitted or due on or after January 18, 2011, must include a
supplementary document of no more than two pages labeled “Data
Management Plan”. This supplementary document should describe how the
proposal will conform to NSF policy on the dissemination and sharing of
research results. See Grant Proposal Guide (GPG) Chapter II.C.2.j for full
policy implementation. https://www.nsf.gov/eng/general/dmp.jsp
Slide courtesy of Amanda Whitmire
24. OSTP Memorandum
Increasing Access to the Results of Federally Funded Scientific
Research -February 22, 2013
“ensuring that, … the direct results of federally funded scientific
research are made available to and useful for the public,
industry, and the scientific community. Such results include peer-
reviewed publications and digital data.”
“develop plans to make the results of federally-funded research
publically available free of charge within 12 months after
original publication.”
https://www.whitehouse.gov/blog/2013/02/22/expanding-public-access-results-federally-funded-research
25. Data Management Plans
• All agencies will require a data management
plan.
• “Not all data need to be shared or
preserved. The costs and benefits of doing
so should be considered in data
management planning.” DOE third principle
http://science.energy.gov/funding-opportunities/digital-data-management/
• DOE and NSF have indicated they will review
and evaluate DMPs
26. Data Sharing
•Digitally formatted data arising from unclassified, publicly
releasable research and programs.
•Decentralized approach to data storage.
•Allow for inclusion of costs for data management and access.
•Will establish a system to enable the identification, attribution,
(federated) storage, and access of digital data.
From NASA FAQ
•“First of all, be reassured that we are not going to force you to
reveal your precious proprietary data prior to publication. No
personal, proprietary or ITAR data is included.”
http://science.nasa.gov/researchers/sara/faqs/dmp-faq-roses/
28. Ownership
• Check institutional policy
• Consult with legal counsel for your institution
• Can’t copyright data so think about licensing
• How to License Research Data
http://www.dcc.ac.uk/resources/how-guides/license-research-data
• Patient Record Ownership by State
http://www.healthinfolaw.org/comparative-analysis/who-owns-medical-records-50-state-comparison
31. Organizing
What makes sense for person or group:
• File type
• Date
• Type of analysis
• Project
MyDocumentsResearchSample20.tiff
vs.
C:NSFGrant2020CellDynamicsImagesRatCell_141020.tiff
32. Naming
Use file naming conventions for related files
• Be consistent
• Short yet descriptive
• Avoid spaces and special characters
e.g. File2020.xls
vs.
Project_experiment_celltype_YYYYMMDD.xls
33. Possible elements for file names
• Project/grant name and/or number.
• Date of creation: useful for version control, e.g. YYYYMMDD
• Name of creator/investigator: last name first followed by
(initials of) first name.
• Name of research team/department associated with the
data.
• Description of content/subject descriptor.
• Data collection method (instrument, site, etc.).
• Version number.
36. Metadata
• Descriptive – describes object in question,
whole dataset and each element of the set
• Administrative – preservation, IP rights
• Structural – physical and logical structure of
digital object
• Metadata Standards Directory
http://rd-alliance.github.io/metadata-directory/
37. Readme Files
• Names + contact information for people associated with the
project
• List of files, including a description of their relationship to one
another
• Copyright + licensing information
• Limitations of the data
• Funding sources / institutional support
• Any information necessary for someone with no knowledge of
your research to understand and / or replicate your work.
42. Data Dictionary
• Define terms used
• If measurements are made, gives units and
explains exactly how measured or calculated
• How item is recorded, especially when there
are multiple options, e.g. date
46. You Can’t Do It All
https://twitter.com/kdnuggets/status/663427070677118976
47. Tools for Data Cleaning
• Open Refine - to clean and transform data to
different formats http://openrefine.org/
• Trifecta Wrangler – free version of the program, so
some limitations
https://www.trifacta.com/trifacta-wrangler
• NLM-Scrubber – clinical text de-identification
https://scrubber.nlm.nih.gov/
• Johns Hopkins Coursera on Data Science
https://www.coursera.org/specializations/jhudatascience
48. Analysis and Visualization
• The R Project - language and environment for
statistical computing and graphics
https://www.r-project.org/
• Tableau Public – analytical tools and visualizations
without learning a programming language
https://public.tableau.com/s/
• Flowing Data - Nathan Yau has written a couple of
books on statistics and visualization; his website has
examples, tutorials and more
http://flowingdata.com/
50. Publish & Share
IR, subject
repository,
or journal
that
includes
supporting
data.
51. Sharing Data
• Helps to avoid duplication, thereby reducing costs and wasted
effort.
• Promotes scientific integrity and debate.
• Enables scrutiny of research findings and allows for validation of
results.
• Leads to new collaborations between data users and data creators.
• Improves research and leads to better science.
• Enables the exploration of topics not envisioned by the initial
investigators.
• Permits the creation of new datasets by combining data from
multiple sources.
• Increases citations.*
* A study by Piwowar, Day and Fridsma showed a 69% increase in citation,
http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0000308
52. Ways to Share Data
Upload to open repository; general, subject, or
institutional.
• figshare http://figshare.com/
• Zenodo https://zenodo.org/
• Open Science Framework https://osf.io/
• DataVerse http://dataverse.org/
• Search Registry of Research Data Repositories
http://www.re3data.org/
53. Supplemental file with journal article or link to
the upload.
– Be sure to check the contract.
– Will the data be available to the public as per
OSTP if grant funded?
– Will the rights conflict with institutional ownership
of the data?
Tried and true methods? Send files upon
request. Upload to personal web site.
55. Controlled Access
• Researchers must request access to database,
explaining research and providing IRB
approval forms.
• Data must be anonymized in some way before
being made publicly available.
58. Storage vs Backup
storage = working files
The files you access regularly and change frequently. In
general, losing your storage means losing current
versions of the data.
backup = regular process of copying data separate from
storage.
You don’t really need it until you lose data, but when
you need to restore a file it will be the most important
process you have in place.
59. Rule of 3
Keep THREE copies of your data –
TWO onsite –
ONE offsite
Example – One: Laptop – Two: External hard drive –
Three: Cloud storage
This ensures that your storage and backup is not all in
the same place – that’s too risky!
http://dataabinitio.com/?p=320
62. Appraisal of Data
1. Relevance to Mission
2. Scientific, Social, Cultural, Historical Value
3. Uniqueness
4. Potential for Redistribution
5. Non-Replicability
6. Economic Case
7. Full Documentation
from NECDMC, Module 7 activity, http://library.umassmed.edu/necdmc/modules
based on Whyte and Wilson http://www.dcc.ac.uk/resources/how-guides/appraise-select-data
63. Where to Preserve Data
• Dryad
• Figshare
• Subject Repository
• Institutional Repository
• Government Repository
64. Don’t Forget Print
• Set a schedule to scan lab notebooks and other print
materials (makes for a good back up and easier to share
data within group).
• Print original should have similar security to digital data (i.e.
good, secure storage and labelling of files).
66. Data Information Literacy
DIL http://www.datainfolit.org/
https://www.dataone.org/education-modules
The New England Collaborative Data
Management Curriculum (NECDMC)
http://library.umassmed.edu/necdmc/index
72. Librarians and Data
• Subject headings = Organization
• Cataloging = Metadata
• Reference = Data Reference and Interviewing
• Collections = Purchasing data sets, Deciding what
data to keep
• Archives = Preservation, Deciding what to keep
• Instruction = Instruction
• Policy = Funder Policies
• Scholarly Communication = Data Citation,
Licensing
78. References
• Bishop, D. 2015. Who’s Afraid of Open Data. Blog post on BishopBlog.
http://deevybee.blogspot.co.uk/2015/11/whos-afraid-of-open-data.html
• Carlson, Jake R. 2011. "Demystifying the Data Interview: Developing a Foundation for Reference
Librarians to Talk with Researchers about their Data." Reference Services Review 40 (1): 7-23.
• Choudhury, S. 2013. Open Access & Data Management Are Do-Able Through Partnerships. In:
ASERL; 2013 Summertime Summit: "Liaison Roles in Open Access & Data Management: Equal Parts
Inspiration & Perspiration," https://smartech.gatech.edu/handle/1853/48696
• Christensen-Dalsgaard, et.al. 2012.Ten Recommendations for Libraries to Get Started with Research
Data Management: Final report of the LIBER working group on E-Science / Research Data
Management . Ligue des Bibliothèques Européennes de Recherche (LIBER)
http://libereurope.eu/wp-content/uploads/The%20research%20data%20group%202012%20v7%20final.pdf
• McClure, Lucretia W. 1997. "Knowledge and the Container." In Health Information Management.
What Strategies? Proceedings of the 5th European Conference of Medical and Health Libraries,
Coimbra, Portugal, September 18–21, 1996, edited by Suzanne Bakker, 258-260: Springer
Netherlands. doi:10.1007/978-94-015-8786-0_86
• Rinehart, Amanda K. September 2015. "Getting Emotional about Data: The Soft Side of Data
Management Services." C&RL News 76 (8): 437-440.
• Ross, Catherine Sheldrick, Kirsti Nilsen, and Marie L. Radford. 2009. Conducting the Reference
Interview: A how-to-do-it Manual for Librarians. 2nd ed. New York: Neal-Schuman Publishers.
79. Resources
• Educating Yourself on Research Data Management: Resources and
Opportunities (resource list) Greater Midwest Region webinar by
Abigail Goben and Rebecca Raszewski, Nov. 16, 2015
• Midwest Data Librarians Symposium - presentations and other
materials http://dc.uwm.edu/mdls/2015/
• Pinfield, Stephen, Andrew M. Cox, and Jen Smith. 2014. "Research
Data Management and Libraries: Relationships, Activities, Drivers
and Influences." PloS One 9, no. 12: e114734.
doi:10.1371/journal.pone.0114734
• Sweeney L, Crosas M, Bar-Sinai M. Sharing Sensitive Data with
Confidence: The Datatags System. Technology Science. 2015101601.
October 16, 2015. http://techscience.org/a/2015101601
• Table of NIH Data Sharing Policies and Repositories
https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_policies.html
Editor's Notes
Where to start, well, first you have to figure out where you are.