This document summarizes a presentation on research data management for social and behavioral sciences and humanities. The presentation covered topics such as what data management is, why it is important to manage and share data, how to create data management plans, organize data files through naming conventions and folder structures, describe data through metadata and codebooks, issues around data ownership, and data storage, archiving and sharing options. The presentation was aimed at providing guidance to researchers at the University of Utah on best practices for managing and sharing their research data.
Introduction to ArtificiaI Intelligence in Higher Education
Research data management for social sciences and humanities
1. Research data management
and sharing for social and
behavioral sciences and
humanities
Rebekah Cummings, Research Data Management Librarian
J. Willard Marriott Library
September 15, 2015
2. In the next two hours…
— Introductions
— What is data management?
— Why manage and share data?
— Data management plans
— Data organization
— Describing data (metadata!)
— Data ownership
— Data storage and security
— Data archiving and
sharing
— Data services at the
University of Utah
— Wrap-up and Questions
3.
4. — Provide guidance on data management to the University
of Utah community
— Data management plan consultations
— Help you find a repository to store and share your data
— Help you locate data services on campus
— Provide data management training
— Pilot library data services
Your Research Data
Management Librarian
6. What is data management?
The process of controlling the
information (read: data)
generated during a research
project.
https://www.libraries.psu.edu/psul/pubcur/what_is_dm.html
7. What are data?
“The recorded factual material
commonly accepted in the research
community as necessary to validate
research findings.”
- U.S. OMB Circular A-110
http://www.whitehouse.gov/
omb/circulars/a110/a110.html
9. SOCIAL SCIENCE DATA HUMANITIES DATA
Well-established data
practices, archives, and
standards
Emerging data practices,
archives, and standards
Types of data: surveys,
interviews, audio-video,
observations, census records,
government records, opinion
polls
Types of data: text,
photographs, newspapers,
letters, birth and death
records, records of human
history
Long history of capturing and
organizing certain types of
data (e.g. Census data,
CLOSER, ICPSR)
More recent history of
capturing and organizing data
(e.g. HathiTrust, Chronicling
America)
10. What do social science and
humanities data have in common?
— Everyone is working digitally now
— Quantitative and qualitative data
— Often work with data that was not created for
our purposes
— We all have more data than ever before
— New mandates for DMPs and sharing
— All of us could improve our data management
practices
12. Two bears data management
problems
1. Didn’t know where he stored the data
2. Saved one copy of the data on a USB
drive
3. Data was in a format that could only be
read by outdated, proprietary software
4. No codebook to explain the variable
names
5. Variable names were not descriptive
6. No contact information for the co-
author Sam Lee
13. Why manage data?
Your best collaborator is
yourself six months from now,
and your past self doesn’t
answer emails.
14. Why else manage data?
— Save time and efficiency
— Meet grant requirements
— Promote reproducible research
— Enable new discoveries from your data
— Make the results of publicly funded
research publicly available
15. Grant requirements and
federal mandates
— National Institute of Health (2003) – Required data
management plans for grants over $500,000; All manuscripts
in PubMed within 12 months of publication.
— National Science Foundation (2011) – All NSF grants must
have a data management plan.
— The White House OSTP memo (2013) – Federal agencies
with over $100 million/year in R&D must develop a plan to
support public access to research.
16. As of 2015…
— NEH Office of Digital Humanities – Requires two-page
data management plan similar to NSF requirement.
— Bill and Melinda Gates Foundation – Data Access Plan
17. Hypothetical Scenario
You are working on a research project with a
small to medium sized research team. You
uncover something notable in your field and
write up the results of that research, which are
then accepted by a reputable journal. People
start citing your work! Three years later
someone accuses you of falsifying your work.
18. — Would you be able to prove that you did the
work as described in the article?
— What would you need to prove you hadn’t
falsified the data?
— What should you have done throughout your
research study to be able to prove you did the
work as described?
Questions from MANTRA training module
19. Data Management Plans
— What data are generated by your research?
— What is your plan for managing the data?
— How will your data be shared?
21. Elements of a DMP
— Types of data, including file formats
— Data description
— Data storage
— Data sharing, including
confidentiality or security
restrictions
— Data archiving and responsibility
— Data management costs
28. Managing Research Assets
— Identify: Make an audit of what you have and where
— Decide which of your assets you want to keep and
which you don’t need
— Organize your assets: Give them descriptive file
names, organize them into a logical file structure, and
write down your organizational scheme
— Make copies: Keep multiple copies of your research
assets. Back up your reference library frequently.
Every few years, check your copies to see if you need
to export them to a newer format.
From Miriam Posner’s “Managing Research Assets” http://bit.ly/manageresearch
32. File naming best practices
1. Be descriptive
2. Don’t be generic
3. Appropriate length
4. Be consistent
5. Think critically about
your file names
33. File naming best practices
— Files should include only letters,
numbers, and underscores/dashes.
— No special characters
— No spaces; Use dashes, underscores,
or camel case (like-this or likeThis)
— Not all systems are case sensitive.
Assume this, THIS, and tHiS are the
same.
34. Version Control - Numbering
001
002
003
009
010
099
1
10
2
3
9
99
Use leading zeros for
scalability
Bonus Tip: Use ordinal numbers (v1,v2,v3) for major version
changes and decimals for minor changes (v1.1, v2.6)
35. Version Control - Dates
If using dates use YYYYMMDD
June2015 = BAD!
06-18-2015 = BAD!
20150618 = GREAT!
2015-06-18 = This is fine too J
36. Common elements in a file
name
— Project name
— Name of creator
— Description of content
— Name of research team/department
— Date of creation
— Version number
From a DMP: “Each file name, for all types of data, will
contain the project acronym PUCCUK; a reference to the
file content(survey, interview, media) and the date of an
event (such as the date of an interview)
38. Who filed better?
1. July 24 2014_SoilSamples%_v6
2. 20140724_NSF_SoilSamples_Cummings
3. SoilSamples_FINAL
39. File organization best
practices
— Top level folder should include
project title and date.
— Sub-structure should have a clear
and consistent naming convention.
— Document your folder structure in a
README text file.
41. File Organization Exercise
1. Is there a better way to organize these files?
2. Can you spot any problems with the way these files are
names?
3. What files might be missing from this folder?
44. Research Documentation
— Grant proposals and related reports
— Applications and approvals (e.g. IRB)
— Codebooks, data dictionaries
— Consent forms
— Surveys, questionnaires, interview protocols
— Transcripts, hard copies of audio and video files
— Any software or code you used (no matter how
insignificant or buggy)
45. Three levels of
documentation
— Project level – what the study set out to do,
research questions, methods, sampling
frames, instruments, protocols, members of
the research team
— File or database level – How all the files
relate to one another. A README file is a
classic way of capturing this information.
— Variable or item level – Full label explaining
the meaning of each variable.
http://datalib.edina.ac.uk/mantra/documentation_metadata_citation/
47. Codebooks
Codebooks provide information on the structure,
contents, and layout of a data file.
— Column locations and widths for each variable
— Response codes for each variable
— Codes used to indicate nonresponsive and missing data
— Questions and skip patterns used in a survey
— Data types
— Variable names
49. Structured Data (Metadata)
There was a study put out
by Dr. Gary Bradshaw from
the University of Nebraska
Medical Center in 1982
called “ Growth of Rodent
Kidney Cells in Serum Media
and the Effect of Viral
Transformation On
Growth”. It concerns the
cytology of kidney cells.
Unstructured Data Structured Data
Title Growth of rodent kidney
cells in serum media
and the effect of viral
transformations on
growth.
Author Gary Bradshaw
Date 1982
Publisher University of Nebraska
Medical Center
Subject Kidney -- Cytology
50. Metadata Fields - Video
— Type/Format
— .mp4, .avi, .mov
— Run time
— Title
— Producer/author
— Date(s)
— Location(s)
— Place of production
— Content
— Annotations
— Systems Requirement for
access
— Windows, Quicktime,
RealPlayer
— Download requirement
— Size of file
— Software needed
— Contact Info
— Persistent Identifier
— Other documentation
http://www.slideshare.net/RebekahCummings/data-management-for-education-research
51. Type of Metadata - Audio
— Structural
— Relationship to other
audio files in the same
project
— Descriptive
— Title, creator, subject,
description of project,
date, content
— Administrative
— Rights, licensing,
contact person
— Technical
— Equipment used, file
format (MP3, WAV,
FLAC), software for
recording and editing
— Embedded
— Some files have
embedded metadata –
date, file format, etc.
Do not rely on this as
metadata
http://www.slideshare.net/RebekahCummings/data-management-for-education-research
52. Data Documentation
Initiative
— Most recognized standard for describing social
science data and is often recommended for
humanities data as well.
— Used by many data repositories
— Extremely mature, XML-based standard
— Hundreds of elements for data description
— AnalysisUnit = the entity being analyzed in the
study or variable
— DataType = specifies type of data being collected
55. Data citation
— Enables easy reuse and verification of your
data
— Allows the impact of your data to be tracked
— Creates a scholarly structure that rewards
data producers
— Increases citation rate for related publications
(Pienta, 2010)
63. Complication #3 – Data and IP
“The discoverer of a scientific fact as to the
nature of the physical world, an historical fact, a
contemporary news event, or any other ‘fact’
may not claim to be the ‘author’ of that fact. If
anyone may claim authorship of facts, it must
be the Supreme Author of us all.The
discoverer merely finds and records.”
- Melville Nimmer, 1963
64. University policy
“The University of Utah retains ownership
and stewardship of the scientific data and
records for projects conducted at the
University or that use University of
personnel or resources.
- Research Handbook, Section 9.9
65. University policy (cont.)
“Except where precluded by the specific
terms of a sponsored agreement, tangible
research property, including the scientific
data and other records of research
conducted by the faculty or staff of the
University, belongs to the University.”
- Research Handbook, Section 9.9
66. But what about IP?
University IP includes “the tangible and
intangible results of research (including
for example data, lab notebooks, charts,
etc.)”
- Employee Intellectual Property
Assignment Agreement
67. Intellectual Property –
Copyright and Patents
— Faculty members retain copyright over
their “traditional scholarly products”
but that term is fairly narrowly defined
and would have to be evaluated on a
case-by-case basis
— If you plan on commercializing your
data, you must speak with TVC
(Technology, Venture, and Commercial).
68. Recap - Who owns the data?
— The University
— The project sponsor if that was negotiated
in the contract
— Another institution with which you are
collaborating.
— IF you are a faculty member and IF your
data can be defined as a “traditional
scholarly work” you would retain copyright
of your data.
69. Data responsibility
“The P.I. is responsible for the collection, management,
maintenance, and retention of research data
accumulated under a research project. The University
must retain research data in sufficient detail and for an
adequate period of time to enable appropriate responses
to questions about accuracy, authenticity, privacy, and
compliance with laws and regulations governing the
conduct of research. It is the P.I.s responsibility to
determine what records need to be retained to comply
with sponsor requirements.
Research Handbook 9.9.2
70. Data responsibility (cont.)
“Research data must be archived for a
minimum of three years after the final
project closeout.”
“The P.I. should develop appropriate
procedures for proper archiving and
tracking of research data.”
Research Handbook 9.9.4
74. Language from a DMP
“All data files will be stored on the University server
that is backed up nightly. The University's computing
network is protected from viruses by a firewall and anti-
virus software. Digital recordings will be copied to the
server each day after interviews.
Signed consent forms will be stored in a locked cabinet
in the office. Interview recordings and transcripts, which
may contain personal information, will be password
protected at file-level and stored on the server.
Original versions of the files will always be kept on the
server. If copies of files are held on a laptop and edits
made, their file names will be changed.”
77. What kind of sensitive data?
— Human subject data
— Patient information
— Environmental data
— Potentially patentable data
78. Working with sensitive data
— If possible, collect the necessary data
without using direct identifiers
— Otherwise, remove all direct identifiers
upon collection or immediately afterwards
— Be careful with indirect identifiers
— Avoid storing or sharing unencrypted
personal data electronically
— Talk to IRB/ Check HIPPA guidelines
79. Sensitive data (cont.)
— Include information in your consent forms
about how the data will be shared and what
steps will be taken to prevent identity
disclosure.
— During the data collection phase, do not share
sensitive data beyond the research group
— If data will not remain usable with identifiers
removed, consider depositing data in an
archive with controlled access.
81. Tools for Working w/
Sensitive Data
— ICPSR Guide to Social Science Data Preparation and Archiving
(Chapters 5 & 6) -
http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf
— Managing and Sharing Data: UK Data Archive (Ethics and
Consent, pages 22-27)
http://www.data-archive.ac.uk/media/2894/
managingsharing.pdf
— Identity Finder, Simple Data Masking, Spider, SSN Scanning Tools
— QualAnon – Tool for anonymizing interview transcripts, typed
field notes, or other qualitative data. Changes identified names
into specified pseudonyms.
86. How to choose a data
repository
— Requirements of funding agency/journal
— Subject or discipline options
— Size of dataset
— File formats accepted
— Accessibility of data
— Budget
— Time
87. Recommended Repositories
— Re3data - index of data repositories at
http://www.re3data.org/browse
— PLOS’s guide -
http://journals.plos.org/plosone/s/data-availability -
loc-recommended-repositories
— Princeton’s guide -
http://libguides.princeton.edu/c.php?
g=84261&p=541339
— Scientific Data’s guide -
http://www.nature.com/sdata/data-policies/
repositories
89. Rules for Sharing Your Data
— Publish your data online with a persistent
identifier (DOI or ARK)
— Publish your data in a reputable data
repository
— Convert your data to stable, non-
proprietary formats for long-term access
— Publish enough context to make your data
understandable (metadata, code,
workflows)
— Link your data to your publications as
often as possible
90. Rules for Sharing Data (cont.)
— State how you want to get credit for your
data
— Always cite the sources of data that you
use and include data citations with your
datasets
— Include datasets in your NSF Biosketch or
Faculty Profile
Content from “Ten Simple Rules for the Care and
Feeding of Scientific Data” http://
journals.plos.org/ploscompbiol/article?id=10.1371/
journal.pcbi.1003542
92. Office of the VP for research
Research Integrity and Compliance
— Institutional Review Board (IRB)
— Conflict of Interest
— Research Education Training
— Human Subjects
— Animal Subjects
— Lab Safety
93. Marriott & Eccles Libraries
Daureen Nesdill
Research Data
Management
Librarian,
Sciences
Darell Schmick
Research
Librarian,
Health Sciences
Rebekah Cummings
Research Data
Management
Librarian, Social
Sciences &
Humanities
94. Marriott Library - Software
— Quantitative and Qualitative Analysis
Software – Student Computing Services
— SPSS
— Stata
— nVivo – qualitative data analysis
— ATLAS.ti – qualitative data analysis
— MATLAB
— R
— SAS
95. Marriott Library – Subject Guides
Data subject guides:
http://campusguides.lib.utah.edu/dataanddatasets
96. Marriott Library – Digital
Humanities
— Explore tools with you and help connect you with other digital
humanists.
— Find available digitized source material
— Secure data for text and data mining
— Bookworm – word frequency; visualize trends in historical texts
(built off Google Ngram Viewer)
— MALLET – topic modeling
— APIs for programmatic access to large corpora
— HathiTrust Research Center
— JSTOR Data for Research
— Getty Research Institute
97. Writing a DMP
— DMPTool – https://dmp.cdlib.org/
— ICPSR website -
https://www.icpsr.umich.edu/icpsrweb/
content/datamanagement/dmp/
index.html
— Once again… call a data librarian!
98. Data Creation/ Collection
— REDCAP – a browser based tool that allows
investigators to create and administer
surveys. Data is stored on HIPPA/FERPA
compliant servers.
— LabArchives – electronic lab notebooks
being implemented on campus in research
labs and classes. Daureen.nesdill@utah.edu
— Create audio/video recordings – Faculty
Center; Robert.nelson@utah.edu and
tony.same@utah.edu
99. Data ownership and
commercialization
Technology & Venture Commercialization
Office
http://www.tvc.utah.edu/
Dave Morrison, Patent Librarian
Marriott Library, Room 2110K
801-585-6802
Dave.morrison@utah.edu
100. Data Visualization
— SCI Institute – Scientific Computing and
Imaging Institute
https://www.sci.utah.edu/
— GIS assistance at Marriott Library –
justin.sorenson@utah.edu
— Creation of interactive mapping projects
— Locating and creating geospatial data
— How to work various GIS platforms
(ArcGIS, Google Earth, etc.)
101. Data Storage & Archiving
— Ubox – HIPPA/FERPA compliant; easy to
create account
— 50 GB free - http://box.utah.edu/
— Uspace – Institutional repository
— 40 GB per data submission - http://uspace.utah.edu/
— Center for High Performance Computing
— HIPPA/ FERPA compliant ($210/TB for 5 years, more for
quarterly backups) - https://www.chpc.utah.edu/
— ICPSR – U of U is an institutional member –
rebekah.cummings@utah.edu
102. Online Data Management
Training
— ICPSR -
https://www.icpsr.umich.edu/icpsrweb/
landing.jsp
— UK Data Archive
http://www.data-archive.ac.uk/help/user-faq#3
— MANTRA Data Management Training -
http://datalib.edina.ac.uk/mantra/
— RDM Rose http://rdmrose.group.shef.ac.uk/
— Data Q http://researchdataq.org/
103. Major takeaways
— Data management starts at the beginning of a
project
— Document your data with a certain level of
reuse in mind
— Consider archiving and sharing options when
you are done with your project
— Don’t overlook campus resources!