Conquering Chaos in the Age of Networked Science: Research Data Management

Conquering Chaos in the Age of Networked Science:
Research Data Management*
*Adaptation of the NECDMC First Module
Kathryn M. Houk, MLIS
Tufts University Hirsh Health Sciences Library
Wednesday June 4, 2014
Librarians: Your Partners in Research

Today’s Objectives
 Recognize what research data is and what data
management entails
 Recognize why managing data is important
 Identify common data management issues
 Learn best practices and resources for managing these
issues
 Learn about how the library can help you identify data
management resources, tools, and best practices

What is Data?
• “Research data, unlike other types of information, is
collected, observed, or created, for purposes of
analysis to produce original research results”
(University of Edinburgh).
• Observational
• Experimental
• Simulation data
• Derived or compiled data

Why Should I Manage it?
• Transparency & Integrity
• Compliance

Science & Personal Benefits
• Who uses your data now?
• Who COULD use your data?
• Shared/Open Data
• Scientific progress
• Impact on your career
• Citation counts

What if I Don’t Consider RDM?
Data Sharing and Management Snafu in 3 Short Acts:
A data management horror story by Karen Hanson,
Alisa Surkis and Karen Yacobucci.
http://www.youtube.com/watch?v=N2zK3sAtr-4

Data Management Planning vs. a DMP

Data Management Plans
• What types of data will be created?
• Who will own, have access to, and be responsible
for managing these data?
• What equipment and methods will be used to
capture and process data?
• Where will data be stored during and after?

Simplified Data Management Plan
1. Types of data
• What types of data will you be creating or capturing? (experimental measures, observational
or qualitative, model simulation, existing)
• How will you capture, create, and/or process the data? (Identify instruments, software,
imaging, etc. used)
2. Contextual Details (Metadata) Needed to Make Data Meaningful to others
• What file formats and naming conventions will you be using?
3. Storage, Backup and Security
• Where and on what media will you store the data?
• What is your backup plan for the data?
• How will you manage data security?
4. Provisions for Protection/Privacy
• How are you addressing any ethical or privacy issues (IRB, anonymization of data)?
• Who will own any copyright or intellectual property rights to the data?
5. Policies for re-use
• What restrictions need to be placed on re-use of your data?
6. Policies for access and sharing
• What is the process for gaining access to your data?
7. Plan for archiving and preservation of access
• What is your long-term plan for preservation and maintenance of the data?

Creating a DMP & Considering
Long-Term DM Issues
• Read the case study provided
• Your group is assigned a set of questions (labeled Group 1-6)
to answer as best you can
• First set of questions are from one section of the simplified DMP
• 2nd set of questions highlight an issue that arises in day-to-day or
long-term management of research data (a more detailed level)
• Elect a group speaker
• Each group will discuss their answers
• We will go over the issue associated with your section, common
problems, and best practices

Group 1
• DMP Section 1: Types of Data
1. What types (e.g. images, lists of readings, text documents) of
data are being collected for this study?
2. What analytical methods and tools are being used in this
study?
3. What types of data will be generated from these analytical
tools and methods?
• Detailed Planning
1. What naming conventions are being used in the lab?
2. Is there a structure for saving files in the lab?
3. What kind of information would you include in a naming
convention for files?
4. What kinds of things would you avoid in naming/labeling files?

Issue: Records Management
• Does this sound familiar?
• Inconsistently labeled files
• in multiple versions…
• inside poorly structured folders…
• stored on multiple media…
• in multiple locations…
• and in various formats…

Issue: Records Management
• Best Practices:
• Avoid special characters in a file name.
• Use capitals or underscores instead of periods or spaces.
• Use 25 or fewer characters.
• Use documented & standardized descriptive information
about the project/experiment.
• Use date format ISO 8601:YYYYMMDD.
• Include a version number.

Group 2
• DMP Section 2: Contextual Details (Metadata)
1. What contextual details would the researcher need to document
to make her data meaningful to others?
2. How would a lack of naming and labeling conventions impact later
data access by other researchers and possibly herself?
1. What general information do you think is needed for scientific data
to make it discoverable? (ex. Think of a search screen and a
dropdown menu of where you can search for a term: Title, Author,
Genre, etc.)
2. Are you aware of any metadata standards for the life or health
sciences?
3. Do you think all metadata has to be hand-entered or recorded?
4. How would you ensure lab members knew to collect and record
specific information in standard ways?

Issue: Metadata
• How will someone make sense of your data e.g. the cells
and values of your spreadsheet?
• What universal or disciplinary standards could be used to
label your data?
• How can you describe a data set to make it
discoverable?

Issue: Metadata
• Biology and health-specific metadata examples

Issue: Metadata
• Title
• Creator
• Identifier
• Subject
• Funders
• Rights
• Access information
• Language
• Dates
• Location
• Methodology
• Data processing
• Sources
• List of file names
• File Formats
• File structure
• Variable list
• Code lists
• Versions
• Checksums

Issue: Metadata
• Best Practices
• Describe the contents of data files
• Define the parameters and the units on the parameter
• Explain the formats for dates, time, geographic coordinates,
and other parameters
• Define any coded values
• Describe quality flags or qualifying values
• Define missing values

Group 3
• DMP Section 3: Data Backup, Storage, and Security
1. Where and on what media will the data from each source be
stored?
2. How, how often, and where will the data be backed up?
3. Are there any security concerns for the data and have they
been addressed?
1. How many copies of your data do you think you should have
and where should you keep them?
2. Is there any group on campus you think could help you with
backup and security/access concerns?
3. What are some good data storage and backup practices you
know about or practice?

Issue: Backup & Security
• How often should data be backed up?
• How many copies of data should you have?
• Where can you store your data?
• How much server space can I get?

Issue: Backup & Security
• Best Practices
• Make 3 copies (original + external/local + external/remote)
• Have them geographically distributed (local vs. remote)
• Use a Hard drive (e.g. Vista backup, Mac Timeline, UNIX
rsync) or Tape backup system
• Cloud Storage - some examples of private sector storage
resources include: (Amazon S3, Elephant Drive, Jungle
Disk, Mozy, Carbonite)
• Unencrypted is ideal for storing your data because it will
make it most easily read by you and others in the future…but
if you do need to encrypt your data because of human
subjects then:
• Keep passwords and keys on paper (2 copies), and in a PGP
(pretty good privacy) encrypted digital file
• Uncompressed is also ideal for storage, but if you need to
do so to conserve space, limit compression to your 3rd
backup copy

Group 4
• DMP Sections 4. Data protection/privacy and 5. Policies for
reuse of data
1. How is the lab addressing any privacy or ethical issues?
2. Who will own any copyright or intellectual property rights to
the data?
3. Are there any restrictions to the reuse of the data?
1. Are there any reasons to not share or reuse data? Are these
ethical or cultural issues?
2. Will having public funding affect data sharing and reuse
differently than having private funding?
3. Who has the right to make decisions about reuse of your data?

Issue: Ownership & Retention
• Intellectual Property Policy
• IRB data retention policy
• Funders’ data retention policy
• Publishers’ data retention policy
• Federal and State laws

• How long is long enough?

• IRB OHRP Requirements: 45 CFR 46 requires research records to be retained
for at least 3 years after the completion of the research.
• HIPAA Requirements: Any research that involved collecting identifiable health
information is subject to HIPAA requirements. As a result records must be
retained for a minimum of 6 years after each subject signed an authorization.
• FDA Requirements 21 CFR 312.62.c Any research that involved drugs,
devices, or biologics being tested in humans must have records retained for a
period of 2 years following the date a marketing application is approved for the
drug for the indication for which it is being investigated; or, if no application is
to be filed or if the application is not approved for such indication, until 2 years
after the investigation is discontinued and FDA is notified.
• VA Requirements: At present records for any research that involves the VA
must be retained indefinitely per VA federal regulatory requirements.
• Intellectual Property Requirements - Any research data used to support a
patent through must be retained for the life of the patent in accordance with
Intellectual Property Policy.
• Check with your Funder and Publisher Requirements
• Questions of data validity: If there are questions or allegations about the validity
of the data or appropriate conduct of the research, you must retain all of the
original research data until such questions or allegations have been completely
resolved.

Group 5
• DMP Sections 6: Policies for access and sharing
1. How will others be able to gain future access to the study
data?
2. How does the graduate student plan to link her datasets to her
published article?
1. Could there be a use for the graduate student’s data that was
not used in the published article?
2. Are the data the student collected open formats or proprietary
(will people need specialized software to access and interpret
the data)?
a) How would this affect future accessibility & reuse?

Group 6
• DMP Section 7: Plan for archiving and preservation of access
1. What is the long-term strategy for maintaining, curating and
archiving the data?
2. Where will the data be stored?
3. What contextual data (data that describes your data) or other
related data will be included in the archive?
1. What data should be included in an archive?
2. Do you know of any data repositories that you could use for your
data?
3. How can you ensure that your data is discoverable and
interpretable?
4. How long should the data be maintained? What factors affect the
length of time you retain your data?

Issue: Long-Term Planning
• What will happen to my data after my project ends?
• How can I appraise the value of my data?
• What are my options for archiving and preserving my
data?
• What are my options for publishing and sharing data?

Data Formats
• Is the file format open (i.e. open source) or closed
(i.e proprietary)?
• Is a particular software package required to read
and work with the data file? If so, the software
package, version, and operating system platform
should be cited in the metadata
• Do multiple files comprise the data file structure? If
so, that should be specified in the metadata

Open vs. Proprietary Formats Used
in Research Labs

• Best Practices
• When choosing a file format, select a consistent
format that can be read well into the future and is
independent of changes in applications.
• Non-proprietary: Open, documented standard,
Unencrypted, Uncompressed, ASCII formatted
files will be readable into the future.

• Librarians can help:
• Identify file formats suitable for long-term preservation
• Interpret your funder or publisher’s repository
requirements
• Find and evaluate a suitable repository for your data
• Upload your data sets to a repository
• Help make your data in a repository searchable and
discoverable
• Create a doi and persistent id
• Choosing metadata standards for increased
discoverability

Issue: Data Stewardship
• Challenges
• Team Science
• Managing Laboratory Notebooks
• Rotating Lab Personnel

Issue: Data Stewardship
• Best Practices
• Define roles and assign responsibilities for data
management
• Identify skills needed to perform tasks outlined in
DMP and match to available staff
• Develop training plans for continuity
• Assign responsible parties and monitor results

How the Library Can Help:
• Teach you, your lab, or
your classes about data
management best
practices
• Write a data
management and/or
sharing plan
• Comply with federal,
funder, and publisher
data sharing policies
• Find & submit your data
to a repository
• Find standards to
describe & label your
data & data files
• Find a data set
• Cite others’ data
• Publish a data set
• Get a doi for a data set
• Measure the citation
impact of your data set
• Build a collection of
research data that others
can search & access
• Archive & preserve your
data
• Learn about copyright &
license issues
surrounding your data

Find Help
• Ask your librarian if the library can help!
• Make it known you are interested in receiving
assistance from the library
• Ask your IT department for information on storage and
security available
• Let them help you make a backup and storage plan

Learn More
• Data Management Principles & Education:
• Research Data MANTRA
• DataONE: Best Practices
• UK Data Archives
• MIT Data Management and Publishing Guide
• Data Management Plans
• Digital Curation Centre
• DMPTool2
• DataONE: Data Management Planning

Works Cited
Lamar Soutter Library, University of Massachusetts Medical School. 2014.
“New England Collaborative Data Management Curriculum: Module 1.”
http://library.umassmed.edu/necdmc.
DataONE. 2013. “Best Practices for Data Management.”
http://www.dataone.org/best-practices.
MIT Libraries. 2013. “Data Management and Publishing.” MIT
http://libraries.mit.edu/guides/subjects/data-management/index.html.
Office of Research Integrity. 2013. “Data Management.” United States
Department of Health and Human Services. United States Federal
Government.
http://ori.hhs.gov/education/products/rcradmin/topics/data/open.shtml.
Special thanks to Jen Ferguson, Richard Moore and Glenn Gaudette for
permission to use their slides.

Conquering Chaos in the Age of Networked Science: Research Data Management

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (11)

Similar to Conquering Chaos in the Age of Networked Science: Research Data Management

Similar to Conquering Chaos in the Age of Networked Science: Research Data Management (20)

Recently uploaded

Recently uploaded (14)

Conquering Chaos in the Age of Networked Science: Research Data Management

Editor's Notes