1. Data, data everywhere!
Managing & organizing data
[Date]
[Name of Presenter]
[Library logo]
This work is licensed under CC BY 4.0 and is adapted from the
University of Michigan Library presentation at https://doi.org/10.7302/8684
2. Technology
Overview
● Live transcription is available
● Recording is on
● Remain muted during
presentation
● Use chat to share questions
and comments during the
presentation
● Evaluation at the end
● Email follow-up with slides &
recording
3. Learning Objectives
● Recognize why organizing your data is important for enabling high
quality research.
● Understand recommended practices for keeping your data organized.
● Identify ways to describe your data.
● Be able to experiment with practical strategies to start organizing your
data.
4. Have you ever lost
some data or notes
because of lack of
organization?
Poll
5. ● Text
● Numerical
● Multimedia
● Models
● Software
What is data?
Raw Data
Analyzable
Data
Shareable
Data
6. Categories of data
● Observational: sensor readings,
telemetry, survey results, videos,
images
● Experimental: gene sequences,
chromatograms, magnetic field
readings, and spectroscopy
● Simulation: climate models,
economic models, and systems
engineering
● Compiled: text and data mining,
compiled database, systems
engineering, and 3D models
● Reference: gene sequence
databanks, census data, chemical
structures
Data Types & File Formats
8. 1 rat heart
100s
of slices
100s of
slides
1000s of
image files
TIF TIF TIF TIF TIF TIF
100s of huge
images
3 postdocs
5-7 experiments
a week…
9. Data management for yourself
● Efficiently find your files
● Track your methods for
reproducibility
● Better version control of data
● Quality control
● Avoid data loss
● Document your data for your
own recollection,
accountability, and re-use
● Gain credibility and
recognition for your science
efforts through data sharing!
10. Data management for science
● Data is a valuable asset. It is expensive and time consuming to collect, take good care of it!
● Well-managed data:
○ improves quality, accuracy, and integrity of your research
○ maximizes the effective use of data
○ ensures appropriate use of data and information
○ strengthens the reliability of the research - promotes transparency, encourages
accountability, reduces bias and errors
○ ensures sustainability and accessibility allowing others to reproduce your findings
11. Research reproducibility
● Reproducibility is the ability for other researchers to reach similar
results when using the same methods and data.
○ Replicability is achieving similar results by conducting a new study with different
methods or approaches
● Accurate, comprehensive, and transparent reporting allows for
reproducibility.
12. Data management challenges
● Good data management take time and planning
● Researchers may lack knowledge about best practices in handling data
● General lack of incentives for doing good data management
● There can be a cost to managing data (human, technology, etc.)
13. Poor data management affects everyone
“MEDICARE PAYMENT ERRORS NEAR $20B” (CNN) December 2004. Miscoding and billing errors from
doctors and hospitals totaled $20 billion in FY 2003 (9.3% error rate). The error rate measured claims
that were paid despite being medically unnecessary, inadequately documented, or improperly coded.
This error rate actually was an improvement over the previous fiscal year (9.8% error rate).
“SOCIAL SECURITY DATA CAN TURN PEOPLE INTO THE LIVING DEAD” (NPR) August 2016. In 2011, an
audit found that about 1,000 people a month in the U.S. were marked deceased when they were very
much alive. Rona Lawson, who works in the Office of the Inspector General at the Social Security
Administration, says that number has gone down. It's now around 500 people a month. Lawson says 90
percent of the time, the cascade of misinformation starts with an input error by Social Security staff — a
regular mistake on a regular office day that just happens to kill a person off, at least on paper."
14. The climate scientists at the centre of a media
storm over leaked emails were cleared of
accusations that they manipulated their
results and silenced critics, but a review found
they had failed to be open enough about their
work.
[Image removed for copyright purposes. Alt text:
Screenshot of article from The Guardian UK titled
"Climategate scientists cleared of manipulating
data on global warming”]
15. “had published duplicate
pictures in several cases
and had repeatedly failed
to exert due diligence in
organising her area of
study over a long period
of time.”
http://retractionwatch.com/2016/11/02/leading-diabetes-
researcher-acted-negligently-probe-concludes/
[Image removed for copyright purposes. Alt text:
Screenshot of Retraction Watch announcement
titled "Leading diabetes researcher acted
negligently, probe concludes”]
16. Don't end up here!
NEJM paper on sleep apnea retracted when original data can’t be found
● Multiple errors in
table
● Did not alter
conclusions in
article
● BUT, could not
locate primary data
[Image removed for copyright purposes. Alt text:
Screenshot of a Retraction Watch announcement
titled "NEJM paper on sleep apnea retracted when
original data can't be found”]
19. ● Communicate thoroughly
○ Provide training
○ Make documentation findable
○ Use variety of mediums (email, Slack, lab manual)
○ Regularly check-in and/or remind all contributors
Keys to success
● Maintain good documentation
○ Readme files
○ Data management plan (DMP)
If you can, assign
one person to be
responsible
21. 1. Identify storage solution(s)
● May have multiple storage locations
● Investigate ITS data storage finder:
Data Storage Finder / U-M Information and Technology Services
● Set up appropriate access/permissions
Document!
22. 2. Establish directory/folder structures
● Organize directories hierarchically
● Group files of similar information together in a single directory
● Name directories after aspects of the project rather than individual
researchers
● Separate ongoing and completed work
● Once you have decided on a directory structure, follow it consistently
and audit it periodically
Document!
23. Organising
— UK Data
Service
Good data practices -
Dryad
[Image removed for copyright purposes. Alt text: A
screenshot showing two ways to organize folders
and files. The first is Organized by file type with one
folder for data, both processed and raw, and
another folder for results. The second way is
organized by analysis where there are two folders,
figure 1 and figure 2, and within each of those
folders are the corresponding data and results
folders.]
[Image removed for copyright
purposes. Alt text: A screenshot of a
series of folders showing how they
are organized. The top level includes
a Data folder and a Documentation
folder. Within the Data folder, there
are folders for Databases, Images,
Models, Sound, and Text. Within the
Documentation folder there are
folders for Consent Forms,
Information Sheets, and
Methodology.]
24. 3. Develop file naming convention(s)
File names should:
● embody the content of the file
● have intuitive (non-cryptic) names where possible
● be extensible
● be unique, where possible and practical
● not use special characters – restrict file names to
numbers, letters, and underscores
● be named using consistent, documentable rules
Document!
May need to include:
● versioning
● multiple
conventions
25. Implement version control (versioning)
Piled Higher and Deeper by Jorge Cham - PhDComics
[Comic removed for copyright purposes]
26. ISO 8601 - Formatting Dates
Standard way to format date and time - extremely helpful for file naming
conventions
YYYY-MM-DD
YYYYMMDD
ISO 8601 — Date and time format
27. Examples
AtherRat_012_056_mb_0423_raw.csv
AtherRat = experiment name
012 = experiment number
056 = sample number
mb = stain used, methylene blue
0423 = 2-digit coordinates of image (4
across, 23 down)
Raw = data stage
File naming and organization of data
- University of Ottawa Library
[Image removed for copyright purposes. Alt text:
Chart showing a filename
Project_YYYMMDD_ContentDescription_Version.e
xt. The parts of the filename are defined as the
Project name, date, description of file content,
and version information. Underscores are used
to separate the different parts of the filename.]
28. 4. Decide on file formats
Whenever possible, select file formats that are:
● non-proprietary
● unencrypted
● uncompressed
● in common usage by the research community
● adherent to an open, documented standard
● interoperable among diverse platforms and applications
● royalty-free and without intellectual property restrictions
● developed and maintained by an established open standards organization
Consider
instrument/device
settings
Document!
29. Recommended file formats
Audio: WAVE, AIFF, MP3, MXF
Containers: TAR, GZIP, ZIP
Databases: XML, CSV
Statistics: ASCII, DTA, POR, SAS,
SAV
Still images: TIFF, JPEG 2000, PNG,
GIF
Tabular data: CSV
Text (documentation, scripts):
XML, PDF/A, Plain Text (ASCII,
UTF-8)
Video: MOV, MPEG, AVI, MXF
Recommended Formats Statement – table of contents | Resources (Preservation, Library of Congress)
31. Documentation
Good data documentation helps ensure accurate reporting of data and
methods:
● Project level - describes the procedures and method use for data
collection and analysis (workflow, protocols, instruments, etc.)
● Metadata - data about your data
● Data dictionaries - describes the variables in a dataset
● README files - dataset description, guide to files and technology
32. Workflows
The meaning of data depends
on the context of how it was
collected.
[Decorative workflow image
removed for copyright purposes.]
33. Internal workflow
● How and when will the work be done?
● Will data be reviewed for quality?
● Who manages the entire process?
34. What is metadata?
It's the “data about data.” Structured (may follow documentation standards)
information that makes it easier to retrieve, use, or manage data. It can include
things like:
● Dates, times, locations
● Hardware/software information and parameters
● Methodology
● Creators
● Copyright
● Formats
35. Perkel, J. M. (2023). How to make your scientific data accessible,
discoverable and useful. Nature, 618(7967), 1098–1099.
https://doi.org/10.1038/d41586-023-01929-7
[Image removed for copyright purposes. Alt text:
Screenshot of a section of the article that says:
"Nature asked data scientists about their best
practices for publishing usable, high-quality data -
here's what they said. Craft Metadata. If there's
one thing scientists can add to maximize their
data's value, it's "metadata, metadata, metadata",
says environmental scientist Patricia Soprano at
Michigan State University in East Lansing.”]
37. Document your variable names
● Intuitive / meaningful variable names e.g. study_id
● What do variable names mean?
● What does each variable contain?
● Are there a limited set of possible values?
Name Field Type Description Possible values Units
study_id text Unique ID of
study
8-digit number
date_enrolled date Initial subject
enrollment
date
Date in format YYYY-MM-
DD; All dates later than
2011-09-01
weight integer Weight of
subject
lbs
38. Data Dictionaries
Data dictionaries provide detailed descriptions of the data in a dataset.
They help provide context about the structure and content of data. They
can also guide the process for collection and use of the data.
● List of all data objects
● Description of data elements (size, type, classification, etc.)
● Relationships to other data
● Variables and coding information
● Creation date
41. What do we need to know to use your data?
● Where to find it
● How to access it
● What can it be used for?
● Known problems, inconsistencies, limitations
● Collection methods, units of measure, variable names
● Data integrity
● Ethical/privacy restrictions
● Licensing
● Who to cite
README files
Source: Quick & Dirty Data Management
42. README: Recommended practices
● Create 1 readme file for each data file/dataset
● Name the readme so that it’s easily associated with the datafile(s) it
describes
● Write your readme document as a plain text file
● Format multiple readme files identically (Tip: use a template - create or
find one!)
● Follow the conventions for your discipline
Source: Quick & Dirty Data Management
43.
44. ● Pick 1 project, implement 1 new tactic
● Choose 1 aspect of a file naming convention and apply it to your files
● Plan out folder names and hierarchy before starting a new project
● Do a scan to figure out where all your data files are & document everything in a readme file
● Review current resources/tools for where you can gather metadata and documentation
without changing your workflow
● Weave in recommended practices where you can; practice leads to improvement!
● Get an organization accountability buddy & check in occasionally
Ways to get started
Consider your
situation and goals!
45. Check yourself!
1. Can you easily locate & understand the raw data?
2. Can you connect different types of related data you collected?
3. If a file gets misplaced, can you put it back in the correct folder?
4. Are your naming conventions consistent with others on your team?
5. If another researcher were to ask you for a copy of your data 5 years
after the close of your project, would you be able to easily find it and
send it to them? Could all the members of your research team find it?
6. If a researcher were to receive a copy of your data, would they be able
to use it without asking you too many questions?
47. Resources
● U-M Library data-related guides:
○ Social Sciences
○ Engineering
○ Health Sciences
○ Research Data (General)
● U-M Library Data Services
● Reach out with questions or
request a consultation:
○ U-M Library Research Data Services:
researchdataservices@umich.edu
○ Taubman Health Sciences Library
Data Team:
THLResearchDataCore@umich.edu
49. Thanks!
Questions?
[Presenter name, email]
Parts of this presentation were adapted from presentations by:
● NYU Langone Health RDM Teaching Toolkit slides by
Kevin Read & Alisa Surkis.
● Marisa Conte, from when she was a THSL informationist.
● DataONE Education Module: Data Management.
Editor's Notes
Have you ever lost some data or notes because of lack of organization?
Yes
No
Maybe?