A basic course on Research data management, part 4: caring for your data, or making data reusable

A basic course on Research data management
part 4: caring for your data, or
making data reusable
PROOF course Information Literacy and
Research Data Management
TU/e, 24-01-2017
l.osinski@tue.nl, TU/e IEC/Library
Available under CC BY-SA license, which permits copying
and redistributing the material in any medium or format &
adapting the material for any purpose, provided the original
author and source are credited & you distribute the
adapted material under the same license as the original

Research data management
 Sharing your data, or making your data findable and accessible
with good data practices
+ protecting your data: back up, access control; file naming, organizing
data, versioning
+ sharing your data via collaboration platforms and archives
→ Caring for your data, or making your data reusable and
interoperable with good data practices
+ metadata, tidy data, licenses
Research data management
what was it again
Before data can be reusable, it has first to be usable

What is the nature of the “unusual episode” to which this table refers?

Raw data:
https://www.amstat.org/publicatio
ns/jse/datasets/titanic.dat.txt
Documentation accompanying the
data:
https://www.amstat.org/publicatio
ns/jse/datasets/titanic.txt
 Size (number of observations
and variables)
 Description
 Provenance
 Variable descriptions
Based on:
The "Unusual Episode" Data
Revisited / by Robert J. MacG.
Dawson, in: Journal of Statistics
Education vol. 3(1995), issue 3

1. Morphological
Measurements of Galapagos
Finches
http://dx.doi.org/10.5061/dry
ad.152
 Use of standard names
(taxonomy, species)
 Variable names clear
enough? WingL must be
wing length but what is
N.Ubkl?
 Units of measurement?
Based on:
Looking after datasets / by
Antony Unwin, 01-09-2015,
http://blog.revolutionanalytics
.com/2015/09/looking-after-
datasets.html
2. Collaborative FAIR data
sharing / by Henry Rzepa

The welfare consequences… /
by Jonathan J. Cooper et. al.,
http://dx.doi.org/10.1371/jou
rnal.pone.0102722
 Word.doc
These data are findable
and accessible – but
usable?

Lessons learned
table structure [ tidy data ]
To allow your data to be easily:
 imported by data management systems;
 analyzed by analysis software, and ;
 combined with other data (interoperability)
make sure that:
 each row represents a single observation (record)
 each column represents a single variable (parameter) or type of measurement
(field)
 every cell contains only one piece of information (no highlighting of cells)
 there is only one table for each type of information (no multiple worksheets)
Cross-tab structure / contingency table: different columns contain measurements of
the same variable: easier to read but difficult to add data (columns) to the records
(rows). See Titanic table versus Titanic raw data
“The problem is that people like to view data in a totally different way than a computer
likes to process it.” (Kien Leong)

Lessons learned
table metadata: variables (columns) and observations/records (rows)
 include a row at the top of each table that contains full column (variable)
names (no hard to understand abbreviations)
 columns: use clear, descriptive variable names, avoid special characters (can
cause problems with some software)
 rows: if possible, use standard names within cells (derived from a taxonomy
for example, standard species name, standard date formats, …)
 try to avoid coding categorical or ordinal data as numbers
 missing data / null values: best option: use a blank

Lessons learned
data set metadata (documentation), discovery metadata, licenses
 size of the data set: number of observations and variables
 explanation of the variables, how each was measured and its measurement
units (code book)
 provenance (origin) of the data, how you collected the data, data
manipulation steps (study design)
 description of the data set: what’s included and excluded, known problems
or inconsistencies in the data, why data are missing
 add license-information: what are others allowed to do with your data?
a simple readme file can be enough (see documentation Titanic dataset) but
not always
“Research outputs that are poorly documented are like canned goods with the label
removed (…)” (Carly Strasser)

Lessons learned
long term availability
 if possible use a non-proprietary (open) file format (are easier to use in a
variety of software), like csv for tabular data
 if possible, take the preferred formats of a data archive in account.
See for example 4TU.ResearchData overview of file formats and types of support:
http://researchdata.4tu.nl/en/publishing-research/data-description-and-formats/

Tools
for working with messy data
Excel vs scripting based software tools
 Excel: data provenance and documentation of data processing with a graphical user
interface is bad because it doesn’t leaves a record
 use a scripted language (R (free), Matlab, SAS…) to process data, run the analysis
and to produce final outputs
OpenRefine
 runs on your computer (not in the cloud), inside the Firefox browser (not in IE), no
web connection is needed
 working with OpenRefine: http://www.datacarpentry.org/OpenRefine-ecology/01-
working-with-openrefine.html
 captures all steps done to your raw data ; original dataset is not modified ; steps are
easily reversed ;
Tabula
 “… tool for liberating data tables locked inside PDF files.”
A reproducible workflow (bartomeuslab)

Data Coach [ website ]
TU/e data librarians (datacoach@tue.nl)
Leon Osinski, Sjef Öllers
Recommended reading
Van den Eynden, Veerle e.a. (2011), Managing and sharing data: best
practice for researchers, UK Data Archive
Strasser, Carly (2015), Research data management, NISO
Recommended online course
Essentials 4 data support [English & Dutch]
Support

1. Overview research data storage services: http://dataservices.silk.co/
2. Raw Titanic data: https://www.amstat.org/publications/jse/datasets/titanic.dat.txt
3. Documentation to Titanic data: https://www.amstat.org/publications/jse/datasets/titanic.txt
4. The “Unusual Episode Data“ revisited:
https://www.amstat.org/publications/jse/v3n3/datasets.dawson.html
5. Morphological Measurements of Galapagos Finches: http://dx.doi.org/10.5061/dryad.152
6. Looking after data sets: http://blog.revolutionanalytics.com/2015/09/looking-after-datasets.html
7. Collaborative FAIR data sharing: http://www.ch.imperial.ac.uk/rzepa/blog/?p=16292
8. The welfare consequences… : http://dx.doi.org/10.1371/journal.pone.0102722
9. Tidy data: http://vita.had.co.nz/papers/tidy-data.pdf
10. Data guide example: http://dx.doi.org/10.17026/dans-xk5-y7vc
11. Preferred data formats of 4TU.ResearchData: http://researchdata.4tu.nl/en/publishing-research/data-
description-and-formats/
12. Excel: http://production-scheduling.com/seven-deadly-spreadsheet-sins/
13. R: https://www.r-project.org/
URL’s of mentioned webpages
in order of appearance #1

14. Bartolomeuslab, A reproducible workflow: https://youtu.be/s3JldKoA0zw
15. OpenRefine: http://openrefine.org/
16. Working with OpenRefine: http://www.datacarpentry.org/OpenRefine-ecology/01-working-with-
openrefine.html
16. TU/e Data Coach: http://www.tue.nl/datacoach
17. Carly Strasser, Research data management:
http://www.niso.org/apps/group_public/download.php/15375/PrimerRDM-2015-0727.pdf
18. Online course ‘Essentials for data support’: http://datasupport.researchdata.nl/en/
URL’s of mentioned webpages
in order of appearance #2

A basic course on Research data management, part 4: caring for your data, or making data reusable

More Related Content

What's hot

Viewers also liked

Similar to A basic course on Research data management, part 4: caring for your data, or making data reusable

More from Leon Osinski

Recently uploaded

A basic course on Research data management, part 4: caring for your data, or making data reusable