A basic course on Research data management: part 1 - part 4

A basic course on Research data management
part 1: what and why
PROOF course Information Literacy and
Research Data Management
TU/e, 19-09-2017
l.osinski@tue.nl, TU/e IEC/Library
Available under CC BY-SA license, which permits copying
and redistributing the material in any medium or format &
adapting the material for any purpose, provided the original
author and source are credited & you distribute the
adapted material under the same license as the original

Research data management [RDM]
what #1
Essence of RDM: “… tracking back to what you did 7
years ago and recovering it (...) immediately in a re-
usable manner.” (Henry Rzepa)

Research data management [RDM]
what #2
RDM: caring for your data with the purpose to:
1. protect their mere existence: data loss, data authenticity (RDM basics)
2. share them with others
a. for reasons of reuse: in the same context or in a different context; during
research and after research
b. for reasons of reproducibility checks  scientific integrity; data quality
RDM = good data practices1,2,3,4,5,6 that make your data understandable, easy
to work with, and available to other scientists
1. Dynamic ecology (2016), Ten commandments for good data management. https://dynamicecology.wordpress.com/2016/08/22/ten-commandments-for-
good-data-management/
2. Borer, E.T., Seabloom, E.W., Jones, M.B., et al. (2009) Some simple guidelines for effective data management, Bulletin of the Ecological Society of America,
90(2), p. 205-214. doi: 10.1890/0012-9623-90.2.205
3. Hook, L.A., Santhana Vannan, S.K., Beaty, T.W. et al. Best practices for preparing environmental data sets to share and archive. Available online
http://daac.ornl.gov/PI/BestPractices-2010.pdf . doi: 10.3334/ORNLDAAC/BestPractices-2010
4. White, E.P., Baldridge, E., Brym, T. et al. (2013) Nine simple ways to make it easier to (re)use your data, Ideas in Ecology and Evolution, 6(2), p. 1-10. doi:
10.4033/iee.2013.6b.6.f
5. Goodman, A., Pepe, A., Blocker, A.W., et al. (2014) Ten simple rules for the care and feeding of scientific data, PLOS Computional Biology, 10(4),
e10033542. doi: 10.1371/journal.pcbi.1003542
6. Sandve, G.K., et. al. (2013), Ten simple rules for reproducible computational research, PLOS Computational Biology, 9(10), e1003285. doi:
10.1371/journal.pcbi.1003285

Source: Research Data
Netherlands / Marina
Noordegraaf
Outline
1. Research data management [RDM]: what and why
a. data management plan
b. discussion
2. Sharing your data, or making your data findable and accessible
a. data protection: back up, file naming, organizing data
b. data sharing: via collaboration platforms, data archives
3. Caring for your data, or making your data usable and interoperable
a. tidy data
b. metadata/documentation
c. licenses
d. open data formats

 Because you work together with other researchers  collaborative science
 Because of re-using results: data-driven science  open science
 Because of scientific integrity: validating data analysis by reproducibility checks
requires data and the code that is used to clean, process and analyze the data and
to produce the final outputs
Additional reasons
 Because your data are unique / not easily repeatable
(long term observational data)
 Because you benefit from it: increases your visibility and
enhances the trustworthiness / credibility of your
research
Why sharing research data? #1

 Data sharing is increasingly required by:
+ Journals [here, here, here, here]
+ Professional organizations [VSNU, KNAW]
+ Universities, including TU/e
+ Research funders [NWO, ZonMW, EC]
data management plan
Why sharing research data? #2
because you have to…

EC: Horizon 2020 #1
Open research data (ORD) pilot: why?
 “The ORD pilot aims to improve and maximise access to and re-use of
research data generated by Horizon 2020…”
 “The ORD pilot applies primarily to the data needed to validate the results
presented in scientific publications. Other data can also be provided…”
 “A data management plan (DMP) is required for all projects participating in
the extended ORD pilot…”
“Participating in the ORD pilot does not necessarily mean opening up all your
research data. Rather, the ORD Pilot follows the principle “as open as possible,
as closed as necessary” and focuses on encouraging sound data management
as an essential part of research best practice.” (my underlining)

EC: Horizon 2020 #2
how? sound research data management
Sound research data management is data management following
the FAIR principles. All research data should be:
Findable: easy to find by both humans and computer systems;
Accessible: stored for long term with well-defined license and access
conditions (open access when possible);
Interoperable: ready to be combined with other datasets by humans as well as
computer systems;
Reusable: ready to be used for future research and to be processed further
using computational methods.

Source: Research Data Netherlands /
Marina Noordegraaf
EC: Horizon 2020 #3
requirements
The conditions set by Horizon 2020 with regard to research data
management, come down to two requirements:
1. Formulate a data management plan, and;
2. Deposit research data in a data repository

The DMP is a set of questions along the FAIR principles about:
1. What research data sets the project will collect, process and/or generate
2. The handling of these data sets during and after the project
3. Whether and how data sets will be findable/discoverable, re-useable and
shared/made open access
4. How data will be curated and preserved
5. What measures are taken to safeguard and protect (sensitive) data
EC Horizon 2020 #4
data management plan
 DMP template Horizon 2020 (via DMPOnline): recommended but voluntary
 ZonMw template (via DMP online)
 DMP template by 4TU.Centre of Research Data
 Examples of H2020 DMPs: http://www.dcc.ac.uk/resources/data-
management-plans/guidance-examples

Research data management
discussion topics and questions
Storage and back-up
 What sort of data do you use? Are you creating new data or are you working with pre-existing
data?
 Where do you store your research data? Is there a back-up? Where?
 Are data selections made? Not everything is to be stored but…?
Metadata and documentation (information to let you find, use and understand the data)
 Do you describe your research data? Who measured or collected what, when, how? Other
context information?
 Are you content with the way you document or describe your research data? Do you succeed
in finding the right (version of your) research data?
 Can other researchers understand and (re-)use your research data (during and after
research)? Should they be able to?
Access and re-use
 Who can access your research data?
 What will happen to your research data when you leave TU/e?
 Would you consider publishing your research data, i.e. to make them public available?

which of these statements is true?
Storage and back-up
1. My research data is stored safely and securely, including regular back ups?
Metadata and documentation
2. I keep metadata with my data: who measured/collected what, when, how
Access and re-use
3. My colleagues are able to access and use my data
4. Other researchers are able to access and use my data
5. My nearest colleagues and I are the only ones who can understand my
data
6. Anyone should be able to use my data when I have finished with it

Reasons not to share your data
 Preparing my data for sharing takes time and effort
But research data management also increases your research efficiency
 My data are confidential
But you can anonymize or pseudonymize your data
 My data still need to yield publications
But you can publish your data under an embargo and by publishing your data you
establish priority and you can get credits for it
 My data can be misused or misinterpret
But the best defense against malicious use is to refer to an archival copy of your
data which is guaranteed exactly as you mean it to be
 My data are only interesting for me
But sharing your data may be required by a funder /
journal or your data may be requested to validate your
results

1. Website IEC/Library [TU/e]: https://www.tue.nl/en/university/library/
2. Figshare support, The importance of data management for research: https://youtu.be/Ae205CNrk6w
3. Henry Rzepa, Collaborative FAIR data sharing: http://www.ch.imperial.ac.uk/rzepa/blog/?p=16292
4. Dynamic ecology (2016), ten commandments for good data management.
https://dynamicecology.wordpress.com/2016/08/22/ten-commandments-for-good-data-management/
5. Borer, E.T., Seabloom, E.W., Jones, M.B., et al. (2009) Some simple guidelines for effective data
management, Bulletin of the Ecological Society of America, 90(2), p. 205-214. doi: 10.1890/0012-9623-
90.2.205
6. Hook, L.A., Santhana Vannan, S.K., Beaty, T.W. et al. Best practices for preparing environmental data sets
to share and archive. doi: 10.3334/ORNLDAAC/BestPractices-2010
7. White, E.P., Baldridge, E., Brym, T. et al. (2013) Nine simple ways to make it easier to (re)use your data,
Ideas in Ecology and Evolution, 6(2), p. 1-10. doi: 10.4033/iee.2013.6b.6.f
8. Goodman, A., Pepe, A., Blocker, A.W., et al. (2014) Ten simple rules for the care and feeding of scientific
data, PLOS Computional Biology, 10(4), e10033542. doi: 10.1371/journal.pcbi.1003542
9. Sandve, G.K., et. al. (2013), Ten simple rules for reproducible computational research, PLOS Computational
Biology, 9(10), e1003285. doi: 10.1371/journal.pcbi.1003285
10. Data sharing increases visibility: http://dx.doi.org/10.7717/peerj.175
11. Data sharing enhances trustworthiness: http://dx.doi.org/10.1371/journal.pone.0026828
URL’s of mentioned webpages
in order of appearance #1

12. Data availability policy journals: http://www.nap.edu/openbook.php?record_id=10613&page=33
13. Data availability policy American Economic Review: https://www.aeaweb.org/aer/data.php
15. Data availability policy PLoS: http://journals.plos.org/plosone/s/data-availability
16. Data availability policy Nature: http://www.nature.com/authors/policies/availability.html
17. VSNU Code of Scientific Conduct (Dutch, revision 2014):
http://www.vsnu.nl/files/documenten/Domeinen/Onderzoek/Code_wetenschapsbeoefening_2004_(2014)
.pdf
18. KNAW responsible research data management: https://www.knaw.nl/en/news/publications/responsible-
research-data-management-and-the-prevention-of-scientific-misconduct?set_language=en
19. Radboud University research data policy: http://www.ru.nl/research-information-services/institutional-
policy/policy-research-data-management/
20. TU/e Code of Scientific Conduct: http://www.tue.nl/en/university/about-the-university/integrity/scientific-
integrity/
21. NWO and research data: http://www.nwo.nl/en/policies/open+science/data+management
21. ZonMW Toegang tot data: https://www.zonmw.nl/en/research-and-results/access-to-data/
22. Horizon 2020 Guidelines on data management:
http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-
mgt_en.pdf

23. About FAIR: Mons, B. et al., Cloudy, increasingly FAIR: revisiting the FAIR Data guiding principles for the
European Open Science Cloud: http://dx.doi.org/10.3233/ISU-170824
24. Template data management plan Horizon 2020: https://dmponline.dcc.ac.uk/
25. ZonMW data management plan template: https://www.zonmw.nl/en/research-and-results/access-to-
data/format-data-management-plan/
26. Data management plan template (4TU.ResearchData): http://researchdata.4tu.nl/en/planning-
research/data-management-plan/
27. Examples of Horizon 2020 data management plans: http://www.dcc.ac.uk/resources/data-management-
plans/guidance-examples
28. Emilio M. Bruna (04-09-2014), The opportunity cost of my #OpenScience was 36 hours + $690 (UPDATED) .
http://brunalab.org/blog/2014/09/04/the-opportunity-cost-of-my-openscience-was-35-hours-690/
28. Rouder, Jeffrey N., The what, why, and how of born-open data, Behavior Research Methods, vol. 48(2016),
p. 1062-1069. http://dx.doi.org/10.3758/s13428-015-0630-z (see p. 1063: “It was a pain to document the
data; it was a pain to format the data”)

part 2: protecting and organizing
your data
TU/e, 07-03-2017

 Sharing your data, or making your data findable and accessible
with good data practices
→ protecting your data: back up, access control; file naming, organizing
data, versioning
+ sharing your data via collaboration platforms and archives
 Caring for your data, or making your data usable and
interoperable with good data practices
+ tidy data
+ metadata/documentation
+ licenses
+ open data formats
what was it again

Be safe
+ storage, backup  data safety, protecting against loss: use local
ICT infrastructure (departmental servers, including SURFdrive) as
much as possible
+ access control  data security, protecting against unauthorized
use: with DataverseNL for example
Be organized, or: you (and others) should be able to tell what’s in
a file without opening it
+ file-naming, organizing data in folders, versioning
Protecting your data
good data practices during your research
“…we can copy everything and do not manage it well.” (Indra Sihar)

File-naming #1
be consistent and aim for concise but informative names
How you organize and name your files has a big impact on your
ability to find those files later and to understand what they contain.
Good file names are consistent (use file-naming conventions), unique
(distinguishes a file from files with similar subjects as well as different
versions of the file) and meaningful (use descriptive names).
File-naming conventions help you find your data, help others to find
your data and help track which version of a file is most current
 Avoid using special characters in a file name: / : * ? < > | [ ] & $
 Use hyphens or underscores instead of periods or spaces to
separate logical elements in a file name
 Avoid very long names: usually 25 characters is sufficient length
 Names should include all necessary descriptive information:
initials researcher, project number, procedure/method…
 Names are independent of where it is stored (not the same
names in different folders)
 Include dates (format YYYYMMDD) and a version number on files
 Add a readme.txt to each folder in which the file naming and its
meaning is explained
Source: Best practices for file naming (Stanford University Libraries)

File naming #2
think about the ordering of elements within a filename
 Order by date:
2013-04-12_interview-recording_THD.mp3
2013-04-12_interview-transcript_THD.docx
2012-12-15_interview-recording_MBD.mp3
2012-12-15_interview-transcript_MBD.docx
 Order by subject:
MBD_interview-recording_2012-12-15.mp3
MBD_interview-transcript_2012-12-15.docx
THD_interview-recording_2013-04-12.mp3
THD_interview-transcript_2013-04-12.docx
 Order by type:
Interview-recording_MBD_2012-12-15.mp3
Interview-recording_THD_2013-04-12.mp3
Interview-transcript_MBD_2012-12-15.docx
Interview-transcript_THD_2013-04-12.docx
 Forced order with numbering:
01_THD_interview-recording_2013-04-12.mp3
02_THD_interview-transcript_2013-04-12.docx
03_MBD_interview-recording_2012-12-15.mp3
04_MBD_interview-transcript_2012-12-15.docx
<

File organization
PAGE 2220-9-2017
Beatriz Ramirez, Data management plan for the PhD project:
development and application of a monitoring system to assess the
impacts of climate and land cover changes on eco-hydrological
processes in an eastern Andes catchment area
Source: Haselager, dr. G.J.T.
(Radboud University Nijmegen);
Aken, prof. dr. M.A.G. van (Utrecht
University) (2000): Personality and
Family Relationships. DANS.
http://dx.doi.org/10.17026/dans-
xk5-y7vc .

Organizing your data in folders #1
based on the TIER documentation protocol (http://www.projecttier.org/)
Guiding principles of TIER documentation protocol
1. keep your raw or original data raw
+ save your raw data read-only in its original format in a separate folder
+ make a working copy of your raw data (input data, used for
processing and analysis)
2. keep the command files (files containing code written in the syntax of the
(statistical) software you use for the study) apart from the data
3. keep the analysis files (the fully cleaned and processed data files that you
use to generate the results reported in your paper) in a separate folder
4. store the metadata (codebook, description of variables, etc.) in a separate
folder, apart from the data itself

based on the TIER documentation protocol (http://www.projecttier.org/)
1. Main project folder (name of your research project/working title of your
paper)
1.1. Original data and metadata
1.1.1. Original data
1.1.2. Metadata
1.1.2.1. Supplements
1.2. Processing and analysis files
1.2.1. Importable data files
1.2.2. Command files
1.2.3. Analysis files
1.3. Documents
1.4. Literature

1. Main project folder (name of your research project/working title of your
paper)
1.1.1. Original data (raw data, obtained/gathered data)
Any data that were necessary for any part of the processing and/or
analysis you reported in you paper.
Copies of all your original data files, saved in exactly the format it was
when you first obtained it. The name of the original data file may be
changed
Keep these data read only!
1.1.2. Metadata
based on the TIER documentation protocol

1. Main project folder (name of your research project/working title of your paper)
1.1.2. Metadata
The Metadata Guide: document that provides information about each of your
original data files. Applies especially to obtained data files
 A bibliographic citation of the original data files, including the date you
downloaded or obtained the original data files and unique identifiers that
have been assigned to the original data files.
 Information about how to obtain a copy of the original data file
 Whatever additional information to understand and use the data in the
original data file
Additional information about an original data file that’s not written by
yourself but that is found in existing supplementary documents, such as
users’ guides and code books that accompany the original data file

1.1.2. Metadata
1.2.1. Importable data files (the data you work with, input data, suitable for
processing and analysis)
A corresponding version for each of the original data files. This version can be identical
to the original version, or in some cases it will be a modified version.
For example modifications required to allow your software to read the file (converting
the file to another format, removing unusable data or explanatory notes from a table)
 The original and importable versions of a data file should be given different names
 The importable data file should be as nearly as identical as possible to the original
 The changes you make to your original data files to create the corresponding
importable data files should be described in a Readme file

1.1.2. Metadata
One or more files containing code written in the syntax of the (statistical) software you use
for the study
 Importing phase: commands to import or read the files and save them in a format that
suits your software
 Processing phase: commands that execute all the processing required to transform the
importable version of your files into the final data files that you will use in your analysis
(i.e. cleaning, recoding, joining two or more data files, dropping variables or cases,
generating new variables)
 Generating the results: commands that open the analysis data file(s), and then
generate the results reported in your paper.

1.1.2. Metadata
 The fully cleaned and processed data files that you use to generate the
results reported in your paper in your paper
 The Data Appendix: codebook for your analysis data files: brief description
of the analysis data file(s), a complete definition of each variable (including
coding and/or units of measurement), the name of the original data files
from which the variable was extracted, the number of valid observations for
the variable, and the number of cases with missing values

1.1.2. Metadata
1.3. Documents
 An electronic copy of your complete final paper
 The Readme-file for your replication documentation
 What statistical software or other computer programs are needed to run the
command files
 Explain the structure of the hierarchy of folders in which the documentation is
stored
 Describe precisely any changes you made to your original data files to create
the corresponding importable data files
 Step-by-step instructions for using your documentation to replicate the
statistical results reported in your paper
1.4. Literature
 Retrieved relevant literature

1. Storage, back up of data: http://www.data-archive.ac.uk/create-manage/storage
2. Local ICT infrastructure: https://intranet.tue.nl/en/university/services/ict-services/ict-service-
catalog/management-services/data-management-storage/ (TU/e intranet)
3. SURFdrive (at TU/e): https://intranet.tue.nl/en/university/services/ict-services/ict-service-
catalog/management-services/data-management-surfdrive
4. DataverseNL: https://dataverse.nl/dvn/
5. Version control: http://www.data-archive.ac.uk/create-manage/format/versions
6. Best practices for file naming: http://library.stanford.edu/research/data-management-services/data-best-
practices/best-practices-file-naming
8. File organization: Haselager, dr. G.J.T. , Aken, prof. dr. M.A.G. van (2000): Personality and Family
Relationships. DANS. http://dx.doi.org/10.17026/dans-xk5-y7vc (Data guide, p. 24-26)
9. Best practices: file names and folder structures (Leiden example):
http://blogs.library.leiden.edu/researchdata/2016/06/03/best-practices-file-names-and-folder-
structures/#more-284
10. Beatriz Ramirez, Data management plan for the PhD project: development and application of a monitoring
system to assess the impacts of climate and land cover changes on eco-hydrological processes in an
eastern Andes catchment area: http://www.wageningenur.nl/web/file?uuid=3f974938-79a0-421f-b1ad-
95eef49d777c&owner=c057b578-4a6a-4449-881b-17fff17e2f1a (see Figure 1 for folder structure)
11. TIER documentation protocol: http://www.projecttier.org/
in order of appearance

part 3: sharing your data
TU/e, 07-03-2017

+ protecting your data: back up, access control; file naming, organizing
data, versioning
→ sharing your data via collaboration platforms and archives
 Caring for your data, or making your data usable and
+ tidy data
+ licenses
+ open data formats
what was it again

During research After researchInstitutionDisciplin
Local
ICT
services
Overview research data sharing
and storage services
Data sharing per se is pretty straightforward

General data sharing platforms:
 SURFdrive [TU/e only]: Dutch academic Dropbox, 100 Gb, maximum data transfer 16 Gb
every TUe employee can use SURFdrive
 Google Drive, Dropbox, Beehub…
DataverseNL [TU/e only]: data sharing platform for active research data [based on Harvard’s
Dataverse Project] where you may:
 store your data in an organized and safe way
 clearly describe your data
 version control of your data
 arrange access to your data
 get recognition for your data
 [collaborate on your data]
Various disciplinary initiatives: Open Science Framework, OpenML, RodRep, CRCNS…
SURF Filesender [secure data transfer up to 500 Gb!, WeTransfer up to 2 Gb]
Sharing your data
collaboration or sharing platforms (during your research)
Storage and backup of data through DANS [Dutch
Archiving and Networking Services]
Data transfer: up to 2 Gb per dataset
Dataverse via 4TU.ResearchData: up to 50 Gb free

How to create an account:
 Go to: https://dataverse.nl/
 Click ‘Log in’ (at the top right); under Institutional account click SURFconext
 Select Eindhoven University of Technology and log on with your TU/e username and
password
 When asked for it, give permission to share your data by answering Yes or click this
Tab
 When asked to create an account, answer Yes or click this Tab.
 When you succeeded to create an account, your username is the prefix of your
email address
You now have a user account with DataverseNL.
If you click 4TU dataverse  Eindhoven dataverse  Add data you can create and
publish data sets, upload files and assign access rights to data sets or files.
However, before you proceed, contact me (for more options) or first use the demo
version: https://act.dataverse.nl
Sharing your data
DataverseNL
If you are interested in using DataverseNL, please contact me (Leon Osinski)

On request
“I'd like to thank E.J. Masicampo and Daniel LaLande for sharing and allowing me to share
their data…”
Daniël Lakens (2014), What p-hacking really looks like: A comment on Masicampo & LaLande (2012)
On a (personal) website
“Let me start by saying that the reason why I put all excel files online, including all the
detailed excel formulas about data constructions and adjustments, is precisely because I
want to promote an open and transparent debate about these important and sensitive
measurement issues.”
Thomas Piketty, My response to the Financial Times, HuffPost The Blog, 29-05-2014 ;
originally published as Addendum: Response to FT, 28-05-2014
A data journal
Journal of open psychology data, Geoscience data journal,
Data in brief, Scientific data, Data reports
Sharing your data
after your research has ended
Source: www.aukeherrema.nl

Choose a repository where other researchers in your discipline are sharing their data, for example
LXcat (for plasma data), TurBase (for turbulence data) or GenBank (for genetic sequence data)
Overview of research data repositories: Re3data.org
Use a repository that at least assigns a persistent identifier to your data (DOI) and requires that
you provide adequate metadata
 General or multidisciplinary repositories: Zenodo, Figshare, DANS, Dryad, B2SHARE
 4TU.ResearchData
+ small medium sized data sets, long tail data
+ static data, ‘frozen’ data sets, ‘milestone’ data sets
+ preferably nonproprietary data formats suitable for long term preservation
+ DOI’s [ persistent identifier for citability and retrievability ]
+ open access
+ long-term availability, Data Seal of Approval
+ Data Citation Index (Thomson Reuters)
+ self-upload (single data sets < 3Gb)
+ special collections of related data sets
Sharing your data
after your research has ended, by publishing and archiving them in an established
repository

Link your data to your publication
Sharing your data
link your data to your publication

1. Overview research data storage and sharing services: http://dataservices.silk.co/
2. DataverseNL: https://www.dataverse.nl/dvn/
3. Harvard’s Dataverse Project: http://dataverse.org/
4. Open Science Framework: https://cos.io/osf/
5. OpenML: http://www.openml.org
6. RodRep: http://www.rodrep.com/
7. CRCNS: http://crcns.org/
8. SURFdrive: https://www.surfdrive.nl/
9. Google Drive: https://www.google.com/drive/
10. Dropbox: https://www.dropbox.com/
11. Beehub: https://beehub.nl/system/
12. SURF filesender: https://filesender.surfnet.nl/
12. Data on request (blog post Daniel Lakens): http://daniellakens.blogspot.nl/2014/09/what-p-hacking-really-
looks-like.html
13. Data on personal website (Thomas Piketty): http://piketty.pse.ens.fr/en/capital21c2
14. Overview of (better known) data journals: http://proj.badc.rl.ac.uk/preparde/blog/DataJournalsList

15. Data journal: Journal of Open Psychology Data: http://openpsychologydata.metajnl.com/
16. Data journal: Geoscience Data Journal: http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)2049-6060
17. Data journal: Data in brief: http://www.journals.elsevier.com/data-in-brief
18. Data journal: Scientific data: http://www.nature.com/sdata/
19. Data journal: Data reports: http://www.frontiersin.org/news/Data_Reports_a_new_type_of_peer-
reviewed_article_in_Frontiers_journals/1051?utm_source=FRN&utm_medium=ECOM&utm_campaign=T
WT_FRN_1502_datareport
20. Research data catalogue: Re3data.org: http://service.re3data.org/search/results?term=
21. Publishing data: Zenodo: http://www.zenodo.org/
22. Publishing data: Figshare: http://www.figshare.com
23. Publishing data: DANS: http://www.dans.knaw.nl/en
23. Publishing data: Dryad: http://datadryad.org/
24. Publishing data: B2SHARE: https://b2share.eudat.eu/
25. Publishing data: 4TU.ResearchData: https://data.4tu.nl/
26. Long tail research data: http://www.nature.com/neuro/journal/v17/n11/fig_tab/nn.3838_F1.html

27. Preferred data formats 4TU.ResearchData: http://researchdata.4tu.nl/en/publishing-research/data-
description-and-formats/
28. Data Seal of Approval: http://www.datasealofapproval.org
29. Data Citation Index (Thomson Reuters): http://wokinfo.com/products_tools/multidisciplinary/dci/
30. Self upload 4TU.ResearchData: https://data.4tu.nl/account/login/?next=/upload/
31. Data sets underlying PhD thesis Joos Buijs: http://dx.doi.org/10.4121/uuid:26aba40d-8b2d-435b-b5af-
6d4bfbd7a270
32. PhD thesis Joos Buijs: http://dx.doi.org/10.6100/IR780920

part 4: caring for your data, or
making data usable
TU/e, 07-03-2017

+ protecting your data: back up, access control; file naming, organizing
data, versioning
+ sharing your data via collaboration platforms and archives
→ Caring for your data, or making your data usable and
+ tidy data
+ licenses
+ open data formats
what was it again
Before data can be reusable, it has first to be usable

Tidy data is about structure of a table / data set.
Tidy data ≠ clean data. It’s a step towards clean data
+ Each variable you measure is in one column
+ Column headers are variable names
+ Each observation is in a different row
+ Every cell contains only one piece of information
Tidy data
making your data easy to handle for computers

Tidy data allow your data to be easily:
+ imported by data management systems
+ analyzed by analysis software
+ visualized, modelled, transformed
+ combined with other data (interoperability)
Tidy data
why

Tidy data versus messy data
1. More than one variable in a
single column (‘clumped data’)
2. Column headers are values, or:
one variable over many columns
(‘wide data’)
3. Variables are in rows and
columns
4. More pieces of information in
one cell (cells are highlighted or
colored; values and
measurement units in one cell)
1. Each variable you measure
is in one column
2. Column headers are
variable names
3. Each observation is in a
different row
4. Every cell contains only one
piece of information
Tidy data Messy data

patient_id drug_a drug_b
1 67 56
2 80 90
3 64 50
4 85 75
Tidy data versus messy data
example
‘Wide’ data: one variable
over many columns Tidy data
patient_id drug heart_rate
1 a 67
2 a 80
3 a 64
4 a 85
1 b 56
2 b 90
3 b 50
4 b 75

What is the nature of the “unusual episode” to which this table refers?

What is the nature of the “unusual episode” to which this table refers?
Different columns contain
measurements of the same variable:
easier to read and interpret but
difficult to add data (columns) to the
records (rows)

Class Sex Age Survived Freq
1 1st Male Child No 0
2 2nd Male Child No 0
3 3rd Male Child No 35
4 Crew Male Child No 0
5 1st Female Child No 0
6 2nd Female Child No 0
7 3rd Female Child No 17
8 Crew Female Child No 0
9 1st Male Adult No 118
10 2nd Male Adult No 154
11 3rd Male Adult No 387
12 Crew Male Adult No 670
13 1st Female Adult No 4
14 2nd Female Adult No 13
15 3rd Female Adult No 89
16 Crew Female Adult No 3
17 1st Male Child Yes 5
18 2nd Male Child Yes 11
19 3rd Male Child Yes 13
20 Crew Male Child Yes 0
21 1st Female Child Yes 1
22 2nd Female Child Yes 13
The same data in a tidy structure (variables
in columns and observations in rows)
“The problem is that people like to view data in a totally different way than
a computer likes to process it.” (Kien Leong)

Tools for tidying data
OpenRefine
 download OpenRefine: http://openrefine.org/download.html
 runs on your computer (not in the cloud), inside the Firefox browser (not in
IE), no web connection is needed
 captures all steps done to your raw data ; original dataset is not modified;
steps are easily reversed;
R, TidyR package
 scripted language (R (free), Matlab, SAS…) to process data (tidying,
cleaning, etc.), run the analysis and to produce final outputs
versus
 Excel: data provenance and documentation of data processing with a
graphical user interface is bad because it doesn’t leaves a record

The table or data set itself
+ columns: use clear, descriptive variable names (no hard to
understand abbreviations), avoid special characters (can cause
problems with some software)
+ rows: if possible, use standard names within cells (derived
from a taxonomy, for example: standard species name, CAS
registry for chemical substances, standard date formats, …)
+ try to avoid coding categorical or ordinal data as numbers
+ missing data: use NA
Documentation / metadata
making your data understandable for humans #1

The table or data set as a whole
A description (documentation) that at least mentions:
+ size of the data set: number of observations and variables
+ information about the variables and its measurement units
(code book)
+ what’s included and excluded in the data set, why data are
missing
+ description of how you collected the data (study design), data
manipulation steps (provenance)
+ when your data consists of multiple files organized in a folder
structure, an explanation of the structure and naming of the
files
making your data understandable for humans #2
“Research outputs that are poorly documented are like canned goods with the label
removed (…)” (Carly Strasser)

metadata standards
Sometimes there are metadata standards for the
documentation of your data set but where no standard
exists, a simple readme file can be good enough

Raw data:
https://www.amstat.org/publicatio
ns/jse/datasets/titanic.dat.txt
Documentation accompanying the
data:
https://www.amstat.org/publicatio
ns/jse/datasets/titanic.txt
 Size (number of observations
and variables)
 Description
 Provenance
 Variable descriptions
Based on:
The "Unusual Episode" Data
Revisited / by Robert J. MacG.
Dawson, in: Journal of Statistics
Education vol. 3(1995), issue 3

1. Morphological
Measurements of Galapagos
Finches
http://dx.doi.org/10.5061/dry
ad.152
 Use of standard names
(taxonomy, species)
 Variable names clear
enough? WingL must be
wing length but what is
N.Ubkl?
 Units of measurement?
Based on:
Looking after datasets / by
Antony Unwin, 01-09-2015,
http://blog.revolutionanalytics
.com/2015/09/looking-after-
datasets.html

making your data findable for humans and search engines
Descriptive metadata for discovery and identification of
your data mainly
+ creator
+ title
+ short description + key words
+ date(s) of data collection
+ publication year
+ related publications
+ DOI (assigned by data archive)
+ etc.
When uploading your
data in a data archive
like 4TU.ResearchData,
you will be asked to
enter these metadata
A DOI is assigned by
the data archive

User license
making clear that other people are allowed to use your data
Let other people know in advance what they are
allowed to do with your data by attaching a user license
to it
+ Creative Commons license for data sets
+ GNU General Public License (GPL) for software
+ License selector

Open data formats
ensuring the ‘longevity’ of your data
+ with open (non-proprietary) data formats it is best
ensured that the data will remain usable and ‘legible’
for computers in the future
+ are easy to use in a variety of software, like .csv for
tabular data
+ check the data formats that are supported by a data
archive like 4TU.ResearchData

Usable data
recommended reading
These 3 papers give a good summary of this module
+ Eugene Barsky (2017), Good enough research data
management: a very brief guide
+ Shannon E. Ellies, Jeffrey T. Leek (2017), How to share
data for collaboration
+ Greg Wilson, et al. (2017), Good enough practices in
scientific computing

Data Coach [ website ]
TU/e data librarians (rdmsupport@tue.nl)
Leon Osinski, Sjef Öllers
Recommended reading
Van den Eynden, Veerle e.a. (2011), Managing and sharing data: best
practice for researchers, UK Data Archive
Strasser, Carly (2015), Research data management, NISO
Recommended online course
Essentials 4 data support [English & Dutch]
Support

1. Tidy data: https://www.jstatsoft.org/article/view/v059i10
2. The “Unusual Episode Data“ revisited:
https://www.amstat.org/publications/jse/v3n3/datasets.dawson.html
3. OpenRefine: http://openrefine.org
4. TidyR: http://tidyr.tidyverse.org/
5. R: https://www.r-project.org/
6. Metadata standards: http://rd-alliance.github.io/metadata-directory/
7. Raw Titanic data: https://www.amstat.org/publications/jse/datasets/titanic.dat.txt
8. Documentation to Titanic data: https://www.amstat.org/publications/jse/datasets/titanic.txt
9. Morphological Measurements of Galapagos Finches: http://dx.doi.org/10.5061/dryad.152
10. Looking after data sets: http://blog.revolutionanalytics.com/2015/09/looking-after-datasets.html
11. Descriptive metadata 4TU.ResearchData: http://researchdata.4tu.nl/en/publishing-research/uploading-
data/
12. Creative Commons licenses: https://creativecommons.org/
13. GNU General Public License: https://www.gnu.org/licenses/gpl-3.0.en.html

14. License selector: https://ufal.github.io/public-license-selector/
15. Preferred data formats of 4TU.ResearchData: http://researchdata.4tu.nl/en/publishing-research/data-
description-and-formats/
16. Eugene Barsky (2017), Good enough research data management: a very brief guide
17. Shannon E. Ellies, Jeffrey T. Leek (2017), How to share data for collaboration
18. Greg Wilson, et al. (2017), Good enough practices in scientific computing
19. TU/e Data Coach: http://www.tue.nl/datacoach
20. Van den Eynden, Veerle e.a. (2011), Managing and sharing data: best practice for researchers, UK Data
Archive
21. Carly Strasser, Research data management:
http://www.niso.org/apps/group_public/download.php/15375/PrimerRDM-2015-0727.pdf
22. Online course ‘Essentials for data support’: http://datasupport.researchdata.nl/en/

A basic course on Research data management: part 1 - part 4

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to A basic course on Research data management: part 1 - part 4

Similar to A basic course on Research data management: part 1 - part 4 (20)

More from Leon Osinski

More from Leon Osinski (19)

Recently uploaded

Recently uploaded (20)

A basic course on Research data management: part 1 - part 4

Editor's Notes