Research data management : [part of] PROOF course Finding and controlling scientific literature and data, Eindhoven University of Technology, 2015 / Leon Osinski
Similar to Research data management : [part of] PROOF course Finding and controlling scientific literature and data, Eindhoven University of Technology, 2015 / Leon Osinski
Research data management during and after your research ; an introduction / L...Leon Osinski
Similar to Research data management : [part of] PROOF course Finding and controlling scientific literature and data, Eindhoven University of Technology, 2015 / Leon Osinski (20)
Research data management : [part of] PROOF course Finding and controlling scientific literature and data, Eindhoven University of Technology, 2015 / Leon Osinski
1. Research data management
PROOF course Finding and controlling
scientific literature and data
TU/e, 2015
l.osinski@tue.nl, TU/e IEC/Library
Available under CC BY-SA license, which permits copying
and redistributing the material in any medium or format &
adapting the material for any purpose, provided the original
author and source are credited & you distribute the
adapted material under the same license as the original
2. Agenda
1. Research data management [RDM]: what and why
2. RDM before your research: data management plan
[discussion]
3. RDM during your research: protecting and sharing your data
via a data lab
4. RDM after your research: publishing and archiving your data
via a data archive
Source: Research Data Netherlands /
Marina Noordegraaf
3. Research data management [RDM]
RDM: caring* for your data with the purpose of
1. protecting their mere existence, and;
2. making them available to others - during and after your
research project
Data sharing implies RDM, or: RDM prepares the way for sharing
your data during and after the project
*Goodman A, et al. (2014) Ten simple rules for the care and feeding of scientific data. PLoS Comput Biol 10(4):
e1003542. doi:10.1371/journal.pcbi.1003542
“Rule 3. Conduct science with a particular level of reuse in mind”
4. During your research
Because you work together with other researchers
After your research
Because of scientific integrity: validating results by replication
requires data
Because of re-using results: data-driven science
Because your data are unique / not easily repeatable (long
term observational data)
Because you benefit from it: increases your visibility and
enhances the trustworthiness of your research
Why sharing research data? #1
5. Because it’s expected by
+ Journals [here, here, here, here]
+ Professional organizations [VSNU, KNAW]
+ Research evaluators
+ Universities, including TU/e
+ Research funders [NWO, ZonMW, EC] data
management plan
Why sharing research data? #2
6. EC: Horizon 2020 #1
Open research data pilot
“… aims to improve and maximise access to and re-use of research data
generated by projects for the benefit of society and the economy.”
“Regarding the digital research data (…), the beneficiaries must: deposit in a
research data repository and take measures to make it possible (…) to
access, mine, exploit, reproduce, and disseminate – free of charge for any
user (…) the data …”
“Participating projects will be required to develop a Data Management Plan
(DMP), in which they will specify what data will be open.” [ italics mine ]
7. The DMP should address:
1. Data set reference and
name
2. Data set description
3. Standards and metadata
4. Data sharing
5. Archiving and preservation
EC: Horizon 2020 #2
Open research data pilot: data management plan [DMP]
Research data should be:
1. Discoverable
2. Accessible
3. Assessable and intelligible
4. Useable beyond the original
purpose
5. Interoperable
DMP template by 3TU.Datacentrum
8. NWO
pilot data management: scope
“The pilot applies to the following seven funding rounds:
Vici
Research talent (Social sciences)
Innovative public private partnership in ICT (Physical sciences)
Fund new chemical innovations (Chemical sciences)
HTM call (Hightech materials) (Technology foundation STW)
Urbanising deltas of the world of security and the rule of law
(WOTRO)
Open programme (Earth and life sciences).”
9. NWO
pilot data management: additional information #1
“Researchers are expected to answer four questions about data
management in the research proposal (data management section).”
“After a proposal has been awarded funding, the researcher should
elaborate the section into a data management plan. Within four
months of the research project being awarded funding, the
researcher must have submitted the first version of the data
management plan to NWO.”
“For this data management plan, NWO has chosen a template that
matches the guidelines for data management from Horizon 2020 as
closely as possible.” [italics mine]
10. “During the pilot, the data management section will not be included
in the decision about the awarding of funding.”
“NWO understands ‘data’ to be both collected, unprocessed data as
well as analysed, generated data. (…). NWO only requests storage of
data that are relevant for reuse. [italics mine]
NWO
pilot data management: additional information #2
11. Research data management
discussion topics and questions
Storage and back-up
Where do you keep your research data?
Is there a back-up? Where?
Are data selections made? Not everything is to be stored but…?
Metadata and documentation
Do you describe your research data? Who measured or collected what, when, how? Other
context information?
Are you content with the way you document or describe your research data? Do you succeed
in finding the right (version of your) research data?
Can other researchers understand and (re-)use your research data (during and after
research)? Should they be able to?
Access and re-use
Who can access your research data?
What will happen to your research data when you leave TU/e?
Would you consider publishing your research data, i.e. to make them public available?
12. Data management plan assignment [ N=5 ]
Collection
Observation during measurements (lab journal), measurement data (from
apparatus, tiff files), simulation data, Matlab, Excel, PDF’s, Origin (creation of
graphs), .csv, .ascii, questionnaire, SPSS, GIS
Storage,
backup Own laptop, network drive, portable/external hard drive, cloud storage
(secondary backup), measurement-pc, user-pc
Documenta-
tion Aimed at understanding and re-use: lab journal, accompanying Excel-/Word-
files naming, organizing data in folders + README’s, organized by data of
acquisition and method of measurement
Access
During your research: all users of the apparatus, access policy of network drive,
SVN (version control + access control), under confidentiality, openly after
publication, open
Sharing
When your research is done: with colleagues, conferences, through university
file servers, published as part of thesis (open), unknown
Preservation
When your research is done and in the long run: DVD’s (raw and processed
data), no archiving, data can be produced by running the models at any time,
unknown
13. Source: Research Data Netherlands /
Marina Noordegraaf
Protection against physical loss and destruction
storage, backup
data classification and retention; different treatment of different data
Protection against intellectual loss and unretrievability - using the correct data
Metadata, data documentation
+ catalogue metadata, for discovery: creator & title data set, abstract …
+ study metadata: more or less similar to the Methodology section of a paper: info on
provenance of data, workflow of data collection, instruments used, data validation
+ data-level metadata, for re-use by humans and machines, often embedded in software
packages: variable and code descriptions in tables or databases, codebook
+ license-information: what are others allowed to do with your data?
file-naming, organizing data in folders, versioning,
using a relational database [ instead of Excel ]
Protection against unauthorized use
access control
RDM during your research
protecting and sharing your data
14. File-naming
File-naming conventions help
you find your data, help others
to find your data and help track
which version of a file is most
current
A good file name distinguishes a
file from files with similar
subjects as well as different
versions of the file
Avoid using special characters in a file name:
/ : * ? < > | [ ] & $ , .
Use underscores instead of periods or spaces
to separate logical elements in a file name
Avoid very long names: usually 25 characters
is sufficient length
Use descriptive names, indicative of the
content
Names should include all necessary
descriptive information independent of
where it is stored
Include dates
Include a version number on files
Be consistent
Add a readme.txt to each folder in which the
file naming and its meaning is explained
Source: File naming conventions
<
15. File organization
PAGE 156-3-2015
<
Source: Beatriz Ramirez, Data management plan for the PhD project:
development and application of a monitoring system to assess the
impacts of climate and land cover changes on eco-hydrological
processes in an eastern Andes catchment area
16. Dataverse Network: data lab for active research data where you may
store your data in an organized and safe way
clearly describe your data
version control of your data
arrange access to your data
get recognition for your data
[collaborate on your data]
Data lab surrogates: Google Drive, Dropbox,[ SURFdrive ], Beehub…
SURF Filesender [data transfer up to 100 Gb]
RDM during your research
data labs
Storage and backup of data through DANS [Dutch
Archiving and Networking Services]
Data transfer: up to 2 Gb per dataset
Dataverse 3TU.Datacentrum: up to 50 Gb free
17. Workshop on Dataverse Network, by Leon Osinski
Workshop on Mendeley, by Rikie Deurenberg
We will contact you to ask if you’re interested!
RDM during your research
Dataverse Network and Mendeley workshop
18. On request (informal, peer to peer sharing)
“Reinhart and Rogoff kindly provided us with the working spreadsheet from the RR analysis. With
the working spreadsheet, we were able to approximate closely the published RR results. While
using RR's working spreadsheet, we identified coding errors, selective exclusion of available data,
and unconventional weighting of summary statistics.”
Herndon, T., Ash, M., Pollin, R. (2013), Does high public debt consistently stifle economic growth? : a critique of Reinhart and Rogoff
“I'd like to thank E.J. Masicampo and Daniel LaLande for sharing and allowing me to share their
data…”
Daniël Lakens (2014), What p-hacking really looks like: A comment on Masicampo & LaLande (2012)
On a (personal) website
“Let me start by saying that the reason why I put all excel files online, including all the detailed
excel formulas about data constructions and adjustments, is precisely because I want to promote
an open and transparent debate about these important and sensitive measurement issues.”
Thomas Piketty, My response to the Financial Times, HuffPost The Blog, 29-05-2014 ; originally published as Addendum: Response to FT, 28-05-2014
RDM after your research
sharing data after your research #1
19. Source: www.aukeherrema.nl
A data journal
Journal of open psychology data, Geoscience data journal, Data in brief , Scientific data,
Frontiers data reports
A data archive or repository
Catalogues of research data repositories: Databib, Re3data.org
Zenodo, Figshare, DANS, Dryad, B2SHARE
3TU.Datacentrum
+ small medium sized data sets, long tail data
+ static data, ‘frozen’ data sets
+ preferably nonproprietary software formats suitable for long term
preservation
+ DOI’s [ persistent identifier for citability and retrievability ]
+ open access
+ long-term availability, Data Seal of Approval
+ Data Citation Index (Thomson Reuters)
+ self-upload (single data sets < 4Gb)
+ special collections of related data sets
RDM after your research
sharing data after your research #2
20. Attach your data to your publication
“What research data and waste have in common is that’s worthwhile to reuse them.”
Lilliana Abarca-Guerrero (2014), A construction waste generation model for developing countries, PhD thesis
TU/e, proposition 9
“Psychology journals should require, as a condition for publication, that
data supporting the results in the paper are accessible in an
appropriate public archive”
Daniël Lakens (2014), Psychology journals should make data sharing a
requirement for publication
RDM after your research
sharing your data of your PhD thesis
21. RDM
time consuming and laborious but also…
“Oh yes, there are certainly benefits from this. Doing
this once means it will be easier in the future (increased
efficiency), so one benefit is reduced future opportunity
costs. Other benefits include personal satisfaction and
the indirect benefits that come from archiving and
publishing in OA journals – I can now list the datasets
and code on NSF Biosketches as a “product” resulting
from previous funding. As I say in the post, I also expect
future publications to be much easier to produce
because the data and code are well organized and
annotated. I will be doing the same calculations for the
next paper using these data/code and writing a follow-
up post.” [ italics mine ]
Emilio M. Bruna
22. Data Coach [ website ]
Data librarian
Leon Osinski, Merle Rodenburg
Recommended reading
Van den Eynden, Veerle e.a. (2011), Managing and sharing data: best
practice for researchers, UK Data Archive
Van den Eynden, Veerle e.a. (2014), Managing and sharing research data: a
guide to good practice, London: Sage [available via TU/e Library]
Recommended online course
Essentials 4 data support [English & Dutch]
Support
23. Be prepared to share your data after your research because it’s
required and because you benefit from it
Preparation = careful and responsible data management during
your research
[You’ll receive an evaluation form after the course by e-mail. Don’t forget to fill it in.]
Source: Research Data
Netherlands / Marina Noordegraaf
Wrap up
24. 1. Website IEC/Library [TU/e]: http://w3.tue.nl/en/services/library/
2. Data sharing increases visibility: http://dx.doi.org/10.7717/peerj.175
3. Data sharing enhances trustworthiness: http://dx.dor.org/10.1371/journal.pone.0026828
4. Data availability policy journals: http://www.nap.edu/openbook.php?record_id=10613&page=33
5. Data availability policy American Economic Review: https://www.aeaweb.org/aer/data.php
6. Data availability policy PLoS: http://www.plos.org/plos-data-policy-faq/
7. Data availability policy Nature: http://www.nature.com/authors/policies/availability.html
8. VSNU Code of Scientific Conduct (Dutch, revision 2014):
http://www.vsnu.nl/files/documenten/Domeinen/Onderzoek/Code_wetenschapsbeoefening_2004_(2014)
.pdf
9. KNAW responsible research data management: https://www.knaw.nl/en/news/publications/responsible-
research-data-management-and-the-prevention-of-scientific-misconduct?set_language=en
10. Research evaluators (Standard evaluation protocol 2015-2021): http://www.vsnu.nl/SEP
11. Radboud University research data policy: http://www.ru.nl/library/services-0/research/expert-
centre/vm/policy-radboud/
12. TU/e Code of Scientific Conduct: http://www.tue.nl/en/university/about-the-university/integrity/scientific-
integrity/
13. NWO and research data: http://www.nwo.nl/en/news-and-events/dossiers/datamanagement
URL’s of mentioned webpages
in order of appearance #1
25. 14. ZonMW Toegang tot data: http://www.zonmw.nl/nl/programmas/programma-detail/toegang-tot-data-
ttdata/algemeen/
15. Horizon 2020 Guidelines on data management:
http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-
mgt_en.pdf
16. Data management plan template (3TU.Datacentrum): http://datacentrum.3tu.nl/en/what-we-offer/data-
management-plan/
17. Loss of data: http://www.cursor.tue.nl/en/news-article/artikel/doctorate-ends-in-drama-after-car-
burglary-1/
18. Storage, back up of data: http://www.data-archive.ac.uk/create-manage/storage
19. Catalogue metadata: http://www.data-archive.ac.uk/create-manage/document/metadata
20. Study metadata: http://www.data-archive.ac.uk/create-manage/document/study-level
21. Data-level metadata: http://www.data-archive.ac.uk/create-manage/document/data-level
22. File naming: http://www.ncdcr.gov/portals/26/pdf/guidelines/filenaming.pdf
23. Organizing data: http://www.wageningenur.nl/en/Expertise-Services/Facilities/Library/Expertise/Write-
cite/Research-data-1/Data-management-plans.htm [example 2]
24. Version control: http://www.data-archive.ac.uk/create-manage/format/versions
25. Using a relational database: http://geekgirls.com/category/office/databases/ , see also
http://www.datacarpentry.org and http://dx.doi.org/10.1890/0012-9623-90.2.205
URL’s of mentioned webpages
in order of appearance #2
26. 26. Kien Leong (2010), The seven deadly spreadsheet sins: http://production-scheduling.com/seven-deadly-
spreadsheet-sins/
27. Dataverse Network: http://www.dataverse.nl
28. Google Drive: https://www.google.com/drive/
29. Dropbox: http://www.dropbox.com
30. SURFdrive: https://surfdrive.surf.nl
31. Beehub: https://beehub.nl/system/
32. Data on request (Reinhart-Rogoff paper): http://dx.doi.org/10.1257/aer.100.2.573
33. Data on request (blog post Daniel Lakens): http://daniellakens.blogspot.nl/2014/09/what-p-hacking-really-
looks-like.html
34. Data on personal website (Thomas Piketty): http://piketty.pse.ens.fr/en/capital21c2
35. Data journal: Journal of Open Psychology Data: http://openpsychologydata.metajnl.com/
36. Data journal: Geoscience Data Journal: http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)2049-6060
37. Data journal: Data in brief: http://www.journals.elsevier.com/data-in-brief
38. Data journal: Scientific data: http://www.nature.com/sdata/
URL’s of mentioned webpages
in order of appearance #3
27. 39. Data journal: Frontiers data reports:
http://www.frontiersin.org/news/Data_Reports_a_new_type_of_peer-
reviewed_article_in_Frontiers_journals/1051?utm_source=FRN&utm_medium=ECOM&utm_campaign=T
WT_FRN_1502_datareport
40. Research data catalogue: Databib: http://databib.org/
41. Research data catalogue: Re3data.org: http://service.re3data.org/search/results?term=
42. Publishing data: Zenodo: http://www.zenodo.org/
43. Publishing data: Figshare: http://www.figshare.com
44. Publishing data: DANS: http://www.dans.knaw.nl/en
45. Publishing data: Dryad: http://datadryad.org/
46. Publishing data: B2SHARE: https://b2share.eudat.eu/
47. Publishing data: 3TU.Datacentrum: http://data.3tu.nl/
48. Long tail research data: http://www.nature.com/neuro/journal/v17/n11/fig_tab/nn.3838_F1.html
49. Nonproprietary software formats:
http://datacentrum.3tu.nl/fileadmin/editor_upload/File_formats/Digital_Preservation_Support_levels.pdf
50. Data Seal of Approval: http://www.datasealofapproval.org
URL’s of mentioned webpages
in order of appearance #4
28. 51. Data Citation Index (Thomson Reuters): http://wokinfo.com/products_tools/multidisciplinary/dci/
52. Self upload 3TU.Datacentrum: https://data.3tu.nl/account/signin/?next=/upload/
53. Data set underlying PhD thesis Lilliana Abarca-Guerrero: http://dx.doi.org/10.4121/uuid:31d9e6b3-77e4-
4a4c-835e-5c3b211edcfc
54. PhD thesis Lilliana Abarca-Guerrero: http://repository.tue.nl/770952
55. Blogpost Daniël Lakens: http://daniellakens.blogspot.nl/2014/12/psychology-journals-should-require-
data.html
56. Emilio M. Bruna, The opportunity cost of my #OpenScience… : http://brunalab.org/blog/2014/09/04/the-
opportunity-cost-of-my-openscience-was-35-hours-690/
57. Data Coach: http://w3.tue.nl/en/services/library/about/services/datacoach/
58. Van den Eynden, V. e.a. Managing and sharing data: best practice for reseachers: http://www.data-
archive.ac.uk/media/2894/managingsharing.pdf
59. Essentials 4 data support: http://datasupport.researchdata.nl/
URL’s of mentioned webpages
in order of appearance #4
Editor's Notes
Introducing myself and IEC/Library
This course is about data sharing but data sharing requires research data management!
RDM is about data sharing, not only after your research but also during your research. Your promotor wants to take quick look at your data, your colleague needs some of your data, etc.
Sharing your data doesn’t necessarily mean open access!
During your research: RDM data sharing allows collaboration
After your research:
Because data providing the evidence for a published paper can be asked for by others in view of verifying or replicating your results (scientific integrity). Validating results by replicating them asks for data
Because journal, funder or code of conduct demand data to be accessible
Because data are unique and / or valuable (non-repeatable observations)
Because data are an asset, worth sharing in order to be reused or built on by others
UPSIDE: Uniform Principle of Sharing Integral Data and Materials Expeditiously
If research funders set conditions with regard to data management, this often comes down to the requirement of a data management plan.
Reproducibility = being able to go from data to figures/results! credibility science
Sharing your data doesn’t necessarily mean open access!
During your research: RDM data sharing allows collaboration
After your research:
Because data providing the evidence for a published paper can be asked for by others in view of verifying or replicating your results (scientific integrity). Validating results by replicating them asks for data
Because journal, funder or code of conduct demand data to be accessible
Because data are unique and / or valuable (non-repeatable observations)
Because data are an asset, worth sharing in order to be reused or built on by others
UPSIDE: Uniform Principle of Sharing Integral Data and Materials Expeditiously
If research funders set conditions with regard to data management, this often comes down to the requirement of a data management plan.
‘Take measures’ = best effort, inspanningsverplichting
http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf [Guidelines on data management in Horizon 2020 ]
Open research data pilot: ook hergebruik van data ; vooral ingevuld door een DMP [ DMP as an early deliverable within the first six months of the project ]
Scope: 7 areas of Horizon 2020 ; €3 billion [ 20% of the overall Horizon 2020 budget 2014-2015 ]
Future and emerging technologies
Research infrastructures – part e-infrastructures
Leadership in enabling and industrial technologies – Information and communication technolgies
Societal challenge: ‘Secure, clean and efficient energy’ – part Smart cities and communities
Societal challenge: ‘Climate, action, environment, resource efficiency and raw materials’ – except raw materials
Societal challenge: ‘Europe in a changing world – inclusive, innovative and reflective societies’
Science with and for society
At the proposal submission stage, the information provided is not part of the evaluation.
Costs relating to the implementation of the pilot will be eligible
3054 proposals: opt out core areas = 24% ; opt in in other areas = 27%
Guidelines on open access to scientific publications and research data in Horizon 2020 (version 1.0, 11 December 2013)
Guidelines on data management in Horizon 2020 (version 1.0, 11 december 2013): open research data pilot
Open research data pilot / Data management plan [ DMP ]
What types of data will the project generate/collect?
What standards will be used?
How will this data be exploited and/or shared/made accessible for verification and re-use? If data cannot be made available explain why
How will this data be curated and preserved?
Data management section = data management paragraaf
1. These are parts of RDM 2. ‘during your research’ but aimed at sharing data after [and during] your research!
Maintaining the integrity of data: this implies protecting the mere existence of data, maintaining quality of data and ensuring that data are accessed only by those authorized to do so.
RDM consists of these parts.
minimize the risk of data loss or deletion ;
protect your data from unauthorized use ;
use the correct data. Especially when you edit your data often or collect data through various experiments or tests, identifying the correct data may pose a problem ;
RDM enhances the efficiency of your research.
Meta data to support re-use of data sets:
Configurations of equipment, measurement settings, annotation, etc.Often discipline specific: provide some discipline specific use cases.
Objectives:
Reproducability of scientific results (including academic integrity)
Common science: building on top of previous results
Use data classification and retention
If not used, then the data volumes and its costs will grow autonomously and are out-of-control
Use filename conventions : Reduce complexity when contents variety grows
Add Meta data and Annotation : Data gets worthless rapidly if meta data is missing
Automate adding of Meta data : If not automated, it will not happen
Put all data in a database: Avoid complexity explosion when data volume and variety grows
Supply application stubs : Transparency for application users
Use XML based contents and interfaces : Ability to easily interface with any tool or system
Handle access control and tool (flow) integration in a platform : Avoid complexity explosion when functionality grows
Handling data privacy is in place : Strict legal requirement and large risk when non-compliant
Standardise with application field specific communities : Local (TU/e) or out-of-context standardisation is not effective and adds complexity
Descriptive file names: uniek die iets zeggen over de inhoud
Relationele database: scalability (grote en complexe datasets)! More options for querying, sorting, minder invoerfouten (FileMaker, MySQL)
Dataverse Network: 2 Gb
Informal peer-to-peer sharing makes it difficult to know which data can be obtained where, requires the right contact, makes managing data access a burden and does not ensure availability of the data in the long-term.
Project websites can offer easy immediate storage and dissemination, but will offer less sustainability and it is difficult to control who uses your data and how they use it unless administrative procedures are in place.
Figshare: free till 1 Gb
DANS: Dutch, social sciences and humaniora
Dryad: not free (90 euro for 10 Gb), only data underlying publications
Who knows DOI’s?
Figshare: free till 1 Gb
DANS: Dutch, social sciences and humaniora
Dryad: not free (90 euro for 10 Gb), only data underlying publications
Who knows DOI’s?