Facilitate Open Science Training for European Research
Open Data: Strategies for Research Data
Management (and Planning)
Martin Donnelly
Digital Curation Centre
University of Edinburgh
Webinar
3 May 2018
The Digital Curation Centre (DCC)
• UK national centre of expertise in digital preservation
and data management, est. 2004
• Principal audience is the UK higher education sector, but
we increasingly work further afield (continental Europe,
North America, South Africa, Asia…)
• Provide guidance, training, tools (e.g. DMPonline) and
other services on all aspects of research data
management and Open Science
• Tailored consultancy/training
• Organise national and international events and webinars
(e.g. International Digital Curation Conference, next one
in Melbourne, February 2019)
• Phase 1 (2014-2016): Spread
the Seeds of Open Science and
Open Access
• Creation of Open Science
Taxonomy
• 2000+ training materials,
categorized in the FOSTER
Portal
• More than 100 f2f training
events in 28 countries and 25
online courses, totalling more
than 6300 participants
FacilitateOpenScienceTrainingforEuropeanResearch
The project
http://fosteropenscience.eu
• Phase 2 (2017-2019): Let the Flowers of Open Science Bloom
• Focus on:
• Training for the practical implementation of Open Science (face to face
and online) including RDM and Open Data
• Developing intermediate/advanced level/discipline-specific training
resources in collaboration with three disciplinary communities (and
related RIs): Life Sciences (ELIXIR), Social Sciences (CESSDA) and
Humanities (DARIAH)
• Update the FOSTER Portal to support moderated learning, badges and
gamification
• In concrete terms:
• 150 new training resources
• Over 50 training events (outcome-oriented, providing participants with
tangible skills) and 20 e-learning courses
• Multi-module Open Science Toolkit
• Trainers Network, Open Science Bootcamp, Open Science Training
Handbook, and more…
FacilitateOpenScienceTrainingforEuropeanResearch
The project
http://fosteropenscience.eu
PRESENTATION OVERVIEW
1. Recap: what does RDM involve?
2. FAIR data
3. Data Management Planning (DMP) –
benefits and requirements
4. Producing a strong DMP
5. Contact details / opportunity for Q&A
RDM, what and why?
RDM is “the active
management and appraisal
of data over the lifecycle of
scholarly and scientific
interest”
Core activities include:
- Planning and describing data-
related work before it takes place
- Documenting your data (and
processing/workflows) so that
others can find and understand it
- Choosing open (or at least
standardised) file formats where
possible
- Storing data safely during a project
- Depositing it in a trusted archive
at the end of the research
- Linking publications to the
datasets that underpin them… and
increasingly code/scripts too
Benefits of Managed/Open Research Outputs
• TRANSPARENCY and QUALITY: The evidence that underpins
research can be made open for anyone to scrutinise, and attempt
to replicate findings. This leads to a more robust scholarly record,
and reduces academic fraud for example
• PROTECTION: Sensitive data is less likely to be exposed if it is
actively and mindfully managed
• SPEED: The research process becomes faster
• IMPACT and LONGEVITY: Open data (and publications) receive
more citations, over longer periods
• EFFICIENCY: Data collection can be funded once, and used many
times for a variety of purposes
• ACCESSIBILITY: Interested third parties can (where appropriate)
access and build upon publicly-funded research outputs with
minimal barriers to access
• DURABILITY: Simply put, fewer important datasets will be lost
Risks of not doing this, or getting this wrong
• LEGAL – sensitive data is protected by law (and
contracts) and needs to be protected
• QUALITY – the scholarly record becomes less robust
• SCIENTIFIC – potential discoveries may be hidden away
in drawers, on USB
• OPPORTUNITY COST – reduced visibility for research >
lost opportunities for collaboration
• FINANCIAL – non-compliance with funder policies can
lead to reduced access to income streams
• REPUTATIONAL – responsible data management is
increasingly considered a core element of good
scholarly practice in the 21st century
The European Commission and FAIR data
• The EC has adopted FORCE11’s ‘FAIR’ approach to
research data management.
• These principles state that “One of the grand challenges
of data-intensive science is to facilitate knowledge
discovery by assisting humans and machines in their
discovery of, access to, integration and analysis of, task-
appropriate scientific data and their associated
algorithms and workflows.”
• To help achieve this, (meta)data should be…
• Findable
• Accessible
• Interoperable
• Reusable
The FAIR Data Principles (1/4)
To be Findable:
F1. (meta)data are assigned a globally unique
and eternally persistent identifier.
F2. data are described with rich metadata.
F3. (meta)data are registered or indexed in a
searchable resource.
F4. metadata specify the data identifier.
The FAIR Data Principles (2/4)
To be Accessible:
A1. (meta)data are retrievable by their
identifier using a standardized communications
protocol.
A1.1. the protocol is open, free, and universally
implementable.
A1.2. the protocol allows for an authentication and
authorization procedure, where necessary.
A2. metadata are accessible, even when the
data are no longer available.
The FAIR Data Principles (3/4)
To be Interoperable:
I1. (meta)data use a formal, accessible, shared,
and broadly applicable language for knowledge
representation.
I2. (meta)data use vocabularies that follow FAIR
principles.
I3. (meta)data include qualified references to
other (meta)data.
The FAIR Data Principles (4/4)
To be Re-usable:
R1. meta(data) have a plurality of accurate and
relevant attributes.
R1.1. (meta)data are released with a clear and
accessible data usage license.
R1.2. (meta)data are associated with
their provenance.
R1.3. (meta)data meet domain-relevant community
standards.
What is DMP and what are its benefits?
• Data Management Planning (DMP) underpins and pulls together
different strands of data management activities.
• A DMP is usually a document, varying in length but ideally (in my
view!) around 10 pages or so, which gives:
 A short description of the research project
 Short descriptions of the datasets to be created or re-used, including
detail about formats etc
 More detailed info about how the data will be documented /described
 How sensitive it is, and who should be able to access it
 Where it will be stored for the short term
 Whether (and how) it will be shared and preserved for the long-term
• Data Management Plans are a means of communication, with
contemporaries and future re-users alike
Benefits of planning (i)
• Firstly, it is intuitive that planned activities stand a better
chance of meeting their goals than unplanned ones.
• The process of planning is also a process of communication,
which is increasingly important in interdisciplinary/multi-
partner research. Collaboration will be more harmonious if
project partners (in industry, other universities, other
countries…) are on the same page.
• There are benefits for research institutions too: RDM obliges
interaction between services/stakeholder groups, encouraging
joined-up approaches and synergies.
• It’s not a huge stretch to suggest that DMP holds benefits for
publishers as well: it supports a more sturdy scholarly record,
and leads to a higher quality product.
Benefits of planning (ii)
• In terms of data security and handling access requests,
if there are good reasons not to publish/share data,
whether in whole or in part, you will be on more solid
ground if you flag these up early in the process
• From a preservation and best practice point of view,
DMP provides an ideal opportunity to engender good
habits with regard to (e.g.) file formats, metadata
standards, storage and risk management practices,
leading to greater longevity of data, and improved
quality standards…
Limits of planning
• What can a plan not do? It can’t do the
work for you…
• “The map is not the territory”
(Korzybski)
or
• “Chalk’s no shears” (Scottish saying)
• It is important to remember that the
human challenges in data management
are often harder to meet than the
technological ones
• Communication is vital
The H2020 DMP policy (i)
The H2020 DMP policy (ii)
• DMPs are now expected from all H2020 research projects,
unless they have reasons for opting out
• Plans are submitted as deliverables – first iteration due in M6
• The DMP should include information on:
• the handling of research data during and after the end of the project
• what data will be collected, processed and/or generated
• which methodology and standards will be applied
• whether data will be shared/made open, and
• how data will be curated and preserved (including after the end of the
project)
• Template and guidance is given in the Guidelines doc
(http://ec.europa.eu/research/participants/data/ref/h2020/g
rants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf)
Reflections on assessing H2020 DMPs
• It would be better if everyone followed the same template –
the EC does provide one, but its use isn’t (yet) mandatory
• A DMP doesn’t need to tell everything there is to know about
a project: brevity is a plus!
• Particularly in early (6 month) DMPs, reviewers are looking for
a sense that data-related issues are being considered
thoughtfully and appropriately, not for a huge degree of detail
• Areas of frequent weakness: security (access and storage),
ethical restrictions for data sharing, appraisal of long-term
value/interest, quality assurance processes, costs
• Advice:
• Be clear about the different between in-project and post-project
data storage and archiving
• Don’t regurgitate the H2020 guidelines – reviewers pick up on that
really quickly
• Try not to confuse publications and data (I have seen projects
describe archived data as ‘gold Open Access’ which doesn’t make
much sense)
Strategies for success, a three step guide
• Step 1. Be clear about who is involved
• Step 2. Write things down & justify decisions
• Step 3. Remember a few rules of thumb
Step 1. Be clear about who is involved
• Remember that RDM is a hybrid activity, involving
multiple stakeholder groups…
• The researchers themselves
• Research support professionals
• Partners based in other institutions, commercial partners,
data centres, funders, etc
• No single person does everything, and it makes no
sense to duplicate effort or reinvent wheels
• Play to your individual strengths – use specialist
knowledge to complete specific sections
Step 2. Write things down & justify decisions
• In your data management plan / record
• In metadata to describe the data and help others to
understand it
• In workflows and README files
• In version management
• In justifying decisions re. access, embargo, selection
and appraisal… the list can get very long…
Communication is crucial…
…and plans can and do change!
Step 3. Remember a few rules of thumb
• Without intervention, data + time = no data
• See (e.g.) Vines, cited in
https://www.theatlantic.com/national/archive/2013/12/scientific-data-lost-
forever/356422/
• Storage is not the same as management
• Think of data as plants and the storage/servers as a greenhouse
• The plants still need to be fed, watered, pruned, etc… and sometimes disposed
of
• Management is not the same as sharing
• Not all data should be shared
• EC approach: “As open as possible, as closed as necessary”
• Prioritise: could anyone die or go to jail?
• Legal issues (e.g. protecting vulnerable subjects) are the most important
• Use standards – technical and community-driven
• Remember that plans are just that – they are not contracts!
Thank you!
• For more information about the
FOSTER project:
• Website: www.fosteropenscience.eu
• Principal investigator: Eloy Rodrigues
(eloy@sdum.uminho.pt)
• General enquiries: Gwen Franck
(gwen.franck@eifl.net)
• Twitter: @fosterscience
• My contact details:
• Email: martin.donnelly@ed.ac.uk
• Twitter: @mkdDCC
• Slideshare:
http://www.slideshare.net/martindo
nnelly
This work is licensed under the
Creative Commons Attribution
2.5 UK: Scotland License.

Open Data: Strategies for Research Data Management (and Planning)

  • 1.
    Facilitate Open ScienceTraining for European Research Open Data: Strategies for Research Data Management (and Planning) Martin Donnelly Digital Curation Centre University of Edinburgh Webinar 3 May 2018
  • 2.
    The Digital CurationCentre (DCC) • UK national centre of expertise in digital preservation and data management, est. 2004 • Principal audience is the UK higher education sector, but we increasingly work further afield (continental Europe, North America, South Africa, Asia…) • Provide guidance, training, tools (e.g. DMPonline) and other services on all aspects of research data management and Open Science • Tailored consultancy/training • Organise national and international events and webinars (e.g. International Digital Curation Conference, next one in Melbourne, February 2019)
  • 3.
    • Phase 1(2014-2016): Spread the Seeds of Open Science and Open Access • Creation of Open Science Taxonomy • 2000+ training materials, categorized in the FOSTER Portal • More than 100 f2f training events in 28 countries and 25 online courses, totalling more than 6300 participants FacilitateOpenScienceTrainingforEuropeanResearch The project http://fosteropenscience.eu
  • 4.
    • Phase 2(2017-2019): Let the Flowers of Open Science Bloom • Focus on: • Training for the practical implementation of Open Science (face to face and online) including RDM and Open Data • Developing intermediate/advanced level/discipline-specific training resources in collaboration with three disciplinary communities (and related RIs): Life Sciences (ELIXIR), Social Sciences (CESSDA) and Humanities (DARIAH) • Update the FOSTER Portal to support moderated learning, badges and gamification • In concrete terms: • 150 new training resources • Over 50 training events (outcome-oriented, providing participants with tangible skills) and 20 e-learning courses • Multi-module Open Science Toolkit • Trainers Network, Open Science Bootcamp, Open Science Training Handbook, and more… FacilitateOpenScienceTrainingforEuropeanResearch The project http://fosteropenscience.eu
  • 5.
    PRESENTATION OVERVIEW 1. Recap:what does RDM involve? 2. FAIR data 3. Data Management Planning (DMP) – benefits and requirements 4. Producing a strong DMP 5. Contact details / opportunity for Q&A
  • 6.
    RDM, what andwhy? RDM is “the active management and appraisal of data over the lifecycle of scholarly and scientific interest” Core activities include: - Planning and describing data- related work before it takes place - Documenting your data (and processing/workflows) so that others can find and understand it - Choosing open (or at least standardised) file formats where possible - Storing data safely during a project - Depositing it in a trusted archive at the end of the research - Linking publications to the datasets that underpin them… and increasingly code/scripts too
  • 7.
    Benefits of Managed/OpenResearch Outputs • TRANSPARENCY and QUALITY: The evidence that underpins research can be made open for anyone to scrutinise, and attempt to replicate findings. This leads to a more robust scholarly record, and reduces academic fraud for example • PROTECTION: Sensitive data is less likely to be exposed if it is actively and mindfully managed • SPEED: The research process becomes faster • IMPACT and LONGEVITY: Open data (and publications) receive more citations, over longer periods • EFFICIENCY: Data collection can be funded once, and used many times for a variety of purposes • ACCESSIBILITY: Interested third parties can (where appropriate) access and build upon publicly-funded research outputs with minimal barriers to access • DURABILITY: Simply put, fewer important datasets will be lost
  • 8.
    Risks of notdoing this, or getting this wrong • LEGAL – sensitive data is protected by law (and contracts) and needs to be protected • QUALITY – the scholarly record becomes less robust • SCIENTIFIC – potential discoveries may be hidden away in drawers, on USB • OPPORTUNITY COST – reduced visibility for research > lost opportunities for collaboration • FINANCIAL – non-compliance with funder policies can lead to reduced access to income streams • REPUTATIONAL – responsible data management is increasingly considered a core element of good scholarly practice in the 21st century
  • 9.
    The European Commissionand FAIR data • The EC has adopted FORCE11’s ‘FAIR’ approach to research data management. • These principles state that “One of the grand challenges of data-intensive science is to facilitate knowledge discovery by assisting humans and machines in their discovery of, access to, integration and analysis of, task- appropriate scientific data and their associated algorithms and workflows.” • To help achieve this, (meta)data should be… • Findable • Accessible • Interoperable • Reusable
  • 10.
    The FAIR DataPrinciples (1/4) To be Findable: F1. (meta)data are assigned a globally unique and eternally persistent identifier. F2. data are described with rich metadata. F3. (meta)data are registered or indexed in a searchable resource. F4. metadata specify the data identifier.
  • 11.
    The FAIR DataPrinciples (2/4) To be Accessible: A1. (meta)data are retrievable by their identifier using a standardized communications protocol. A1.1. the protocol is open, free, and universally implementable. A1.2. the protocol allows for an authentication and authorization procedure, where necessary. A2. metadata are accessible, even when the data are no longer available.
  • 12.
    The FAIR DataPrinciples (3/4) To be Interoperable: I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles. I3. (meta)data include qualified references to other (meta)data.
  • 13.
    The FAIR DataPrinciples (4/4) To be Re-usable: R1. meta(data) have a plurality of accurate and relevant attributes. R1.1. (meta)data are released with a clear and accessible data usage license. R1.2. (meta)data are associated with their provenance. R1.3. (meta)data meet domain-relevant community standards.
  • 14.
    What is DMPand what are its benefits? • Data Management Planning (DMP) underpins and pulls together different strands of data management activities. • A DMP is usually a document, varying in length but ideally (in my view!) around 10 pages or so, which gives:  A short description of the research project  Short descriptions of the datasets to be created or re-used, including detail about formats etc  More detailed info about how the data will be documented /described  How sensitive it is, and who should be able to access it  Where it will be stored for the short term  Whether (and how) it will be shared and preserved for the long-term • Data Management Plans are a means of communication, with contemporaries and future re-users alike
  • 15.
    Benefits of planning(i) • Firstly, it is intuitive that planned activities stand a better chance of meeting their goals than unplanned ones. • The process of planning is also a process of communication, which is increasingly important in interdisciplinary/multi- partner research. Collaboration will be more harmonious if project partners (in industry, other universities, other countries…) are on the same page. • There are benefits for research institutions too: RDM obliges interaction between services/stakeholder groups, encouraging joined-up approaches and synergies. • It’s not a huge stretch to suggest that DMP holds benefits for publishers as well: it supports a more sturdy scholarly record, and leads to a higher quality product.
  • 16.
    Benefits of planning(ii) • In terms of data security and handling access requests, if there are good reasons not to publish/share data, whether in whole or in part, you will be on more solid ground if you flag these up early in the process • From a preservation and best practice point of view, DMP provides an ideal opportunity to engender good habits with regard to (e.g.) file formats, metadata standards, storage and risk management practices, leading to greater longevity of data, and improved quality standards…
  • 17.
    Limits of planning •What can a plan not do? It can’t do the work for you… • “The map is not the territory” (Korzybski) or • “Chalk’s no shears” (Scottish saying) • It is important to remember that the human challenges in data management are often harder to meet than the technological ones • Communication is vital
  • 18.
    The H2020 DMPpolicy (i)
  • 19.
    The H2020 DMPpolicy (ii) • DMPs are now expected from all H2020 research projects, unless they have reasons for opting out • Plans are submitted as deliverables – first iteration due in M6 • The DMP should include information on: • the handling of research data during and after the end of the project • what data will be collected, processed and/or generated • which methodology and standards will be applied • whether data will be shared/made open, and • how data will be curated and preserved (including after the end of the project) • Template and guidance is given in the Guidelines doc (http://ec.europa.eu/research/participants/data/ref/h2020/g rants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf)
  • 20.
    Reflections on assessingH2020 DMPs • It would be better if everyone followed the same template – the EC does provide one, but its use isn’t (yet) mandatory • A DMP doesn’t need to tell everything there is to know about a project: brevity is a plus! • Particularly in early (6 month) DMPs, reviewers are looking for a sense that data-related issues are being considered thoughtfully and appropriately, not for a huge degree of detail • Areas of frequent weakness: security (access and storage), ethical restrictions for data sharing, appraisal of long-term value/interest, quality assurance processes, costs • Advice: • Be clear about the different between in-project and post-project data storage and archiving • Don’t regurgitate the H2020 guidelines – reviewers pick up on that really quickly • Try not to confuse publications and data (I have seen projects describe archived data as ‘gold Open Access’ which doesn’t make much sense)
  • 21.
    Strategies for success,a three step guide • Step 1. Be clear about who is involved • Step 2. Write things down & justify decisions • Step 3. Remember a few rules of thumb
  • 22.
    Step 1. Beclear about who is involved • Remember that RDM is a hybrid activity, involving multiple stakeholder groups… • The researchers themselves • Research support professionals • Partners based in other institutions, commercial partners, data centres, funders, etc • No single person does everything, and it makes no sense to duplicate effort or reinvent wheels • Play to your individual strengths – use specialist knowledge to complete specific sections
  • 23.
    Step 2. Writethings down & justify decisions • In your data management plan / record • In metadata to describe the data and help others to understand it • In workflows and README files • In version management • In justifying decisions re. access, embargo, selection and appraisal… the list can get very long… Communication is crucial… …and plans can and do change!
  • 24.
    Step 3. Remembera few rules of thumb • Without intervention, data + time = no data • See (e.g.) Vines, cited in https://www.theatlantic.com/national/archive/2013/12/scientific-data-lost- forever/356422/ • Storage is not the same as management • Think of data as plants and the storage/servers as a greenhouse • The plants still need to be fed, watered, pruned, etc… and sometimes disposed of • Management is not the same as sharing • Not all data should be shared • EC approach: “As open as possible, as closed as necessary” • Prioritise: could anyone die or go to jail? • Legal issues (e.g. protecting vulnerable subjects) are the most important • Use standards – technical and community-driven • Remember that plans are just that – they are not contracts!
  • 25.
    Thank you! • Formore information about the FOSTER project: • Website: www.fosteropenscience.eu • Principal investigator: Eloy Rodrigues (eloy@sdum.uminho.pt) • General enquiries: Gwen Franck (gwen.franck@eifl.net) • Twitter: @fosterscience • My contact details: • Email: martin.donnelly@ed.ac.uk • Twitter: @mkdDCC • Slideshare: http://www.slideshare.net/martindo nnelly This work is licensed under the Creative Commons Attribution 2.5 UK: Scotland License.

Editor's Notes

  • #4 The FOSTER project – what and how. FOSTER’s training strategy uses a combination of methods and activities, from face-to-face training, to the use of e-learning, blended and self-learning, as well as the dissemination of training materials/contents/curricula via a dedicated training portal, plus a helpdesk. Face-to-face trainings targets graduate schools in European universities and in particular will train trainers/teachers/multipliers that can conduct further training and dissemination activities in their institution, country and disciplinary community. FOSTER combines experiences and materials to showcase best practices, setting the scene for an active learning and teaching community for open access practices across Europe. The main outcomes of the project are: The FOSTER portal to host training courses and curricula; Facilitate the organisation of FOSTER training events and the creation of training content across Europe Identification of existing contents that can be reused in the context of the training activities and develop/create/ enhance contents if/where they are needed;
  • #5 Partners: University of Minho, University of Göttingen, Open University, Stichting eIFL.net, Digital Curation Centre University of Edinburgh and University of Glasgow, Danmarks Tekniske Universitet, Stichting LIBER, Spanish National Research Council, GESIS – Leibniz Institute for the Social Sciences, Centre for Genomic Regulation Associated partners: DARIAH EU, TIB Hannover
  • #6 Context: Open Knowledge Foundation, Creative Commons, G8 statement Open Science principles are an essential part of knowledge creation and sharing and innovation. They directly support researchers’ need for greater impact, optimum dissemination of research, while also enabling the engagement of citizen scientists and society at large on societal challenges. FOSTER aims to set in place sustainable mechanisms for EU researchers to integrate Open Science in their daily workflow, supporting researchers to optimizing their research visibility and impact and to facilitate the adoption of EU open access policies.
  • #8 Note that even if data is not suitable for sharing/publication, it still needs active management!
  • #14 Note that the European Commission has established an Expert Group on Turning FAIR Data into Reality (E03464) which will run until Spring 2018.