2. Landscape with the Fall of Icarus, Royal Museums of Fine Arts of Belgium, now seen as a good early copy of Bruegel's original - ca. 1558 -
By Pieter Brueghel the Elder (1526/1530–1569) - 1., Public Domain, https://commons.wikimedia.org/w/index.php?curid=11974918
5. Data decay
Vines et al. (2014) The
Availability of Research
Data Declines Rapidly with
Article Age. Current
Biology 24, 1-4.
6. “Data often have a longer lifespan than the research project
that creates them. Researchers may continue to work on
data after funding has ceased, follow-up projects may
analyse or add to the data, and data may be re-used by
other researchers.”
“Well organised, well documented, preserved and shared
data are invaluable to advance scientific inquiry and to
increase opportunities for learning and innovation.”
Source: http://www.data-archive.ac.uk/create-manage/life-cycle
7. Data management plan: definition
A data management plan or DMP is a formal document that outlines
how you will handle your data both during your research, and after the
project is completed.
The goal of a data management plan is to consider the many aspects of
data management, metadata generation, data preservation, and
analysis before the project begins; this ensures that data are well-
managed in the present, and prepared for preservation in the future.
9. What should motivate you (1/2)
• Planning facilitates later work, esp. with regards to:
• data storage
• publication
• archival & retrieval
• Opportunity to consider how you handle data, incl.:
• recording new data
• efficient organisation of data
• workflows to ensure its quality
10. What should motivate you (2/2)
• Ensuring data is understandable and re-usable, thus:
• maximising the visibility & impact
• consider data publication, re-use and dissemination of data and
results at an early stage
• Opportunity to “talk data” with project partners, covering
e.g.:
• data workflows
• data exchange
• common data management policies and practices
11. Approach for constructing DMP depends on
• Purpose
–Project proposal
–Improving internal data management
–Overall open data policy
• Addressee
–Funding agency: NSF, EU,…
–Internal / Website
• Type(s) of data and volume
–Big data such as next generation sequencing output or automatic sensor data
–Remote sensing imagery
–Biodiversity monitoring campaign
13. Typical components of a DMP
What data and analyses are expected
Info about data (metadata)
Standards for data storage and exchange
Sharing, publication, public archiving
Storage preservation and access to data
14. DMP components: Simple example
Data collected during runs: distance, duration, speed, cadence, heartrate
15. DMP components: Simple example
Data collected during runs: distance, duration, speed, cadence, heartrate
Personal info on runner entered when configuring device/account
Metadata for run automatically recorded (date, time) or associate during data upload (weather)
Additional metadata for run (shoes, remarks) added through web interface
Raw data in .fit format
Uploaded data available through web services (json format) for exchange with other services
Summary data downloadable as csv-data
Data on device (as long as storage permits) and pending uploads on computer
Uploaded data accessible through on-line service
Ad-hoc download of summary data on computer (end of year)
Data accessible to authorised “connections”
Privacy settings can be set for individual activities
16. DMP components: Sequence data
Data from 4 454-sequencing runs
Expected maximum file size: 4 Gb raw data/run
Output format: Standard Flowgram Format (SFF)
Information in ReadMe-file:
• Project details & background • Information on sample origin & selection
• Sequencing methodology
Raw data in: Standard Flowgram Format (SFF)
FASTA format for analysis and storage
Metadata included in FASTA format
• Local storage on removable HD
• Back-up through cloud storage
• Raw data in NCBI Sequence Read Archive
• Annotated FASTA data in EMBL/GenBank
• Data associated with publications in Datadryad.org
17. DMP components: Monitoring data
Data from monthly water quality monitoring at 100 stations, 15 parameters recorded in
spreadsheet
(data volume 194 kb in xlsx, 62 in csv)
Details of individual sampling events recorded in spreadsheet
Sampling & analysis protocol recorded in field manual
Spreadsheet data imported into SQL database with export queries for:
• Occurrence data in csv using fields from Darwin Core standard
• Metadata in Ecological Metadate Language (EML)
• Local storage: Raw spreadsheet data on field operators’ device
• Database on institute servers + local back-ups
• Occurrence data through GBIF on national IPT node
• Data associated with publications in Datadryad.org
19. What
Info
Standards
Sharing
Storage
Typical components of a DMP
for NSF-Bio
• media and methods
• policies and public access
1. Describe data, metadata,
formats standards
2. Physical/cyber resources
and facilities
3. Media and dissemination
methods
4. Policies for data sharing
and public access
5. Roles and responsibilities
& Who
20. H2020 – Pilot action
Pilot action on open access to research data –
Research Data generated by the project:
• deposit in a research data repository and take measures to
make it possible for third parties to access, mine, exploit,
reproduce and disseminate — free of charge for any user — the
following:
• the data, including associated metadata, needed to validate the results
presented in scientific publications as soon as possible;
• other data, including associated metadata, as specified and within the
deadlines laid down in the data management plan (see Annex I);
21. Typical components of a DMP
for H2020
What
Info
Standards
Sharing
Storage
1. What types of data will the project
generate/collect?
2. What standards will be used?
3. How will this data be exploited and/or
shared/made accessible for verification
and re- use? If data cannot be made
available, explain why.
4. How will this data be curated and
preserved?
22. H2020 DMP example: AQUACROSS
Knowledge, Assessment, and Management for AQUAtic Biodiversity and
Ecosystem Services aCROSS EU policies - aquacross.eu:
• 16 partners
• 8 case studies (over 4 WPs)
• dataset survey
• … case studies to be worked out, datasets to be used only partially known
Organisation of Data Management Plan
• Overall DMP distinguishes categories of data (but has to remain general)
• Allows for more specific DMP at case study level
• Living document, updating at regular intervals
23. Open data institutes in biodiversity monitoring?
See https://www.inbo.be/en/norms-for-data-use
24. Open data institutes in biodiversity monitoring?
Key actions:
- All data owned by the institute are covered by open data policy
- Data are released after an embargo period of 12 months
- Focus on raw data (biodiversity, sequence and map data)
- Data associated with scientific papers are opened up
- Data are released under the CC0 license waiver
- Data norms apply
- Data are sufficiently documented through metadata
- All research projects are required to prepare a DMP
- Researches are required to apply these policies
- Researchers are supported by the institutes data unit
25. Features specific for Alien Invasive Species
data?
• Need to integrate initiatives in regional and national networks (Crall
et al. 2010 doi:10.1007/s10530-010-9740-9)
• Need for faster sharing of information (for early warning and rapid
response)
• More thorough data validation measures (before taking management
action)
• Link observation data with data on management response (common
standard for recording this?)
• Concerns about privacy
26. The Tower of Babel (1563), Kunsthistorisches Museum, Vienna, oil on board - By Pieter Brueghel the Elder (1526/1530–1569) - 1., Public
Domain, https://commons.wikimedia.org/w/index.php?curid=11974918
28. Background material
Digital Data Curation Centre
• http://www.dcc.ac.uk/resources/how-guides/develop-data-plan
Library University of Michigan
• http://www.lib.umich.edu/research-data-services/data-management-
planning (and http://www.lib.umich.edu/research-data-services/nsf-
data-management-plans#examples_proposals for sample plans)
Data One
• https://www.dataone.org/best-practices
30. What data and analyses are expected?
Describe the data to be collected (actual observations) during your research including amount (if known).
Name the type of data, the instrument or collection approach, and how the data will be sampled. If actual
data are interpreted, note the interpretation. Describe any quality control measures. Also describe the final
derivative products (datasets and software or computer code) and the analysis used including analytical
software packages that are required for replication, etc. Describe data both (digital and analog) and physical
materials (samples and collections) gathered or generated during the tie of the award.
Consider these questions:
• What data will be generated in the research?
• What data types will you be creating or capturing? (e.g. experimental measures, qualitative, raw, processed)
• How will you capture or create the data? (This should cover content selection, instrumentation, technologies
and approaches chosen, methods for naming, versioning, meeting user needs, etc, and should be sensitive
to the location in which data capture is taking place.)
• If you will be using existing data, state that fact and include where you got it. What is the relationship
between the data you are collecting and the existing data?
Source: https://dmptool.org
31. Info/Standards
– Standards, Formats and Metadata
Describe the format of your data; think about what details (metadata) someone else would need to be able to
use these files. Describe the structural standards that you will apply in making data and metadata available. For
example, for most ecological data, documentation should be structured in Ecological Metadata Language
(EML). An example of metadata could also be as simple as a "readme file" to explain variables, structure of the
files, etc.
Consider these questions:
• Which file formats will you use for your data and why?
• What form will the metadata describing/documenting your data take?
• How will you create or capture these details?
• Which metadata standards will you use and why have you chosen them? (e.g. accepted domain-local
standards, widespread usage)
• What contextual details (metadata) are needed to make the data you capture or collect meaningful?
Source: https://dmptool.org
32. Who – Roles and responsabilities
Explain how the responsibilities regarding the management of your data will be delegated. This should include
time allocations, project management of technical aspects, training requirements, and contributions of non-
project staff - individuals should be named where possible. Remember that those responsible for long-term
decisions about your data will likely be the custodians of the repository/archive you choose to store your data.
While the costs associated with your research (and the results of your research) must be specified in the
Budget Justification portion of the proposal, you may want to reiterate who will be responsible for funding the
management of your data.
Consider the following:
• Outline the staff/organizational roles and responsibilities for implementing this data management plan.
• Who will be responsible for data management and for monitoring the data management plan?
• How will adherence to this data management plan be checked or demonstrated?
• What process is in place for transferring responsibility for the data?
• Who will have responsibility over time for decisions about the data once the original personnel are no
longer available? Source: https://dmptool.org
33. Sharing – Dissemination methods
Describe how and where you will make these data and metadata available to the community. Remember BIO
is committed to timely and rapid data distribution; make sure you address how soon your data will be
available. Indicate what data will be made available and preserved. Will data be accessible on a web page, by
email request, via open-access repository, etc.?
Consider these questions:
• What data will be made available from the study and preserved for the long-term?
• How and when will you make the data available? (Include resources needed to make the data available: equipment, systems,
expertise, etc.)
• What transformations will be necessary to prepare data for preservation / data sharing?
• What metadata / documentation will be submitted alongside the data or created on deposit/ transformation in order to make the
data reusable?
• What related information will be deposited?
• What is the process for gaining access to the data?
• How long will the original data collector/creator/principal investigator retain the right to use the data before opening it up to
wider use?
• Explain details of any embargo periods for political/commercial/patent or publisher reasons. Source: https://dmptool.org
34. Sharing – Policies for Data Sharing and
Public Access
Describe the policies under which these data will be made available. It is very important, the reason a DMP is
required, that you specify how you will share your data with non-group members after the project is
completed. If the data is of a sensitive nature—privacy or ecological endangerment concerns, for instance—
and public access is inappropriate, address here the means by which granular control and access will be
achieved (e.g. formal consent agreements, anonymized data, only available within a secure network, etc.).
Consider these questions:
• Will any permission restrictions need to be placed on the data?
• Are there ethical and privacy issues? If so, how will these be resolved?
• What have you done to comply with your obligations in your IRB Protocol?
• Who will hold the intellectual property rights to the data and how might this affect data access?
• What and who are the intended or foreseeable uses/users of the data?
• Do you plan on publishing findings which rely on the data? If so, do your prospective publishers place any
restrictions on other avenues of publication?
Source: https://dmptool.org
35. Storage – Archiving, Storage and
Preservation
Consider which data (or research products) will be deposited for long-term access and where. (What physical
and/or cyber resources and facilities (including third party resources) will be used to store and preserve the
data after the grant ends?) Describe your long-term strategy for storing, archiving and preserving the data you
will generate or use.
Consider the following:
• What is the long-term strategy for maintaining, curating and archiving the data?
• Which archive/repository/database have you identified as a place to deposit data?
• What procedures does your intended long-term data storage facility have in place for preservation and backup?
• How long will/should data be kept beyond the life of the project?
• What data will be preserved for the long-term?
• On what basis will data be selected for long-term preservation?
• What metadata/documentation will be submitted alongside the data or created on deposit/transformation in order to make the
data reusable?
• What related information will be deposited?
Source: https://dmptool.org
Editor's Notes
Aaike De Wever - Royal Belgian Institute of Natural Sciences - involved in a wide range of initiatives around freshwater biodiversity data (management)
Landscape with the Fall of Icarus (originally by Pieter Brueghel the Elder
Data is the "soil" of a lot of the work we do nowadays.
... and we should be tilling it, "managing it"
... and because it can be labour intensive, we tend to resist doing it
... moreover, we need to do this at the right time! This is where planning comes in.
Data management is an essential component of the research life cycle. In this image, the data life cycle is considered part of the “research process”. There are obviously a lot of ways to depict the data life cycle, but this one [from the UK data archive] …
[There are obviously a lot of ways to depict the data life cycle, but this one from the UK data archive] … provides a very good overview.
Most of us cover at least a few of these steps;
"Create”, “Process”, “Analyse”
But are maybe less active or less concerned about
“Presering data”, “Giving access and “Re-using it”
But, if we fail to plan what we will do with our data, the majority will go lost over time.
... and that is a pity. Because often data can be (re)used in many more ways than we imagine.
Which is nicely illustrated by this quote:
“Data often have a longer lifespan than the research project that creates them. Researchers may continue to work on data after funding has ceased, follow-up projects may analyse or add to the data, and data may be re-used by other researchers.
Well organised, well documented, preserved and shared data are invaluable to advance scientific inquiry and to increase opportunities for learning and innovation.”
This is where Data Management Plans come in. Getting a plan on paper helps to carry it out, stick to it, refer to it.
So, what are Data Management Plans?
A Data Management Plan is a formal document that outlines how you will handle your data [both during and after your project].
The goal of a data management plan is to consider the many aspects of data management, metadata generation, data preservation, and analysis before the project begins; and ensure good management both in the present and future.
The motivation for compiling a data management plan can be manifold, but very roughly you could either do so because you see its value or because you are required to do so by your institute or funder.
Nevertheless, I believe most researchers could benefit from engaging in data management planning, because:
1) (as with the tilling) planning makes things easier in the future [for yourself]
2) it is an opportunity to reflect on the way you handle and organise data,
[towards the outside world]
3) and how you make it visible, and available for re-use
4) It is also an excellent opportunity to "talk data" with your project partners, clarify workflows, exchanges and so on.
Regardless of the approach you take to Data Management Planning, there are a number of typical components to consider;
I’d like to call this the Wisdom –with tripple “s”- of data management planning
- What
- Info
- Standards
- Sharing and
- Storage
Under “What” I consider expected data and planned data analyses
By “Info” I cover the information about data –also known as metadata– and documentation that is necessary for others to discover and interpret the data correctly
Under “Standards” I refer to data formats and standards that allow data to be exchanged within the community
By “Sharing” I mean the act of making data available for re-use through on-line data publication or public archiving
… and ”Storage” covers both the physical storage, back-up plans and long term preservation.
Next Generation Sequencing study: as we needed to get a small description on paper which covered a few of these topics it was somewhat of a Data Management Plan “avant là lettre”.
So, for “what”, we were anticipating a number of 454-runs, with a maximum filesize of 4 Gb per run in the SFF-format.
Metadata would be recorded in a ReadMe file.
Data would be exported in FASTA-files including metadata.
Raw data could be stored on the NCBI Sequence Read Archive, annotated sequences would go into (one of) the International Nucleotide Sequence Database Collaboration http://www.insdc.org. Potentially selected data used in publications could be archived in Dryad.
In terms of storage, a local (removable) HD sufficed and back-ups could still be done through a kind of cloud storage.
I have also included a very much simplified example for monitoring data:
Say we consider a monthly monitoring campaign at 100 stations during which we record 15 parameters. The data volume is very manageable on local storage.
Metadata for the sampling event are recorded in the excel file and sampling details are recorded in field manuals.
Spreadsheet data are imported in the institute’s SQL database with export queries for occurrences in Darwin Core and metadata in the EML Ecological Metadata Language-format
Occurrence data can be published through the GBIF network and data associated with papers can (additionally) be archived through Dryad
Local storage typically takes place on personal computers and field operators’ devices, while the database is stored on the institutes servers.
Obviously, in reality each of the 15 parameters (such as chlorophyll analysis, macro-invertebrate count,…) could represent a separate analysis step, which later requires integration in a central database. Documenting these workflows would be part of the DMP.
But as I mentioned earlier, funders have also discovered Data Management Plans, which is probably one of the reasons why this is becoming a hot topic.
If I am not mistaken, NSF in the US is requiring DMP since 2011,
The EU currently has a pilot phase running which requires selected projects to produce a Data Management Plan and make their data publicly available...
... and other funders are following their example.
But each of these funders have slightly different requirements for the DMPs they request.
The NSF for instance requires it during project application.
The biggest difference with what we discussed earlier is that they also request info on “Who” is doing the data management and distinguish between media and methods for data sharing and policies for doing so and public access.
This results in these 5 categories, which overlap with one or more of the topics we discussed earlier.
Since DMPs will be considered during the merit review process, to help reviewers, and as appropriate, please organize the DMP as follows:
Describe the data that will be collected, and the data and metadata formats and standards used.
Describe what physical and/or cyber resources and facilities (including third party resources) will be used to store and preserve the data after the grant ends.
Describe what media and dissemination methods will be used to make the data and metadata available to others after the grant ends.
Describe the policies for data sharing and public access (including provisions for protection of privacy, confidentiality, security, intellectual property rights and other rights as appropriate).
Describe the roles and responsibilities of all parties with respect to the management of the data (including contingency plans for the departure of key personnel from the project) after the grant ends.
The EU is currently running a pilot on open access data under its Horizon 2020 funding program. This pilot requires selected projects to deposit their data in repositories to make it available for re-use.
They explicitly refer to the need to archive data associated with scientific papers and allow for a grace period for any “other data”.
In terms of DMP they rather consider it as a living document from which different versions are produced during the project’s lifetime.
The requirements for DMPs are described as follows:
What types of data will the project generate/collect?
What standards will be used?
How will this data be exploited and/or shared/made accessible for verification and re- use? If data cannot be made available, explain why.
How will this data be curated and preserved?