Introduction to Data Management Planning at Alien Challenge COST workshop

Introduction to
Data Management Planning
Aaike De Wever
@aaik

Landscape with the Fall of Icarus, Royal Museums of Fine Arts of Belgium, now seen as a good early copy of Bruegel's original - ca. 1558 -
By Pieter Brueghel the Elder (1526/1530–1569) - 1., Public Domain, https://commons.wikimedia.org/w/index.php?curid=11974918

Source: JISC
http://webarchive.nationalarchives.gov.uk/20140702233839/http:/www.jisc.ac.uk/whatwedo/campaigns/res3/jischelp.aspx
Research Life Cycle
and the
Data Life Cycle

Source: http://www.data-archive.ac.uk/create-manage/life-cycle

Data decay
Vines et al. (2014) The
Availability of Research
Data Declines Rapidly with
Article Age. Current
Biology 24, 1-4.

“Data often have a longer lifespan than the research project
that creates them. Researchers may continue to work on
data after funding has ceased, follow-up projects may
analyse or add to the data, and data may be re-used by
other researchers.”
“Well organised, well documented, preserved and shared
data are invaluable to advance scientific inquiry and to
increase opportunities for learning and innovation.”
Source: http://www.data-archive.ac.uk/create-manage/life-cycle

Data management plan: definition
A data management plan or DMP is a formal document that outlines
how you will handle your data both during your research, and after the
project is completed.
The goal of a data management plan is to consider the many aspects of
data management, metadata generation, data preservation, and
analysis before the project begins; this ensures that data are well-
managed in the present, and prepared for preservation in the future.

Motivation to construct a DMP
• Intrinsic motivation
• Obligation from:
• institute
• funder

What should motivate you (1/2)
• Planning facilitates later work, esp. with regards to:
• data storage
• publication
• archival & retrieval
• Opportunity to consider how you handle data, incl.:
• recording new data
• efficient organisation of data
• workflows to ensure its quality

What should motivate you (2/2)
• Ensuring data is understandable and re-usable, thus:
• maximising the visibility & impact
• consider data publication, re-use and dissemination of data and
results at an early stage
• Opportunity to “talk data” with project partners, covering
e.g.:
• data workflows
• data exchange
• common data management policies and practices

Approach for constructing DMP depends on
• Purpose
–Project proposal
–Improving internal data management
–Overall open data policy
• Addressee
–Funding agency: NSF, EU,…
–Internal / Website
• Type(s) of data and volume
–Big data such as next generation sequencing output or automatic sensor data
–Remote sensing imagery
–Biodiversity monitoring campaign

Typical components of a DMP
What
Info
Standards
Sharing
Storage

What data and analyses are expected
Info about data (metadata)
Standards for data storage and exchange
Sharing, publication, public archiving
Storage preservation and access to data

DMP components: Simple example
Data collected during runs: distance, duration, speed, cadence, heartrate

DMP components: Simple example
Data collected during runs: distance, duration, speed, cadence, heartrate
Personal info on runner entered when configuring device/account
Metadata for run automatically recorded (date, time) or associate during data upload (weather)
Additional metadata for run (shoes, remarks) added through web interface
Raw data in .fit format
Uploaded data available through web services (json format) for exchange with other services
Summary data downloadable as csv-data
Data on device (as long as storage permits) and pending uploads on computer
Uploaded data accessible through on-line service
Ad-hoc download of summary data on computer (end of year)
Data accessible to authorised “connections”
Privacy settings can be set for individual activities

DMP components: Sequence data
Data from 4 454-sequencing runs
Expected maximum file size: 4 Gb raw data/run
Output format: Standard Flowgram Format (SFF)
Information in ReadMe-file:
• Project details & background • Information on sample origin & selection
• Sequencing methodology
Raw data in: Standard Flowgram Format (SFF)
FASTA format for analysis and storage
Metadata included in FASTA format
• Local storage on removable HD
• Back-up through cloud storage
• Raw data in NCBI Sequence Read Archive
• Annotated FASTA data in EMBL/GenBank
• Data associated with publications in Datadryad.org

DMP components: Monitoring data
Data from monthly water quality monitoring at 100 stations, 15 parameters recorded in
spreadsheet
(data volume 194 kb in xlsx, 62 in csv)
Details of individual sampling events recorded in spreadsheet
Sampling & analysis protocol recorded in field manual
Spreadsheet data imported into SQL database with export queries for:
• Occurrence data in csv using fields from Darwin Core standard
• Metadata in Ecological Metadate Language (EML)
• Local storage: Raw spreadsheet data on field operators’ device
• Database on institute servers + local back-ups
• Occurrence data through GBIF on national IPT node
• Data associated with publications in Datadryad.org

What
Info
Standards
Sharing
Storage
for NSF-Bio
• media and methods
• policies and public access
1. Describe data, metadata,
formats standards
2. Physical/cyber resources
and facilities
3. Media and dissemination
methods
4. Policies for data sharing
and public access
5. Roles and responsibilities
& Who

H2020 – Pilot action
Pilot action on open access to research data –
Research Data generated by the project:
• deposit in a research data repository and take measures to
make it possible for third parties to access, mine, exploit,
reproduce and disseminate — free of charge for any user — the
following:
• the data, including associated metadata, needed to validate the results
presented in scientific publications as soon as possible;
• other data, including associated metadata, as specified and within the
deadlines laid down in the data management plan (see Annex I);

for H2020
What
Info
Standards
Sharing
Storage
1. What types of data will the project
generate/collect?
2. What standards will be used?
3. How will this data be exploited and/or
shared/made accessible for verification
and re- use? If data cannot be made
available, explain why.
4. How will this data be curated and
preserved?

H2020 DMP example: AQUACROSS
Knowledge, Assessment, and Management for AQUAtic Biodiversity and
Ecosystem Services aCROSS EU policies - aquacross.eu:
• 16 partners
• 8 case studies (over 4 WPs)
• dataset survey
• … case studies to be worked out, datasets to be used only partially known
Organisation of Data Management Plan
• Overall DMP distinguishes categories of data (but has to remain general)
• Allows for more specific DMP at case study level
• Living document, updating at regular intervals

Open data institutes in biodiversity monitoring?
See https://www.inbo.be/en/norms-for-data-use

Open data institutes in biodiversity monitoring?
Key actions:
- All data owned by the institute are covered by open data policy
- Data are released after an embargo period of 12 months
- Focus on raw data (biodiversity, sequence and map data)
- Data associated with scientific papers are opened up
- Data are released under the CC0 license waiver
- Data norms apply
- Data are sufficiently documented through metadata
- All research projects are required to prepare a DMP
- Researches are required to apply these policies
- Researchers are supported by the institutes data unit

Features specific for Alien Invasive Species
data?
• Need to integrate initiatives in regional and national networks (Crall
et al. 2010 doi:10.1007/s10530-010-9740-9)
• Need for faster sharing of information (for early warning and rapid
response)
• More thorough data validation measures (before taking management
action)
• Link observation data with data on management response (common
standard for recording this?)
• Concerns about privacy

The Tower of Babel (1563), Kunsthistorisches Museum, Vienna, oil on board - By Pieter Brueghel the Elder (1526/1530–1569) - 1., Public
Domain, https://commons.wikimedia.org/w/index.php?curid=11974918

Background material
Digital Data Curation Centre
• http://www.dcc.ac.uk/resources/how-guides/develop-data-plan
Library University of Michigan
• http://www.lib.umich.edu/research-data-services/data-management-
planning (and http://www.lib.umich.edu/research-data-services/nsf-
data-management-plans#examples_proposals for sample plans)
Data One
• https://www.dataone.org/best-practices

Tools
DMP Online - https://dmponline.dcc.ac.uk
DMPTool - https://dmptool.org
Others:
• DMP builder - https://dmp.library.ualberta.ca
• DMP editor -
http://www.openmetadata.org/site/?page_id=
373
• IEDA Data Management Plan tool -
http://www.iedadata.org/compliance/plan

What data and analyses are expected?
Describe the data to be collected (actual observations) during your research including amount (if known).
Name the type of data, the instrument or collection approach, and how the data will be sampled. If actual
data are interpreted, note the interpretation. Describe any quality control measures. Also describe the final
derivative products (datasets and software or computer code) and the analysis used including analytical
software packages that are required for replication, etc. Describe data both (digital and analog) and physical
materials (samples and collections) gathered or generated during the tie of the award.
Consider these questions:
• What data will be generated in the research?
• What data types will you be creating or capturing? (e.g. experimental measures, qualitative, raw, processed)
• How will you capture or create the data? (This should cover content selection, instrumentation, technologies
and approaches chosen, methods for naming, versioning, meeting user needs, etc, and should be sensitive
to the location in which data capture is taking place.)
• If you will be using existing data, state that fact and include where you got it. What is the relationship
between the data you are collecting and the existing data?
Source: https://dmptool.org

Info/Standards
– Standards, Formats and Metadata
Describe the format of your data; think about what details (metadata) someone else would need to be able to
use these files. Describe the structural standards that you will apply in making data and metadata available. For
example, for most ecological data, documentation should be structured in Ecological Metadata Language
(EML). An example of metadata could also be as simple as a "readme file" to explain variables, structure of the
files, etc.
• Which file formats will you use for your data and why?
• What form will the metadata describing/documenting your data take?
• How will you create or capture these details?
• Which metadata standards will you use and why have you chosen them? (e.g. accepted domain-local
standards, widespread usage)
• What contextual details (metadata) are needed to make the data you capture or collect meaningful?

Who – Roles and responsabilities
Explain how the responsibilities regarding the management of your data will be delegated. This should include
time allocations, project management of technical aspects, training requirements, and contributions of non-
project staff - individuals should be named where possible. Remember that those responsible for long-term
decisions about your data will likely be the custodians of the repository/archive you choose to store your data.
While the costs associated with your research (and the results of your research) must be specified in the
Budget Justification portion of the proposal, you may want to reiterate who will be responsible for funding the
management of your data.
Consider the following:
• Outline the staff/organizational roles and responsibilities for implementing this data management plan.
• Who will be responsible for data management and for monitoring the data management plan?
• How will adherence to this data management plan be checked or demonstrated?
• What process is in place for transferring responsibility for the data?
• Who will have responsibility over time for decisions about the data once the original personnel are no
longer available? Source: https://dmptool.org

Sharing – Dissemination methods
Describe how and where you will make these data and metadata available to the community. Remember BIO
is committed to timely and rapid data distribution; make sure you address how soon your data will be
available. Indicate what data will be made available and preserved. Will data be accessible on a web page, by
email request, via open-access repository, etc.?
• What data will be made available from the study and preserved for the long-term?
• How and when will you make the data available? (Include resources needed to make the data available: equipment, systems,
expertise, etc.)
• What transformations will be necessary to prepare data for preservation / data sharing?
• What metadata / documentation will be submitted alongside the data or created on deposit/ transformation in order to make the
data reusable?
• What related information will be deposited?
• What is the process for gaining access to the data?
• How long will the original data collector/creator/principal investigator retain the right to use the data before opening it up to
wider use?
• Explain details of any embargo periods for political/commercial/patent or publisher reasons. Source: https://dmptool.org

Sharing – Policies for Data Sharing and
Public Access
Describe the policies under which these data will be made available. It is very important, the reason a DMP is
required, that you specify how you will share your data with non-group members after the project is
completed. If the data is of a sensitive nature—privacy or ecological endangerment concerns, for instance—
and public access is inappropriate, address here the means by which granular control and access will be
achieved (e.g. formal consent agreements, anonymized data, only available within a secure network, etc.).
• Will any permission restrictions need to be placed on the data?
• Are there ethical and privacy issues? If so, how will these be resolved?
• What have you done to comply with your obligations in your IRB Protocol?
• Who will hold the intellectual property rights to the data and how might this affect data access?
• What and who are the intended or foreseeable uses/users of the data?
• Do you plan on publishing findings which rely on the data? If so, do your prospective publishers place any
restrictions on other avenues of publication?

Storage – Archiving, Storage and
Preservation
Consider which data (or research products) will be deposited for long-term access and where. (What physical
and/or cyber resources and facilities (including third party resources) will be used to store and preserve the
data after the grant ends?) Describe your long-term strategy for storing, archiving and preserving the data you
will generate or use.
Consider the following:
• What is the long-term strategy for maintaining, curating and archiving the data?
• Which archive/repository/database have you identified as a place to deposit data?
• What procedures does your intended long-term data storage facility have in place for preservation and backup?
• How long will/should data be kept beyond the life of the project?
• What data will be preserved for the long-term?
• On what basis will data be selected for long-term preservation?
• What metadata/documentation will be submitted alongside the data or created on deposit/transformation in order to make the
data reusable?
• What related information will be deposited?

Introduction to Data Management Planning at Alien Challenge COST workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Introduction to Data Management Planning at Alien Challenge COST workshop

Similar to Introduction to Data Management Planning at Alien Challenge COST workshop (20)

Recently uploaded

Recently uploaded (20)

Introduction to Data Management Planning at Alien Challenge COST workshop

Editor's Notes