Research Data Management

1
Research Data Management
Open Science
Daniel Jacob
INRA UMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics Facility

Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 2
• Links between Research Data and Open Science
• How the management and preservation of Research Data
can facilitate the work of researchers
• How to address concerns about Data Sharing
• The research Data life cycle
At the end of the course you should understand...

The Reproducibility Crisis
In recent years, evidence has emerged from disciplines ranging from biology to
economics that many scientific studies are not reproducible.
This evidence has led to declarations in both the scientific and lay press that
science is experiencing a “reproducibility crisis” and that this crisis has
significant impacts on both science and society, including misdirected effort,
funding, and policy implemented on the basis of irreproducible research.
Franklin Sayre, Amy Riegelman (2018) C&RL 79(1) https://doi.org/10.5860/crl.79.1.2

This phenomenon appears, for example, in medicine, more precisely in
epidemiology, where, based on a large number of data (weight, age of the first
cigarette, etc.) and a large number of possible outcomes (breast cancer, lung
cancer, car accident, etc.), hazardous associations are made (a posteriori) and
statistically "validated".
p-hacking
p-hacking (also data dredging data fishing, data snooping, … ) is the misuse of
data analysis to find patterns in data that can be presented as statistically
significant when in fact there is no real underlying effect.
This is done by performing many statistical tests on the data and only paying
attention to those that come back with significant results, instead of stating a
single hypothesis about an underlying effect before the analysis and then
conducting a single test for it
https://en.wikipedia.org/wiki/Data_dredging

Cholesterol and Controversy: Past, present and Future
By Jeanne Garbarino on November 15, 2011
Scientific American - Blog
https://blogs.scientificamerican.com/guest-blog/cholesterol-
confusion-and-why-we-should-rethink-our-approach-to-statin-
therapy/
Cholesterol controversy
The French paradox: lessons for other countries
Heart. 2004 Jan; 90(1): 107–111.
doi: 10.1136/heart.90.1.107
Jean Ferrières
Plot of death rate from coronary heart disease (1977)
correlated with daily dietary intake (from 1976 to 1978) of
cholesterol and saturated fat as expressed by the
cholesterol fat index (CSI) per 1000 kcal
Correlation does not mean causal relationship !

Open Science

Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
DATA Studies
Research Project
During a research project
Know-how knowledge
Input Output
7

What do they become?
• Nothing ! They rest on a disk space (up to its death!)
Among the possible scenarios, two of them are extreme
• Creation of a comprehensive database managing all
data and metadata in its entirety, associated with a
visualization and querying interface.
Expected objectives
After the project is completed
DATA Studies
8
Research Project

Expected objectives
Scientific Data Repositories
Enrichment
Expected links
DATA Studies
Publishing policies
…
9
https://ec.europa.eu/research/participants/docs/h2020-funding-guide/cross-cutting-issues/open-access-dissemination_en.htm
Research Project

NATIONAL PLAN FOR OPEN SCIENCE
Open science is the practice of making research publications and
data freely available (transparency)
Open science seeks to create an ecosystem in which scientific
research is more cumulative (interdisciplinary)
Open science makes knowledge accessible to all (civic aspect)
Open science also drives scientific progress (reactivity)
Finally, open science fosters scientific integrity and people’s trust
in science (ethics)
http://cache.media.enseignementsup-recherche.gouv.fr/file/Recherche/50/1/SO_A4_2018_EN_01_leger_982501.pdf
announced by Frédérique Vidal on 4 July 2018
makes open access mandatory for publications and project-funded research data.
10

Interdisciplinary
Data
Science
Scientific
Field
IT
Skills
Data Management
Data InterpretationData Analysis
Open Science is a new research paradigm facing many challenges, mainly :
 Requirement of many skills
 the ingrained research habits
Statistics
Software Data

Science today - context
Knowledge creation
 Experimental science
 Theoretical science
 Data-intensive science /
Data-driven science
Requires three skills:
 Scientific field
 Information management
 Data processing
Research Paradigms
What are the
consequences on the
data?
Publications + Data
Not only induction, deduction
but above all abduction >> data science
New Paradigm
13

Abduction
Abduction is a type of reasoning consisting in inferring probable causes to
an observed fact.
In other words, it is a question of establishing a most probable cause of a
fact found …
… and stating, as a hypothesis, that the fact in question probably results
from that cause.
Data Science
Data-driven science

Data from observation, experimentation or derived from existing sources
that are analyzed in order to produce or validate research results original
What is the Research Data ?
Digital Data Tables, Text Files, Sound Recordings, Completed
Survey Questionnaires, Image or Video Database, Derived data or
compiled
“Data, or units of information, related to research activities, whether funded or
not, are often organized or formatted in such a way that they can be
communicated, interpreted and processed. Research Data are all the information
you use as part of your research “ according to the University of Bristol
15

“Data management should be woven into every course in science.”
Data's shameful neglect
Nature 461, 2009 (Editorial)
 orchestrates data for efficient and reliable use
 increases the impact of research,
 improves the visibility of research
 allows data to be shared securely
 makes it easy to find the data
 reduces the risk of data loss
 increases citation rates
 requirement of most funders and publishers
RDM benefits
Data Management Facilitates
Sharing and Re-use …
Why do we have to "manage" the Research Data
based on the Open Science paradigm ?
https://www.nature.com/articles/461145a

• Primary/secondary
• Experimental, observational, simulation, derived, compiled, canonical
• Raw, processed, aggregated, enriched, annotated, formatted, standardized, processed,
published
• Structured/unstructured, homogenous/heterogeneous
• Free / protected
Manage?... but manage what?
17

Data
Creation
Data
processing
Data
Analysis
Data
preservation
Data
dissemination
Re-Use
Data
Collection: experiments, measurements,
observations, simulations
Creation
of metadata
Enter, format, clean,
organize, verify, validate,
describe, store
Interpretation, visualization,
formatting, publication
Migration, reformatting,
back-up, permanent storage,
Metadata, documentation, certification
Distribution, referencing,
Reporting, rights management
Data journals
Teaching,
new research,
evaluation
Curation
of data
The data life cycle
Integrate scientific data
management into research
activities

IT Manager / System Administrator
«skilled partner» in data archiving and
preservation
Data Creator
people who produce digital data
Data Manager
expert on the management, reporting,
storage and dissemination of research data
Data Scientist
data analysis
A wide variety of fields
Rapid developments - Continuing training required
New jobs require more and more IT skills
Support - skills and professions
The data life cycle
at each stage, services can be developed:
- development of Data Management Plan (DMP)
- identification of metadata describing the data
- selection of warehouses to store data
- data retention infrastructures
- data discovery and mining tools
- data reuse framework
The scientific data life cycle is the set of
stages of management, conservation,
dissemination and reuse of scientific
data related to research activities.
19
https://ec.europa.eu/research/openscience/pdf/os_skills_wgreport_final.pdf

Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 https://www6.inra.fr/datapartage/
A data management plan or DMP is a formal document that outlines
how data will be obtained, processed, organized, stored, secured, preserved, shared
both during a research project, and after the project is completed.
The goal of a data management plan is to consider
the many aspects of data management, metadata generation, data preservation, and analysis
before the project begins
this ensures that data are well-managed
in the present, and prepared for preservation in the future.
Optimization of Data Sharing and
Interoperability of Research
https://dmp.opidor.fr/
Main step of data management
Tool to be used as soon as projects are set up
Data Management Plan (DMP)
20

Operational DetailsData Management Plan (DMP)

How does the
management of data is
it funded, especially in
the long term?
Resources
What does the project consist of?
Who are the partners?
What policy on data management?
Who is responsible for the
management of data?
Responsibilities
in the project
What data will be produced/used
during the course of the project
(type, format, volume and
increase...) ?
How will they be produced?
processed?
Data collection
How, where, where, by
whom, will be stored,
backed up and secured
the data?
Data backup
Data Management Plan (DMP)
Who will be able to access the
data? The data will they be shared?
published? With whom? How?
How long does it take? Under which
license?
Data Access and Data sharing
Who will own it?
of the data produced
External data
will they be used?
Intellectual Property
What is the plan for
long-term archiving and
preservation?
Data Archiving
How will the data be identified,
described? What metadata
standards will be used?
How will the metadata be
generated?
Data Documentation

Findable Accessible
Interoperable Reusable
Describe your data in a data repository
Apply a persistent identifier
Consider what will be shared
Obtain participant consent
Use open formats
Consistent vocabulary
Common metadata standards
Consider permitted use
Apply appropriate license
23
The FAIR Data Principles are a set of guiding principles to make data accessible, interoperable and
reusable (Wilkinson et al.,2016 Scientific Data - https://www.nature.com/articles/sdata201618).
https://www.force11.org/group/fairgroup/fairprinciples
RDM based on the Open Science : THE FAIR DATA PRINCIPLES

THE FAIR DATA PRINCIPLES
A1.2 => Open as much as possible, Close as much as necessary

5 ★ OPEN DATA

It is above all an approach to measure
the maturity of your data in relation to
Open DATA
https://www.go-fair.org/
From Principles towards Implementations
The Internet of FAIR Data & Services

DMP model H2020 based on FAIR principles
https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
Guidelines on FAIR Data Management in Horizon 2020
1. Data Summary
2. FAIR data
2.1. Making data findable, including provisions for metadata
2.2. Making data openly accessible
2.3. Making data interoperable
2.4. Increase data re-use (through clarifying licences)
3. Allocation of resources
4. Data security
5. Ethical aspects
6. Other issues
7. Further support in developing your DMP

Data on the web, open license
… in a structured format
… and non-proprietary format
… identified by URIs
… and related to others (data)
5 ★ OPEN DATA
Publish data "5 Gold stars"
Tim Berners-Lee, the inventor of the Web and Linked Data
initiator, suggested a 5-star deployment scheme for Open Data
28
K. Janowicz et al (2014) Five Stars of Linked Data Vocabulary Use
Semantic Web 0 (2014) 1–0
https://geog.ucsb.edu/~jano/swj653.pdf
See also

SERVICE DESCRIPTION
re3data is a global registry of research data repositories from a diverse range of academic disciplines.
It provides information on repositories for the permanent storage and access of data sets to
researchers, funding bodies, publishers and scholarly institutions.
Research Data Repositories are based on
web applications to preserve, share, cite, search and analyse research data.
…
https://data.inra.fr/
Science Europe’s Framework for Discipline-specific
29
https://www.nature.com/sdata/policies/repositories
Recommended Data Repositories
https://fairsharing.org/databases/

https://data.inra.fr/

…
2,406 Data Repositories (Oct 10, 2019)
https://www.re3data.org/metrics
Not FAIR !!
FAIR ?

Reproducible Research
in the context of Open Science

 Some issues often arise with users jumping straight into software implementations of
methods (e.g. in R) that may lack documentation on biases and assumptions that are
mentioned in the original papers.
Halsey et al (2015) The fickle P value generates irreproducible results, Nature Methods 12, 179–185
Calls for Open Science & Reproducible Research
Typical examples of where problems can arise
 A major cause of lack of repeatability (often not being considered) is the wide sample-
to-sample variability in the P value. Due to that p-value is fickle, the interpreting of
analyses should not be based predominantly on this statistic.
 Overfitting a model is a condition where a statistical model begins to describe the
random error in the data rather than the relationships between variables. This
problem occurs when the model is too complex. In regression analysis, overfitting
can produce misleading R-squared values, regression coefficients, and p-values.
https://statisticsbyjim.com/regression/overfitting-regression-models/

Others issues
 Loss of data and/or information :
 Not regularly backing up your data is considered as professional negligence
 Lack of knowledge, lack of technical skills, having more or less hazardous practices :
 Training is a right but also a duty to claim to fully assume a function / mission
 Continuous evolution of software libraries & their dependencies
 Problems related to digital accuracy from one computer to another,
 Versioning,
 …
Miscellaneous

“Citations to unpublished data and personal communications
cannot be used to support claims in a published paper”
“All data necessary to understand, assess, and extend the
conclusions of the manuscript must be available to any reader
of science.
What Science Requires

Research is defined as reproducible when then published results
can be replicated using the documented data, code, and methods
employed by the author or provider without the need for any
additional information or needing to communicate with the author
or provider
https://nnlm.gov/data/thesaurus/reproducible-research
Reproducible research is
is not a guarantee of research quality, but a guarantee of transparency.
contributes to quality but does not replace it

Reproducibility has the potential to serve as a minimum standard for judging scientific
claims when full independent replication of a study is not possible

Good practices
 Data Collection and Management :
 Write an information collection protocol: this protocol should be part of the published article
 Maintain a laboratory notebook
 Collect data repeatedly AND reproducibly
 Research Compendium :
 facilitates reproducible research by bringing together in a single
virtual "place" the data, codes, protocols and documentation
related to a research project
 Full computational environment used to produce the results in the
paper such as the code, data, etc. that can be used to reproduce
the results and create new work based on the research.

Good practices
Manage what ? What kind of data/information ?
The minimal but mandatory set of files
From RAW DATA To Final results
Including
• Standard Operating Procedures (SOP)
• Data reporting
Checking
Validation
Tracing
Raw Data
Processed
data

Good practices
The minimal but mandatory set of files
Checking
Validation
Tracing
The final
quantification
results file
The calibration file
(Calibration curves based on
standard compounds)
The Excel worksheet(s)
having served to calculate
the quantification
The compound
attribution zones
An image of an annotated
NMR spectrum
Protocol documents that describe each step of the process (Quality Assurance):
I. Analytical sample preparation
II. Analytical processing
III. Data processing
IV. Quantification
The raw
NMR
spectra
(ZIP file)
Example: 1H-NMR Analytical Technique
http://nmrprocflow.org/ex1
Example of full 1H-NMR data set
Manage what ? What kind of data/information ?

Good practices
 Backups :
 Not regularly backing up your data is considered as professional negligence
 Versions and Archives :
 Safeguarding the successive stages of document development (texts, data, codes, etc.) is one of
the fundamental building blocks of reproducible research
 Implementation of a version management strategy
 Git + local or institutional Forge (i.e. Forgemia), GitHub (i.e. github/INRA)
 Research data repositories (re3data.org)

Good advices
 Data exploration
 Use tools that you know well or that allow you to gain in efficiency.
But
 Learn to program :
 Limit the use of graphical interfaces (GUI) for subtle or repetitive tasks
 Be able to express in a clear, documented and unambiguous way what you want the software to do
 A program can be simply expressed in a few lines only. The higher the level of language used, the less
there will be to write.
 Typical examples of reproducible research comprise compendia of data, code and text files, often
organised around an R Markdown source document or a Jupyter notebook.

Open Data for Access and Mining
ODAM Framework
Example of a Data Management System in the context of Open Science
http://pmb-bordeaux.fr/dataexplorer/
http://pmb-bordeaux.fr/odam/FAIR_and_DataLife_DJ_Oct2019.pdf
https://nbviewer.jupyter.org/github/djacob65/binder_odam/blob/master/PyODAM_api_PCA.ipynb

https://doranum.fr/
Research Data - Digital Learning
https://coop-ist.cirad.fr/gerer-des-donnees
CoopIST – Cooperate in Scientific and Technical Information
INRA services and resources
https://www6.inra.fr/datapartage
Some useful links related to Open Science / Data Management
The future of science is Open
https://www.fosteropenscience.eu/
Building the social and technical bridges to enable open sharing and re-use of data
https://www.rd-alliance.org/ 23 Things: Libraries for Research Data
44

Vers une recherche reproductible : Faire évoluer ses pratiques
https://hal.archives-ouvertes.fr/hal-02144142v1
https://englianhu.files.wordpress.com/2016/01/reproducible-research-with-r-and-studio-2nd-edition.pdf
Reproducible Research with R and RStudio Second Edition
Reproducibility and Replicability in Science
https://www.nap.edu/catalog/25303/reproducibility-and-replicability-in-science
Books online related to Reproducible Research

Research Data Management

More Related Content

What's hot

Similar to Research Data Management

More from Daniel JACOB

Recently uploaded

Research Data Management