1
Research Data Management
Open Science
Daniel Jacob
INRA UMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics Facility
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 2
• Links between Research Data and Open Science
• How the management and preservation of Research Data
can facilitate the work of researchers
• How to address concerns about Data Sharing
• The research Data life cycle
At the end of the course you should understand...
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 3
The Reproducibility Crisis
In recent years, evidence has emerged from disciplines ranging from biology to
economics that many scientific studies are not reproducible.
This evidence has led to declarations in both the scientific and lay press that
science is experiencing a “reproducibility crisis” and that this crisis has
significant impacts on both science and society, including misdirected effort,
funding, and policy implemented on the basis of irreproducible research.
Franklin Sayre, Amy Riegelman (2018) C&RL 79(1) https://doi.org/10.5860/crl.79.1.2
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 4
This phenomenon appears, for example, in medicine, more precisely in
epidemiology, where, based on a large number of data (weight, age of the first
cigarette, etc.) and a large number of possible outcomes (breast cancer, lung
cancer, car accident, etc.), hazardous associations are made (a posteriori) and
statistically "validated".
p-hacking
p-hacking (also data dredging data fishing, data snooping, … ) is the misuse of
data analysis to find patterns in data that can be presented as statistically
significant when in fact there is no real underlying effect.
This is done by performing many statistical tests on the data and only paying
attention to those that come back with significant results, instead of stating a
single hypothesis about an underlying effect before the analysis and then
conducting a single test for it
https://en.wikipedia.org/wiki/Data_dredging
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 5
Cholesterol and Controversy: Past, present and Future
By Jeanne Garbarino on November 15, 2011
Scientific American - Blog
https://blogs.scientificamerican.com/guest-blog/cholesterol-
confusion-and-why-we-should-rethink-our-approach-to-statin-
therapy/
Cholesterol controversy
The French paradox: lessons for other countries
Heart. 2004 Jan; 90(1): 107–111.
doi: 10.1136/heart.90.1.107
Jean Ferrières
Plot of death rate from coronary heart disease (1977)
correlated with daily dietary intake (from 1976 to 1978) of
cholesterol and saturated fat as expressed by the
cholesterol fat index (CSI) per 1000 kcal
Correlation does not mean causal relationship !
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 6
Open Science
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
DATA Studies
Research Project
During a research project
Know-how knowledge
Input Output
7
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
What do they become?
• Nothing ! They rest on a disk space (up to its death!)
Among the possible scenarios, two of them are extreme
• Creation of a comprehensive database managing all
data and metadata in its entirety, associated with a
visualization and querying interface.
Expected objectives
After the project is completed
DATA Studies
8
Research Project
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
Expected objectives
Scientific Data Repositories
Enrichment
Expected links
DATA Studies
Publishing policies
…
9
https://ec.europa.eu/research/participants/docs/h2020-funding-guide/cross-cutting-issues/open-access-dissemination_en.htm
Research Project
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
NATIONAL PLAN FOR OPEN SCIENCE
Open science is the practice of making research publications and
data freely available (transparency)
Open science seeks to create an ecosystem in which scientific
research is more cumulative (interdisciplinary)
Open science makes knowledge accessible to all (civic aspect)
Open science also drives scientific progress (reactivity)
Finally, open science fosters scientific integrity and people’s trust
in science (ethics)
http://cache.media.enseignementsup-recherche.gouv.fr/file/Recherche/50/1/SO_A4_2018_EN_01_leger_982501.pdf
announced by Frédérique Vidal on 4 July 2018
makes open access mandatory for publications and project-funded research data.
10
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 11
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 12
Interdisciplinary
Data
Science
Scientific
Field
IT
Skills
Data Management
Data InterpretationData Analysis
Open Science is a new research paradigm facing many challenges, mainly :
 Requirement of many skills
 the ingrained research habits
Statistics
Software Data
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
Science today - context
Knowledge creation
 Experimental science
 Theoretical science
 Data-intensive science /
Data-driven science
Requires three skills:
 Scientific field
 Information management
 Data processing
Research Paradigms
What are the
consequences on the
data?
Publications + Data
Not only induction, deduction
but above all abduction >> data science
New Paradigm
13
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 14
Abduction
Abduction is a type of reasoning consisting in inferring probable causes to
an observed fact.
In other words, it is a question of establishing a most probable cause of a
fact found …
… and stating, as a hypothesis, that the fact in question probably results
from that cause.
Data Science
Data-driven science
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
Data from observation, experimentation or derived from existing sources
that are analyzed in order to produce or validate research results original
What is the Research Data ?
Digital Data Tables, Text Files, Sound Recordings, Completed
Survey Questionnaires, Image or Video Database, Derived data or
compiled
“Data, or units of information, related to research activities, whether funded or
not, are often organized or formatted in such a way that they can be
communicated, interpreted and processed. Research Data are all the information
you use as part of your research “ according to the University of Bristol
15
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 16
“Data management should be woven into every course in science.”
Data's shameful neglect
Nature 461, 2009 (Editorial)
 orchestrates data for efficient and reliable use
 increases the impact of research,
 improves the visibility of research
 allows data to be shared securely
 makes it easy to find the data
 reduces the risk of data loss
 increases citation rates
 requirement of most funders and publishers
RDM benefits
Data Management Facilitates
Sharing and Re-use …
Why do we have to "manage" the Research Data
based on the Open Science paradigm ?
https://www.nature.com/articles/461145a
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
• Primary/secondary
• Experimental, observational, simulation, derived, compiled, canonical
• Raw, processed, aggregated, enriched, annotated, formatted, standardized, processed,
published
• Structured/unstructured, homogenous/heterogeneous
• Free / protected
Manage?... but manage what?
17
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 18
Data
Creation
Data
processing
Data
Analysis
Data
preservation
Data
dissemination
Re-Use
Data
Collection: experiments, measurements,
observations, simulations
Creation
of metadata
Enter, format, clean,
organize, verify, validate,
describe, store
Interpretation, visualization,
formatting, publication
Migration, reformatting,
back-up, permanent storage,
Metadata, documentation, certification
Distribution, referencing,
Reporting, rights management
Data journals
Teaching,
new research,
evaluation
Curation
of data
The data life cycle
Integrate scientific data
management into research
activities
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
IT Manager / System Administrator
«skilled partner» in data archiving and
preservation
Data Creator
people who produce digital data
Data Manager
expert on the management, reporting,
storage and dissemination of research data
Data Scientist
data analysis
A wide variety of fields
Rapid developments - Continuing training required
New jobs require more and more IT skills
Research Data Management
Support - skills and professions
The data life cycle
at each stage, services can be developed:
- development of Data Management Plan (DMP)
- identification of metadata describing the data
- selection of warehouses to store data
- data retention infrastructures
- data discovery and mining tools
- data reuse framework
The scientific data life cycle is the set of
stages of management, conservation,
dissemination and reuse of scientific
data related to research activities.
19
https://ec.europa.eu/research/openscience/pdf/os_skills_wgreport_final.pdf
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 https://www6.inra.fr/datapartage/
A data management plan or DMP is a formal document that outlines
how data will be obtained, processed, organized, stored, secured, preserved, shared
both during a research project, and after the project is completed.
The goal of a data management plan is to consider
the many aspects of data management, metadata generation, data preservation, and analysis
before the project begins
this ensures that data are well-managed
in the present, and prepared for preservation in the future.
Optimization of Data Sharing and
Interoperability of Research
https://dmp.opidor.fr/
Main step of data management
Tool to be used as soon as projects are set up
Data Management Plan (DMP)
20
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 21
Operational DetailsData Management Plan (DMP)
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 22
How does the
management of data is
it funded, especially in
the long term?
Resources
What does the project consist of?
Who are the partners?
What policy on data management?
Who is responsible for the
management of data?
Responsibilities
in the project
What data will be produced/used
during the course of the project
(type, format, volume and
increase...) ?
How will they be produced?
processed?
Data collection
How, where, where, by
whom, will be stored,
backed up and secured
the data?
Data backup
Data Management Plan (DMP)
Who will be able to access the
data? The data will they be shared?
published? With whom? How?
How long does it take? Under which
license?
Data Access and Data sharing
Who will own it?
of the data produced
External data
will they be used?
Intellectual Property
What is the plan for
long-term archiving and
preservation?
Data Archiving
How will the data be identified,
described? What metadata
standards will be used?
How will the metadata be
generated?
Data Documentation
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
Findable Accessible
Interoperable Reusable
Describe your data in a data repository
Apply a persistent identifier
Consider what will be shared
Obtain participant consent
Use open formats
Consistent vocabulary
Common metadata standards
Consider permitted use
Apply appropriate license
23
The FAIR Data Principles are a set of guiding principles to make data accessible, interoperable and
reusable (Wilkinson et al.,2016 Scientific Data - https://www.nature.com/articles/sdata201618).
https://www.force11.org/group/fairgroup/fairprinciples
RDM based on the Open Science : THE FAIR DATA PRINCIPLES
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 24
THE FAIR DATA PRINCIPLES
A1.2 => Open as much as possible, Close as much as necessary
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 25
THE FAIR DATA PRINCIPLES
5 ★ OPEN DATA
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 26
It is above all an approach to measure
the maturity of your data in relation to
Open DATA
THE FAIR DATA PRINCIPLES
https://www.go-fair.org/
From Principles towards Implementations
The Internet of FAIR Data & Services
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 27
DMP model H2020 based on FAIR principles
https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
Guidelines on FAIR Data Management in Horizon 2020
1. Data Summary
2. FAIR data
2.1. Making data findable, including provisions for metadata
2.2. Making data openly accessible
2.3. Making data interoperable
2.4. Increase data re-use (through clarifying licences)
3. Allocation of resources
4. Data security
5. Ethical aspects
6. Other issues
7. Further support in developing your DMP
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
Data on the web, open license
… in a structured format
… and non-proprietary format
… identified by URIs
… and related to others (data)
5 ★ OPEN DATA
Publish data "5 Gold stars"
Tim Berners-Lee, the inventor of the Web and Linked Data
initiator, suggested a 5-star deployment scheme for Open Data
28
K. Janowicz et al (2014) Five Stars of Linked Data Vocabulary Use
Semantic Web 0 (2014) 1–0
https://geog.ucsb.edu/~jano/swj653.pdf
See also
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
SERVICE DESCRIPTION
re3data is a global registry of research data repositories from a diverse range of academic disciplines.
It provides information on repositories for the permanent storage and access of data sets to
researchers, funding bodies, publishers and scholarly institutions.
Research Data Repositories are based on
web applications to preserve, share, cite, search and analyse research data.
…
https://data.inra.fr/
Science Europe’s Framework for Discipline-specific
Research Data Management
29
https://www.nature.com/sdata/policies/repositories
Recommended Data Repositories
https://fairsharing.org/databases/
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 30
https://data.inra.fr/
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 31
…
2,406 Data Repositories (Oct 10, 2019)
https://www.re3data.org/metrics
Not FAIR !!
FAIR ?
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 32
Reproducible Research
in the context of Open Science
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 33
 Some issues often arise with users jumping straight into software implementations of
methods (e.g. in R) that may lack documentation on biases and assumptions that are
mentioned in the original papers.
Halsey et al (2015) The fickle P value generates irreproducible results, Nature Methods 12, 179–185
Calls for Open Science & Reproducible Research
Typical examples of where problems can arise
 A major cause of lack of repeatability (often not being considered) is the wide sample-
to-sample variability in the P value. Due to that p-value is fickle, the interpreting of
analyses should not be based predominantly on this statistic.
 Overfitting a model is a condition where a statistical model begins to describe the
random error in the data rather than the relationships between variables. This
problem occurs when the model is too complex. In regression analysis, overfitting
can produce misleading R-squared values, regression coefficients, and p-values.
https://statisticsbyjim.com/regression/overfitting-regression-models/
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 34
Calls for Open Science & Reproducible Research
Others issues
 Loss of data and/or information :
 Not regularly backing up your data is considered as professional negligence
 Lack of knowledge, lack of technical skills, having more or less hazardous practices :
 Training is a right but also a duty to claim to fully assume a function / mission
 Continuous evolution of software libraries & their dependencies
 Problems related to digital accuracy from one computer to another,
 Versioning,
 …
Miscellaneous
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 35
“Citations to unpublished data and personal communications
cannot be used to support claims in a published paper”
“All data necessary to understand, assess, and extend the
conclusions of the manuscript must be available to any reader
of science.
What Science Requires
Calls for Open Science & Reproducible Research
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 36
Research is defined as reproducible when then published results
can be replicated using the documented data, code, and methods
employed by the author or provider without the need for any
additional information or needing to communicate with the author
or provider
Reproducible Research
https://nnlm.gov/data/thesaurus/reproducible-research
Reproducible research is
is not a guarantee of research quality, but a guarantee of transparency.
contributes to quality but does not replace it
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 37
Reproducibility has the potential to serve as a minimum standard for judging scientific
claims when full independent replication of a study is not possible
Reproducible Research
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 38
Reproducible Research
Good practices
 Data Collection and Management :
 Write an information collection protocol: this protocol should be part of the published article
 Maintain a laboratory notebook
 Collect data repeatedly AND reproducibly
 Research Compendium :
 facilitates reproducible research by bringing together in a single
virtual "place" the data, codes, protocols and documentation
related to a research project
 Full computational environment used to produce the results in the
paper such as the code, data, etc. that can be used to reproduce
the results and create new work based on the research.
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 39
Reproducible Research
Good practices
Manage what ? What kind of data/information ?
The minimal but mandatory set of files
From RAW DATA To Final results
Including
• Standard Operating Procedures (SOP)
• Data reporting
Checking
Validation
Tracing
Raw Data
Processed
data
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 40
Reproducible Research
Good practices
The minimal but mandatory set of files
Checking
Validation
Tracing
The final
quantification
results file
The calibration file
(Calibration curves based on
standard compounds)
The Excel worksheet(s)
having served to calculate
the quantification
The compound
attribution zones
An image of an annotated
NMR spectrum
Protocol documents that describe each step of the process (Quality Assurance):
I. Analytical sample preparation
II. Analytical processing
III. Data processing
IV. Quantification
The raw
NMR
spectra
(ZIP file)
Example: 1H-NMR Analytical Technique
http://nmrprocflow.org/ex1
Example of full 1H-NMR data set
Manage what ? What kind of data/information ?
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 41
Reproducible Research
Good practices
 Backups :
 Not regularly backing up your data is considered as professional negligence
 Versions and Archives :
 Safeguarding the successive stages of document development (texts, data, codes, etc.) is one of
the fundamental building blocks of reproducible research
 Implementation of a version management strategy
 Git + local or institutional Forge (i.e. Forgemia), GitHub (i.e. github/INRA)
 Research data repositories (re3data.org)
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 42
Reproducible Research
Good advices
 Data exploration
 Use tools that you know well or that allow you to gain in efficiency.
But
 Learn to program :
 Limit the use of graphical interfaces (GUI) for subtle or repetitive tasks
 Be able to express in a clear, documented and unambiguous way what you want the software to do
 A program can be simply expressed in a few lines only. The higher the level of language used, the less
there will be to write.
 Typical examples of reproducible research comprise compendia of data, code and text files, often
organised around an R Markdown source document or a Jupyter notebook.
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 43
Open Data for Access and Mining
ODAM Framework
Example of a Data Management System in the context of Open Science
http://pmb-bordeaux.fr/dataexplorer/
http://pmb-bordeaux.fr/odam/FAIR_and_DataLife_DJ_Oct2019.pdf
https://nbviewer.jupyter.org/github/djacob65/binder_odam/blob/master/PyODAM_api_PCA.ipynb
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
https://doranum.fr/
Research Data - Digital Learning
https://coop-ist.cirad.fr/gerer-des-donnees
CoopIST – Cooperate in Scientific and Technical Information
INRA services and resources
https://www6.inra.fr/datapartage
Some useful links related to Open Science / Data Management
The future of science is Open
https://www.fosteropenscience.eu/
Building the social and technical bridges to enable open sharing and re-use of data
https://www.rd-alliance.org/ 23 Things: Libraries for Research Data
44
Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 45
Vers une recherche reproductible : Faire évoluer ses pratiques
https://hal.archives-ouvertes.fr/hal-02144142v1
https://englianhu.files.wordpress.com/2016/01/reproducible-research-with-r-and-studio-2nd-edition.pdf
Reproducible Research with R and RStudio Second Edition
Reproducibility and Replicability in Science
https://www.nap.edu/catalog/25303/reproducibility-and-replicability-in-science
Books online related to Reproducible Research

Research Data Management

  • 1.
    1 Research Data Management OpenScience Daniel Jacob INRA UMR 1332 BFP – Metabolism Group Bordeaux Metabolomics Facility
  • 2.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 2 • Links between Research Data and Open Science • How the management and preservation of Research Data can facilitate the work of researchers • How to address concerns about Data Sharing • The research Data life cycle At the end of the course you should understand...
  • 3.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 3 The Reproducibility Crisis In recent years, evidence has emerged from disciplines ranging from biology to economics that many scientific studies are not reproducible. This evidence has led to declarations in both the scientific and lay press that science is experiencing a “reproducibility crisis” and that this crisis has significant impacts on both science and society, including misdirected effort, funding, and policy implemented on the basis of irreproducible research. Franklin Sayre, Amy Riegelman (2018) C&RL 79(1) https://doi.org/10.5860/crl.79.1.2
  • 4.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 4 This phenomenon appears, for example, in medicine, more precisely in epidemiology, where, based on a large number of data (weight, age of the first cigarette, etc.) and a large number of possible outcomes (breast cancer, lung cancer, car accident, etc.), hazardous associations are made (a posteriori) and statistically "validated". p-hacking p-hacking (also data dredging data fishing, data snooping, … ) is the misuse of data analysis to find patterns in data that can be presented as statistically significant when in fact there is no real underlying effect. This is done by performing many statistical tests on the data and only paying attention to those that come back with significant results, instead of stating a single hypothesis about an underlying effect before the analysis and then conducting a single test for it https://en.wikipedia.org/wiki/Data_dredging
  • 5.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 5 Cholesterol and Controversy: Past, present and Future By Jeanne Garbarino on November 15, 2011 Scientific American - Blog https://blogs.scientificamerican.com/guest-blog/cholesterol- confusion-and-why-we-should-rethink-our-approach-to-statin- therapy/ Cholesterol controversy The French paradox: lessons for other countries Heart. 2004 Jan; 90(1): 107–111. doi: 10.1136/heart.90.1.107 Jean Ferrières Plot of death rate from coronary heart disease (1977) correlated with daily dietary intake (from 1976 to 1978) of cholesterol and saturated fat as expressed by the cholesterol fat index (CSI) per 1000 kcal Correlation does not mean causal relationship !
  • 6.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 6 Open Science
  • 7.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 DATA Studies Research Project During a research project Know-how knowledge Input Output 7
  • 8.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 What do they become? • Nothing ! They rest on a disk space (up to its death!) Among the possible scenarios, two of them are extreme • Creation of a comprehensive database managing all data and metadata in its entirety, associated with a visualization and querying interface. Expected objectives After the project is completed DATA Studies 8 Research Project
  • 9.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 Expected objectives Scientific Data Repositories Enrichment Expected links DATA Studies Publishing policies … 9 https://ec.europa.eu/research/participants/docs/h2020-funding-guide/cross-cutting-issues/open-access-dissemination_en.htm Research Project
  • 10.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 NATIONAL PLAN FOR OPEN SCIENCE Open science is the practice of making research publications and data freely available (transparency) Open science seeks to create an ecosystem in which scientific research is more cumulative (interdisciplinary) Open science makes knowledge accessible to all (civic aspect) Open science also drives scientific progress (reactivity) Finally, open science fosters scientific integrity and people’s trust in science (ethics) http://cache.media.enseignementsup-recherche.gouv.fr/file/Recherche/50/1/SO_A4_2018_EN_01_leger_982501.pdf announced by Frédérique Vidal on 4 July 2018 makes open access mandatory for publications and project-funded research data. 10
  • 11.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 11
  • 12.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 12 Interdisciplinary Data Science Scientific Field IT Skills Data Management Data InterpretationData Analysis Open Science is a new research paradigm facing many challenges, mainly :  Requirement of many skills  the ingrained research habits Statistics Software Data
  • 13.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 Science today - context Knowledge creation  Experimental science  Theoretical science  Data-intensive science / Data-driven science Requires three skills:  Scientific field  Information management  Data processing Research Paradigms What are the consequences on the data? Publications + Data Not only induction, deduction but above all abduction >> data science New Paradigm 13
  • 14.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 14 Abduction Abduction is a type of reasoning consisting in inferring probable causes to an observed fact. In other words, it is a question of establishing a most probable cause of a fact found … … and stating, as a hypothesis, that the fact in question probably results from that cause. Data Science Data-driven science
  • 15.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 Data from observation, experimentation or derived from existing sources that are analyzed in order to produce or validate research results original What is the Research Data ? Digital Data Tables, Text Files, Sound Recordings, Completed Survey Questionnaires, Image or Video Database, Derived data or compiled “Data, or units of information, related to research activities, whether funded or not, are often organized or formatted in such a way that they can be communicated, interpreted and processed. Research Data are all the information you use as part of your research “ according to the University of Bristol 15
  • 16.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 16 “Data management should be woven into every course in science.” Data's shameful neglect Nature 461, 2009 (Editorial)  orchestrates data for efficient and reliable use  increases the impact of research,  improves the visibility of research  allows data to be shared securely  makes it easy to find the data  reduces the risk of data loss  increases citation rates  requirement of most funders and publishers RDM benefits Data Management Facilitates Sharing and Re-use … Why do we have to "manage" the Research Data based on the Open Science paradigm ? https://www.nature.com/articles/461145a
  • 17.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 • Primary/secondary • Experimental, observational, simulation, derived, compiled, canonical • Raw, processed, aggregated, enriched, annotated, formatted, standardized, processed, published • Structured/unstructured, homogenous/heterogeneous • Free / protected Manage?... but manage what? 17
  • 18.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 18 Data Creation Data processing Data Analysis Data preservation Data dissemination Re-Use Data Collection: experiments, measurements, observations, simulations Creation of metadata Enter, format, clean, organize, verify, validate, describe, store Interpretation, visualization, formatting, publication Migration, reformatting, back-up, permanent storage, Metadata, documentation, certification Distribution, referencing, Reporting, rights management Data journals Teaching, new research, evaluation Curation of data The data life cycle Integrate scientific data management into research activities
  • 19.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 IT Manager / System Administrator «skilled partner» in data archiving and preservation Data Creator people who produce digital data Data Manager expert on the management, reporting, storage and dissemination of research data Data Scientist data analysis A wide variety of fields Rapid developments - Continuing training required New jobs require more and more IT skills Research Data Management Support - skills and professions The data life cycle at each stage, services can be developed: - development of Data Management Plan (DMP) - identification of metadata describing the data - selection of warehouses to store data - data retention infrastructures - data discovery and mining tools - data reuse framework The scientific data life cycle is the set of stages of management, conservation, dissemination and reuse of scientific data related to research activities. 19 https://ec.europa.eu/research/openscience/pdf/os_skills_wgreport_final.pdf
  • 20.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 https://www6.inra.fr/datapartage/ A data management plan or DMP is a formal document that outlines how data will be obtained, processed, organized, stored, secured, preserved, shared both during a research project, and after the project is completed. The goal of a data management plan is to consider the many aspects of data management, metadata generation, data preservation, and analysis before the project begins this ensures that data are well-managed in the present, and prepared for preservation in the future. Optimization of Data Sharing and Interoperability of Research https://dmp.opidor.fr/ Main step of data management Tool to be used as soon as projects are set up Data Management Plan (DMP) 20
  • 21.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 21 Operational DetailsData Management Plan (DMP)
  • 22.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 22 How does the management of data is it funded, especially in the long term? Resources What does the project consist of? Who are the partners? What policy on data management? Who is responsible for the management of data? Responsibilities in the project What data will be produced/used during the course of the project (type, format, volume and increase...) ? How will they be produced? processed? Data collection How, where, where, by whom, will be stored, backed up and secured the data? Data backup Data Management Plan (DMP) Who will be able to access the data? The data will they be shared? published? With whom? How? How long does it take? Under which license? Data Access and Data sharing Who will own it? of the data produced External data will they be used? Intellectual Property What is the plan for long-term archiving and preservation? Data Archiving How will the data be identified, described? What metadata standards will be used? How will the metadata be generated? Data Documentation
  • 23.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 Findable Accessible Interoperable Reusable Describe your data in a data repository Apply a persistent identifier Consider what will be shared Obtain participant consent Use open formats Consistent vocabulary Common metadata standards Consider permitted use Apply appropriate license 23 The FAIR Data Principles are a set of guiding principles to make data accessible, interoperable and reusable (Wilkinson et al.,2016 Scientific Data - https://www.nature.com/articles/sdata201618). https://www.force11.org/group/fairgroup/fairprinciples RDM based on the Open Science : THE FAIR DATA PRINCIPLES
  • 24.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 24 THE FAIR DATA PRINCIPLES A1.2 => Open as much as possible, Close as much as necessary
  • 25.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 25 THE FAIR DATA PRINCIPLES 5 ★ OPEN DATA
  • 26.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 26 It is above all an approach to measure the maturity of your data in relation to Open DATA THE FAIR DATA PRINCIPLES https://www.go-fair.org/ From Principles towards Implementations The Internet of FAIR Data & Services
  • 27.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 27 DMP model H2020 based on FAIR principles https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf Guidelines on FAIR Data Management in Horizon 2020 1. Data Summary 2. FAIR data 2.1. Making data findable, including provisions for metadata 2.2. Making data openly accessible 2.3. Making data interoperable 2.4. Increase data re-use (through clarifying licences) 3. Allocation of resources 4. Data security 5. Ethical aspects 6. Other issues 7. Further support in developing your DMP
  • 28.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 Data on the web, open license … in a structured format … and non-proprietary format … identified by URIs … and related to others (data) 5 ★ OPEN DATA Publish data "5 Gold stars" Tim Berners-Lee, the inventor of the Web and Linked Data initiator, suggested a 5-star deployment scheme for Open Data 28 K. Janowicz et al (2014) Five Stars of Linked Data Vocabulary Use Semantic Web 0 (2014) 1–0 https://geog.ucsb.edu/~jano/swj653.pdf See also
  • 29.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 SERVICE DESCRIPTION re3data is a global registry of research data repositories from a diverse range of academic disciplines. It provides information on repositories for the permanent storage and access of data sets to researchers, funding bodies, publishers and scholarly institutions. Research Data Repositories are based on web applications to preserve, share, cite, search and analyse research data. … https://data.inra.fr/ Science Europe’s Framework for Discipline-specific Research Data Management 29 https://www.nature.com/sdata/policies/repositories Recommended Data Repositories https://fairsharing.org/databases/
  • 30.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 30 https://data.inra.fr/
  • 31.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 31 … 2,406 Data Repositories (Oct 10, 2019) https://www.re3data.org/metrics Not FAIR !! FAIR ?
  • 32.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 32 Reproducible Research in the context of Open Science
  • 33.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 33  Some issues often arise with users jumping straight into software implementations of methods (e.g. in R) that may lack documentation on biases and assumptions that are mentioned in the original papers. Halsey et al (2015) The fickle P value generates irreproducible results, Nature Methods 12, 179–185 Calls for Open Science & Reproducible Research Typical examples of where problems can arise  A major cause of lack of repeatability (often not being considered) is the wide sample- to-sample variability in the P value. Due to that p-value is fickle, the interpreting of analyses should not be based predominantly on this statistic.  Overfitting a model is a condition where a statistical model begins to describe the random error in the data rather than the relationships between variables. This problem occurs when the model is too complex. In regression analysis, overfitting can produce misleading R-squared values, regression coefficients, and p-values. https://statisticsbyjim.com/regression/overfitting-regression-models/
  • 34.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 34 Calls for Open Science & Reproducible Research Others issues  Loss of data and/or information :  Not regularly backing up your data is considered as professional negligence  Lack of knowledge, lack of technical skills, having more or less hazardous practices :  Training is a right but also a duty to claim to fully assume a function / mission  Continuous evolution of software libraries & their dependencies  Problems related to digital accuracy from one computer to another,  Versioning,  … Miscellaneous
  • 35.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 35 “Citations to unpublished data and personal communications cannot be used to support claims in a published paper” “All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of science. What Science Requires Calls for Open Science & Reproducible Research
  • 36.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 36 Research is defined as reproducible when then published results can be replicated using the documented data, code, and methods employed by the author or provider without the need for any additional information or needing to communicate with the author or provider Reproducible Research https://nnlm.gov/data/thesaurus/reproducible-research Reproducible research is is not a guarantee of research quality, but a guarantee of transparency. contributes to quality but does not replace it
  • 37.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 37 Reproducibility has the potential to serve as a minimum standard for judging scientific claims when full independent replication of a study is not possible Reproducible Research
  • 38.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 38 Reproducible Research Good practices  Data Collection and Management :  Write an information collection protocol: this protocol should be part of the published article  Maintain a laboratory notebook  Collect data repeatedly AND reproducibly  Research Compendium :  facilitates reproducible research by bringing together in a single virtual "place" the data, codes, protocols and documentation related to a research project  Full computational environment used to produce the results in the paper such as the code, data, etc. that can be used to reproduce the results and create new work based on the research.
  • 39.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 39 Reproducible Research Good practices Manage what ? What kind of data/information ? The minimal but mandatory set of files From RAW DATA To Final results Including • Standard Operating Procedures (SOP) • Data reporting Checking Validation Tracing Raw Data Processed data
  • 40.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 40 Reproducible Research Good practices The minimal but mandatory set of files Checking Validation Tracing The final quantification results file The calibration file (Calibration curves based on standard compounds) The Excel worksheet(s) having served to calculate the quantification The compound attribution zones An image of an annotated NMR spectrum Protocol documents that describe each step of the process (Quality Assurance): I. Analytical sample preparation II. Analytical processing III. Data processing IV. Quantification The raw NMR spectra (ZIP file) Example: 1H-NMR Analytical Technique http://nmrprocflow.org/ex1 Example of full 1H-NMR data set Manage what ? What kind of data/information ?
  • 41.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 41 Reproducible Research Good practices  Backups :  Not regularly backing up your data is considered as professional negligence  Versions and Archives :  Safeguarding the successive stages of document development (texts, data, codes, etc.) is one of the fundamental building blocks of reproducible research  Implementation of a version management strategy  Git + local or institutional Forge (i.e. Forgemia), GitHub (i.e. github/INRA)  Research data repositories (re3data.org)
  • 42.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 42 Reproducible Research Good advices  Data exploration  Use tools that you know well or that allow you to gain in efficiency. But  Learn to program :  Limit the use of graphical interfaces (GUI) for subtle or repetitive tasks  Be able to express in a clear, documented and unambiguous way what you want the software to do  A program can be simply expressed in a few lines only. The higher the level of language used, the less there will be to write.  Typical examples of reproducible research comprise compendia of data, code and text files, often organised around an R Markdown source document or a Jupyter notebook.
  • 43.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 43 Open Data for Access and Mining ODAM Framework Example of a Data Management System in the context of Open Science http://pmb-bordeaux.fr/dataexplorer/ http://pmb-bordeaux.fr/odam/FAIR_and_DataLife_DJ_Oct2019.pdf https://nbviewer.jupyter.org/github/djacob65/binder_odam/blob/master/PyODAM_api_PCA.ipynb
  • 44.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 https://doranum.fr/ Research Data - Digital Learning https://coop-ist.cirad.fr/gerer-des-donnees CoopIST – Cooperate in Scientific and Technical Information INRA services and resources https://www6.inra.fr/datapartage Some useful links related to Open Science / Data Management The future of science is Open https://www.fosteropenscience.eu/ Building the social and technical bridges to enable open sharing and re-use of data https://www.rd-alliance.org/ 23 Things: Libraries for Research Data 44
  • 45.
    Daniel Jacob –INRA UMR 1332 BFP – Oct 2019 45 Vers une recherche reproductible : Faire évoluer ses pratiques https://hal.archives-ouvertes.fr/hal-02144142v1 https://englianhu.files.wordpress.com/2016/01/reproducible-research-with-r-and-studio-2nd-edition.pdf Reproducible Research with R and RStudio Second Edition Reproducibility and Replicability in Science https://www.nap.edu/catalog/25303/reproducibility-and-replicability-in-science Books online related to Reproducible Research