Make your data great now

Daniel Jacob – INRA - 2018
How to best manage your data
to make the most of it for your research
Make your data great now
Give an open access to your data
and make them ready to be mined
Open Data for Access and Mining
ODAM Framework
Daniel Jacob
INRA UMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics Facility
Oct 2018

DATA Studies
Project
During a research project
Know-how Knowledge
Input Output

What do they become?
• Nothing ! They rest on a disk space (up to its death!)
Among the possible scenarios, two of them are extreme
• Creation of a comprehensive database managing all
data and metadata in its entirety, associated with a
visualization and querying interface.
Expected objectives
After the project is completed
DATA Studies
Project

Expected objectives
Scientific Data Repositories
Enrichment
Expected links
DATA Studies
Project
Publishing policies
http://www.omicsdi.org/ …

Raw Data
Processed
data
Analyzed
data
Published
data
Processed data is the raw
data processed in a way so
that they can highlight
some features, i.e. some
type of variables linked to
the focus of the study.
Raw data
Information
(often partial)
Open access
Specialized Data Repositories
Data flow Concerned data
Experiment
Data Tables
Know-how Knowledge
Annotation, Curation, Validation
Partly not automatically
reproducible because it
requires human expertise

Open Data
Accessible data
including incomprehensible and
unexploitable documents by
automatons (programmatic way)
Open API
Queryable data
according to the imposed API scheme
 Metadata
Application Program Interface
PUBLISH DATA "5 STARS" THE FAIR DATA PRINCIPLES

Open Data
Accessible data
including incomprehensible and
unexploitable documents by
automatons (programmatic way)
Open API
Queryable data
according to the imposed API scheme
 Metadata
Application Program Interface
PUBLISH DATA "5 STARS" THE FAIR DATA PRINCIPLES
Data capture Using data
The more simple as possible and in a normalized way
As far as possible, the most appropriate choice seems to keep the old way of using
the scientist's worksheets,
while allowing other more efficient approaches and this, throughout the data flow
Producers
are scientists
Consumers
are scientists

Before the project begins
Future
Data flow
DATA
A data management plan or DMP is a formal
document that outlines how data are to be handled
both during a research project, and after the project is
completed.
The goal of a data management plan is to consider the
many aspects of data management, metadata
generation, data preservation, and analysis before the
project begins
this ensures that data are well-managed in the
present, and prepared for preservation in the future.
 Description by metadata (Ontology / Controlled Vocabulary)
 Data capturing, data formatting, data linking
 Implying data archaeology after several months / years
https://dmp.opidor.fr/
https://www6.inra.fr/datapartage/
Data management
Publishing policies

Know-how
Knowledge
Project
Data flow
DATA
Data / Metadata
Data Mining Modeling
Find out
“biomarkers”
Explain data
Data/Metadata Exploration
Data Visualization
Data mining / Modeling
Data exploration  Descriptive statistics
First glimpse of the data that can show trends.
Allow the data to be well characterized, which is
necessary to then choose how to analyze them.
 Repetition of multiple scenarios on
different subsets of data
 Selection subsets of data
 Implying lots of data manipulation
 data capturing, data formatting, data linking,
data import / data export
 Linking both metadata and data for data
mining
Data processing

Before the project begins
Future
Know-how
KnowledgeProject
Data flow
2
1 2
1
Data management
Time is clearly explicit
Data processing
Time is often implicit
 Reduce data manipulation
 Data sharing & data availability
 Facilitate the subsequent data mining
 Facilitate the data dissemination
Make consistent the two axes :
Motivations
How ?
DATA
The "data life" must therefore be integrated
into the scientific research process

seeding harvesting
samples
preparation
samples analysis
Sample
identifiers
Experiment
Data Tables
Experiment Design
Make both metadata and data
available for data mining
Several operators,
technics, data
types, SOPs, …
Each time we plan to share data coming from a common experimental
design, the classical challenges for fast using data by every partner are data
storage and data access
Several partners
Use-Case
“Metabolism”
Research question  Project  Experiment  Experimental set-up
Plant Metabolism
• Systems Biology
• Biomarkers
associated with plant
performance

Whatever the kind of experiment, this assumes
a design of experiment (DoE) involving
individuals, samples or whatever things, as the
main objects of study (e.g. plants, tissues,
bacteria, …)
This also assumes the observation of dependent
variables resulting of effects of some controlled
experiment factors.
Moreover, the objects of study have usually an
identifier for each of them, and the variables
can be quantitative or qualitative.
Promote good practices
samples : Sample features
Data capture The experimental context: needs / wishesseeding harvesting
samples
preparation
Sample
identifiers
Experiment Design (DoE)
samples analysis
Use-Case “Metabolism”
identifier factors Quantitative Qualitative
Data
Promote non-proprietary
format like CSV or TSV

Promote good practices
samples : Sample features
Data capture The experimental context: needs / wishesseeding harvesting
samples
preparation
Sample
identifiers
Experiment Design (DoE)
samples analysis
Use-Case “Metabolism”
Shortname Description Unit
SampleID Pool of several harvests Identifier
Treatment Treatment applied on plants Factor
DevStage fruit development stage Factor
FruitAge fruit age Days post-anthesis (dpa) Factor
FruitDiameter Fruit diameter mm Variable
FruitHeight Fruit height mm Variable
FruitFW Fruit Fresh Weight(g) g Variable
Rank Row of the invidual plant on the table Feature
Truss Position on the stem of the truss Feature
Description of the different
columns within data files
Metadata
Data
Promote non-proprietary
format like CSV or TSV

Experiment
Data Tables
Data storage
drag & drop
No database schema, no programming code and no additional configuration on the server side.
Data capture Minimal effort (PUT)
Merely dropping data files in a data
storage (e.g. a local NAS or distant
storage space)
PUT
Data capture Using Data
The "core idea"
(See Good Practices)
Data center
mount
Data can be downloaded,
explored and mined
Data analysis / mining
Maximum efficiency (GET)
http://myhost.org/
GET
The more simple as possible and in a normalized way

Experiment
Data Tables
Data storage
+2 metadata files
drag & drop
No database schema, no programming code and no additional configuration on the server side.
Data capture Minimal effort (PUT)
Merely dropping data files in a data
storage (e.g. a local NAS or distant
storage space)
Web API
Data center
mount
Data can be downloaded,
explored and mined
Data analysis / mining
Maximum efficiency (GET)
http://myhost.org/
GET
PUT
Data capture Using Data
EDTMS
Experiment Data Tables Management System
(EDTMS)
Implementation
F
A
INTEROPERABLE
R

s_subsets.tsv This metadata file allows to associate a key concept to each data subset file EDTMS
Metadata files
In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata:
It must exist a relationship between object types that we assume of “obtainedFrom" type.
To linked together two tables, it implies a common attribute, i.e. an identifier in most case.
Optional(*)
(*) in fact, rather deferred

a_attributes.tsv This metadata file allows each attribute (variable) to be annotated with
some minimal but relevant metadata
factor
quantitative
qualitative
identifier
categories
EDTMS
Metadata files
In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata:
Plants
Harvests
Samples
Compounds
…
…
Optional (*)
Good Practices
Description of the different columns within data files
(*) in fact, rather deferred

s_subsets.tsv
a_attributes.tsv
…
…
EDTMS
Time
Data
Make consistent both data flow
Additional subsets can be added step by step, as soon as data are produced.
Metadata files some minimal but relevant metadata

Using DataFRIM1 Fruit Integrative Modelling http://www.erasysbio.net/index.php?index=266
Dataset example

Using DataFRIM1 Fruit Integrative Modelling

Metadata files
In order to allow data to be explored and
mined, we have to adjoin some minimal
but relevant metadata
Using DataFRIM1 Fruit Integrative Modelling

Using Data
http://myhost.org/
Application
Programming
Interface
F
A
INTEROPERABLE
R
EDTMS
GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >
REST Services: hierarchical tree of resource naming (URL)
Metadata files
http://pmb-bordeaux.fr/odamsw/
Data Emancipation

plants samples activome qNMR_metabo
Identifiers
https://pmb-bordeaux.fr/getdata/xml/frim1/(activome,qNMR_metabo)/treatment/Control
Get of data subsets by merging all the subsets with lower rank than the specified
subsets and following the pathway defined by the “obtainedFrom" links.
 Avoids lots of data
manipulation
 Facilitates linking both
metadata and data for
data mining (activome,qNMR_metabo)  plants + samples
+ (aliquots+activome)
+ (pools+qNMR_metabo)
FRIM1
Application
Programming
Interface
EDTMS

Metadata files
Using Data
F
A
INTEROPERABLE
R
EDTMS
Develop if needed, lightweight tools
- R scripts (Galaxy), lightweight GUI
(R shiny)
Tools
 Data emancipation
regarding Tools
Data API  Tools
Data
http://myhost.org/
Application
Programming
Interface
Data Emancipation

https://pmb-bordeaux.fr/dataexplorer/?ds=frim1
FRIM1 Example online
R shiny
Visual data exploration
a first key step for deeper analyses
https://pmb-bordeaux.fr/dataexplorer/
Using Data
EDTMS

FRIM1
Metadata files
Using Data
EDTMS

FRIM1 Using Data

Explore several
possibilities by
interacting with the
graph
FRIM1 Using Data

FRIM1 Using Data
Explore your data in several
ways according to your
concerns and always by
interacting with the graphs

Save as …
FRIM1 Using Data

FRIM1
Export as …
Using Data
As far as possible,
keep the old way of
using the scientist's
worksheets …

The R package
Rodam
Copy-Paste
The Comprehensive R
Archive Network
https://cran.r-project.org
FRIM1 Using Data
… while allowing
a way to be more
efficient ...

The R package
Rodam
 Selection subsets of data
 Repetition of multiple scenarios on
different subsets of data
Data mining / Modeling https://cran.r-project.org/web/packages/Rodam/index.html
The Comprehensive R Archive Network Using Data

R markdown
knitr
Reproducible Research … with R and RStudio
R code
https://rmarkdown.rstudio.com/authoring_quick_tour.html
The R package
Rodam
EDTMS
ODAM Framework

Reproducible Research … with R and RStudio
“How you gather your data directly impacts how reproducible your research will be.
If all of your data gathering steps are tied together by your source code, then independent
researchers (and you) can more easily regather the data“
II. 6 - Gathering Data with R
II. 7 - Preparing Data for Analysis
“Once we have gathered the raw data that we want to include in our statistical analyses
we generally need to clean it up so that it can be merged into a single data file.”
https://englianhu.files.wordpress.com/2016/01/reproducible-research-with-r-and-studio-2nd-edition.pdf
This is exactly what the ODAM framework aims to answer in
a normalized way the easier and faster as possible
Chap II Data Gathering and Storage (70 pages out of 300)
Christopher Gandrud (2015)
https://github.com/christophergandrud/Rep-Res-Book

https://pmb-bordeaux.fr/dataexplorer/?ds=frim1
doi:10.15454/95JUTK
https://data.inra.fr/
FINDABLE
A
I
R
Data Dissemination
R scripts (Rmd) If applicable

Data as the subject of a paper
Data Paper
The Data Paper describes the data
It includes the associated descriptive elements (metadata),
and all the technical information (methods, formulas, software
applications ...) useful to the understanding of the obtaining of
data and their reuse by other scientists
https://www6.inra.fr/datapartage/Partager-Publier/Valoriser-ses-donnees/Publier-un-Data-Paper
This tool allows you to generate a draft data paper (scientific publication describing
a dataset) from the DOI of a dataset deposited in the data.inra.fr portal
A draft data paper
https://data.inra.fr/datapartage-datapapers-web/

Make your data great now
All the actors in the data acquisition chain must be convinced that
the data repository can bring added value, all the more so as
producers will do so as soon as possible, i.e.:
• Reduce data manipulation
• Data sharing & data availability
• Facilitate the subsequent data mining
• Facilitate the reproducible research
• Facilitate the data dissemination
• Assistance with decision-making in the selection of samples, in
the choice of additional analyses,
• Assistance with annotation by cross-referencing, a priori
knowledge input,
• etc.
 Make consistent the two axes
 Data processing
 Data management
This implies that (bio)computer scientists
• propose useful and/or innovative tools
• to motivate and convince researchers to submit
their data as soon as possible.
The data management system becomes
completely independent of data usage.
One dataset  Several applications
&
One application  Several datasets
The "data life" must therefore be integrated into
the scientific research process

 Need to take care of data
Take-home message
Thank you to your attention
Have a good fun !!
Open Data for Access and Mining
https://hub.docker.com/r/odam/getdata/
http://pmb-bordeaux.fr/dataexplorer/
https://github.com/INRA/ODAM
https://cran.r-project.org/package=Rodam

Make your data great now

More Related Content

What's hot

Similar to Make your data great now

Recently uploaded

Make your data great now

Editor's Notes