SlideShare a Scribd company logo
Daniel Jacob – INRA - 2018
How to best manage your data
to make the most of it for your research
Make your data great now
Give an open access to your data
and make them ready to be mined
Open Data for Access and Mining
ODAM Framework
Daniel Jacob
INRA UMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics Facility
Oct 2018
Daniel Jacob – INRA - 2018
DATA Studies
Project
During a research project
Know-how Knowledge
Input Output
Daniel Jacob – INRA - 2018
What do they become?
• Nothing ! They rest on a disk space (up to its death!)
Among the possible scenarios, two of them are extreme
• Creation of a comprehensive database managing all
data and metadata in its entirety, associated with a
visualization and querying interface.
Expected objectives
After the project is completed
DATA Studies
Project
Daniel Jacob – INRA - 2018
Expected objectives
Scientific Data Repositories
Enrichment
Expected links
DATA Studies
Project
Publishing policies
http://www.omicsdi.org/ …
Daniel Jacob – INRA - 2018
Raw Data
Processed
data
Analyzed
data
Published
data
Processed data is the raw
data processed in a way so
that they can highlight
some features, i.e. some
type of variables linked to
the focus of the study.
Raw data
Information
(often partial)
Open access
Specialized Data Repositories
Data flow Concerned data
Experiment
Data Tables
Know-how Knowledge
Annotation, Curation, Validation
Partly not automatically
reproducible because it
requires human expertise
Daniel Jacob – INRA - 2018
Open Data
Accessible data
including incomprehensible and
unexploitable documents by
automatons (programmatic way)
Open API
Queryable data
according to the imposed API scheme
 Metadata
Application Program Interface
PUBLISH DATA "5 STARS" THE FAIR DATA PRINCIPLES
Daniel Jacob – INRA - 2018
Open Data
Accessible data
including incomprehensible and
unexploitable documents by
automatons (programmatic way)
Open API
Queryable data
according to the imposed API scheme
 Metadata
Application Program Interface
PUBLISH DATA "5 STARS" THE FAIR DATA PRINCIPLES
Data capture Using data
The more simple as possible and in a normalized way
As far as possible, the most appropriate choice seems to keep the old way of using
the scientist's worksheets,
while allowing other more efficient approaches and this, throughout the data flow
Producers
are scientists
Consumers
are scientists
Daniel Jacob – INRA - 2018
Before the project begins
After the project is completed
Future
Data flow
DATA
A data management plan or DMP is a formal
document that outlines how data are to be handled
both during a research project, and after the project is
completed.
The goal of a data management plan is to consider the
many aspects of data management, metadata
generation, data preservation, and analysis before the
project begins
this ensures that data are well-managed in the
present, and prepared for preservation in the future.
 Description by metadata (Ontology / Controlled Vocabulary)
 Data capturing, data formatting, data linking
 Implying data archaeology after several months / years
https://dmp.opidor.fr/
https://www6.inra.fr/datapartage/
Data management
Publishing policies
Daniel Jacob – INRA - 2018
During a research project
Know-how
Knowledge
Project
Data flow
DATA
Data / Metadata
Data Mining Modeling
Find out
“biomarkers”
Explain data
Data/Metadata Exploration
Data Visualization
Data mining / Modeling
Data exploration  Descriptive statistics
First glimpse of the data that can show trends.
Allow the data to be well characterized, which is
necessary to then choose how to analyze them.
 Repetition of multiple scenarios on
different subsets of data
 Selection subsets of data
 Implying lots of data manipulation
 data capturing, data formatting, data linking,
data import / data export
 Linking both metadata and data for data
mining
Data processing
Daniel Jacob – INRA - 2018
Before the project begins
After the project is completed
Future
During a research project
Know-how
KnowledgeProject
Data flow
2
1 2
1
Data management
Time is clearly explicit
Data processing
Time is often implicit
 Reduce data manipulation
 Data sharing & data availability
 Facilitate the subsequent data mining
 Facilitate the data dissemination
Make consistent the two axes :
Motivations
How ?
DATA
The "data life" must therefore be integrated
into the scientific research process
Daniel Jacob – INRA - 2018
seeding harvesting
samples
preparation
samples analysis
Sample
identifiers
Experiment
Data Tables
Experiment Design
Make both metadata and data
available for data mining
Several operators,
technics, data
types, SOPs, …
Each time we plan to share data coming from a common experimental
design, the classical challenges for fast using data by every partner are data
storage and data access
Several partners
Use-Case
“Metabolism”
Research question  Project  Experiment  Experimental set-up
During a research project
Plant Metabolism
• Systems Biology
• Biomarkers
associated with plant
performance
Daniel Jacob – INRA - 2018
Whatever the kind of experiment, this assumes
a design of experiment (DoE) involving
individuals, samples or whatever things, as the
main objects of study (e.g. plants, tissues,
bacteria, …)
This also assumes the observation of dependent
variables resulting of effects of some controlled
experiment factors.
Moreover, the objects of study have usually an
identifier for each of them, and the variables
can be quantitative or qualitative.
Promote good practices
samples : Sample features
Data capture The experimental context: needs / wishesseeding harvesting
samples
preparation
Sample
identifiers
Experiment Design (DoE)
samples analysis
Use-Case “Metabolism”
identifier factors Quantitative Qualitative
Data
Promote non-proprietary
format like CSV or TSV
Daniel Jacob – INRA - 2018
Promote good practices
samples : Sample features
Data capture The experimental context: needs / wishesseeding harvesting
samples
preparation
Sample
identifiers
Experiment Design (DoE)
samples analysis
Use-Case “Metabolism”
Shortname Description Unit
SampleID Pool of several harvests Identifier
Treatment Treatment applied on plants Factor
DevStage fruit development stage Factor
FruitAge fruit age Days post-anthesis (dpa) Factor
FruitDiameter Fruit diameter mm Variable
FruitHeight Fruit height mm Variable
FruitFW Fruit Fresh Weight(g) g Variable
Rank Row of the invidual plant on the table Feature
Truss Position on the stem of the truss Feature
Description of the different
columns within data files
Metadata
Data
Promote non-proprietary
format like CSV or TSV
Daniel Jacob – INRA - 2018
Experiment
Data Tables
Data storage
drag & drop
No database schema, no programming code and no additional configuration on the server side.
Data capture Minimal effort (PUT)
Merely dropping data files in a data
storage (e.g. a local NAS or distant
storage space)
PUT
Data capture Using Data
The "core idea"
(See Good Practices)
Data center
mount
Data can be downloaded,
explored and mined
Data analysis / mining
Maximum efficiency (GET)
http://myhost.org/
GET
The more simple as possible and in a normalized way
Daniel Jacob – INRA - 2018
Experiment
Data Tables
Data storage
+2 metadata files
drag & drop
No database schema, no programming code and no additional configuration on the server side.
Data capture Minimal effort (PUT)
Merely dropping data files in a data
storage (e.g. a local NAS or distant
storage space)
Web API
Data center
mount
Data can be downloaded,
explored and mined
Data analysis / mining
Maximum efficiency (GET)
http://myhost.org/
GET
PUT
Data capture Using Data
EDTMS
Experiment Data Tables Management System
(EDTMS)
Implementation
F
A
INTEROPERABLE
R
Daniel Jacob – INRA - 2018
s_subsets.tsv This metadata file allows to associate a key concept to each data subset file EDTMS
Metadata files
In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata:
It must exist a relationship between object types that we assume of “obtainedFrom" type.
To linked together two tables, it implies a common attribute, i.e. an identifier in most case.
Optional(*)
(*) in fact, rather deferred
Daniel Jacob – INRA - 2018
a_attributes.tsv This metadata file allows each attribute (variable) to be annotated with
some minimal but relevant metadata
factor
quantitative
qualitative
identifier
categories
EDTMS
Metadata files
In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata:
Plants
Harvests
Samples
Compounds
…
…
Optional (*)
Good Practices
Description of the different columns within data files
(*) in fact, rather deferred
Daniel Jacob – INRA - 2018
s_subsets.tsv
a_attributes.tsv
…
…
EDTMS
Time
Data
Make consistent both data flow
Additional subsets can be added step by step, as soon as data are produced.
Metadata files some minimal but relevant metadata
Daniel Jacob – INRA - 2018
Using DataFRIM1 Fruit Integrative Modelling http://www.erasysbio.net/index.php?index=266
Dataset example
Daniel Jacob – INRA - 2018
Using DataFRIM1 Fruit Integrative Modelling
Daniel Jacob – INRA - 2018
Using DataFRIM1 Fruit Integrative Modelling
Daniel Jacob – INRA - 2018
Metadata files
In order to allow data to be explored and
mined, we have to adjoin some minimal
but relevant metadata
Using DataFRIM1 Fruit Integrative Modelling
Daniel Jacob – INRA - 2018
Using Data
http://myhost.org/
Application
Programming
Interface
F
A
INTEROPERABLE
R
EDTMS
GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … >
REST Services: hierarchical tree of resource naming (URL)
Metadata files
In order to allow data to be explored and
mined, we have to adjoin some minimal
but relevant metadata
http://pmb-bordeaux.fr/odamsw/
Data Emancipation
Daniel Jacob – INRA - 2018
plants samples activome qNMR_metabo
Identifiers
https://pmb-bordeaux.fr/getdata/xml/frim1/(activome,qNMR_metabo)/treatment/Control
Get of data subsets by merging all the subsets with lower rank than the specified
subsets and following the pathway defined by the “obtainedFrom" links.
 Avoids lots of data
manipulation
 Facilitates linking both
metadata and data for
data mining (activome,qNMR_metabo)  plants + samples
+ (aliquots+activome)
+ (pools+qNMR_metabo)
FRIM1
Application
Programming
Interface
EDTMS
http://pmb-bordeaux.fr/odamsw/
Daniel Jacob – INRA - 2018
Metadata files
In order to allow data to be explored and
mined, we have to adjoin some minimal
but relevant metadata
Using Data
F
A
INTEROPERABLE
R
EDTMS
Develop if needed, lightweight tools
- R scripts (Galaxy), lightweight GUI
(R shiny)
Tools
 Data emancipation
regarding Tools
Data API  Tools
Data
http://myhost.org/
Application
Programming
Interface
Data Emancipation
Daniel Jacob – INRA - 2018
https://pmb-bordeaux.fr/dataexplorer/?ds=frim1
FRIM1 Example online
R shiny
Visual data exploration
a first key step for deeper analyses
https://pmb-bordeaux.fr/dataexplorer/
Using Data
EDTMS
Daniel Jacob – INRA - 2018
FRIM1
Metadata files
In order to allow data to be explored and
mined, we have to adjoin some minimal
but relevant metadata
Using Data
EDTMS
Daniel Jacob – INRA - 2018
FRIM1 Using Data
Daniel Jacob – INRA - 2018
Explore several
possibilities by
interacting with the
graph
FRIM1 Using Data
Daniel Jacob – INRA - 2018
FRIM1 Using Data
Explore your data in several
ways according to your
concerns and always by
interacting with the graphs
Daniel Jacob – INRA - 2018
Save as …
FRIM1 Using Data
Daniel Jacob – INRA - 2018
FRIM1
Export as …
Using Data
As far as possible,
keep the old way of
using the scientist's
worksheets …
Daniel Jacob – INRA - 2018
The R package
Rodam
Copy-Paste
The Comprehensive R
Archive Network
https://cran.r-project.org
FRIM1 Using Data
… while allowing
a way to be more
efficient ...
Daniel Jacob – INRA - 2018
The R package
Rodam
 Selection subsets of data
 Repetition of multiple scenarios on
different subsets of data
Data mining / Modeling https://cran.r-project.org/web/packages/Rodam/index.html
The Comprehensive R Archive Network Using Data
Daniel Jacob – INRA - 2018
R markdown
knitr
Reproducible Research … with R and RStudio
R code
https://rmarkdown.rstudio.com/authoring_quick_tour.html
The R package
Rodam
EDTMS
ODAM Framework
Daniel Jacob – INRA - 2018
Reproducible Research … with R and RStudio
“How you gather your data directly impacts how reproducible your research will be.
If all of your data gathering steps are tied together by your source code, then independent
researchers (and you) can more easily regather the data“
II. 6 - Gathering Data with R
II. 7 - Preparing Data for Analysis
“Once we have gathered the raw data that we want to include in our statistical analyses
we generally need to clean it up so that it can be merged into a single data file.”
https://englianhu.files.wordpress.com/2016/01/reproducible-research-with-r-and-studio-2nd-edition.pdf
This is exactly what the ODAM framework aims to answer in
a normalized way the easier and faster as possible
Chap II Data Gathering and Storage (70 pages out of 300)
Christopher Gandrud (2015)
https://github.com/christophergandrud/Rep-Res-Book
Daniel Jacob – INRA - 2018
https://pmb-bordeaux.fr/dataexplorer/?ds=frim1
doi:10.15454/95JUTK
https://data.inra.fr/
FINDABLE
A
I
R
Data Dissemination
R scripts (Rmd) If applicable
Daniel Jacob – INRA - 2018
Data as the subject of a paper
Data Paper
The Data Paper describes the data
It includes the associated descriptive elements (metadata),
and all the technical information (methods, formulas, software
applications ...) useful to the understanding of the obtaining of
data and their reuse by other scientists
https://www6.inra.fr/datapartage/Partager-Publier/Valoriser-ses-donnees/Publier-un-Data-Paper
This tool allows you to generate a draft data paper (scientific publication describing
a dataset) from the DOI of a dataset deposited in the data.inra.fr portal
A draft data paper
https://data.inra.fr/datapartage-datapapers-web/
Daniel Jacob – INRA - 2018
Make your data great now
All the actors in the data acquisition chain must be convinced that
the data repository can bring added value, all the more so as
producers will do so as soon as possible, i.e.:
• Reduce data manipulation
• Data sharing & data availability
• Facilitate the subsequent data mining
• Facilitate the reproducible research
• Facilitate the data dissemination
• Assistance with decision-making in the selection of samples, in
the choice of additional analyses,
• Assistance with annotation by cross-referencing, a priori
knowledge input,
• etc.
 Make consistent the two axes
 Data processing
 Data management
This implies that (bio)computer scientists
• propose useful and/or innovative tools
• to motivate and convince researchers to submit
their data as soon as possible.
The data management system becomes
completely independent of data usage.
One dataset  Several applications
&
One application  Several datasets
The "data life" must therefore be integrated into
the scientific research process
Daniel Jacob – INRA - 2018
 Need to take care of data
Take-home message
Thank you to your attention
Have a good fun !!
Open Data for Access and Mining
https://hub.docker.com/r/odam/getdata/
http://pmb-bordeaux.fr/dataexplorer/
https://github.com/INRA/ODAM
https://cran.r-project.org/package=Rodam
http://pmb-bordeaux.fr/odamsw/

More Related Content

What's hot

How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Robert Grossman
 
Basics of Research Data Management
Basics of Research Data ManagementBasics of Research Data Management
Basics of Research Data Management
OpenAIRE
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
Best practices data management
Best practices data managementBest practices data management
Best practices data management
Sherry Lake
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
hktripathy
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Greg Landrum
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Robert Grossman
 
Digital data
Digital dataDigital data
Digital data
ShivanandaVSeeri
 
David Shotton - Research Integrity: Integrity of the published record
David Shotton - Research Integrity: Integrity of the published recordDavid Shotton - Research Integrity: Integrity of the published record
David Shotton - Research Integrity: Integrity of the published record
Jisc
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
Robert Grossman
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
Robert Grossman
 
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
Databricks
 
DataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data Sharing
DataONE
 
Big Data Analytics Using Hadoop
Big Data Analytics Using HadoopBig Data Analytics Using Hadoop
Big Data Analytics Using Hadoop
Srikanth VNV
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
Robert Grossman
 
Data management
Data management Data management
Data management
Graça Gabriel
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Robert Grossman
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practice
Carole Goble
 
Research Data Management for SOE
Research Data Management for SOEResearch Data Management for SOE
Research Data Management for SOE
Lynda Kellam
 
Datamining
DataminingDatamining
Datamining
sumit621
 

What's hot (20)

How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Basics of Research Data Management
Basics of Research Data ManagementBasics of Research Data Management
Basics of Research Data Management
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Best practices data management
Best practices data managementBest practices data management
Best practices data management
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Digital data
Digital dataDigital data
Digital data
 
David Shotton - Research Integrity: Integrity of the published record
David Shotton - Research Integrity: Integrity of the published recordDavid Shotton - Research Integrity: Integrity of the published record
David Shotton - Research Integrity: Integrity of the published record
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
 
DataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data Sharing
 
Big Data Analytics Using Hadoop
Big Data Analytics Using HadoopBig Data Analytics Using Hadoop
Big Data Analytics Using Hadoop
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
Data management
Data management Data management
Data management
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practice
 
Research Data Management for SOE
Research Data Management for SOEResearch Data Management for SOE
Research Data Management for SOE
 
Datamining
DataminingDatamining
Datamining
 

Similar to Make your data great now

INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
Attila Barta
 
UK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalfaceUK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalface
LizLyon
 
RDM for trainee physicians
RDM for trainee physiciansRDM for trainee physicians
RDM for trainee physicians
Historic Environment Scotland
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
Prof.Balakrishnan S
 
GBIF and reuse of research data, Bergen (2016-12-14)
GBIF and reuse of research data, Bergen (2016-12-14)GBIF and reuse of research data, Bergen (2016-12-14)
GBIF and reuse of research data, Bergen (2016-12-14)
Dag Endresen
 
Meeting the NSF DMP Requirement June 13, 2012
Meeting the NSF DMP Requirement June 13, 2012Meeting the NSF DMP Requirement June 13, 2012
Meeting the NSF DMP Requirement June 13, 2012
IUPUI
 
Good Practice in Research Data Management
Good Practice in Research Data ManagementGood Practice in Research Data Management
Good Practice in Research Data Management
Historic Environment Scotland
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
Loïc Lejoly
 
Research data management : Open Research Data pilot, data management (plans),...
Research data management : Open Research Data pilot, data management (plans),...Research data management : Open Research Data pilot, data management (plans),...
Research data management : Open Research Data pilot, data management (plans),...
Leon Osinski
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
SEEKing our way to better presentation of data and models from scientific inv...
SEEKing our way to better presentation of data and models from scientific inv...SEEKing our way to better presentation of data and models from scientific inv...
SEEKing our way to better presentation of data and models from scientific inv...
Natalie Stanford
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
Sarah Anna Stewart
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Carole Goble
 
Research Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and HumanitiesResearch Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and Humanities
Rebekah Cummings
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
lyarmey
 
NISO Training Thursday Crafting a Scientific Data Management Plan
NISO Training Thursday Crafting a Scientific Data Management PlanNISO Training Thursday Crafting a Scientific Data Management Plan
NISO Training Thursday Crafting a Scientific Data Management Plan
National Information Standards Organization (NISO)
 
Introduction to RDM for Geoscience PhD Students
Introduction to RDM for Geoscience PhD StudentsIntroduction to RDM for Geoscience PhD Students
Introduction to RDM for Geoscience PhD Students
EDINA, University of Edinburgh
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
Basma Gamal
 
Data Management Lab: Session 2 slides
Data Management Lab: Session 2 slidesData Management Lab: Session 2 slides
Data Management Lab: Session 2 slides
IUPUI
 

Similar to Make your data great now (20)

INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
UK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalfaceUK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalface
 
RDM for trainee physicians
RDM for trainee physiciansRDM for trainee physicians
RDM for trainee physicians
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
GBIF and reuse of research data, Bergen (2016-12-14)
GBIF and reuse of research data, Bergen (2016-12-14)GBIF and reuse of research data, Bergen (2016-12-14)
GBIF and reuse of research data, Bergen (2016-12-14)
 
Meeting the NSF DMP Requirement June 13, 2012
Meeting the NSF DMP Requirement June 13, 2012Meeting the NSF DMP Requirement June 13, 2012
Meeting the NSF DMP Requirement June 13, 2012
 
Good Practice in Research Data Management
Good Practice in Research Data ManagementGood Practice in Research Data Management
Good Practice in Research Data Management
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
 
Research data management : Open Research Data pilot, data management (plans),...
Research data management : Open Research Data pilot, data management (plans),...Research data management : Open Research Data pilot, data management (plans),...
Research data management : Open Research Data pilot, data management (plans),...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
SEEKing our way to better presentation of data and models from scientific inv...
SEEKing our way to better presentation of data and models from scientific inv...SEEKing our way to better presentation of data and models from scientific inv...
SEEKing our way to better presentation of data and models from scientific inv...
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
Research Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and HumanitiesResearch Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and Humanities
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
 
NISO Training Thursday Crafting a Scientific Data Management Plan
NISO Training Thursday Crafting a Scientific Data Management PlanNISO Training Thursday Crafting a Scientific Data Management Plan
NISO Training Thursday Crafting a Scientific Data Management Plan
 
Introduction to RDM for Geoscience PhD Students
Introduction to RDM for Geoscience PhD StudentsIntroduction to RDM for Geoscience PhD Students
Introduction to RDM for Geoscience PhD Students
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
Data Management Lab: Session 2 slides
Data Management Lab: Session 2 slidesData Management Lab: Session 2 slides
Data Management Lab: Session 2 slides
 

More from Daniel JACOB

Indexator_oct2022.pdf
Indexator_oct2022.pdfIndexator_oct2022.pdf
Indexator_oct2022.pdf
Daniel JACOB
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
Daniel JACOB
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
Daniel JACOB
 
Biostatflow
BiostatflowBiostatflow
Biostatflow
Daniel JACOB
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and Mining
Daniel JACOB
 
ERVA-NMR
ERVA-NMRERVA-NMR
ERVA-NMR
Daniel JACOB
 

More from Daniel JACOB (6)

Indexator_oct2022.pdf
Indexator_oct2022.pdfIndexator_oct2022.pdf
Indexator_oct2022.pdf
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
 
Biostatflow
BiostatflowBiostatflow
Biostatflow
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and Mining
 
ERVA-NMR
ERVA-NMRERVA-NMR
ERVA-NMR
 

Recently uploaded

Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
Frédéric Baudron
 
Immunotherapy presentation from clinical immunology
Immunotherapy presentation from clinical immunologyImmunotherapy presentation from clinical immunology
Immunotherapy presentation from clinical immunology
VetriVel359477
 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
PirithiRaju
 
Reaching the age of Adolescence- Class 8
Reaching the age of Adolescence- Class 8Reaching the age of Adolescence- Class 8
Reaching the age of Adolescence- Class 8
abhinayakamasamudram
 
Alternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart AgricultureAlternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxTOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
shubhijain836
 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
ABHISHEK SONI NIMT INSTITUTE OF MEDICAL AND PARAMEDCIAL SCIENCES , GOVT PG COLLEGE NOIDA
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
PsychoTech Services
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
Sérgio Sacani
 
23PH301 - Optics - Unit 1 - Optical Lenses
23PH301 - Optics  -  Unit 1 - Optical Lenses23PH301 - Optics  -  Unit 1 - Optical Lenses
23PH301 - Optics - Unit 1 - Optical Lenses
RDhivya6
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
yourprojectpartner05
 
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDSJAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
Sérgio Sacani
 
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Sérgio Sacani
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
vadgavevedant86
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
frank0071
 
Sustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart AgricultureSustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Lattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptxLattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptx
DrRajeshDas
 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Sérgio Sacani
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 

Recently uploaded (20)

Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
 
Immunotherapy presentation from clinical immunology
Immunotherapy presentation from clinical immunologyImmunotherapy presentation from clinical immunology
Immunotherapy presentation from clinical immunology
 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
 
Reaching the age of Adolescence- Class 8
Reaching the age of Adolescence- Class 8Reaching the age of Adolescence- Class 8
Reaching the age of Adolescence- Class 8
 
Alternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart AgricultureAlternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart Agriculture
 
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxTOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
 
23PH301 - Optics - Unit 1 - Optical Lenses
23PH301 - Optics  -  Unit 1 - Optical Lenses23PH301 - Optics  -  Unit 1 - Optical Lenses
23PH301 - Optics - Unit 1 - Optical Lenses
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
 
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDSJAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
 
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
 
Sustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart AgricultureSustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart Agriculture
 
Lattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptxLattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptx
 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 

Make your data great now

  • 1. Daniel Jacob – INRA - 2018 How to best manage your data to make the most of it for your research Make your data great now Give an open access to your data and make them ready to be mined Open Data for Access and Mining ODAM Framework Daniel Jacob INRA UMR 1332 BFP – Metabolism Group Bordeaux Metabolomics Facility Oct 2018
  • 2. Daniel Jacob – INRA - 2018 DATA Studies Project During a research project Know-how Knowledge Input Output
  • 3. Daniel Jacob – INRA - 2018 What do they become? • Nothing ! They rest on a disk space (up to its death!) Among the possible scenarios, two of them are extreme • Creation of a comprehensive database managing all data and metadata in its entirety, associated with a visualization and querying interface. Expected objectives After the project is completed DATA Studies Project
  • 4. Daniel Jacob – INRA - 2018 Expected objectives Scientific Data Repositories Enrichment Expected links DATA Studies Project Publishing policies http://www.omicsdi.org/ …
  • 5. Daniel Jacob – INRA - 2018 Raw Data Processed data Analyzed data Published data Processed data is the raw data processed in a way so that they can highlight some features, i.e. some type of variables linked to the focus of the study. Raw data Information (often partial) Open access Specialized Data Repositories Data flow Concerned data Experiment Data Tables Know-how Knowledge Annotation, Curation, Validation Partly not automatically reproducible because it requires human expertise
  • 6. Daniel Jacob – INRA - 2018 Open Data Accessible data including incomprehensible and unexploitable documents by automatons (programmatic way) Open API Queryable data according to the imposed API scheme  Metadata Application Program Interface PUBLISH DATA "5 STARS" THE FAIR DATA PRINCIPLES
  • 7. Daniel Jacob – INRA - 2018 Open Data Accessible data including incomprehensible and unexploitable documents by automatons (programmatic way) Open API Queryable data according to the imposed API scheme  Metadata Application Program Interface PUBLISH DATA "5 STARS" THE FAIR DATA PRINCIPLES Data capture Using data The more simple as possible and in a normalized way As far as possible, the most appropriate choice seems to keep the old way of using the scientist's worksheets, while allowing other more efficient approaches and this, throughout the data flow Producers are scientists Consumers are scientists
  • 8. Daniel Jacob – INRA - 2018 Before the project begins After the project is completed Future Data flow DATA A data management plan or DMP is a formal document that outlines how data are to be handled both during a research project, and after the project is completed. The goal of a data management plan is to consider the many aspects of data management, metadata generation, data preservation, and analysis before the project begins this ensures that data are well-managed in the present, and prepared for preservation in the future.  Description by metadata (Ontology / Controlled Vocabulary)  Data capturing, data formatting, data linking  Implying data archaeology after several months / years https://dmp.opidor.fr/ https://www6.inra.fr/datapartage/ Data management Publishing policies
  • 9. Daniel Jacob – INRA - 2018 During a research project Know-how Knowledge Project Data flow DATA Data / Metadata Data Mining Modeling Find out “biomarkers” Explain data Data/Metadata Exploration Data Visualization Data mining / Modeling Data exploration  Descriptive statistics First glimpse of the data that can show trends. Allow the data to be well characterized, which is necessary to then choose how to analyze them.  Repetition of multiple scenarios on different subsets of data  Selection subsets of data  Implying lots of data manipulation  data capturing, data formatting, data linking, data import / data export  Linking both metadata and data for data mining Data processing
  • 10. Daniel Jacob – INRA - 2018 Before the project begins After the project is completed Future During a research project Know-how KnowledgeProject Data flow 2 1 2 1 Data management Time is clearly explicit Data processing Time is often implicit  Reduce data manipulation  Data sharing & data availability  Facilitate the subsequent data mining  Facilitate the data dissemination Make consistent the two axes : Motivations How ? DATA The "data life" must therefore be integrated into the scientific research process
  • 11. Daniel Jacob – INRA - 2018 seeding harvesting samples preparation samples analysis Sample identifiers Experiment Data Tables Experiment Design Make both metadata and data available for data mining Several operators, technics, data types, SOPs, … Each time we plan to share data coming from a common experimental design, the classical challenges for fast using data by every partner are data storage and data access Several partners Use-Case “Metabolism” Research question  Project  Experiment  Experimental set-up During a research project Plant Metabolism • Systems Biology • Biomarkers associated with plant performance
  • 12. Daniel Jacob – INRA - 2018 Whatever the kind of experiment, this assumes a design of experiment (DoE) involving individuals, samples or whatever things, as the main objects of study (e.g. plants, tissues, bacteria, …) This also assumes the observation of dependent variables resulting of effects of some controlled experiment factors. Moreover, the objects of study have usually an identifier for each of them, and the variables can be quantitative or qualitative. Promote good practices samples : Sample features Data capture The experimental context: needs / wishesseeding harvesting samples preparation Sample identifiers Experiment Design (DoE) samples analysis Use-Case “Metabolism” identifier factors Quantitative Qualitative Data Promote non-proprietary format like CSV or TSV
  • 13. Daniel Jacob – INRA - 2018 Promote good practices samples : Sample features Data capture The experimental context: needs / wishesseeding harvesting samples preparation Sample identifiers Experiment Design (DoE) samples analysis Use-Case “Metabolism” Shortname Description Unit SampleID Pool of several harvests Identifier Treatment Treatment applied on plants Factor DevStage fruit development stage Factor FruitAge fruit age Days post-anthesis (dpa) Factor FruitDiameter Fruit diameter mm Variable FruitHeight Fruit height mm Variable FruitFW Fruit Fresh Weight(g) g Variable Rank Row of the invidual plant on the table Feature Truss Position on the stem of the truss Feature Description of the different columns within data files Metadata Data Promote non-proprietary format like CSV or TSV
  • 14. Daniel Jacob – INRA - 2018 Experiment Data Tables Data storage drag & drop No database schema, no programming code and no additional configuration on the server side. Data capture Minimal effort (PUT) Merely dropping data files in a data storage (e.g. a local NAS or distant storage space) PUT Data capture Using Data The "core idea" (See Good Practices) Data center mount Data can be downloaded, explored and mined Data analysis / mining Maximum efficiency (GET) http://myhost.org/ GET The more simple as possible and in a normalized way
  • 15. Daniel Jacob – INRA - 2018 Experiment Data Tables Data storage +2 metadata files drag & drop No database schema, no programming code and no additional configuration on the server side. Data capture Minimal effort (PUT) Merely dropping data files in a data storage (e.g. a local NAS or distant storage space) Web API Data center mount Data can be downloaded, explored and mined Data analysis / mining Maximum efficiency (GET) http://myhost.org/ GET PUT Data capture Using Data EDTMS Experiment Data Tables Management System (EDTMS) Implementation F A INTEROPERABLE R
  • 16. Daniel Jacob – INRA - 2018 s_subsets.tsv This metadata file allows to associate a key concept to each data subset file EDTMS Metadata files In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata: It must exist a relationship between object types that we assume of “obtainedFrom" type. To linked together two tables, it implies a common attribute, i.e. an identifier in most case. Optional(*) (*) in fact, rather deferred
  • 17. Daniel Jacob – INRA - 2018 a_attributes.tsv This metadata file allows each attribute (variable) to be annotated with some minimal but relevant metadata factor quantitative qualitative identifier categories EDTMS Metadata files In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata: Plants Harvests Samples Compounds … … Optional (*) Good Practices Description of the different columns within data files (*) in fact, rather deferred
  • 18. Daniel Jacob – INRA - 2018 s_subsets.tsv a_attributes.tsv … … EDTMS Time Data Make consistent both data flow Additional subsets can be added step by step, as soon as data are produced. Metadata files some minimal but relevant metadata
  • 19. Daniel Jacob – INRA - 2018 Using DataFRIM1 Fruit Integrative Modelling http://www.erasysbio.net/index.php?index=266 Dataset example
  • 20. Daniel Jacob – INRA - 2018 Using DataFRIM1 Fruit Integrative Modelling
  • 21. Daniel Jacob – INRA - 2018 Using DataFRIM1 Fruit Integrative Modelling
  • 22. Daniel Jacob – INRA - 2018 Metadata files In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata Using DataFRIM1 Fruit Integrative Modelling
  • 23. Daniel Jacob – INRA - 2018 Using Data http://myhost.org/ Application Programming Interface F A INTEROPERABLE R EDTMS GET http://myhost.org/getdata/<data format>/<dataset name>/< … >/< … > REST Services: hierarchical tree of resource naming (URL) Metadata files In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata http://pmb-bordeaux.fr/odamsw/ Data Emancipation
  • 24. Daniel Jacob – INRA - 2018 plants samples activome qNMR_metabo Identifiers https://pmb-bordeaux.fr/getdata/xml/frim1/(activome,qNMR_metabo)/treatment/Control Get of data subsets by merging all the subsets with lower rank than the specified subsets and following the pathway defined by the “obtainedFrom" links.  Avoids lots of data manipulation  Facilitates linking both metadata and data for data mining (activome,qNMR_metabo)  plants + samples + (aliquots+activome) + (pools+qNMR_metabo) FRIM1 Application Programming Interface EDTMS http://pmb-bordeaux.fr/odamsw/
  • 25. Daniel Jacob – INRA - 2018 Metadata files In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata Using Data F A INTEROPERABLE R EDTMS Develop if needed, lightweight tools - R scripts (Galaxy), lightweight GUI (R shiny) Tools  Data emancipation regarding Tools Data API  Tools Data http://myhost.org/ Application Programming Interface Data Emancipation
  • 26. Daniel Jacob – INRA - 2018 https://pmb-bordeaux.fr/dataexplorer/?ds=frim1 FRIM1 Example online R shiny Visual data exploration a first key step for deeper analyses https://pmb-bordeaux.fr/dataexplorer/ Using Data EDTMS
  • 27. Daniel Jacob – INRA - 2018 FRIM1 Metadata files In order to allow data to be explored and mined, we have to adjoin some minimal but relevant metadata Using Data EDTMS
  • 28. Daniel Jacob – INRA - 2018 FRIM1 Using Data
  • 29. Daniel Jacob – INRA - 2018 Explore several possibilities by interacting with the graph FRIM1 Using Data
  • 30. Daniel Jacob – INRA - 2018 FRIM1 Using Data Explore your data in several ways according to your concerns and always by interacting with the graphs
  • 31. Daniel Jacob – INRA - 2018 Save as … FRIM1 Using Data
  • 32. Daniel Jacob – INRA - 2018 FRIM1 Export as … Using Data As far as possible, keep the old way of using the scientist's worksheets …
  • 33. Daniel Jacob – INRA - 2018 The R package Rodam Copy-Paste The Comprehensive R Archive Network https://cran.r-project.org FRIM1 Using Data … while allowing a way to be more efficient ...
  • 34. Daniel Jacob – INRA - 2018 The R package Rodam  Selection subsets of data  Repetition of multiple scenarios on different subsets of data Data mining / Modeling https://cran.r-project.org/web/packages/Rodam/index.html The Comprehensive R Archive Network Using Data
  • 35. Daniel Jacob – INRA - 2018 R markdown knitr Reproducible Research … with R and RStudio R code https://rmarkdown.rstudio.com/authoring_quick_tour.html The R package Rodam EDTMS ODAM Framework
  • 36. Daniel Jacob – INRA - 2018 Reproducible Research … with R and RStudio “How you gather your data directly impacts how reproducible your research will be. If all of your data gathering steps are tied together by your source code, then independent researchers (and you) can more easily regather the data“ II. 6 - Gathering Data with R II. 7 - Preparing Data for Analysis “Once we have gathered the raw data that we want to include in our statistical analyses we generally need to clean it up so that it can be merged into a single data file.” https://englianhu.files.wordpress.com/2016/01/reproducible-research-with-r-and-studio-2nd-edition.pdf This is exactly what the ODAM framework aims to answer in a normalized way the easier and faster as possible Chap II Data Gathering and Storage (70 pages out of 300) Christopher Gandrud (2015) https://github.com/christophergandrud/Rep-Res-Book
  • 37. Daniel Jacob – INRA - 2018 https://pmb-bordeaux.fr/dataexplorer/?ds=frim1 doi:10.15454/95JUTK https://data.inra.fr/ FINDABLE A I R Data Dissemination R scripts (Rmd) If applicable
  • 38. Daniel Jacob – INRA - 2018 Data as the subject of a paper Data Paper The Data Paper describes the data It includes the associated descriptive elements (metadata), and all the technical information (methods, formulas, software applications ...) useful to the understanding of the obtaining of data and their reuse by other scientists https://www6.inra.fr/datapartage/Partager-Publier/Valoriser-ses-donnees/Publier-un-Data-Paper This tool allows you to generate a draft data paper (scientific publication describing a dataset) from the DOI of a dataset deposited in the data.inra.fr portal A draft data paper https://data.inra.fr/datapartage-datapapers-web/
  • 39. Daniel Jacob – INRA - 2018 Make your data great now All the actors in the data acquisition chain must be convinced that the data repository can bring added value, all the more so as producers will do so as soon as possible, i.e.: • Reduce data manipulation • Data sharing & data availability • Facilitate the subsequent data mining • Facilitate the reproducible research • Facilitate the data dissemination • Assistance with decision-making in the selection of samples, in the choice of additional analyses, • Assistance with annotation by cross-referencing, a priori knowledge input, • etc.  Make consistent the two axes  Data processing  Data management This implies that (bio)computer scientists • propose useful and/or innovative tools • to motivate and convince researchers to submit their data as soon as possible. The data management system becomes completely independent of data usage. One dataset  Several applications & One application  Several datasets The "data life" must therefore be integrated into the scientific research process
  • 40. Daniel Jacob – INRA - 2018  Need to take care of data Take-home message Thank you to your attention Have a good fun !! Open Data for Access and Mining https://hub.docker.com/r/odam/getdata/ http://pmb-bordeaux.fr/dataexplorer/ https://github.com/INRA/ODAM https://cran.r-project.org/package=Rodam http://pmb-bordeaux.fr/odamsw/

Editor's Notes

  1. Présentation à la fois d’un travail mais aussi d’une réflexion autour de la gestion de la données À savoir comment en tirer le meilleur parti à toute les étapes de leur cycle de vie.
  2. Schématisation d’un projet du point des données concernant les Entrées / Sorties
  3. Time is more often implicit Time is clearly explicit
  4. Time is more often implicit Time is clearly explicit
  5. As far as possible, keep the old way of using the scientist's worksheets,
  6. https://www.fun-mooc.fr/c4x/UPSUD/42001S02/asset/RMarkdown.pdf https://rmarkdown.rstudio.com/authoring_quick_tour.html
  7. https://englianhu.files.wordpress.com/2016/01/reproducible-research-with-r-and-studio-2nd-edition.pdf Chap 6 - Gathering Data with R (p109, PDF p138) How you gather your data directly impacts how reproducible your research will be. You should try your best to document every step of your data gathering process. Reproduction will be easier if your documentation (especially variable descriptions and source code) makes it easy for you and others to understand what you have done. If all of your data gathering steps are tied together by your source code, then independent researchers (and you) can more easily regather the data. Regathering data will be easiest if running your code allows you to get all the way back to the raw data files –the rawer the better. Chap 7 - Preparing Data for Analysis (p129, PDF p158) Once we have gathered the raw data that we want to include in our statistical analyses we generally need to clean it up so that it can be merged into a single data file.