SlideShare a Scribd company logo
Metadata management for data
storage spaces
Contributors:
François Ehrenmann (UMR BioGECO)
Philippe Chaumeil (UMR BioGECO)
Daniel Jacob (UMR BFP)
INRAE - Indexator – October 2022
• The implementation of a Data Management Plan (DMP) involves
some requisites such as the data outsourcing to be preserved
outside the users' disk space.
• This concerns not only published data but all data produced during
the course of a project.
• This is even more necessary when temporary staff (doctoral
students, post-docs, trainees, fixed-term contracts) are involved in
the production of data.
Data Management Plan
How to encourage the structures (Units, Platforms,...)
to better manage their data ?
INRAE - Indexator – October 2022
Data storage
• The central idea is that the storage space becomes the data repository, so the
metadata should go to the data and not the other way around.
Metadata
How to encourage the structures (Units, Platforms,...) to better manage their data
Your data repository
• Concern about the organisation of these storage spaces.
• Should they be harmonised, i.e. impose good practices such as i) folder and file naming, ii) folder structure (docs, data, scripts,
etc.), iii) the use of README files, iv) etc.
• At least the use of a README file seems the simplest and least restrictive.  what to put in it ?
• How to use them effectively when you want to find information? With what vocabulary ?
INRAE - Indexator – October 2022
Data storage Project data storage space :
Put a metadata file (JSON format)
describing the project data within each
subdirectory
• The central idea is that the storage space becomes the data repository, so the
metadata should go to the data and not the other way around.
The choice was made for the JSON format,
which is very appropriate for describing
metadata, readable by both humans and
machines
How to encourage the structures (Units, Platforms,...) to better manage their data
Your data repository
INRAE - Indexator – October 2022
Generate the
metadata file (JSON)
Data storage
Web interface
Project data storage space :
Put a metadata file (JSON format)
describing the project data within each
subdirectory
• The central idea is that the storage space becomes the data repository, so the
metadata should go to the data and not the other way around.
The choice was made for the JSON format,
which is very appropriate for describing
metadata, readable by both humans and
machines
Knowing the production of files in JSON
format being delicate for users, a web
interface makes it possible to create
them.
How to encourage the structures (Units, Platforms,...) to better manage their data
deposit
INRAE - Indexator – October 2022
View
Metadata
Generate the
metadata file (JSON)
Search datasets based
on some metadata
deposit
scan
Data storage
Web interface
Project data storage space :
Put a metadata file (JSON format)
describing the project data within each
subdirectory
Then, find projects and/or data
corresponding to your criteria
• The central idea is that the storage space becomes the data repository, so the
metadata should go to the data and not the other way around.
How to encourage the structures (Units, Platforms,...) to better manage their data
INRAE - Indexator – October 2022
How to encourage the structures (Units, Platforms,...) to better manage their data
What metadata?
How to specify it?
From which vocabulary?
How to generate a JSON file?
Questions immediately raised
INRAE - Indexator – October 2022
• Given the diversity of domains, the approach chosen is to be both as flexible and as
pragmatic as possible by allowing each collective to choose its own (controlled) vocabulary
corresponding to the reality of its field and activities.
• The main idea is to be able to "capture" the user's metadata as easily as possible using their
vocabulary.
How to encourage the structures (Units, Platforms,...) to better manage their data
What metadata?
How to specify it?
From which vocabulary?
How to generate a JSON file?
Questions immediately raised
INRAE - Indexator – October 2022
• The main idea is to be able to "capture" the user's metadata as easily as possible using their
vocabulary.
How to encourage the structures (Units, Platforms,...) to better manage their data
The web interface
must therefore correspond to the scientific and experimental context
of the collective (research unit, project, platform, ...)
What metadata?
How to specify it?
From which vocabulary?
How to generate a JSON file?
Questions immediately raised
• Given the diversity of domains, the approach chosen is to be both as flexible and as
pragmatic as possible by allowing each collective to choose its own (controlled) vocabulary
corresponding to the reality of its field and activities.
INRAE - Indexator – October 2022
…
Web interface for metadata entry
Generate the metadata file (JSON)
INRAE - Indexator – October 2022
Sections
…
Web interface for metadata entry
Generate the metadata file (JSON)
INRAE - Indexator – October 2022
…
Web interface for metadata entry
Generate the metadata file (JSON)
Sections
Fields
INRAE - Indexator – October 2022
…
Web interface for metadata entry
textbox
dropbox textbox
checkbox
dropbox
textbox textbox
checkbox
Generate the metadata file (JSON)
Type
Sections
Fields
INRAE - Indexator – October 2022
…
Web interface for metadata entry
textbox
dropbox textbox
checkbox
dropbox
textbox textbox
checkbox
Generate the metadata file (JSON)
Predefined terms
Sections
Fields
Type
INRAE - Indexator – October 2022
Sections
Predefined terms
…
Web interface for metadata entry
Fields
width=350px width=350px
width=350px width=500px
open
textbox
dropbox textbox
checkbox
dropbox
textbox textbox
checkbox
Generate the metadata file (JSON)
Features
Type
INRAE - Indexator – October 2022
…
Fields Sections Type Features Predefined terms
config_terms.txt
Definition of metadata
• Terminology definition file in Tabulation-Separared-Values (TSV)
• Based on (controlled) vocabulary specified by the data manager of a collective (research unit, platform, … )
all the metadata to be entered can be fully configured using only one configuration file (TSV format).
It is possible to define
the whole terminology
using a spreadsheet.
INRAE - Indexator – October 2022
• column 1 - Field : shortname of the fields
• column 2 - Section : shortname ot the sections
• column 3 - Search : indicates if the field can be used as a criterion search ('Y') or not ('N')
• column 4 - Shortview : indicates with ordered numbers if the field serves for the overview table after the search (empty by default)
• column 5 - Type : indicates the way they will be entered via the web interface (possible values are: textbox, dropbox, checkbox and areabox).
• column 6 - Features : dependings on the Type value, one can specifiy some specific features. If several features, they must be separated by a comma
• for checkbox: open=0 or open=1 indicates if the selection is opened or not
• for textbox & checkbox: autocomplete=item The items.js file must be present under web/js/autocomplete
• for textbox & dropbox: width=NNNpx allows you to specify the width of the box. Usefull if you want put several fields in the same line
• for areabox: row=NN and cols=NN allows you to specify the row and column size of the textarea
• column 7 - Label : Labels corresponding to the fields that will appear in the web interface
• column 8 - Predefined terms : for fields defined with a type equal to 'checkbox' or 'dropbox', one can give a list of terms separated by a comma.
Structure of the Terminology definition file
Definition of metadata
config_terms.txt
all the metadata to be entered can be fully configured using only one configuration file (TSV format).
INRAE - Indexator – October 2022
Architecture diagram
config_terms.json
initdb
search
Configuration / Initialization steps
Normal operating mode
pgd-mmdt-schema.json
Terminology definition file (Tabulation-Separated Values)
Important: Must be defined in the first step and then no longer changed.
Web interface
(config)
config_terms.txt
generate
generate
generate
linked
MongoDB Web interface
create
insert
PGD_XXXXX.json
options
scan
cron
Data storage
deposit
scan
View
Metadata
Docker Containers
Input / Output files
Data storage
Web server
INRAE - Indexator – October 2022
Architecture diagram
config_terms.json
initdb
pgd-mmdt-schema.json
Terminology definition file (Tabulation-Separated Values)
Important: Must be defined in the first step and then no longer changed.
Web interface
(config)
config_terms.txt
generate
generate
MongoDB
http:/mysite.org/pgd-mmdt/config
Docker Containers
Input / Output files
Configuration / Initialization steps
web/json
INRAE - Indexator – October 2022
Architecture diagram
config_terms.json
Web interface
create
PGD_XXXXX.json
pgd-mmdt-schema.json
linked
options
Data storage
deposit
Metadata entry
Docker Containers
Input / Output files
web/json
INRAE - Indexator – October 2022
Architecture diagram
config_terms.json
Web interface
search
insert
MongoDB
options
scan
cron
Data storage
scan
View
Metadata
Docker Containers
Input / Output files
web/json
Project search
INRAE - Indexator – October 2022
…
http:/mysite.org/pgd-mmdt/search
Web interface for search
INRAE - Indexator – October 2022
http:/mysite.org/pgd-mmdt/search#results
Web interface for search
Short View
INRAE - Indexator – October 2022
http:/mysite.org/pgd-mmdt/metadata/Atacama
Web interface for metadata
…
INRAE - Indexator – October 2022
PGD_XXXXX.json
deposit
scan
Web interface
options
scan cron
Web interface : Add new predefined terms
Terminology definition file
First time we need
of this new term
This new term is now available
for other users / datasets
Data storage
INRAE - Indexator – October 2022
web/js/autocomplete/cities.js
Web interface
Example with
Web interface : autocompletion
.
.
API « Découpage administratif » (Administrative division)
var cities=[];
$.getJSON("https://geo.api.gouv.fr/communes", function (data) {
$.each(data, function (index, value) { cities.push(value['nom']); });
});
. Terminology definition file
INRAE - Indexator – October 2022
// Get all descendant classes from 'Data' classe
edam_data=[];
get_terms_from_bioportal('EDAM', 'http://edamontology.org/data_0006', 'edam_data');
web/js/autocomplete/edam_data.js
To get information about the BioPortal API : https://data.bioontology.org/documentation
Web interface : autocompletion Example with
https://bioportal.bioontology.org/ontologies/EDAM/?p=classes
“datatype":{
"titre":"Data type",
"autocomplete":"edam_data",
"width":"350px“
}
web/json/config_terms.json
.
Web interface
.
.
Choose from 947 terms
autocompletion
INRAE - Indexator – October 2022
Web interface : autocompletion
https://vocabulaires-ouverts.inrae.fr/a-propos-du-thesaurus-inrae/
Example with
INRAE - Indexator – October 2022
Web interface : autocompletion Example with
https://consultation.vocabulaires-ouverts.inrae.fr/api/
web/js/autocomplete/VOvocab.js
.
Terminology definition file
keywords = [
'data', 'report','simulation', 'model', 'image','script',
'omics', 'statistic','scientific', 'research', ‘document',
'experiment','video', 'spatial', 'instrument'
]
VOvocab=[];
get_terms_from_voinrae(keywords,'VOvocab')
Choose from 405 terms
autocompletion
INRAE - Indexator – October 2022
Web interface : Resources
Terminology definition file
The "description" field should make it possible to better annotate the data,
while the "location" field should make it possible to
1) extend the perimeter of the data beyond the local space,
2) eventually to be able to emancipate oneself from the local space when one wishes to
disseminate the metadata alone
A location can be anything: a text, an absolute path in a tree, a URL link, ...
We can thus put a link to a publication: Type=article, link=DOI
INRAE - Indexator – October 2022
Creation
JSON metadata file
metadata viewer
Resource example 1: Atacama
INRAE - Indexator – October 2022
Resource example 2: Link to nextcloud
Put a NextCloud link pointing to the data repository.
Access is thus limited to those who have rights !
INRAE - Indexator – October 2022
Resource example 2: Link to nextcloud
Put a NextCloud link pointing to the data repository.
Access is thus limited to those who have rights !
Resource example 3: Indicate the path on a external storage
In case putting an URL is not possible, nervertheless
provide clear indications on the location of the data.
INRAE - Indexator – October 2022
VM
Data storage
Web server
Storage located on the VM
Installation : Local, Remote or Mixed
Local storage mounted on the VM
NAS Server
VPN
GlobalProtect
WinSCP
Successful
testing
Local VM
Remote VM (Datacenter)
2 cpu, 2 Go RAM, 10 Go HD
INRAE - Indexator – October 2022
VM
Data storage
Web server
Local VM
Remote VM (Datacenter)
Storage located on the VM
Google Drive
2 cpu, 2 Go RAM, 10 Go HD
Installation : Local, Remote or Mixed
Local storage mounted on the VM
NAS Server
VPN
GlobalProtect
WinSCP
Successful
testing
INRAE - Indexator – October 2022
scan
[ncloud]
type = webdav
url = https://nextcloud.inrae.fr/remote.php/webdav/
vendor = nextcloud
user = XXXXX
Pass = XXXXX
rclone mount ncloud:MTH2-PF-Bordeaux/DATA/ /mnt/ncloud/ 
--allow-other --vfs-cache-mode minimal 
--read-only --no-checksum --no-modtime 
--daemon --daemon-wait 15s
https://pmb-bordeaux.fr/ncloud/search
https://nextcloud.inrae.fr/apps/files/?dir=/MTH2-PF-Bordeaux/DATA
INRAE - Indexator – October 2022
Web Interface
Creation of the
JSON file
Mapping of JSON
file sections/terms
with the metadata
structure in
DATA INRAE
Pre-fill a dataset in the INRAE DATA dataverse (via API)
JSON Schema
Metadata JSON file
+
pgd-mmdt-schema.json
JSON-LD
Metadata JSON-LD file
• A good approach is to use only controlled vocabulary i.e. a relevant and sufficient
vocabulary used as reference in the field concerned to allow users to describe a project and
its context without having to add additional terms.
• A mapping of terms based on controlled vocabulary can thus be done more easily to
generate formats corresponding to different standards (MIAPPE, JSON-LD, ...)
Push
INRAE - Indexator – October 2022
Example of mapping from a controlled vocabulary based on an ontology in BioPortal
autocompletion
http://edamontology.org/data_0006
API BioPortal ontology / EDAM
get terms
Pre-fill a dataset in the INRAE DATA dataverse (via API)
INRAE - Indexator – October 2022
API BioPortal Search
https://data.bioontology.org/search
?q=Gene%20expression%20profile&ontology=EDAM&subtree_root_id=http%3A%2F%2Fedamontology.org%2Fdata_0006&apikey=….
Example of mapping from a controlled vocabulary based on an ontology in BioPortal
autocompletion
http://edamontology.org/data_0006
API BioPortal ontology / EDAM
get terms
search
Pre-fill a dataset in the INRAE DATA dataverse (via API)
Mapping
get
INRAE - Indexator – October 2022
Example of mapping from a controlled vocabulary based on the Thesaurus INRAE
https://consultation.vocabulaires-ouverts.inrae.fr/api/
API Thesaurus INRAE
get terms
Pre-fill a dataset in the INRAE DATA dataverse (via API)
autocompletion
INRAE - Indexator – October 2022
Example of mapping from a controlled vocabulary based on the Thesaurus INRAE
https://consultation.vocabulaires-ouverts.inrae.fr/api/
API Thesaurus INRAE
get terms
Pre-fill a dataset in the INRAE DATA dataverse (via API)
autocompletion
https://consultation.vocabulaires-ouverts.inrae.fr/rest/v1/search
?vocab=thesaurus-inrae&lang=en&type=skos%3AConcept
&query=metabolomics
&offset=0
API Thesaurus INRAE
search
get
Mapping
INRAE - Indexator – October 2022
Create
the
project Descriptive metadata
(Project)
Preserving
data
Web-based metadata entry tool
Storage space for the project
associated with the metadata file
Data analysis
•Adding new metadata
•Saving data with their metadata
•Convert to a suitable format
(JSON-LD)
Access to
data
Reuse of
data
Metadata query
(Web interface and/or API)
Observations,
Samples,
Experimentation,
Instrumentation
Push
JSON-LD
JSON with
a Schema
Adding
Resources
NAS
National and
international
data repositories
TSV
PGD_XXX.json
…
TSV
XXX
“Machine-Actionable Metadata" Create
the data
JSON with a Schema
Pre-fill a dataset in the INRAE DATA dataverse (via API)
Mapping
INRAE - Indexator – October 2022
• Have a visibility of what is produced within the collective
• data sets, software, databases, images, sounds, videos, analyses, codes, ...
• Use a controlled vocabulary specific to the domain of the collective, with mapping to other formats
embedding ontologies to be done downstream as required,
• Propose an alternative/complement to external data repositories or other thematic warehouses to have
knowledge of and access to ALL data, not only those that are published,
• Favour FAIR (at least Findable & Accessible criteria) within the collective,
• Sensitise newcomers and students to a better description of what they produce.
Conclusion
The “INDEXATOR" tool allows a collective to :
INRAE - Indexator – October 2022
https://github.com/inrae/pgd-mmdt
Thank you for your attention
Metadata Management for Storage Spaces
Metadata aggregation & indexation
Source code

More Related Content

Similar to Indexator_oct2022.pdf

Micka Manual
Micka ManualMicka Manual
Micka Manual
SDIEDU
 
Spring data presentation
Spring data presentationSpring data presentation
Spring data presentation
Oleksii Usyk
 
Searching Repositories of Web Application Models
Searching Repositories of Web Application ModelsSearching Repositories of Web Application Models
Searching Repositories of Web Application Models
Marco Brambilla
 
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Mark Wilkinson
 
United Airlines 2012 SharePoint Conference Presentation
United Airlines 2012 SharePoint Conference PresentationUnited Airlines 2012 SharePoint Conference Presentation
United Airlines 2012 SharePoint Conference Presentation
Denise Wilson
 
United Airlines 2012 Microsoft SharePoint Conference Presentation
United Airlines 2012 Microsoft SharePoint Conference PresentationUnited Airlines 2012 Microsoft SharePoint Conference Presentation
United Airlines 2012 Microsoft SharePoint Conference Presentation
Denise Wilson
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
 
21 domino mohan-1
21 domino mohan-121 domino mohan-1
21 domino mohan-1
ashish61_scs
 
"Data Dynamics: Trends & Patterns Revealed"
"Data Dynamics: Trends & Patterns Revealed""Data Dynamics: Trends & Patterns Revealed"
"Data Dynamics: Trends & Patterns Revealed"
cakepearls17
 
MongoDB NoSQL database a deep dive -MyWhitePaper
MongoDB  NoSQL database a deep dive -MyWhitePaperMongoDB  NoSQL database a deep dive -MyWhitePaper
MongoDB NoSQL database a deep dive -MyWhitePaper
Rajesh Kumar
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM
4Science
 
SAP BI/BW
SAP BI/BWSAP BI/BW
Modular Documentation Joe Gelb Techshoret 2009
Modular Documentation Joe Gelb Techshoret 2009Modular Documentation Joe Gelb Techshoret 2009
Modular Documentation Joe Gelb Techshoret 2009
Suite Solutions
 
EUDAT data architecture and interoperability aspects – Daan Broeder
EUDAT data architecture and interoperability aspects – Daan BroederEUDAT data architecture and interoperability aspects – Daan Broeder
EUDAT data architecture and interoperability aspects – Daan Broeder
OpenAIRE
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptx
WidsoulDevil
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
IRJET Journal
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research Objects
Carole Goble
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
OpenAIRE
 
Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talk
benosteen
 
MongoDB
MongoDBMongoDB

Similar to Indexator_oct2022.pdf (20)

Micka Manual
Micka ManualMicka Manual
Micka Manual
 
Spring data presentation
Spring data presentationSpring data presentation
Spring data presentation
 
Searching Repositories of Web Application Models
Searching Repositories of Web Application ModelsSearching Repositories of Web Application Models
Searching Repositories of Web Application Models
 
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
 
United Airlines 2012 SharePoint Conference Presentation
United Airlines 2012 SharePoint Conference PresentationUnited Airlines 2012 SharePoint Conference Presentation
United Airlines 2012 SharePoint Conference Presentation
 
United Airlines 2012 Microsoft SharePoint Conference Presentation
United Airlines 2012 Microsoft SharePoint Conference PresentationUnited Airlines 2012 Microsoft SharePoint Conference Presentation
United Airlines 2012 Microsoft SharePoint Conference Presentation
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
21 domino mohan-1
21 domino mohan-121 domino mohan-1
21 domino mohan-1
 
"Data Dynamics: Trends & Patterns Revealed"
"Data Dynamics: Trends & Patterns Revealed""Data Dynamics: Trends & Patterns Revealed"
"Data Dynamics: Trends & Patterns Revealed"
 
MongoDB NoSQL database a deep dive -MyWhitePaper
MongoDB  NoSQL database a deep dive -MyWhitePaperMongoDB  NoSQL database a deep dive -MyWhitePaper
MongoDB NoSQL database a deep dive -MyWhitePaper
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM
 
SAP BI/BW
SAP BI/BWSAP BI/BW
SAP BI/BW
 
Modular Documentation Joe Gelb Techshoret 2009
Modular Documentation Joe Gelb Techshoret 2009Modular Documentation Joe Gelb Techshoret 2009
Modular Documentation Joe Gelb Techshoret 2009
 
EUDAT data architecture and interoperability aspects – Daan Broeder
EUDAT data architecture and interoperability aspects – Daan BroederEUDAT data architecture and interoperability aspects – Daan Broeder
EUDAT data architecture and interoperability aspects – Daan Broeder
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptx
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research Objects
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 
Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talk
 
MongoDB
MongoDBMongoDB
MongoDB
 

More from Daniel JACOB

Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
Daniel JACOB
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
Daniel JACOB
 
Make your data great now
Make your data great nowMake your data great now
Make your data great now
Daniel JACOB
 
Biostatflow
BiostatflowBiostatflow
Biostatflow
Daniel JACOB
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and Mining
Daniel JACOB
 
ERVA-NMR
ERVA-NMRERVA-NMR
ERVA-NMR
Daniel JACOB
 

More from Daniel JACOB (6)

Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
 
Make your data great now
Make your data great nowMake your data great now
Make your data great now
 
Biostatflow
BiostatflowBiostatflow
Biostatflow
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and Mining
 
ERVA-NMR
ERVA-NMRERVA-NMR
ERVA-NMR
 

Recently uploaded

English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 

Recently uploaded (20)

English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 

Indexator_oct2022.pdf

  • 1. Metadata management for data storage spaces Contributors: François Ehrenmann (UMR BioGECO) Philippe Chaumeil (UMR BioGECO) Daniel Jacob (UMR BFP)
  • 2. INRAE - Indexator – October 2022 • The implementation of a Data Management Plan (DMP) involves some requisites such as the data outsourcing to be preserved outside the users' disk space. • This concerns not only published data but all data produced during the course of a project. • This is even more necessary when temporary staff (doctoral students, post-docs, trainees, fixed-term contracts) are involved in the production of data. Data Management Plan How to encourage the structures (Units, Platforms,...) to better manage their data ?
  • 3. INRAE - Indexator – October 2022 Data storage • The central idea is that the storage space becomes the data repository, so the metadata should go to the data and not the other way around. Metadata How to encourage the structures (Units, Platforms,...) to better manage their data Your data repository • Concern about the organisation of these storage spaces. • Should they be harmonised, i.e. impose good practices such as i) folder and file naming, ii) folder structure (docs, data, scripts, etc.), iii) the use of README files, iv) etc. • At least the use of a README file seems the simplest and least restrictive.  what to put in it ? • How to use them effectively when you want to find information? With what vocabulary ?
  • 4. INRAE - Indexator – October 2022 Data storage Project data storage space : Put a metadata file (JSON format) describing the project data within each subdirectory • The central idea is that the storage space becomes the data repository, so the metadata should go to the data and not the other way around. The choice was made for the JSON format, which is very appropriate for describing metadata, readable by both humans and machines How to encourage the structures (Units, Platforms,...) to better manage their data Your data repository
  • 5. INRAE - Indexator – October 2022 Generate the metadata file (JSON) Data storage Web interface Project data storage space : Put a metadata file (JSON format) describing the project data within each subdirectory • The central idea is that the storage space becomes the data repository, so the metadata should go to the data and not the other way around. The choice was made for the JSON format, which is very appropriate for describing metadata, readable by both humans and machines Knowing the production of files in JSON format being delicate for users, a web interface makes it possible to create them. How to encourage the structures (Units, Platforms,...) to better manage their data deposit
  • 6. INRAE - Indexator – October 2022 View Metadata Generate the metadata file (JSON) Search datasets based on some metadata deposit scan Data storage Web interface Project data storage space : Put a metadata file (JSON format) describing the project data within each subdirectory Then, find projects and/or data corresponding to your criteria • The central idea is that the storage space becomes the data repository, so the metadata should go to the data and not the other way around. How to encourage the structures (Units, Platforms,...) to better manage their data
  • 7. INRAE - Indexator – October 2022 How to encourage the structures (Units, Platforms,...) to better manage their data What metadata? How to specify it? From which vocabulary? How to generate a JSON file? Questions immediately raised
  • 8. INRAE - Indexator – October 2022 • Given the diversity of domains, the approach chosen is to be both as flexible and as pragmatic as possible by allowing each collective to choose its own (controlled) vocabulary corresponding to the reality of its field and activities. • The main idea is to be able to "capture" the user's metadata as easily as possible using their vocabulary. How to encourage the structures (Units, Platforms,...) to better manage their data What metadata? How to specify it? From which vocabulary? How to generate a JSON file? Questions immediately raised
  • 9. INRAE - Indexator – October 2022 • The main idea is to be able to "capture" the user's metadata as easily as possible using their vocabulary. How to encourage the structures (Units, Platforms,...) to better manage their data The web interface must therefore correspond to the scientific and experimental context of the collective (research unit, project, platform, ...) What metadata? How to specify it? From which vocabulary? How to generate a JSON file? Questions immediately raised • Given the diversity of domains, the approach chosen is to be both as flexible and as pragmatic as possible by allowing each collective to choose its own (controlled) vocabulary corresponding to the reality of its field and activities.
  • 10. INRAE - Indexator – October 2022 … Web interface for metadata entry Generate the metadata file (JSON)
  • 11. INRAE - Indexator – October 2022 Sections … Web interface for metadata entry Generate the metadata file (JSON)
  • 12. INRAE - Indexator – October 2022 … Web interface for metadata entry Generate the metadata file (JSON) Sections Fields
  • 13. INRAE - Indexator – October 2022 … Web interface for metadata entry textbox dropbox textbox checkbox dropbox textbox textbox checkbox Generate the metadata file (JSON) Type Sections Fields
  • 14. INRAE - Indexator – October 2022 … Web interface for metadata entry textbox dropbox textbox checkbox dropbox textbox textbox checkbox Generate the metadata file (JSON) Predefined terms Sections Fields Type
  • 15. INRAE - Indexator – October 2022 Sections Predefined terms … Web interface for metadata entry Fields width=350px width=350px width=350px width=500px open textbox dropbox textbox checkbox dropbox textbox textbox checkbox Generate the metadata file (JSON) Features Type
  • 16. INRAE - Indexator – October 2022 … Fields Sections Type Features Predefined terms config_terms.txt Definition of metadata • Terminology definition file in Tabulation-Separared-Values (TSV) • Based on (controlled) vocabulary specified by the data manager of a collective (research unit, platform, … ) all the metadata to be entered can be fully configured using only one configuration file (TSV format). It is possible to define the whole terminology using a spreadsheet.
  • 17. INRAE - Indexator – October 2022 • column 1 - Field : shortname of the fields • column 2 - Section : shortname ot the sections • column 3 - Search : indicates if the field can be used as a criterion search ('Y') or not ('N') • column 4 - Shortview : indicates with ordered numbers if the field serves for the overview table after the search (empty by default) • column 5 - Type : indicates the way they will be entered via the web interface (possible values are: textbox, dropbox, checkbox and areabox). • column 6 - Features : dependings on the Type value, one can specifiy some specific features. If several features, they must be separated by a comma • for checkbox: open=0 or open=1 indicates if the selection is opened or not • for textbox & checkbox: autocomplete=item The items.js file must be present under web/js/autocomplete • for textbox & dropbox: width=NNNpx allows you to specify the width of the box. Usefull if you want put several fields in the same line • for areabox: row=NN and cols=NN allows you to specify the row and column size of the textarea • column 7 - Label : Labels corresponding to the fields that will appear in the web interface • column 8 - Predefined terms : for fields defined with a type equal to 'checkbox' or 'dropbox', one can give a list of terms separated by a comma. Structure of the Terminology definition file Definition of metadata config_terms.txt all the metadata to be entered can be fully configured using only one configuration file (TSV format).
  • 18. INRAE - Indexator – October 2022 Architecture diagram config_terms.json initdb search Configuration / Initialization steps Normal operating mode pgd-mmdt-schema.json Terminology definition file (Tabulation-Separated Values) Important: Must be defined in the first step and then no longer changed. Web interface (config) config_terms.txt generate generate generate linked MongoDB Web interface create insert PGD_XXXXX.json options scan cron Data storage deposit scan View Metadata Docker Containers Input / Output files Data storage Web server
  • 19. INRAE - Indexator – October 2022 Architecture diagram config_terms.json initdb pgd-mmdt-schema.json Terminology definition file (Tabulation-Separated Values) Important: Must be defined in the first step and then no longer changed. Web interface (config) config_terms.txt generate generate MongoDB http:/mysite.org/pgd-mmdt/config Docker Containers Input / Output files Configuration / Initialization steps web/json
  • 20. INRAE - Indexator – October 2022 Architecture diagram config_terms.json Web interface create PGD_XXXXX.json pgd-mmdt-schema.json linked options Data storage deposit Metadata entry Docker Containers Input / Output files web/json
  • 21. INRAE - Indexator – October 2022 Architecture diagram config_terms.json Web interface search insert MongoDB options scan cron Data storage scan View Metadata Docker Containers Input / Output files web/json Project search
  • 22. INRAE - Indexator – October 2022 … http:/mysite.org/pgd-mmdt/search Web interface for search
  • 23. INRAE - Indexator – October 2022 http:/mysite.org/pgd-mmdt/search#results Web interface for search Short View
  • 24. INRAE - Indexator – October 2022 http:/mysite.org/pgd-mmdt/metadata/Atacama Web interface for metadata …
  • 25. INRAE - Indexator – October 2022 PGD_XXXXX.json deposit scan Web interface options scan cron Web interface : Add new predefined terms Terminology definition file First time we need of this new term This new term is now available for other users / datasets Data storage
  • 26. INRAE - Indexator – October 2022 web/js/autocomplete/cities.js Web interface Example with Web interface : autocompletion . . API « Découpage administratif » (Administrative division) var cities=[]; $.getJSON("https://geo.api.gouv.fr/communes", function (data) { $.each(data, function (index, value) { cities.push(value['nom']); }); }); . Terminology definition file
  • 27. INRAE - Indexator – October 2022 // Get all descendant classes from 'Data' classe edam_data=[]; get_terms_from_bioportal('EDAM', 'http://edamontology.org/data_0006', 'edam_data'); web/js/autocomplete/edam_data.js To get information about the BioPortal API : https://data.bioontology.org/documentation Web interface : autocompletion Example with https://bioportal.bioontology.org/ontologies/EDAM/?p=classes “datatype":{ "titre":"Data type", "autocomplete":"edam_data", "width":"350px“ } web/json/config_terms.json . Web interface . . Choose from 947 terms autocompletion
  • 28. INRAE - Indexator – October 2022 Web interface : autocompletion https://vocabulaires-ouverts.inrae.fr/a-propos-du-thesaurus-inrae/ Example with
  • 29. INRAE - Indexator – October 2022 Web interface : autocompletion Example with https://consultation.vocabulaires-ouverts.inrae.fr/api/ web/js/autocomplete/VOvocab.js . Terminology definition file keywords = [ 'data', 'report','simulation', 'model', 'image','script', 'omics', 'statistic','scientific', 'research', ‘document', 'experiment','video', 'spatial', 'instrument' ] VOvocab=[]; get_terms_from_voinrae(keywords,'VOvocab') Choose from 405 terms autocompletion
  • 30. INRAE - Indexator – October 2022 Web interface : Resources Terminology definition file The "description" field should make it possible to better annotate the data, while the "location" field should make it possible to 1) extend the perimeter of the data beyond the local space, 2) eventually to be able to emancipate oneself from the local space when one wishes to disseminate the metadata alone A location can be anything: a text, an absolute path in a tree, a URL link, ... We can thus put a link to a publication: Type=article, link=DOI
  • 31. INRAE - Indexator – October 2022 Creation JSON metadata file metadata viewer Resource example 1: Atacama
  • 32. INRAE - Indexator – October 2022 Resource example 2: Link to nextcloud Put a NextCloud link pointing to the data repository. Access is thus limited to those who have rights !
  • 33. INRAE - Indexator – October 2022 Resource example 2: Link to nextcloud Put a NextCloud link pointing to the data repository. Access is thus limited to those who have rights ! Resource example 3: Indicate the path on a external storage In case putting an URL is not possible, nervertheless provide clear indications on the location of the data.
  • 34. INRAE - Indexator – October 2022 VM Data storage Web server Storage located on the VM Installation : Local, Remote or Mixed Local storage mounted on the VM NAS Server VPN GlobalProtect WinSCP Successful testing Local VM Remote VM (Datacenter) 2 cpu, 2 Go RAM, 10 Go HD
  • 35. INRAE - Indexator – October 2022 VM Data storage Web server Local VM Remote VM (Datacenter) Storage located on the VM Google Drive 2 cpu, 2 Go RAM, 10 Go HD Installation : Local, Remote or Mixed Local storage mounted on the VM NAS Server VPN GlobalProtect WinSCP Successful testing
  • 36. INRAE - Indexator – October 2022 scan [ncloud] type = webdav url = https://nextcloud.inrae.fr/remote.php/webdav/ vendor = nextcloud user = XXXXX Pass = XXXXX rclone mount ncloud:MTH2-PF-Bordeaux/DATA/ /mnt/ncloud/ --allow-other --vfs-cache-mode minimal --read-only --no-checksum --no-modtime --daemon --daemon-wait 15s https://pmb-bordeaux.fr/ncloud/search https://nextcloud.inrae.fr/apps/files/?dir=/MTH2-PF-Bordeaux/DATA
  • 37. INRAE - Indexator – October 2022 Web Interface Creation of the JSON file Mapping of JSON file sections/terms with the metadata structure in DATA INRAE Pre-fill a dataset in the INRAE DATA dataverse (via API) JSON Schema Metadata JSON file + pgd-mmdt-schema.json JSON-LD Metadata JSON-LD file • A good approach is to use only controlled vocabulary i.e. a relevant and sufficient vocabulary used as reference in the field concerned to allow users to describe a project and its context without having to add additional terms. • A mapping of terms based on controlled vocabulary can thus be done more easily to generate formats corresponding to different standards (MIAPPE, JSON-LD, ...) Push
  • 38. INRAE - Indexator – October 2022 Example of mapping from a controlled vocabulary based on an ontology in BioPortal autocompletion http://edamontology.org/data_0006 API BioPortal ontology / EDAM get terms Pre-fill a dataset in the INRAE DATA dataverse (via API)
  • 39. INRAE - Indexator – October 2022 API BioPortal Search https://data.bioontology.org/search ?q=Gene%20expression%20profile&ontology=EDAM&subtree_root_id=http%3A%2F%2Fedamontology.org%2Fdata_0006&apikey=…. Example of mapping from a controlled vocabulary based on an ontology in BioPortal autocompletion http://edamontology.org/data_0006 API BioPortal ontology / EDAM get terms search Pre-fill a dataset in the INRAE DATA dataverse (via API) Mapping get
  • 40. INRAE - Indexator – October 2022 Example of mapping from a controlled vocabulary based on the Thesaurus INRAE https://consultation.vocabulaires-ouverts.inrae.fr/api/ API Thesaurus INRAE get terms Pre-fill a dataset in the INRAE DATA dataverse (via API) autocompletion
  • 41. INRAE - Indexator – October 2022 Example of mapping from a controlled vocabulary based on the Thesaurus INRAE https://consultation.vocabulaires-ouverts.inrae.fr/api/ API Thesaurus INRAE get terms Pre-fill a dataset in the INRAE DATA dataverse (via API) autocompletion https://consultation.vocabulaires-ouverts.inrae.fr/rest/v1/search ?vocab=thesaurus-inrae&lang=en&type=skos%3AConcept &query=metabolomics &offset=0 API Thesaurus INRAE search get Mapping
  • 42. INRAE - Indexator – October 2022 Create the project Descriptive metadata (Project) Preserving data Web-based metadata entry tool Storage space for the project associated with the metadata file Data analysis •Adding new metadata •Saving data with their metadata •Convert to a suitable format (JSON-LD) Access to data Reuse of data Metadata query (Web interface and/or API) Observations, Samples, Experimentation, Instrumentation Push JSON-LD JSON with a Schema Adding Resources NAS National and international data repositories TSV PGD_XXX.json … TSV XXX “Machine-Actionable Metadata" Create the data JSON with a Schema Pre-fill a dataset in the INRAE DATA dataverse (via API) Mapping
  • 43. INRAE - Indexator – October 2022 • Have a visibility of what is produced within the collective • data sets, software, databases, images, sounds, videos, analyses, codes, ... • Use a controlled vocabulary specific to the domain of the collective, with mapping to other formats embedding ontologies to be done downstream as required, • Propose an alternative/complement to external data repositories or other thematic warehouses to have knowledge of and access to ALL data, not only those that are published, • Favour FAIR (at least Findable & Accessible criteria) within the collective, • Sensitise newcomers and students to a better description of what they produce. Conclusion The “INDEXATOR" tool allows a collective to :
  • 44. INRAE - Indexator – October 2022 https://github.com/inrae/pgd-mmdt Thank you for your attention Metadata Management for Storage Spaces Metadata aggregation & indexation Source code