Indexator_oct2022.pdf

Metadata management for data
storage spaces
Contributors:
François Ehrenmann (UMR BioGECO)
Philippe Chaumeil (UMR BioGECO)
Daniel Jacob (UMR BFP)

INRAE - Indexator – October 2022
• The implementation of a Data Management Plan (DMP) involves
some requisites such as the data outsourcing to be preserved
outside the users' disk space.
• This concerns not only published data but all data produced during
the course of a project.
• This is even more necessary when temporary staff (doctoral
students, post-docs, trainees, fixed-term contracts) are involved in
the production of data.
Data Management Plan
How to encourage the structures (Units, Platforms,...)
to better manage their data ?

Data storage
• The central idea is that the storage space becomes the data repository, so the
metadata should go to the data and not the other way around.
Metadata
How to encourage the structures (Units, Platforms,...) to better manage their data
Your data repository
• Concern about the organisation of these storage spaces.
• Should they be harmonised, i.e. impose good practices such as i) folder and file naming, ii) folder structure (docs, data, scripts,
etc.), iii) the use of README files, iv) etc.
• At least the use of a README file seems the simplest and least restrictive.  what to put in it ?
• How to use them effectively when you want to find information? With what vocabulary ?

Data storage Project data storage space :
Put a metadata file (JSON format)
describing the project data within each
subdirectory
The choice was made for the JSON format,
which is very appropriate for describing
metadata, readable by both humans and
machines
Your data repository

Generate the
metadata file (JSON)
Data storage
Web interface
Project data storage space :
subdirectory
The choice was made for the JSON format,
which is very appropriate for describing
metadata, readable by both humans and
machines
Knowing the production of files in JSON
format being delicate for users, a web
interface makes it possible to create
them.
deposit

View
Metadata
Generate the
metadata file (JSON)
Search datasets based
on some metadata
deposit
scan
Data storage
Web interface
Project data storage space :
subdirectory
Then, find projects and/or data
corresponding to your criteria

What metadata?
How to specify it?
From which vocabulary?
How to generate a JSON file?
Questions immediately raised

• Given the diversity of domains, the approach chosen is to be both as flexible and as
pragmatic as possible by allowing each collective to choose its own (controlled) vocabulary
corresponding to the reality of its field and activities.
• The main idea is to be able to "capture" the user's metadata as easily as possible using their
vocabulary.
What metadata?
How to specify it?

• The main idea is to be able to "capture" the user's metadata as easily as possible using their
vocabulary.
The web interface
must therefore correspond to the scientific and experimental context
of the collective (research unit, project, platform, ...)
What metadata?
How to specify it?
• Given the diversity of domains, the approach chosen is to be both as flexible and as
pragmatic as possible by allowing each collective to choose its own (controlled) vocabulary
corresponding to the reality of its field and activities.

…
Web interface for metadata entry
Generate the metadata file (JSON)

Sections
…

…
Sections
Fields

…
textbox
dropbox textbox
checkbox
dropbox
textbox textbox
checkbox
Type
Sections
Fields

…
textbox
dropbox textbox
checkbox
dropbox
textbox textbox
checkbox
Predefined terms
Sections
Fields
Type

Sections
Predefined terms
…
Fields
width=350px width=350px
width=350px width=500px
open
textbox
dropbox textbox
checkbox
dropbox
textbox textbox
checkbox
Features
Type

…
Fields Sections Type Features Predefined terms
config_terms.txt
Definition of metadata
• Terminology definition file in Tabulation-Separared-Values (TSV)
• Based on (controlled) vocabulary specified by the data manager of a collective (research unit, platform, … )
all the metadata to be entered can be fully configured using only one configuration file (TSV format).
It is possible to define
the whole terminology
using a spreadsheet.

• column 1 - Field : shortname of the fields
• column 2 - Section : shortname ot the sections
• column 3 - Search : indicates if the field can be used as a criterion search ('Y') or not ('N')
• column 4 - Shortview : indicates with ordered numbers if the field serves for the overview table after the search (empty by default)
• column 5 - Type : indicates the way they will be entered via the web interface (possible values are: textbox, dropbox, checkbox and areabox).
• column 6 - Features : dependings on the Type value, one can specifiy some specific features. If several features, they must be separated by a comma
• for checkbox: open=0 or open=1 indicates if the selection is opened or not
• for textbox & checkbox: autocomplete=item The items.js file must be present under web/js/autocomplete
• for textbox & dropbox: width=NNNpx allows you to specify the width of the box. Usefull if you want put several fields in the same line
• for areabox: row=NN and cols=NN allows you to specify the row and column size of the textarea
• column 7 - Label : Labels corresponding to the fields that will appear in the web interface
• column 8 - Predefined terms : for fields defined with a type equal to 'checkbox' or 'dropbox', one can give a list of terms separated by a comma.
Structure of the Terminology definition file
Definition of metadata
config_terms.txt
all the metadata to be entered can be fully configured using only one configuration file (TSV format).

Architecture diagram
config_terms.json
initdb
search
Configuration / Initialization steps
Normal operating mode
pgd-mmdt-schema.json
Terminology definition file (Tabulation-Separated Values)
Important: Must be defined in the first step and then no longer changed.
Web interface
(config)
config_terms.txt
generate
generate
generate
linked
MongoDB Web interface
create
insert
PGD_XXXXX.json
options
scan
cron
Data storage
deposit
scan
View
Metadata
Docker Containers
Input / Output files
Data storage
Web server

config_terms.json
initdb
Terminology definition file (Tabulation-Separated Values)
Important: Must be defined in the first step and then no longer changed.
Web interface
(config)
config_terms.txt
generate
generate
MongoDB
http:/mysite.org/pgd-mmdt/config
Docker Containers
Configuration / Initialization steps
web/json

config_terms.json
Web interface
create
PGD_XXXXX.json
linked
options
Data storage
deposit
Metadata entry
Docker Containers
web/json

config_terms.json
Web interface
search
insert
MongoDB
options
scan
cron
Data storage
scan
View
Metadata
Docker Containers
web/json
Project search

…
http:/mysite.org/pgd-mmdt/search
Web interface for search

http:/mysite.org/pgd-mmdt/search#results
Web interface for search
Short View

http:/mysite.org/pgd-mmdt/metadata/Atacama
Web interface for metadata
…

PGD_XXXXX.json
deposit
scan
Web interface
options
scan cron
Web interface : Add new predefined terms
Terminology definition file
First time we need
of this new term
This new term is now available
for other users / datasets
Data storage

web/js/autocomplete/cities.js
Web interface
Example with
Web interface : autocompletion
.
.
API « Découpage administratif » (Administrative division)
var cities=[];
$.getJSON("https://geo.api.gouv.fr/communes", function (data) {
$.each(data, function (index, value) { cities.push(value['nom']); });
});
. Terminology definition file

// Get all descendant classes from 'Data' classe
edam_data=[];
get_terms_from_bioportal('EDAM', 'http://edamontology.org/data_0006', 'edam_data');
web/js/autocomplete/edam_data.js
To get information about the BioPortal API : https://data.bioontology.org/documentation
Web interface : autocompletion Example with
https://bioportal.bioontology.org/ontologies/EDAM/?p=classes
“datatype":{
"titre":"Data type",
"autocomplete":"edam_data",
"width":"350px“
}
web/json/config_terms.json
.
Web interface
.
.
Choose from 947 terms
autocompletion

Web interface : autocompletion
https://vocabulaires-ouverts.inrae.fr/a-propos-du-thesaurus-inrae/
Example with

Web interface : autocompletion Example with
https://consultation.vocabulaires-ouverts.inrae.fr/api/
web/js/autocomplete/VOvocab.js
.
keywords = [
'data', 'report','simulation', 'model', 'image','script',
'omics', 'statistic','scientific', 'research', ‘document',
'experiment','video', 'spatial', 'instrument'
]
VOvocab=[];
get_terms_from_voinrae(keywords,'VOvocab')
Choose from 405 terms
autocompletion

Web interface : Resources
The "description" field should make it possible to better annotate the data,
while the "location" field should make it possible to
1) extend the perimeter of the data beyond the local space,
2) eventually to be able to emancipate oneself from the local space when one wishes to
disseminate the metadata alone
A location can be anything: a text, an absolute path in a tree, a URL link, ...
We can thus put a link to a publication: Type=article, link=DOI

Creation
JSON metadata file
metadata viewer
Resource example 1: Atacama

Resource example 2: Link to nextcloud
Put a NextCloud link pointing to the data repository.
Access is thus limited to those who have rights !

Resource example 2: Link to nextcloud
Put a NextCloud link pointing to the data repository.
Access is thus limited to those who have rights !
Resource example 3: Indicate the path on a external storage
In case putting an URL is not possible, nervertheless
provide clear indications on the location of the data.

VM
Data storage
Web server
Storage located on the VM
Installation : Local, Remote or Mixed
Local storage mounted on the VM
NAS Server
VPN
GlobalProtect
WinSCP
Successful
testing
Local VM
Remote VM (Datacenter)
2 cpu, 2 Go RAM, 10 Go HD

VM
Data storage
Web server
Local VM
Remote VM (Datacenter)
Storage located on the VM
Google Drive
2 cpu, 2 Go RAM, 10 Go HD
Installation : Local, Remote or Mixed
Local storage mounted on the VM
NAS Server
VPN
GlobalProtect
WinSCP
Successful
testing

scan
[ncloud]
type = webdav
url = https://nextcloud.inrae.fr/remote.php/webdav/
vendor = nextcloud
user = XXXXX
Pass = XXXXX
rclone mount ncloud:MTH2-PF-Bordeaux/DATA/ /mnt/ncloud/
--allow-other --vfs-cache-mode minimal
--read-only --no-checksum --no-modtime
--daemon --daemon-wait 15s
https://pmb-bordeaux.fr/ncloud/search
https://nextcloud.inrae.fr/apps/files/?dir=/MTH2-PF-Bordeaux/DATA

Web Interface
Creation of the
JSON file
Mapping of JSON
file sections/terms
with the metadata
structure in
DATA INRAE
Pre-fill a dataset in the INRAE DATA dataverse (via API)
JSON Schema
Metadata JSON file
+
JSON-LD
Metadata JSON-LD file
• A good approach is to use only controlled vocabulary i.e. a relevant and sufficient
vocabulary used as reference in the field concerned to allow users to describe a project and
its context without having to add additional terms.
• A mapping of terms based on controlled vocabulary can thus be done more easily to
generate formats corresponding to different standards (MIAPPE, JSON-LD, ...)
Push

Example of mapping from a controlled vocabulary based on an ontology in BioPortal
autocompletion
http://edamontology.org/data_0006
API BioPortal ontology / EDAM
get terms

API BioPortal Search
https://data.bioontology.org/search
?q=Gene%20expression%20profile&ontology=EDAM&subtree_root_id=http%3A%2F%2Fedamontology.org%2Fdata_0006&apikey=….
Example of mapping from a controlled vocabulary based on an ontology in BioPortal
autocompletion
http://edamontology.org/data_0006
API BioPortal ontology / EDAM
get terms
search
Mapping
get

Example of mapping from a controlled vocabulary based on the Thesaurus INRAE
API Thesaurus INRAE
get terms
autocompletion

Example of mapping from a controlled vocabulary based on the Thesaurus INRAE
API Thesaurus INRAE
get terms
autocompletion
https://consultation.vocabulaires-ouverts.inrae.fr/rest/v1/search
?vocab=thesaurus-inrae&lang=en&type=skos%3AConcept
&query=metabolomics
&offset=0
API Thesaurus INRAE
search
get
Mapping

Create
the
project Descriptive metadata
(Project)
Preserving
data
Web-based metadata entry tool
Storage space for the project
associated with the metadata file
Data analysis
•Adding new metadata
•Saving data with their metadata
•Convert to a suitable format
(JSON-LD)
Access to
data
Reuse of
data
Metadata query
(Web interface and/or API)
Observations,
Samples,
Experimentation,
Instrumentation
Push
JSON-LD
JSON with
a Schema
Adding
Resources
NAS
National and
international
data repositories
TSV
PGD_XXX.json
…
TSV
XXX
“Machine-Actionable Metadata" Create
the data
JSON with a Schema
Mapping

• Have a visibility of what is produced within the collective
• data sets, software, databases, images, sounds, videos, analyses, codes, ...
• Use a controlled vocabulary specific to the domain of the collective, with mapping to other formats
embedding ontologies to be done downstream as required,
• Propose an alternative/complement to external data repositories or other thematic warehouses to have
knowledge of and access to ALL data, not only those that are published,
• Favour FAIR (at least Findable & Accessible criteria) within the collective,
• Sensitise newcomers and students to a better description of what they produce.
Conclusion
The “INDEXATOR" tool allows a collective to :

https://github.com/inrae/pgd-mmdt
Thank you for your attention
Metadata Management for Storage Spaces
Metadata aggregation & indexation
Source code

Indexator_oct2022.pdf

More Related Content

Similar to Indexator_oct2022.pdf

Recently uploaded

Indexator_oct2022.pdf