Metadata management for data storage spaces :
INDEXATOR is a metadata management tool that addresses the problems of organising, documenting, storing and sharing data in a research unit or infrastructure, and fits perfectly into a data management plan of a collective.
The central idea is that the storage space becomes the data repository, so the metadata should go to the data and not the other way around.
Given the diversity of domains, the approach chosen is to be both as flexible and as pragmatic as possible by allowing each collective to choose its own (controlled) vocabulary corresponding to the reality of its field and activities. The main idea is to be able to "capture" the user's metadata as easily as possible using their vocabulary. It is possible to define the whole terminology using a spreadsheet.
The choice was made for the JSON format, which is very appropriate for describing metadata, readable by both humans and machines.
This tool is built around a web interface coupled with a MongoDB database. The web interface allows you to i) Describe a dataset using metadata of various types (Description), ii) Search datasets by their metadata (Accessibility).
Part 4 of tutorials at DC2008, Berlin. (International Conference on Dublin Core and Metadata Applications). See also part 1-3 by Jane Greenberg, Pete Johnston, and Mikael Nilsson on DC history, concepts, and other schemas. This part focuses on practical issues.
Page 18Goal Implement a complete search engine. Milestones.docxsmile790243
Page 1/8
Goal: Implement a complete search engine. Milestones Overview
Milestone Goal #1 Produce an initial index for the corpus and a basic retrieval component
#2 Complete Search System
Page 2/8
PROJECT: SEARCH ENGINE Corpus: all ICS web pages We will provide you with the crawled data as a zip file (webpages_raw.zip). This contains the downloaded content of the ICS web pages that were crawled by a previous quarter. You are expected to build your search engine index off of this data. Main challenges: Full HTML parsing, File/DB handling, handling user input (either using command line or desktop GUI application or web interface) COMPONENT 1 - INDEX: Create an inverted index for all the corpus given to you. You can either use a database to store your index (MongoDB, Redis, memcached are some examples) or you can store the index in a file. You are free to choose an approach here. The index should store more than just a simple list of documents where the token occurs. At the very least, your index should store the TF-IDF of every term/document. Sample Index:
Note: This is a simplistic example provided for your understanding. Please do not consider this as the expected index format. A good inverted index will store more information than this. Index Structure: token – docId1, tf-idf1 ; docId2, tf-idf2
Example: informatics – doc_1, 5 ; doc_2, 10 ; doc_3, 7 You are encouraged to come up with heuristics that make sense and will help in retrieving relevant search results. For e.g. - words in bold and in heading (h1, h2, h3) could be treated as more important than the other words. These are useful metadata that could be added to your inverted index data. Optional (1 point for each meta data item up to 2 points max):: Extra credit will be given for ideas that improve the quality of the retrieval, so you may add more metadata to your index, if you think it will help improve the quality of the retrieval. For this, instead of storing a simple TF-IDF count for every page, you can store more information related to the page (e.g. position of the words in the page). To store this information, you need to design your index in such a way that it can store and retrieve all this metadata efficiently. Your index lookup during search should not be horribly slow, so pay attention to the structure of your index COMPONENT 2 – SEARCH AND RETRIEVE: Your program should prompt the user for a query. This doesn’t need to be a Web interface, it can be a console prompt. At the time of the query, your program will look up your index, perform some calculations (see ranking below) and give out the ranked list of pages that are relevant for the query.
COMPONENT 3 - RANKING:
At the very least, your ranking formula should include tf-idf scoring, but you should feel free to add additional components to this formula if you think they improve the retrieval. Optional (1 point for each parameter up to 2 points max): Extra credit will be given if your ranking formula includes par.
Presentation of the early prototype of "FAIR Profiles" - an example of the proposed DCAT Profile, proposed by the DCAT working group (but AFAIK never implemented). This prototype emerged from the activity of the "Skunkworks" group, from the Data FAIRport project.
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...datascienceiqss
It would be useful to be able to discover what kinds of data are contained in the myriad general-purpose public data repositories. It would be even better if it were possible to query that data and/or have that data conform to a particular context-dependent data format. This was the ambition of the Data FAIRport project. I will be demonstrating the "strawman" demonstration of a fully-functional Data FAIRport, where the meta/data in a public repository can be "projected" into one of a number of different context-dependent formats, such that it can be cross-queried in combination with the (potentially "projected") data from other repositories.
Part 4 of tutorials at DC2008, Berlin. (International Conference on Dublin Core and Metadata Applications). See also part 1-3 by Jane Greenberg, Pete Johnston, and Mikael Nilsson on DC history, concepts, and other schemas. This part focuses on practical issues.
Page 18Goal Implement a complete search engine. Milestones.docxsmile790243
Page 1/8
Goal: Implement a complete search engine. Milestones Overview
Milestone Goal #1 Produce an initial index for the corpus and a basic retrieval component
#2 Complete Search System
Page 2/8
PROJECT: SEARCH ENGINE Corpus: all ICS web pages We will provide you with the crawled data as a zip file (webpages_raw.zip). This contains the downloaded content of the ICS web pages that were crawled by a previous quarter. You are expected to build your search engine index off of this data. Main challenges: Full HTML parsing, File/DB handling, handling user input (either using command line or desktop GUI application or web interface) COMPONENT 1 - INDEX: Create an inverted index for all the corpus given to you. You can either use a database to store your index (MongoDB, Redis, memcached are some examples) or you can store the index in a file. You are free to choose an approach here. The index should store more than just a simple list of documents where the token occurs. At the very least, your index should store the TF-IDF of every term/document. Sample Index:
Note: This is a simplistic example provided for your understanding. Please do not consider this as the expected index format. A good inverted index will store more information than this. Index Structure: token – docId1, tf-idf1 ; docId2, tf-idf2
Example: informatics – doc_1, 5 ; doc_2, 10 ; doc_3, 7 You are encouraged to come up with heuristics that make sense and will help in retrieving relevant search results. For e.g. - words in bold and in heading (h1, h2, h3) could be treated as more important than the other words. These are useful metadata that could be added to your inverted index data. Optional (1 point for each meta data item up to 2 points max):: Extra credit will be given for ideas that improve the quality of the retrieval, so you may add more metadata to your index, if you think it will help improve the quality of the retrieval. For this, instead of storing a simple TF-IDF count for every page, you can store more information related to the page (e.g. position of the words in the page). To store this information, you need to design your index in such a way that it can store and retrieve all this metadata efficiently. Your index lookup during search should not be horribly slow, so pay attention to the structure of your index COMPONENT 2 – SEARCH AND RETRIEVE: Your program should prompt the user for a query. This doesn’t need to be a Web interface, it can be a console prompt. At the time of the query, your program will look up your index, perform some calculations (see ranking below) and give out the ranked list of pages that are relevant for the query.
COMPONENT 3 - RANKING:
At the very least, your ranking formula should include tf-idf scoring, but you should feel free to add additional components to this formula if you think they improve the retrieval. Optional (1 point for each parameter up to 2 points max): Extra credit will be given if your ranking formula includes par.
Presentation of the early prototype of "FAIR Profiles" - an example of the proposed DCAT Profile, proposed by the DCAT working group (but AFAIK never implemented). This prototype emerged from the activity of the "Skunkworks" group, from the Data FAIRport project.
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...datascienceiqss
It would be useful to be able to discover what kinds of data are contained in the myriad general-purpose public data repositories. It would be even better if it were possible to query that data and/or have that data conform to a particular context-dependent data format. This was the ambition of the Data FAIRport project. I will be demonstrating the "strawman" demonstration of a fully-functional Data FAIRport, where the meta/data in a public repository can be "projected" into one of a number of different context-dependent formats, such that it can be cross-queried in combination with the (potentially "projected") data from other repositories.
Searching Repositories of Web Application ModelsMarco Brambilla
Project repositories are a central asset in software development, as they preserve the technical knowledge gathered in past development activities. However, locating relevant information in a vast project repository is problematic, because it requires manually tagging projects with accurate metadata, an activity which is time consuming and prone to errors and omissions. This paper investigates the use of classical Information Retrieval techniques for easing the discovery of useful information from past projects. Differently from approaches based on textual search over the source code of applications or on querying structured metadata, we propose to index and search the models of applications, which are available in companies applying Model-Driven Engineering practices. We contrast alternative index structures and result presentations, and evaluate a prototype implementation on real-world experimental data.
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015Mark Wilkinson
A discussion and demonstration of a functional Data FAIRport, using W3C's Linked Data Platform, Ruben Verborgh's Linked Data Fragments, and Hydra's hypermedia controlled vocabularies. This is the output of the "Skunkworks" working group of the larger Data FAIRport project (http://datafairport.org).
United Airlines 2012 SharePoint Conference PresentationDenise Wilson
How United is Using Taxonomy to Drive Procurement. This is the first version uploaded and has some text overlap in the SlideShare preview but the downloaded version displays correctly. For a corrected version of the SlideShare preview version, please see the most recent posting of "United Airlines 2012 Microsoft SharePoint Conference Presentation"
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
Python is open source and has so many libraries for data wrangling and visualization that makes life of data scientists easier. For data wrangling pandas is used as it represent tabular data and it has other function to parse data from different sources, data cleaning, handling missing values, merging data sets etc. To visualize data, low level matplotlib can be used. But it is a base package for other high level packages such as seaborn, that draw well customized plot in just one line of code. Python has dash framework that is used to make interactive web application using python code without javascript and html. These dash application can be published on any server as well as on clouds like google cloud but freely on heroku cloud.
Certainly! However, your request is quite broad, as "description for data" can encompass a wide range of topics. Could you please provide more details or specify the type of data you need a description for? For instance, are you looking for a description of a dataset, a specific type of data, or something else? The more information you provide, the better I can assist you.
Modular Documentation Joe Gelb Techshoret 2009Suite Solutions
Designing, building and maintaining a coherent content model is critical to proper planning, creation, management and delivery of documentation and training content. This is especially true when implementing a modular or topic-based XML standard such as DITA, SCORM and S1000D, and is essential for successfully facilitating content reuse, multi-purpose conditional publishing and user-driven content.
During this presentation we will review basic concepts and methods for implementing information architecture. We will then introduce an innovative, comprehensive methodology for information modeling and content development that employs recognized XML standards for representation and interchange of knowledge, such as Topic Maps and SKOS. In this way, semantic technologies designed for taxonomy and ontology development can be brought to bear for creating and managing technical documentation and training content, and ultimately impacting the usability and findability of technical information.
RO-Crate: A framework for packaging research products into FAIR Research ObjectsCarole Goble
RO-Crate: A framework for packaging research products into FAIR Research Objects presented to Research Data Alliance RDA Data Fabric/GEDE FAIR Digital Object meeting. 2021-02-25
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...OpenAIRE
OpenAIRE Interoperability Workshop (8 Feb. 2013).
DataCite – Bridging the gap and helping to find, access and reuse data – Herbert Gruttemeier, INIST-CNRS
Data Management in the context of Open Science.
Because open access become mandatory for publications and project-funded research data, it is the responsibility of each researcher to be informed and then trained in new practices.
Searching Repositories of Web Application ModelsMarco Brambilla
Project repositories are a central asset in software development, as they preserve the technical knowledge gathered in past development activities. However, locating relevant information in a vast project repository is problematic, because it requires manually tagging projects with accurate metadata, an activity which is time consuming and prone to errors and omissions. This paper investigates the use of classical Information Retrieval techniques for easing the discovery of useful information from past projects. Differently from approaches based on textual search over the source code of applications or on querying structured metadata, we propose to index and search the models of applications, which are available in companies applying Model-Driven Engineering practices. We contrast alternative index structures and result presentations, and evaluate a prototype implementation on real-world experimental data.
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015Mark Wilkinson
A discussion and demonstration of a functional Data FAIRport, using W3C's Linked Data Platform, Ruben Verborgh's Linked Data Fragments, and Hydra's hypermedia controlled vocabularies. This is the output of the "Skunkworks" working group of the larger Data FAIRport project (http://datafairport.org).
United Airlines 2012 SharePoint Conference PresentationDenise Wilson
How United is Using Taxonomy to Drive Procurement. This is the first version uploaded and has some text overlap in the SlideShare preview but the downloaded version displays correctly. For a corrected version of the SlideShare preview version, please see the most recent posting of "United Airlines 2012 Microsoft SharePoint Conference Presentation"
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
Python is open source and has so many libraries for data wrangling and visualization that makes life of data scientists easier. For data wrangling pandas is used as it represent tabular data and it has other function to parse data from different sources, data cleaning, handling missing values, merging data sets etc. To visualize data, low level matplotlib can be used. But it is a base package for other high level packages such as seaborn, that draw well customized plot in just one line of code. Python has dash framework that is used to make interactive web application using python code without javascript and html. These dash application can be published on any server as well as on clouds like google cloud but freely on heroku cloud.
Certainly! However, your request is quite broad, as "description for data" can encompass a wide range of topics. Could you please provide more details or specify the type of data you need a description for? For instance, are you looking for a description of a dataset, a specific type of data, or something else? The more information you provide, the better I can assist you.
Modular Documentation Joe Gelb Techshoret 2009Suite Solutions
Designing, building and maintaining a coherent content model is critical to proper planning, creation, management and delivery of documentation and training content. This is especially true when implementing a modular or topic-based XML standard such as DITA, SCORM and S1000D, and is essential for successfully facilitating content reuse, multi-purpose conditional publishing and user-driven content.
During this presentation we will review basic concepts and methods for implementing information architecture. We will then introduce an innovative, comprehensive methodology for information modeling and content development that employs recognized XML standards for representation and interchange of knowledge, such as Topic Maps and SKOS. In this way, semantic technologies designed for taxonomy and ontology development can be brought to bear for creating and managing technical documentation and training content, and ultimately impacting the usability and findability of technical information.
RO-Crate: A framework for packaging research products into FAIR Research ObjectsCarole Goble
RO-Crate: A framework for packaging research products into FAIR Research Objects presented to Research Data Alliance RDA Data Fabric/GEDE FAIR Digital Object meeting. 2021-02-25
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...OpenAIRE
OpenAIRE Interoperability Workshop (8 Feb. 2013).
DataCite – Bridging the gap and helping to find, access and reuse data – Herbert Gruttemeier, INIST-CNRS
Data Management in the context of Open Science.
Because open access become mandatory for publications and project-funded research data, it is the responsibility of each researcher to be informed and then trained in new practices.
How to best manage your data to make the most of it for your research - With ODAM Framework (Open Data for Access and Mining) Give an open access to your data and make them ready to be mined
BioStatFlow is a web application useful to analyze "OMICS", including metabolomics, data with statistical methods.
BioStatFlow is available online: http://biostatflow.org
ODAM is an Experiment Data Table Management System (EDTMS) that gives you an open access to your data and make them ready to be mined - A data explorer as bonus
Spectra processing is crucial in metabolomics approaches, especially for proton NMR metabolomic profiling, since each processing step may impact the following steps. Among the different processing steps, data reduction (binning or bucketing) strongly impacts subsequent statistical data analysis and potential biomarker discovery. Based on a recently published work, we propose an improved method of data reduction, called ERVA which stands for Extraction of Relevant Variables for Analysis. This new method, by providing buckets centred on resonance peaks and rid of any non-significant signal, helps to recover the chemical fingerprints of metabolites. Moreover, we take advantage of the concentration variability of each compound from a series of samples of a complex mixture, to highlight chemical information. This is performed by linking the buckets into clusters based on significant correlations, thus bringing a helpful support for compound identification. As a proof of concept, this new method has been applied to a tomato 1H-NMR dataset to test its ability to recover fruit extract composition.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
1. Metadata management for data
storage spaces
Contributors:
François Ehrenmann (UMR BioGECO)
Philippe Chaumeil (UMR BioGECO)
Daniel Jacob (UMR BFP)
2. INRAE - Indexator – October 2022
• The implementation of a Data Management Plan (DMP) involves
some requisites such as the data outsourcing to be preserved
outside the users' disk space.
• This concerns not only published data but all data produced during
the course of a project.
• This is even more necessary when temporary staff (doctoral
students, post-docs, trainees, fixed-term contracts) are involved in
the production of data.
Data Management Plan
How to encourage the structures (Units, Platforms,...)
to better manage their data ?
3. INRAE - Indexator – October 2022
Data storage
• The central idea is that the storage space becomes the data repository, so the
metadata should go to the data and not the other way around.
Metadata
How to encourage the structures (Units, Platforms,...) to better manage their data
Your data repository
• Concern about the organisation of these storage spaces.
• Should they be harmonised, i.e. impose good practices such as i) folder and file naming, ii) folder structure (docs, data, scripts,
etc.), iii) the use of README files, iv) etc.
• At least the use of a README file seems the simplest and least restrictive. what to put in it ?
• How to use them effectively when you want to find information? With what vocabulary ?
4. INRAE - Indexator – October 2022
Data storage Project data storage space :
Put a metadata file (JSON format)
describing the project data within each
subdirectory
• The central idea is that the storage space becomes the data repository, so the
metadata should go to the data and not the other way around.
The choice was made for the JSON format,
which is very appropriate for describing
metadata, readable by both humans and
machines
How to encourage the structures (Units, Platforms,...) to better manage their data
Your data repository
5. INRAE - Indexator – October 2022
Generate the
metadata file (JSON)
Data storage
Web interface
Project data storage space :
Put a metadata file (JSON format)
describing the project data within each
subdirectory
• The central idea is that the storage space becomes the data repository, so the
metadata should go to the data and not the other way around.
The choice was made for the JSON format,
which is very appropriate for describing
metadata, readable by both humans and
machines
Knowing the production of files in JSON
format being delicate for users, a web
interface makes it possible to create
them.
How to encourage the structures (Units, Platforms,...) to better manage their data
deposit
6. INRAE - Indexator – October 2022
View
Metadata
Generate the
metadata file (JSON)
Search datasets based
on some metadata
deposit
scan
Data storage
Web interface
Project data storage space :
Put a metadata file (JSON format)
describing the project data within each
subdirectory
Then, find projects and/or data
corresponding to your criteria
• The central idea is that the storage space becomes the data repository, so the
metadata should go to the data and not the other way around.
How to encourage the structures (Units, Platforms,...) to better manage their data
7. INRAE - Indexator – October 2022
How to encourage the structures (Units, Platforms,...) to better manage their data
What metadata?
How to specify it?
From which vocabulary?
How to generate a JSON file?
Questions immediately raised
8. INRAE - Indexator – October 2022
• Given the diversity of domains, the approach chosen is to be both as flexible and as
pragmatic as possible by allowing each collective to choose its own (controlled) vocabulary
corresponding to the reality of its field and activities.
• The main idea is to be able to "capture" the user's metadata as easily as possible using their
vocabulary.
How to encourage the structures (Units, Platforms,...) to better manage their data
What metadata?
How to specify it?
From which vocabulary?
How to generate a JSON file?
Questions immediately raised
9. INRAE - Indexator – October 2022
• The main idea is to be able to "capture" the user's metadata as easily as possible using their
vocabulary.
How to encourage the structures (Units, Platforms,...) to better manage their data
The web interface
must therefore correspond to the scientific and experimental context
of the collective (research unit, project, platform, ...)
What metadata?
How to specify it?
From which vocabulary?
How to generate a JSON file?
Questions immediately raised
• Given the diversity of domains, the approach chosen is to be both as flexible and as
pragmatic as possible by allowing each collective to choose its own (controlled) vocabulary
corresponding to the reality of its field and activities.
10. INRAE - Indexator – October 2022
…
Web interface for metadata entry
Generate the metadata file (JSON)
11. INRAE - Indexator – October 2022
Sections
…
Web interface for metadata entry
Generate the metadata file (JSON)
12. INRAE - Indexator – October 2022
…
Web interface for metadata entry
Generate the metadata file (JSON)
Sections
Fields
13. INRAE - Indexator – October 2022
…
Web interface for metadata entry
textbox
dropbox textbox
checkbox
dropbox
textbox textbox
checkbox
Generate the metadata file (JSON)
Type
Sections
Fields
14. INRAE - Indexator – October 2022
…
Web interface for metadata entry
textbox
dropbox textbox
checkbox
dropbox
textbox textbox
checkbox
Generate the metadata file (JSON)
Predefined terms
Sections
Fields
Type
15. INRAE - Indexator – October 2022
Sections
Predefined terms
…
Web interface for metadata entry
Fields
width=350px width=350px
width=350px width=500px
open
textbox
dropbox textbox
checkbox
dropbox
textbox textbox
checkbox
Generate the metadata file (JSON)
Features
Type
16. INRAE - Indexator – October 2022
…
Fields Sections Type Features Predefined terms
config_terms.txt
Definition of metadata
• Terminology definition file in Tabulation-Separared-Values (TSV)
• Based on (controlled) vocabulary specified by the data manager of a collective (research unit, platform, … )
all the metadata to be entered can be fully configured using only one configuration file (TSV format).
It is possible to define
the whole terminology
using a spreadsheet.
17. INRAE - Indexator – October 2022
• column 1 - Field : shortname of the fields
• column 2 - Section : shortname ot the sections
• column 3 - Search : indicates if the field can be used as a criterion search ('Y') or not ('N')
• column 4 - Shortview : indicates with ordered numbers if the field serves for the overview table after the search (empty by default)
• column 5 - Type : indicates the way they will be entered via the web interface (possible values are: textbox, dropbox, checkbox and areabox).
• column 6 - Features : dependings on the Type value, one can specifiy some specific features. If several features, they must be separated by a comma
• for checkbox: open=0 or open=1 indicates if the selection is opened or not
• for textbox & checkbox: autocomplete=item The items.js file must be present under web/js/autocomplete
• for textbox & dropbox: width=NNNpx allows you to specify the width of the box. Usefull if you want put several fields in the same line
• for areabox: row=NN and cols=NN allows you to specify the row and column size of the textarea
• column 7 - Label : Labels corresponding to the fields that will appear in the web interface
• column 8 - Predefined terms : for fields defined with a type equal to 'checkbox' or 'dropbox', one can give a list of terms separated by a comma.
Structure of the Terminology definition file
Definition of metadata
config_terms.txt
all the metadata to be entered can be fully configured using only one configuration file (TSV format).
18. INRAE - Indexator – October 2022
Architecture diagram
config_terms.json
initdb
search
Configuration / Initialization steps
Normal operating mode
pgd-mmdt-schema.json
Terminology definition file (Tabulation-Separated Values)
Important: Must be defined in the first step and then no longer changed.
Web interface
(config)
config_terms.txt
generate
generate
generate
linked
MongoDB Web interface
create
insert
PGD_XXXXX.json
options
scan
cron
Data storage
deposit
scan
View
Metadata
Docker Containers
Input / Output files
Data storage
Web server
19. INRAE - Indexator – October 2022
Architecture diagram
config_terms.json
initdb
pgd-mmdt-schema.json
Terminology definition file (Tabulation-Separated Values)
Important: Must be defined in the first step and then no longer changed.
Web interface
(config)
config_terms.txt
generate
generate
MongoDB
http:/mysite.org/pgd-mmdt/config
Docker Containers
Input / Output files
Configuration / Initialization steps
web/json
20. INRAE - Indexator – October 2022
Architecture diagram
config_terms.json
Web interface
create
PGD_XXXXX.json
pgd-mmdt-schema.json
linked
options
Data storage
deposit
Metadata entry
Docker Containers
Input / Output files
web/json
22. INRAE - Indexator – October 2022
…
http:/mysite.org/pgd-mmdt/search
Web interface for search
23. INRAE - Indexator – October 2022
http:/mysite.org/pgd-mmdt/search#results
Web interface for search
Short View
24. INRAE - Indexator – October 2022
http:/mysite.org/pgd-mmdt/metadata/Atacama
Web interface for metadata
…
25. INRAE - Indexator – October 2022
PGD_XXXXX.json
deposit
scan
Web interface
options
scan cron
Web interface : Add new predefined terms
Terminology definition file
First time we need
of this new term
This new term is now available
for other users / datasets
Data storage
26. INRAE - Indexator – October 2022
web/js/autocomplete/cities.js
Web interface
Example with
Web interface : autocompletion
.
.
API « Découpage administratif » (Administrative division)
var cities=[];
$.getJSON("https://geo.api.gouv.fr/communes", function (data) {
$.each(data, function (index, value) { cities.push(value['nom']); });
});
. Terminology definition file
27. INRAE - Indexator – October 2022
// Get all descendant classes from 'Data' classe
edam_data=[];
get_terms_from_bioportal('EDAM', 'http://edamontology.org/data_0006', 'edam_data');
web/js/autocomplete/edam_data.js
To get information about the BioPortal API : https://data.bioontology.org/documentation
Web interface : autocompletion Example with
https://bioportal.bioontology.org/ontologies/EDAM/?p=classes
“datatype":{
"titre":"Data type",
"autocomplete":"edam_data",
"width":"350px“
}
web/json/config_terms.json
.
Web interface
.
.
Choose from 947 terms
autocompletion
28. INRAE - Indexator – October 2022
Web interface : autocompletion
https://vocabulaires-ouverts.inrae.fr/a-propos-du-thesaurus-inrae/
Example with
29. INRAE - Indexator – October 2022
Web interface : autocompletion Example with
https://consultation.vocabulaires-ouverts.inrae.fr/api/
web/js/autocomplete/VOvocab.js
.
Terminology definition file
keywords = [
'data', 'report','simulation', 'model', 'image','script',
'omics', 'statistic','scientific', 'research', ‘document',
'experiment','video', 'spatial', 'instrument'
]
VOvocab=[];
get_terms_from_voinrae(keywords,'VOvocab')
Choose from 405 terms
autocompletion
30. INRAE - Indexator – October 2022
Web interface : Resources
Terminology definition file
The "description" field should make it possible to better annotate the data,
while the "location" field should make it possible to
1) extend the perimeter of the data beyond the local space,
2) eventually to be able to emancipate oneself from the local space when one wishes to
disseminate the metadata alone
A location can be anything: a text, an absolute path in a tree, a URL link, ...
We can thus put a link to a publication: Type=article, link=DOI
31. INRAE - Indexator – October 2022
Creation
JSON metadata file
metadata viewer
Resource example 1: Atacama
32. INRAE - Indexator – October 2022
Resource example 2: Link to nextcloud
Put a NextCloud link pointing to the data repository.
Access is thus limited to those who have rights !
33. INRAE - Indexator – October 2022
Resource example 2: Link to nextcloud
Put a NextCloud link pointing to the data repository.
Access is thus limited to those who have rights !
Resource example 3: Indicate the path on a external storage
In case putting an URL is not possible, nervertheless
provide clear indications on the location of the data.
34. INRAE - Indexator – October 2022
VM
Data storage
Web server
Storage located on the VM
Installation : Local, Remote or Mixed
Local storage mounted on the VM
NAS Server
VPN
GlobalProtect
WinSCP
Successful
testing
Local VM
Remote VM (Datacenter)
2 cpu, 2 Go RAM, 10 Go HD
35. INRAE - Indexator – October 2022
VM
Data storage
Web server
Local VM
Remote VM (Datacenter)
Storage located on the VM
Google Drive
2 cpu, 2 Go RAM, 10 Go HD
Installation : Local, Remote or Mixed
Local storage mounted on the VM
NAS Server
VPN
GlobalProtect
WinSCP
Successful
testing
36. INRAE - Indexator – October 2022
scan
[ncloud]
type = webdav
url = https://nextcloud.inrae.fr/remote.php/webdav/
vendor = nextcloud
user = XXXXX
Pass = XXXXX
rclone mount ncloud:MTH2-PF-Bordeaux/DATA/ /mnt/ncloud/
--allow-other --vfs-cache-mode minimal
--read-only --no-checksum --no-modtime
--daemon --daemon-wait 15s
https://pmb-bordeaux.fr/ncloud/search
https://nextcloud.inrae.fr/apps/files/?dir=/MTH2-PF-Bordeaux/DATA
37. INRAE - Indexator – October 2022
Web Interface
Creation of the
JSON file
Mapping of JSON
file sections/terms
with the metadata
structure in
DATA INRAE
Pre-fill a dataset in the INRAE DATA dataverse (via API)
JSON Schema
Metadata JSON file
+
pgd-mmdt-schema.json
JSON-LD
Metadata JSON-LD file
• A good approach is to use only controlled vocabulary i.e. a relevant and sufficient
vocabulary used as reference in the field concerned to allow users to describe a project and
its context without having to add additional terms.
• A mapping of terms based on controlled vocabulary can thus be done more easily to
generate formats corresponding to different standards (MIAPPE, JSON-LD, ...)
Push
38. INRAE - Indexator – October 2022
Example of mapping from a controlled vocabulary based on an ontology in BioPortal
autocompletion
http://edamontology.org/data_0006
API BioPortal ontology / EDAM
get terms
Pre-fill a dataset in the INRAE DATA dataverse (via API)
39. INRAE - Indexator – October 2022
API BioPortal Search
https://data.bioontology.org/search
?q=Gene%20expression%20profile&ontology=EDAM&subtree_root_id=http%3A%2F%2Fedamontology.org%2Fdata_0006&apikey=….
Example of mapping from a controlled vocabulary based on an ontology in BioPortal
autocompletion
http://edamontology.org/data_0006
API BioPortal ontology / EDAM
get terms
search
Pre-fill a dataset in the INRAE DATA dataverse (via API)
Mapping
get
40. INRAE - Indexator – October 2022
Example of mapping from a controlled vocabulary based on the Thesaurus INRAE
https://consultation.vocabulaires-ouverts.inrae.fr/api/
API Thesaurus INRAE
get terms
Pre-fill a dataset in the INRAE DATA dataverse (via API)
autocompletion
41. INRAE - Indexator – October 2022
Example of mapping from a controlled vocabulary based on the Thesaurus INRAE
https://consultation.vocabulaires-ouverts.inrae.fr/api/
API Thesaurus INRAE
get terms
Pre-fill a dataset in the INRAE DATA dataverse (via API)
autocompletion
https://consultation.vocabulaires-ouverts.inrae.fr/rest/v1/search
?vocab=thesaurus-inrae&lang=en&type=skos%3AConcept
&query=metabolomics
&offset=0
API Thesaurus INRAE
search
get
Mapping
42. INRAE - Indexator – October 2022
Create
the
project Descriptive metadata
(Project)
Preserving
data
Web-based metadata entry tool
Storage space for the project
associated with the metadata file
Data analysis
•Adding new metadata
•Saving data with their metadata
•Convert to a suitable format
(JSON-LD)
Access to
data
Reuse of
data
Metadata query
(Web interface and/or API)
Observations,
Samples,
Experimentation,
Instrumentation
Push
JSON-LD
JSON with
a Schema
Adding
Resources
NAS
National and
international
data repositories
TSV
PGD_XXX.json
…
TSV
XXX
“Machine-Actionable Metadata" Create
the data
JSON with a Schema
Pre-fill a dataset in the INRAE DATA dataverse (via API)
Mapping
43. INRAE - Indexator – October 2022
• Have a visibility of what is produced within the collective
• data sets, software, databases, images, sounds, videos, analyses, codes, ...
• Use a controlled vocabulary specific to the domain of the collective, with mapping to other formats
embedding ontologies to be done downstream as required,
• Propose an alternative/complement to external data repositories or other thematic warehouses to have
knowledge of and access to ALL data, not only those that are published,
• Favour FAIR (at least Findable & Accessible criteria) within the collective,
• Sensitise newcomers and students to a better description of what they produce.
Conclusion
The “INDEXATOR" tool allows a collective to :
44. INRAE - Indexator – October 2022
https://github.com/inrae/pgd-mmdt
Thank you for your attention
Metadata Management for Storage Spaces
Metadata aggregation & indexation
Source code