Ontologies, controlled vocabularies and Dataverse

Ontologies, controlled vocabularies
and Dataverse
Slava Tykhonov
Senior Information Scientist,
Research & Innovation (DANS-KNAW)
Dataverse community call, Harvard University, 03.12.2020

Overall goals for DANS-KNAW
● DANS-KNAW is running EASY Trusted Digital Repository as a service, it’s
time to get data back from archive, convert and put in Dataverse ready for
curation
● DANS-KNAW wants to run Data Stations with metadata created by and
maintained by different research communities
● the long term goal of DANS is to make all datasets harvestable and
approachable, and create an interoperability layer with external controlled
vocabularies (FAIR Data Point)

DANS Data Stations - Future Data Services

The importance of standards and ontologies
Generic controlled vocabularies to link metadata in the bibliographic collections are well
known: ORCID, GRID, GeoNames, Getty.
Medical knowledge graphs powered by:
● Biological Expression Language (BEL)
● Medical Subject Headings (MeSH®) by U.S. National Library of Medicine (NIH)
● Wikidata (Open ontology) - Wikipedia
Integration based on metadata standards:
● MARC21, Dublin Core (DC), Data Documentation Initiative (DDI)
The most of prominent ontologies already available as a Web Services with API endpoints.
4

FAIR Dataverse
Source:
Mercè Crosas,
“FAIR principles and
beyond:
implementation in
Dataverse”

Interoperability in EOSC
● Technical interoperability defined as the “ability of different information technology systems and
software applications to communicate and exchange data”. It should allow “to accept data from each
other and perform a given task in an appropriate and satisfactory manner without the need for extra
operator intervention”.
● Semantic interoperability is “the ability of computer systems to transmit data with unambiguous,
shared meaning. Semantic interoperability is a requirement to enable machine computable logic,
inferencing, knowledge discovery, and data”.
● Organisational interoperability refers to the “way in which organisations align their business
processes, responsibilities and expectations to achieve commonly agreed and mutually beneficial
goals. Focus on the requirements of the user community by making services available, easily
identifiable, accessible and user-focused”.
● Legal interoperability covers “the broader environment of laws, policies, procedures and
cooperation agreements”
Source: EOSC Interoperability Framework v1.0

Our goals to increase Dataverse interoperability
Provide a custom FAIR metadata schema for European research communities:
● CESSDA metadata (Consortium of European Social Science Data Archives)
● Component MetaData Infrastructure (CMDI) metadata from CLARIN
linguistics community
Connect metadata to ontologies and CVs:
● link metadata fields to common ontologies (Dublin Core, DCAT)
● define semantic relationships between (new) metadata fields (SKOS)
● select available external controlled vocabularies for the specific fields
● provide multilingual access to controlled vocabularies

Introduction of Data Catalog Vocabulary (DCAT)
Source: W3C DCAT recommendation
DCAT defines three main
classes:
● dcat:Catalog
represents the
catalog
● dcat:Dataset
represents a dataset
in a catalog.
● dcat:Distribution
represents an
accessible form of a
dataset
DCAT makes extensive
use of terms of RDF,
Dublin Core, SKOS, and
other vocabs!

Simple Knowledge Organization System (SKOS)
SKOS models a thesauri-like resources:
- skos:Concepts with preferred labels and alternative labels (synonyms) attached to them
(skos:prefLabel, skos:altLabel).
- skos:Concept can be related with skos:broader, skos:narrower and skos:related properties.
- terms and concepts could have more than one broader term and concept.
SKOS allows to create a semantic layer on top of objects, a network with statements and relationships.
A major difference of SKOS is logical “is-a hierarchies”. In thesauri the hierarchical relation can represent
anything from “is-a” to “part-of”.
9

RDF graph using the SKOS Core Vocabulary
10Source: SKOS Core Guide

Global Research Identifier Database (GRID) in SKOS
11
Can we provide human with
convenient web interface to
create links to data points?
Can we use Machine Learning
algorithms to make a prediction
about links and convert data in
SKOS automatically?

Linked Data integration challenges
● datasets are very heterogeneous and multilingual
● data usually lacks sufficient data quality control
● data providers using different modeling schemas and styles
● linked data cleansing and versioning is very difficult to track and maintain
properly, web resources aren’t persistent
● even modern data repositories providing only metadata records describing
data without giving access to individual data items stored in files
● difficult to assign and manually keep up-to-date entity relationships in
knowledge graph
We need semantic relationships among metadata fields and their values!
12

What is semantics?
Semantics (from Ancient Greek: σημαντικός sēmantikós, "significant")[a][1] is the study of meaning. The term can be used to
refer to subfields of several distinct disciplines including linguistics, philosophy, and computer science.
Linguistics
In linguistics, semantics is the subfield that studies meaning. Semantics can address meaning at the levels of words,
phrases, sentences, or larger units of discourse. One of the crucial questions which unites different approaches to linguistic
semantics is that of the relationship between form and meaning.[2]
Computer science
In computer science, the term semantics refers to the meaning of language constructs, as opposed to their form (syntax).
According to Euzenat, semantics "provides the rules for interpreting the syntax which do not provide the meaning directly
but constrains the possible interpretations of what is declared."[14]
(from Wikipedia)

Semantics in Dataverse metadata schema

Dataverse datasetfield API
curl http://localhost:8080/api/admin/datasetfield/title To do list for Dataverse core:
● add TermURI for
metadata fields (DC)
● show external
controlled vocabularies
available for the
specific field
● add multilingual
support with ‘lang’
parameter

Semantic Gateway as plugin application
Source: Dataverse gateway

Semantic Gateway configuration

Dataverse deposit form with connection to
ontologies
Every field can be linked to the appropriate controlled vocabularies in FAIR way!

One metadata field can be linked to many ontologies
Language switch in Dataverse will change the language of suggested terms!

The flexibility of Semantic Gateway
Source: Semantic Gateway API

Semantic Gateway lookup API
Scenario: when user selects vocabulary and search for term, API will get filled
values and returning back the list of concepts in the standardized format:
GET /?lang=language&vocab=vocabulary&term=keyword
examples:
GET /?lang=en&vocab=unesco&query=fam
GET /?vocab=mesh&query=sars

Use case: CMDI, hierarchical metadata schema
Some conclusions:
● Top-level concepts (CMDI
components) can share the same
concepts
● Relations between concepts define
metadata schema
● Disambiguation of concepts is
complicated
● Multilingual components have
language indication (for example,
keywords in Dutch)
● Hierarchy defined by semantics

Use case: CMDI data model and namespaces
Default namespace added in Semantic Gateway for CMDI schema to keep all relationships
between top-level concepts (metadata fields) in the knowledge graph:
ns.dataverse.org/cmdi_component/cmdi_term
However, a component or element in CMDI has a unique name among its siblings, so:
Source: M. Windhouwer, E. Indarto, D. Broeder. CMD2RDF: Building a Bridge from CLARIN to Linked Open Data

Adding component-specific URIs in SKOS
CMDI Component Registry was created for registered Components/Profiles
Example path in CMDI:
/CMD/Components/corpusProfile/resourceCommonInfo/metadataInfo/metadataCreator/actor
Info/actorType
ns.dataverse.org/cmdi1/metadataCreator skos:broader ns.dataverse.org/cmdi1/actorInfo
or simply: cmdi1:metadataCreator skos:related cmdi1:corpusProfile
CMDI concepts could be linked to the other SKOS concepts on the next step.

How can we link CMDI components in SKOS?
Source: CMDI Component Registry

Export from Dataverse metadata back to CMDI
Basic requirements:
Dataverse metadata schema should have CMDI metadata that can be extended
by custom components used by CLARIN centers in the different countries.
Original relationships between fields and concepts should be kept, custom
components should be added to SKOS schema.
Users should be able to download metadata in the original CMDI format without
losing quality.

The FAIR Signposting Profile
Herbert Van de Sompel,
DANS Chief Innovation Officer
https://hvdsomp.info
Two levels of access to Web resources:
● level one provides a concise set of links or a
minimal set of links by value in the HTTP
header
● level two delivers a complete comprehensive
set of links by reference meaning in a
standalone document (link set)

Dataverse meta(data) in FAIR Data Point (FDP)
● RESTful web service that enables data
owners to expose their data sets using
rich machine-readable metadata
● Provides standardized descriptions
(RDF-based metadata) using
controlled vocabularies and ontologies
● FDP spec is public
Source: FDP
The goal is to run FDP on
Dataverse side (DCAT, CVs) and
provide metadata export in RDF!

Questions?
Slava Tykhonov,
Senior Information Scientist
vyacheslav.tykhonov@dans.knaw.nl

Ontologies, controlled vocabularies and Dataverse

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ontologies, controlled vocabularies and Dataverse

Similar to Ontologies, controlled vocabularies and Dataverse (20)

More from vty

More from vty (19)

Recently uploaded

Recently uploaded (20)

Ontologies, controlled vocabularies and Dataverse