Ontologies, controlled vocabularies
and Dataverse
Slava Tykhonov
Senior Information Scientist,
Research & Innovation (DANS-KNAW)
Dataverse community call, Harvard University, 03.12.2020
Overall goals for DANS-KNAW
● DANS-KNAW is running EASY Trusted Digital Repository as a service, it’s
time to get data back from archive, convert and put in Dataverse ready for
curation
● DANS-KNAW wants to run Data Stations with metadata created by and
maintained by different research communities
● the long term goal of DANS is to make all datasets harvestable and
approachable, and create an interoperability layer with external controlled
vocabularies (FAIR Data Point)
DANS Data Stations - Future Data Services
The importance of standards and ontologies
Generic controlled vocabularies to link metadata in the bibliographic collections are well
known: ORCID, GRID, GeoNames, Getty.
Medical knowledge graphs powered by:
● Biological Expression Language (BEL)
● Medical Subject Headings (MeSH®) by U.S. National Library of Medicine (NIH)
● Wikidata (Open ontology) - Wikipedia
Integration based on metadata standards:
● MARC21, Dublin Core (DC), Data Documentation Initiative (DDI)
The most of prominent ontologies already available as a Web Services with API endpoints.
4
FAIR Dataverse
Source:
Mercè Crosas,
“FAIR principles and
beyond:
implementation in
Dataverse”
Interoperability in EOSC
● Technical interoperability defined as the “ability of different information technology systems and
software applications to communicate and exchange data”. It should allow “to accept data from each
other and perform a given task in an appropriate and satisfactory manner without the need for extra
operator intervention”.
● Semantic interoperability is “the ability of computer systems to transmit data with unambiguous,
shared meaning. Semantic interoperability is a requirement to enable machine computable logic,
inferencing, knowledge discovery, and data”.
● Organisational interoperability refers to the “way in which organisations align their business
processes, responsibilities and expectations to achieve commonly agreed and mutually beneficial
goals. Focus on the requirements of the user community by making services available, easily
identifiable, accessible and user-focused”.
● Legal interoperability covers “the broader environment of laws, policies, procedures and
cooperation agreements”
Source: EOSC Interoperability Framework v1.0
Our goals to increase Dataverse interoperability
Provide a custom FAIR metadata schema for European research communities:
● CESSDA metadata (Consortium of European Social Science Data Archives)
● Component MetaData Infrastructure (CMDI) metadata from CLARIN
linguistics community
Connect metadata to ontologies and CVs:
● link metadata fields to common ontologies (Dublin Core, DCAT)
● define semantic relationships between (new) metadata fields (SKOS)
● select available external controlled vocabularies for the specific fields
● provide multilingual access to controlled vocabularies
Introduction of Data Catalog Vocabulary (DCAT)
Source: W3C DCAT recommendation
DCAT defines three main
classes:
● dcat:Catalog
represents the
catalog
● dcat:Dataset
represents a dataset
in a catalog.
● dcat:Distribution
represents an
accessible form of a
dataset
DCAT makes extensive
use of terms of RDF,
Dublin Core, SKOS, and
other vocabs!
Simple Knowledge Organization System (SKOS)
SKOS models a thesauri-like resources:
- skos:Concepts with preferred labels and alternative labels (synonyms) attached to them
(skos:prefLabel, skos:altLabel).
- skos:Concept can be related with skos:broader, skos:narrower and skos:related properties.
- terms and concepts could have more than one broader term and concept.
SKOS allows to create a semantic layer on top of objects, a network with statements and relationships.
A major difference of SKOS is logical “is-a hierarchies”. In thesauri the hierarchical relation can represent
anything from “is-a” to “part-of”.
9
RDF graph using the SKOS Core Vocabulary
10Source: SKOS Core Guide
Global Research Identifier Database (GRID) in SKOS
11
Can we provide human with
convenient web interface to
create links to data points?
Can we use Machine Learning
algorithms to make a prediction
about links and convert data in
SKOS automatically?
Linked Data integration challenges
● datasets are very heterogeneous and multilingual
● data usually lacks sufficient data quality control
● data providers using different modeling schemas and styles
● linked data cleansing and versioning is very difficult to track and maintain
properly, web resources aren’t persistent
● even modern data repositories providing only metadata records describing
data without giving access to individual data items stored in files
● difficult to assign and manually keep up-to-date entity relationships in
knowledge graph
We need semantic relationships among metadata fields and their values!
12
What is semantics?
Semantics (from Ancient Greek: σημαντικός sēmantikós, "significant")[a][1] is the study of meaning. The term can be used to
refer to subfields of several distinct disciplines including linguistics, philosophy, and computer science.
Linguistics
In linguistics, semantics is the subfield that studies meaning. Semantics can address meaning at the levels of words,
phrases, sentences, or larger units of discourse. One of the crucial questions which unites different approaches to linguistic
semantics is that of the relationship between form and meaning.[2]
Computer science
In computer science, the term semantics refers to the meaning of language constructs, as opposed to their form (syntax).
According to Euzenat, semantics "provides the rules for interpreting the syntax which do not provide the meaning directly
but constrains the possible interpretations of what is declared."[14]
(from Wikipedia)
Semantics in Dataverse metadata schema
Dataverse datasetfield API
curl http://localhost:8080/api/admin/datasetfield/title To do list for Dataverse core:
● add TermURI for
metadata fields (DC)
● show external
controlled vocabularies
available for the
specific field
● add multilingual
support with ‘lang’
parameter
Semantic Gateway as plugin application
Source: Dataverse gateway
Semantic Gateway configuration
Dataverse deposit form with connection to
ontologies
Every field can be linked to the appropriate controlled vocabularies in FAIR way!
One metadata field can be linked to many ontologies
Language switch in Dataverse will change the language of suggested terms!
The flexibility of Semantic Gateway
Source: Semantic Gateway API
Semantic Gateway lookup API
Scenario: when user selects vocabulary and search for term, API will get filled
values and returning back the list of concepts in the standardized format:
GET /?lang=language&vocab=vocabulary&term=keyword
examples:
GET /?lang=en&vocab=unesco&query=fam
GET /?vocab=mesh&query=sars
Semantic Gateway interface
Use case: CMDI, hierarchical metadata schema
Some conclusions:
● Top-level concepts (CMDI
components) can share the same
concepts
● Relations between concepts define
metadata schema
● Disambiguation of concepts is
complicated
● Multilingual components have
language indication (for example,
keywords in Dutch)
● Hierarchy defined by semantics
Use case: CMDI data model and namespaces
Default namespace added in Semantic Gateway for CMDI schema to keep all relationships
between top-level concepts (metadata fields) in the knowledge graph:
ns.dataverse.org/cmdi_component/cmdi_term
However, a component or element in CMDI has a unique name among its siblings, so:
Source: M. Windhouwer, E. Indarto, D. Broeder. CMD2RDF: Building a Bridge from CLARIN to Linked Open Data
Adding component-specific URIs in SKOS
CMDI Component Registry was created for registered Components/Profiles
Example path in CMDI:
/CMD/Components/corpusProfile/resourceCommonInfo/metadataInfo/metadataCreator/actor
Info/actorType
ns.dataverse.org/cmdi1/metadataCreator skos:broader ns.dataverse.org/cmdi1/actorInfo
or simply: cmdi1:metadataCreator skos:related cmdi1:corpusProfile
CMDI concepts could be linked to the other SKOS concepts on the next step.
How can we link CMDI components in SKOS?
Source: CMDI Component Registry
Export from Dataverse metadata back to CMDI
Basic requirements:
Dataverse metadata schema should have CMDI metadata that can be extended
by custom components used by CLARIN centers in the different countries.
Original relationships between fields and concepts should be kept, custom
components should be added to SKOS schema.
Users should be able to download metadata in the original CMDI format without
losing quality.
The FAIR Signposting Profile
Herbert Van de Sompel,
DANS Chief Innovation Officer
https://hvdsomp.info
Two levels of access to Web resources:
● level one provides a concise set of links or a
minimal set of links by value in the HTTP
header
● level two delivers a complete comprehensive
set of links by reference meaning in a
standalone document (link set)
Dataverse meta(data) in FAIR Data Point (FDP)
● RESTful web service that enables data
owners to expose their data sets using
rich machine-readable metadata
● Provides standardized descriptions
(RDF-based metadata) using
controlled vocabularies and ontologies
● FDP spec is public
Source: FDP
The goal is to run FDP on
Dataverse side (DCAT, CVs) and
provide metadata export in RDF!
Questions?
Slava Tykhonov,
Senior Information Scientist
vyacheslav.tykhonov@dans.knaw.nl

Ontologies, controlled vocabularies and Dataverse

  • 1.
    Ontologies, controlled vocabularies andDataverse Slava Tykhonov Senior Information Scientist, Research & Innovation (DANS-KNAW) Dataverse community call, Harvard University, 03.12.2020
  • 2.
    Overall goals forDANS-KNAW ● DANS-KNAW is running EASY Trusted Digital Repository as a service, it’s time to get data back from archive, convert and put in Dataverse ready for curation ● DANS-KNAW wants to run Data Stations with metadata created by and maintained by different research communities ● the long term goal of DANS is to make all datasets harvestable and approachable, and create an interoperability layer with external controlled vocabularies (FAIR Data Point)
  • 3.
    DANS Data Stations- Future Data Services
  • 4.
    The importance ofstandards and ontologies Generic controlled vocabularies to link metadata in the bibliographic collections are well known: ORCID, GRID, GeoNames, Getty. Medical knowledge graphs powered by: ● Biological Expression Language (BEL) ● Medical Subject Headings (MeSH®) by U.S. National Library of Medicine (NIH) ● Wikidata (Open ontology) - Wikipedia Integration based on metadata standards: ● MARC21, Dublin Core (DC), Data Documentation Initiative (DDI) The most of prominent ontologies already available as a Web Services with API endpoints. 4
  • 5.
    FAIR Dataverse Source: Mercè Crosas, “FAIRprinciples and beyond: implementation in Dataverse”
  • 6.
    Interoperability in EOSC ●Technical interoperability defined as the “ability of different information technology systems and software applications to communicate and exchange data”. It should allow “to accept data from each other and perform a given task in an appropriate and satisfactory manner without the need for extra operator intervention”. ● Semantic interoperability is “the ability of computer systems to transmit data with unambiguous, shared meaning. Semantic interoperability is a requirement to enable machine computable logic, inferencing, knowledge discovery, and data”. ● Organisational interoperability refers to the “way in which organisations align their business processes, responsibilities and expectations to achieve commonly agreed and mutually beneficial goals. Focus on the requirements of the user community by making services available, easily identifiable, accessible and user-focused”. ● Legal interoperability covers “the broader environment of laws, policies, procedures and cooperation agreements” Source: EOSC Interoperability Framework v1.0
  • 7.
    Our goals toincrease Dataverse interoperability Provide a custom FAIR metadata schema for European research communities: ● CESSDA metadata (Consortium of European Social Science Data Archives) ● Component MetaData Infrastructure (CMDI) metadata from CLARIN linguistics community Connect metadata to ontologies and CVs: ● link metadata fields to common ontologies (Dublin Core, DCAT) ● define semantic relationships between (new) metadata fields (SKOS) ● select available external controlled vocabularies for the specific fields ● provide multilingual access to controlled vocabularies
  • 8.
    Introduction of DataCatalog Vocabulary (DCAT) Source: W3C DCAT recommendation DCAT defines three main classes: ● dcat:Catalog represents the catalog ● dcat:Dataset represents a dataset in a catalog. ● dcat:Distribution represents an accessible form of a dataset DCAT makes extensive use of terms of RDF, Dublin Core, SKOS, and other vocabs!
  • 9.
    Simple Knowledge OrganizationSystem (SKOS) SKOS models a thesauri-like resources: - skos:Concepts with preferred labels and alternative labels (synonyms) attached to them (skos:prefLabel, skos:altLabel). - skos:Concept can be related with skos:broader, skos:narrower and skos:related properties. - terms and concepts could have more than one broader term and concept. SKOS allows to create a semantic layer on top of objects, a network with statements and relationships. A major difference of SKOS is logical “is-a hierarchies”. In thesauri the hierarchical relation can represent anything from “is-a” to “part-of”. 9
  • 10.
    RDF graph usingthe SKOS Core Vocabulary 10Source: SKOS Core Guide
  • 11.
    Global Research IdentifierDatabase (GRID) in SKOS 11 Can we provide human with convenient web interface to create links to data points? Can we use Machine Learning algorithms to make a prediction about links and convert data in SKOS automatically?
  • 12.
    Linked Data integrationchallenges ● datasets are very heterogeneous and multilingual ● data usually lacks sufficient data quality control ● data providers using different modeling schemas and styles ● linked data cleansing and versioning is very difficult to track and maintain properly, web resources aren’t persistent ● even modern data repositories providing only metadata records describing data without giving access to individual data items stored in files ● difficult to assign and manually keep up-to-date entity relationships in knowledge graph We need semantic relationships among metadata fields and their values! 12
  • 13.
    What is semantics? Semantics(from Ancient Greek: σημαντικός sēmantikós, "significant")[a][1] is the study of meaning. The term can be used to refer to subfields of several distinct disciplines including linguistics, philosophy, and computer science. Linguistics In linguistics, semantics is the subfield that studies meaning. Semantics can address meaning at the levels of words, phrases, sentences, or larger units of discourse. One of the crucial questions which unites different approaches to linguistic semantics is that of the relationship between form and meaning.[2] Computer science In computer science, the term semantics refers to the meaning of language constructs, as opposed to their form (syntax). According to Euzenat, semantics "provides the rules for interpreting the syntax which do not provide the meaning directly but constrains the possible interpretations of what is declared."[14] (from Wikipedia)
  • 14.
    Semantics in Dataversemetadata schema
  • 15.
    Dataverse datasetfield API curlhttp://localhost:8080/api/admin/datasetfield/title To do list for Dataverse core: ● add TermURI for metadata fields (DC) ● show external controlled vocabularies available for the specific field ● add multilingual support with ‘lang’ parameter
  • 16.
    Semantic Gateway asplugin application Source: Dataverse gateway
  • 17.
  • 18.
    Dataverse deposit formwith connection to ontologies Every field can be linked to the appropriate controlled vocabularies in FAIR way!
  • 19.
    One metadata fieldcan be linked to many ontologies Language switch in Dataverse will change the language of suggested terms!
  • 20.
    The flexibility ofSemantic Gateway Source: Semantic Gateway API
  • 21.
    Semantic Gateway lookupAPI Scenario: when user selects vocabulary and search for term, API will get filled values and returning back the list of concepts in the standardized format: GET /?lang=language&vocab=vocabulary&term=keyword examples: GET /?lang=en&vocab=unesco&query=fam GET /?vocab=mesh&query=sars
  • 22.
  • 23.
    Use case: CMDI,hierarchical metadata schema Some conclusions: ● Top-level concepts (CMDI components) can share the same concepts ● Relations between concepts define metadata schema ● Disambiguation of concepts is complicated ● Multilingual components have language indication (for example, keywords in Dutch) ● Hierarchy defined by semantics
  • 24.
    Use case: CMDIdata model and namespaces Default namespace added in Semantic Gateway for CMDI schema to keep all relationships between top-level concepts (metadata fields) in the knowledge graph: ns.dataverse.org/cmdi_component/cmdi_term However, a component or element in CMDI has a unique name among its siblings, so: Source: M. Windhouwer, E. Indarto, D. Broeder. CMD2RDF: Building a Bridge from CLARIN to Linked Open Data
  • 25.
    Adding component-specific URIsin SKOS CMDI Component Registry was created for registered Components/Profiles Example path in CMDI: /CMD/Components/corpusProfile/resourceCommonInfo/metadataInfo/metadataCreator/actor Info/actorType ns.dataverse.org/cmdi1/metadataCreator skos:broader ns.dataverse.org/cmdi1/actorInfo or simply: cmdi1:metadataCreator skos:related cmdi1:corpusProfile CMDI concepts could be linked to the other SKOS concepts on the next step.
  • 26.
    How can welink CMDI components in SKOS? Source: CMDI Component Registry
  • 27.
    Export from Dataversemetadata back to CMDI Basic requirements: Dataverse metadata schema should have CMDI metadata that can be extended by custom components used by CLARIN centers in the different countries. Original relationships between fields and concepts should be kept, custom components should be added to SKOS schema. Users should be able to download metadata in the original CMDI format without losing quality.
  • 28.
    The FAIR SignpostingProfile Herbert Van de Sompel, DANS Chief Innovation Officer https://hvdsomp.info Two levels of access to Web resources: ● level one provides a concise set of links or a minimal set of links by value in the HTTP header ● level two delivers a complete comprehensive set of links by reference meaning in a standalone document (link set)
  • 29.
    Dataverse meta(data) inFAIR Data Point (FDP) ● RESTful web service that enables data owners to expose their data sets using rich machine-readable metadata ● Provides standardized descriptions (RDF-based metadata) using controlled vocabularies and ontologies ● FDP spec is public Source: FDP The goal is to run FDP on Dataverse side (DCAT, CVs) and provide metadata export in RDF!
  • 30.
    Questions? Slava Tykhonov, Senior InformationScientist vyacheslav.tykhonov@dans.knaw.nl