This presentation is about external CVs support in Dataverse, Open Source data repository. Data Archiving and Networked Services (DANS-KNAW) decided to use Dataverse as a basic technology to build Data Stations and provide FAIR data services for various Dutch research communities.
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
1. Controlled Vocabularies support
and ontologies in Dataverse
Slava Tykhonov
Senior Information Scientist (DANS-KNAW)
CLARIAH Tech Day, 25.02.2021
Creative Commons Attribution 4.0 International (CC BY 4.0)
2. Overall goals for DANS
● DANS-KNAW is running EASY Trusted Digital Repository as a
service, it’s time to get data back from archive, convert and put
in Dataverse ready for curation
● DANS-KNAW wants to run Data Stations with metadata created
by and maintained by different research communities
● the long term goal of DANS is to make all datasets harvestable
and approachable, and create an interoperability layer with
external controlled vocabularies (FAIR Data Point)
3. DANS Data Stations - Future Data Services
Dataverse is API based data platform and a key framework for Open Innovation!
4. Dataverse as a service for Data Stations
● Open source project developed by IQSS of Harvard University
● Great product with very long history (from 2006) created by experienced
and Agile development team
● Clear vision and understanding of research communities requirements,
public roadmap
● Well developed architecture with rich APIs allows to build application layers
around Dataverse
● Strong community behind of Dataverse is helping to improve the basic
functionality and develop it further.
● DANS-KNAW is leading SSHOC WP5.2 task to deliver production ready
Dataverse repository for CESSDA, CLARIN and DARIAH communities
5. Services in European Open Science Cloud (EOSC)
● EOSC requires the level 8 of maturity
(at least)
● we need the highest quality of software
to be accepted as a service
● clear and transparent evaluation of
services is essential
● the evidence of technical maturity is the
key to success
● the limited warranty will allow to stop
out-of-warranty services
6. Dataverse App Store
Data preview: DDI Explorer, Spreadsheet/CSV, PDF, Text files, HTML, Images, video
render, audio, JSON, GeoJSON/Shapefiles/Map, XML
CLARIN tools: VRE integration
Interoperability: external controlled vocabularies support
Data processing: NESSTAR DDI migration tool (DDI -> Dataverse)
Linked Data: RDF compliance (FAIR Data Point)
Federated login: eduGAIN, PIONIER ID (EGI Check-in)
Visualization tools: Apache Superset
7. Applications maturity level
Every software package should follow the same CESSDA Maturity Model to
be accepted as a service running in EOSC.
Must have: Kubernetes infrastructure with upstream Docker images,
warranty statement, documentation, unit tests, Selenium tests, jenkins
pipeline.
It should be possible to connect externals services to your own Dataverse.
9. CMDI core metadata task
The goal mentioned in CMDI strategy 2019-2020: "Ready-made,
good quality profiles & components suitable for common use cases
and resource types".
DataCite has three types for metadata elements: mandatory,
recommended, optional, how to distinguish CMDI core
components for different CLARIN centers?
We are part of the specific CMDI task for the design and
implementation of CLARIN core metadata components and
profiles, and the use of FAIR vocabularies within CLARIN metadata.
10. CMDI implementation in Dataverse
Source code: https://github.com/IQSS/dataverse-docker/tree/clariah
11. CMDI metadata model in Dataverse
External FAIR controlled vocabularies is the key for interoperability!
Is it all about
relationships =>
between fields?
13. Out of the box CV support in Dataverse (1)
Source: Dataverse Metadata Schema
14. Out of the box CV support in Dataverse (2)
Internal vocabularies are stored in Dataverse, we need more CVs!
15. The importance of standards and ontologies
Generic controlled vocabularies to link metadata in the bibliographic collections are well known:
ORCID, GRID, GeoNames, Getty.
Medical knowledge graphs powered by:
● Biological Expression Language (BEL)
● Medical Subject Headings (MeSH®) by U.S. National Library of Medicine (NIH)
● Wikidata (Open ontology) - Wikipedia
Integration based on metadata standards:
● MARC21, Dublin Core (DC), Data Documentation Initiative (DDI)
The most of prominent ontologies already available as a Web Services with API endpoints.
15
16. Simple Knowledge Organization System (SKOS)
SKOS models a thesauri-like resources:
- skos:Concepts with preferred labels and alternative labels (synonyms) attached to them (skos:prefLabel,
skos:altLabel).
- skos:Concept can be related with skos:broader, skos:narrower and skos:related properties.
- terms and concepts could have more than one broader term and concept.
SKOS allows to create a semantic layer on top of objects, a network with statements and relationships.
A major difference of SKOS is logical “is-a hierarchies”. In thesauri the hierarchical relation can represent anything
from “is-a” to “part-of”.
16
17. Global Research Identifier Database (GRID) in SKOS
17
We already have a lot of data in
the global Dataverse network.
Can we provide depositors a
convenient web interface to link
their metadata to external
controlled vocabularies?
Is it possible to disambiguate
concepts and create links
automatically?
18. SKOSMOS framework to discover ontologies
18
● SKOSMOS is developed in
Europe by the National Library
of Finland (NLF)
● active global user community
● search and browsing interface
for SKOS concept
● multilingual vocabularies
support
● used for different use cases
(publish vocabularies, build
discovery systems, vocabulary
visualization)
21. Use case: COVID-19 expert questions
21
Source: Epidemic Questions Answering
“In response to the COVID-19 pandemic, the Epidemic Question Answering (EPIC-QA) track challenges teams to develop
systems capable of automatically answering ad-hoc questions about the disease COVID-19, its causal virus SARS-CoV-2,
related corona viruses, and the recommended response to the pandemic. While COVID-19 has been an impetus for a
large body of emergent scientific research and inquiry, the response to COVID-19 raises questions for consumers.”
23. COVID-19 questions in Dataverse metadata
23
Source: COVID-19 European data hub in Harvard Dataverse
● COVID-19 ontologies can be hosted by
SKOSMOS framework
● Researchers can enrich metadata by
adding standardized questions provided
by SKOSMOS ontologies
● rich metadata exported back to Linked
Open Data Cloud to increase a chance
to be found
● enriched metadata can be used for
further ML models training
25. Dataverse deposit form with ontologies
Every field can be linked to the appropriate controlled vocabularies in FAIR way!
26. One metadata field linked to many ontologies
Language switch in Dataverse will change the language of suggested terms!
27. Semantic Gateway lookup API
Scenario: when user selects vocabulary and search for some term,
API will get filled values and return back the list of concepts in the
Skosmos format:
GET /?lang=$language&vocab=$vocabulary&query=$keyword
examples:
GET /?lang=en&vocab=unesco&query=fam
Dataverse can be connected to any service with Skosmos protocol!
28. SKOSMOS python module (SKOSMOS-Client)
from skosmos_client import SkosmosClient
# then you can create your own client
skosmos = SkosmosClient(api_base='http://api.finto.fi/rest/v1/')
Finding the available vocabularies:
Vocabulary id: afo title: AFO - Natural resource and environment ontology
Vocabulary id: allars title: Allärs - General thesaurus in Swedish
Vocabulary id: cn title: Finnish Corporate Names
Vocabulary id: ic title: Iconclass
...
30. Other SKOSMOS supported services
● Finto (Finnish thesaurus and ontology service)
● CESSDA CV Service has implemented SKOSMOS interface
● CESSDA ELSST (European Language Social Science Thesaurus)
● ACDH Vocabularies (Austrian Academy of Sciences)
● Thesaurus INRAE (Paris, France)
● AGROVOC Multilingual Thesaurus (United Nations)
● UNESCO Thesaurus
● European Space Agency ESA
NDE (Netwerk Digitaal Erfgoed) is working with DANS on the (partial)
support of SKOSMOS protocol to get a proper external CV connection to
DANS Data Stations.
31. Dataverse meta(data) in FAIR Data Point (FDP)
• FDP is a technology developed in FAIRsFAIR
project led by DANS
• RESTful web service that enables data
owners to expose their data sets using rich
machine-readable metadata
• Provides standardized descriptions (RDF-
based metadata) using controlled
vocabularies and ontologies
• FDP spec is public
Source: FDP
The goal is to run FDP on
Dataverse side (DCAT, CVs) and
provide metadata export in RDF!
32. Linking data (files) to external CVs
Source: Scholars Portal’ Data Curation Tool (Canada)