Disentangling the origin of chemical differences using GHOST
Data standardization process for social sciences and humanities
1. dans.knaw.nl
DANS is een instituut van KNAW en NWO
Data standardization process
for arts and humanities
Vyacheslav Tykhonov
Senior Information Scientist
(DANS-KNAW, Netherlands)
Developing the SSHOC Reference Ontology workshop
ICS-FORTH , Heraklion, Crete
21-22 May, 2019
3. Outline
• Standardization process during data deposit and archiving
(metadata level created by users)
• Research data management and harmonization of deposited
datasets (file level)
• Standardization and enrichment of harvested content (metadata
level provided by different data providers)
• Tracking provenance information for data and tools, moving to FAIR
Big problem: researchers and librarians are not talking to each other
and there is no common Reference model!
4. Metadata schemas
• EASY TDR has own metadata schema developed for Dutch
scientific landscape but allows Dublin Core export from OAI-
PMH endpoint
• NARCIS is an aggregator that harvesting metadata from
various repositories, no standardization pipeline
• Metadata from Dataverse can be exported as:
5. Controlled vocabulary and thesaurus
• Linked data is one step forward (or actually backward in the right
direction) on solving some of standardization problems.
• By having shared controlled vocabularies (CV) created and
maintained by experts on various domains, the digital items can
be annotated with them and easily retrieved by other experts
from the same domain without being librarian. It’s clear
indication which vocabulary is good enough and shared by a
critical mass.
• A thesaurus is a semantic network of unique concepts, including
relationships between synonyms, broader and narrower
(parent/child) contexts, and other related concepts. Thesaurus is
hierarchy for controlled vocabularies.
6. SSHOC data repository
DANS-KNAW is leading the development of SSHOC DataverseEU project.
We’re developing multilingual web interface and localizing metadata fields and developed data
standardization technique based on APIs for CESSDA CVs, Topic Classification and CESSDA CV Manager
services.
SSHOC/CESSDA DataverseEU:
• Hungary (TARKI)
• Sweden (SND)
• Slovenia (ADP)
• Germany (GESIS)
• France (SciencesPro)
• Austria (AUSSDA)
• United Kingdom (UKDA)
• Italy (CNR, UniData)
• Belgium (SODA)
• Latvia (LSZDA)
• Poland (PSNC)
• Norway (DataverseNO)
• Netherlands (DANS-KNAW)
7. SKOS RDF Vocabularies (CESSDA)
We’re importing thesaurus delivered as SKOS RDF, for example:
Rest API endpoint delivers back JSON suitable for web applications.
10. Standardized metadata in RDF
All relations exported and available in the Knowledge Graph
and ready for the further querying and exploration:
11. Research data management
Data standardization process plays a key role in the data
management plan of any organization but current situation in
research data management is very complex:
• too much data chaos in datasets
• no data transparency
• sometimes no standards available
• no provenance information attached to data
• homonyms, synonyms, generalizations, specializations,
spelling variations and mistakes, language versions are all
complicating the keyword-based search and retrieval of
information
17. Time Machine association
• large scale project with 300+
partners
• development and support of
sustainable networked services
• trends watching and tracking of
software maturity level
• reliable governance model
• the foundation for the further
innovation!
18. Conclusion
• development of large-scale networked services out of research
pipelines
• every service should be mature enough, maintainable and follow
continuous integration pipeline
• tracking provenance information for every tool and dataset is the
highest priority
• creation and governance of standardization pipelines based on
services providing access to domain specific controlled vocabularies
and ontologies
• providing access to data, metadata and provenance (processes) in the
Knowledge Graph
• further integration of services maintained by different partners and
deployed in the Cloud