dans.knaw.nl
DANS is een instituut van KNAW en NWO
Building an electronic repository and archives on Dataverse
in the European Open Science Cloud
Vyacheslav Tykhonov
Senior Information Scientist
Data Archiving and Networked Services
(DANS-KNAW, Netherlands)
XVIII International Scientific and Practical conference
"BUILDING OF INFORMATION SOCIETY: RESOURCES AND TECHNOLOGIES"
September 19, 2019 in Kyiv
About me
• was born in Kyiv in 1979
• studied in the National Technical University of
Ukraine – Kyiv Polytechnic Institute (MSc,
2002)
• used to work for international search engines
companies and media monitoring agencies in
the past (1999-2010)
• started to work for the Royal Netherlands
Academy of Arts and Sciences (KNAW) in 2011
• Senior Data Scientist at DANS-KNAW from 2016
• currently leading the technical development of
DataverseEU cloud efforts in SSHOC Dataverse
and other projects
DANS-KNAW core services
Why Dataverse?
• Open source project developed by IQSS of Harvard University
and published on github
• Great product with very long history (from 2006)
• Very dynamic and experienced development team working in the
Agile environment (community call scheduled once in two weeks)
• Clear vision and understanding of research communities
requirements, public roadmap
• Strong community behind of Dataverse is helping to improve the
basic functionality and develop it further
• Dataverse has been selected as a data repository infrastructure
by countries from all continents
• Well developed architecture with rich API endpoints to build
application layers around Dataverse
Dataverse and API economy
Dataverse is data repository platform with 4 API endpoints:
- Native API
- SWORD API
- Search API
- Data Access API
API token is the key to connect Dataverse with unlimited
amount of tools developed by different research communities
and integrate it with other repositories.
DataverseNL as a shared service
Datasets container for Leiden University
DataverseNL as collaboration platform
• DataverseNL is a shared service provided by the participating institutions
and DANS. DANS performs back office tasks, including server and software
maintenance and administrative support.
• The participating institutions are responsible for managing the deposited
data and the content. Every institution has own data manager.
• User friendly:users at participating institutions simply log in and
DataverseNL will be ready for use.
• Reliable and safe: in cooperation with the participating institutions and
universities, standard procedures have been established which ensure
sound data management. Data are stored in the Netherlands.
• Accessible: the service can be accessed online, from anywhere and at any
time. Just open dataverse.nl!
Dataset submission form
Published dataset in Ukrainian
SSHOC DataverseEU project
SSHOC is Social Sciences and Humanities Open Cloud
The goal of SSHOC Dataverse project (CESSDA, DARIAH and CLARIN) is to create a reliable and production
ready Open Source data infrastructure that everybody can install and reuse for their own needs and
requirements.
We’re developing multilingual web interface and localizing metadata fields and developed data
standardization technique based on APIs for CESSDA CVs, Topic Classification and CESSDA CV Manager
services.
DataverseEU countries:
• Hungary (TARKI)
• Sweden(SND)
• Slovenia (ADP)
• Germany (GESIS)
• France (SciencesPro)
• Austria (AUSSDA)
• United Kingdom (UKDA)
• Italy (UniData)
• Belgium (SODA)
• Latvia (LSZDA)
• Netherlands (DANS-KNAW)
SSHOC Dataverse project has two parallel tracks of the development:
• Core development team is working on the modification and extension
of the Dataverse core functionality.
• The application development team will create new or will integrate
existent tools that will be published on Dataverse App Store website.
Our goal is to build the distributed and mature data infrastructure based on
sustainable microservices.
Development process
Maturity evaluation of DataverseEU services
• testing process should be compliant with CESSDA services maturity
model https://zenodo.org/record/2591055#.XKR6ny2B2u5
• every change of Dataverse functionality should be supplied with unit test,
changes of external functionality should get Selenium scenarios.
• the service should score as high as possible according to CESSDA
maturity model
Services in European Open Science Cloud (EOSC)
• EOSC requires the level 8 of
maturity (at least)
• we need the highest quality of
software to be accepted as a
service
• clear and transparent evaluation
of services is essential
• the evidence of technical maturity
is the key to success
• the limited warranty will allow to
stop out-of-warranty services
Research data management
Data standardization process plays a key role in the data
management plan of any organization but current situation in
research data management is very complex:
• too much data chaos in datasets
• no data transparency
• sometimes no standards available
• no provenance information attached to data
• homonyms, synonyms, generalizations, specializations,
spelling variations and mistakes, language versions are all
complicating the keyword-based search and retrieval of
information
Controlled vocabulary and thesaurus
• Linked data is one step forward (or actually backward in the right
direction) on solving some of standardization problems.
• By having shared controlled vocabularies (CV) created and
maintained by experts on various domains, the digital items can
be annotated with them and easily retrieved by other experts
from the same domain without being librarian. It’s clear
indication which vocabulary is good enough and shared by a
critical mass.
• A thesaurus is a semantic network of unique concepts, including
relationships between synonyms, broader and narrower
(parent/child) contexts, and other related concepts. Thesaurus is
hierarchy for controlled vocabularies.
CESSDA CV Service
External controlled vocabularies in Dataverse
Standardized metadata in Dataverse
Weblate as a multilingual support service
Managing translations with Weblate
Questions?
Contact me:
Slava Tykhonov
vyacheslav.tykhonov@dans.knaw.nl
https://www.linkedin.com/in/vyacheslavtikhonov/
https://twitter.com/4tykhonov
Watch SSHOC Dataverse presentation at Harvard University:
https://www.youtube.com/watch?v=vAPpKuDQUDY
Try now!
https://dataverse.harvard.edu and https://dataverse.nl
http://dataverse.org.ua (Ukrainian portal)
http://github.com/IQSS/dataverse (application source code)
http://github.com/IQSS/dataverse-docker (Cloud release for Kubernetes)

Building an electronic repository and archives on Dataverse in the European Open Science Cloud

  • 1.
    dans.knaw.nl DANS is eeninstituut van KNAW en NWO Building an electronic repository and archives on Dataverse in the European Open Science Cloud Vyacheslav Tykhonov Senior Information Scientist Data Archiving and Networked Services (DANS-KNAW, Netherlands) XVIII International Scientific and Practical conference "BUILDING OF INFORMATION SOCIETY: RESOURCES AND TECHNOLOGIES" September 19, 2019 in Kyiv
  • 2.
    About me • wasborn in Kyiv in 1979 • studied in the National Technical University of Ukraine – Kyiv Polytechnic Institute (MSc, 2002) • used to work for international search engines companies and media monitoring agencies in the past (1999-2010) • started to work for the Royal Netherlands Academy of Arts and Sciences (KNAW) in 2011 • Senior Data Scientist at DANS-KNAW from 2016 • currently leading the technical development of DataverseEU cloud efforts in SSHOC Dataverse and other projects
  • 3.
  • 4.
    Why Dataverse? • Opensource project developed by IQSS of Harvard University and published on github • Great product with very long history (from 2006) • Very dynamic and experienced development team working in the Agile environment (community call scheduled once in two weeks) • Clear vision and understanding of research communities requirements, public roadmap • Strong community behind of Dataverse is helping to improve the basic functionality and develop it further • Dataverse has been selected as a data repository infrastructure by countries from all continents • Well developed architecture with rich API endpoints to build application layers around Dataverse
  • 5.
    Dataverse and APIeconomy Dataverse is data repository platform with 4 API endpoints: - Native API - SWORD API - Search API - Data Access API API token is the key to connect Dataverse with unlimited amount of tools developed by different research communities and integrate it with other repositories.
  • 6.
    DataverseNL as ashared service
  • 7.
    Datasets container forLeiden University
  • 8.
    DataverseNL as collaborationplatform • DataverseNL is a shared service provided by the participating institutions and DANS. DANS performs back office tasks, including server and software maintenance and administrative support. • The participating institutions are responsible for managing the deposited data and the content. Every institution has own data manager. • User friendly:users at participating institutions simply log in and DataverseNL will be ready for use. • Reliable and safe: in cooperation with the participating institutions and universities, standard procedures have been established which ensure sound data management. Data are stored in the Netherlands. • Accessible: the service can be accessed online, from anywhere and at any time. Just open dataverse.nl!
  • 9.
  • 10.
  • 11.
    SSHOC DataverseEU project SSHOCis Social Sciences and Humanities Open Cloud The goal of SSHOC Dataverse project (CESSDA, DARIAH and CLARIN) is to create a reliable and production ready Open Source data infrastructure that everybody can install and reuse for their own needs and requirements. We’re developing multilingual web interface and localizing metadata fields and developed data standardization technique based on APIs for CESSDA CVs, Topic Classification and CESSDA CV Manager services. DataverseEU countries: • Hungary (TARKI) • Sweden(SND) • Slovenia (ADP) • Germany (GESIS) • France (SciencesPro) • Austria (AUSSDA) • United Kingdom (UKDA) • Italy (UniData) • Belgium (SODA) • Latvia (LSZDA) • Netherlands (DANS-KNAW)
  • 12.
    SSHOC Dataverse projecthas two parallel tracks of the development: • Core development team is working on the modification and extension of the Dataverse core functionality. • The application development team will create new or will integrate existent tools that will be published on Dataverse App Store website. Our goal is to build the distributed and mature data infrastructure based on sustainable microservices. Development process
  • 13.
    Maturity evaluation ofDataverseEU services • testing process should be compliant with CESSDA services maturity model https://zenodo.org/record/2591055#.XKR6ny2B2u5 • every change of Dataverse functionality should be supplied with unit test, changes of external functionality should get Selenium scenarios. • the service should score as high as possible according to CESSDA maturity model
  • 14.
    Services in EuropeanOpen Science Cloud (EOSC) • EOSC requires the level 8 of maturity (at least) • we need the highest quality of software to be accepted as a service • clear and transparent evaluation of services is essential • the evidence of technical maturity is the key to success • the limited warranty will allow to stop out-of-warranty services
  • 15.
    Research data management Datastandardization process plays a key role in the data management plan of any organization but current situation in research data management is very complex: • too much data chaos in datasets • no data transparency • sometimes no standards available • no provenance information attached to data • homonyms, synonyms, generalizations, specializations, spelling variations and mistakes, language versions are all complicating the keyword-based search and retrieval of information
  • 16.
    Controlled vocabulary andthesaurus • Linked data is one step forward (or actually backward in the right direction) on solving some of standardization problems. • By having shared controlled vocabularies (CV) created and maintained by experts on various domains, the digital items can be annotated with them and easily retrieved by other experts from the same domain without being librarian. It’s clear indication which vocabulary is good enough and shared by a critical mass. • A thesaurus is a semantic network of unique concepts, including relationships between synonyms, broader and narrower (parent/child) contexts, and other related concepts. Thesaurus is hierarchy for controlled vocabularies.
  • 17.
  • 18.
  • 19.
  • 20.
    Weblate as amultilingual support service
  • 21.
  • 22.
    Questions? Contact me: Slava Tykhonov vyacheslav.tykhonov@dans.knaw.nl https://www.linkedin.com/in/vyacheslavtikhonov/ https://twitter.com/4tykhonov WatchSSHOC Dataverse presentation at Harvard University: https://www.youtube.com/watch?v=vAPpKuDQUDY Try now! https://dataverse.harvard.edu and https://dataverse.nl http://dataverse.org.ua (Ukrainian portal) http://github.com/IQSS/dataverse (application source code) http://github.com/IQSS/dataverse-docker (Cloud release for Kubernetes)