5 years of Dataverse
evolution
Slava Tykhonov
Senior Information Scientist,
Research & Innovation meeting (DANS-KNAW)
26.01.2021
Dataverse based Clio Infra collaboration platform (2015)
Clio Infra functionality based on the Dataverse solution:
- teams curate, share and analyze research datasets collaboratively
- teams members can share the responsibility to collect data on specific variables
(for example, countries) and inform each other about changes and additions
- dataset version control system is able to track changes in datasets
- other researchers can download their own copy of the data if dataset is
published as Open Data Dataverse in flexible metadata store (Dataverse) that
connected with Research datasets storage by data processing engine
Interactive Clio Infra Dashboard with data in Dataverse (2015)
DANS Dataverse 3.x migration (2016)
Basic DataverseNL services:
• Federated login for Netherlands
institutions
• Persistent Identifier Services (DOI and
handle)
• Integration with archival systems
Applications:
• Modern and historical world maps
visualisations
• Data API and Geo API services for
projects with data
• Panel datasets constructor
• Time series plot
• Treemaps
• Pie and chart visualizations
• Descriptive statistics tools
Major challenges to provide services for researchers
● Maintenance concerns - who will be in charge after project is finished?
● Infrastructure problems - how to install and run tools for researchers?
● Various Interoperability issues - how to leverage data exchange between
different systems and services
Software updates and bug fixing, licences, technical staff training, legal aspects
and so on...
The influence of APIs standards on innovation
Source: V. Tykhonov “API Economy”
Interoperability in EOSC
● Technical interoperability defined as the “ability of different information technology systems
and software applications to communicate and exchange data”. It should allow “to accept
data from each other and perform a given task in an appropriate and satisfactory manner
without the need for extra operator intervention”.
● Semantic interoperability is “the ability of computer systems to transmit data with
unambiguous, shared meaning. Semantic interoperability is a requirement to enable machine
computable logic, inferencing, knowledge discovery, and data”.
● Organisational interoperability refers to the “way in which organisations align their
business processes, responsibilities and expectations to achieve commonly agreed and
mutually beneficial goals. Focus on the requirements of the user community by making
services available, easily identifiable, accessible and user-focused”.
● Legal interoperability covers “the broader environment of laws, policies, procedures and
cooperation agreements”
Source: EOSC Interoperability Framework v1.0
Open vs Closed Innovation
DANS Data Stations - Future Data Services
Dataverse is API based data platform and a key framework for Open Innovation!
Dataverse architecture in the nutshell
Basic components: Database (postgres), search index (solr) and web application (Glassfish/Payara)
Simple but
powerful!
How about
maintenance?
Dataverse Docker module (CESSDA Dataverse, 2018)
Source: https://github.com/IQSS/dataverse-docker
The Cathedral and the Bazaar
“The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary
(abbreviated CatB) is an essay, and later a book, by Eric S. Raymond on software engineering methods,
based on his observations of the Linux kernel development process and his experiences managing an
open source project, fetchmail. It examines the struggle between top-down and bottom-up design.”
Wikipedia
Some important points:
● Smart data structures and dumb code works a lot better than the other way
around
● When writing gateway software of any kind, take pains to disturb the data
stream as little as possible—and never throw away information unless the
recipient forces you to!
● Any tool should be useful in the expected way, but a truly great tool lends itself
to uses you never expected
Principle of good enough
The principle of good enough or "good enough" principle is a rule in software and systems design. It
indicates that consumers will use products that are good enough for their requirements, despite the
availability of more advanced technology.
Wikipedia
The KISS Principle of "Keep it Simple, Stupid” provides a series of design rules, some of them:
● Separate mechanisms from policy
● Write simple programs
● Write transparent programs
● Value developer time over machine time
● Make data complicated when required, not the program
● Build on potential users' expected knowledge
● Write programs which fail in a way that is easy to diagnose
● Prototype software before polishing it
● Make the program and protocols extensible
What should be simplified to make Dataverse “good enough”?
“One-liner” installation requirements include:
● even users without any technical knowledge should be able to install it
● simple, clear and transparent infrastructure ready for integration (Docker based)
● reverse proxy and load balancer should be set up both locally and on a remote host to
run Dataverse website (Nginx/Traefik)
Q: How do we cross the chasm?
A: Let’s try to capture the
mainstream!
Using Dataverse to fight against COVID-19
1300+ people
registered in the
organization
15
Jupyter integration: datasets conversion to pandas
dataframe
Can AI researchers read and reuse data directly from Dataverse in a collaborative
way?
Crossing the chasm...
The technology adoption requires further automation of all processes.
Our goal is to deliver production ready Dataverse for the European Open Science
Cloud (EOSC):
● SSHOC project: Docker/Kubernetes, common CI/CD pipeline, integration
tests, previewers, language localization, external tools
● EOSC Synergy Software Quality Assurance (SqaaS) pipeline integration
● CLARIAH - leveraging metadata schema with CLARIN community, CLARIN
tools integration, development common pipelines
● FAIRsFAIR - enabling FAIR Data Points (FDP) in Dataverse
● ODISSEI - using Dataverse as a data registry
Services in European Open Science Cloud (EOSC)
● EOSC requires the level 8 of maturity (at least)
● we need the highest quality of software to be
accepted as a service
● clear and transparent evaluation of services is
essential
● the evidence of technical maturity is the key to
success
● the limited warranty will allow to stop out-of-
warranty services
Running Dataverse in production on Cloud
HTTP(S) Load
Balancer Kubernetes Engine
Container Registry
Dataverse Service
Kubernetes Cluster
K8S Cluster Node
Dataverse Deployment
PostgreS
QL
Service
Solr Deployment
PostgreSQL Deployment
Users
Certbot Cronjob
Email Relay Deployment
Certbot
Service
Email
relay
Service
Dataverse Service
Solr
Service
Dataverse Kubernetes
Project maintained by Oliver Bertuch (FZ Julich) and available in Global
Dataverse Community Consortium github (GDCC)
Google Cloud, Amazon AWS, Microsoft Azure platforms supported
Open Source, community pull requests are welcome
http://github.com/IQSS/dataverse-kubernetes
SQA process with Selenium tests for Dataverse
Selenium IDE allows
to create and replay
all UI tests in your
browser
Shared tests can be
reused by community
to increase
reproducibility
SQA for the service maturity = unit tests + integration tests
21
Source: SSHOC project, data repositories task WP5.2
CI/CD pipeline with SQAaaS (S)
1
2
3
git
push
Push GCP
container
registry
webhook
Create
docker
image
Kubernetes
Deployment
git clone
Jenkins pipeline (Jenkinsfile)
9
7
Run SQA
S 8
1. Developer pushes code to GitHub
2. Jenkins receives notification - build trigger
3. Jenkins clones the workspace
4. (S) Runs SQA tests and does FAIRness check
5. (S) Issuing digital badge according to the results
6. (S) SQAaaS API triggers appropriate workflow
7. Creates docker image if success
8. Pushes new docker image to container registry
9. Updates the kubernetes deployment
22
Source: EOSC Synergy project
Data Commons is essential for integrations
Merce Crosas, “Harvard Data Commons”
FAIR Dataverse
Source:
Mercè Crosas,
“FAIR principles and
beyond:
implementation in
Dataverse”
Our goals to increase Dataverse interoperability
Provide a custom FAIR metadata schema for European research communities:
● CESSDA metadata (Consortium of European Social Science Data Archives)
● Component MetaData Infrastructure (CMDI) metadata from CLARIN
linguistics community
Connect metadata to ontologies and CVs:
● link metadata fields to common ontologies (Dublin Core, DCAT)
● define semantic relationships between (new) metadata fields (SKOS)
● select available external controlled vocabularies for the specific fields
● provide multilingual access to controlled vocabularies
One metadata field can be linked to many ontologies
Language switch in Dataverse will change the language of suggested terms!
The FAIR Signposting Profile
Herbert Van de Sompel
https://hvdsomp.info
Two levels of access to Web resources:
● level one provides a concise set of links or a
minimal set of links by value in the HTTP
header
● level two delivers a complete comprehensive
set of links by reference meaning in a
standalone document (link set)
Dataverse meta(data) in FAIR Data Point (FDP)
● RESTful web service that enables data
owners to expose their data sets using
rich machine-readable metadata
● Provides standardized descriptions
(RDF-based metadata) using
controlled vocabularies and ontologies
● FDP spec is public
Source: FDP
The goal is to run FDP on
Dataverse side (DCAT, CVs) and
provide metadata export in RDF!
F-UJI Automated FAIR Data Assessment Tool
Dataverse localization with Weblate
● service to connect files to Weblate in order to
translate them in a structured way
● several options for project visibility: accept
translations by the crowd, or only give access
to a select group of translators.
● Weblate indicates untranslated strings,
strings with failing checks, and strings that
need approval.
● when new strings are added with an upgrade
of Dataverse, Weblate can indicate which
strings are new and untranslated.
GUI translation with Weblate as a service
Source: SSHOC Weblate
Dataverse App Store
Data preview: DDI Explorer, Spreadsheet/CSV, PDF, Text files, HTML, Images, video render,
audio, JSON, GeoJSON/Shapefiles/Map, XML
Interoperability: external controlled vocabularies (CESSDA CV Manager)
Data processing: NESSTAR DDI migration tool
Linked Data: RDF compliance including SPARQL endpoint (FDP)
Federated login: eduGAIN, PIONIER ID
CLARIN Switchboard integration: Natural Language Processing tools
Visualization tools (maps, charts, timelines)
Dataverse and CLARIN tools integration
Make Data Count
Make Data Count is part of a broader Research Data Alliance (RDA) Data Usage Metrics Working Group
which helped to produce a specification called the COUNTER Code of Practice for Research Data.
The following metrics can be downloaded directly from the DataCite hub for datasets hosted by Dataverse
installations:
● Total Views for a Dataset
● Unique Views for a Dataset
● Total Downloads for a Dataset
● Downloads for a Dataset
● Citations for a Dataset (via Crossref)
Dataverse Metrics API is a powerful source for BI tools used for the Data Landscape monitoring.
Metrics for BI and integration with Apache Superset
Source: Apache Superset (Open Source)
Apache Superset visualizations
Apache Airflow for Dataverse pipelines
● Intended for acyclic processes,
around those processing data with a
point of "completion."
● DAG (Directed Acyclic Graph) is a
collection of all the tasks organized in
a way that reflects their relationships
and dependencies
● absolutely essential component for
the harvesting and depositing data
● Airflow dashboard allows to get a
clear overview and status of all
running processes
On the roadmap of ODISSEI project!
Conclusion
Due to the open architecture and the use of open standards, Dataverse team has
managed to attract the best people and create a strong community, and finally
build a product completely aligned with principles of Open Innovation.
Suitable for the future, community-driven, it has all chances to “cross the chasm”
and become a prominent FAIR data repository on all continents.
Dataverse already has a very rich ecosystem for technological innovation that will
allow to integrate tools which don't exist yet.
“Any tool should be useful in the expected way, but a truly great tool
lends itself to uses you never expected”...
Questions?
Slava Tykhonov,
Senior Information Scientist
vyacheslav.tykhonov@dans.knaw.nl

5 years of Dataverse evolution

  • 1.
    5 years ofDataverse evolution Slava Tykhonov Senior Information Scientist, Research & Innovation meeting (DANS-KNAW) 26.01.2021
  • 2.
    Dataverse based ClioInfra collaboration platform (2015) Clio Infra functionality based on the Dataverse solution: - teams curate, share and analyze research datasets collaboratively - teams members can share the responsibility to collect data on specific variables (for example, countries) and inform each other about changes and additions - dataset version control system is able to track changes in datasets - other researchers can download their own copy of the data if dataset is published as Open Data Dataverse in flexible metadata store (Dataverse) that connected with Research datasets storage by data processing engine
  • 3.
    Interactive Clio InfraDashboard with data in Dataverse (2015)
  • 4.
    DANS Dataverse 3.xmigration (2016) Basic DataverseNL services: • Federated login for Netherlands institutions • Persistent Identifier Services (DOI and handle) • Integration with archival systems Applications: • Modern and historical world maps visualisations • Data API and Geo API services for projects with data • Panel datasets constructor • Time series plot • Treemaps • Pie and chart visualizations • Descriptive statistics tools
  • 5.
    Major challenges toprovide services for researchers ● Maintenance concerns - who will be in charge after project is finished? ● Infrastructure problems - how to install and run tools for researchers? ● Various Interoperability issues - how to leverage data exchange between different systems and services Software updates and bug fixing, licences, technical staff training, legal aspects and so on...
  • 6.
    The influence ofAPIs standards on innovation Source: V. Tykhonov “API Economy”
  • 7.
    Interoperability in EOSC ●Technical interoperability defined as the “ability of different information technology systems and software applications to communicate and exchange data”. It should allow “to accept data from each other and perform a given task in an appropriate and satisfactory manner without the need for extra operator intervention”. ● Semantic interoperability is “the ability of computer systems to transmit data with unambiguous, shared meaning. Semantic interoperability is a requirement to enable machine computable logic, inferencing, knowledge discovery, and data”. ● Organisational interoperability refers to the “way in which organisations align their business processes, responsibilities and expectations to achieve commonly agreed and mutually beneficial goals. Focus on the requirements of the user community by making services available, easily identifiable, accessible and user-focused”. ● Legal interoperability covers “the broader environment of laws, policies, procedures and cooperation agreements” Source: EOSC Interoperability Framework v1.0
  • 8.
    Open vs ClosedInnovation
  • 9.
    DANS Data Stations- Future Data Services Dataverse is API based data platform and a key framework for Open Innovation!
  • 10.
    Dataverse architecture inthe nutshell Basic components: Database (postgres), search index (solr) and web application (Glassfish/Payara) Simple but powerful! How about maintenance?
  • 11.
    Dataverse Docker module(CESSDA Dataverse, 2018) Source: https://github.com/IQSS/dataverse-docker
  • 12.
    The Cathedral andthe Bazaar “The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary (abbreviated CatB) is an essay, and later a book, by Eric S. Raymond on software engineering methods, based on his observations of the Linux kernel development process and his experiences managing an open source project, fetchmail. It examines the struggle between top-down and bottom-up design.” Wikipedia Some important points: ● Smart data structures and dumb code works a lot better than the other way around ● When writing gateway software of any kind, take pains to disturb the data stream as little as possible—and never throw away information unless the recipient forces you to! ● Any tool should be useful in the expected way, but a truly great tool lends itself to uses you never expected
  • 13.
    Principle of goodenough The principle of good enough or "good enough" principle is a rule in software and systems design. It indicates that consumers will use products that are good enough for their requirements, despite the availability of more advanced technology. Wikipedia The KISS Principle of "Keep it Simple, Stupid” provides a series of design rules, some of them: ● Separate mechanisms from policy ● Write simple programs ● Write transparent programs ● Value developer time over machine time ● Make data complicated when required, not the program ● Build on potential users' expected knowledge ● Write programs which fail in a way that is easy to diagnose ● Prototype software before polishing it ● Make the program and protocols extensible
  • 14.
    What should besimplified to make Dataverse “good enough”? “One-liner” installation requirements include: ● even users without any technical knowledge should be able to install it ● simple, clear and transparent infrastructure ready for integration (Docker based) ● reverse proxy and load balancer should be set up both locally and on a remote host to run Dataverse website (Nginx/Traefik) Q: How do we cross the chasm? A: Let’s try to capture the mainstream!
  • 15.
    Using Dataverse tofight against COVID-19 1300+ people registered in the organization 15
  • 16.
    Jupyter integration: datasetsconversion to pandas dataframe Can AI researchers read and reuse data directly from Dataverse in a collaborative way?
  • 17.
    Crossing the chasm... Thetechnology adoption requires further automation of all processes. Our goal is to deliver production ready Dataverse for the European Open Science Cloud (EOSC): ● SSHOC project: Docker/Kubernetes, common CI/CD pipeline, integration tests, previewers, language localization, external tools ● EOSC Synergy Software Quality Assurance (SqaaS) pipeline integration ● CLARIAH - leveraging metadata schema with CLARIN community, CLARIN tools integration, development common pipelines ● FAIRsFAIR - enabling FAIR Data Points (FDP) in Dataverse ● ODISSEI - using Dataverse as a data registry
  • 18.
    Services in EuropeanOpen Science Cloud (EOSC) ● EOSC requires the level 8 of maturity (at least) ● we need the highest quality of software to be accepted as a service ● clear and transparent evaluation of services is essential ● the evidence of technical maturity is the key to success ● the limited warranty will allow to stop out-of- warranty services
  • 19.
    Running Dataverse inproduction on Cloud HTTP(S) Load Balancer Kubernetes Engine Container Registry Dataverse Service Kubernetes Cluster K8S Cluster Node Dataverse Deployment PostgreS QL Service Solr Deployment PostgreSQL Deployment Users Certbot Cronjob Email Relay Deployment Certbot Service Email relay Service Dataverse Service Solr Service
  • 20.
    Dataverse Kubernetes Project maintainedby Oliver Bertuch (FZ Julich) and available in Global Dataverse Community Consortium github (GDCC) Google Cloud, Amazon AWS, Microsoft Azure platforms supported Open Source, community pull requests are welcome http://github.com/IQSS/dataverse-kubernetes
  • 21.
    SQA process withSelenium tests for Dataverse Selenium IDE allows to create and replay all UI tests in your browser Shared tests can be reused by community to increase reproducibility SQA for the service maturity = unit tests + integration tests 21 Source: SSHOC project, data repositories task WP5.2
  • 22.
    CI/CD pipeline withSQAaaS (S) 1 2 3 git push Push GCP container registry webhook Create docker image Kubernetes Deployment git clone Jenkins pipeline (Jenkinsfile) 9 7 Run SQA S 8 1. Developer pushes code to GitHub 2. Jenkins receives notification - build trigger 3. Jenkins clones the workspace 4. (S) Runs SQA tests and does FAIRness check 5. (S) Issuing digital badge according to the results 6. (S) SQAaaS API triggers appropriate workflow 7. Creates docker image if success 8. Pushes new docker image to container registry 9. Updates the kubernetes deployment 22 Source: EOSC Synergy project
  • 23.
    Data Commons isessential for integrations Merce Crosas, “Harvard Data Commons”
  • 24.
    FAIR Dataverse Source: Mercè Crosas, “FAIRprinciples and beyond: implementation in Dataverse”
  • 25.
    Our goals toincrease Dataverse interoperability Provide a custom FAIR metadata schema for European research communities: ● CESSDA metadata (Consortium of European Social Science Data Archives) ● Component MetaData Infrastructure (CMDI) metadata from CLARIN linguistics community Connect metadata to ontologies and CVs: ● link metadata fields to common ontologies (Dublin Core, DCAT) ● define semantic relationships between (new) metadata fields (SKOS) ● select available external controlled vocabularies for the specific fields ● provide multilingual access to controlled vocabularies
  • 26.
    One metadata fieldcan be linked to many ontologies Language switch in Dataverse will change the language of suggested terms!
  • 27.
    The FAIR SignpostingProfile Herbert Van de Sompel https://hvdsomp.info Two levels of access to Web resources: ● level one provides a concise set of links or a minimal set of links by value in the HTTP header ● level two delivers a complete comprehensive set of links by reference meaning in a standalone document (link set)
  • 28.
    Dataverse meta(data) inFAIR Data Point (FDP) ● RESTful web service that enables data owners to expose their data sets using rich machine-readable metadata ● Provides standardized descriptions (RDF-based metadata) using controlled vocabularies and ontologies ● FDP spec is public Source: FDP The goal is to run FDP on Dataverse side (DCAT, CVs) and provide metadata export in RDF!
  • 29.
    F-UJI Automated FAIRData Assessment Tool
  • 30.
    Dataverse localization withWeblate ● service to connect files to Weblate in order to translate them in a structured way ● several options for project visibility: accept translations by the crowd, or only give access to a select group of translators. ● Weblate indicates untranslated strings, strings with failing checks, and strings that need approval. ● when new strings are added with an upgrade of Dataverse, Weblate can indicate which strings are new and untranslated.
  • 31.
    GUI translation withWeblate as a service Source: SSHOC Weblate
  • 32.
    Dataverse App Store Datapreview: DDI Explorer, Spreadsheet/CSV, PDF, Text files, HTML, Images, video render, audio, JSON, GeoJSON/Shapefiles/Map, XML Interoperability: external controlled vocabularies (CESSDA CV Manager) Data processing: NESSTAR DDI migration tool Linked Data: RDF compliance including SPARQL endpoint (FDP) Federated login: eduGAIN, PIONIER ID CLARIN Switchboard integration: Natural Language Processing tools Visualization tools (maps, charts, timelines)
  • 33.
    Dataverse and CLARINtools integration
  • 34.
    Make Data Count MakeData Count is part of a broader Research Data Alliance (RDA) Data Usage Metrics Working Group which helped to produce a specification called the COUNTER Code of Practice for Research Data. The following metrics can be downloaded directly from the DataCite hub for datasets hosted by Dataverse installations: ● Total Views for a Dataset ● Unique Views for a Dataset ● Total Downloads for a Dataset ● Downloads for a Dataset ● Citations for a Dataset (via Crossref) Dataverse Metrics API is a powerful source for BI tools used for the Data Landscape monitoring.
  • 35.
    Metrics for BIand integration with Apache Superset Source: Apache Superset (Open Source)
  • 36.
  • 37.
    Apache Airflow forDataverse pipelines ● Intended for acyclic processes, around those processing data with a point of "completion." ● DAG (Directed Acyclic Graph) is a collection of all the tasks organized in a way that reflects their relationships and dependencies ● absolutely essential component for the harvesting and depositing data ● Airflow dashboard allows to get a clear overview and status of all running processes On the roadmap of ODISSEI project!
  • 38.
    Conclusion Due to theopen architecture and the use of open standards, Dataverse team has managed to attract the best people and create a strong community, and finally build a product completely aligned with principles of Open Innovation. Suitable for the future, community-driven, it has all chances to “cross the chasm” and become a prominent FAIR data repository on all continents. Dataverse already has a very rich ecosystem for technological innovation that will allow to integrate tools which don't exist yet. “Any tool should be useful in the expected way, but a truly great tool lends itself to uses you never expected”...
  • 39.
    Questions? Slava Tykhonov, Senior InformationScientist vyacheslav.tykhonov@dans.knaw.nl