Automated CI/CD testing, installation and deployment of Dataverse infrastructure on a Cloud

This project is funded from the EU Horizon 2020 Research and Innovation Programme (2014-2020) under Grant Agreement No. 823782
SSHOC
Dataverse in the
European Open Science Cloud
Automated CI/CD Testing, Installation and Deployment
Slava Tykhonov (DANS-KNAW)
lead software engineer, DANS R&D
Dataverse Community Meeting 2021
14 June 2021
Harvard University

Overview
● SSHOC in the European Open Science Cloud (EOSC)
● CESSDA Maturity Model
● Docker
● Kubernetes
● Serverless
● Community-driven QA plan
● CI/CD and test automation with pyDataverse
● Quality Assurance as a Service, EOSC Synergy
● Q&A

Duration: 40 months
(January 2019 – 30 April 2022)
Partners: 47
(20 beneficiaries + 27 LTPs)
SSH ESFRI Landmarks and Projects
& international SSH data infrastructures
Project budget:
€ 14,455,594.08
Type of action & funding:
Research and Innovation action
(INFRAEOSC-04-2018)
Project website:
www.SSHOpenCloud.eu
Objectives:
• creating the social sciences and humanities (SSH) part of European Open Science Cloud (EOSC)
• maximising re-use through Open Science and FAIR principles (standards, common catalogue, access control, semantic techniques, training)
• interconnecting existing and new infrastructures (clustered cloud infrastructure)
• establishing appropriate governance model for SSH-EOSC
Project:

EOSC Synergy

DANS Data Stations - Future Data Services

Interoperability in EOSC
● Technical interoperability defined as the “ability of different information technology
systems and software applications to communicate and exchange data”. It should allow
“to accept data from each other and perform a given task in an appropriate and
satisfactory manner without the need for extra operator intervention”.
● Semantic interoperability is “the ability of computer systems to transmit data with
unambiguous, shared meaning. Semantic interoperability is a requirement to enable
machine computable logic, inferencing, knowledge discovery, and data”.
● Organisational interoperability refers to the “way in which organisations align their
business processes, responsibilities and expectations to achieve commonly agreed and
mutually beneficial goals. Focus on the requirements of the user community by making
services available, easily identifiable, accessible and user-focused”.
● Legal interoperability covers “the broader environment of laws, policies, procedures
and cooperation agreements”
Source: EOSC Interoperability Framework v1.0

SSHOC task 5.2 Hosting and sharing data repositories
Objective
Development of a research data repository service on EOSC, for
SSH institutions currently without such a facility for their
designated communities
Deliverables
After 38 months: Data repository service running on EOSC
After 40 months: Report on principles of governance and
sustainability of the data repository service

Services in European Open Science Cloud (EOSC)
● EOSC requires the level 8 of maturity (at
least)
● we need the highest quality of software to
be accepted as a service
● clear and transparent evaluation of
services is essential
● the evidence of technical maturity is the
key to success
● the limited warranty will allow to stop out-
of-warranty services

Dataverse deployment in CESSDA Cloud
Docker Compose for the local development and testing
Kubernetes (K8s) for the Cloud deployment of services
(CI/CD pipeline with Jenkins and Helm)

Dataverse Docker module (CESSDA Dataverse, 2018)
Source: https://github.com/IQSS/dataverse-docker

Docker introduction
• Extremely powerful configuration tool
• Allows to install software on any platform (Linux, Mac, Windows)
• Any software can be installed from Docker as standalone container or container
delivering Microservices (database, search engine, core service)
• Docker allows to host unlimited amount of the same software tools on different
ports
• Docker can be used to organise multilingual interfaces, for example
• Deployments with orchestrator tools: Docker Swarm, Kubernetes
• Faster development and deployments
• Isolation of running containers allows to scale up apps
• Portability saves time to run the same image on the local computer or in the cloud
• Snapshotting allows to archive Docker images state
• Resource limitation can be adjusted

Dataverse Kubernetes
Project maintained by Oliver Bertuch (FZ Julich) and available in Global Dataverse
Community Consortium github (GDCC)
Google Cloud, Amazon AWS, Microsoft Azure platforms supported
Open Source, community pull requests are welcome
http://github.com/IQSS/dataverse-kubernetes

How to deploy Dataverse on Google Cloud Platform
Resources needed to deploy Dataverse:
» GCP VM-instance(s)
» Kubernetes (K8S) cluster
» Docker images
» K8S Services
» K8S Deployments (workloads)
» Load Balancer or Ingress (SSL certificates)
» Mail relay (default port 25 is blocked)
» Persistent volumes (storage) for the Solr, Postgres, Dataverse and SSL services

Integration with Google Cloud
Google Cloud Platform
» GCloud
a command-line (CLI) tool for running commands against Google Cloud Platform.
GCloud is part of the Google Cloud SDK
Kubernetes cluster
» Kubectl
A command line interface (CLI) for running commands against Kubernetes clusters
» Use of .yaml configuration files to describe the desired resources
Note: Kubernetes software is used by all major Cloud providers, the different is the storage
specification!

Dataverse Cloud Architecture
Ingress
HTTP(S) Load Balancer
Kubernetes Engine
Dataverse Service
Kubernetes Cluster
K8S Cluster Node
Dataverse Deployment Dataverse Service
Solr Deployment Solr Service
PostgreSQL
Service
PostgreSQL Deployment
Users

How to scale up Kubernetes horizontally
Kubernetes Engine
Compute Engine
Dataverse Service
Kubernetes Cluster
K8S Cluster Node1
Users
K8S Cluster Node2
Docker Hub
Container Registry

The importance of Persistent Storage
Docker containers write files to disk (I/O) for state or storage, both in /data and
/docroot folders. If a Docker container is restarted for some reason, all data will be
lost.
Solution: mount Persistent storage into the container on external disk hosted in the
Cloud.

SQA process with Selenium tests for Dataverse
Selenium IDE allows
to create and replay
all UI tests in your
browser
Shared tests can be
reused by community
to increase
reproducibility
SQA for the service maturity = unit tests + integration tests
http://github.com/GlobalDataverseCommunityConsortium/dataverse-test-suite
18

CI/CD pipeline with SQAaaS (S)
1
2
3
git
push
Push GCP
container
registry
webhook
Create
docker
image
Kubernetes
Deployment
git clone
Jenkins pipeline (Jenkinsfile)
9
7
Run SQA
S 8
1. Developer pushes code to GitHub
2. Jenkins receives notification - build trigger
3. Jenkins clones the workspace
4. (S) Runs SQA tests and does FAIRness check
5. (S) Issuing digital badge according to the results
6. (S) SQAaaS API triggers appropriate workflow
7. Creates docker image if success
8. Pushes new docker image to container registry
9. Updates the kubernetes deployment
19
Source: EOSC Synergy project

Dataverse App Store
Data preview: DDI Explorer, Spreadsheet/CSV, PDF, Text files, HTML, Images, video render, audio,
JSON, GeoJSON/Shapefiles/Map, XML
Interoperability: external controlled vocabularies (SKOSMOS, CESSDA CV Manager)
Data processing: NESSTAR DDI migration tool
Linked Data: RDF compliance including SPARQL endpoint (FDP)
Multilingual support: Weblate as a Service http://translation.cessda.eu
Federated login: eduGAIN, EGI Check-in, PIONIER ID
CLARIN Switchboard integration: Natural Language Processing tools
Visualization tools (maps, charts, timelines), Apache Superset integration

GUI translation with Weblate as a service
Source: SSHOC Weblate

Dataverse and CLARIN tools integration

Dataverse Apps in Serverless - Function as a Service
Infrastructure maintenance = human resources and money.
Are you sure you need a server to provide all of your services?
The Serverless application framework allows to upload your code to the Cloud
space and it will provision and run it without any extra efforts! Scaling is done
automatically based on the workfload.
Costs - only pay for the time when your application or computing process is running.
AWS Lambda can be used to spin up application based on triggers/events.

Serverless support by leading Cloud Providers
Function as a Service (FaaS) model:
Main costs:
1. Number of requests: typically billed for every million requests per month
2. Compute time: this is a measurement of the duration of the life of the function and the
amount of memory provisioned, measured in GB-seconds (higher-memory execution will
cost more than lower-memory execution of the same duration)
Source: Serverless Showdown

Serverless Use Cases for Dataverse
Remember: inactive Serverless application will require some additional time to
respond to a request. This initial delay encountered is known as a “cold start”.
● Can we run Dataverse instance as Serverless app?
Yes, but just for demo, Kubernetes setup is for Dataverse as a service
● Data Previewers in Serverless?
Ideal case, cheap and efficient!
● Data Computing tools?
Yes, will fit for services running on demand
● GIS tools integration with Dataverse?
Function as a Service (FaaS) is the best

Practical example - Caching function for CVV
Linked Data Serverless Resolver implemented as Lambda Function and deployed
on Harvard AWS cloud
https://github.com/Dans-labs/ld-serverless-resolver

Microsoft Visual Studio Code
Features:
● source-code editor made by Microsoft
for Windows, Linux and macOS
● The Integrated CLI (Command Line
Interface)
● support for debugging, syntax
highlighting, intelligent code
completion, snippets, code refactoring,
and embedded Git
● cloud-powered development
environments for any activity
Try it here: https://code.visualstudio.com

Maturity of Dataverse core and apps in EOSC
Testing process follows the CESSDA maturity model
https://zenodo.org/record/2591055#.XKR6ny2B2u5
Important: every change of Dataverse functionality should be supplied with
unit tests, changes of external functionality should get Selenium scenarios.
Goal: automate all tests, workflows and integrations, follow the Software
Quality as a Service (SQaaS) framework and score as high as possible
according to the CESSDA maturity model.

Our speakers
Stefan Kasberger is DevOp and Python Developer at AUSSDA (Austria) and involved
in the Dataverse community during last 3 years. He contributes Open Source tools in
Python to support Dataverse. He has done several data migrations into Dataverse.
Lately, he started to focus on creating standardized tests and test data for Dataverse
installations, to support automation/cloud/devops activities in the international
community. Stefan is creator of pyDataverse module.
Samuel Bernardo is Software Engineer at LIP in Portugal. Graduated in Lisbon
University with a degree in Computer Science and Engineering and finishing the
master degree dissertation. Worked on several projects around distributed systems,
data science and computing, Currently working in BigHPC and SQAaaS platforms
development in the EOSC Synergy and other European projects, such as Opencoasts
and WORSICA. Also a open source contributor for several years supporting the
community and free software.

Questions?
vyacheslav.tykhonov@dans.knaw.nl
https://www.sshopencloud.eu
info@sshopencloud.eu
@SSHOpenCloud
/in/sshopencloud
Join our community

Automated CI/CD testing, installation and deployment of Dataverse infrastructure on a Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Automated CI/CD testing, installation and deployment of Dataverse infrastructure on a Cloud

Similar to Automated CI/CD testing, installation and deployment of Dataverse infrastructure on a Cloud (20)

More from vty

More from vty (6)

Recently uploaded

Recently uploaded (20)

Automated CI/CD testing, installation and deployment of Dataverse infrastructure on a Cloud