The main focus of Science Demonstrator sessions is to provide feedback to the EOSC community on the first experience of science demonstrators in the practical use of the emerging EOSC ecosystem.
Each panel will consist of a representative of a Science Demonstrator that will provide an overview of their experiences in the use of emerging EOSC services.
These sessions will help members of the scientific communities understanding the current state of maturity of the EOSC ecosystem and what is obtainable in a field of scientific research. It is also valuable to prospective Service Providers who wish to discover what are the challenges and opportunities that user communities might have to deal with, as a result of the adoption of their services.
This session will focus on life science and materials science.
2. PanCancer Science
Demonstrator - Sergei
Yakneen, EMBL
2www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
3. The Science Challenge
3www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
- Collect Next Generation Sequencing Data from several
cohorts of cancer patients generated at multiple
sequencing centres and across multiple cancer types.
- Reanalyze the data using a uniform and consistent data
processing pipeline utilizing established best practices
from the International Cancer Genomics Consortium.
- Analyze the integrated data set to identify patterns of
germline and somatic mutation that act across cancer
types in a PanCancer fashion.
4. The Science Demonstrator
4www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
- Utilize Butler, a cloud-based large-scale scientific workflow
framework developed in the context of ICGC’s Pancancer Analysis
of Whole Genomes project to perform a coordinated data
analysis across multiple clouds.
- Code - https://github.com/llevar/butler
- Paper - https://doi.org/10.1101/185736
- Perform automated repeatable deployments and configuration of
the entire processing infrastructure at three academic cloud
computing environments.
- EMBL-EBI Embassy Cloud
- ComputeCanada West Cloud
- Cyfronet
- Deliver a large dataset (>50 TB) to each cloud computing centre.
- Use Butler to run PanCancer pipelines and monitor progress.
5. Successes
5www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
EMBL/EBI Embassy Compute Canada Cyfronet
vCPU 1000 1000 700
RAM 4 TB 4 TB 2.6 TB
Disk 1 PB 150 TB 200 TB
Data 448 samples from 224
prostate cancer donors
422 samples from 211 pediatric
brain tumour donors
2081 samples from 1000
Genomes Project
71 TB raw data 62 TB raw data 50 TB raw data
Status Alignment and variant
calling completed
Alignment and variant calling
completed
Alignment completed
- Developed configurations for each cloud - https://github.com/llevar/eosc_pilot
- Developed extensive documentation and examples - https://butler.readthedocs.io/en/latest/
- Developed Butler self-healing capabilities.
- Performed data staging via Cyfronet Onedata.
6. Issues
6www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
- Biggest issue encountered by the SD was the initial
shortage of resources for operating at “cloud scale”.
- Used 20% of data set that was utilized for PCAWG
- < 0.5% of data set for 100k Genomes Project.
- Repeatable provisioning of large clusters of VMs.
- >10% of provisioning jobs experience failures
- Data movement and staging.
- 50 TB data set takes up to two weeks to move locations
- Genomics data requires encryption and network security
measures
- Shared access to network-accessible storage creates
processing bottlenecks.
7. Lessons Learned
7www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
- Effectively supporting life sciences use cases like cancer
genomics will require A LOT of resources.
- Diverse data-sets have diverse data handling requirements, thus
it is better to provide a variety of tools to make solutions with
rather than a single “solution”.
- Automated detection and resolution of issues with
infrastructure (a la Butler self-healing) are imperative for
effective operation at cloud-scale.
8. EGA – FAIR Genomic Datasets
Tony Wildish
on behalf of Nino Spataro and
the EGA-CRG team
8www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
9. The Science Challenge
9www.eoscpilot.e
u
The principal objectives of our SD are:
i. Test the feasibility of data reproducibility in genomics
ii. Prove the possibility to remaster genomic datasets
iii. Render genomic datasets more FAIR
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
10. The Science Demonstrator
10www.eoscpilot.e
u
How we made it:
Implementing portable containerized genomic pipelines
Using a language enabling scalable and reproducible scientific work-flows
(Nextflow available at: https://www.nextflow.io/)
Storing the pipelines in a public repository together with metadata
describing each pipeline step and the used tools and versions
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
11. Successes
11www.eoscpilot.e
u
Genomic pipelines portabilility
Pipelines were successfully implemented and executed in a third-party infrastructure.
Genomic pipelines FAIRification
Pipelines were deposited jointly with metadata describing the relevant variables relevant
for pipeline description and re-use.
Pipelines available at:
https://dockstore.org/workflows/github.com/CRG-CNAG/EOSC-Pilot
Feasibility of reproducibility and remastering in genomics
Overall, 97.38% of the obtained variants are shared and 99.66% of the called genotypes
perfectly agreed.
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
12. Issues
12www.eoscpilot.e
u
Unavailable original version of some softwares
Solved using of the closest available version
Size of the selected dataset to replicate
Solved limiting the replicability to a subset of the original data
Time-consuming understanding of original pipelines
The absence of consolidated standards to store and describe the original pipelines
slowed down the pipeline implementation process
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
13. Lessons Learned
13www.eoscpilot.e
u
The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
Reproducibility is a time consuming task on both the implementation and
computational side.
Universal methods to describe pipelines are required along with long term
repositories to keep the whole experiment reproducible.
A FAIR-compliant semantic repository on which to represent objects and their
relationships is missing in the EOSC ecosystem.
Open science is still not perceived as scientific obligation by scientific
stakeholders. Continuous training and education is required to form a new
generation of scientists.
15. The Science Challenge
15www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
CryoEM aims to improve
reproducibility of their work using
image processing workflows through
the production of a Scipion
workflow file that describes their
image processing steps. This allows
full reproduction of the same results
when the data is reprocessed
outside the microscope facility. This
description can also be uploaded to
public databases, so that other users
can understand the process
followed to achieve a given
structure.
16. The Science Demonstrator
16www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
• Adapt Scipion (an image processing workflow engine) to be
able to thoroughly report in a Json file all the inputs,
outputs, and used parameters so that the same processing
can be reproduced.
• Adapt Scipion to be able to reproduce an already existing
workflow producing exactly the same results as in the first
run.
• Connecting Scipion to a public database (Electron
Microscopy Data Bank) in order to allow the user to
automatically submit his/her results.
• Allow other users to visualize the workflow performed by
other scientists.
18. Issues
18www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
1. Create a public repository of acquisition metadata and
image processing workflows for new acquisitions, as a
temporary repository until the data is finally analyzed and
deposited in the standard public databases (EMDB and
EMPIAR).
2. Create an authentication policy such that biologists coming
out from an EM facility could continue the image
processing in some of the EOSC cloud machines.
19. Lessons Learned
19www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
• There is a big gap between technological advances and
their adoption in EU facilities and scientists. Much of it due
to funding:
• Local resources for stream processing
• Existence of temporary repositories
• Access to high-end computer clusters
• There is a gap between open science promotion and the
obligation of facilities to keep and disclose publicly funded
data.
20. The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
Bioimaging
Beatriz Serrano-Solano
Jean-Karim Hériché
2
0
www.eoscpilot.eu
21. The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
The Science Challenge
2
1
www.eoscpilot.eu
▸ Biological images contain more information than described in their
original publications.
▸ Re-analyzing the images with machine learning algorithms can extract
new knowledge from these unexploited resources.
22. The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
The Science Demonstrator
2
2
www.eoscpilot.eu
23. The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
Successes
2
3
www.eoscpilot.eu
24. The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
Issues
2
4
www.eoscpilot.eu
25. The European Open Science Cloud for Research pilot project is funded by the
European Commission, DG Research & Innovation under contract no. 739563
Lessons Learned
2
5
www.eoscpilot.eu
▸ EOSC Ecosystem
▸ Technical
▸ Lack of high-performance file system
▸ Lack of big memory machines (1 TB of RAM)
▸ Services
▸ User-unfriendly deployment and set-up (e.g. ElastiCluster)
▸ Inadequate training
It would have been more efficient to use the local HPC
26. Photon and Neutron
Michael Schuh, DESY
26www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
27. The Science Challenge
27www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
Data
● Volume of hundreds of PBs
● Fast data ingest, tens of GB/s per detector
● File creation at kHz rates
Computing
● Fast resources for immediate online
analysis, monitoring running experiments
● Highly specialized offline analysis
frameworks used in physics, chemistry,
materials science, biology, nanotechnology
Policy
● Data Management Plans
● Sharing of FAIR data, methods, results
between users, sites and communities
● Control access during data embargos
● Persistence, long term archival
Images: desy.de/~twhite/crystfel, cid.cfel.de/research/femtosecond_crystallography
28. The Science Demonstrator
28www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
Motivation:
Data sets too large to take home
○ Execute codes on cloud
resources close to the data,
avoid downloading large
amounts of data to user systems
Solution:
IaaS and PaaS
○ No stack implementation
by the user
○ Efficient resource management
○ Prepare federation of DESY
OpenStack as EOSC resource
CaaS
○ Libraries for containerized
software, tools and functions
○ Run user defined software stacks
○ Container orchestration
FaaS
○ Containers as cloud functions
Service oriented architecture with cloud computing technologies
29. Successes
29www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
Automated data processing
● Data comes in, FaaS
automatically triggered
○ Create derived data
○ Extract metadata
Interactive data analysis
● Share and re-use complete workflows
● Jupyter Notebooks as graphical frontend,
run anywhere from EOSC to small remote
system
● Notebooks and functions published and
continuously integrated via GitLab/Docker
30. Issues
30www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
● Fully integrated template solutions (Magnum/Heat, TOSCA) for scaling COE
clusters (Docker Swarm, kubernetes, Mesos) still cumbersome.
○ EOSC can do a great job in facilitating this with good cluster on demand
service as open science solution
● Cloud Functions (FaaS) have proven to be a good solution for short running
functions, micro-services. Integration with present HPC and HTC systems still
undefined, request routing based on job profile needs research.
○ Submitting into present HPC clusters
○ Virtualizing HPC clusters in the EOSC on demand
● Many licenses are not aware of new container distribution channels and
deployments as cloud functions, as a service.
● Integrated AAI solution needed technical and policy-wise
● Will EOSC provide cloud application building blocks?
○ Container registries
○ Message hubs
○ GitLab
○ JupyterHub
31. Lessons Learned
31www.eoscpilot.eu
The European Open Science Cloud for Research pilot project is funded by
the European Commission, DG Research & Innovation under contract no.
739563
● Scaling highly specialized scientific applications means effort,
splitting into micro-services, containerizing, cloud deployments.
○ Strengthen co-development between cloud, infrastructure, platform
DevOps and software developers as well as data analysts.
● User interaction feels different with graphical applications, Window-
Forwarding from cloud resources often low-performing.
○ Clearly define where batch, headless, API ready and GUI applications
are in focus.
● Fully templated virtualized HPC cluster solutions still to emerge,
same for native deployments and for container clusters
○ EOSC to provide collaborative templates as know-how
as well as cluster on demand solutions.
○ EOSC to provide sufficient resources
for large-scale deployments suitable for big data.