Successfully reported this slideshow.
Your SlideShare is downloading. ×

Science Demonstrator Session: Life and Materials Sciences

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 31 Ad

Science Demonstrator Session: Life and Materials Sciences

Download to read offline

The main focus of Science Demonstrator sessions is to provide feedback to the EOSC community on the first experience of science demonstrators in the practical use of the emerging EOSC ecosystem.

Each panel will consist of a representative of a Science Demonstrator that will provide an overview of their experiences in the use of emerging EOSC services.

These sessions will help members of the scientific communities understanding the current state of maturity of the EOSC ecosystem and what is obtainable in a field of scientific research. It is also valuable to prospective Service Providers who wish to discover what are the challenges and opportunities that user communities might have to deal with, as a result of the adoption of their services.

This session will focus on life science and materials science.

The main focus of Science Demonstrator sessions is to provide feedback to the EOSC community on the first experience of science demonstrators in the practical use of the emerging EOSC ecosystem.

Each panel will consist of a representative of a Science Demonstrator that will provide an overview of their experiences in the use of emerging EOSC services.

These sessions will help members of the scientific communities understanding the current state of maturity of the EOSC ecosystem and what is obtainable in a field of scientific research. It is also valuable to prospective Service Providers who wish to discover what are the challenges and opportunities that user communities might have to deal with, as a result of the adoption of their services.

This session will focus on life science and materials science.

Advertisement
Advertisement

More Related Content

Similar to Science Demonstrator Session: Life and Materials Sciences (20)

More from EOSCpilot .eu (17)

Advertisement

Recently uploaded (20)

Science Demonstrator Session: Life and Materials Sciences

  1. 1. Science Demonstrator Panel Session 1 on Life Sciences
  2. 2. PanCancer Science Demonstrator - Sergei Yakneen, EMBL 2www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563
  3. 3. The Science Challenge 3www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 - Collect Next Generation Sequencing Data from several cohorts of cancer patients generated at multiple sequencing centres and across multiple cancer types. - Reanalyze the data using a uniform and consistent data processing pipeline utilizing established best practices from the International Cancer Genomics Consortium. - Analyze the integrated data set to identify patterns of germline and somatic mutation that act across cancer types in a PanCancer fashion.
  4. 4. The Science Demonstrator 4www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 - Utilize Butler, a cloud-based large-scale scientific workflow framework developed in the context of ICGC’s Pancancer Analysis of Whole Genomes project to perform a coordinated data analysis across multiple clouds. - Code - https://github.com/llevar/butler - Paper - https://doi.org/10.1101/185736 - Perform automated repeatable deployments and configuration of the entire processing infrastructure at three academic cloud computing environments. - EMBL-EBI Embassy Cloud - ComputeCanada West Cloud - Cyfronet - Deliver a large dataset (>50 TB) to each cloud computing centre. - Use Butler to run PanCancer pipelines and monitor progress.
  5. 5. Successes 5www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 EMBL/EBI Embassy Compute Canada Cyfronet vCPU 1000 1000 700 RAM 4 TB 4 TB 2.6 TB Disk 1 PB 150 TB 200 TB Data 448 samples from 224 prostate cancer donors 422 samples from 211 pediatric brain tumour donors 2081 samples from 1000 Genomes Project 71 TB raw data 62 TB raw data 50 TB raw data Status Alignment and variant calling completed Alignment and variant calling completed Alignment completed - Developed configurations for each cloud - https://github.com/llevar/eosc_pilot - Developed extensive documentation and examples - https://butler.readthedocs.io/en/latest/ - Developed Butler self-healing capabilities. - Performed data staging via Cyfronet Onedata.
  6. 6. Issues 6www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 - Biggest issue encountered by the SD was the initial shortage of resources for operating at “cloud scale”. - Used 20% of data set that was utilized for PCAWG - < 0.5% of data set for 100k Genomes Project. - Repeatable provisioning of large clusters of VMs. - >10% of provisioning jobs experience failures - Data movement and staging. - 50 TB data set takes up to two weeks to move locations - Genomics data requires encryption and network security measures - Shared access to network-accessible storage creates processing bottlenecks.
  7. 7. Lessons Learned 7www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 - Effectively supporting life sciences use cases like cancer genomics will require A LOT of resources. - Diverse data-sets have diverse data handling requirements, thus it is better to provide a variety of tools to make solutions with rather than a single “solution”. - Automated detection and resolution of issues with infrastructure (a la Butler self-healing) are imperative for effective operation at cloud-scale.
  8. 8. EGA – FAIR Genomic Datasets Tony Wildish on behalf of Nino Spataro and the EGA-CRG team 8www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563
  9. 9. The Science Challenge 9www.eoscpilot.e u The principal objectives of our SD are: i. Test the feasibility of data reproducibility in genomics ii. Prove the possibility to remaster genomic datasets iii. Render genomic datasets more FAIR The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563
  10. 10. The Science Demonstrator 10www.eoscpilot.e u How we made it:  Implementing portable containerized genomic pipelines  Using a language enabling scalable and reproducible scientific work-flows (Nextflow available at: https://www.nextflow.io/)  Storing the pipelines in a public repository together with metadata describing each pipeline step and the used tools and versions The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563
  11. 11. Successes 11www.eoscpilot.e u  Genomic pipelines portabilility Pipelines were successfully implemented and executed in a third-party infrastructure.  Genomic pipelines FAIRification Pipelines were deposited jointly with metadata describing the relevant variables relevant for pipeline description and re-use. Pipelines available at: https://dockstore.org/workflows/github.com/CRG-CNAG/EOSC-Pilot  Feasibility of reproducibility and remastering in genomics Overall, 97.38% of the obtained variants are shared and 99.66% of the called genotypes perfectly agreed. The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563
  12. 12. Issues 12www.eoscpilot.e u  Unavailable original version of some softwares Solved using of the closest available version  Size of the selected dataset to replicate Solved limiting the replicability to a subset of the original data  Time-consuming understanding of original pipelines The absence of consolidated standards to store and describe the original pipelines slowed down the pipeline implementation process The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563
  13. 13. Lessons Learned 13www.eoscpilot.e u The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563  Reproducibility is a time consuming task on both the implementation and computational side.  Universal methods to describe pipelines are required along with long term repositories to keep the whole experiment reproducible.  A FAIR-compliant semantic repository on which to represent objects and their relationships is missing in the EOSC ecosystem.  Open science is still not perceived as scientific obligation by scientific stakeholders. Continuous training and education is required to form a new generation of scientists.
  14. 14. CryoEM Carlos Oscar Sorzano (CSIC) 14www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563
  15. 15. The Science Challenge 15www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 CryoEM aims to improve reproducibility of their work using image processing workflows through the production of a Scipion workflow file that describes their image processing steps. This allows full reproduction of the same results when the data is reprocessed outside the microscope facility. This description can also be uploaded to public databases, so that other users can understand the process followed to achieve a given structure.
  16. 16. The Science Demonstrator 16www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 • Adapt Scipion (an image processing workflow engine) to be able to thoroughly report in a Json file all the inputs, outputs, and used parameters so that the same processing can be reproduced. • Adapt Scipion to be able to reproduce an already existing workflow producing exactly the same results as in the first run. • Connecting Scipion to a public database (Electron Microscopy Data Bank) in order to allow the user to automatically submit his/her results. • Allow other users to visualize the workflow performed by other scientists.
  17. 17. Successes 17www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563
  18. 18. Issues 18www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 1. Create a public repository of acquisition metadata and image processing workflows for new acquisitions, as a temporary repository until the data is finally analyzed and deposited in the standard public databases (EMDB and EMPIAR). 2. Create an authentication policy such that biologists coming out from an EM facility could continue the image processing in some of the EOSC cloud machines.
  19. 19. Lessons Learned 19www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 • There is a big gap between technological advances and their adoption in EU facilities and scientists. Much of it due to funding: • Local resources for stream processing • Existence of temporary repositories • Access to high-end computer clusters • There is a gap between open science promotion and the obligation of facilities to keep and disclose publicly funded data.
  20. 20. The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 Bioimaging Beatriz Serrano-Solano Jean-Karim Hériché 2 0 www.eoscpilot.eu
  21. 21. The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 The Science Challenge 2 1 www.eoscpilot.eu ▸ Biological images contain more information than described in their original publications. ▸ Re-analyzing the images with machine learning algorithms can extract new knowledge from these unexploited resources.
  22. 22. The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 The Science Demonstrator 2 2 www.eoscpilot.eu
  23. 23. The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 Successes 2 3 www.eoscpilot.eu
  24. 24. The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 Issues 2 4 www.eoscpilot.eu
  25. 25. The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 Lessons Learned 2 5 www.eoscpilot.eu ▸ EOSC Ecosystem ▸ Technical ▸ Lack of high-performance file system ▸ Lack of big memory machines (1 TB of RAM) ▸ Services ▸ User-unfriendly deployment and set-up (e.g. ElastiCluster) ▸ Inadequate training It would have been more efficient to use the local HPC
  26. 26. Photon and Neutron Michael Schuh, DESY 26www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563
  27. 27. The Science Challenge 27www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 Data ● Volume of hundreds of PBs ● Fast data ingest, tens of GB/s per detector ● File creation at kHz rates Computing ● Fast resources for immediate online analysis, monitoring running experiments ● Highly specialized offline analysis frameworks used in physics, chemistry, materials science, biology, nanotechnology Policy ● Data Management Plans ● Sharing of FAIR data, methods, results between users, sites and communities ● Control access during data embargos ● Persistence, long term archival Images: desy.de/~twhite/crystfel, cid.cfel.de/research/femtosecond_crystallography
  28. 28. The Science Demonstrator 28www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 Motivation: Data sets too large to take home ○ Execute codes on cloud resources close to the data, avoid downloading large amounts of data to user systems Solution: IaaS and PaaS ○ No stack implementation by the user ○ Efficient resource management ○ Prepare federation of DESY OpenStack as EOSC resource CaaS ○ Libraries for containerized software, tools and functions ○ Run user defined software stacks ○ Container orchestration FaaS ○ Containers as cloud functions Service oriented architecture with cloud computing technologies
  29. 29. Successes 29www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 Automated data processing ● Data comes in, FaaS automatically triggered ○ Create derived data ○ Extract metadata Interactive data analysis ● Share and re-use complete workflows ● Jupyter Notebooks as graphical frontend, run anywhere from EOSC to small remote system ● Notebooks and functions published and continuously integrated via GitLab/Docker
  30. 30. Issues 30www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 ● Fully integrated template solutions (Magnum/Heat, TOSCA) for scaling COE clusters (Docker Swarm, kubernetes, Mesos) still cumbersome. ○ EOSC can do a great job in facilitating this with good cluster on demand service as open science solution ● Cloud Functions (FaaS) have proven to be a good solution for short running functions, micro-services. Integration with present HPC and HTC systems still undefined, request routing based on job profile needs research. ○ Submitting into present HPC clusters ○ Virtualizing HPC clusters in the EOSC on demand ● Many licenses are not aware of new container distribution channels and deployments as cloud functions, as a service. ● Integrated AAI solution needed technical and policy-wise ● Will EOSC provide cloud application building blocks? ○ Container registries ○ Message hubs ○ GitLab ○ JupyterHub
  31. 31. Lessons Learned 31www.eoscpilot.eu The European Open Science Cloud for Research pilot project is funded by the European Commission, DG Research & Innovation under contract no. 739563 ● Scaling highly specialized scientific applications means effort, splitting into micro-services, containerizing, cloud deployments. ○ Strengthen co-development between cloud, infrastructure, platform DevOps and software developers as well as data analysts. ● User interaction feels different with graphical applications, Window- Forwarding from cloud resources often low-performing. ○ Clearly define where batch, headless, API ready and GUI applications are in focus. ● Fully templated virtualized HPC cluster solutions still to emerge, same for native deployments and for container clusters ○ EOSC to provide collaborative templates as know-how as well as cluster on demand solutions. ○ EOSC to provide sufficient resources for large-scale deployments suitable for big data.

×