PHIDIAS - Boosting the use of cloud services for marine data management, services and processing

The PHIDIAS project has received funding from the European Union's Connecting Europe Facility under grant agreement n° INEA/CEF/ICT/A2018/1810854.
PHIDIAS: Boosting the use of cloud
services for marine data management,
services and processing
Webinar | June 4, 2020, 11:00 AM CEST

PHIDIAS Ocean Use Case
204.06.2020 PHIDIAS Webinar | 13.02.2020 | https://www.phidias-hpc.eu/ | @PhidiasHpc

Webinar Agenda
11:00 - 11:05 - Introduction of PHIDIAS project - Francesco Osimanti, Trust-IT
Services, PHIDIAS WP7 Leader
11:05 - 11:15 - PHIDIAS Ocean use case and contribution of HPC to marine
studies - Cecile Nys, IFREMER
11:15 - 11:25 - Exploring advanced cloud services for marine and oceanographic
data access and data management - Gilbert Maudire, IFREMER
11:25 - 11:30 - Q&A Session
11:30 - 11:40 - Passport photos for plankton: new era for marine biology research -
Jukka Seppälä, SYKE
11:40 - 11:50 - Analyzing ocean observations in an HPC infrastructure with
DIVAnd - Alexander Barth, University of Liege
11:50 - 12:00 - Blue-Cloud Platform: marine-thematic EOSC services for Marine
Research and the Blue Economy - Pasquale Pagano, CNR-ISTI & Blue-Cloud Project
12:00 - 12:05 - Q&A Session
12:05 - 12:10 - Closing remarks
04.06.2020 PHIDIAS Webinar | 13.02.2020 | https://www.phidias-hpc.eu/ | @PhidiasHpc 3

Thank-you
Francesco Osimanti, Trust-IT & PHIDIAS WP7 Leader, phidias-hpc.eu
f.osimanti@trust-itservices.com
13.02.2020 PHIDIAS Webinar | 13.02.2020 | www.phidias-hpc.eu | @PhidiasHpc 4

PHIDIAS Ocean use case and
contribution of HPC to marine studies
Cécile NYS, IFREMER
Assistant Manager Ocean Data Cluster – ODATIS
Phidias WP6 member
Webinar | June 4, 2020

WP6 “Use-case 3 – Ocean” overview
Combine, collocate and process
data from several data sources (in
situ & satellite)
Enhancing data archiving (most
observation cannot be
reproduced)  facilitate data
reuse
Facilitate and speed up co-
localisation and process of data
from different sources

WP6 “Use-case 3 – Ocean” overview
Combine and collocate data from several data sources (in situ &
satellite)
Adopting new data structures (based on big-data technologies)
DataCubes
NoSQL databases (numerical data) : Cassandra, MongoDB, etc.
Semantic Web (text data)
Providing on demand data browsing and processing facilities

Surface Salinity in North Atlantic
CTD (SeaDataNet),
Argo Floats (CMEMS),
SMOS satellite.
Chlorophyll in North-East Atlantic and Baltic Sea
CTD and bottles (SeaDataNet),
BGC Argo floats (ARGO GDAC),
Ferrybox,
Sentinel 2 images (DIAS WEkEO).
Case-studies

Data Infrastructure Harmonisation
Collections
Data lake
Processing
Data flow

Collections
Data lake
Processing
Peter THIJSSE (presented by Gilbert MAUDIRE)
Exploring advanced cloud services for marine and
oceanographic data access and data management

Collections
Data lake
Processing
Jukka SEPPÄLÄ
Passport photos for plankton: new era
for marine biology research

Collections
Data lake
Processing
Alexander BARTH
Analyzing ocean observations in an HPC
infrastructure with DIVAnd

Thank-you
Cécile NYS & Gilbert MAUDIRE, IFREMER
PHIDIAS WP6 leader
Phidias@Ifremer.fr / Cecile.Nys@Ifremer.fr / Gilbert.Maudire@Ifremer.fr

Cloud services for marine and
oceanographic data access and data
management
Gilbert Maudire (Ifremer) / Peter Thijsse (MARIS)
June 4, 2020, 11:25 AM CEST

Outline
Introduction
Data resources in scope
Discovery service
Prototype Data Lake for processing

Main objective recap
to improve the use of cloud services for marine data
management, data service to users in a FAIR perspective, data
processing on demand, taking into account the European Open
Science Cloud (EOSC) challenge and the Copernicus Data and
Information Access Services (DIAS).

Marine data resources in scope
SeaDataNet in-situ
Euro-ARGO in-situ CMEMS in-situ
SMOS and Sentinel-3
Remote sensing

Discovery service
Build up metadata indexes of available datasets
Metadata checks during import (completeness/readable/correct
vocabularies)
Include DOI’s/PID’s of the original datasets
New DOI’s will be assigned for newly processed datasets (SEANOE)
Use elastic search to support fast response on searches

Metadata is important
The PHIDIAS catalogue metadata model will be based on Dublin Core
element (extended with ISO19115 if necessary):
compliant with the Dublin Core standards. If relevant, for example for geo-referenced
data, metadata are made compatible with ISO 19115 standard (e.g. by the addition of
geographical extend…). Main managed information are:
General metadata (Dublin Core)
Title | Author(s) and affiliations (link with ORC ID) | Publication date | Abstract | References | Use Conditions (Possible limitations…) |
Reference to data user’s manual (if any)
Access conditions
Data License (Creative Commons license, ...) | Provided data citation in DataCite format | Access service(s) | Data format and size
Keywords (CodeLists provided):
Variables (link with the Essential Ocean Variables Code List) | Method(s) | Instrument(s) | Project(s)
Geographical extends
Min and Max latitudes and longitudes | Location map
Temporal extends
Data preview(s)
List of citing publication
…

Prototype Data Lake for processing
Two data types:
In-situ datasets:
not extremely large, but in many small files.
managed data types are heterogeneous: vertical profiles, times series, underway
data...
Satellite datasets:
may be very large (> several tens of petabytes at total), that leads to difficulties to
transfer them over networks.
The “Data Lake” will be periodically synchronized (e.g. daily)

Different use cases, different storage (1)
For in-situ datasets - Online selection and vizualization of data using a two-
step discovery service via a common catalogue:
1) Selection of “Data collections” / Datasets , and then
2) selection of the subset of data of interest.
Example: Exploring SeaDataNet (Common Data Index) and Copernicus Marine Services
data collections including fast detection of co-localized data
Access to data will have to be optimized to select and retrieve a small
amount of data among a large number of metadata records, using
different selection criterions : geographical, temporal...
Prototype: Elastic Search on top of (No)SQL database, in order to allow
faceting of the web selection portal, with optimized response time.

Facilitate and improved access to data (especially for in-situ data)
for fast and interoperable access for visualization and subsetting
purposes (web portal) : “access few data among many data”.
Output: Small” extracted data subsets and web-based maps and
diagrams (representation of time-series and of vertical profiles).
Prototype: set up of the Data Lake by implementing NoSQL Data
base (e.g. Cassandra). This includes the synchronization
procedures from distributed data sources to the adopted data
structure within the Data Lake.

Support on- demand data processing of large data subsets using
DIVA or Pangeo
Requires high performance browsing and processing of large
amount of data (e.g. salinity and chlorophyll), preferably in
parrallel: “access many data among many data”.
Output : Gridded fields of Salinity and Chlorophyll.
Data lake prototype: “Data Cubes” which are used to access data
using Pangeo software components suite : e.g. zarr format,
Xarray, Parquet, Arrow.

Thank-you
Gilbert Maudire (Ifremer), PHIDIAS WP6 Leader
Peter Thijsse (peter@maris.nl) and the PHIDIAS WP6 group

PHIDIAS: Boosting the use of cloud
services for marine data management,
services and processing
Passport photos for plankton:
new era for marine biology research
Jukka Seppälä, Seppo Kaitala,
Kaisa Kraft, Otso Velhonoja SYKE
Webinar | June 4, 2020, 11:00 AM CEST

Phytoplankton abundance is typically estimated
using ocean colour, in situ sensors or lab analysis
Phytoplankton contribute 50% of the global photosynthesis: CO2 fixation and O2 production.
Due to measurement uncertainties and undersampling, the role of oceans – and phytoplankton
– is one of the key unknowns in global carbon-budget
We may observe the abundance of phytoplankton using Chlorophyll a as a proxy
26
Long-term average concentration of chlorophyll at the
ocean’s surface in milligrams per cubic meter of water.
The data in this map were provided by the Joint Research
Centre (JRC). Source EMODnet.
Seasonal concentration of chlorophyll in the Baltic Sea,
between Helsinki (FI) and Travemünde (DE), measured
with the ferrybox. Source Alg@line project, SYKE.

Species/group –specific information is crucial to
understand the biogeochemical fluxes
Bulk biomass estimates by Chlorophyll a do not reflect the diversity of phytoplankton
Phytoplankton community composition is largely affected by environmental and anthropogenic
forcing (light, nutrients, temperature)
Phytoplankton community composition responds very quickly to chaotic rhytms of aquatic
environments
Phytoplankton community composition (and functional types) largely affects the aquatic
elemental fluxes (carbon and nutrients) and structure of the food web (up to fish)
Photos of phytoplankton, taken by Imaging FlowCytobot at Utö station, Gulf of Finland

Why plankton imaging
Trad. microscopy is slow and costly (though
accurate and important reference method!)
New technologies based on optics, fluidics and
imaging offer rapid, automated, unattended,
quantitative, and cost-efficient analysis of individual
cells and colonies of plankton organisms
Possibility to permanently store the digital raw data
gathered, which allows re-analyses, and creation of
open data archives within the international scientific
community
28
Cyanobacterial bloom in the Baltic 2018 - with 3
main species recorded at 20 min intervals.
Kraft et al in prep.

Plankton imaging – state of art
13.02.2020 29
Various technologies available, many in the beta-
version/demonstration phase. Some forerunner
technologies (e.g. Cytosense) have well established
user communities and common vocabularies for
metadata.
Machine learning algorithms available but
optimising/developments ongoing
Central data storage not available, no agreed way to
connect to data aggregators
EcoTaxa web application an European forerunner for
visual exploration and the taxonomic annotation of
images. Initiated by Laboratoire d'Océanographie de
Villefranche (LOV) https://ecotaxa.obs-vlfr.fr/
PHIDIAS Webinar | 04.06.2020 | https://www.phidias-hpc.eu/ | @PhidiasHpc

Imaging technology
IMAGING FLOWCYTOBOT at SYKE
Images of phytoplankton cells (range 10-150µm)
Operate remotely on Utö island flow through system
Samples of 5ml with approx. 20 min interval
Camera triggered by chlorophyll-a fluorescence
Up to 30 000 high resolution images / hour
Random Forest algorithm for image regocnition –
moving towards Convolutional Neural Networks

Plankton imaging – PHIDIAS
31
Demonstration: from
image to information
Imaging FlowCytobot (Finnish
Environment Institute, Utö)
Finnish Meteorological
Institute's server
CSC (Center for Scientific
Computing, FI) Allas object
storage
- Data storage and sharing
during the project's
duration
Data aggregators / other
users
- EcoTaxa
- Long time data storage
cPouta (Cloud computing)
- Development of CNN-models
- GPU flavor is needed
Puhti (high performance
computing)
- CNN in production mode
(classification of new images)
- GPU or CPU flavor
- Potential realtime usage
Images
Labels

3204/06/2020
PHIDIAS, at the focal point for multiplatform detection
of phytoplankton:
EO algorithms – sensor validation – ML, CNN – DIVA
PictureLauriLaaksoFMI
PHIDIAS Webinar | 04.06.2020 | https://www.phidias-hpc.eu/ | @PhidiasHpc

Thank-you, stay tuned,
and see you again!
Jukka Seppälä, SYKE
jukka.seppala@ymparisto.fi
Special Thanks to SYKE,
FMI, LUT and CSC staff
supporting the various steps
of plankton imaging!!!

Analyzing ocean observations in a
HPC infrastructure with DIVAnd
Alexander Barth, Charles Troupin University of
Liège

● Many ocean
processes are
present
simultaneously
● Non-linear
Interaction between
them
● Wide time/space
spectrum of scales
● → High diversity of
ocean observations
The ocean is complex...
2
Image creation: Center for Environmental Visualization, University of Washington

… and is complex to observe
The types of observations
observations is quite diverse
Ocean observations are
sparse (because expensive)
Yet scientifically very valuable
(a measurement not taken it
lost forever, the state of the
climate and ocean in
particular changes)Image credits: ICTS SOCIB

Challenges in ocean data analysis
37
Fast access to data, multitude of formats, general trend towards
netCDF
Different programming environments/languages used by
scientists:
•Fortran (still used in numerical models)
•Matlab (very widespread ~10 years ago, but less use today)
•Python
•R
But also Julia, C, C++, shell scripts,...

Switching to Julia language
● At GHER, ULiège: started to use Julia to use in 2017
● Julia version 1.0 was released on 8 August 2018

DIVAnd
● DIVA: Data Interpolating Variational
Analysis
● Objective: derive a gridded
climatology from in situ
observations
● The variational inverse methods aim
to derive a continuous field which is:
○ close to the observations (it should not
necessarily pass through all
observations because observations
have errors)
○ "smooth"
● Spline interpolation

● Workshops
● Virtual Research Environment
(VRE) in SeaDataCloud
● Jupyter Notebooks
● CI (Continuous Integration)
testing (Linux, Mac OS,
Windows)
● Docker and Singularity
images with preconfigured
software
DIVAnd

DIVAnd in a virtual research environment
https://vre.seadatanet.org/

BlueCloud VRE
BlueCloud VRE will
also include DIVAnd

Computing resource
● DIVAnd needs to solve a large matrix system
● The solvers:
○ direct solver (SuiteSparse, Cholmod) requiring a significant amount of
memory but a very fast
○ iterative solvers (preconditioned conjugate gradient) are more memory
efficient but slower
● In practice: the direct solver is preferred as long as the problems fits
into the available memory
● But having access to computing resources with sufficient
memory has been a problem for our users (SeaDataCloud, EMODnet
Chemistry)
● Code portability via Singularity container

DINCAE
44
● Paper: Data INterpolating Convolutional Auto-Encoder
● Neural network to reconstruct missing data in satellite images
(in particular clouds in remotely sensed Sea Surface Temperature)
● Originally written in Python using TensorFlow 1
● Many changes in TensorFlow 2 -> better alternatives?
● Use Julia and with the Knet library
● Training time of the network was reduced from 3.5 hours to 1.9
hours (on a NVidia 1080 GPU)
● We use “data augmentation” (in particular perturbing input
data, add additional clouds,...) using vectorized numpy code,
but it could be made significantly faster by using Julia instead.

● Sea Surface
Temperature (SST)
reconstruction with
DINCAE
● Some data is
withheld during the
reconstruction (i.e.
additional clouds)
● SST is reconstructed
and a reliable the
expected error
standard deviation
is computed
Some results with DINCAE
DINCAE reconstruction using MODIS sea surface temperature in theAdriatic

Conclusions
46
● The types of available ocean data is quite diverse
● Fortran is still widely used in the oceanographic HPC community
○ But there are significant challenges to support users outside of a typical
HPC environment
○ Julia has been a good fit for us for data analysis
● The original Fortran tool DIVA has been rewritten in Julia
(DIVAnd)
● Jupyter notebooks provide the users a convenient interface
that can also be used in a Virtual Research Environment
(especially for data exploration)
● In future: adapt existing tools or adopt new algorithm able to
leverage GPUs (or other accelerators)

Questions?

The mission
Blue-Cloud aims to pilot a
cyber platform
bringing together and
providing access to
49Boosting the use of cloud services for marine data management, services and processing4 June 2020
1.
multidisciplinary
data from
observations and
models
2. analytical
tools
3. computing
facilities
to support research
to better understand
and manage the many
aspects of
ocean sustainability

The Leading Concepts
Developing and deploying a cloud platform with a
Virtual Research Environment (VRE) with an array of
services for configuring Virtual Labs for specific
analytical workflows, use cases and demonstrators
Applying common standards and interoperability
solutions for providing harmonized data and metadata
Developing and deploying harmonised discovery and
access to a series of established European marine
data management and processing infrastructures, that
are dealing with major marine and ocean data
collections, related data centres, and their data
providers
Discovery and access
to datasets from many
sources
Upstream
Services
Downstream
Services
Added-value services
and applications
VRE – Cloud Platform
Standards
OGC, ISO, W3C
& Vocabularies
Boosting the use of cloud services for marine data management, services and processing4 June 2020 50

The Technical Framework
a component to serve federated discovery and access
• Bridging blue data infrastructures and their multi-disciplinary data
from observations, in-situ and remote sensing, data products and
outputs of numerical models
a component to serve as Blue Cloud Virtual Research
Environment (VRE)
• Federating computing platforms and analytical services; this will
include Virtual Labs for each of the use case Demonstrators

Blue-Cloud federation of major infrastructures
Blue Data infrastructures E-infrastructures

Blue-Cloud Virtual Research Environment
Exploits Blue-Cloud data discovery and
access service
Federates computing platforms and
algorithms
Interacts with external systems
Exposes all repositories, algorithms, and
computing platforms as a common unified
space of resources
Serves diverse communities of
researchers

Support collaborative research and experimentation
Implement Reproducibility-Repeatability-Reusability of Science
Allow sharing of data, processes and findings
Grant open access to the produced scientific knowledge
Tackle Big Data challenges
Manage heterogeneous data/processes access policies
Sustainability: low operational costs, low maintenance prices
Blue-Cloud Framework satisfies
Open Science Requirements

Tuning, testing and promoting with five
demonstrators
Zoo- and Phytoplankton EOV products
Plankton Genomics
Marine Environmental Indicators
Fish, a matter of scales
Aquaculture Monitor
Biodiversity
Environment
Fishery
Aquaculture
Genomics

Function of Demonstrators
Demonstrate how the Services developed contribute to unlocking
innovation potential
• to derive requirements and specifications for the Pilot Blue Cloud platform development
• to demonstrate the potential of cloud-based open science in the marine community
• to serve as a catalyst for wider community engagement, identifying longer term challenges,
and planning future developments from pilot to a full-scale Blue-Cloud infrastructure.
Identify the scientific communities requirements
• Storage (repositories, warehouses, …)
• Multidisciplinary data access and harmonisation
• Analytical processes
• Computing requirements

Piloting an EOSC ”thematic cloud”

Blue-Cloud project
• Funding: H2020: The ‘Future of Seas and Oceans Flagship
Initiative’ (BG-07-2019-2020) topic: [A] 2019 - Blue Cloud
services
• Timing: 36 Months (start October 2019)
• Budget: 5.9 Million Euro
• Partnership: 20 partners

Any questions?
https://blue-cloud.org

Thank-you

PHIDIAS - Boosting the use of cloud services for marine data management, services and processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PHIDIAS - Boosting the use of cloud services for marine data management, services and processing

Similar to PHIDIAS - Boosting the use of cloud services for marine data management, services and processing (20)

Recently uploaded

Recently uploaded (20)

PHIDIAS - Boosting the use of cloud services for marine data management, services and processing

Editor's Notes