Giuseppe Fiameni (CINECA)
The goal of this EUDAT workshop is to present the EUDAT services, the results of the collaboration activity achieved so far and deliver a hands-on on how to write a Data Management Plan or DMP. The DMP is a useful instrument for researchers to reflect on and communicate about the way they will deal with their data as it prompts them to think about how they will generate, analyse and share data during their research project and afterwards.
Coupling HPC and Data Resources and services together - EUDAT Workshop at exdci 2017
1. www.eudat.eu
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065
Coupling HPC and Data
Resources and services
together - EUDAT workshop
Giuseppe Fiameni (CINECA)
Barcelona, May 18th
2. Agenda
14.30 - Introduction to EUDAT Services and EUDAT-
PRACE Pilots - Giuseppe Fiameni, CINECA
15.15 - How to write a Data Management Plan -
Stéphane Coutin, CINES
16.00 - DMP Hands-on a concrete use case -
Stéphane Coutin, CINES
16.30 - Coffee break
17.00 - EUDAT HTTP API - A programmable way to
deposit data onto EUDAT resources – Giuseppe
Fiameni, CINECA
17.30 - Round Table - How to move the
collaboration forward?
Coupling HPC and Data Resources and services together - EUDAT workshop
3. www.eudat.eu
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065
EUDAT
The pan-European Data
Infrastructure
Giuseppe Fiameni CINECA
EUDAT Workshop
Barcelona, May 18th
4. A pan-European e-Infrastructure solution for pan-
European RI data Challenges
All RIs are facing data challenges
Where to store the growing amount of data?
How to find it?
How to make the most of it?
Solutions are needed at pan-European level
We need to promote synergies
Some services are common to many
communities
Costs and investments can be
optimised
Better integration of e-infras and
research infrastructures can be
achieved
Coupling HPC and Data Resources and services together - EUDAT workshop
5. Trust
DataCuration
Common Data Services
Users
User functionalities, data capture
& transfer, virtual research
environments
Persistent storage, identification,
authenticity, workflow execution,
mining
Data
Generators
Community Support Services
Data discovery & navigation,
workflow generation, annotation,
interpretability
Collaborative Data Infrastructure
Coupling HPC and Data Resources and services together - EUDAT workshop
7. PHYSICAL SCIENCES
& ENGINEERING
MATERIALS &
ANALYTICAL
FACILITIES
MAPPER
BIOMEDICAL &
MEDICAL SCIENCES
EUDAT services are designed, built and
implemented based on user community
requirements.
Community-Driven Solutions
8. The CDI – A Service Infrastructure
Coupling HPC and Data Resources and services together - EUDAT workshop
10. EUDAT Data Domain modeled on the ANDS1 Data Curation Continiuum
1. Australian National Data Service organization – www.ands.org.au
CDI Data Domain
10
11. EUDAT Data Domain modeled on the ANDS1 Data Curation Continiuum
1. Australian National Data Service organization – www.ands.org.au
CDI Data Domain
11
12. Registered Data?
Data is registered using Persistent Identifier to be
“recognised” on the long-term
Persistent Identifier:
pointers to data resources of different forms: data
files, metadata files, documents, multimedia, etc.
globally unique
meant to exist infinitely long
used to identify and retrieve resources
can be resolved to the physical resource
Examples: ISBN, DOIs, PURLs, Handles …
13. Persistent over time
today 2016 .... .... 2030
11839/abc123 11839/abc123
11100
00100
01111
11100
00100
01111
http://www.example.com/ http://www.moved.com/
Supports access to resource as it moves from one location to another.
.. by design
18. EUDAT2020
Further integration with EUDAT
CDI (e.g. B2SHARE)
Integration with B2ACCESS to
enable access by many different
Identity Providers
Cloud Storage Federation,
collaboration with GEANT in
OpenCloudMesh
Assess B2DROP as workspace
area to computing facilities
Who
Citizens Scientists and small teams
What
Store and exchange data
Synchronize multiple versions
Ensure automatic desktop
synchronization
Why
Ease of Use
Trusted European Service
Coupling HPC and Data Resources and services together - EUDAT workshop
19. Coupling HPC and Data Resources and services together - EUDAT
workshop
20. EUDAT2020
Further integration with EUDAT CDI (e.g.
B2DROP, B2SAFE)
Integration with B2ACCESS (incl eduGAIN),
focus on authorization
Embargo period
Editing of metadata
Data versioning and annotation
Extended HTTP Restful API interface
Easy installable software package
Who
Small to Medium Teams
What
Store data (incl. software) and add domain
meta data
Share registered research data worldwide
Preserve (small-scale) research data for long-
term
Why
Register Data for Publications
Make known to wider community
Coupling HPC and Data Resources and services together - EUDAT workshop
21. Collection of official RDA documents
Coupling HPC and Data Resources and services together - EUDAT
workshop
22. EUDAT2020
Support iRODS v4
Support metadata
Optimize and extend policies to support
data curation and provenance
Further integration with B2ACCESS
Support authorization on basis of
community access rules
Assess B2SAFE as workspace area to
computing facilities
Who
Community Data Managers
‘Sophisticated’ Organisations
What
Provide an abstraction layer which virtualizes
large-scale data resources
Guard against data loss in long-term
archiving and preservation
Optimize access for users from different
regions
Bring data closer to powerful computers
Why
Performance
Replication between trusted sites
Data Preservation
Coupling HPC and Data Resources and services together - EUDAT
workshop
23. Data Policy Manager
Data policies are centrally managed
Policy rules are implemented and enforced by
site-local rule engines
Policies describe in an abstract language
Community data managers must authenticate
to provide trust
Support policies for data replication and
integrity checking
Central logging for auditable data policies to
monitor execution
Active collaboration with the RDA Practical
Policy WG
EUDAT2020
Handover to operations
Extend number of policies supported
Focus on data curation and
provenance policies
Integrate with B2ACCESS
Coupling HPC and Data Resources and services together - EUDAT
workshop
24. Further develop HTTP to a mature
interface and extend functionality to
metadata
Native support PIDs within GridFTP
transfers
Extend EUDAT client API library to other
B2 services (e.g. B2SHARE, B2FIND,
PID)
Further integration with B2ACCESS
EUDAT2020
Who
Users and Communities with Significant
Computational Needs
What
Transfer large data collections from EUDAT
storages to external HPC facilities for
processing
Copy large data sets, ingesting them onto
EUDAT storage resources
Why
Integration/Collaboration with PRACE
Simplify Data Transfer
Coupling HPC and Data Resources and services together - EUDAT
workshop
25. Harvesting of metadata stored in
B2SAFE
Community customizations
Annotation of datasets
Further assess RDF and Linked Data
Further assess scalability and
performance
EUDAT2020
Who
Anyone
What
Find collections of scientific data quickly and
easily, irrespective of their origin, discipline or
community
Get quick overviews of available data
Browse through collections using standardized
facets
Why
Unique collection
Ease of Searching
Coupling HPC and Data Resources and services together - EUDAT
workshop
26. Coupling HPC and Data Resources and services together - EUDAT
workshop
27. an annotation is “a note
added to a text, book,
drawing, etc., as a
comment or an
explanation” (from Merriam
Webster)
Provide a service to add
annotations to digital assets
Manual annotations via
WUI, or programmatic via a
REST API
PoC Integration with B2FIND
and B2SHARE
Prototype available at
http://b2note-dev.bsc.es
Coupling HPC and Data Resources and services together - EUDAT workshop
28. Use cases from ELIXIR, EuroArgo
and SeaDataNet
Automatic (re-)distribution of
updatable data to data
storage providers and users
Data storage providers are
inside and outside the EUDAT
CDI domain
Data owner must be able to
mark data as subscribeable
Data storage providers and
individual users must be able to
subscribe to data
Data transfers and notifications
are triggered by updates
Coupling HPC and Data Resources and services together - EUDAT workshop
29. How to collaborate?
- Data Pilot call Highlights -
24 pilots, of which
7 = Earth sciences, energy and environment,
6 = Biomedical and life sciences,
6 = Social Sciences and Humanities
5 = Physical Sciences and Engineering
potential user audience = 40,000 users
cumulative storage resource request of up to 4.3PB
Coupling HPC and Data Resources and services together - EUDAT workshop
31. Coupling HPC & Data Resources
PRACE – coupling PRACE HPC resources to
EUDAT storage resources.
PRACE & EUDAT infrastructures are now fully
interoperable from the technical point of
view
EUDAT has contributed to three PRACE
Calls for proposals (Tier0 Call 11, Tier1 Call
13, Tier1 Call 14)) by offering data services
and storage resources to interested PRACE
users.
Users now have the possibility to apply for
EUDAT services to preserve their data
throughout and beyond the lifecycle of
their PRACE computational grant!
Coupling HPC and Data Resources and services together - EUDAT workshop
32. Acronym Title Field Country Data
requirem
ents in TB
during
the
PRACE
project
Data
requirements
in TB after the
PRACE project
Duration of
the access to
this data
service after
the PRACE
allocation
PRACE Site EUDAT Site
HybTurb3D Hybrid 3D simulations
of turbulence and
kinetic instabilities at
ion scales in the
expanding solar wind
Astro Sciences IT 140 TB 140 TB 24 months SurfSARA CINECA
MULTINANO Multiscale simulations
of nanoparticle
suspensions
Engineering IT 30 TB 30 TB 24 months MPCD CINECA
HiResClimate High Resolution EC-
Earth Simulations
Earth Sciences IE 150TB 150TB 12 months KTH EPCC
AFiD Effect of rotation and
surface roughness on
heat transport in
turbulent flow
Engineering NL 11TB 10TB for 24
months
1TB for long-
term storage
and publication
10TB for 24
months
1TB for long-
term storage
and
publication
EPCC SurfSARA
CHARTERED Charge transfer
dynamics by time
dependent density
functional theory
Materials Science SE 30TB 30TB 24 months IT4I KTH/PDC
Total 448TB
Coupling HPC and Data Resources and services together - EUDAT workshop
33. Users credential synchronization
1. PRACE LDAP ->
B2ACCESS
synchronisation
2. B2ACCESS -> B2STAGE
synchronisation
3. Access to web based
EUDAT services
IGTF
B2SAFE/B2STAGE
(iRods, gridFTP)
B2ACCESS
(Unity IDM)
PRACE
LDAP
Coupling HPC and Data Resources and services together - EUDAT workshop
35. 10 year initial commitment
Coupling HPC and Data Resources and services together - EUDAT workshop
36. EUDAT CDI Agreement
Objective: Sustain the EUDAT Collaborative Data
Infrastructure (CDI) beyond project-based agreements
How? by formalising Partnership between service
providers which:
Ensures common service management guidelines and
clear commitments from partners (currently sign for 10
years)
Establishes a coordinating Secretariat paid through
membership fees
Who? Generic (cross-disciplinary) & Thematic (discipline-
oriented) service providers
Two levels of engagement: Interoperable (level 2) vs
Integrated (level 1) nodes
Coupling HPC and Data Resources and services together - EUDAT workshop
38. www.eudat.eu
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065
HTTP API - A programmable
way to deposit data onto
EUDAT resources
Roberto Mucci – r.mucci@cineca.it
https://github.com/EUDAT-B2STAGE/http-api
39. Why an API?
Define a set of rules and specifications to interact
with EUDAT services that communities can rely on
Handle underline technology changes by reducing
integration complexity
Propose a basic but extendible data model
Existing standards, i.e. S3, are static impossible to
decouple data-object into data-payload and
meta-data
Provide support for registered data
Coupling HPC and Data Resources and services together - EUDAT workshop
40. B2STAGE HTTP-API: Principles
As RESTful compliant as possible:
Resource-Based,
Self-descriptive return messages,
Hypermedia-driven (HATEOAS)
Follow the EUDAT Data Domain Model specifications:
Registered (B2SAFE)
Workspace (temporary working copy)
Adopt the SWAGGER specifications:
Specification driven development
Swagger UI for testing and interactive
documentation
Coupling HPC and Data Resources and services together - EUDAT workshop
41. CRUD operations on the registered domain (B2SAFE):
GET: obtain file metadata, download a file, list directory
PUT: upload an entity (and trigger B2SAFE registration)
POST: create a directory
DELETE: delete a file, delete empty directory
PATCH: Rename file, rename directory
Resources identified by full path directory namespace:
curl -H "Authorization: Bearer <auth_token>"
<http_server:port>/api/registered/path/to/directory/filename.txt
Available endpoints: /api/registered
Coupling HPC and Data Resources and services together - EUDAT workshop
42. PID resolution (B2HANDLE)
GET: resolve PID and get URL and EUDAT/CHECKSUM
GET ?download=true: resolve PID and download object
Example:
curl -H "Authorization: Bearer <auth_token>" <http_server:port>/api/pids/<PID>
[...]
"Response": {
"data": {
"EUDAT/CHECKSUM": '123456789',
"URL": "<http_server:port>/api/registered/tempZone/home/guest/test.txt"
},
"errors": null
}
Available endpoints: /api/pids
Coupling HPC and Data Resources and services together - EUDAT workshop
43. Typical workflows (1)
Upload a file and get the PID*
Client
B2STAGE
HTTP-API
B2SAFE
(iRODS)
PUT api/registered/<iRODS_path>
EPIC
PRC write Object registration
PID
Client
B2STAGE
HTTP-API
B2SAFE
(iRODS)
GET api/registered/<iRODS_Path> PRC imeta
Metadata and PID
Get object metadata (with PID)
Coupling HPC and Data Resources and services together - EUDAT workshop
44. Typical workflows (2)
PID resolution
PID resolution and download
Client
B2STAGE
HTTP-API
B2HANDLE
PUT api/pids/<PID> B2HANDLE python client
-URL
-EUDAT/CHECKSUM
Client
B2STAGE
HTTP-API
B2SAFE
(iRODS)
PUT api/pids/<PID>?download=true
Data object
B2HANDLE
Coupling HPC and Data Resources and services together - EUDAT workshop
45. Examples with cURL: GET
Obtain entity metadata:
Download an entity:
Get list of entities in a directory:
Coupling HPC and Data Resources and services together - EUDAT workshop
curl -H "Authorization: Bearer <auth_token>"
<http_server:port>/api/registered/path/to/directory/filename.txt
curl -H "Authorization: Bearer <auth_token>"
<http_server:port>/api/registered/path/to/directory/filename.txt?download=true
curl -H "Authorization: Bearer <auth_token>"
<http_server:port>/api/registered/path/to/directory
46. www.eudat.eu
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065
EPOS UpTake Plan
Where we are
Munich EUDAT F2F meeting WP6.3 28-30 March 2017
Massimo Fares (EPOS) WP6.3
47. EPOS Uptake Plan in a nutshell
Make seismic data discoverable & searchable through:
querying the EPOS meta-data catalogue
retrieving data from EUDAT resources using standard
transfer protocols and logical handle, e.g. PID
controlling users access (AuthN/AuthZ)
EPOS is an organization aiming at creating a
pan-European infrastructure for solid Earth
science to support a safe and sustainable
society.
48. EPOS HTTP-API adoption plan
RT DATA
COLLECTION
SENSOR
NETWORK
SEARCH Metadata
HTTP API
INGV - REPO
B2SAFE
Access Data
User
B2FIND
Search
Harvest MetaData
OAI-PMH
Search Data
B2ACCESS
Auth
Auth
MONGO
CATALOG
METADATA
EXTRACTOR