This presentation explored the use-cases driving ARCHIVER at the audience gathered at INFN, in Rome, Italy, for the Open Data Ecosystem and CS3 conference, 27-29 January 2020
2. Project Objective
2
Focus: Archiving and Data Preservation Services using commercial cloud services to be available via the European Open Science
Cloud (EOSC)
Procurement R&D budget: 3.4M euro
Starting Date: 1st of January 2019
Duration: 36 Months
Coordinator: CERN (Lead Procurer)
3. Consortium
Includes Buyers and Experts in the preparation, execution and promotion of the Procurement of
R&D
3
Procurers - Public organisations committing funds to contribute to a joint-R&D-
procurement, research data use cases and R&D testing effort
Experts – Partner organisations bringing expertise in requirement assessment and promotion activities, not part of the
Buyers Group
4. Preferred Partners (Early Adopters)
4
• Confirmed subscription received from 11 organisations: High level of interest from the community
• Participants:
• Demand side public sector organisations Information Webinar (04th September) = 47 participants
• Key advantages
• Assess if resulting services address archiving and preservation needs
• Contribute and shape the R&D carried out in the project, contribute with use cases and have the option to purchase pilot-
scale services by the end of the project
5. Challenge
20/02/2020 5
• Demonstrate services for long-term
preservation and archiving in the PB
range of scientific data
• F.A.I.R archiving services following
best practices and standards
• Expand resulting services to several
scientific domains
• Transparent business models and
make resulting services available
through the EOSC catalogue
Current Status of Scientific Data
Repositories
• Basic bit preservation and
archiving capabilities
• Data volumes and
communities growing
• Longstanding archiving and
preservation activities, but
most of data not yet
published
• Fragmentation across
scientific disciplines with
underestimation of costs at
the planning phase
6. 6
R&D Scoping Activities and Dialogue with the Private Sector
Continuous updates to the FAQ
Consortium Matchmaking
Early Adopters Engagement
08th February
1st
Preparation
Workshop
20th February
2nd
Preparation
Workshop
08th April
OMC Kick-off
CERN
07th May
OMC Event
Barcelona
23rd May
OMC Event
Stansted
05th June
OMC
Consolidation
CERN
Feedback on the Draft PCP Contract Notice
Open Market Consultation Process
Feedback
Integration
Feedback
Integration
Feedback
Integration
7. Geographical Distribution of Companies
7
Wide geographical distribution
42 companies
Majority from 12 European Countries
8. Cost-effective
Business
Model
Taking into
account:
- Scale
- Ingest rates
- Archive lifetime
- # of copies
- Exit strategies
- Portability
- SLAs
Regulation &
Legislation
- Auditing
- Self-
assessment
- Data Retention
- GDPR
Outcome: R&D Challenge
8
Data integrity/security; cloud/hybrid deployment; data
volume in the PB range; high, sustained ingest data rates;
ISO certification: 27000, 27040, 19086 and related
Archives connected to the GEANT network.
OAIS conformant services: data readability formats, normalization,
obsolesce monitoring, files fixity, authenticity checks, etc.;
ISO 14721/16363, 26324 and related standards
User services: search, discover, share, indexing, data removal, etc.;
Access under Federated AAI
Layer 1
Storage/Basic Archiving/Secure
backup
Layer 2
Preservation
Layer 3
Baseline user services
Layer 4
Advanced
services
High level services: visual representation of data (domain specific),
reproducibility of scientific analyses using Machine Learning
Algorithms, etc.;
Core
R&D
Bonus
R&D
Scientific use cases deployments documented at: https://www.archiver-project.eu/deployment-scenarios
9. 9
High Energy Physics
The BaBar Experiment
During this year, the BaBar Experiment infrastructure at SLAC will be decommissioned. 2 PB of
BaBar data can no longer be stored at the host laboratory. Currently a copy of the data is being
held by CERN IT-ST.
Goal: To ensure that a complete second copy of Babar data will be retained for possible
comparisons with data from other experiments and be shared through the CERN Open Data
Portal.
CERN Open Data Portal
The CERN Open Data portal disseminates close to 2 PBs of primary and derived datasets from
particle physics as they are released by LHC Collaborations and is being used for both education
and research purposes.
Goal: Achieve total reproducibility of research, being able to completely instantiate data,
associated software and services off-premise. Offer research reproducibility services to
individual researchers running open data analyses completely independent from the original on-
premise infrastructure.
CERN Digital Memory
Deployment consisting on a requirement to archive approximately 1.5 PB of digital Memory,
containing analogue documents produced by the Organization in the 20th century as well as
digital production of the 21st century (web sites, social media, emails, etc.)
Goal : Produce a dark archive in the cloud following standard OAIS practices.
10. 10
Life Sciences
EMBL on FIRE
EMBL-EBI provides data archiving services to the global molecular biology community. These
data archives are currently based on an internal service (FIRE: FIle REplication). FIRE currently
holds 20PB of data and is growing at 40% per year.
Goal: cost-effective scaling via cloud-based storage solutions. Distribute data effectively be
on cloud, covering the increasing needs for cloud-hosted analysis.
EMBL Cloud Data Caching
Life sciences research communities access more and more internal data from public cloud
services for their data analysis.
Goal: To progressively cache data in the cloud, with the on-premises data being replicated
and discarded as required. Which data should be cached, how much and for how long, will be
a tradeoff between the cost of cloud storage and of having the network capacity/latency to
download the data multiple times.
Scientific use cases deployments documented at: https://www.archiver-project.eu/deployment-scenarios
11. 11
The MAGIC Cherenkov gamma-ray telescopes and the PAUcam camera for the William
Herschel Telescope are located in the Observatorio del Roque de los Muchachos, in
Canary Islands, Spain. The first Large Scale Telescope of the next-generation Cherenkov
Telescope Array (CTA) is also there. They produce about 0.3 PB of raw data per year
which is automatically sent to PIC in Barcelona.
PIC Large File Storage
Goal: To replace the current in-house tape library storage. Each instance of the service to
be purchased is the 5-year safe-keeping of a yearly dataset from a single source.
PIC Mixed File Remote Storage
Goal: To archive the derived datasets from at most two sources, becoming part of the
yearly dataset. In addition, allow update/upload of derived data sets for a period of 4
years following the creation of the data,
PIC Data Distribution
Goal: To replace the Hierarchical Storage Manager, disk storage and data distribution
service. Each instance of the service to be purchased is the 5-year safe-keeping and data
distribution of a yearly dataset and its derived datasets.
Astronomy
12. 12
Photon Sciences
PETRA III is the worldwide most brilliant storage ring based X-ray sources for high energy photons with 22 beamlines
distributed over three experimental halls. The European XFEL is a world's largest X-ray laser generating 27 000 ultrashort X-
ray per second and with a brilliance that is a billion times higher than that of the best conventional X-ray radiation sources.
The two facilities produce yearly about several 10s PB of raw data and this is expected to double in size every year.
Goal: Develop a hybrid model that combines current on-premise archiving services with the resulting services of ARCHIVER.
To move a predefined set of datasets in public clouds and make them open for public access.
Scientific use cases deployments documented at: https://www.archiver-project.eu/deployment-scenarios
13. 13
Data integrity/security; cloud/hybrid deployment
Data volume in the PB range; high, sustained ingest data
rates. ISO certification: 27000, 27040, 19086 and related
standards. Archives connected to the GEANT network
OAIS conformant services: data readability formats,
normalization, obsolesce monitoring, files fixity, authenticity
checks, etc.
ISO 14721/16363, 26324 and related standards
User services: search, discover, share, indexing, data
removal, etc.
Access under Federated IAM
Layer 1
Storage/Basic
Archiving/Secure backup
Layer 2
Preservation
Layer 3
Baseline user
services
Layer 4
Advanced
services
High level services: visual representation of data (domain
specific), reproducibility of using Machine Learning Algorithms,
etc.;
EMBL1–FIRE
PIC2–MixedFileRemoteStorage
DESY1–PETRAIII/EUXFEL
CERN3–CERNOpenData
CERN2–CERNDigitalMemory
CERN1–TheBaBarExperiment
PIC3–DataDistribution
EMBL2–CloudCaching
PIC1–LargeFileStorage
Definition of the R&D scope Deployments derived from 4 ESFRI landmarks
CTA, ELIXIR, EuXFEL and HL-LHC
Scientific use cases deployments documented at: https://www.archiver-project.eu/deployment-scenarios
16. Role of the EOSC:
To ensure that 1.7 million European researchers and 70 million professionals in
science and technology reap the full benefits of data-driven science
- Federated virtual environment, free at the point of use for the end researcher
- Open services for storage, analysis and re-use of research data
- Promote an approach across national borders & scientific disciplines
- Promote choice of the deployment model: on-prem, hybrid, off-prem
EOSC Phase 1 investment of EUR 300 Million on core services
EOSC legal entity expected to be created by the end of 2020
European Open Science Cloud - The Vision
“We are creating a European Open Science Cloud now. It is a trusted space for researchers to store their data and to
access data from researchers from all other disciplines. We will create a pool of interlinked information, a ‘web of research
data’. (…) The idea is that once we have the rules of the game ready, then we will open this up to the broader public
sector and to business as well. So that companies can come in, store the data and use the data.”
Special Address by Ursula von der Leyen, President of the European Commission, 22nd January, WEF DAVOS 2020
https://www.youtube.com/watch?v=QN476nVbFVs&feature=youtu.be&t=682
17. EOSC should provide a level playing field –
same requirements for commercial and not-for-profit
providers
Accept commercial services in Data Mgmt. Plans
Stay mainstream and interoperable by adopting widely
used and internationally recognized standards
Promote choice, an ecosystem for innovation, fostering
data self-determination and digital sovereignty in Europe
https://www.spielwarenmesse.de
EOSC – Engagement of commercial providers
18. ARCHIVER key contributions for the EOSC
18
Long-Term Archiving and Preservation
of Research Data at the core strategy of
the EOSC
ARCHIVER key in defining EOSC Rules of
Participation for the private sector both
as service providers and as R&D
partners for “close to market” solutions
List of 40+ EOSC projects available at: https://www.eosc-portal.eu/about/eosc-projects
19. ARCHIVER services in the European Open Science Cloud
19
ARCHIVER Services available in the EOSC
2019 2020 2021 2022 2024
Objective: to make resulting services available in the European Open Science Cloud catalogue
What does this really mean?
Atomic Use Case: “As a researcher, I want to have access to the full set of ARCHIVER services, so that,
I’m able to evaluate their functionality for my specific research field, able to purchase them with a clear
cost model, and implement an exit strategy to be able to repatriate or move my research data
seamlessly to another location by the end of the contract and usage period.”
2023
20. ARCHIVER Data Management Strategy
20
• DMPs both for research use cases & for the project
FAIR guiding principles
post-GDPR era: strong focus on technical and organisational measures for
Data Privacy & Protection as the path to digital sovereignty
Modern guidelines provided by Science Europe for pan-European
Research Data Management (RDM)
Establishing core requirements for Research DMPs & a set of
criteria to assess trustworthy repositories:
https://www.scienceeurope.org/our-resources/practical-guide-to-the-international-alignment-of-
research-data-management/
21. ARCHIVER EOSC Services - Technical Validation (I)
21
Resource provisioning using Terraform API
OSS container orchestration systems
Automated deployment based on Kubernetes and Docker
Results stored back at CERN S3 cloud storage service
Available at Github under an OSS license:
https://ocre-testsuite.readthedocs.io/en/latest/
Tests provided by the research community
https://fairsharing.github.io/FAIR-Evaluator-FrontEnd/
Started in HNSciCloud
in use already in OCRE
reused and expanded in ARCHIVER
22. ARCHIVER EOSC Services – Legal & Organisational (II)
22
Data Privacy and Protection (GDPR) – An EC pillar
Personal Data wrapped in several components (Fed AAI, Research Data itself)
Technical & Organisational measures: “Privacy by Design” approach for GDPR
conformance
Best Practices: Standards
Infrastructure: ISO 27001 series, European Cybersecurity Act
Long Term Data Preservation: OAIS and CoreTrustSeal
Self-Assessment: Definition of responsibilities across data stewards and service
providers
Exit Strategies
Stimulate the use of Open Source, Open APIs: measures for vendor lock-in prevention
Definition and field testing of viable exit plans (provider/on-prem & provider/provider)
23. EOSC Services – Financial & Business Model (III)
23
Establish a range of sustainable, cost-effective purchasing options
Must support organisations or individual researchers to store their data after
the end of a procurement cycle or research grant of individual researcher
Requirement to service providers to establish a “Total Cost of Services” study
From the architecture phase (Design) to Prototype and Pilot
“Total” TCS: must include all factors that a research organisation or individual
researcher will bear when running the ARCHIVER resulting services over a
defined period
Strongly connected to exit strategies
Escrow services concept: public research organisations as data stewards
Protection factor against scenarios such as vendor locks or supplier bankruptcy
24. Summary
24
ARCHIVER aims to develop a set of commercial, FAIR, archiving and preservation services for
research data
Petabyte research data (tens of petabytes and beyond) in multiple scientific domains
Open, Trustworthy, aligned with best practices (ISO, OAIS, CoreTrustSeal)
Strong preference by Open Source Software, Open Standards as measures to prevent vendor lock-ins
Set of “derived rules” for commercial services onboarding in the EOSC
Technical: extensive field testing, “research data ready” archiving and preservation services
Legal: GDPR as an opportunity for high quality digital services guaranteeing digital sovereignty
Financial: models adapted to research considering public procurement cycles and research grants periods,
allowing effective cost planning for LTDP
ARCHIVER R&D activity will start end of Q1 2020
Tender for R&D services to open in less than 48h! for submission of R&D bids from the 31st of January
(submission period of 2 months)
Info Session Webinars: February 7th and March 18th
Selected companies will be providing R&D services, 3 phases: Design (2020), Prototype (2020/2021) and Pilot
(2021)
27. Synergies with CS3Mesh4EOSC (CS3Mesh Kick-off)
• Test Suite
• Maybe CS3Mesh can profit from the ARCHIVER test suite available in Github
• https://github.com/cern-it-efp/OCRE-Testsuite
• A simple framework using Terraform for resource provisioning; Kubernetes for
resource orchestration as a standard way to embark scientific use cases tests
• APIs
• Would it make sense to define and test together the APIs implementation that
will be available in ARCHIVER resulting repository services?
• Preference is give to open, general purpose APIs
• Testing data workflows from CS3 -> ARCHIVER based on “data temperature”?
27
28. ARCHIVER Test Catalogue
28
Initial test catalogue include tests on network, storage and compute:
https://github.com/cern-it-efp/OCRE-Testsuite/
More tests being included:
FAIRsFAIR project evaluation of ”FAIRness”:
https://fairsharing.github.io/FAIR-Evaluator-FrontEnd/#!/#%2F!
Tests provided by the research community; using Science Europe assessment criteria
guidelines
Data ingestion: test of the ingest process, ingests with incremental changes for high
volumes of data, lifecycle from archive packages data creation, combination and/or
aggregation before final archival for long-term preservation
Open APIs: prevent vendor lock-ins and create innovative workflows (CS3MESH?)
Extensive testing of exit plans: during R&D execution, from provider2on-prem /
provider2provider