Summary of the Deployment Scenarios and Functional Requirements
1. Summary of the Deployment Scenarios
and Functional Requirements
Evangelos Motesnitsalis
Technical Coordinator
ARCHIVER Consolidation Event
5 June 2019
2. 5 June 2019 Summary of the Deployment Scenarios and Functional Requirements 2
Contents
Recap
Common Characteristics
Service Layers Mapping
Testing plans
Summary and Next Steps
4. 5 June 2019 Summary of the Deployment Scenarios and Functional Requirements 4
High Energy Physics Deployment Scenarios
The BaBar Experiment
In 2020 the BaBar Experiment infrastructure at SLAC will be decommissioned. As a result, the 2 PB
of BaBar data can no longer be stored at the host laboratory and alternative solutions need to be
found. Currently a copy of the data is being held by CERN IT. We want to ensure that a complete
copy of Babar data will be retained for possible comparisons with data from other experiments
and sharing through the CERN Open Data Portal.
CERN Open Data Portal
The CERN Open Data portal disseminates close to 2 PBs of primary and derived datasets from
partical physics as they were released by LHC Collaborations and is being used for both education
and research purposes. The CERN Open Data Service Managers seek an easy-to-use, easy-to-
achieve independent archiving and backup for its holdings based on SIPs [Submission Information
Packages] with intelligent and reliable disaster recovery mechanisms.
CERN Digital Memory
We want to archive the ~1.5 PB of CERN Digital Memory, containing digitized analog documents
produced by the institution in the 20th century as well as the digital production of the 21st
century, including new types like web sites, social medias, emails, etc.
5. 5 June 2019 Summary of the Deployment Scenarios and Functional Requirements 5
Life Sciences Deployment Scenarios
EMBL on FIRE
EMBL-EBI provides data archiving services to the global molecular biology community. These
data archives are currently based on an internal service (FIRE: FIle REplication) that stores the
files in two different systems: a distributed object store and tape.
FIRE currently holds 20PB of data and is growing at 40% per year. We want to ensure that:
FIRE can achieve cost-effective scaling via cloud-based storage solutions
Data can effectively be distributed on cloud infrastructure, covering the increasing needs for cloud-hosted analysis
EMBL Cloud Data Caching
As research communities access more and more of internal data from cloud services for their
data analysis, it makes sense to progressively cache data in the cloud, with the on-premises
data being replicated and discarded as required. Which data should be cached, how much and
for how long, will be a tradeoff between the cost of cloud storage and of having the network
capacity/latency to download the data multiple times.
6. 5 June 2019 Summary of the Deployment Scenarios and Functional Requirements 6
The MAGIC Cherenkov gamma-ray telescopes and the PAUcam camera for the William
Herschel Telescope are located in the Observatorio del Roque de los Muchachos, in Canary
Islands, Spain. The first Large Scale Telescope of the next-generation Cherenkov Telescope
Array (CTA) is also there. They produce about 0.3 PB of raw data per year which is
automatically sent to PIC in Barcelona.
PIC Large File Storage
We want to substitute the current in-house tape library storage. Each instance of the
service to be purchased is the 5-year safe-keeping of a yearly dataset from a single source.
PIC Mixed File Remote Storage
We also want to be able to archive the derived datasets from at most two sources,
becoming part of the yearly dataset. In addition, anytime during the 4 years following the
creation of the data, additional versions of derived datasets may need to be uploaded.
PIC Data Distribution
We also want to substitute the Hierarchical Storage Manager, disk storage and data
distribution service. Each instance of the service to be purchased is the 5-year safe-keeping
and data distribution of a yearly dataset and its derived datasets.
Astronomy Deployment Scenarios
7. 5 June 2019 Summary of the Deployment Scenarios and Functional Requirements 7
Photon Science Deployment Scenarios
PETRA III is the worldwide most brilliant storage ring based X-ray sources for high energy photons with 22
beamlines distributed over three experimental halls are concurrently available for users. The European
XFEL is a world's largest X-ray laser generating 27 000 ultrashort X-ray per second and with a brilliance that
is a billion times higher than that of the best conventional X-ray radiation sources.
PETRA III /EuXFEL – Individual Scientist
Individual scientist at DESY need a service to create archives for their experiment data as well as their
publications with specific capabilities such as data ingestion via browser or third-party copies.
PETRA III /EuXFEL – Manual Data Archiving
Experiment managers want to be able to create/manage/delete archives via APIs/CLIs based on accepted
data policies supporting a wide range of options for cloud and on-prem storage, while being able to utilize
existing user credentials, authentication techniques and identification mechanisms.
PETRA III /EuXFEL – Integrated Data Archiving
Long-lived collaborations present a growing need to plan and execute archiving operations in a fully
automated, policy-based, certified, and documented way, based on APIs.
9. Summary of the Deployment Scenarios and Functional Requirements 9
FAIR Principles
Findable
AccessibleInteroperable
Re-Usable
• Accurate and relevant description
• Data usage license and detailed
provenance
• Retrievable with free protocols
• Accessible metadata even after
deletion
• Global, unique identifiers
• Rich Metadata, indexes, search
capabilities
• Qualified reference to other data
• Formal, shared and broadly applicable
knowledge representation standards
https://www.go-fair.org/
5 June 2019
10. 5 June 2019 Summary of the Deployment Scenarios and Functional Requirements 10
OAIS Reference Model
11. Common Characteristics
5 June 2019 Summary of the Deployment Scenarios and Functional Requirements 11
Scientific Data Storage in the PB Range
Solid needs for Federated AAI Services
Sustained Data Ingest Rates
Access to GEANT Network
Development under the OAIS Reference Model and FAIR Principles
Data Privacy and Compliance
Significant Monitoring Requirements
Sustainable Business Models and Costs
13. Service Layers and Deployment Scenarios Mappings
5 June 2019 Summary of the Deployment Scenarios and Functional Requirements 13
Data integrity/security; cloud/hybrid deployment
Data volume in the PB range; high, sustained ingest data rates
ISO certification: 27000, 27040, 19086 and related standards
Archives connected to the GEANT network
OAIS conformant services: data readability formats, normalization,
obsolesce monitoring, files fixity, authenticity checks, etc.
ISO 14721/16393, 26324 and related standards
User services: search, discover, share, indexing, data removal, etc.
Access under Federated IAM
Layer 1
Storage/Basic Archiving/Secure
backup
Layer 2
Preservation
Layer 3
Baseline user services
Layer 4
Advanced services
High level services: visual representation of data (domain specific),
reproducibility of scientific analyses, etc.
EMBL1–FIRE
PIC2–MixedFileRemoteStorage
DESY1–PETRAIII/EUXFEL
CERN3–CERNOpenData
CERN2–CERNDigitalMemory
CERN1–TheBaBarExperiment
PIC3–DataDistribution
EMBL2–CloudCaching
PIC1–LargeFileStorage
15. Testing Plans
5 June 2019 Summary of the Deployment Scenarios and Functional Requirements 15
The Buyers Group will request demo access to the current product offerings during
the Design Phase.
Testing will focus on Functionality for the Prototype Phase and Performance,
Scalability, and Reliability for the Pilot Phase.
The Buyers Group will provide a set of tests derived from the Buyers Group
deployment scenarios and the Functional Specifications.
The tests will have clear assessment criteria for pass/fail.
The Buyers Group expects to deploy tests only after a clear indication of the
contractor that the tests were run successfully by the contractor themselves.
We plan to present the initial set of tests by the Design Phase Kick-off.
Assessment of the tests results will have implications on the assessment of the
respective phase results and on the payments to be executed.
16. Basic Functionality Testing Examples
5 June 2019 Summary of the Deployment Scenarios and Functional Requirements 16
Ingestion:
ability to submit a particular dataset of X size to the Archiving Service within time Y
Access:
ability to recall a particular part of a file, file or dataset within time Y
Monitoring and Dashboard:
ability to access displayed informations via web browser and trigger basic management function
e.g. data deletion, fixity checks, etc.
Audit and Log:
ability to access detailed access logs for a particular file/dataset
18. 5 June 2019 Summary of the Deployment Scenarios and Functional Requirements 18
Overview
C3 – CERN Open Data
C1 – The BaBar Experiment
C2 – CERN Digital Memory
P1 – Large File Remote Storage
P3 – Data Distribution
P2 – Mixed File Remote Storage
E1 – FIRE
E2 – Cloud Caching
D1 – PETRA III / EUXFEL
19. 5 June 2019 Summary of the Deployment Scenarios and Functional Requirements 19
Summary and Next Steps
The primary goal for all the Deployment Scenarios is the preservation and long-term archiving of data in the PB
range with high sustained ingest rates for complex data types.
If this can be achieved easily, all the scenarios would benefit greatly from an added Software Reproducability and Open
Data Distribution Layer on top of the archiving solution.
These deployment scenarios exhibit many similarities such as the scientific complex data types, the need for
federated AAI services, the significant monitoring requirements, and the development under OAIS and FAIR.
We welcome your feedback on the draft of the “Functional Specifications” documents until 14 June.
The Buyers group will co-design and co-develop with you a test plan:
The plan will be based on the outcome of the Design Phase, the Functional Specifications document, and the Deployment
Scenarios needs
The test assessment will be a deciding factor to qualify solutions to the subsequent phases
The tests will focus on basic functionality capabilities during the prototype phase and performance, efficiency, and
scalability during the pilot phase