SlideShare a Scribd company logo
1 of 22
Download to read offline
Helix Nebula Science Cloud Pilot Phase
14th June 2018, CERN, Geneva
Interactive Data Analysis for End Users
on HN Science Cloud
Status Report on Totem Test
Enrico Bocchi
CERN IT, Storage
Contributors
 TOTEM Experiment // AGH Krakow
V. Avati, M. Blaszkiewicz, L. Grzanka, M. Malawski, A. Mnich
 EP, Software Development for Experiments
D. Castro, D. Piparo, E. Tejedor
 IT, Database Services
L. Canali, P. Kothuri
 IT, Storage
E. Bocchi, J.T. Moscicki
2
What is the Totem Test about
 Goal: Enable end users to run data analysis interactively
 Analysis scenario: Totem experiment
 End users: Totem collaborators and scientists
 How: Deployment of “Science Box” on Helix Nebula Cloud
 EOS, CERNBox, SWAN + SPARK
 Data transfer to import relevant datasets
 Measure of success:
Quantification of analysis done with SWAN
 Real physics analysis with ROOT (RDataFrame) + SPARK
 Functional tests + synthetic benchmarks via specialized tools
3
First Tests (2017)
 “Science Box” on a single VM
 Streamlined installation – No configuration required
 Automated functional tests for validation
4
Open Source
disk-based storage
Synchronization
and Sharing
Interactive Notebooks
in a Web Browser
IBM
RHEA
T-Systems
Current Tests: Science Box
 Scalable deployment with Kubernetes
5
SWAN
EOS
CERNBox
File
Storage
Servers
Management Node
CERNBox
Gateway
…
JupyterHubCVMFS
Client
EOS
Fuse
Mount
Single-user
Server
CERNBox
User
Sync
Client
Scale-out capacity
with user population
Scale-out storage
Current Tests: Science Box
 Scalable deployment with Kubernetes
 SPARK as a computing engine
6
SWAN
JupyterHubCVMFS
Client
EOS
Fuse
Mount
Single-user
Server
Scale-out capacity
with user population
AppMaster
Spark Worker
Python Task
Python Task
EOS
File
Storage
Servers
Scale-out storage
Management Node
…
Current Tests: Spark Cluster
 Hadoop and Spark installed via yum on plain VMs
 Configuration via Cloudera Manager
 Validation with industry standard TPC-DS benchmark + user workloads
7
Current Installation
 6 nodes
 ECS Type: s2.xlarge.8
 Hadoop: CDH_5.14
 Spark: 2.2.1
 JAVA: 1.8.0_161-b12
Deployment Status in HN Cloud
 Deployment OK on OTC
 All resources managed as
sub-tenant “TOTEM_Test”
 Small scale for now
 22 VMs, 41 Block Devices
 128 CPUs, 488GB Memory
 22 System Volumes (950GB)
 19 Data Volumes (~12TB)
 Functional tests successful
 Storage, Sync, Analysis
 SWAN to Spark connection
 Import small dataset
8
Deployment Status in HN Cloud
9
Demo at Lunchtime
CERN, Restaurant 2
Experience with OTC as a User
 Incidents:
 May 7th Outage:
3 containers lost due to VMs reboot
Manual intervention required
 May 14th: VM reported IO errors on system disk – VM was rebuilt
 Early June: OTC web console unavailable for short periods
 Requests:
 Streamline VM name <==> hostname <==> IP address resolution
E.g., Internal DNS to resolve VM name to its private IP address
Worked-around with entries in /etc/hosts + dnsmasq
10
Next Steps
 Application layer optimization for TOTEM use-case
 Performance tuning, caching of frequently-used software packages, …
 Focus on relevant metrics for users, e.g., total execution time
 Scale-out storage and computing resources
 Leverage on scalable architecture design (EOS, Swan, Spark)
 Storage:
o Import larger dataset (10-50TB)  Additional file servers and block devices
o Investigate caching layer for Spark for improved performance
 Computing:
o Experiment and identify best VM flavor for Spark computing nodes
o Additional spark workers to crunch bigger datasets
11
Helix Nebula Science Cloud Pilot Phase
14th June 2018, CERN, Geneva
Interactive Data Analysis for End Users
on HN Science Cloud
Status Report on Totem Test
Enrico Bocchi
CERN IT, Storage
Backup Slides
Project Description
 This is a deployment test of the ScienceBox IT services (EOS, SWAN, CERNBox,
and Spark) in a particular analysis scenario for TOTEM experiment (conducted
jointly with SFT group and interested TOTEM collaborators).
 Phase 1 (April-September) is a feasibility study: (a) verification of the deployment of
the services; (b) performance tuning and testing of a TOTEM data analysis
example. If feasibility tests are successful then Phase 2 (September-November) is
to use available resources opportunistically to understand the scaling limitations of
the system and, if possible, to actually perform analysis for use-cases defined by
the TOTEM collaborators.
 The amount of data currently foreseen is ~10-50TB. HN resource is treated as an
extension of CERN resource. Any data stored in HN cloud will be automatically
erased at the end of 2018.
 This project does not provide guarantee of service but will make opportunistic use of
resources if feasible.
14
Synthetic Benchmarks
 Network – iperf, ping
 TCP connection, 128K buffer, 60s test (5s binning)
 Average bandwidth: ~7.8 Gb/s (both directions)
 Very low latency (~0.3ms) and jitter (sdev ~0.07ms)
15
0
1
2
3
4
5
6
7
8
9
10
0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60
Time [s]
Bandwidth [Gb/s] Transfered [GB]
Avg: 7.82Gb/s
Total: 54.6GB
Synthetic Benchmarks (cont’d)
 Block Devices – fio
 Three types of VDBs: Common-IO, High-IO, and UltraHigh-IO
 No disk encryption, No disk sharing
 Size form 10GB to 1TB (not related to performance)
16
92
359
44
153
600
26
358
1399
11
1
10
100
1000
10000
Bandwidth [MB/s] IOPS Latency [ms]
Sequential Read, 256KB
CommonIO HighIO UltraHighIO
3450
862
18
6316
1579
10
8938
2234
7
1
10
100
1000
10000
Bandwidth [kB/s] IOPS Latency [ms]
Random Read, 4KB
CommonIO HighIO UltraHighIO
Synthetic Benchmarks (cont’d)
 CPU and Memory – sysbench
 CPU - Total events: 10’000
o Avg. time per event: 1.11ms
o 95th percentile: 1.12ms
 Memory – Transfer (write) 10GB
o IOPS: 1’614’325
o Throughput: 1576.49 MB/s
 Two (main) flavor of VMs:
 4CPUs, 8GB memory – General purpose VM
 4CPUs, 32GB memory – Spark workers
 Performance is consistent across VM sizes
17
Synthetic Benchmarks (cont’d)
 Spark Cluster – TPC-DS Benchmark
 Small case test: Dataset size 20GB
 5 executors (2 cores, 7GB memory), Query Set v1.4, Spark 2.2.1
18
0
10
20
30
40
50
60
70
80
q1
q4
q7
q10
q13
q17
q20
q23a
q24b
q27
q30
q33
q36
q39a
q41
q44
q47
q50
q53
q56
q59
q62
q65
q68
q71
q75
q78
q81
q84
q87
q90
q93
q96
q99
QueryExecutionTime(Latency)[s]
Query name
Minimum Maximum Average
Synthetic Benchmarks (cont’d)
 Spark Cluster – TPC-DS Benchmark
 Small case test: Dataset size 20GB
 5 executors (2 cores, 7GB memory), Query Set v1.4, Spark 2.2.1
19
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
CDF
Query Execution Time (Latency) [s]
Average
Data Analysis with RDataFrame
df = RDataFrame('t', 'f.root')
hist = df.Filter('theta > 0') 
.Histo1D('pt')
hist.Draw()
20
 Columnar dataset: ROOT, other formats
 Declarative analysis: what, not how
 Transformations: cuts, definition of new columns
 Actions: return a result (e.g., histogram)
 Implicit parallelisation
 Multi-threaded, distributed
Build data-frame object
Apply transformation
Apply actions
RDataFrame: Distributed Execution
21
 RDataFrame supports distributed execution on top of Spark
 No changes in the app will be required to go distributed!
RDataFrame: Spark Monitoring
22
 Automatic monitoring when an RDataFrame Spark job is spawned
 Job progress, task distribution, resource utilization
 Dataset split in ranges under the hood, one task per range

More Related Content

What's hot

Testing data and metadata backends with ClawIO
Testing data and metadata backends with ClawIOTesting data and metadata backends with ClawIO
Testing data and metadata backends with ClawIOHugo González Labrador
 
Architecture of the oasis mobile shared virtual memory system
Architecture of the oasis mobile shared virtual memory systemArchitecture of the oasis mobile shared virtual memory system
Architecture of the oasis mobile shared virtual memory systemZongYing Lyu
 
Identifying hosts with natfilterd
Identifying hosts with natfilterdIdentifying hosts with natfilterd
Identifying hosts with natfilterdGeorg Wicherski
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleDataWorks Summit/Hadoop Summit
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)Ryousei Takano
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리NAVER D2
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
[232]mist 고성능 iot 스트림 처리 시스템
[232]mist 고성능 iot 스트림 처리 시스템[232]mist 고성능 iot 스트림 처리 시스템
[232]mist 고성능 iot 스트림 처리 시스템NAVER D2
 
Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13specialk29
 
Real-time driving score service using Flink
Real-time driving score service using FlinkReal-time driving score service using Flink
Real-time driving score service using FlinkDongwon Kim
 
Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014Karthik Ramasamy
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCUKernel TLV
 
PASTE: Network Stacks Must Integrate with NVMM Abstractions
PASTE: Network Stacks Must Integrate with NVMM AbstractionsPASTE: Network Stacks Must Integrate with NVMM Abstractions
PASTE: Network Stacks Must Integrate with NVMM Abstractionsmicchie
 

What's hot (20)

Testing data and metadata backends with ClawIO
Testing data and metadata backends with ClawIOTesting data and metadata backends with ClawIO
Testing data and metadata backends with ClawIO
 
Architecture of the oasis mobile shared virtual memory system
Architecture of the oasis mobile shared virtual memory systemArchitecture of the oasis mobile shared virtual memory system
Architecture of the oasis mobile shared virtual memory system
 
What is CERNBox ?
What is CERNBox ?What is CERNBox ?
What is CERNBox ?
 
Identifying hosts with natfilterd
Identifying hosts with natfilterdIdentifying hosts with natfilterd
Identifying hosts with natfilterd
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
Lec11 timing
Lec11 timingLec11 timing
Lec11 timing
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Storm
StormStorm
Storm
 
[232]mist 고성능 iot 스트림 처리 시스템
[232]mist 고성능 iot 스트림 처리 시스템[232]mist 고성능 iot 스트림 처리 시스템
[232]mist 고성능 iot 스트림 처리 시스템
 
CERNBox: Site Report
CERNBox: Site ReportCERNBox: Site Report
CERNBox: Site Report
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13Seattle Scalability Meetup 6-26-13
Seattle Scalability Meetup 6-26-13
 
Real-time driving score service using Flink
Real-time driving score service using FlinkReal-time driving score service using Flink
Real-time driving score service using Flink
 
Storm
StormStorm
Storm
 
Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCU
 
PASTE: Network Stacks Must Integrate with NVMM Abstractions
PASTE: Network Stacks Must Integrate with NVMM AbstractionsPASTE: Network Stacks Must Integrate with NVMM Abstractions
PASTE: Network Stacks Must Integrate with NVMM Abstractions
 

Similar to Helix Nebula Science Cloud Pilot Phase Status

CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travixRuslan Lutsenko
 
Plank
PlankPlank
PlankFNian
 
Comparison of Open Source Virtualization Technology
Comparison of Open Source Virtualization TechnologyComparison of Open Source Virtualization Technology
Comparison of Open Source Virtualization TechnologyBenoit des Ligneris
 
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...AWS User Group - Thailand
 
1005 cern-active mq-v2
1005 cern-active mq-v21005 cern-active mq-v2
1005 cern-active mq-v2James Casey
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsGanesan Narayanasamy
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Databricks
 
Cognitive Engine: Boosting Scientific Discovery
Cognitive Engine:  Boosting Scientific DiscoveryCognitive Engine:  Boosting Scientific Discovery
Cognitive Engine: Boosting Scientific Discoverydiannepatricia
 
Hpc Cloud project Overview
Hpc Cloud project OverviewHpc Cloud project Overview
Hpc Cloud project OverviewFloris Sluiter
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationPaul Groth
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programsgreenwop
 
Resource planning on the (Amazon) cloud
Resource planning on the (Amazon) cloudResource planning on the (Amazon) cloud
Resource planning on the (Amazon) cloudEnis Afgan
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascaleinside-BigData.com
 
Analytical Modeling of End-to-End Delay in OpenFlow Based Networks
Analytical Modeling of End-to-End Delay in OpenFlow Based NetworksAnalytical Modeling of End-to-End Delay in OpenFlow Based Networks
Analytical Modeling of End-to-End Delay in OpenFlow Based NetworksAzeem Iqbal
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...PT Datacomm Diangraha
 

Similar to Helix Nebula Science Cloud Pilot Phase Status (20)

CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travix
 
Plank
PlankPlank
Plank
 
Comparison of Open Source Virtualization Technology
Comparison of Open Source Virtualization TechnologyComparison of Open Source Virtualization Technology
Comparison of Open Source Virtualization Technology
 
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
 
1005 cern-active mq-v2
1005 cern-active mq-v21005 cern-active mq-v2
1005 cern-active mq-v2
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
 
Cognitive Engine: Boosting Scientific Discovery
Cognitive Engine:  Boosting Scientific DiscoveryCognitive Engine:  Boosting Scientific Discovery
Cognitive Engine: Boosting Scientific Discovery
 
Hpc Cloud project Overview
Hpc Cloud project OverviewHpc Cloud project Overview
Hpc Cloud project Overview
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programs
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
Resource planning on the (Amazon) cloud
Resource planning on the (Amazon) cloudResource planning on the (Amazon) cloud
Resource planning on the (Amazon) cloud
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascale
 
Analytical Modeling of End-to-End Delay in OpenFlow Based Networks
Analytical Modeling of End-to-End Delay in OpenFlow Based NetworksAnalytical Modeling of End-to-End Delay in OpenFlow Based Networks
Analytical Modeling of End-to-End Delay in OpenFlow Based Networks
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 

More from Helix Nebula The Science Cloud

This Helix Nebula Science Cloud Pilot Phase Open Session
This Helix Nebula Science Cloud Pilot Phase Open SessionThis Helix Nebula Science Cloud Pilot Phase Open Session
This Helix Nebula Science Cloud Pilot Phase Open SessionHelix Nebula The Science Cloud
 
Cloud Services for Education - HNSciCloud applied to the UP2U project
Cloud Services for Education - HNSciCloud applied to the UP2U projectCloud Services for Education - HNSciCloud applied to the UP2U project
Cloud Services for Education - HNSciCloud applied to the UP2U projectHelix Nebula The Science Cloud
 
Network experiences with Public Cloud Services @ TNC2017
Network experiences with Public Cloud Services @ TNC2017Network experiences with Public Cloud Services @ TNC2017
Network experiences with Public Cloud Services @ TNC2017Helix Nebula The Science Cloud
 
Helix Nebula Science Cloud Pilot Phase, 6 February 2018, Bologna, Italy
Helix Nebula Science Cloud Pilot Phase, 6 February 2018, Bologna, ItalyHelix Nebula Science Cloud Pilot Phase, 6 February 2018, Bologna, Italy
Helix Nebula Science Cloud Pilot Phase, 6 February 2018, Bologna, ItalyHelix Nebula The Science Cloud
 
Pilot phase Award Ceremony - INFN Introduction and welcome
Pilot phase Award Ceremony - INFN Introduction and welcomePilot phase Award Ceremony - INFN Introduction and welcome
Pilot phase Award Ceremony - INFN Introduction and welcomeHelix Nebula The Science Cloud
 
Early adopter group and closing of webinar - João Fernandes (CERN)
Early adopter group and closing of webinar - João Fernandes (CERN)Early adopter group and closing of webinar - João Fernandes (CERN)
Early adopter group and closing of webinar - João Fernandes (CERN)Helix Nebula The Science Cloud
 

More from Helix Nebula The Science Cloud (20)

M-PIL-3.2 Public Session
M-PIL-3.2 Public SessionM-PIL-3.2 Public Session
M-PIL-3.2 Public Session
 
Deep Learning for Fast Simulation
Deep Learning for Fast SimulationDeep Learning for Fast Simulation
Deep Learning for Fast Simulation
 
Container Federation Use Cases
Container Federation Use CasesContainer Federation Use Cases
Container Federation Use Cases
 
CERN Batch in the HNSciCloud
CERN Batch in the HNSciCloudCERN Batch in the HNSciCloud
CERN Batch in the HNSciCloud
 
LHCb on RHEA and T-Systems
LHCb on RHEA and T-SystemsLHCb on RHEA and T-Systems
LHCb on RHEA and T-Systems
 
HNSciCloud CMS status-report
HNSciCloud CMS status-reportHNSciCloud CMS status-report
HNSciCloud CMS status-report
 
Helix Nebula Science Cloud usage by ALICE
Helix Nebula Science Cloud usage by ALICEHelix Nebula Science Cloud usage by ALICE
Helix Nebula Science Cloud usage by ALICE
 
Hybrid cloud for science
Hybrid cloud for scienceHybrid cloud for science
Hybrid cloud for science
 
HNSciCloud PILOT PLATFORM OVERVIEW
HNSciCloud PILOT PLATFORM OVERVIEWHNSciCloud PILOT PLATFORM OVERVIEW
HNSciCloud PILOT PLATFORM OVERVIEW
 
HNSciCloud Overview
HNSciCloud Overview HNSciCloud Overview
HNSciCloud Overview
 
This Helix Nebula Science Cloud Pilot Phase Open Session
This Helix Nebula Science Cloud Pilot Phase Open SessionThis Helix Nebula Science Cloud Pilot Phase Open Session
This Helix Nebula Science Cloud Pilot Phase Open Session
 
Cloud Services for Education - HNSciCloud applied to the UP2U project
Cloud Services for Education - HNSciCloud applied to the UP2U projectCloud Services for Education - HNSciCloud applied to the UP2U project
Cloud Services for Education - HNSciCloud applied to the UP2U project
 
Network experiences with Public Cloud Services @ TNC2017
Network experiences with Public Cloud Services @ TNC2017Network experiences with Public Cloud Services @ TNC2017
Network experiences with Public Cloud Services @ TNC2017
 
EOSC in practice - Silvana Muscella (chair EOSC HLEG)
EOSC in practice - Silvana Muscella (chair EOSC HLEG)EOSC in practice - Silvana Muscella (chair EOSC HLEG)
EOSC in practice - Silvana Muscella (chair EOSC HLEG)
 
Helix Nebula Science Cloud Pilot Phase, 6 February 2018, Bologna, Italy
Helix Nebula Science Cloud Pilot Phase, 6 February 2018, Bologna, ItalyHelix Nebula Science Cloud Pilot Phase, 6 February 2018, Bologna, Italy
Helix Nebula Science Cloud Pilot Phase, 6 February 2018, Bologna, Italy
 
Pilot phase Award Ceremony - INFN Introduction and welcome
Pilot phase Award Ceremony - INFN Introduction and welcomePilot phase Award Ceremony - INFN Introduction and welcome
Pilot phase Award Ceremony - INFN Introduction and welcome
 
Early adopter group and closing of webinar - João Fernandes (CERN)
Early adopter group and closing of webinar - João Fernandes (CERN)Early adopter group and closing of webinar - João Fernandes (CERN)
Early adopter group and closing of webinar - João Fernandes (CERN)
 
HNSciCloud pilot phase - Andrea Chierici (INFN)
HNSciCloud pilot phase - Andrea Chierici (INFN)HNSciCloud pilot phase - Andrea Chierici (INFN)
HNSciCloud pilot phase - Andrea Chierici (INFN)
 
Pilot phase Award Ceremony - T-Systems
Pilot phase Award Ceremony - T-SystemsPilot phase Award Ceremony - T-Systems
Pilot phase Award Ceremony - T-Systems
 
Pilot phase Award Ceremony - RHEA
Pilot phase Award Ceremony - RHEAPilot phase Award Ceremony - RHEA
Pilot phase Award Ceremony - RHEA
 

Recently uploaded

Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 

Recently uploaded (20)

Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 

Helix Nebula Science Cloud Pilot Phase Status

  • 1. Helix Nebula Science Cloud Pilot Phase 14th June 2018, CERN, Geneva Interactive Data Analysis for End Users on HN Science Cloud Status Report on Totem Test Enrico Bocchi CERN IT, Storage
  • 2. Contributors  TOTEM Experiment // AGH Krakow V. Avati, M. Blaszkiewicz, L. Grzanka, M. Malawski, A. Mnich  EP, Software Development for Experiments D. Castro, D. Piparo, E. Tejedor  IT, Database Services L. Canali, P. Kothuri  IT, Storage E. Bocchi, J.T. Moscicki 2
  • 3. What is the Totem Test about  Goal: Enable end users to run data analysis interactively  Analysis scenario: Totem experiment  End users: Totem collaborators and scientists  How: Deployment of “Science Box” on Helix Nebula Cloud  EOS, CERNBox, SWAN + SPARK  Data transfer to import relevant datasets  Measure of success: Quantification of analysis done with SWAN  Real physics analysis with ROOT (RDataFrame) + SPARK  Functional tests + synthetic benchmarks via specialized tools 3
  • 4. First Tests (2017)  “Science Box” on a single VM  Streamlined installation – No configuration required  Automated functional tests for validation 4 Open Source disk-based storage Synchronization and Sharing Interactive Notebooks in a Web Browser IBM RHEA T-Systems
  • 5. Current Tests: Science Box  Scalable deployment with Kubernetes 5 SWAN EOS CERNBox File Storage Servers Management Node CERNBox Gateway … JupyterHubCVMFS Client EOS Fuse Mount Single-user Server CERNBox User Sync Client Scale-out capacity with user population Scale-out storage
  • 6. Current Tests: Science Box  Scalable deployment with Kubernetes  SPARK as a computing engine 6 SWAN JupyterHubCVMFS Client EOS Fuse Mount Single-user Server Scale-out capacity with user population AppMaster Spark Worker Python Task Python Task EOS File Storage Servers Scale-out storage Management Node …
  • 7. Current Tests: Spark Cluster  Hadoop and Spark installed via yum on plain VMs  Configuration via Cloudera Manager  Validation with industry standard TPC-DS benchmark + user workloads 7 Current Installation  6 nodes  ECS Type: s2.xlarge.8  Hadoop: CDH_5.14  Spark: 2.2.1  JAVA: 1.8.0_161-b12
  • 8. Deployment Status in HN Cloud  Deployment OK on OTC  All resources managed as sub-tenant “TOTEM_Test”  Small scale for now  22 VMs, 41 Block Devices  128 CPUs, 488GB Memory  22 System Volumes (950GB)  19 Data Volumes (~12TB)  Functional tests successful  Storage, Sync, Analysis  SWAN to Spark connection  Import small dataset 8
  • 9. Deployment Status in HN Cloud 9 Demo at Lunchtime CERN, Restaurant 2
  • 10. Experience with OTC as a User  Incidents:  May 7th Outage: 3 containers lost due to VMs reboot Manual intervention required  May 14th: VM reported IO errors on system disk – VM was rebuilt  Early June: OTC web console unavailable for short periods  Requests:  Streamline VM name <==> hostname <==> IP address resolution E.g., Internal DNS to resolve VM name to its private IP address Worked-around with entries in /etc/hosts + dnsmasq 10
  • 11. Next Steps  Application layer optimization for TOTEM use-case  Performance tuning, caching of frequently-used software packages, …  Focus on relevant metrics for users, e.g., total execution time  Scale-out storage and computing resources  Leverage on scalable architecture design (EOS, Swan, Spark)  Storage: o Import larger dataset (10-50TB)  Additional file servers and block devices o Investigate caching layer for Spark for improved performance  Computing: o Experiment and identify best VM flavor for Spark computing nodes o Additional spark workers to crunch bigger datasets 11
  • 12. Helix Nebula Science Cloud Pilot Phase 14th June 2018, CERN, Geneva Interactive Data Analysis for End Users on HN Science Cloud Status Report on Totem Test Enrico Bocchi CERN IT, Storage
  • 14. Project Description  This is a deployment test of the ScienceBox IT services (EOS, SWAN, CERNBox, and Spark) in a particular analysis scenario for TOTEM experiment (conducted jointly with SFT group and interested TOTEM collaborators).  Phase 1 (April-September) is a feasibility study: (a) verification of the deployment of the services; (b) performance tuning and testing of a TOTEM data analysis example. If feasibility tests are successful then Phase 2 (September-November) is to use available resources opportunistically to understand the scaling limitations of the system and, if possible, to actually perform analysis for use-cases defined by the TOTEM collaborators.  The amount of data currently foreseen is ~10-50TB. HN resource is treated as an extension of CERN resource. Any data stored in HN cloud will be automatically erased at the end of 2018.  This project does not provide guarantee of service but will make opportunistic use of resources if feasible. 14
  • 15. Synthetic Benchmarks  Network – iperf, ping  TCP connection, 128K buffer, 60s test (5s binning)  Average bandwidth: ~7.8 Gb/s (both directions)  Very low latency (~0.3ms) and jitter (sdev ~0.07ms) 15 0 1 2 3 4 5 6 7 8 9 10 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 Time [s] Bandwidth [Gb/s] Transfered [GB] Avg: 7.82Gb/s Total: 54.6GB
  • 16. Synthetic Benchmarks (cont’d)  Block Devices – fio  Three types of VDBs: Common-IO, High-IO, and UltraHigh-IO  No disk encryption, No disk sharing  Size form 10GB to 1TB (not related to performance) 16 92 359 44 153 600 26 358 1399 11 1 10 100 1000 10000 Bandwidth [MB/s] IOPS Latency [ms] Sequential Read, 256KB CommonIO HighIO UltraHighIO 3450 862 18 6316 1579 10 8938 2234 7 1 10 100 1000 10000 Bandwidth [kB/s] IOPS Latency [ms] Random Read, 4KB CommonIO HighIO UltraHighIO
  • 17. Synthetic Benchmarks (cont’d)  CPU and Memory – sysbench  CPU - Total events: 10’000 o Avg. time per event: 1.11ms o 95th percentile: 1.12ms  Memory – Transfer (write) 10GB o IOPS: 1’614’325 o Throughput: 1576.49 MB/s  Two (main) flavor of VMs:  4CPUs, 8GB memory – General purpose VM  4CPUs, 32GB memory – Spark workers  Performance is consistent across VM sizes 17
  • 18. Synthetic Benchmarks (cont’d)  Spark Cluster – TPC-DS Benchmark  Small case test: Dataset size 20GB  5 executors (2 cores, 7GB memory), Query Set v1.4, Spark 2.2.1 18 0 10 20 30 40 50 60 70 80 q1 q4 q7 q10 q13 q17 q20 q23a q24b q27 q30 q33 q36 q39a q41 q44 q47 q50 q53 q56 q59 q62 q65 q68 q71 q75 q78 q81 q84 q87 q90 q93 q96 q99 QueryExecutionTime(Latency)[s] Query name Minimum Maximum Average
  • 19. Synthetic Benchmarks (cont’d)  Spark Cluster – TPC-DS Benchmark  Small case test: Dataset size 20GB  5 executors (2 cores, 7GB memory), Query Set v1.4, Spark 2.2.1 19 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 CDF Query Execution Time (Latency) [s] Average
  • 20. Data Analysis with RDataFrame df = RDataFrame('t', 'f.root') hist = df.Filter('theta > 0') .Histo1D('pt') hist.Draw() 20  Columnar dataset: ROOT, other formats  Declarative analysis: what, not how  Transformations: cuts, definition of new columns  Actions: return a result (e.g., histogram)  Implicit parallelisation  Multi-threaded, distributed Build data-frame object Apply transformation Apply actions
  • 21. RDataFrame: Distributed Execution 21  RDataFrame supports distributed execution on top of Spark  No changes in the app will be required to go distributed!
  • 22. RDataFrame: Spark Monitoring 22  Automatic monitoring when an RDataFrame Spark job is spawned  Job progress, task distribution, resource utilization  Dataset split in ranges under the hood, one task per range