SlideShare a Scribd company logo
1 of 19
Download to read offline
||ID | SIS
2019 hpc-ch Forum – Cloud and Containers
Andrei Plamadă, Jarunan Panyasantisuk
ETH Zürich – Scientific IT Services
16.05.2019 1
Benchmarking MPI Applications in Singularity Containers
on Traditional HPC and Cloud Infrastructures
Andrei Plamadă
||ID | SIS
§ Motivation
§ User experience:
§ Traditional HPC vs HPC in the Public Cloud
§ Singularity v2.6
§ Benchmarking MPI Applications
§ OSU Micro-Benchmarks
§ Machine Learning: TensorFlow
16.05.2019Andrei Plamadă 2
Outline
||ID | SIS
§ 2018-2022: 20.2% CAGR for IaaS (see Forbes –
Gartner)
16.05.2019Andrei Plamadă 3
Motivation – Public Cloud is growing rapidly
80.0
94.8 110.5
126.7
143.7
30.5 38.9 49.1 61.9
76.7
2018 2019 2020 2021 2022
Worldwide Public Cloud SaaS and IaaS
Revenue Forecast (Billions of U.S. Dollars)
SaaS IaaS
||ID | SIS
§ 2018-2022: 20.2% CAGR for IaaS (see Forbes –
Gartner)
§ Expectations
§ More competitive prices
§ More regions
§ More heterogeneous
16.05.2019Andrei Plamadă 4
Motivation – Public Cloud is growing rapidly
80.0
94.8 110.5
126.7
143.7
30.5 38.9 49.1 61.9
76.7
2018 2019 2020 2021 2022
Worldwide Public Cloud SaaS and IaaS
Revenue Forecast (Billions of U.S. Dollars)
SaaS IaaS
||ID | SIS
§ 2018-2022: 20.2% CAGR for IaaS (see Forbes –
Gartner)
§ Expectations
§ More competitive prices
§ More regions
§ More heterogeneous
16.05.2019Andrei Plamadă 5
Motivation – Public Cloud is growing rapidly
§ Available in Switzerland
§ 2019-03-12 Google Cloud Platform in Zurich
§ Announced in Switzerland
§ 2018-03-14 Azure Switzerland North and West
80.0
94.8 110.5
126.7
143.7
30.5 38.9 49.1 61.9
76.7
2018 2019 2020 2021 2022
Worldwide Public Cloud SaaS and IaaS
Revenue Forecast (Billions of U.S. Dollars)
SaaS IaaS
||ID | SIS
§ Amazon EC2
§ 2018-11-26 c5n Instances
§ Intel Xeon Platinum ~3.0 GHz, 72 vCPUs, 2.6 GB/vCPU, 100 Gbps
§ Azure
§ 2017-10-23 Cray in Azure
§ Cray XC-series, Cray CS-series
§ 2018-11-14 New H-series in preview*
§ AMD EPYC 7551 ~3.0 GHz: 60 vCPUs, 4.0 GB/vCPU, 100 Gbps EDR InfiniBand (2019-05-14 available)
§ Intel Xeon Platinum 8168 ~3.4 GHz: 44 vCPUs, 8.0 GB/vCPU, 100 Gbps EDR InfiniBand
§ Google Cloud Platform
§ 2019-04-02 Compute-Optimized VMs (C2)
§ 2nd Gen Intel Xeon Scalable Processors ~3.8 GHz, 60 vCPUs, 4.0 GB/vCPU
16.05.2019Andrei Plamadă 6
Motivation – HPC is in the Cloud as per Press Releases
||ID | SIS
§ Containers improve portability and can address the reproducibility issue in
research (EnhanceR Survey - Science IT Consultants)
§ EnhanceR Survey - Infrastructure Providers for Container Use
§ Singularity:
§ Developed initially at LBL - Berkeley Lab - for HPC use case (multi-tenancy)
§ Open source with standard BSD 3 clause license https://github.com/sylabs/singularity
§ Under active development with 12 contributors with more than 100 commits
§ Available also with commercial support: Singularity Pro
§ Used world wide and recommended by vendors, e.g. NVIDIA, Azure Batch
§ Big worldwide community (google groups, slack)
§ Swiss community - EnhanceR
16.05.2019Andrei Plamadă 7
Motivation – Singularity as the container solution for HPC
||ID | SIS
§ Containers improve portability and can address the reproducibility issue in
research (EnhanceR Survey - Science IT Consultants)
§ EnhanceR Survey - Infrastructure Providers for Container Use
§ Main idea
16.05.2019Andrei Plamadă 8
Motivation – Singularity as the container solution for HPC
Host OS+Drivers+Middleware
(OSDM)
MPI
• mpirun
• MPI Library
SSH
Server
App
• Shared MPI
Library
Host OS+Drivers+Middleware
(OSDM)
MPI
• mpirun
SSH
Server
Container OSDM
• MPI
• App
• Shared MPI Library
||ID | SIS
§ Traditional HPC (ETH – SIS – HPC)
§ Euler IV:
§ 2x18 core Intel Xeon Gold 6150 (2.7-3.7 GHz)
§ All cores available
§ HT available
§ 7.4 GB/core Memory
§ 100 Gbps InfiniBand
§ Public Cloud - Azure
§ In preview HC-Series – Standard_HC44rs
§ 2x24 core Intel Xeon Plat 8168 (2.7-3.7 GHz)?
§ 2x2 core used by the supervisor?
§ HT disabled?
§ 8.0 GB/core Memory
§ 100 Gbps InfiniBand
16.05.2019Andrei Plamadă 9
Traditional HPC vs HPC in the Public Cloud
||ID | SIS
§ Traditional HPC (ETH – SIS – HPC)
§ Ready to be used (LSF)
§ No maintenance / set-up
§ Login and Compute Nodes
§ Moderate flexibility regarding the software
stack
§ Queue
§ It generally works as expected
§ Public Cloud - Azure
§ Needs to be set-up (Slurm Cluster) via
CycleCloud
§ As admin fully responsible
§ Master and Execute Nodes
§ High flexibility (as the admin), e.g. OpenMPI,
MPICH, MVAPICH2, Intel MPI
§ Queue (as admin high availability)
§ Auto-scaling
§ https://github.com/Azure/cyclecloud-
slurm/issues
16.05.2019Andrei Plamadă 10
User Experience – Traditional HPC vs HPC in the Public Cloud
||ID | SIS 16.05.2019Andrei Plamadă 11
User Experience on CentOS 7 – Singularity v2.6
Create
• Docker
• root access
• on your PC
Run
• Singularity
• on your PC or HPC
infrastructure
§ Multi-node: MPICH ABI Compatibility
initiative
||ID | SIS
Bytes EN m2 v2.2 EC m2 v2.2 EC m2 v2.3 AN m2 v2.3 AC m2 v2.3
8 0.16 0.15 0.16 0.16 0.08
64 1.30 1.27 1.29 1.28 1.25
512 8.27 8.21 8.14 7.87 7.65
4K 37.41 37.65 37.42 37.23 36.54
32K 88.89 89.25 89.43 83.50 82.47
2M 94.75 94.59 95.19 94.25 94.30
16M 94.95 94.75 95.50 91.49 89.99
16.05.2019Andrei Plamadă 12
Osu Micro-Benchmarks – osu_bw (Gbps) 1000 iterations
Abbreviations: Azure (A), Euler (E), MVAPICH2 (m2), Native (N), Container (C)
§ Naïve EC/AC MPICH v3.3 is working but only up to 10/4 Gbps (no InfiniBand)
§ Host: AC MPICH v3.3, Container: m2 v2.3; results as for AC m2 v2.3 - up to 100 Gbps
§ OpenMPI is not compatible with MPICH-derived MPI implementations is not working
||ID | SIS
Bytes EN m2 v2.2 EC m2 v2.2 EC m2 v2.3 AN m2 v2.3 AC m2 v2.3
8 1.25 1.26 1.30 2.37 2.34
64 1.37 1.38 1.37 2.54 2.54
512 2.12 2.09 2.12 3.44 3.38
4K 3.44 3.34 3.63 5.16 5.30
32K 8.69 8.59 8.88 14.07 13.47
2M 28.46 28.39 28.54 39.62 38.71
16M 188.68 188.70 185.10 202.52 204.84
16.05.2019Andrei Plamadă 13
Osu Micro-Benchmarks – osu_latency (μs) 100000 iterations
Abbreviations: Azure (A), Euler (E), MVAPICH2 (m2), Native (N), Container (C)
||ID | SIS 16.05.2019Andrei Plamadă 14
Osu Micro-Benchmarks – Dockerfile
||ID | SIS
§ 2018-11-24: new N-Series Azure Virtual Machines (in preview)
§ Standard_ND40s_v2:
§ Intel Skylake: 40 vCPUs, 16.8 GB/vCPU
§ 8 x NVIDIA Tesla V100 NVLINK
16.05.2019Andrei Plamadă 15
Machine Learning – Tensor Flow – on Azure
(1 iteration – NO STATISTICS)
Time to Solution (min)
No of GPUs CUDA 9 CUDA 10 Singularity CUDA 10
1 87 63 65
2 102 89 59?
4 66 46 45
8 28 19 18
||ID | SIS 16.05.2019Andrei Plamadă 16
Machine Learning – Tensor Flow – Dockerfile (1/2)
||ID | SIS 16.05.2019Andrei Plamadă 17
Machine Learning – Tensor Flow – Dockerfile (2/2)
||ID | SIS 16.05.2019Andrei Plamadă 18
Conclusion
§ User experience on Azure - HPC in the cloud is catching up:
§ CycleCloud Slurm Cluster with compute intensive VMs + 100 Gbps InfiniBand in preview
§ Big Machine learning VMs (up to 8 x Tesla V100 NVLINK) in preview
§ Singularity Containers:
§ Once the host is similar with the container we did not experience any overhead
§ HPC partially breaks the portability of containers
§ The container should be compatible with host infrastructure and host MPI implementation
§ Updating CUDA drivers (9 to 10) might improve the time to solution
||ID | SIS
ETH Zürich
Andrei Plamadă
Scientific IT Services
Weinbergstrasse 11
8092 Zürich
16.05.2019Andrei Plamadă 19
Contact Acknowledgements
SIS colleagues
Thomas Wüst
Urban Borstnik
Samuel Fux
EnhanceR colleagues
Alexander Kashev (UniBe)
Microsoft / Azure
Lukasz Miroslaw
Andy Howard
EnhanceR Survey - Infrastructure Providers for Container Use
https://forms.gle/JBW78qDPWabd4GDR8

More Related Content

What's hot

Cloud Strategies for a modern hybrid datacenter - Dec 2015
Cloud Strategies for a modern hybrid datacenter - Dec 2015Cloud Strategies for a modern hybrid datacenter - Dec 2015
Cloud Strategies for a modern hybrid datacenter - Dec 2015
Miguel Pérez Colino
 
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
NVIDIA Taiwan
 

What's hot (20)

1030: NVIDIA GRID 2.0
1030: NVIDIA GRID 2.01030: NVIDIA GRID 2.0
1030: NVIDIA GRID 2.0
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
全面保護企業的關鍵智慧資產
全面保護企業的關鍵智慧資產全面保護企業的關鍵智慧資產
全面保護企業的關鍵智慧資產
 
CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014
 
AI, A New Computing Model
AI, A New Computing ModelAI, A New Computing Model
AI, A New Computing Model
 
Kubernetes Native Infrastructure and CoreOS Operator Framework for 5G Edge Cl...
Kubernetes Native Infrastructure and CoreOS Operator Framework for 5G Edge Cl...Kubernetes Native Infrastructure and CoreOS Operator Framework for 5G Edge Cl...
Kubernetes Native Infrastructure and CoreOS Operator Framework for 5G Edge Cl...
 
Part 2 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 2   Maximizing the utilization of GPU resources on-premise and in the cloudPart 2   Maximizing the utilization of GPU resources on-premise and in the cloud
Part 2 Maximizing the utilization of GPU resources on-premise and in the cloud
 
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloudPart 3 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud
 
HPC Top 5 Stories: April 26, 2018
HPC Top 5 Stories: April 26, 2018HPC Top 5 Stories: April 26, 2018
HPC Top 5 Stories: April 26, 2018
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platform
 
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」
PCCC21:日本電気株式会社「一台何役?SX-Aurora TSUBASA最新情報」
 
GPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next FrontierGPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next Frontier
 
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
 
Harnessing AI for the Benefit of All.
Harnessing AI for the Benefit of All.Harnessing AI for the Benefit of All.
Harnessing AI for the Benefit of All.
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
 
Cloud Strategies for a modern hybrid datacenter - Dec 2015
Cloud Strategies for a modern hybrid datacenter - Dec 2015Cloud Strategies for a modern hybrid datacenter - Dec 2015
Cloud Strategies for a modern hybrid datacenter - Dec 2015
 
OpenACC Monthly Highlights- December
OpenACC Monthly Highlights- DecemberOpenACC Monthly Highlights- December
OpenACC Monthly Highlights- December
 
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA DGX-1 超級電腦與人工智慧及深度學習NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
 
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
 

Similar to Benchmarking MPI Applications in Singularity Containers on Traditional HPC and Cloud Infrastructures

How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015
How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015
How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015
StampedeCon
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
inside-BigData.com
 

Similar to Benchmarking MPI Applications in Singularity Containers on Traditional HPC and Cloud Infrastructures (20)

OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
A journay to do AI research in the cloud.pdf
A journay to do AI research in the cloud.pdfA journay to do AI research in the cloud.pdf
A journay to do AI research in the cloud.pdf
 
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
 
Amazon Elastic Fabric Adapter: Anatomy, Capabilities, and the Road Ahead
Amazon Elastic Fabric Adapter: Anatomy, Capabilities, and the Road AheadAmazon Elastic Fabric Adapter: Anatomy, Capabilities, and the Road Ahead
Amazon Elastic Fabric Adapter: Anatomy, Capabilities, and the Road Ahead
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
 
StampedeCon 2015 Keynote
StampedeCon 2015 KeynoteStampedeCon 2015 Keynote
StampedeCon 2015 Keynote
 
How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015
How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015
How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015
 
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next Decade
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
Presentation of OCCIware, a standard, extensible Cloud consumer platform at P...
Presentation of OCCIware, a standard, extensible Cloud consumer platform at P...Presentation of OCCIware, a standard, extensible Cloud consumer platform at P...
Presentation of OCCIware, a standard, extensible Cloud consumer platform at P...
 
OCCIware @ Paris Open Source Summit 2017 - a standard, extensible Cloud consu...
OCCIware @ Paris Open Source Summit 2017 - a standard, extensible Cloud consu...OCCIware @ Paris Open Source Summit 2017 - a standard, extensible Cloud consu...
OCCIware @ Paris Open Source Summit 2017 - a standard, extensible Cloud consu...
 
OpenShift 4, the smarter Kubernetes platform
OpenShift 4, the smarter Kubernetes platformOpenShift 4, the smarter Kubernetes platform
OpenShift 4, the smarter Kubernetes platform
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
NFV features in kubernetes
NFV features in kubernetesNFV features in kubernetes
NFV features in kubernetes
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 

Recently uploaded

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 

Recently uploaded (20)

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 

Benchmarking MPI Applications in Singularity Containers on Traditional HPC and Cloud Infrastructures

  • 1. ||ID | SIS 2019 hpc-ch Forum – Cloud and Containers Andrei Plamadă, Jarunan Panyasantisuk ETH Zürich – Scientific IT Services 16.05.2019 1 Benchmarking MPI Applications in Singularity Containers on Traditional HPC and Cloud Infrastructures Andrei Plamadă
  • 2. ||ID | SIS § Motivation § User experience: § Traditional HPC vs HPC in the Public Cloud § Singularity v2.6 § Benchmarking MPI Applications § OSU Micro-Benchmarks § Machine Learning: TensorFlow 16.05.2019Andrei Plamadă 2 Outline
  • 3. ||ID | SIS § 2018-2022: 20.2% CAGR for IaaS (see Forbes – Gartner) 16.05.2019Andrei Plamadă 3 Motivation – Public Cloud is growing rapidly 80.0 94.8 110.5 126.7 143.7 30.5 38.9 49.1 61.9 76.7 2018 2019 2020 2021 2022 Worldwide Public Cloud SaaS and IaaS Revenue Forecast (Billions of U.S. Dollars) SaaS IaaS
  • 4. ||ID | SIS § 2018-2022: 20.2% CAGR for IaaS (see Forbes – Gartner) § Expectations § More competitive prices § More regions § More heterogeneous 16.05.2019Andrei Plamadă 4 Motivation – Public Cloud is growing rapidly 80.0 94.8 110.5 126.7 143.7 30.5 38.9 49.1 61.9 76.7 2018 2019 2020 2021 2022 Worldwide Public Cloud SaaS and IaaS Revenue Forecast (Billions of U.S. Dollars) SaaS IaaS
  • 5. ||ID | SIS § 2018-2022: 20.2% CAGR for IaaS (see Forbes – Gartner) § Expectations § More competitive prices § More regions § More heterogeneous 16.05.2019Andrei Plamadă 5 Motivation – Public Cloud is growing rapidly § Available in Switzerland § 2019-03-12 Google Cloud Platform in Zurich § Announced in Switzerland § 2018-03-14 Azure Switzerland North and West 80.0 94.8 110.5 126.7 143.7 30.5 38.9 49.1 61.9 76.7 2018 2019 2020 2021 2022 Worldwide Public Cloud SaaS and IaaS Revenue Forecast (Billions of U.S. Dollars) SaaS IaaS
  • 6. ||ID | SIS § Amazon EC2 § 2018-11-26 c5n Instances § Intel Xeon Platinum ~3.0 GHz, 72 vCPUs, 2.6 GB/vCPU, 100 Gbps § Azure § 2017-10-23 Cray in Azure § Cray XC-series, Cray CS-series § 2018-11-14 New H-series in preview* § AMD EPYC 7551 ~3.0 GHz: 60 vCPUs, 4.0 GB/vCPU, 100 Gbps EDR InfiniBand (2019-05-14 available) § Intel Xeon Platinum 8168 ~3.4 GHz: 44 vCPUs, 8.0 GB/vCPU, 100 Gbps EDR InfiniBand § Google Cloud Platform § 2019-04-02 Compute-Optimized VMs (C2) § 2nd Gen Intel Xeon Scalable Processors ~3.8 GHz, 60 vCPUs, 4.0 GB/vCPU 16.05.2019Andrei Plamadă 6 Motivation – HPC is in the Cloud as per Press Releases
  • 7. ||ID | SIS § Containers improve portability and can address the reproducibility issue in research (EnhanceR Survey - Science IT Consultants) § EnhanceR Survey - Infrastructure Providers for Container Use § Singularity: § Developed initially at LBL - Berkeley Lab - for HPC use case (multi-tenancy) § Open source with standard BSD 3 clause license https://github.com/sylabs/singularity § Under active development with 12 contributors with more than 100 commits § Available also with commercial support: Singularity Pro § Used world wide and recommended by vendors, e.g. NVIDIA, Azure Batch § Big worldwide community (google groups, slack) § Swiss community - EnhanceR 16.05.2019Andrei Plamadă 7 Motivation – Singularity as the container solution for HPC
  • 8. ||ID | SIS § Containers improve portability and can address the reproducibility issue in research (EnhanceR Survey - Science IT Consultants) § EnhanceR Survey - Infrastructure Providers for Container Use § Main idea 16.05.2019Andrei Plamadă 8 Motivation – Singularity as the container solution for HPC Host OS+Drivers+Middleware (OSDM) MPI • mpirun • MPI Library SSH Server App • Shared MPI Library Host OS+Drivers+Middleware (OSDM) MPI • mpirun SSH Server Container OSDM • MPI • App • Shared MPI Library
  • 9. ||ID | SIS § Traditional HPC (ETH – SIS – HPC) § Euler IV: § 2x18 core Intel Xeon Gold 6150 (2.7-3.7 GHz) § All cores available § HT available § 7.4 GB/core Memory § 100 Gbps InfiniBand § Public Cloud - Azure § In preview HC-Series – Standard_HC44rs § 2x24 core Intel Xeon Plat 8168 (2.7-3.7 GHz)? § 2x2 core used by the supervisor? § HT disabled? § 8.0 GB/core Memory § 100 Gbps InfiniBand 16.05.2019Andrei Plamadă 9 Traditional HPC vs HPC in the Public Cloud
  • 10. ||ID | SIS § Traditional HPC (ETH – SIS – HPC) § Ready to be used (LSF) § No maintenance / set-up § Login and Compute Nodes § Moderate flexibility regarding the software stack § Queue § It generally works as expected § Public Cloud - Azure § Needs to be set-up (Slurm Cluster) via CycleCloud § As admin fully responsible § Master and Execute Nodes § High flexibility (as the admin), e.g. OpenMPI, MPICH, MVAPICH2, Intel MPI § Queue (as admin high availability) § Auto-scaling § https://github.com/Azure/cyclecloud- slurm/issues 16.05.2019Andrei Plamadă 10 User Experience – Traditional HPC vs HPC in the Public Cloud
  • 11. ||ID | SIS 16.05.2019Andrei Plamadă 11 User Experience on CentOS 7 – Singularity v2.6 Create • Docker • root access • on your PC Run • Singularity • on your PC or HPC infrastructure § Multi-node: MPICH ABI Compatibility initiative
  • 12. ||ID | SIS Bytes EN m2 v2.2 EC m2 v2.2 EC m2 v2.3 AN m2 v2.3 AC m2 v2.3 8 0.16 0.15 0.16 0.16 0.08 64 1.30 1.27 1.29 1.28 1.25 512 8.27 8.21 8.14 7.87 7.65 4K 37.41 37.65 37.42 37.23 36.54 32K 88.89 89.25 89.43 83.50 82.47 2M 94.75 94.59 95.19 94.25 94.30 16M 94.95 94.75 95.50 91.49 89.99 16.05.2019Andrei Plamadă 12 Osu Micro-Benchmarks – osu_bw (Gbps) 1000 iterations Abbreviations: Azure (A), Euler (E), MVAPICH2 (m2), Native (N), Container (C) § Naïve EC/AC MPICH v3.3 is working but only up to 10/4 Gbps (no InfiniBand) § Host: AC MPICH v3.3, Container: m2 v2.3; results as for AC m2 v2.3 - up to 100 Gbps § OpenMPI is not compatible with MPICH-derived MPI implementations is not working
  • 13. ||ID | SIS Bytes EN m2 v2.2 EC m2 v2.2 EC m2 v2.3 AN m2 v2.3 AC m2 v2.3 8 1.25 1.26 1.30 2.37 2.34 64 1.37 1.38 1.37 2.54 2.54 512 2.12 2.09 2.12 3.44 3.38 4K 3.44 3.34 3.63 5.16 5.30 32K 8.69 8.59 8.88 14.07 13.47 2M 28.46 28.39 28.54 39.62 38.71 16M 188.68 188.70 185.10 202.52 204.84 16.05.2019Andrei Plamadă 13 Osu Micro-Benchmarks – osu_latency (μs) 100000 iterations Abbreviations: Azure (A), Euler (E), MVAPICH2 (m2), Native (N), Container (C)
  • 14. ||ID | SIS 16.05.2019Andrei Plamadă 14 Osu Micro-Benchmarks – Dockerfile
  • 15. ||ID | SIS § 2018-11-24: new N-Series Azure Virtual Machines (in preview) § Standard_ND40s_v2: § Intel Skylake: 40 vCPUs, 16.8 GB/vCPU § 8 x NVIDIA Tesla V100 NVLINK 16.05.2019Andrei Plamadă 15 Machine Learning – Tensor Flow – on Azure (1 iteration – NO STATISTICS) Time to Solution (min) No of GPUs CUDA 9 CUDA 10 Singularity CUDA 10 1 87 63 65 2 102 89 59? 4 66 46 45 8 28 19 18
  • 16. ||ID | SIS 16.05.2019Andrei Plamadă 16 Machine Learning – Tensor Flow – Dockerfile (1/2)
  • 17. ||ID | SIS 16.05.2019Andrei Plamadă 17 Machine Learning – Tensor Flow – Dockerfile (2/2)
  • 18. ||ID | SIS 16.05.2019Andrei Plamadă 18 Conclusion § User experience on Azure - HPC in the cloud is catching up: § CycleCloud Slurm Cluster with compute intensive VMs + 100 Gbps InfiniBand in preview § Big Machine learning VMs (up to 8 x Tesla V100 NVLINK) in preview § Singularity Containers: § Once the host is similar with the container we did not experience any overhead § HPC partially breaks the portability of containers § The container should be compatible with host infrastructure and host MPI implementation § Updating CUDA drivers (9 to 10) might improve the time to solution
  • 19. ||ID | SIS ETH Zürich Andrei Plamadă Scientific IT Services Weinbergstrasse 11 8092 Zürich 16.05.2019Andrei Plamadă 19 Contact Acknowledgements SIS colleagues Thomas Wüst Urban Borstnik Samuel Fux EnhanceR colleagues Alexander Kashev (UniBe) Microsoft / Azure Lukasz Miroslaw Andy Howard EnhanceR Survey - Infrastructure Providers for Container Use https://forms.gle/JBW78qDPWabd4GDR8