SlideShare a Scribd company logo
1 of 56
Download to read offline
Increasing cluster
performance by combining
rCUDA with Slurm
Federico Silla
Technical University of Valencia
Spain
HPC Advisory Council Switzerland Conference 2016 2/56
Outline
rCUDA … what’s that?
HPC Advisory Council Switzerland Conference 2016 3/56
Basics of CUDA
GPU
HPC Advisory Council Switzerland Conference 2016 4/56
rCUDA … remote CUDA
No GPU
HPC Advisory Council Switzerland Conference 2016 5/56
A software technology that enables a more
flexible use of GPUs in computing facilities
No GPU
rCUDA … remote CUDA
HPC Advisory Council Switzerland Conference 2016 6/56
Basics of rCUDA
HPC Advisory Council Switzerland Conference 2016 7/56
Basics of rCUDA
HPC Advisory Council Switzerland Conference 2016 8/56
Basics of rCUDA
HPC Advisory Council Switzerland Conference 2016 9/56
Physical
configuration
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
CPU
MainMemory
Network
PCI-e
PCI-e
CPU
MainMemory
Network
Interconnection Network
Logical connections
Logical
configuration
Cluster envision with rCUDA
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
Interconnection Network
PCI-e
CPU
GPU GPU
mem
MainMemory
Network
GPU GPU
mem
PCI-e
CPU
GPU GPU
mem
MainMemory
Network
GPU GPU
mem
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
CPU
MainMemory
Network
GPU GPU
mem
GPU GPU
mem
PCI-e
 rCUDA allows a new vision of a GPU deployment, moving from
the usual cluster configuration:
to the following one:
HPC Advisory Council Switzerland Conference 2016 10/56
Outline
Two questions:
• Why should we need rCUDA?
• rCUDA … slower CUDA?
HPC Advisory Council Switzerland Conference 2016 11/56
Outline
Two questions:
• Why should we need rCUDA?
• rCUDA … slower CUDA?
HPC Advisory Council Switzerland Conference 2016 12/56
The main concern with rCUDA is the
reduced bandwidth to the remote GPU
Concern with rCUDA
No GPU
HPC Advisory Council Switzerland Conference 2016 13/56
Using InfiniBand networks
HPC Advisory Council Switzerland Conference 2016 14/56
H2D pageable D2H pageable
H2D pinned D2H pinned
rCUDA EDR Orig rCUDA EDR Opt
rCUDA FDR Orig rCUDA FDR Opt
Initial transfers within rCUDA
HPC Advisory Council Switzerland Conference 2016 15/56
 CUDASW++
Bioinformatics software for Smith-Waterman protein database
searches
144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478
0
5
10
15
20
25
30
35
40
45
0
2
4
6
8
10
12
14
16
18
FDR Overhead QDR Overhead GbE Overhead CUDA
rCUDA FDR rCUDA QDR rCUDA GbE
Sequence Length
rCUDAOverhead(%)
ExecutionTime(s)
Performance depending on network
HPC Advisory Council Switzerland Conference 2016 16/56
H2D pageable D2H pageable
H2D pinned
Almost 100% of
available BW
D2H pinned
Almost 100% of
available BW
rCUDA EDR Orig rCUDA EDR Opt
rCUDA FDR Orig rCUDA FDR Opt
Optimized transfers within rCUDA
HPC Advisory Council Switzerland Conference 2016 17/56
rCUDA optimizations on applications
• Several applications executed with
CUDA and rCUDA
• K20 GPU and FDR InfiniBand
• K40 GPU and EDR InfiniBand
Lower
is better
HPC Advisory Council Switzerland Conference 2016 18/56
Outline
Two questions:
• Why should we need rCUDA?
• rCUDA … slower CUDA?
HPC Advisory Council Switzerland Conference 2016 19/56
Outline
rCUDA improves
cluster performance
HPC Advisory Council Switzerland Conference 2016 20/56
 Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU
 FDR InfiniBand based cluster
Test bench for studying rCUDA+Slurm
node with
the Slurm
scheduler
node with
the Slurm
scheduler node with
the Slurm
scheduler
8+1 GPU nodes
16+1 GPU nodes
4+1 GPU nodes
HPC Advisory Council Switzerland Conference 2016 21/56
 Applications used for tests:
 GPU-Blast (21 seconds; 1 GPU; 1599 MB)
 LAMMPS (15 seconds; 4 GPUs; 876 MB)
 MCUDA-MEME (165 seconds; 4 GPUs; 151 MB)
 GROMACS (2 nodes) (167 seconds)
 NAMD (4 nodes) (11 minutes)
 BarraCUDA (10 minutes; 1 GPU; 3319 MB)
 GPU-LIBSVM (5 minutes; 1GPU; 145 MB)
 MUMmerGPU (5 minutes; 1GPU; 2804 MB)
Non-GPU
Short execution time
Long execution time
Set 1
Set 2
 Three workloads:
 Set 1
 Set 2
 Set 1 + Set 2
Applications for studying rCUDA+Slurm
HPC Advisory Council Switzerland Conference 2016 22/56
Workloads for studying rCUDA+Slurm (I)
HPC Advisory Council Switzerland Conference 2016 23/56
Performance of rCUDA+Slurm (I)
HPC Advisory Council Switzerland Conference 2016 24/56
Workloads for studying rCUDA+Slurm (II)
HPC Advisory Council Switzerland Conference 2016 25/56
Performance of rCUDA+Slurm (II)
HPC Advisory Council Switzerland Conference 2016 26/56
Outline
Why does rCUDA improve
cluster performance?
HPC Advisory Council Switzerland Conference 2016 27/56
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Non-accelerated applications keep GPUs idle in the nodes
where they use all the cores
1st reason for improved performance
A CPU-only application spreading over
these nodes will make their GPUs
unavailable for accelerated applications
Hybrid MPI shared-memory
non-accelerated applications
usually span to all the cores
in a node (across n nodes)
HPC Advisory Council Switzerland Conference 2016 28/56
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Accelerated applications keep CPUs idle in the nodes
where they execute
An accelerated application using just one
CPU core may avoid other jobs to be
dispatched to this node
Hybrid MPI shared-memory
non-accelerated applications
usually span to all the cores
in a node (across n nodes)
2nd reason for improved performance (I)
HPC Advisory Council Switzerland Conference 2016 29/56
Hybrid MPI shared-memory
non-accelerated applications
usually span to all the cores
in a node (across n nodes)
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
• Accelerated applications keep CPUs idle in the nodes
where they execute
An accelerated MPI application using
just one CPU core per node may
keep part of the cluster busy
2nd reason for improved performance (II)
HPC Advisory Council Switzerland Conference 2016 30/56
• Do applications completely squeeze the GPUs available in the cluster?
• When a GPU is assigned to an application, computational resources
inside the GPU may not be fully used
• Application presenting low level of parallelism
• CPU code being executed (GPU assigned ≠ GPU working)
• GPU-core stall due to lack of data
• etc …
Interconnection Network
node nnode 2 node 3node 1
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe CPU
CPU RAM
RAM
RAM
Network
GPU
PCIe
CPU
CPU RAM
RAM
RAM
3rd reason for improved performance
HPC Advisory Council Switzerland Conference 2016 31/56
GPU usage of GPU-Blast
GPU assigned
but not used
GPU assigned
but not used
HPC Advisory Council Switzerland Conference 2016 32/56
GPU usage of CUDA-MEME
GPU utilization is far away from maximum
HPC Advisory Council Switzerland Conference 2016 33/56
GPU usage of LAMMPS
GPU assigned
but not used
HPC Advisory Council Switzerland Conference 2016 34/56
GPU allocation vs GPU utilization
GPUs
assigned
but not
used
HPC Advisory Council Switzerland Conference 2016 35/56
Sharing a GPU among jobs: GPU-Blast
Two concurrent
instances of GPU-Blast
One
instance
required
about 51
seconds
HPC Advisory Council Switzerland Conference 2016 36/56
Two concurrent
instances of GPU-Blast
Sharing a GPU among jobs: GPU-Blast
First
instance
HPC Advisory Council Switzerland Conference 2016 37/56
Two concurrent
instances of GPU-Blast
Sharing a GPU among jobs: GPU-Blast
First
instance
Second
instance
HPC Advisory Council Switzerland Conference 2016 38/56
Sharing a GPU among jobs
• LAMMPS: 876 MB
• mCUDA-MEME: 151 MB
• BarraCUDA: 3319 MB
• MUMmerGPU: 2104 MB
• GPU-LIBSVM: 145 MB
K20 GPU
HPC Advisory Council Switzerland Conference 2016 39/56
Outline
Other reasons for
using rCUDA?
HPC Advisory Council Switzerland Conference 2016 40/56
Cheaper cluster upgrade
No GPU
• Let’s suppose that a cluster without GPUs needs to be
upgraded to use GPUs
• GPUs require large power supplies
• Are power supplies already installed in the
nodes large enough?
• GPUs require large amounts of space
• Does current form factor of the nodes allow
to install GPUs?
The answer to both
questions is usually “NO”
HPC Advisory Council Switzerland Conference 2016 41/56
Cheaper cluster upgrade
No GPU
GPU-enabled
Approach 1: augment the cluster with some CUDA GPU-
enabled nodes  only those GPU-enabled nodes can
execute accelerated applications
HPC Advisory Council Switzerland Conference 2016 42/56
Cheaper cluster upgrade
Approach 2: augment the cluster with some rCUDA
servers  all nodes can execute accelerated
applications
GPU-enabled
HPC Advisory Council Switzerland Conference 2016 43/56
 Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU
 FDR InfiniBand based cluster
16 nodes without GPU + 1 node with 4 GPUs
Cheaper cluster upgrade
HPC Advisory Council Switzerland Conference 2016 44/56
More workloads for studying rCUDA+Slurm
HPC Advisory Council Switzerland Conference 2016 45/56
Performance
-68% -60%
-63% -56%
+131% +119%
HPC Advisory Council Switzerland Conference 2016 46/56
Outline
Additional reasons for
using rCUDA?
HPC Advisory Council Switzerland Conference 2016 47/56
#1: More GPUs for a single application
64
GPUs!
HPC Advisory Council Switzerland Conference 2016 48/56
 MonteCarlo Multi-GPU (from NVIDIA samples)
Lower
is better
Higher
is better
#1: More GPUs for a single application
FDR InfiniBand +
NVIDIA Tesla K20
HPC Advisory Council Switzerland Conference 2016 49/56
#2: Virtual machines can share GPUs
• The GPU is assigned by using PCI passthrough exclusively
to a single virtual machine
• Concurrent usage of the GPU is not possible
HPC Advisory Council Switzerland Conference 2016 50/56
High performance
network available
Low performance
network available
#2: Virtual machines can share GPUs
HPC Advisory Council Switzerland Conference 2016 51/56
 Box A has 4 GPUs but only one is busy
 Box B has 8 GPUs but only two are busy
1. Move jobs from Box B to Box A and
switch off Box B
2. Migration should be transparent to
applications (decided by the global
scheduler)
#3: GPU task migration
Box A
Box B
Migration is performed
at GPU granularity
HPC Advisory Council Switzerland Conference 2016 52/56
1
1
3
7
13
14
14
Job granularity instead of GPU granularity
#3: GPU task migration
HPC Advisory Council Switzerland Conference 2016 53/56
Outline
… in summary …
HPC Advisory Council Switzerland Conference 2016 54/56
• Cons:
1.Reduced bandwidth to remote GPU (really a concern??)
Pros and cons of rCUDA
• Pros:
1.Many GPUs for a single application
2.Concurrent GPU access to virtual machines
3.Increased cluster throughput
4.Similar performance with smaller investment
5.Easier (cheaper) cluster upgrade
6.Migration of GPU jobs
7.Reduced energy consumption
8.Increased GPU utilization
HPC Advisory Council Switzerland Conference 2016 55/56
Get a free copy of rCUDA at
http://www.rcuda.net
@rcuda_
More than 650 requests world wide
rCUDA is a development by Technical University of Valencia
HPC Advisory Council Switzerland Conference 2016 56/56
Thanks!
Questions?
rCUDA is a development by Technical University of Valencia

More Related Content

What's hot

Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418inside-BigData.com
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmapinside-BigData.com
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputinginside-BigData.com
 
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
Open Source RAPIDS GPU Platform to Accelerate Predictive Data AnalyticsOpen Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analyticsinside-BigData.com
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringKeith Kraus
 
A Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei EnterpriseA Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei Enterpriseinside-BigData.com
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Missioninside-BigData.com
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systemsinside-BigData.com
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Challenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIinside-BigData.com
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlapinside-BigData.com
 
Sierra Supercomputer: Science Unleashed
Sierra Supercomputer: Science UnleashedSierra Supercomputer: Science Unleashed
Sierra Supercomputer: Science Unleashedinside-BigData.com
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologiesinside-BigData.com
 
The SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data ProcessorThe SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data Processorinside-BigData.com
 
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA
 

What's hot (20)

Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmap
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
 
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
Open Source RAPIDS GPU Platform to Accelerate Predictive Data AnalyticsOpen Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
Open Source RAPIDS GPU Platform to Accelerate Predictive Data Analytics
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature Engineering
 
A Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei EnterpriseA Fresh Look at HPC from Huawei Enterprise
A Fresh Look at HPC from Huawei Enterprise
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Mission
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
 
Nvidia at SEMICon, Munich
Nvidia at SEMICon, MunichNvidia at SEMICon, Munich
Nvidia at SEMICon, Munich
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Challenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPI
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlap
 
Sierra Supercomputer: Science Unleashed
Sierra Supercomputer: Science UnleashedSierra Supercomputer: Science Unleashed
Sierra Supercomputer: Science Unleashed
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
 
The SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data ProcessorThe SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data Processor
 
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
 

Similar to Increasing Cluster Performance by Combining rCUDA with Slurm

PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUsiguazio
 
PowerEdge Rack and Tower Server Masters AMD Processors.pptx
PowerEdge Rack and Tower Server Masters AMD Processors.pptxPowerEdge Rack and Tower Server Masters AMD Processors.pptx
PowerEdge Rack and Tower Server Masters AMD Processors.pptxNeoKenj
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...Jim St. Leger
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuAlan Sill
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computingAshwin Ashok
 
CUDA Sessions You Won't Want to Miss at GTC 2019
CUDA Sessions You Won't Want to Miss at GTC 2019CUDA Sessions You Won't Want to Miss at GTC 2019
CUDA Sessions You Won't Want to Miss at GTC 2019NVIDIA
 
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based NodesBenefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodesinside-BigData.com
 
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC
 
2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...
2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...
2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...Jürgen Ambrosi
 
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, TrustedNVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, TrustedJeremy Eder
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community6WIND
 
02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...Yutaka Kawai
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
 
3 Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...
3  Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...3  Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...
3 Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...Jürgen Ambrosi
 
GPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkGPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkRed Hat Developers
 
Multi-GPU FFT Performance on Different Hardware
Multi-GPU FFT Performance on Different HardwareMulti-GPU FFT Performance on Different Hardware
Multi-GPU FFT Performance on Different Hardwareinside-BigData.com
 

Similar to Increasing Cluster Performance by Combining rCUDA with Slurm (20)

PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
PowerEdge Rack and Tower Server Masters AMD Processors.pptx
PowerEdge Rack and Tower Server Masters AMD Processors.pptxPowerEdge Rack and Tower Server Masters AMD Processors.pptx
PowerEdge Rack and Tower Server Masters AMD Processors.pptx
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computing
 
CUDA Sessions You Won't Want to Miss at GTC 2019
CUDA Sessions You Won't Want to Miss at GTC 2019CUDA Sessions You Won't Want to Miss at GTC 2019
CUDA Sessions You Won't Want to Miss at GTC 2019
 
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based NodesBenefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
 
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020
 
2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...
2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...
2 Sessione - Macchine virtuali per la scalabilità di calcolo per velocizzare ...
 
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, TrustedNVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
 
02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
3 Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...
3  Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...3  Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...
3 Sessione - Come superare il problema delle risorse nell’utilizzo di softwa...
 
GPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkGPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech Talk
 
Multi-GPU FFT Performance on Different Hardware
Multi-GPU FFT Performance on Different HardwareMulti-GPU FFT Performance on Different Hardware
Multi-GPU FFT Performance on Different Hardware
 
Ti DSP optimization on Jacinto
Ti DSP optimization on JacintoTi DSP optimization on Jacinto
Ti DSP optimization on Jacinto
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Making Supernovae with Jets
Making Supernovae with JetsMaking Supernovae with Jets
Making Supernovae with Jets
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 

Increasing Cluster Performance by Combining rCUDA with Slurm

  • 1. Increasing cluster performance by combining rCUDA with Slurm Federico Silla Technical University of Valencia Spain
  • 2. HPC Advisory Council Switzerland Conference 2016 2/56 Outline rCUDA … what’s that?
  • 3. HPC Advisory Council Switzerland Conference 2016 3/56 Basics of CUDA GPU
  • 4. HPC Advisory Council Switzerland Conference 2016 4/56 rCUDA … remote CUDA No GPU
  • 5. HPC Advisory Council Switzerland Conference 2016 5/56 A software technology that enables a more flexible use of GPUs in computing facilities No GPU rCUDA … remote CUDA
  • 6. HPC Advisory Council Switzerland Conference 2016 6/56 Basics of rCUDA
  • 7. HPC Advisory Council Switzerland Conference 2016 7/56 Basics of rCUDA
  • 8. HPC Advisory Council Switzerland Conference 2016 8/56 Basics of rCUDA
  • 9. HPC Advisory Council Switzerland Conference 2016 9/56 Physical configuration CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e CPU MainMemory Network PCI-e PCI-e CPU MainMemory Network Interconnection Network Logical connections Logical configuration Cluster envision with rCUDA GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem GPU GPU mem PCI-e CPU MainMemory Network Interconnection Network PCI-e CPU GPU GPU mem MainMemory Network GPU GPU mem PCI-e CPU GPU GPU mem MainMemory Network GPU GPU mem CPU MainMemory Network GPU GPU mem GPU GPU mem PCI-e CPU MainMemory Network GPU GPU mem GPU GPU mem PCI-e CPU MainMemory Network GPU GPU mem GPU GPU mem PCI-e  rCUDA allows a new vision of a GPU deployment, moving from the usual cluster configuration: to the following one:
  • 10. HPC Advisory Council Switzerland Conference 2016 10/56 Outline Two questions: • Why should we need rCUDA? • rCUDA … slower CUDA?
  • 11. HPC Advisory Council Switzerland Conference 2016 11/56 Outline Two questions: • Why should we need rCUDA? • rCUDA … slower CUDA?
  • 12. HPC Advisory Council Switzerland Conference 2016 12/56 The main concern with rCUDA is the reduced bandwidth to the remote GPU Concern with rCUDA No GPU
  • 13. HPC Advisory Council Switzerland Conference 2016 13/56 Using InfiniBand networks
  • 14. HPC Advisory Council Switzerland Conference 2016 14/56 H2D pageable D2H pageable H2D pinned D2H pinned rCUDA EDR Orig rCUDA EDR Opt rCUDA FDR Orig rCUDA FDR Opt Initial transfers within rCUDA
  • 15. HPC Advisory Council Switzerland Conference 2016 15/56  CUDASW++ Bioinformatics software for Smith-Waterman protein database searches 144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478 0 5 10 15 20 25 30 35 40 45 0 2 4 6 8 10 12 14 16 18 FDR Overhead QDR Overhead GbE Overhead CUDA rCUDA FDR rCUDA QDR rCUDA GbE Sequence Length rCUDAOverhead(%) ExecutionTime(s) Performance depending on network
  • 16. HPC Advisory Council Switzerland Conference 2016 16/56 H2D pageable D2H pageable H2D pinned Almost 100% of available BW D2H pinned Almost 100% of available BW rCUDA EDR Orig rCUDA EDR Opt rCUDA FDR Orig rCUDA FDR Opt Optimized transfers within rCUDA
  • 17. HPC Advisory Council Switzerland Conference 2016 17/56 rCUDA optimizations on applications • Several applications executed with CUDA and rCUDA • K20 GPU and FDR InfiniBand • K40 GPU and EDR InfiniBand Lower is better
  • 18. HPC Advisory Council Switzerland Conference 2016 18/56 Outline Two questions: • Why should we need rCUDA? • rCUDA … slower CUDA?
  • 19. HPC Advisory Council Switzerland Conference 2016 19/56 Outline rCUDA improves cluster performance
  • 20. HPC Advisory Council Switzerland Conference 2016 20/56  Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU  FDR InfiniBand based cluster Test bench for studying rCUDA+Slurm node with the Slurm scheduler node with the Slurm scheduler node with the Slurm scheduler 8+1 GPU nodes 16+1 GPU nodes 4+1 GPU nodes
  • 21. HPC Advisory Council Switzerland Conference 2016 21/56  Applications used for tests:  GPU-Blast (21 seconds; 1 GPU; 1599 MB)  LAMMPS (15 seconds; 4 GPUs; 876 MB)  MCUDA-MEME (165 seconds; 4 GPUs; 151 MB)  GROMACS (2 nodes) (167 seconds)  NAMD (4 nodes) (11 minutes)  BarraCUDA (10 minutes; 1 GPU; 3319 MB)  GPU-LIBSVM (5 minutes; 1GPU; 145 MB)  MUMmerGPU (5 minutes; 1GPU; 2804 MB) Non-GPU Short execution time Long execution time Set 1 Set 2  Three workloads:  Set 1  Set 2  Set 1 + Set 2 Applications for studying rCUDA+Slurm
  • 22. HPC Advisory Council Switzerland Conference 2016 22/56 Workloads for studying rCUDA+Slurm (I)
  • 23. HPC Advisory Council Switzerland Conference 2016 23/56 Performance of rCUDA+Slurm (I)
  • 24. HPC Advisory Council Switzerland Conference 2016 24/56 Workloads for studying rCUDA+Slurm (II)
  • 25. HPC Advisory Council Switzerland Conference 2016 25/56 Performance of rCUDA+Slurm (II)
  • 26. HPC Advisory Council Switzerland Conference 2016 26/56 Outline Why does rCUDA improve cluster performance?
  • 27. HPC Advisory Council Switzerland Conference 2016 27/56 Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM • Non-accelerated applications keep GPUs idle in the nodes where they use all the cores 1st reason for improved performance A CPU-only application spreading over these nodes will make their GPUs unavailable for accelerated applications Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes)
  • 28. HPC Advisory Council Switzerland Conference 2016 28/56 Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM • Accelerated applications keep CPUs idle in the nodes where they execute An accelerated application using just one CPU core may avoid other jobs to be dispatched to this node Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes) 2nd reason for improved performance (I)
  • 29. HPC Advisory Council Switzerland Conference 2016 29/56 Hybrid MPI shared-memory non-accelerated applications usually span to all the cores in a node (across n nodes) Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM • Accelerated applications keep CPUs idle in the nodes where they execute An accelerated MPI application using just one CPU core per node may keep part of the cluster busy 2nd reason for improved performance (II)
  • 30. HPC Advisory Council Switzerland Conference 2016 30/56 • Do applications completely squeeze the GPUs available in the cluster? • When a GPU is assigned to an application, computational resources inside the GPU may not be fully used • Application presenting low level of parallelism • CPU code being executed (GPU assigned ≠ GPU working) • GPU-core stall due to lack of data • etc … Interconnection Network node nnode 2 node 3node 1 Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM Network GPU PCIe CPU CPU RAM RAM RAM 3rd reason for improved performance
  • 31. HPC Advisory Council Switzerland Conference 2016 31/56 GPU usage of GPU-Blast GPU assigned but not used GPU assigned but not used
  • 32. HPC Advisory Council Switzerland Conference 2016 32/56 GPU usage of CUDA-MEME GPU utilization is far away from maximum
  • 33. HPC Advisory Council Switzerland Conference 2016 33/56 GPU usage of LAMMPS GPU assigned but not used
  • 34. HPC Advisory Council Switzerland Conference 2016 34/56 GPU allocation vs GPU utilization GPUs assigned but not used
  • 35. HPC Advisory Council Switzerland Conference 2016 35/56 Sharing a GPU among jobs: GPU-Blast Two concurrent instances of GPU-Blast One instance required about 51 seconds
  • 36. HPC Advisory Council Switzerland Conference 2016 36/56 Two concurrent instances of GPU-Blast Sharing a GPU among jobs: GPU-Blast First instance
  • 37. HPC Advisory Council Switzerland Conference 2016 37/56 Two concurrent instances of GPU-Blast Sharing a GPU among jobs: GPU-Blast First instance Second instance
  • 38. HPC Advisory Council Switzerland Conference 2016 38/56 Sharing a GPU among jobs • LAMMPS: 876 MB • mCUDA-MEME: 151 MB • BarraCUDA: 3319 MB • MUMmerGPU: 2104 MB • GPU-LIBSVM: 145 MB K20 GPU
  • 39. HPC Advisory Council Switzerland Conference 2016 39/56 Outline Other reasons for using rCUDA?
  • 40. HPC Advisory Council Switzerland Conference 2016 40/56 Cheaper cluster upgrade No GPU • Let’s suppose that a cluster without GPUs needs to be upgraded to use GPUs • GPUs require large power supplies • Are power supplies already installed in the nodes large enough? • GPUs require large amounts of space • Does current form factor of the nodes allow to install GPUs? The answer to both questions is usually “NO”
  • 41. HPC Advisory Council Switzerland Conference 2016 41/56 Cheaper cluster upgrade No GPU GPU-enabled Approach 1: augment the cluster with some CUDA GPU- enabled nodes  only those GPU-enabled nodes can execute accelerated applications
  • 42. HPC Advisory Council Switzerland Conference 2016 42/56 Cheaper cluster upgrade Approach 2: augment the cluster with some rCUDA servers  all nodes can execute accelerated applications GPU-enabled
  • 43. HPC Advisory Council Switzerland Conference 2016 43/56  Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU  FDR InfiniBand based cluster 16 nodes without GPU + 1 node with 4 GPUs Cheaper cluster upgrade
  • 44. HPC Advisory Council Switzerland Conference 2016 44/56 More workloads for studying rCUDA+Slurm
  • 45. HPC Advisory Council Switzerland Conference 2016 45/56 Performance -68% -60% -63% -56% +131% +119%
  • 46. HPC Advisory Council Switzerland Conference 2016 46/56 Outline Additional reasons for using rCUDA?
  • 47. HPC Advisory Council Switzerland Conference 2016 47/56 #1: More GPUs for a single application 64 GPUs!
  • 48. HPC Advisory Council Switzerland Conference 2016 48/56  MonteCarlo Multi-GPU (from NVIDIA samples) Lower is better Higher is better #1: More GPUs for a single application FDR InfiniBand + NVIDIA Tesla K20
  • 49. HPC Advisory Council Switzerland Conference 2016 49/56 #2: Virtual machines can share GPUs • The GPU is assigned by using PCI passthrough exclusively to a single virtual machine • Concurrent usage of the GPU is not possible
  • 50. HPC Advisory Council Switzerland Conference 2016 50/56 High performance network available Low performance network available #2: Virtual machines can share GPUs
  • 51. HPC Advisory Council Switzerland Conference 2016 51/56  Box A has 4 GPUs but only one is busy  Box B has 8 GPUs but only two are busy 1. Move jobs from Box B to Box A and switch off Box B 2. Migration should be transparent to applications (decided by the global scheduler) #3: GPU task migration Box A Box B Migration is performed at GPU granularity
  • 52. HPC Advisory Council Switzerland Conference 2016 52/56 1 1 3 7 13 14 14 Job granularity instead of GPU granularity #3: GPU task migration
  • 53. HPC Advisory Council Switzerland Conference 2016 53/56 Outline … in summary …
  • 54. HPC Advisory Council Switzerland Conference 2016 54/56 • Cons: 1.Reduced bandwidth to remote GPU (really a concern??) Pros and cons of rCUDA • Pros: 1.Many GPUs for a single application 2.Concurrent GPU access to virtual machines 3.Increased cluster throughput 4.Similar performance with smaller investment 5.Easier (cheaper) cluster upgrade 6.Migration of GPU jobs 7.Reduced energy consumption 8.Increased GPU utilization
  • 55. HPC Advisory Council Switzerland Conference 2016 55/56 Get a free copy of rCUDA at http://www.rcuda.net @rcuda_ More than 650 requests world wide rCUDA is a development by Technical University of Valencia
  • 56. HPC Advisory Council Switzerland Conference 2016 56/56 Thanks! Questions? rCUDA is a development by Technical University of Valencia