R cuda presentation_ib_features_120704

Sharing GPUs across a Cluster
also known as

“The GPU as a service”

Outline

• Introduction (slide 3)
• Facing the problems in GPGPU (slide 13)
• rCUDA functionality (slide 22)
• New rCUDA version (slide 29)
• Getting rCUDA (slide 42)

As this presentation includes a lot of
information, the reader can directly go to
the section most interesting to her/him by
leveraging the slide number information

Improving application performance

• The complexity of current applications makes
their execution time to be extremely high
• There is the trend to accelerate parts of their
code by using GPUs

GPU computing: the building block

 The basic building block is a node with one
or more GPUs

Main Memory
mem
GPU
mem
GPU

Network
GPU

GPU

CPU

PCI-e

mem

mem

mem

mem
GPU

GPU

GPU

GPU
Main Memory

Network
GPU

GPU

GPU

GPU
CPU

PCI-e

GPU computing: programmer view

 From the programming point of view:
 A set of nodes, each one with:
one or more CPUs (several cores per CPU)


 one or more GPUs (1-4)

 An interconnection network

GPU GPU GPU GPU GPU GPU
GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem

GPU GPU GPU GPU GPU GPU
GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem
PCI-e

PCI-e

PCI-e

PCI-e

PCI-e

PCI-e
Main Memory

Main Memory

Main Memory

Main Memory

Main Memory

Main Memory
CPU

CPU

CPU

CPU

CPU

CPU
Network Network Network Network Network Network

Interconnection Network

Not all kinds of code are eligible for GPUs
• For the right kind of code the use of GPUs
brings huge benefits in terms of performance
and energy
• There must be data parallelism in the code:
this is the only way to take benefit from the
hundreds of processors inside a GPU
• We can find different scenarios from the point
of view of the application:
− Low level of data parallelism

− High level of data parallelism

− Moderate level of data parallelism

− Applications for multi-GPU computing

Low and high level of data parallelism

 Low level of data parallelism

Regarding GPU computing?
No GPU is needed, just proceed with the traditional HPC
strategies

 High level of data parallelism

Add one or more GPUs to every node in the system and
rewrite applications to use them

Moderate level of data parallelism

 Application presents a data parallelism
around [40%-80%]. This is the common case

The GPUs in the system are used only for some parts of the
application, remaining idle the rest of the time

Leak of money in current clusters
• The GPUs of a CUDA-enabled cluster may
be idle for long periods of time
• Waste of resources and energy
• The total cost of ownership (TCO) is no
longer dominated by acquisition costs but
electricity bill and rack space are
increasingly contributing

Last scenario: multi-GPU computing

• An application can use a large amount of
GPUs in parallel

The code running in a node can only access the GPUs in that
node, but it would run faster if it could have access to more
GPUs

GPU computing presents drawbacks

• Although GPUs effectively accelerate
applications, their use may bring additional
concerns

Outline

• New rCUDA version (slide 29)

Looking for an efficient solution

• A way of avoiding the low GPU
utilization inefficiency is by
reducing the number of GPUs
and sharing the remaining
ones among the CPU nodes in
the cluster
• This would increase GPU
utilization also reducing power
consumption

Saving costs by doing better

• Doing better by spending
less money in GPUs at the
initial investment and
therefore reducing TCO
• Doing better by deploying
rCUDA into your new cost-
effective cluster

rCUDA adds value

• rCUDA allows sharing GPUs among
nodes in the cluster
• rCUDA allows having less GPUs
than nodes in the cluster
• rCUDA provides remote access
from each node to any GPU in the
system
• rCUDA reduces costs without
noticeably reducing performance

The main idea behind rCUDA

Add only the
GPU computing
nodes giving you
the necessary
computational
power!
Make all the
GPUs accesible
from evey node

rCUDA also extends CUDA’s possibilities

• rCUDA allows providing all the
GPUs in the cluster to a single
application
• Current limitations in the number
of GPUs per node are avoided
• Useful for multi-GPU computing:
now the only limit is the
programmer’s ability to
accelerate her/his application

rCUDA for multi-GPU computing
• All GPUs available to every node

GPU mem GPU mem GPU mem GPU mem GPU mem

PCI-e

PCI-e

PCI-e

PCI-e

PCI-e
Main Memory

Main Memory

Main Memory

Main Memory

Main Memory
CPU CPU CPU CPU CPU

Network Network Network Network Network


Currently, from a given CPU it is only possible to access
the GPUs in that very same node




Logical connections

Main Memory
Main Memory

Main Memory

Main Memory

Main Memory
CPU CPU CPU CPU CPU
PCI-e

PCI-e

PCI-e

PCI-e

PCI-e


rCUDA makes all GPUs accessible from every node




Logical connections

Main Memory
Main Memory

Main Memory

Main Memory

Main Memory
CPU CPU CPU CPU CPU
PCI-e

PCI-e

PCI-e

PCI-e

PCI-e


rCUDA makes all GPUs accessible from every node and
enables the access from a CPU to as many as required GPUs

How rCUDA works

• rCUDA is a middleware that enables
seamlessly remote CUDA usage
• rCUDA clusters are equipped with:
− The rCUDA client at every node
− The rCUDA server only in those
nodes having a GPU
• Client-server communication:
General TCP/IP communications
• Or alternatively highly efficient low-
level communications library for
InfiniBand networks

Seamlessly usage

Usual way to use GPUs CURRENTLY

Application

CUDA library

Seamlessly usage

rCUDA leverages a client and a server
Client side Server side

Application

CUDA library

Seamlessly usage


Application

rCUDA library rCUDA daemon

Network interface Network interface CUDA library

Seamlessly usage


Application

rCUDA library rCUDA daemon

Network interface Network interface CUDA library

Request

Response

rCUDA uses a proprietary communication protocol
Example:

1) initialization
2) memory allocation
on the remote GPU
3) CPU to GPU memory
transfer of the input
data
4) kernel execution
5) GPU to CPU memory
transfer of the results
6) GPU memory release
7) communication
channel closing and
server process
finalization

Features in the new rCUDA version

• CUDA 5 support
• Efficient InfiniBand support
• Support for CUDA extensions to C
• Multithread support
• Support for providing a single application
with multiple GPUs across the cluster

New Infiniband support

 Why InfiniBand support?
 InfiniBand is the most used HPC network

− Low latency and high bandwidth

TOP 500

New Infiniband support
 Use of IB-Verbs
• All TPC/IP stack overhead is out

 Bandwidth among client and remote GPU
near the peak InfiniBand network bandwidth
 Use of GPUDirect
 Reduce the number of intra-node data

movements
 Use of pipelined transfers
 Overlap intra-node data movements with

transfers

Performance example for InfiniBand

Remote GPU bandwidth for
synchronous transfers Enhanced performance
To GPU From GPU Internal algorithm
making use of pinned
memory
Bandwidth (MB/s)

4000
Maximum BW InfiniBand QDR
3000 exploitation Maximum BW
2000 nVidia
Tesla
1000
C2050
0
rCUDA, rCUDA, rCUDA, Local
GigaE IPoIB IB Verbs GPU

Low-level
1 Gbps IP over
InfiniBand
Ethernet InfiniBand
library

2. Using a remote
GPU through rCUDA 3. Therefore,
is only slightly slower employing a remote
than local GPU GPU is much faster
than a local CPU
Matrix-matrix product 4096 x 4096

rCUDA 40G 50% nVidia
InfiniBand GeForce
9800 GTX
CUBLAS 3.2
CUDA 48%
Intel
100% Xeon E5645
CPU MKL
1. Local GPU
computation is 0 0,5 1 1,5
much faster
than CPU
Execution time (seconds)


Overhead time for a matrix-matrix product lower than 1%
% overhead gpu % overhead rcuda
Tesla C2050
1 Intel Xeon E5645

QDR InfiniBand
0,9
0,8
0,7
0,6
% overhead

0,5
0,4
0,3
0,2
0,1
0
600 1600 2600 3600 4600 5600 6600 7600 8600 9600 1060011600126001360014600
100 1100 2100 3100 4100 5100 6100 7100 8100 9100 1010011100121001310014100
Matrix dimension

Execution time for the LAMMPS application, in.eam input script
scaled by a factor of 5 in the three dimensions
Tesla C2050
Intel Xeon E5520

QDR InfiniBand

Support for CUDA extensions to C
 Previously, rCUDA did not support the CUDA
extensions to C
 In order to execute a program with rCUDA, the
CUDA extensions included in its code had to be
“unextended” to the plain C API
NVCC inserts calls to
undocumented CUDA functions

Support for CUDA extensions to C
 The new rCUDA version to be released will
support the CUDA extensions to C
 The exact way we have achieved this goal
will not be disclosed within this document. We
ask for some patience …

Multithread Support
 The new rCUDA version supports applications
with multiple threads
 All the threads from the application can
access the remote GPU concurrently in the
same way as if the GPU was installed in the
node executing the application

MultiGPU support for a single application
 The new rCUDA version is able to provide a
single application all the GPUs in the cluster
 Accelerating applications no longer depends
on the amount of GPUs that fit into a node

MultiGPU multithreaded support
 The new rCUDA features can be granted to a
single application, so that each thread of the
application can access as many GPUs as it
requires

Outline

• New InfiniBand version (slide 29)

Getting rCUDA

Full InfiniBand version freely available:
− Enhanced client-server data transfers
− High-performance InfiniBand
communications library
− TCP/IP-based communications also
included for non-InfiniBand networks

http://www.rcuda.net

R cuda presentation_ib_features_120704

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Recently uploaded

Recently uploaded (20)

R cuda presentation_ib_features_120704