Sharing GPUs across a Cluster           also known as   “The GPU as a service”
Outline  • Introduction                                  (slide 3)  • Facing the problems in GPGPU              (slide 13)...
Improving application performance • The complexity of current applications makes   their execution time to be extremely hi...
GPU computing: the building block                  The basic building block is a node with one                   or more ...
GPU computing: programmer view        From the programming point of view:                        A set of nodes, each on...
Not all kinds of code are eligible for GPUs   •   For the right kind of code the use of GPUs       brings huge benefits in...
Low and high level of data parallelism   Low level of data parallelism                  Regarding GPU computing?        N...
Moderate level of data parallelism   Application presents a data parallelism    around [40%-80%]. This is the common case...
Leak of money in current clusters• The GPUs of a CUDA-enabled cluster may  be idle for long periods of time• Waste of reso...
Last scenario: multi-GPU computing • An application can use a large amount of   GPUs in parallel                 Regarding...
GPU computing presents drawbacks • Although GPUs effectively accelerate   applications, their use may bring additional   c...
Outline  • Introduction                   (slide 3)  • Facing the problems in GPGPU   (slide 13)  • rCUDA functionality   ...
Looking for an efficient solution         • A way of avoiding the low GPU           utilization inefficiency is by        ...
Saving costs by doing better           • Doing better by spending             less money in GPUs at the             initia...
rCUDA adds value  • rCUDA allows sharing GPUs among    nodes in the cluster  • rCUDA allows having less GPUs    than nodes...
The main idea behind rCUDA              Add only the           GPU computing          nodes giving you             the nec...
rCUDA also extends CUDA’s possibilities  • rCUDA allows providing all the    GPUs in the cluster to a single    applicatio...
rCUDA for multi-GPU computing          • All GPUs available to every node           GPU       mem                         ...
rCUDA for multi-GPU computing          • All GPUs available to every node                                            GPU  ...
rCUDA for multi-GPU computing          • All GPUs available to every node                                            GPU  ...
Outline  • Introduction                   (slide 3)  • Facing the problems in GPGPU   (slide 13)  • rCUDA functionality   ...
How rCUDA works  • rCUDA is a middleware that enables    seamlessly remote CUDA usage  • rCUDA clusters are equipped with:...
Seamlessly usage   Usual way to use GPUs CURRENTLY             Application             CUDA library
Seamlessly usage     rCUDA leverages a client and a server  Client side             Server side  Application              ...
Seamlessly usage      rCUDA leverages a client and a server   Client side                  Server side  Application rCUDA ...
Seamlessly usage      rCUDA leverages a client and a server   Client side                        Server side  Application ...
rCUDA uses a proprietary communication protocolExample: 1) initialization 2) memory allocation    on the remote GPU 3) CPU...
Outline  • Introduction                   (slide 3)  • Facing the problems in GPGPU   (slide 13)  • rCUDA functionality   ...
Features in the new rCUDA version • CUDA 5 support • Efficient InfiniBand support • Support for CUDA extensions to C • Mul...
New Infiniband support    Why InfiniBand support?      InfiniBand is the most used HPC network        − Low latency and ...
New Infiniband support     Use of IB-Verbs       • All TPC/IP stack overhead is out     Bandwidth among client and remot...
Performance example for InfiniBand                            Remote GPU bandwidth for                              synchr...
Performance example for InfiniBand                         2. Using a remote                        GPU through rCUDA     ...
Performance example for InfiniBandOverhead time for a matrix-matrix product lower than 1%                                 ...
Performance example for InfiniBandExecution time for the LAMMPS application, in.eam input scriptscaled by a factor of 5 in...
Support for CUDA extensions to C     Previously, rCUDA did not support the CUDA      extensions to C     In order to exe...
Support for CUDA extensions to C     The new rCUDA version to be released will      support the CUDA extensions to C    ...
Multithread Support     The new rCUDA version supports applications      with multiple threads     All the threads from ...
MultiGPU support for a single application      The new rCUDA version is able to provide a       single application all th...
MultiGPU multithreaded support      The new rCUDA features can be granted to a       single application, so that each thr...
Outline  • Introduction                   (slide 3)  • Facing the problems in GPGPU   (slide 13)  • rCUDA functionality   ...
Getting rCUDA  Full InfiniBand version freely available:     − Enhanced client-server data transfers     − High-performanc...
Upcoming SlideShare
Loading in...5
×

R cuda presentation_ib_features_120704

359

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
359
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

R cuda presentation_ib_features_120704

  1. 1. Sharing GPUs across a Cluster also known as “The GPU as a service”
  2. 2. Outline • Introduction (slide 3) • Facing the problems in GPGPU (slide 13) • rCUDA functionality (slide 22) • New rCUDA version (slide 29) • Getting rCUDA (slide 42) As this presentation includes a lot of information, the reader can directly go to the section most interesting to her/him by leveraging the slide number information
  3. 3. Improving application performance • The complexity of current applications makes their execution time to be extremely high • There is the trend to accelerate parts of their code by using GPUs
  4. 4. GPU computing: the building block  The basic building block is a node with one or more GPUs Main Memory mem GPUmemGPU NetworkGPU GPU CPU PCI-e mem mem mem mem GPU GPU GPU GPU Main Memory Network GPU GPU GPU GPU CPU PCI-e
  5. 5. GPU computing: programmer view From the programming point of view:  A set of nodes, each one with: one or more CPUs (several cores per CPU)   one or more GPUs (1-4)  An interconnection network GPU GPU GPU GPU GPU GPU GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU GPU GPU GPU GPU GPU GPU mem GPU mem GPU mem GPU mem GPU mem GPU memPCI-e PCI-e PCI-e PCI-e PCI-e PCI-e Main Memory Main Memory Main Memory Main Memory Main Memory Main Memory CPU CPU CPU CPU CPU CPU Network Network Network Network Network Network Interconnection Network
  6. 6. Not all kinds of code are eligible for GPUs • For the right kind of code the use of GPUs brings huge benefits in terms of performance and energy • There must be data parallelism in the code: this is the only way to take benefit from the hundreds of processors inside a GPU • We can find different scenarios from the point of view of the application: − Low level of data parallelism − High level of data parallelism − Moderate level of data parallelism − Applications for multi-GPU computing
  7. 7. Low and high level of data parallelism Low level of data parallelism Regarding GPU computing? No GPU is needed, just proceed with the traditional HPC strategies High level of data parallelism Regarding GPU computing? Add one or more GPUs to every node in the system and rewrite applications to use them
  8. 8. Moderate level of data parallelism Application presents a data parallelism around [40%-80%]. This is the common case Regarding GPU computing? The GPUs in the system are used only for some parts of the application, remaining idle the rest of the time
  9. 9. Leak of money in current clusters• The GPUs of a CUDA-enabled cluster may be idle for long periods of time• Waste of resources and energy• The total cost of ownership (TCO) is no longer dominated by acquisition costs but electricity bill and rack space are increasingly contributing
  10. 10. Last scenario: multi-GPU computing • An application can use a large amount of GPUs in parallel Regarding GPU computing? The code running in a node can only access the GPUs in that node, but it would run faster if it could have access to more GPUs
  11. 11. GPU computing presents drawbacks • Although GPUs effectively accelerate applications, their use may bring additional concerns
  12. 12. Outline • Introduction (slide 3) • Facing the problems in GPGPU (slide 13) • rCUDA functionality (slide 22) • New rCUDA version (slide 29) • Getting rCUDA (slide 42)
  13. 13. Looking for an efficient solution • A way of avoiding the low GPU utilization inefficiency is by reducing the number of GPUs and sharing the remaining ones among the CPU nodes in the cluster • This would increase GPU utilization also reducing power consumption
  14. 14. Saving costs by doing better • Doing better by spending less money in GPUs at the initial investment and therefore reducing TCO • Doing better by deploying rCUDA into your new cost- effective cluster
  15. 15. rCUDA adds value • rCUDA allows sharing GPUs among nodes in the cluster • rCUDA allows having less GPUs than nodes in the cluster • rCUDA provides remote access from each node to any GPU in the system • rCUDA reduces costs without noticeably reducing performance
  16. 16. The main idea behind rCUDA Add only the GPU computing nodes giving you the necessary computational power! Make all the GPUs accesible from evey node
  17. 17. rCUDA also extends CUDA’s possibilities • rCUDA allows providing all the GPUs in the cluster to a single application • Current limitations in the number of GPUs per node are avoided • Useful for multi-GPU computing: now the only limit is the programmer’s ability to accelerate her/his application
  18. 18. rCUDA for multi-GPU computing • All GPUs available to every node GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem PCI-e PCI-e PCI-e PCI-e PCI-e Main Memory Main Memory Main Memory Main Memory Main Memory CPU CPU CPU CPU CPU Network Network Network Network Network Interconnection Network Currently, from a given CPU it is only possible to access the GPUs in that very same node
  19. 19. rCUDA for multi-GPU computing • All GPUs available to every node GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem Logical connections Main Memory Main Memory Main Memory Main Memory Main Memory CPU CPU CPU CPU CPU PCI-e PCI-e PCI-e PCI-e PCI-e Network Network Network Network Network Interconnection Network rCUDA makes all GPUs accessible from every node
  20. 20. rCUDA for multi-GPU computing • All GPUs available to every node GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem Logical connections Main Memory Main Memory Main Memory Main Memory Main Memory CPU CPU CPU CPU CPU PCI-e PCI-e PCI-e PCI-e PCI-e Network Network Network Network Network Interconnection Network rCUDA makes all GPUs accessible from every node andenables the access from a CPU to as many as required GPUs
  21. 21. Outline • Introduction (slide 3) • Facing the problems in GPGPU (slide 13) • rCUDA functionality (slide 22) • New rCUDA version (slide 29) • Getting rCUDA (slide 42)
  22. 22. How rCUDA works • rCUDA is a middleware that enables seamlessly remote CUDA usage • rCUDA clusters are equipped with: − The rCUDA client at every node − The rCUDA server only in those nodes having a GPU • Client-server communication: General TCP/IP communications • Or alternatively highly efficient low- level communications library for InfiniBand networks
  23. 23. Seamlessly usage Usual way to use GPUs CURRENTLY Application CUDA library
  24. 24. Seamlessly usage rCUDA leverages a client and a server Client side Server side Application CUDA library
  25. 25. Seamlessly usage rCUDA leverages a client and a server Client side Server side Application rCUDA library rCUDA daemon Network interface Network interface CUDA library
  26. 26. Seamlessly usage rCUDA leverages a client and a server Client side Server side Application rCUDA library rCUDA daemon Network interface Network interface CUDA library Request Response
  27. 27. rCUDA uses a proprietary communication protocolExample: 1) initialization 2) memory allocation on the remote GPU 3) CPU to GPU memory transfer of the input data 4) kernel execution 5) GPU to CPU memory transfer of the results 6) GPU memory release 7) communication channel closing and server process finalization
  28. 28. Outline • Introduction (slide 3) • Facing the problems in GPGPU (slide 13) • rCUDA functionality (slide 22) • New rCUDA version (slide 29) • Getting rCUDA (slide 42)
  29. 29. Features in the new rCUDA version • CUDA 5 support • Efficient InfiniBand support • Support for CUDA extensions to C • Multithread support • Support for providing a single application with multiple GPUs across the cluster
  30. 30. New Infiniband support  Why InfiniBand support?  InfiniBand is the most used HPC network − Low latency and high bandwidth TOP 500
  31. 31. New Infiniband support  Use of IB-Verbs • All TPC/IP stack overhead is out  Bandwidth among client and remote GPU near the peak InfiniBand network bandwidth  Use of GPUDirect  Reduce the number of intra-node data movements  Use of pipelined transfers  Overlap intra-node data movements with transfers
  32. 32. Performance example for InfiniBand Remote GPU bandwidth for synchronous transfers Enhanced performance To GPU From GPU Internal algorithm making use of pinned memory Bandwidth (MB/s) 4000 Maximum BW InfiniBand QDR 3000 exploitation Maximum BW 2000 nVidia Tesla 1000 C2050 0 rCUDA, rCUDA, rCUDA, Local GigaE IPoIB IB Verbs GPU Low-level 1 Gbps IP over InfiniBand Ethernet InfiniBand library
  33. 33. Performance example for InfiniBand 2. Using a remote GPU through rCUDA 3. Therefore, is only slightly slower employing a remote than local GPU GPU is much faster than a local CPU Matrix-matrix product 4096 x 4096 rCUDA 40G 50% nVidia InfiniBand GeForce 9800 GTX CUBLAS 3.2 CUDA 48% Intel 100% Xeon E5645 CPU MKL 1. Local GPUcomputation is 0 0,5 1 1,5 much faster than CPU Execution time (seconds)
  34. 34. Performance example for InfiniBandOverhead time for a matrix-matrix product lower than 1% % overhead gpu % overhead rcuda Tesla C2050 1 Intel Xeon E5645 QDR InfiniBand 0,9 0,8 0,7 0,6 % overhead 0,5 0,4 0,3 0,2 0,1 0 600 1600 2600 3600 4600 5600 6600 7600 8600 9600 1060011600126001360014600 100 1100 2100 3100 4100 5100 6100 7100 8100 9100 1010011100121001310014100 Matrix dimension
  35. 35. Performance example for InfiniBandExecution time for the LAMMPS application, in.eam input scriptscaled by a factor of 5 in the three dimensions Tesla C2050 Intel Xeon E5520 QDR InfiniBand
  36. 36. Support for CUDA extensions to C  Previously, rCUDA did not support the CUDA extensions to C  In order to execute a program with rCUDA, the CUDA extensions included in its code had to be “unextended” to the plain C API NVCC inserts calls to undocumented CUDA functions
  37. 37. Support for CUDA extensions to C  The new rCUDA version to be released will support the CUDA extensions to C  The exact way we have achieved this goal will not be disclosed within this document. We ask for some patience …
  38. 38. Multithread Support  The new rCUDA version supports applications with multiple threads  All the threads from the application can access the remote GPU concurrently in the same way as if the GPU was installed in the node executing the application
  39. 39. MultiGPU support for a single application  The new rCUDA version is able to provide a single application all the GPUs in the cluster  Accelerating applications no longer depends on the amount of GPUs that fit into a node
  40. 40. MultiGPU multithreaded support  The new rCUDA features can be granted to a single application, so that each thread of the application can access as many GPUs as it requires
  41. 41. Outline • Introduction (slide 3) • Facing the problems in GPGPU (slide 13) • rCUDA functionality (slide 22) • New InfiniBand version (slide 29) • Getting rCUDA (slide 42)
  42. 42. Getting rCUDA Full InfiniBand version freely available: − Enhanced client-server data transfers − High-performance InfiniBand communications library − TCP/IP-based communications also included for non-InfiniBand networks http://www.rcuda.net

×