Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)

  • 358 views
Uploaded on

This repost was presented at Fronties in Computational Astrophysics Conference (Lyon, France, 11-15 October, 2010). I give brief and light introduction to CUDA architecture and it's benefits for …

This repost was presented at Fronties in Computational Astrophysics Conference (Lyon, France, 11-15 October, 2010). I give brief and light introduction to CUDA architecture and it's benefits for scientific HPC. Also a brief description about KIPT in-house package for N-body simulations is given. This talk with minor differences was also presented at
seminars in Institute for Single Crystals (Kharkov) and Kharkov Institute of Physics and Technology.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
358
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
11
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Computations on GPU: a road towards desktop supercomputing Computations on GPU: a road towards desktop supercomputing Glib Ivashkevych Institute of Theoretical Physics, NSC KIPT November 24, 2010
  • 2. Computations on GPU: a road towards desktop supercomputing Quick outline GPU – Graphic Processing Unit programmable manycore multithreaded with very high memory bandwidth
  • 3. Computations on GPU: a road towards desktop supercomputing Quick outline GPU – Graphic Processing Unit programmable manycore multithreaded with very high memory bandwidth We are going to talk about: how GPU became usefull for scientific computations GPU intrinsics and programming how to get as much as possible from GPU and survive:) the future of GPUs and GPU programming CUDA(Nvidia s Compute Unified Device Architecture) , most of the time, and OpenCL(Open Computing Language)
  • 4. Computations on GPU: a road towards desktop supercomputing Should we care? But first of all: do we really need GPU computing?
  • 5. Computations on GPU: a road towards desktop supercomputing Should we care? But first of all: do we really need GPU computing? Short answer: yes! high performance transparent scalability
  • 6. Computations on GPU: a road towards desktop supercomputing Should we care? But first of all: do we really need GPU computing? Short answer: yes! high performance transparent scalability More accurate answer: yes, for problems with high parallelism. large datasets portions of data could be processed independently
  • 7. Computations on GPU: a road towards desktop supercomputing Should we care? But first of all: do we really need GPU computing? Short answer: yes! high performance transparent scalability More accurate answer: yes, for problems with high parallelism. large datasets portions of data could be processed independently Most accurate answer: yes, for problems with high data parallelism.
  • 8. Computations on GPU: a road towards desktop supercomputing ReferenceFor reference GFLOPs – 109 FLoating Point Operations Per second ∼ 55 GFLOPs on Intel Core i7 Nehalem 975 (according to Intel) ∼ 125 GFLOPs AMD Opteron Istanbul 2435 ∼ 500 GFLOPs in double and ∼ 2 TFLOPs in single precision on Nvidia Tesla C2050 ∼ 3.2 · 103 GFLOPs on ASCI Red in Sandia National Laboratory – the fastest supercomputer as for November 1999 ∼ 87 · 103 GFLOPs on TSUBAME-1 Grid Cluster in Tokyo Institute of Technology – first GPU based supercomputer, №88 in Top500, as for November 2010 (№56 – in November 2009) ∼ 2.56 · 106 GFLOPs on Tianhe-1-A at National Supercomputing Center in Tianjin – the fastest supercomputer as for November 2010 – GPU based
  • 9. Computations on GPU: a road towards desktop supercomputing ExamplesMatrix and vector operations CUBLAS 1 on Nvidia Tesla C2050 (CUDA 3.2) vs Intel MKL 10.21 on Intel Core i7 Nehalem (4 threads) ∼ 8x in double precision CULA 1 on Nvidia Tesla C2050 (CUDA 3.2) up to ∼ 220 GFLOPs in double precision up to ∼ 450 GFLOPs in single precision vs Intel MKL 10.2 ∼ 4 − 6x speed–up 1 CUDA accelerated Basic Linear Algebra Subprograms 1 Math Kernel Library 1 LAPACK for Heterogeneous systems
  • 10. Computations on GPU: a road towards desktop supercomputing ExamplesFast Fourier Transform CUFFT on Nvidia Tesla C2070 (CUDA 3.2) up to 65 GFLOPs in double precision up to 220 GFLOPs in single precision vs Intel MKL on Intel Core i7 Nehalem ∼ 9x in double precision ∼ 20x in single precision
  • 11. Computations on GPU: a road towards desktop supercomputing ExamplesPhysics: Computational Fluid Dynamics Simulation of transition to turbulence1 Nvidia Tesla S1070 vs quad-core Intel Xeon X5450 (3GHz) ∼ 20x over serial code ∼ 10x over OpenMP realization (2 threads) ∼ 5x over OpenMP realization (4 threads) 1 A.S. Antoniou et al., American Institute of Aeronautics and Astronautics Paper 2010 – 0525
  • 12. Computations on GPU: a road towards desktop supercomputing ExamplesQuantum chemistry Calculations of molecular orbitals1 Nvidia GeForce GTX 280 vs Intel Core2Quad Q6600 (2.4GHz) ∼ 173x over serial non–optimized code ∼ 14x over parallel optimized code (4 threads) 1 D.J. Hardy et al., GPGPU 2009
  • 13. Computations on GPU: a road towards desktop supercomputing ExamplesMedical Imaging Isosurfaces reconstruction from scalar volumetric data1 Nvidia GeForce GTX 285 vs ? ∼ 68x over optimized CPU code nearly real-time processing of data 1 T. Kalbe et al., Proceedings of 5th International Symposium on Visual Computing (ISVC 2009)
  • 14. Computations on GPU: a road towards desktop supercomputing ExamplesGPUGrid.net Biomolecular simulations accelerated by Nvidia CUDA boards and Sony PlayStation ∼ 8000 users from 101 country ∼ 145 TFLOPs in average ≈ №25 in Top500 ∼ 50 GFLOPs from every active user
  • 15. Computations on GPU: a road towards desktop supercomputing ExamplesATLAS experiment on Large Hadron Collider Particles tracking, trigerring, events simulations1 possible Higgs events – track large number of particles ∼ 32x in tracking, ∼ 35x in triggering on Nvidia Tesla C1060 1 P.J. Clark et al., Processing Petabytes per Second with the ATLAS Experiment at the LHC (GTC 2010)
  • 16. Computations on GPU: a road towards desktop supercomputing ExamplesAnd even more examples in: N–body simulations seismic simulations molecular dynamics SETI@Home & MilkyWay@Home finance neural networks ... and, of course, graphics VFX, rendering image editing, video
  • 17. Computations on GPU: a road towards desktop supercomputing Examples GPU Technology Conference 2010 (September 20-23)
  • 18. Computations on GPU: a road towards desktop supercomputing OutlineOutline 1 History
  • 19. Computations on GPU: a road towards desktop supercomputing OutlineOutline 1 History 2 CUDA: architecture overview and programming model
  • 20. Computations on GPU: a road towards desktop supercomputing OutlineOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy
  • 21. Computations on GPU: a road towards desktop supercomputing OutlineOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox
  • 22. Computations on GPU: a road towards desktop supercomputing OutlineOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL
  • 23. Computations on GPU: a road towards desktop supercomputing OutlineOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality
  • 24. Computations on GPU: a road towards desktop supercomputing OutlineOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture
  • 25. Computations on GPU: a road towards desktop supercomputing OutlineOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential
  • 26. Computations on GPU: a road towards desktop supercomputing OutlineOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem
  • 27. Computations on GPU: a road towards desktop supercomputing OutlineOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version
  • 28. Computations on GPU: a road towards desktop supercomputing OutlineOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results
  • 29. Computations on GPU: a road towards desktop supercomputing OutlineOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 30. Computations on GPU: a road towards desktop supercomputing HistoryOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 31. Computations on GPU: a road towards desktop supercomputing HistoryHistory in brief:
  • 32. Computations on GPU: a road towards desktop supercomputing History GPGPU in 2001-2006: through graphics API (OpenGL or DirectX) extremely hard only in single precision
  • 33. Computations on GPU: a road towards desktop supercomputing History GPGPU in 2001-2006: through graphics API (OpenGL or DirectX) extremely hard only in single precision GPGPU today: straightforward easy in double precision.
  • 34. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming modelOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 35. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming modelHardware model: GT200 architecture consists of multiprocessors each MP has: 8 stream processors 1 unit for double precision operations shared memory global memory
  • 36. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming modelHardware model: Fermi architecture each MP has: 32 stream processors 4 SFU’s (Special Function Unit) each SP has: 1 FP Unit & 1 INT Unit
  • 37. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming modelHardware model: Multiprocessors and threads MP can launch numerous threads threads are ”lightweight” – little creation and switching overhead threads run the same code threads syncronization within MP cooperation via shared memory each thread have unique identifier – thread ID Efficiency is achieved by latency hiding by calculation, and not by cache usage, as on CPU
  • 38. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming modelSoftware model: C for CUDA a set of extensions to C runtime library function and variable type qualifiers built–in vector types: float4, double2 etc. built–in variables Kernels maps parallel part of the program to the GPU execution: N times in parallel by N CUDA threads CUDA Driver API low–level control over the execution no need in nvcc compiler if kernels are precompiled – only driver needed
  • 39. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming modelSoftware model: Example //Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU) device f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B) { //Some math r e t u r n smth ; }
  • 40. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming modelSoftware model: Example //Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU) device f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B) { //Some math r e t u r n smth ; } // K e r n e l d e f i n i t i o n global v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C) { //Some math C = D e v i c e F u n c t i o n (A , B ) ; }
  • 41. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming modelSoftware model: Example //Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU) device f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B) { //Some math r e t u r n smth ; } // K e r n e l d e f i n i t i o n global v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C) { //Some math C = D e v i c e F u n c t i o n (A , B ) ; } // Host c o d e i n t main ( ) { // K e r n e l i n v o c a t i o n SomeKernel <<<1,N>>>(A , B , C) }
  • 42. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming modelSoftware model: Explanations device qualifier defines function that is: executed on device callable from device only
  • 43. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming modelSoftware model: Explanations device qualifier defines function that is: executed on device callable from device only global qualifier defines function that is: executed on device callable from host only
  • 44. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming modelExecution model
  • 45. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming modelScalability underlying hardware architecture is hidden threads could syncronize only within the MP ↓ we do not need to know exact number of MP ↓ scalable applications – from GTX8800 to Fermi
  • 46. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchyOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 47. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchySingle threads each thread have private local memory are identified by built–in variable threadIdx (uint3 type) int idx = threadIdx . x + threadIdx . y + threadIdx . z ; form 1–, 2– or 3–dimensional array – vector, matrix or field Threads are organized into thread blocks
  • 48. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchyThread blocks each block have shared memory visible to all threads within the block are identified by built–in variable blockIdx (uint3 type) int b idx = blockIdx . x + blockIdx . y ; dimension of the block is identified by built–in variable blockDim (dim3 type) Blocks are organized into grid
  • 49. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchyGrid of thread blocks global device memory is accessible by all threads in the grid dimension of the grid is identified by built–in variable gridDim (dim3 type)
  • 50. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchyThreads and memories hierarchy
  • 51. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchyExample: vector addition i n t main ( ) { // A l l o c a t e v e c t o r s i n d e v i c e memory size t size = N ∗ sizeof ( float ); float ∗ d A ; c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ; float ∗ d B ; c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ; float ∗ d C ; c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ; // Copy d a t a from h o s t memory t o d e v i c e memory cudamemcpy ( d A , h A , s i z e , cudaMemcpyHostToDevice ) ; cudamemcpy ( d B , h B , s i z e , cudaMemcpyHostToDevice ) ; // P r e p a r e t h e k e r n e l l a u n c h int threadsPerBlock = 256; i n t t h r e a d s P e r G r i d = (N + t h r e a d s P e r B l o c k −1) / T h r e a d s P e r B l o c k ; VecAdd<<<t h r e a d s P e r G r i d , t h r e a d s P e r B l o c k >>>(d A , d B , d C ) ; cudamemcpy ( h C , d C , s i z e , cudaMemcpyDeviceToHost ) ; // F r e e d e v i c e memory cudaFree ( d A ) ; cudaFree ( d B ) ; cudaFree ( d C ) ; }
  • 52. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchyExample: vector addition // K e r n e l c o d e global v o i d VecAdd ( f l o a t ∗ A , f l o a t ∗ B , f l o a t ∗ C) { int i = threadIdx . x ; i f ( i < N) C[ i ] = A[ i ] + B[ i ] ; }
  • 53. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchyPerformance analysis and optimization there must be enough thread blocks per MP to hide latency try not to under–populate blocks
  • 54. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchyPerformance analysis and optimization there must be enough thread blocks per MP to hide latency try not to under–populate blocks use memory bandwidth (∼ 100GB/s!) efficiently coalescing non–optimized access to global memory could reduce the performance in order(-s) of magnitude try to achieve high arithmetic intensity
  • 55. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchyPerformance analysis and optimization there must be enough thread blocks per MP to hide latency try not to under–populate blocks use memory bandwidth (∼ 100GB/s!) efficiently coalescing non–optimized access to global memory could reduce the performance in order(-s) of magnitude try to achieve high arithmetic intensity never diverge threads within one warp: divergence → serialization = parallelism
  • 56. Computations on GPU: a road towards desktop supercomputing ToolboxOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 57. Computations on GPU: a road towards desktop supercomputing ToolboxStart–up tools drivers CUDA Toolkit nvcc compiler, runtime library, header files, CUBLAS, CUFFT, Visual Profiler etc. CUDA SDK examples, Occupancy Calculator etc. Free download at http://developer.nvidia.com/object/cuda 2 3 downloads.html Support for 32 and 64-bit Windows, Linux1 & Mac OS X 1 Supported distros in CUDA 3.2: Fedora 13, RH Enterprise 4.8& 5.5, OpenSUSE 11.2, SLED 11.0, Ubuntu 10.04
  • 58. Computations on GPU: a road towards desktop supercomputing ToolboxDevelopers Tools CUDA-gdb integration into gdb CUDA C support works on all 32/64–bit Linux distros breakpoints and single step execution
  • 59. Computations on GPU: a road towards desktop supercomputing ToolboxDevelopers Tools CUDA-gdb integration into gdb CUDA C support works on all 32/64–bit Linux distros breakpoints and single step execution CUDA Visual Profiler tracks events with hardware counters global memory loads/stores total branches and divergent branches taken by threads instruction count number of serialized thread warps due to address conflicts (shared and constant memory)
  • 60. Computations on GPU: a road towards desktop supercomputing PyCUDA&PyOpenCLOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 61. Computations on GPU: a road towards desktop supercomputing PyCUDA&PyOpenCLPython easy to learn dynamically typed rich built–in functionality interpreted very well documented have large and active community
  • 62. Computations on GPU: a road towards desktop supercomputing PyCUDA&PyOpenCLScientific tools: Scipy – modeling and simulation Fourier transforms ODE Optimization scipy.weave.inline – C inlining with little or no overhead ···
  • 63. Computations on GPU: a road towards desktop supercomputing PyCUDA&PyOpenCLScientific tools: Scipy – modeling and simulation Fourier transforms ODE Optimization scipy.weave.inline – C inlining with little or no overhead ··· NumPy – arrays flexible array creation routines sorting, random sampling and statistics ···
  • 64. Computations on GPU: a road towards desktop supercomputing PyCUDA&PyOpenCLScientific tools: Scipy – modeling and simulation Fourier transforms ODE Optimization scipy.weave.inline – C inlining with little or no overhead ··· NumPy – arrays flexible array creation routines sorting, random sampling and statistics ··· Python is a convenient way of interfacing C/C++ libraries
  • 65. Computations on GPU: a road towards desktop supercomputing PyCUDA&PyOpenCLPyCUDA provide complete access to CUDA features automatically manages resources errors handling and translation to Python exceptions convenient abstractions: GPUArray metaprogramming: creation of CUDA source code dynamically interactive! PyOpenCL is pretty much the same in concept – but not only for Nvidia GPUs. Also for ATI/AMD cards, AMD&Intel Proccesors etc. (IBM Cell?)
  • 66. Computations on GPU: a road towards desktop supercomputing PyCUDA&PyOpenCLPython and CUDA We could interface with: Python C API – low–level approach: overkill SWIG, Boost::Python – high–level approach: overkill PyCUDA – most simple and straightforward way for CUDA only scipy.weave.inline – simple and straightforward way for both CUDA and plain C/C++
  • 67. Computations on GPU: a road towards desktop supercomputing EnSPy functionalityOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 68. Computations on GPU: a road towards desktop supercomputing EnSPy functionalityMotivation Combine flexibility of Python with efficiency of C++ → CUDA for N–body sim interface of EnSPy is written in Python core of EnSPy is written in C++ joined together by scipy.weave.inline C++ core could be used without Python – just include header and link with precompiled shared library easily extensible: both through high–level Python interface and low–level C++ core – new algorithms, initial distributions etc. multi–GPU parallelization it’s easy to experiment with EnSPy!
  • 69. Computations on GPU: a road towards desktop supercomputing EnSPy functionalityEnSPy functionality Types of ensembles: ”Simple” ensemble – without interaction, only external potential N–body ensemble – both external potential and gravitational interaction between particles Current algorithms: 4-th order Runge–Kutta for ”simple” ensemble Hermite scheme with shared time steps for N-body ensemble
  • 70. Computations on GPU: a road towards desktop supercomputing EnSPy functionality Predefined initial distributions: Uniform, point and spherical for ”simple” ensembles Uniform sphere with 2T /|U| = 1 for N-body ensemble user could supply functions (in Python) for initial ensemble generation User specified values and expressions: parameters of initial distribution potential, forces, parameters of integration scheme arbitrary number of triggers – Ni (t) of particles which do not cross the given hypersurface Fi (q, p) = 0 before time t ¯ arbitrary number of averages – Fi (q, p, t) – quantities which should be averaged over the ensembles
  • 71. Computations on GPU: a road towards desktop supercomputing EnSPy functionality Runtime generation and compilation of C and CUDA code: User specified expressions (as Python strings) are wrapped by EnSPy template subpackage into C functions and CUDA module Compiled at runtime High usage and calculation efficiency: flexible Python interface all actual calculations are performed by runtime generated C extension and precompiled shared library Drawback: extra time for generation and compilation of new code
  • 72. Computations on GPU: a road towards desktop supercomputing EnSPy architectureOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 73. Computations on GPU: a road towards desktop supercomputing EnSPy architectureExecution flow and architecture Input parameters ↓ Ensemble population (predefined or user specified distribution) ↓ Code generation and compilation ↓ Launching NGPUs threads
  • 74. Computations on GPU: a road towards desktop supercomputing EnSPy architectureGPU parallelization scheme for N–body simulations
  • 75. Computations on GPU: a road towards desktop supercomputing EnSPy architectureOrder of force calculation
  • 76. Computations on GPU: a road towards desktop supercomputing Example: D5 potentialOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 77. Computations on GPU: a road towards desktop supercomputing Example: D5 potentialOverview Problem: Escape from potential well. Watched values (trigger): N(t) – number of particles, remaining in the well at time t Potential: x4 UD5 = 2ay 2 − x 2 + xy 2 + 4 ”Critical” energy: Ecr = ES = 0
  • 78. Computations on GPU: a road towards desktop supercomputing Example: D5 potentialPotential and structure of phase space: Level lines of D5 potential 2 2 1 1 0 px 0 y 1 −1 2 2 1 0 1 2 −2 x −2 −1 0 1 2 x
  • 79. Computations on GPU: a road towards desktop supercomputing Example: D5 potentialCalculation setup: ”Simple ensemble” uniform initial distribution of N = 10240 particles in x > 0 ∩ U(x, y ) < E trigger: x = 0 → q0 = 0. 12 lines of simple Python code (examples/d5.py): specification of integration parameters
  • 80. Computations on GPU: a road towards desktop supercomputing Example: D5 potentialResults: Regular particles are trapped in well → initial ”mixed state” splits 1 E = 0.1 0.8 0.6 N (t)/N (0) 0.4 E = 0.9 0.2 0 0 10 20 30 t
  • 81. Computations on GPU: a road towards desktop supercomputing Example: Hill problemOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 82. Computations on GPU: a road towards desktop supercomputing Example: Hill problemOverview Problem: Toy model of escape from star cluster: escape of star from potential of point rotating star cluster Mc and point galaxy core Mg Mc Watched values (trigger): N(t) – number of particles, remaining in cluster at time t ”Potential” in cluster frame of reference (tidal approximation): GMc UHill = −3ω 2 x 2 − r2 ”Critical” energy: Ecr = ES = −4.5ω 2
  • 83. Computations on GPU: a road towards desktop supercomputing Example: Hill problemPotential: Hill curves 0.5 0.0 y −0.5 −1.0 −1.0 −0.5 0.0 0.5 x
  • 84. Computations on GPU: a road towards desktop supercomputing Example: Hill problemCalculation setup: ”Simple ensemble” uniform initial distribution of N = 10240 particles in |x| < rt ∩ U(x, y ) < E 1 ω= √ 3 → rt = 1 trigger: |x| − rt = 0 → abs(q0) - 1. = 0. 12 lines of simple Python code (examples/hill plain.py): specification of integration parameters
  • 85. Computations on GPU: a road towards desktop supercomputing Example: Hill problemResults: Traping of regular particles (some tricky physics here): 1 · 104 8 · 103 6 · 103 N (t) 4 · 103 2 · 103 E = −1.3 E = −0.8 E = −0.3 0 0 2.5 · 104 5 · 104 7.5 · 104 1 · 105 nt
  • 86. Computations on GPU: a road towards desktop supercomputing Example: Hill problem, N–body versionOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 87. Computations on GPU: a road towards desktop supercomputing Example: Hill problem, N–body versionOverview Problem: Simplified model of escape from star cluster: escape of star from potential of rotating star cluster with total mass Mc and point potential of galaxy core with mass Mg Mc (2D) Watched values: Configuration of cluster Potential of galaxy core in cluster frame of reference (tidal approximation): UHillNB = −3ω 2 x 2
  • 88. Computations on GPU: a road towards desktop supercomputing Example: Hill problem, N–body version”Toy” Hill model vs N–body Hill model:
  • 89. Computations on GPU: a road towards desktop supercomputing Example: Hill problem, N–body versionCalculation setup: N–body ensemble 2D (z = 0) initial distribution of N = 10240 particles inside circle R with zero initial velocities 14 lines of simple Python code (examples/hill nbody.py): specification of integration parameters 1 Mc = 1, R = 200, ω = √ 3
  • 90. Computations on GPU: a road towards desktop supercomputing Example: Hill problem, N–body versionResults: cluster configuration step = 201 step = 401 step = 601 300 300 300 200 200 200 100 100 100 0 0 0 y y y −100 −100 −100 −200 −200 −200 −300 −300 −300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 x x x step = 801 step = 1001 step = 1201 300 300 300 200 200 200 100 100 100 0 0 0 y y y −100 −100 −100 −200 −200 −200 −300 −300 −300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 x x x
  • 91. Computations on GPU: a road towards desktop supercomputing Performance results OpenSUSE 11.2, GCC 4.4, CUDA 3.0. AMD Athlon X2 4400+ (2.3GHz) / Intel Core2Duo E8500 (3.16GHz), Nvidia Geforce 260 GTX. Not as good, as it could be – subject to improve. Estimation: ∼ 1TFLOPs on 2x recent Fermi graphic processors 40 300 OpenM P SSE optimized 250 CU DA 30 200 speed − up GF lop/s 20 150 100 10 GTX260 DP - N –body 50 GTX260 DP – ”simple” ensemble 0 0 1 · 104 2 · 104 5 · 104 1 · 105 2 · 105 0 2.5 · 105 5 · 105 7.5 · 105 1 · 106 N number of particles
  • 92. Computations on GPU: a road towards desktop supercomputing GPU computing prospectsOutline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 93. Computations on GPU: a road towards desktop supercomputing GPU computing prospectsYesterday: uniform programming with OpenCL: no need to care about concrete implementation desktop supercomputers (full ATX form–factor): Nvidia Tesla C1060 x4 ATI FireStream x4 ∼ 300GFLOPs/4TFLOPs ∼ 960GFLOPs/4.8TFLOPs Windows & Linux 32/64–bit Windows & Linux 32/64–bit support support
  • 94. Computations on GPU: a road towards desktop supercomputing GPU computing prospectsToday: CUDA 3.2 → C++: classes, namespaces, default parameters, operators overloading Nvidia Tesla C2050/2070 x4 ATI FireStream 9350/9370 x4 ∼ 2TFLOPs/4TFLOPs ∼ 2GFLOPs/8TFLOPs concurent kernel execution stable double–precision support (12 August 2010) ∼ 8x in GFLOPs, ∼ 6x in GFLOPs/$, ∼ 5x in LOEWE–CSC (University of GFLOPs/W vs four Intel Xeon Frankfurt): №22 in Top500 X5550 (85GFLOPs/73GFLOPs) Tianhe-1-A, Nebulae, Tsubame-2: №1, 3, 4 SC from Top500
  • 95. Computations on GPU: a road towards desktop supercomputing GPU computing prospectsTommorow: OpenCL 1.2 (?) → matrix and ”field” complex and real types New libraries: GPU programming as simple as CPU programming Nvidia Geforce 580 GTX ATI Radeon 6950 ”Cayman” ∼ 0.75TFLOPs/1.5TFLOPs ∼ 0.75GFLOPs/3TFLOPs
  • 96. Computations on GPU: a road towards desktop supercomputing GPU computing prospects This presentation is available for download at http://www.scribd.com/doc/27751403