SlideShare a Scribd company logo
1 of 96
Download to read offline
Computations on GPU: a road towards desktop supercomputing




           Computations on GPU: a road towards desktop
                         supercomputing

                                              Glib Ivashkevych

                                Institute of Theoretical Physics, NSC KIPT


                                           November 24, 2010
Computations on GPU: a road towards desktop supercomputing
  Quick outline




      GPU – Graphic Processing Unit
              programmable
              manycore
              multithreaded
              with very high memory bandwidth
Computations on GPU: a road towards desktop supercomputing
  Quick outline




      GPU – Graphic Processing Unit
              programmable
              manycore
              multithreaded
              with very high memory bandwidth

      We are going to talk about:
              how GPU became usefull for scientific computations
              GPU intrinsics and programming
              how to get as much as possible from GPU and survive:)
              the future of GPUs and GPU programming
              CUDA(Nvidia s Compute Unified Device Architecture) , most of the
              time, and OpenCL(Open Computing Language)
Computations on GPU: a road towards desktop supercomputing
  Should we care?




      But first of all: do we really need GPU computing?
Computations on GPU: a road towards desktop supercomputing
  Should we care?




      But first of all: do we really need GPU computing?


      Short answer: yes!
              high performance
              transparent scalability
Computations on GPU: a road towards desktop supercomputing
  Should we care?




      But first of all: do we really need GPU computing?


      Short answer: yes!
              high performance
              transparent scalability

      More accurate answer: yes, for problems with high parallelism.
              large datasets
              portions of data could be processed independently
Computations on GPU: a road towards desktop supercomputing
  Should we care?




      But first of all: do we really need GPU computing?


      Short answer: yes!
              high performance
              transparent scalability

      More accurate answer: yes, for problems with high parallelism.
              large datasets
              portions of data could be processed independently

      Most accurate answer: yes, for problems with high data
      parallelism.
Computations on GPU: a road towards desktop supercomputing
  Reference




For reference
              GFLOPs – 109 FLoating Point Operations Per second
              ∼ 55 GFLOPs on Intel Core i7 Nehalem 975 (according to
              Intel)
              ∼ 125 GFLOPs AMD Opteron Istanbul 2435
              ∼ 500 GFLOPs in double and ∼ 2 TFLOPs in single precision
              on Nvidia Tesla C2050
              ∼ 3.2 · 103 GFLOPs on ASCI Red in Sandia National
              Laboratory – the fastest supercomputer as for November 1999
              ∼ 87 · 103 GFLOPs on TSUBAME-1 Grid Cluster in Tokyo
              Institute of Technology – first GPU based supercomputer, №88
              in Top500, as for November 2010 (№56 – in November 2009)
              ∼ 2.56 · 106 GFLOPs on Tianhe-1-A at National
              Supercomputing Center in Tianjin – the fastest
              supercomputer as for November 2010 – GPU based
Computations on GPU: a road towards desktop supercomputing
  Examples




Matrix and vector operations
      CUBLAS 1 on Nvidia Tesla C2050 (CUDA 3.2) vs Intel MKL 10.21
      on Intel Core i7 Nehalem (4 threads)

               ∼ 8x in double precision

      CULA       1   on Nvidia Tesla C2050 (CUDA 3.2)

               up to ∼ 220 GFLOPs in double precision
               up to ∼ 450 GFLOPs in single precision

      vs Intel MKL 10.2

               ∼ 4 − 6x speed–up
           1
             CUDA accelerated Basic Linear Algebra Subprograms
           1
             Math Kernel Library
           1
             LAPACK for Heterogeneous systems
Computations on GPU: a road towards desktop supercomputing
  Examples




Fast Fourier Transform
      CUFFT on Nvidia Tesla C2070 (CUDA 3.2)

              up to 65 GFLOPs in double precision
              up to 220 GFLOPs in single precision

      vs Intel MKL on Intel Core i7 Nehalem

              ∼ 9x in double precision
              ∼ 20x in single precision
Computations on GPU: a road towards desktop supercomputing
  Examples




Physics: Computational Fluid Dynamics
      Simulation of transition to turbulence1
               Nvidia Tesla S1070 vs quad-core Intel Xeon X5450 (3GHz)
               ∼ 20x over serial code
               ∼ 10x over OpenMP realization (2 threads)
               ∼ 5x over OpenMP realization (4 threads)




           1
         A.S. Antoniou et al., American Institute of Aeronautics and Astronautics
      Paper 2010 – 0525
Computations on GPU: a road towards desktop supercomputing
  Examples




Quantum chemistry
      Calculations of molecular orbitals1
                Nvidia GeForce GTX 280 vs Intel Core2Quad Q6600 (2.4GHz)
                ∼ 173x over serial non–optimized code
                ∼ 14x over parallel optimized code (4 threads)




           1
               D.J. Hardy et al., GPGPU 2009
Computations on GPU: a road towards desktop supercomputing
  Examples




Medical Imaging
      Isosurfaces reconstruction from scalar volumetric data1
               Nvidia GeForce GTX 285 vs ?
               ∼ 68x over optimized CPU code
               nearly real-time processing of data




           1
         T. Kalbe et al., Proceedings of 5th International Symposium on Visual
      Computing (ISVC 2009)
Computations on GPU: a road towards desktop supercomputing
  Examples




GPUGrid.net
      Biomolecular simulations
              accelerated by Nvidia CUDA boards and Sony PlayStation
              ∼ 8000 users from 101 country
              ∼ 145 TFLOPs in average ≈ №25 in Top500
              ∼ 50 GFLOPs from every active user
Computations on GPU: a road towards desktop supercomputing
  Examples




ATLAS experiment on Large Hadron Collider
      Particles tracking, trigerring, events simulations1
               possible Higgs events – track large number of particles
               ∼ 32x in tracking, ∼ 35x in triggering on Nvidia Tesla C1060




           1
         P.J. Clark et al., Processing Petabytes per Second with the ATLAS
      Experiment at the LHC (GTC 2010)
Computations on GPU: a road towards desktop supercomputing
  Examples




And even more examples in:
              N–body simulations
              seismic simulations
              molecular dynamics
              SETI@Home & MilkyWay@Home
              finance
              neural networks
              ...
      and, of course, graphics

              VFX, rendering
              image editing, video
Computations on GPU: a road towards desktop supercomputing
  Examples




      GPU Technology Conference 2010 (September 20-23)
Computations on GPU: a road towards desktop supercomputing
  Outline




Outline
       1 History
Computations on GPU: a road towards desktop supercomputing
  Outline




Outline
       1 History
       2 CUDA: architecture overview and programming model
Computations on GPU: a road towards desktop supercomputing
  Outline




Outline
       1 History
       2 CUDA: architecture overview and programming model
       3 Threads and memories hierarchy
Computations on GPU: a road towards desktop supercomputing
  Outline




Outline
       1    History
       2    CUDA: architecture overview and programming model
       3    Threads and memories hierarchy
       4    Toolbox
Computations on GPU: a road towards desktop supercomputing
  Outline




Outline
       1    History
       2    CUDA: architecture overview and programming model
       3    Threads and memories hierarchy
       4    Toolbox
       5    PyCUDA&PyOpenCL
Computations on GPU: a road towards desktop supercomputing
  Outline




Outline
       1    History
       2    CUDA: architecture overview and programming model
       3    Threads and memories hierarchy
       4    Toolbox
       5    PyCUDA&PyOpenCL
       6    EnSPy functionality
Computations on GPU: a road towards desktop supercomputing
  Outline




Outline
       1    History
       2    CUDA: architecture overview and programming model
       3    Threads and memories hierarchy
       4    Toolbox
       5    PyCUDA&PyOpenCL
       6    EnSPy functionality
       7    EnSPy architecture
Computations on GPU: a road towards desktop supercomputing
  Outline




Outline
       1    History
       2    CUDA: architecture overview and programming model
       3    Threads and memories hierarchy
       4    Toolbox
       5    PyCUDA&PyOpenCL
       6    EnSPy functionality
       7    EnSPy architecture
       8    Example: D5 potential
Computations on GPU: a road towards desktop supercomputing
  Outline




Outline
       1    History
       2    CUDA: architecture overview and programming model
       3    Threads and memories hierarchy
       4    Toolbox
       5    PyCUDA&PyOpenCL
       6    EnSPy functionality
       7    EnSPy architecture
       8    Example: D5 potential
       9    Example: Hill problem
Computations on GPU: a road towards desktop supercomputing
  Outline




Outline
      1     History
      2     CUDA: architecture overview and programming model
      3     Threads and memories hierarchy
      4     Toolbox
      5     PyCUDA&PyOpenCL
      6     EnSPy functionality
      7     EnSPy architecture
      8     Example: D5 potential
      9     Example: Hill problem
      10    Example: Hill problem, N–body version
Computations on GPU: a road towards desktop supercomputing
  Outline




Outline
      1     History
      2     CUDA: architecture overview and programming model
      3     Threads and memories hierarchy
      4     Toolbox
      5     PyCUDA&PyOpenCL
      6     EnSPy functionality
      7     EnSPy architecture
      8     Example: D5 potential
      9     Example: Hill problem
      10    Example: Hill problem, N–body version
      11    Performance results
Computations on GPU: a road towards desktop supercomputing
  Outline




Outline
      1     History
      2     CUDA: architecture overview and programming model
      3     Threads and memories hierarchy
      4     Toolbox
      5     PyCUDA&PyOpenCL
      6     EnSPy functionality
      7     EnSPy architecture
      8     Example: D5 potential
      9     Example: Hill problem
      10    Example: Hill problem, N–body version
      11    Performance results
      12    GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
  History




Outline
      1     History
      2     CUDA: architecture overview and programming model
      3     Threads and memories hierarchy
      4     Toolbox
      5     PyCUDA&PyOpenCL
      6     EnSPy functionality
      7     EnSPy architecture
      8     Example: D5 potential
      9     Example: Hill problem
      10    Example: Hill problem, N–body version
      11    Performance results
      12    GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
  History




History in brief:
Computations on GPU: a road towards desktop supercomputing
  History




      GPGPU in 2001-2006:
              through graphics API (OpenGL or DirectX)
              extremely hard
              only in single precision
Computations on GPU: a road towards desktop supercomputing
  History




      GPGPU in 2001-2006:
              through graphics API (OpenGL or DirectX)
              extremely hard
              only in single precision

      GPGPU today:
              straightforward
              easy
              in double precision.
Computations on GPU: a road towards desktop supercomputing
  CUDA: architecture overview and programming model




Outline
      1    History
      2    CUDA: architecture overview and programming model
      3    Threads and memories hierarchy
      4    Toolbox
      5    PyCUDA&PyOpenCL
      6    EnSPy functionality
      7    EnSPy architecture
      8    Example: D5 potential
      9    Example: Hill problem
      10   Example: Hill problem, N–body version
      11   Performance results
      12   GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
  CUDA: architecture overview and programming model




Hardware model: GT200 architecture


          consists of
          multiprocessors
          each MP has:
                  8 stream processors
                  1 unit for double
                  precision operations
                  shared memory
          global memory
Computations on GPU: a road towards desktop supercomputing
  CUDA: architecture overview and programming model




Hardware model: Fermi architecture


          each MP has:
                  32 stream processors
                  4 SFU’s (Special
                  Function Unit)
          each SP has:
                  1 FP Unit & 1 INT
                  Unit
Computations on GPU: a road towards desktop supercomputing
  CUDA: architecture overview and programming model




Hardware model: Multiprocessors and threads
              MP can launch numerous threads
              threads are ”lightweight” – little creation and switching
              overhead
              threads run the same code
              threads syncronization within MP
              cooperation via shared memory
              each thread have unique identifier – thread ID

      Efficiency is achieved by latency hiding by calculation, and not by
      cache usage, as on CPU
Computations on GPU: a road towards desktop supercomputing
  CUDA: architecture overview and programming model




Software model: C for CUDA
              a set of extensions to C
              runtime library
              function and variable type qualifiers
              built–in vector types: float4, double2 etc.
              built–in variables
      Kernels
              maps parallel part of the program to the GPU
              execution: N times in parallel by N CUDA threads

      CUDA Driver API
              low–level control over the execution
              no need in nvcc compiler if kernels are precompiled – only
              driver needed
Computations on GPU: a road towards desktop supercomputing
  CUDA: architecture overview and programming model




Software model: Example
      //Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU)
        device        f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B)
      {
        //Some math
        r e t u r n smth ;
      }
Computations on GPU: a road towards desktop supercomputing
  CUDA: architecture overview and programming model




Software model: Example
      //Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU)
        device        f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B)
      {
        //Some math
        r e t u r n smth ;
      }

      // K e r n e l d e f i n i t i o n
         global         v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C)
      {
         //Some math
        C = D e v i c e F u n c t i o n (A , B ) ;
      }
Computations on GPU: a road towards desktop supercomputing
  CUDA: architecture overview and programming model




Software model: Example
      //Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU)
        device        f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B)
      {
        //Some math
        r e t u r n smth ;
      }

      // K e r n e l d e f i n i t i o n
         global         v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C)
      {
         //Some math
        C = D e v i c e F u n c t i o n (A , B ) ;
      }

      // Host c o d e
      i n t main ( )
      {
          // K e r n e l i n v o c a t i o n
          SomeKernel <<<1,N>>>(A , B , C)
      }
Computations on GPU: a road towards desktop supercomputing
  CUDA: architecture overview and programming model




Software model: Explanations
         device       qualifier defines function that is:
              executed on device
              callable from device only
Computations on GPU: a road towards desktop supercomputing
  CUDA: architecture overview and programming model




Software model: Explanations
         device       qualifier defines function that is:
              executed on device
              callable from device only

         global       qualifier defines function that is:
              executed on device
              callable from host only
Computations on GPU: a road towards desktop supercomputing
  CUDA: architecture overview and programming model




Execution model
Computations on GPU: a road towards desktop supercomputing
  CUDA: architecture overview and programming model




Scalability
              underlying hardware architecture is hidden
              threads could syncronize only within the MP

                                                             ↓

              we do not need to know exact number of MP

                                                             ↓

              scalable applications – from GTX8800 to Fermi
Computations on GPU: a road towards desktop supercomputing
  Threads and memories hierarchy




Outline
      1    History
      2    CUDA: architecture overview and programming model
      3    Threads and memories hierarchy
      4    Toolbox
      5    PyCUDA&PyOpenCL
      6    EnSPy functionality
      7    EnSPy architecture
      8    Example: D5 potential
      9    Example: Hill problem
      10   Example: Hill problem, N–body version
      11   Performance results
      12   GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
  Threads and memories hierarchy




Single threads
              each thread have private local memory
              are identified by built–in variable threadIdx (uint3 type)
                      int idx = threadIdx . x + threadIdx . y + threadIdx . z ;

              form 1–, 2– or 3–dimensional array – vector, matrix or field


   Threads are organized into
   thread blocks
Computations on GPU: a road towards desktop supercomputing
  Threads and memories hierarchy




Thread blocks
              each block have shared memory visible to all threads within
              the block
              are identified by built–in variable blockIdx (uint3 type)
                      int b idx = blockIdx . x + blockIdx . y ;

              dimension of the block is identified by built–in variable
              blockDim (dim3 type)

   Blocks are organized into
   grid
Computations on GPU: a road towards desktop supercomputing
  Threads and memories hierarchy




Grid of thread blocks
              global device memory is accessible by all threads in the grid
              dimension of the grid is identified by built–in variable gridDim
              (dim3 type)
Computations on GPU: a road towards desktop supercomputing
  Threads and memories hierarchy




Threads and memories hierarchy
Computations on GPU: a road towards desktop supercomputing
  Threads and memories hierarchy




Example: vector addition
       i n t main ( )
           {
             // A l l o c a t e v e c t o r s i n d e v i c e memory
             size t size = N ∗ sizeof ( float );
             float ∗ d A ;
             c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ;
             float ∗ d B ;
             c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ;
             float ∗ d C ;
             c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ;

             // Copy d a t a from h o s t memory t o d e v i c e memory
             cudamemcpy ( d A , h A , s i z e , cudaMemcpyHostToDevice ) ;
             cudamemcpy ( d B , h B , s i z e , cudaMemcpyHostToDevice ) ;

             // P r e p a r e t h e k e r n e l l a u n c h
             int threadsPerBlock = 256;
             i n t t h r e a d s P e r G r i d = (N + t h r e a d s P e r B l o c k −1) / T h r e a d s P e r B l o c k ;

             VecAdd<<<t h r e a d s P e r G r i d , t h r e a d s P e r B l o c k >>>(d A , d B , d C ) ;
             cudamemcpy ( h C , d C , s i z e , cudaMemcpyDeviceToHost ) ;

             // F r e e d e v i c e memory
             cudaFree ( d A ) ;
             cudaFree ( d B ) ;
             cudaFree ( d C ) ;
         }
Computations on GPU: a road towards desktop supercomputing
  Threads and memories hierarchy




Example: vector addition
      // K e r n e l c o d e
         global          v o i d VecAdd ( f l o a t ∗ A , f l o a t ∗ B , f l o a t ∗ C)
         {
            int i = threadIdx . x ;
            i f ( i < N)
                C[ i ] = A[ i ] + B[ i ] ;
         }
Computations on GPU: a road towards desktop supercomputing
  Threads and memories hierarchy




Performance analysis and optimization
              there must be enough thread blocks per MP to hide latency
              try not to under–populate blocks
Computations on GPU: a road towards desktop supercomputing
  Threads and memories hierarchy




Performance analysis and optimization
              there must be enough thread blocks per MP to hide latency
              try not to under–populate blocks
              use memory bandwidth (∼ 100GB/s!) efficiently
                     coalescing
                     non–optimized access to global memory could reduce the
                     performance in order(-s) of magnitude
              try to achieve high arithmetic intensity
Computations on GPU: a road towards desktop supercomputing
  Threads and memories hierarchy




Performance analysis and optimization
              there must be enough thread blocks per MP to hide latency
              try not to under–populate blocks
              use memory bandwidth (∼ 100GB/s!) efficiently
                     coalescing
                     non–optimized access to global memory could reduce the
                     performance in order(-s) of magnitude
              try to achieve high arithmetic intensity
              never diverge threads within one warp:
              divergence → serialization = parallelism
Computations on GPU: a road towards desktop supercomputing
  Toolbox




Outline
      1     History
      2     CUDA: architecture overview and programming model
      3     Threads and memories hierarchy
      4     Toolbox
      5     PyCUDA&PyOpenCL
      6     EnSPy functionality
      7     EnSPy architecture
      8     Example: D5 potential
      9     Example: Hill problem
      10    Example: Hill problem, N–body version
      11    Performance results
      12    GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
  Toolbox




Start–up tools
                drivers
                CUDA Toolkit
                     nvcc compiler, runtime library, header files, CUBLAS, CUFFT,
                     Visual Profiler etc.
                CUDA SDK
                     examples, Occupancy Calculator etc.

      Free download at
      http://developer.nvidia.com/object/cuda 2 3 downloads.html


      Support for 32 and 64-bit Windows, Linux1 & Mac OS X


            1
         Supported distros in CUDA 3.2: Fedora 13, RH Enterprise 4.8& 5.5,
      OpenSUSE 11.2, SLED 11.0, Ubuntu 10.04
Computations on GPU: a road towards desktop supercomputing
  Toolbox




Developers Tools
      CUDA-gdb
              integration into gdb
              CUDA C support
              works on all 32/64–bit Linux distros
              breakpoints and single step execution
Computations on GPU: a road towards desktop supercomputing
  Toolbox




Developers Tools
      CUDA-gdb
              integration into gdb
              CUDA C support
              works on all 32/64–bit Linux distros
              breakpoints and single step execution

      CUDA Visual Profiler
              tracks events with hardware counters
                     global memory loads/stores
                     total branches and divergent branches taken by threads
                     instruction count
                     number of serialized thread warps due to address conflicts
                     (shared and constant memory)
Computations on GPU: a road towards desktop supercomputing
  PyCUDA&PyOpenCL




Outline
      1    History
      2    CUDA: architecture overview and programming model
      3    Threads and memories hierarchy
      4    Toolbox
      5    PyCUDA&PyOpenCL
      6    EnSPy functionality
      7    EnSPy architecture
      8    Example: D5 potential
      9    Example: Hill problem
      10   Example: Hill problem, N–body version
      11   Performance results
      12   GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
  PyCUDA&PyOpenCL




Python
              easy to learn
              dynamically typed
              rich built–in functionality
              interpreted
              very well documented
              have large and active community
Computations on GPU: a road towards desktop supercomputing
  PyCUDA&PyOpenCL




Scientific tools:
              Scipy – modeling and simulation
                     Fourier transforms
                     ODE
                     Optimization
                     scipy.weave.inline – C inlining with little or no
                     overhead
                     ···
Computations on GPU: a road towards desktop supercomputing
  PyCUDA&PyOpenCL




Scientific tools:
              Scipy – modeling and simulation
                     Fourier transforms
                     ODE
                     Optimization
                     scipy.weave.inline – C inlining with little or no
                     overhead
                     ···
              NumPy – arrays
                     flexible array creation routines
                     sorting, random sampling and statistics
                     ···
Computations on GPU: a road towards desktop supercomputing
  PyCUDA&PyOpenCL




Scientific tools:
              Scipy – modeling and simulation
                     Fourier transforms
                     ODE
                     Optimization
                     scipy.weave.inline – C inlining with little or no
                     overhead
                     ···
              NumPy – arrays
                     flexible array creation routines
                     sorting, random sampling and statistics
                     ···

      Python is a convenient way of interfacing C/C++ libraries
Computations on GPU: a road towards desktop supercomputing
  PyCUDA&PyOpenCL




PyCUDA
              provide complete access to CUDA features
              automatically manages resources
              errors handling and translation to Python exceptions
              convenient abstractions: GPUArray
              metaprogramming: creation of CUDA source code
              dynamically
              interactive!
      PyOpenCL is pretty much the same in concept – but not only for
      Nvidia GPUs.
      Also for ATI/AMD cards, AMD&Intel Proccesors etc. (IBM Cell?)
Computations on GPU: a road towards desktop supercomputing
  PyCUDA&PyOpenCL




Python and CUDA
      We could interface with:
              Python C API – low–level approach: overkill
              SWIG, Boost::Python – high–level approach: overkill
              PyCUDA – most simple and straightforward way for CUDA
              only
              scipy.weave.inline – simple and straightforward way for
              both CUDA and plain C/C++
Computations on GPU: a road towards desktop supercomputing
  EnSPy functionality




Outline
      1    History
      2    CUDA: architecture overview and programming model
      3    Threads and memories hierarchy
      4    Toolbox
      5    PyCUDA&PyOpenCL
      6    EnSPy functionality
      7    EnSPy architecture
      8    Example: D5 potential
      9    Example: Hill problem
      10   Example: Hill problem, N–body version
      11   Performance results
      12   GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
  EnSPy functionality




Motivation
      Combine flexibility of Python with efficiency of C++ → CUDA for
      N–body sim
              interface of EnSPy is written in Python
              core of EnSPy is written in C++
              joined together by scipy.weave.inline
              C++ core could be used without Python – just include header
              and link with precompiled shared library
              easily extensible: both through high–level Python interface
              and low–level C++ core – new algorithms, initial distributions
              etc.
              multi–GPU parallelization
              it’s easy to experiment with EnSPy!
Computations on GPU: a road towards desktop supercomputing
  EnSPy functionality




EnSPy functionality
      Types of ensembles:
              ”Simple” ensemble – without interaction, only external
              potential
              N–body ensemble – both external potential and gravitational
              interaction between particles

      Current algorithms:
              4-th order Runge–Kutta for ”simple” ensemble
              Hermite scheme with shared time steps for N-body ensemble
Computations on GPU: a road towards desktop supercomputing
  EnSPy functionality




      Predefined initial distributions:
              Uniform, point and spherical for ”simple” ensembles
              Uniform sphere with 2T /|U| = 1 for N-body ensemble
              user could supply functions (in Python) for initial ensemble
              generation

      User specified values and expressions:
              parameters of initial distribution
              potential, forces, parameters of integration scheme
              arbitrary number of triggers – Ni (t) of particles which do not
              cross the given hypersurface Fi (q, p) = 0 before time t
                                                ¯
              arbitrary number of averages – Fi (q, p, t) – quantities which
              should be averaged over the ensembles
Computations on GPU: a road towards desktop supercomputing
  EnSPy functionality




      Runtime generation and compilation of C and CUDA code:
              User specified expressions (as Python strings) are wrapped by
              EnSPy template subpackage into C functions and CUDA
              module
              Compiled at runtime

      High usage and calculation efficiency:
              flexible Python interface
              all actual calculations are performed by runtime generated C
              extension and precompiled shared library

      Drawback:
              extra time for generation and compilation of new code
Computations on GPU: a road towards desktop supercomputing
  EnSPy architecture




Outline
      1    History
      2    CUDA: architecture overview and programming model
      3    Threads and memories hierarchy
      4    Toolbox
      5    PyCUDA&PyOpenCL
      6    EnSPy functionality
      7    EnSPy architecture
      8    Example: D5 potential
      9    Example: Hill problem
      10   Example: Hill problem, N–body version
      11   Performance results
      12   GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
  EnSPy architecture




Execution flow and architecture

          Input parameters

                             ↓

          Ensemble population
          (predefined or user specified
          distribution)

                             ↓

          Code generation and
          compilation

                             ↓

          Launching NGPUs threads
Computations on GPU: a road towards desktop supercomputing
  EnSPy architecture




GPU parallelization scheme for N–body simulations
Computations on GPU: a road towards desktop supercomputing
  EnSPy architecture




Order of force calculation
Computations on GPU: a road towards desktop supercomputing
  Example: D5 potential




Outline
      1    History
      2    CUDA: architecture overview and programming model
      3    Threads and memories hierarchy
      4    Toolbox
      5    PyCUDA&PyOpenCL
      6    EnSPy functionality
      7    EnSPy architecture
      8    Example: D5 potential
      9    Example: Hill problem
      10   Example: Hill problem, N–body version
      11   Performance results
      12   GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
  Example: D5 potential




Overview
      Problem:
      Escape from potential well.
      Watched values (trigger):
      N(t) – number of particles, remaining in the well at time t

      Potential:

                                                                 x4
                                    UD5 = 2ay 2 − x 2 + xy 2 +
                                                                 4

              ”Critical” energy: Ecr = ES = 0
Computations on GPU: a road towards desktop supercomputing
  Example: D5 potential




Potential and structure of phase space:

                          Level lines of D5 potential

        2
                                                                     2




                                                                     1
        1




                                                                     0
                                                                px
        0
   y




                                                                     1


       −1


                                                                     2


                                                                         2   1       0   1   2
       −2                                                                        x
            −2     −1                 0                 1   2
                                      x
Computations on GPU: a road towards desktop supercomputing
  Example: D5 potential




Calculation setup:
              ”Simple ensemble”
              uniform initial distribution of N = 10240 particles in
              x > 0 ∩ U(x, y ) < E
              trigger: x = 0 → q0 = 0.
              12 lines of simple Python code (examples/d5.py):
              specification of integration parameters
Computations on GPU: a road towards desktop supercomputing
  Example: D5 potential




Results:
      Regular particles are trapped in well → initial ”mixed state” splits

                                          1

                                                                 E = 0.1

                                         0.8




                                         0.6

                           N (t)/N (0)


                                         0.4
                                                                 E = 0.9


                                         0.2




                                          0
                                               0       10                  20   30
                                                             t
Computations on GPU: a road towards desktop supercomputing
  Example: Hill problem




Outline
      1    History
      2    CUDA: architecture overview and programming model
      3    Threads and memories hierarchy
      4    Toolbox
      5    PyCUDA&PyOpenCL
      6    EnSPy functionality
      7    EnSPy architecture
      8    Example: D5 potential
      9    Example: Hill problem
      10   Example: Hill problem, N–body version
      11   Performance results
      12   GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
  Example: Hill problem




Overview
      Problem:
      Toy model of escape from star cluster: escape of star from
      potential of point rotating star cluster Mc and point galaxy core
      Mg     Mc
      Watched values (trigger):
      N(t) – number of particles, remaining in cluster at time t

      ”Potential” in cluster frame of reference (tidal approximation):

                                                              GMc
                                        UHill = −3ω 2 x 2 −
                                                               r2

              ”Critical” energy: Ecr = ES = −4.5ω 2
Computations on GPU: a road towards desktop supercomputing
  Example: Hill problem




Potential:


                                                                                 Hill curves




                                                                  0.5




                                                                  0.0




                                                             y
                                                                 −0.5




                                                                 −1.0
                                                                   −1.0   −0.5       0.0       0.5
                                                                                     x
Computations on GPU: a road towards desktop supercomputing
  Example: Hill problem




Calculation setup:
              ”Simple ensemble”
              uniform initial distribution of N = 10240 particles in
              |x| < rt ∩ U(x, y ) < E
                       1
              ω=      √
                        3
                            → rt = 1
              trigger: |x| − rt = 0 → abs(q0) - 1.           = 0.
              12 lines of simple Python code (examples/hill plain.py):
              specification of integration parameters
Computations on GPU: a road towards desktop supercomputing
  Example: Hill problem




Results:
      Traping of regular particles (some tricky physics here):

                                      1 · 104



                                      8 · 103



                                      6 · 103
                              N (t)




                                      4 · 103



                                      2 · 103                                 E = −1.3
                                                                              E = −0.8
                                                                              E = −0.3
                                           0
                                                0   2.5 · 104   5 · 104   7.5 · 104      1 · 105
                                                                  nt
Computations on GPU: a road towards desktop supercomputing
  Example: Hill problem, N–body version




Outline
      1    History
      2    CUDA: architecture overview and programming model
      3    Threads and memories hierarchy
      4    Toolbox
      5    PyCUDA&PyOpenCL
      6    EnSPy functionality
      7    EnSPy architecture
      8    Example: D5 potential
      9    Example: Hill problem
      10   Example: Hill problem, N–body version
      11   Performance results
      12   GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
  Example: Hill problem, N–body version




Overview
      Problem:
      Simplified model of escape from star cluster: escape of star from
      potential of rotating star cluster with total mass Mc and point
      potential of galaxy core with mass Mg       Mc (2D)

      Watched values:
      Configuration of cluster
      Potential of galaxy core in cluster frame of reference (tidal
      approximation):

                                            UHillNB = −3ω 2 x 2
Computations on GPU: a road towards desktop supercomputing
  Example: Hill problem, N–body version




”Toy” Hill model vs N–body Hill model:
Computations on GPU: a road towards desktop supercomputing
  Example: Hill problem, N–body version




Calculation setup:
              N–body ensemble
              2D (z = 0) initial distribution of N = 10240 particles inside
              circle R with zero initial velocities
              14 lines of simple Python code (examples/hill nbody.py):
              specification of integration parameters
                                                   1
              Mc = 1, R = 200, ω =                 √
                                                     3
Computations on GPU: a road towards desktop supercomputing
  Example: Hill problem, N–body version




Results: cluster configuration
   step = 201                                           step = 401                                          step = 601
           300                                                 300                                                 300



           200                                                 200                                                 200



           100                                                 100                                                 100



             0                                                   0                                                   0
      y




                                                          y




                                                                                                              y
          −100                                                −100                                                −100



          −200                                                −200                                                −200



          −300                                                −300                                                −300
             −300   −200   −100   0   100   200   300            −300   −200   −100   0   100   200   300            −300   −200   −100   0   100   200   300
                                  x                                                   x                                                   x




  step = 801                                            step = 1001                                         step = 1201
           300                                                 300                                                 300



           200                                                 200                                                 200



           100                                                 100                                                 100



             0                                                   0                                                   0
      y




                                                          y




                                                                                                              y
          −100                                                −100                                                −100



          −200                                                −200                                                −200



          −300                                                −300                                                −300
             −300   −200   −100   0   100   200   300            −300   −200   −100   0   100   200   300            −300   −200   −100   0   100   200   300
                                  x                                                   x                                                   x
Computations on GPU: a road towards desktop supercomputing
  Performance results




           OpenSUSE 11.2, GCC 4.4, CUDA 3.0. AMD Athlon X2 4400+
           (2.3GHz) / Intel Core2Duo E8500 (3.16GHz), Nvidia Geforce 260
           GTX. Not as good, as it could be – subject to improve.
           Estimation: ∼ 1TFLOPs on 2x recent Fermi graphic processors

                40                                                                   300
                                                                                               OpenM P
                                                                                               SSE optimized
                                                                                     250
                                                                                               CU DA
                30

                                                                                     200




                                                                        speed − up
     GF lop/s




                20                                                                   150


                                                                                     100
                10

                                    GTX260 DP - N –body                              50
                                    GTX260 DP – ”simple” ensemble
                0                                                                      0
                1 · 104   2 · 104         5 · 104   1 · 105   2 · 105                      0   2.5 · 105    5 · 105    7.5 · 105   1 · 106
                                             N                                                         number of particles
Computations on GPU: a road towards desktop supercomputing
  GPU computing prospects




Outline
      1    History
      2    CUDA: architecture overview and programming model
      3    Threads and memories hierarchy
      4    Toolbox
      5    PyCUDA&PyOpenCL
      6    EnSPy functionality
      7    EnSPy architecture
      8    Example: D5 potential
      9    Example: Hill problem
      10   Example: Hill problem, N–body version
      11   Performance results
      12   GPU computing prospects
Computations on GPU: a road towards desktop supercomputing
  GPU computing prospects




Yesterday:
              uniform programming with OpenCL: no need to care about
              concrete implementation
              desktop supercomputers (full ATX form–factor):


   Nvidia Tesla C1060 x4                                     ATI FireStream x4
             ∼ 300GFLOPs/4TFLOPs                                  ∼ 960GFLOPs/4.8TFLOPs
          Windows & Linux 32/64–bit                              Windows & Linux 32/64–bit
          support                                                support
Computations on GPU: a road towards desktop supercomputing
  GPU computing prospects




Today:
              CUDA 3.2 → C++: classes, namespaces, default
              parameters, operators overloading


   Nvidia Tesla C2050/2070 x4                                ATI FireStream 9350/9370 x4

          ∼ 2TFLOPs/4TFLOPs                                       ∼ 2GFLOPs/8TFLOPs
          concurent kernel execution                             stable double–precision support
                                                                 (12 August 2010)
          ∼ 8x in GFLOPs, ∼ 6x in
          GFLOPs/$, ∼ 5x in                                      LOEWE–CSC (University of
          GFLOPs/W vs four Intel Xeon                            Frankfurt): №22 in Top500
          X5550
          (85GFLOPs/73GFLOPs)
          Tianhe-1-A, Nebulae,
          Tsubame-2: №1, 3, 4 SC from
          Top500
Computations on GPU: a road towards desktop supercomputing
  GPU computing prospects




Tommorow:
              OpenCL 1.2 (?) → matrix and ”field” complex and real
              types
              New libraries: GPU programming as simple as CPU
              programming


   Nvidia Geforce 580 GTX                                    ATI Radeon 6950 ”Cayman”
          ∼ 0.75TFLOPs/1.5TFLOPs                                  ∼ 0.75GFLOPs/3TFLOPs
Computations on GPU: a road towards desktop supercomputing
  GPU computing prospects




      This presentation is available for download at
      http://www.scribd.com/doc/27751403

More Related Content

What's hot

Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rFerdinand Jamitzky
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?Shinnosuke Furuya
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwJan Holčapek
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015Kohei KaiGai
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStromKohei KaiGai
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)Kohei KaiGai
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDAprithan
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_PlaceKohei KaiGai
 
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合Hitoshi Sato
 
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCINVIDIA Japan
 

What's hot (20)

Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
Cuda
CudaCuda
Cuda
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
Debugging CUDA applications
Debugging CUDA applicationsDebugging CUDA applications
Debugging CUDA applications
 
Advances in GPU Computing
Advances in GPU ComputingAdvances in GPU Computing
Advances in GPU Computing
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
ECP Application Development
ECP Application DevelopmentECP Application Development
ECP Application Development
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
 
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCI
 

Viewers also liked

Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...
Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...
Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...Umbra Software
 
Gpu Systems
Gpu SystemsGpu Systems
Gpu Systemsjpaugh
 
GPGPU in scientifc applications
GPGPU in scientifc applicationsGPGPU in scientifc applications
GPGPU in scientifc applicationssdart
 
Putting a Heart into a Box:GPGPU simulation of a Cardiac Model on the Xbox 360
Putting a Heart into a Box:GPGPU simulation of a Cardiac Model on the Xbox 360Putting a Heart into a Box:GPGPU simulation of a Cardiac Model on the Xbox 360
Putting a Heart into a Box:GPGPU simulation of a Cardiac Model on the Xbox 360Simon Scarle
 
GPGPU Programming @DroidconNL 2012 by Alten
GPGPU Programming @DroidconNL 2012 by AltenGPGPU Programming @DroidconNL 2012 by Alten
GPGPU Programming @DroidconNL 2012 by AltenArjan Somers
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUNur Ahmadi
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - ConfooSirKetchup
 
GPGPU Education at Nagaoka University of Technology: A Trial Run
GPGPU Education at Nagaoka University of Technology: A Trial RunGPGPU Education at Nagaoka University of Technology: A Trial Run
GPGPU Education at Nagaoka University of Technology: A Trial Run智啓 出川
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoAMD Developer Central
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jancstalks
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 

Viewers also liked (18)

Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...
Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...
Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...
 
Gpu Systems
Gpu SystemsGpu Systems
Gpu Systems
 
GPGPU in scientifc applications
GPGPU in scientifc applicationsGPGPU in scientifc applications
GPGPU in scientifc applications
 
Putting a Heart into a Box:GPGPU simulation of a Cardiac Model on the Xbox 360
Putting a Heart into a Box:GPGPU simulation of a Cardiac Model on the Xbox 360Putting a Heart into a Box:GPGPU simulation of a Cardiac Model on the Xbox 360
Putting a Heart into a Box:GPGPU simulation of a Cardiac Model on the Xbox 360
 
GPGPU Programming @DroidconNL 2012 by Alten
GPGPU Programming @DroidconNL 2012 by AltenGPGPU Programming @DroidconNL 2012 by Alten
GPGPU Programming @DroidconNL 2012 by Alten
 
Cheap HPC
Cheap HPCCheap HPC
Cheap HPC
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPU
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
GPGPU_report_v3
GPGPU_report_v3GPGPU_report_v3
GPGPU_report_v3
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - Confoo
 
GPGPU Education at Nagaoka University of Technology: A Trial Run
GPGPU Education at Nagaoka University of Technology: A Trial RunGPGPU Education at Nagaoka University of Technology: A Trial Run
GPGPU Education at Nagaoka University of Technology: A Trial Run
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
 
The GPGPU Continuum
The GPGPU ContinuumThe GPGPU Continuum
The GPGPU Continuum
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jan
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
2014/07/17 Parallelize computer vision by GPGPU computing
2014/07/17 Parallelize computer vision by GPGPU computing2014/07/17 Parallelize computer vision by GPGPU computing
2014/07/17 Parallelize computer vision by GPGPU computing
 

Similar to Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)

Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computerPriya Manik
 
HPC Top 5 Stories: September 22, 2017
HPC Top 5 Stories: September 22, 2017HPC Top 5 Stories: September 22, 2017
HPC Top 5 Stories: September 22, 2017NVIDIA
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningSri Ambati
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science Domino Data Lab
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUsiguazio
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievVolodymyr Saviak
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
S1170143 2
S1170143 2S1170143 2
S1170143 2s1170143
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Clusterinside-BigData.com
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 

Similar to Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010) (20)

Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computer
 
HPC Top 5 Stories: September 22, 2017
HPC Top 5 Stories: September 22, 2017HPC Top 5 Stories: September 22, 2017
HPC Top 5 Stories: September 22, 2017
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
S1170143 2
S1170143 2S1170143 2
S1170143 2
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Cluster
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Dl2 computing gpu
Dl2 computing gpuDl2 computing gpu
Dl2 computing gpu
 

Recently uploaded

From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarThousandEyes
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxNeo4j
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsDianaGray10
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3DianaGray10
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInThousandEyes
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Muhammad Tiham Siddiqui
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)IES VE
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingMAGNIntelligence
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FESTBillieHyde
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdfThe Good Food Institute
 

Recently uploaded (20)

From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projects
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced Computing
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FEST
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf
 

Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)

  • 1. Computations on GPU: a road towards desktop supercomputing Computations on GPU: a road towards desktop supercomputing Glib Ivashkevych Institute of Theoretical Physics, NSC KIPT November 24, 2010
  • 2. Computations on GPU: a road towards desktop supercomputing Quick outline GPU – Graphic Processing Unit programmable manycore multithreaded with very high memory bandwidth
  • 3. Computations on GPU: a road towards desktop supercomputing Quick outline GPU – Graphic Processing Unit programmable manycore multithreaded with very high memory bandwidth We are going to talk about: how GPU became usefull for scientific computations GPU intrinsics and programming how to get as much as possible from GPU and survive:) the future of GPUs and GPU programming CUDA(Nvidia s Compute Unified Device Architecture) , most of the time, and OpenCL(Open Computing Language)
  • 4. Computations on GPU: a road towards desktop supercomputing Should we care? But first of all: do we really need GPU computing?
  • 5. Computations on GPU: a road towards desktop supercomputing Should we care? But first of all: do we really need GPU computing? Short answer: yes! high performance transparent scalability
  • 6. Computations on GPU: a road towards desktop supercomputing Should we care? But first of all: do we really need GPU computing? Short answer: yes! high performance transparent scalability More accurate answer: yes, for problems with high parallelism. large datasets portions of data could be processed independently
  • 7. Computations on GPU: a road towards desktop supercomputing Should we care? But first of all: do we really need GPU computing? Short answer: yes! high performance transparent scalability More accurate answer: yes, for problems with high parallelism. large datasets portions of data could be processed independently Most accurate answer: yes, for problems with high data parallelism.
  • 8. Computations on GPU: a road towards desktop supercomputing Reference For reference GFLOPs – 109 FLoating Point Operations Per second ∼ 55 GFLOPs on Intel Core i7 Nehalem 975 (according to Intel) ∼ 125 GFLOPs AMD Opteron Istanbul 2435 ∼ 500 GFLOPs in double and ∼ 2 TFLOPs in single precision on Nvidia Tesla C2050 ∼ 3.2 · 103 GFLOPs on ASCI Red in Sandia National Laboratory – the fastest supercomputer as for November 1999 ∼ 87 · 103 GFLOPs on TSUBAME-1 Grid Cluster in Tokyo Institute of Technology – first GPU based supercomputer, №88 in Top500, as for November 2010 (№56 – in November 2009) ∼ 2.56 · 106 GFLOPs on Tianhe-1-A at National Supercomputing Center in Tianjin – the fastest supercomputer as for November 2010 – GPU based
  • 9. Computations on GPU: a road towards desktop supercomputing Examples Matrix and vector operations CUBLAS 1 on Nvidia Tesla C2050 (CUDA 3.2) vs Intel MKL 10.21 on Intel Core i7 Nehalem (4 threads) ∼ 8x in double precision CULA 1 on Nvidia Tesla C2050 (CUDA 3.2) up to ∼ 220 GFLOPs in double precision up to ∼ 450 GFLOPs in single precision vs Intel MKL 10.2 ∼ 4 − 6x speed–up 1 CUDA accelerated Basic Linear Algebra Subprograms 1 Math Kernel Library 1 LAPACK for Heterogeneous systems
  • 10. Computations on GPU: a road towards desktop supercomputing Examples Fast Fourier Transform CUFFT on Nvidia Tesla C2070 (CUDA 3.2) up to 65 GFLOPs in double precision up to 220 GFLOPs in single precision vs Intel MKL on Intel Core i7 Nehalem ∼ 9x in double precision ∼ 20x in single precision
  • 11. Computations on GPU: a road towards desktop supercomputing Examples Physics: Computational Fluid Dynamics Simulation of transition to turbulence1 Nvidia Tesla S1070 vs quad-core Intel Xeon X5450 (3GHz) ∼ 20x over serial code ∼ 10x over OpenMP realization (2 threads) ∼ 5x over OpenMP realization (4 threads) 1 A.S. Antoniou et al., American Institute of Aeronautics and Astronautics Paper 2010 – 0525
  • 12. Computations on GPU: a road towards desktop supercomputing Examples Quantum chemistry Calculations of molecular orbitals1 Nvidia GeForce GTX 280 vs Intel Core2Quad Q6600 (2.4GHz) ∼ 173x over serial non–optimized code ∼ 14x over parallel optimized code (4 threads) 1 D.J. Hardy et al., GPGPU 2009
  • 13. Computations on GPU: a road towards desktop supercomputing Examples Medical Imaging Isosurfaces reconstruction from scalar volumetric data1 Nvidia GeForce GTX 285 vs ? ∼ 68x over optimized CPU code nearly real-time processing of data 1 T. Kalbe et al., Proceedings of 5th International Symposium on Visual Computing (ISVC 2009)
  • 14. Computations on GPU: a road towards desktop supercomputing Examples GPUGrid.net Biomolecular simulations accelerated by Nvidia CUDA boards and Sony PlayStation ∼ 8000 users from 101 country ∼ 145 TFLOPs in average ≈ №25 in Top500 ∼ 50 GFLOPs from every active user
  • 15. Computations on GPU: a road towards desktop supercomputing Examples ATLAS experiment on Large Hadron Collider Particles tracking, trigerring, events simulations1 possible Higgs events – track large number of particles ∼ 32x in tracking, ∼ 35x in triggering on Nvidia Tesla C1060 1 P.J. Clark et al., Processing Petabytes per Second with the ATLAS Experiment at the LHC (GTC 2010)
  • 16. Computations on GPU: a road towards desktop supercomputing Examples And even more examples in: N–body simulations seismic simulations molecular dynamics SETI@Home & MilkyWay@Home finance neural networks ... and, of course, graphics VFX, rendering image editing, video
  • 17. Computations on GPU: a road towards desktop supercomputing Examples GPU Technology Conference 2010 (September 20-23)
  • 18. Computations on GPU: a road towards desktop supercomputing Outline Outline 1 History
  • 19. Computations on GPU: a road towards desktop supercomputing Outline Outline 1 History 2 CUDA: architecture overview and programming model
  • 20. Computations on GPU: a road towards desktop supercomputing Outline Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy
  • 21. Computations on GPU: a road towards desktop supercomputing Outline Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox
  • 22. Computations on GPU: a road towards desktop supercomputing Outline Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL
  • 23. Computations on GPU: a road towards desktop supercomputing Outline Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality
  • 24. Computations on GPU: a road towards desktop supercomputing Outline Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture
  • 25. Computations on GPU: a road towards desktop supercomputing Outline Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential
  • 26. Computations on GPU: a road towards desktop supercomputing Outline Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem
  • 27. Computations on GPU: a road towards desktop supercomputing Outline Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version
  • 28. Computations on GPU: a road towards desktop supercomputing Outline Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results
  • 29. Computations on GPU: a road towards desktop supercomputing Outline Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 30. Computations on GPU: a road towards desktop supercomputing History Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 31. Computations on GPU: a road towards desktop supercomputing History History in brief:
  • 32. Computations on GPU: a road towards desktop supercomputing History GPGPU in 2001-2006: through graphics API (OpenGL or DirectX) extremely hard only in single precision
  • 33. Computations on GPU: a road towards desktop supercomputing History GPGPU in 2001-2006: through graphics API (OpenGL or DirectX) extremely hard only in single precision GPGPU today: straightforward easy in double precision.
  • 34. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming model Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 35. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming model Hardware model: GT200 architecture consists of multiprocessors each MP has: 8 stream processors 1 unit for double precision operations shared memory global memory
  • 36. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming model Hardware model: Fermi architecture each MP has: 32 stream processors 4 SFU’s (Special Function Unit) each SP has: 1 FP Unit & 1 INT Unit
  • 37. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming model Hardware model: Multiprocessors and threads MP can launch numerous threads threads are ”lightweight” – little creation and switching overhead threads run the same code threads syncronization within MP cooperation via shared memory each thread have unique identifier – thread ID Efficiency is achieved by latency hiding by calculation, and not by cache usage, as on CPU
  • 38. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming model Software model: C for CUDA a set of extensions to C runtime library function and variable type qualifiers built–in vector types: float4, double2 etc. built–in variables Kernels maps parallel part of the program to the GPU execution: N times in parallel by N CUDA threads CUDA Driver API low–level control over the execution no need in nvcc compiler if kernels are precompiled – only driver needed
  • 39. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming model Software model: Example //Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU) device f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B) { //Some math r e t u r n smth ; }
  • 40. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming model Software model: Example //Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU) device f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B) { //Some math r e t u r n smth ; } // K e r n e l d e f i n i t i o n global v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C) { //Some math C = D e v i c e F u n c t i o n (A , B ) ; }
  • 41. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming model Software model: Example //Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU) device f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B) { //Some math r e t u r n smth ; } // K e r n e l d e f i n i t i o n global v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C) { //Some math C = D e v i c e F u n c t i o n (A , B ) ; } // Host c o d e i n t main ( ) { // K e r n e l i n v o c a t i o n SomeKernel <<<1,N>>>(A , B , C) }
  • 42. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming model Software model: Explanations device qualifier defines function that is: executed on device callable from device only
  • 43. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming model Software model: Explanations device qualifier defines function that is: executed on device callable from device only global qualifier defines function that is: executed on device callable from host only
  • 44. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming model Execution model
  • 45. Computations on GPU: a road towards desktop supercomputing CUDA: architecture overview and programming model Scalability underlying hardware architecture is hidden threads could syncronize only within the MP ↓ we do not need to know exact number of MP ↓ scalable applications – from GTX8800 to Fermi
  • 46. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchy Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 47. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchy Single threads each thread have private local memory are identified by built–in variable threadIdx (uint3 type) int idx = threadIdx . x + threadIdx . y + threadIdx . z ; form 1–, 2– or 3–dimensional array – vector, matrix or field Threads are organized into thread blocks
  • 48. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchy Thread blocks each block have shared memory visible to all threads within the block are identified by built–in variable blockIdx (uint3 type) int b idx = blockIdx . x + blockIdx . y ; dimension of the block is identified by built–in variable blockDim (dim3 type) Blocks are organized into grid
  • 49. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchy Grid of thread blocks global device memory is accessible by all threads in the grid dimension of the grid is identified by built–in variable gridDim (dim3 type)
  • 50. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchy Threads and memories hierarchy
  • 51. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchy Example: vector addition i n t main ( ) { // A l l o c a t e v e c t o r s i n d e v i c e memory size t size = N ∗ sizeof ( float ); float ∗ d A ; c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ; float ∗ d B ; c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ; float ∗ d C ; c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ; // Copy d a t a from h o s t memory t o d e v i c e memory cudamemcpy ( d A , h A , s i z e , cudaMemcpyHostToDevice ) ; cudamemcpy ( d B , h B , s i z e , cudaMemcpyHostToDevice ) ; // P r e p a r e t h e k e r n e l l a u n c h int threadsPerBlock = 256; i n t t h r e a d s P e r G r i d = (N + t h r e a d s P e r B l o c k −1) / T h r e a d s P e r B l o c k ; VecAdd<<<t h r e a d s P e r G r i d , t h r e a d s P e r B l o c k >>>(d A , d B , d C ) ; cudamemcpy ( h C , d C , s i z e , cudaMemcpyDeviceToHost ) ; // F r e e d e v i c e memory cudaFree ( d A ) ; cudaFree ( d B ) ; cudaFree ( d C ) ; }
  • 52. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchy Example: vector addition // K e r n e l c o d e global v o i d VecAdd ( f l o a t ∗ A , f l o a t ∗ B , f l o a t ∗ C) { int i = threadIdx . x ; i f ( i < N) C[ i ] = A[ i ] + B[ i ] ; }
  • 53. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchy Performance analysis and optimization there must be enough thread blocks per MP to hide latency try not to under–populate blocks
  • 54. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchy Performance analysis and optimization there must be enough thread blocks per MP to hide latency try not to under–populate blocks use memory bandwidth (∼ 100GB/s!) efficiently coalescing non–optimized access to global memory could reduce the performance in order(-s) of magnitude try to achieve high arithmetic intensity
  • 55. Computations on GPU: a road towards desktop supercomputing Threads and memories hierarchy Performance analysis and optimization there must be enough thread blocks per MP to hide latency try not to under–populate blocks use memory bandwidth (∼ 100GB/s!) efficiently coalescing non–optimized access to global memory could reduce the performance in order(-s) of magnitude try to achieve high arithmetic intensity never diverge threads within one warp: divergence → serialization = parallelism
  • 56. Computations on GPU: a road towards desktop supercomputing Toolbox Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 57. Computations on GPU: a road towards desktop supercomputing Toolbox Start–up tools drivers CUDA Toolkit nvcc compiler, runtime library, header files, CUBLAS, CUFFT, Visual Profiler etc. CUDA SDK examples, Occupancy Calculator etc. Free download at http://developer.nvidia.com/object/cuda 2 3 downloads.html Support for 32 and 64-bit Windows, Linux1 & Mac OS X 1 Supported distros in CUDA 3.2: Fedora 13, RH Enterprise 4.8& 5.5, OpenSUSE 11.2, SLED 11.0, Ubuntu 10.04
  • 58. Computations on GPU: a road towards desktop supercomputing Toolbox Developers Tools CUDA-gdb integration into gdb CUDA C support works on all 32/64–bit Linux distros breakpoints and single step execution
  • 59. Computations on GPU: a road towards desktop supercomputing Toolbox Developers Tools CUDA-gdb integration into gdb CUDA C support works on all 32/64–bit Linux distros breakpoints and single step execution CUDA Visual Profiler tracks events with hardware counters global memory loads/stores total branches and divergent branches taken by threads instruction count number of serialized thread warps due to address conflicts (shared and constant memory)
  • 60. Computations on GPU: a road towards desktop supercomputing PyCUDA&PyOpenCL Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 61. Computations on GPU: a road towards desktop supercomputing PyCUDA&PyOpenCL Python easy to learn dynamically typed rich built–in functionality interpreted very well documented have large and active community
  • 62. Computations on GPU: a road towards desktop supercomputing PyCUDA&PyOpenCL Scientific tools: Scipy – modeling and simulation Fourier transforms ODE Optimization scipy.weave.inline – C inlining with little or no overhead ···
  • 63. Computations on GPU: a road towards desktop supercomputing PyCUDA&PyOpenCL Scientific tools: Scipy – modeling and simulation Fourier transforms ODE Optimization scipy.weave.inline – C inlining with little or no overhead ··· NumPy – arrays flexible array creation routines sorting, random sampling and statistics ···
  • 64. Computations on GPU: a road towards desktop supercomputing PyCUDA&PyOpenCL Scientific tools: Scipy – modeling and simulation Fourier transforms ODE Optimization scipy.weave.inline – C inlining with little or no overhead ··· NumPy – arrays flexible array creation routines sorting, random sampling and statistics ··· Python is a convenient way of interfacing C/C++ libraries
  • 65. Computations on GPU: a road towards desktop supercomputing PyCUDA&PyOpenCL PyCUDA provide complete access to CUDA features automatically manages resources errors handling and translation to Python exceptions convenient abstractions: GPUArray metaprogramming: creation of CUDA source code dynamically interactive! PyOpenCL is pretty much the same in concept – but not only for Nvidia GPUs. Also for ATI/AMD cards, AMD&Intel Proccesors etc. (IBM Cell?)
  • 66. Computations on GPU: a road towards desktop supercomputing PyCUDA&PyOpenCL Python and CUDA We could interface with: Python C API – low–level approach: overkill SWIG, Boost::Python – high–level approach: overkill PyCUDA – most simple and straightforward way for CUDA only scipy.weave.inline – simple and straightforward way for both CUDA and plain C/C++
  • 67. Computations on GPU: a road towards desktop supercomputing EnSPy functionality Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 68. Computations on GPU: a road towards desktop supercomputing EnSPy functionality Motivation Combine flexibility of Python with efficiency of C++ → CUDA for N–body sim interface of EnSPy is written in Python core of EnSPy is written in C++ joined together by scipy.weave.inline C++ core could be used without Python – just include header and link with precompiled shared library easily extensible: both through high–level Python interface and low–level C++ core – new algorithms, initial distributions etc. multi–GPU parallelization it’s easy to experiment with EnSPy!
  • 69. Computations on GPU: a road towards desktop supercomputing EnSPy functionality EnSPy functionality Types of ensembles: ”Simple” ensemble – without interaction, only external potential N–body ensemble – both external potential and gravitational interaction between particles Current algorithms: 4-th order Runge–Kutta for ”simple” ensemble Hermite scheme with shared time steps for N-body ensemble
  • 70. Computations on GPU: a road towards desktop supercomputing EnSPy functionality Predefined initial distributions: Uniform, point and spherical for ”simple” ensembles Uniform sphere with 2T /|U| = 1 for N-body ensemble user could supply functions (in Python) for initial ensemble generation User specified values and expressions: parameters of initial distribution potential, forces, parameters of integration scheme arbitrary number of triggers – Ni (t) of particles which do not cross the given hypersurface Fi (q, p) = 0 before time t ¯ arbitrary number of averages – Fi (q, p, t) – quantities which should be averaged over the ensembles
  • 71. Computations on GPU: a road towards desktop supercomputing EnSPy functionality Runtime generation and compilation of C and CUDA code: User specified expressions (as Python strings) are wrapped by EnSPy template subpackage into C functions and CUDA module Compiled at runtime High usage and calculation efficiency: flexible Python interface all actual calculations are performed by runtime generated C extension and precompiled shared library Drawback: extra time for generation and compilation of new code
  • 72. Computations on GPU: a road towards desktop supercomputing EnSPy architecture Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 73. Computations on GPU: a road towards desktop supercomputing EnSPy architecture Execution flow and architecture Input parameters ↓ Ensemble population (predefined or user specified distribution) ↓ Code generation and compilation ↓ Launching NGPUs threads
  • 74. Computations on GPU: a road towards desktop supercomputing EnSPy architecture GPU parallelization scheme for N–body simulations
  • 75. Computations on GPU: a road towards desktop supercomputing EnSPy architecture Order of force calculation
  • 76. Computations on GPU: a road towards desktop supercomputing Example: D5 potential Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 77. Computations on GPU: a road towards desktop supercomputing Example: D5 potential Overview Problem: Escape from potential well. Watched values (trigger): N(t) – number of particles, remaining in the well at time t Potential: x4 UD5 = 2ay 2 − x 2 + xy 2 + 4 ”Critical” energy: Ecr = ES = 0
  • 78. Computations on GPU: a road towards desktop supercomputing Example: D5 potential Potential and structure of phase space: Level lines of D5 potential 2 2 1 1 0 px 0 y 1 −1 2 2 1 0 1 2 −2 x −2 −1 0 1 2 x
  • 79. Computations on GPU: a road towards desktop supercomputing Example: D5 potential Calculation setup: ”Simple ensemble” uniform initial distribution of N = 10240 particles in x > 0 ∩ U(x, y ) < E trigger: x = 0 → q0 = 0. 12 lines of simple Python code (examples/d5.py): specification of integration parameters
  • 80. Computations on GPU: a road towards desktop supercomputing Example: D5 potential Results: Regular particles are trapped in well → initial ”mixed state” splits 1 E = 0.1 0.8 0.6 N (t)/N (0) 0.4 E = 0.9 0.2 0 0 10 20 30 t
  • 81. Computations on GPU: a road towards desktop supercomputing Example: Hill problem Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 82. Computations on GPU: a road towards desktop supercomputing Example: Hill problem Overview Problem: Toy model of escape from star cluster: escape of star from potential of point rotating star cluster Mc and point galaxy core Mg Mc Watched values (trigger): N(t) – number of particles, remaining in cluster at time t ”Potential” in cluster frame of reference (tidal approximation): GMc UHill = −3ω 2 x 2 − r2 ”Critical” energy: Ecr = ES = −4.5ω 2
  • 83. Computations on GPU: a road towards desktop supercomputing Example: Hill problem Potential: Hill curves 0.5 0.0 y −0.5 −1.0 −1.0 −0.5 0.0 0.5 x
  • 84. Computations on GPU: a road towards desktop supercomputing Example: Hill problem Calculation setup: ”Simple ensemble” uniform initial distribution of N = 10240 particles in |x| < rt ∩ U(x, y ) < E 1 ω= √ 3 → rt = 1 trigger: |x| − rt = 0 → abs(q0) - 1. = 0. 12 lines of simple Python code (examples/hill plain.py): specification of integration parameters
  • 85. Computations on GPU: a road towards desktop supercomputing Example: Hill problem Results: Traping of regular particles (some tricky physics here): 1 · 104 8 · 103 6 · 103 N (t) 4 · 103 2 · 103 E = −1.3 E = −0.8 E = −0.3 0 0 2.5 · 104 5 · 104 7.5 · 104 1 · 105 nt
  • 86. Computations on GPU: a road towards desktop supercomputing Example: Hill problem, N–body version Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 87. Computations on GPU: a road towards desktop supercomputing Example: Hill problem, N–body version Overview Problem: Simplified model of escape from star cluster: escape of star from potential of rotating star cluster with total mass Mc and point potential of galaxy core with mass Mg Mc (2D) Watched values: Configuration of cluster Potential of galaxy core in cluster frame of reference (tidal approximation): UHillNB = −3ω 2 x 2
  • 88. Computations on GPU: a road towards desktop supercomputing Example: Hill problem, N–body version ”Toy” Hill model vs N–body Hill model:
  • 89. Computations on GPU: a road towards desktop supercomputing Example: Hill problem, N–body version Calculation setup: N–body ensemble 2D (z = 0) initial distribution of N = 10240 particles inside circle R with zero initial velocities 14 lines of simple Python code (examples/hill nbody.py): specification of integration parameters 1 Mc = 1, R = 200, ω = √ 3
  • 90. Computations on GPU: a road towards desktop supercomputing Example: Hill problem, N–body version Results: cluster configuration step = 201 step = 401 step = 601 300 300 300 200 200 200 100 100 100 0 0 0 y y y −100 −100 −100 −200 −200 −200 −300 −300 −300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 x x x step = 801 step = 1001 step = 1201 300 300 300 200 200 200 100 100 100 0 0 0 y y y −100 −100 −100 −200 −200 −200 −300 −300 −300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 x x x
  • 91. Computations on GPU: a road towards desktop supercomputing Performance results OpenSUSE 11.2, GCC 4.4, CUDA 3.0. AMD Athlon X2 4400+ (2.3GHz) / Intel Core2Duo E8500 (3.16GHz), Nvidia Geforce 260 GTX. Not as good, as it could be – subject to improve. Estimation: ∼ 1TFLOPs on 2x recent Fermi graphic processors 40 300 OpenM P SSE optimized 250 CU DA 30 200 speed − up GF lop/s 20 150 100 10 GTX260 DP - N –body 50 GTX260 DP – ”simple” ensemble 0 0 1 · 104 2 · 104 5 · 104 1 · 105 2 · 105 0 2.5 · 105 5 · 105 7.5 · 105 1 · 106 N number of particles
  • 92. Computations on GPU: a road towards desktop supercomputing GPU computing prospects Outline 1 History 2 CUDA: architecture overview and programming model 3 Threads and memories hierarchy 4 Toolbox 5 PyCUDA&PyOpenCL 6 EnSPy functionality 7 EnSPy architecture 8 Example: D5 potential 9 Example: Hill problem 10 Example: Hill problem, N–body version 11 Performance results 12 GPU computing prospects
  • 93. Computations on GPU: a road towards desktop supercomputing GPU computing prospects Yesterday: uniform programming with OpenCL: no need to care about concrete implementation desktop supercomputers (full ATX form–factor): Nvidia Tesla C1060 x4 ATI FireStream x4 ∼ 300GFLOPs/4TFLOPs ∼ 960GFLOPs/4.8TFLOPs Windows & Linux 32/64–bit Windows & Linux 32/64–bit support support
  • 94. Computations on GPU: a road towards desktop supercomputing GPU computing prospects Today: CUDA 3.2 → C++: classes, namespaces, default parameters, operators overloading Nvidia Tesla C2050/2070 x4 ATI FireStream 9350/9370 x4 ∼ 2TFLOPs/4TFLOPs ∼ 2GFLOPs/8TFLOPs concurent kernel execution stable double–precision support (12 August 2010) ∼ 8x in GFLOPs, ∼ 6x in GFLOPs/$, ∼ 5x in LOEWE–CSC (University of GFLOPs/W vs four Intel Xeon Frankfurt): №22 in Top500 X5550 (85GFLOPs/73GFLOPs) Tianhe-1-A, Nebulae, Tsubame-2: №1, 3, 4 SC from Top500
  • 95. Computations on GPU: a road towards desktop supercomputing GPU computing prospects Tommorow: OpenCL 1.2 (?) → matrix and ”field” complex and real types New libraries: GPU programming as simple as CPU programming Nvidia Geforce 580 GTX ATI Radeon 6950 ”Cayman” ∼ 0.75TFLOPs/1.5TFLOPs ∼ 0.75GFLOPs/3TFLOPs
  • 96. Computations on GPU: a road towards desktop supercomputing GPU computing prospects This presentation is available for download at http://www.scribd.com/doc/27751403