SlideShare a Scribd company logo
1 of 153
Download to read offline
Massively Parallel Computing
                         CS 264 / CSCI E-292
Lecture #6: CUDA Ninja Tricks | March 1st, 2011




                Nicolas Pinto (MIT, Harvard)
                       pinto@mit.edu
Massively Parallel Computing
                                 CS 264 / CSCI E-292
Lecture #6: CUDA Ninja Tricks | February 29th, 2011


                                                       Auto-tuning
                                              am ming,
                              , Meta- progr
                  riptin   g”
       G   PU “Sc

                  Nicolas Pinto (MIT, Harvard)
                         pinto@mit.edu
News
During this course,
                          r CS264
                adapted fo



we’ll try to


          “                         ”

and use existing material ;-)
Today
yey!!
Outline

1. Scripting GPUs with PyCUDA
2. Meta-programming and RTCG
3. Case study in brain-inspired AI
Outline

1. Scripting GPUs with PyCUDA
2. Meta-programming and RTCG
3. Case study in brain-inspired AI
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


Why do Scripting for GPUs?

     GPUs are everything that scripting
     languages are not.
            Highly parallel
            Very architecture-sensitive
            Built for maximum
            compute/memory throughput
     → complement each other
     CPU: largely restricted to control
     tasks (∼1000/sec)
            Scripting fast enough
     Realize a promise: Use Scripting. . .
            from first prototype
            to full-scale production code.


                                     o slide by Andreas Klockner (NYU)
   Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


Why do Scripting for GPUs?

     GPUs are everything that scripting
     languages are not.
            Highly parallel
            Very architecture-sensitive
            Built for maximum
            compute/memory throughput
     → complement each other
     CPU: largely restricted to control
     tasks (∼1000/sec)
            Scripting fast enough
     Realize a promise: Use Scripting. . .
            from first prototype
            to full-scale production code.


                                     o slide by Andreas Klockner (NYU)
   Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


Why do Scripting for GPUs?

     GPUs are everything that scripting
     languages are not.
            Highly parallel
            Very architecture-sensitive
            Built for maximum
            compute/memory throughput
     → complement each other
     CPU: largely restricted to control
     tasks (∼1000/sec)
            Scripting fast enough
     Realize a promise: Use Scripting. . .
            from first prototype
            to full-scale production code.


                                     o slide by Andreas Klockner (NYU)
   Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being Productive


Why do Scripting for GPUs?


      GPUs are everything that scripting
      languages are not.
            Highly parallel
            Very architecture-sensitive
            Built for maximum FP/memory
            throughput
      → complement each other
      CPU: largely restricted to control
      tasks (∼1000/sec)
            Scripting fast enough
      Python + CUDA = PyCUDA
      Python + OpenCL = PyOpenCL


                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being Productive


How are High-Performance Codes constructed?



      “Traditional” Construction of
      High-Performance Codes:
            C/C++/Fortran
            Libraries
      “Alternative” Construction of
      High-Performance Codes:
            Scripting for ‘brains’
            GPUs for ‘inner loops’
      Play to the strengths of each
      programming environment.




                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being Productive


Scripting: Python


   One example of a scripting language: Python

         Mature
         Large and active community
         Emphasizes readability
         Written in widely-portable C
         A ‘multi-paradigm’ language
         Rich ecosystem of sci-comp related
         software




                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


Scripting Languages



   Python:
         is discoverable and interactive.
         has comprehensive built-in functionality.
         manages resources automatically.
         uses run-time typing.
         works well for “gluing” lower-level blocks together.




                                      o slide by Andreas Klockner (NYU)
    Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


Scripting: Goals

   Scripting languages aim to reduce the load on the programmer:
         Reduce required knowledge
         Encourage experimentation
         Eliminate sources of error
         Encourage abstraction wherever possible
         Value programmer time over computer time

    Think about the tools you use.
                                                          Use the right tool for the job.




                                      o slide by Andreas Klockner (NYU)
    Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
Intro GPUs Scripting Hands-on      Intro Example Working with PyCuda A peek under the hood


Scripting: Goals

   Scripting languages aim to reduce the load on the programmer:
         Reduce required knowledge
         Encourage experimentation
         Eliminate sources of error
         Encourage abstraction wherever possible
         Value programmer time over computer time

    Think about the tools you use.
                                                       Use the right tool for the job.




    Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)
                                      o                PyCuda Tutorial
Intro GPUs Scripting Hands-on      Intro Example Working with PyCuda A peek under the hood


Scripting: Goals

   Scripting languages aim to reduce the load on the programmer:
         Reduce required knowledge
         Encourage experimentation
         Eliminate sources of error
         Encourage abstraction wherever possible
         Value programmer time over computer time

    Think about the tools you use.
                                                       Use the right tool for the job.




    Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)
                                      o                PyCuda Tutorial
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


Scripting: Speed


        Usual answer to the “Speed
        Question”:
        Hybrid (“mixed”) Code.
        Plays to the strengths of each
        language.
        But: Introduces (some)
        complexity.

   Observation: GPU code is already hybrid.

   Consequence: No added complexity through hybrid code.



                                      o slide by Andreas Klockner (NYU)
    Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being Productive


Whetting your appetite



1   import pycuda.driver as cuda
2   import pycuda.autoinit , pycuda.compiler
3   import numpy
4
5   a = numpy.random.randn(4,4).astype(numpy.float32)
6   a gpu = cuda.mem alloc(a.nbytes)
7   cuda.memcpy htod(a gpu, a)


    [This is examples/demo.py in the PyCUDA distribution.]




                                    slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                               Andreas Kl¨ckner
                                          o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase     Overview Being Productive


Whetting your appetite

 1   mod = pycuda.compiler.SourceModule(”””
 2        global     void twice( float ∗a)
 3      {
 4        int idx = threadIdx.x + threadIdx.y∗4;
 5        a[ idx ] ∗= 2;
 6      }
 7      ”””)
 8
 9   func = mod.get function(”twice”)
10   func(a gpu, block=(4,4,1))
11
12   a doubled = numpy.empty like(a)
13   cuda.memcpy dtoh(a doubled, a gpu)
14   print a doubled
15   print a


                                Andreas Kl¨ckner
                                          o        PyCUDA: Even Simpler GPU Programming with Python
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being Productive


Whetting your appetite

 1   mod = pycuda.compiler.SourceModule(”””
 2        global     void twice( float ∗a)
 3      {
 4        int idx = threadIdx.x + threadIdx.y∗4;
 5        a[ idx ] ∗= 2;
 6      }                                                                Compute kernel
 7      ”””)
 8
 9   func = mod.get function(”twice”)
10   func(a gpu, block=(4,4,1))
11
12   a doubled = numpy.empty like(a)
13   cuda.memcpy dtoh(a doubled, a gpu)
14   print a doubled
15   print a


                                     slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                                Andreas Kl¨ckner
                                           o            PyCUDA: Even
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


Whetting your appetite, Part II




   Did somebody say “Abstraction is good”?




                                      o slide by Andreas Klockner (NYU)
    Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


Whetting your appetite, Part II


 1   import numpy
 2   import pycuda.autoinit
 3   from pycuda import gpuarray
 4
 5   a cpu = numpy.random.randn(4,4).astype(numpy.float32)
 6   b cpu = numpy.random.randn(4,4).astype(numpy.float32)
 7   c cpu = a cpu ∗ b cpu
 8
 9   a gpu = gpuarray.to gpu(a cpu)
10   b gpu = gpuarray.to gpu(b cpu)
11   c gpu = (a gpu ∗ b gpu).get()
12
13   print c cpu − c gpu




                                        o slide by Andreas Klockner (NYU)
      Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
Intro GPUs Scripting Hands-on                Intro Example Working with PyCuda A peek under the hood


 Remember me?

 1   // trivia
 2   #include <stdio.h>
 3
 4   #define CUDA CHK(NAME, ARGS) { 
 5     cudaError t cuda err code = NAME ARGS; 
 6      if (cuda err code != cudaSuccess) {                            1   // main2
 7         printf (”%s failed with code %dn”, #NAME, cuda err code);  2     for ( int i = 0; i < n; i++) { a host[i] = i; b host [ i ] = i+1; }
 8        abort ();                                                    3
 9     }                                                               4     CUDA CHK(cudaMemcpy, (a device, a host, n∗sizeof(float),
10   }                                                                  5         cudaMemcpyHostToDevice));
11   // end                                                             6     CUDA CHK(cudaMemcpy, (b device, b host, n∗sizeof(float),
12                                                                      7         cudaMemcpyHostToDevice));
13   // kernel                                                          8
14     global      void square array ( float ∗a, float ∗b, int n)         9     dim3 block dim(16, 16);
15   {                                                                 10     int block size = block dim.x∗block dim.y;
16     int i = ( blockIdx .x ∗ blockDim.y + threadIdx.y)               11     int n blocks = (n + block size−1) / block size ;
17        ∗ blockDim.x + threadIdx.x;                                  12     square array <<<n blocks, block dim>>>(a device, b device, n);
18      if ( i < n)                                                    13   // end
19        a[ i ] = a[i ] ∗ b[i ];                                      14
20   }                                                                 15   // main3
21   // end                                                            16     CUDA CHK(cudaMemcpy, (a host, a device, n∗sizeof(float),
22                                                                     17          cudaMemcpyDeviceToHost));
23   // main1                                                          18
24   int main()                                                        19     for ( int i = 0; i < n; i++)
25   {                                                                 20        printf (”%.0f ”, a host [ i ]);
26     cudaSetDevice(0); // EDIT ME                                    21     puts(”n”);
27                                                                     22
28     const int n = 4096;                                             23     free (a host );
29                                                                     24     CUDA CHK(cudaFree, (a device));
30     float ∗a host = (float ∗) malloc(n∗sizeof( float ));               25   }
31     float ∗b host = (float ∗) malloc(n∗sizeof( float ));               26   // end
32
33     float ∗a device, ∗b device;
34     CUDA CHK(cudaMalloc, ((void ∗∗) &a device, n∗sizeof(float)));
35     CUDA CHK(cudaMalloc, ((void ∗∗) &b device, n∗sizeof(float)));
36   // end




                                             o slide by Andreas Klockner (NYU)
           Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being Productive


PyCUDA Philosophy



                                                Provide complete access
                                                Automatically manage resources
                                                Provide abstractions
                                                Check for and report errors
                                                automatically
                                                Full documentation
                                                Integrate tightly with numpy




                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


PyCuda: Workflow



                  Edit                                            Cache!

                  Run                                               nvcc                  .cubin

   SourceModule("...")                                     Upload to GPU
                                                                                               PyCuda

           Run on GPU




                                     o slide by Andreas Klockner (NYU)
   Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


Automatic Cleanup



          Reachable objects (memory,
          streams, . . . ) are never destroyed.
          Once unreachable, released at an
          unspecified future time.
          Scarce resources (memory) can be
          explicitly freed. (obj.free())
          Correctly deals with multiple
          contexts and dependencies.




                                     o slide by Andreas Klockner (NYU)
   Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


gpuarray: Simple Linear Algebra

 pycuda.gpuarray:
     Meant to look and feel just like numpy.
               gpuarray.to gpu(numpy array)
               numpy array = gpuarray.get()
      No: nd indexing, slicing, etc. (yet!)
      Yes: +, -, ∗, /, fill, sin, exp, rand, take, . . .
      Random numbers using pycuda.curandom
      Mixed types (int32 + float32 = float64)
       print gpuarray for debugging.
      Memory behind gpuarray available as .gpudata
      attribute.
             Use as kernel arguments, textures, etc.


                                      o slide by Andreas Klockner (NYU)
    Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being Productive


What’s this “numpy”, anyway?

 Numpy: package for large,
 multi-dimensional arrays.
      Vectors, Matrices, . . .
      A+B, sin(A), dot(A,B)
      la.solve(A, b), la.eig(A)
      cube[:, :, n-k:n+k], cube+5
 All much faster than functional equivalents in
 Python.

 “Python’s MATLAB”:
 Basis for SciPy, plotting, . . .



                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


gpuarray: Elementwise expressions

   Avoiding extra store-fetch cycles for elementwise math:
   from pycuda.curandom import rand as curand
   a gpu = curand((50,))
   b gpu = curand((50,))

   from pycuda.elementwise import ElementwiseKernel
   lin comb = ElementwiseKernel(
           ” float a, float ∗x, float b, float ∗y, float ∗z”,
           ”z[ i ] = a∗x[i ] + b∗y[i]”)

   c gpu = gpuarray.empty like (a gpu)
   lin comb(5, a gpu, 6, b gpu, c gpu)

   assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5



                                      o slide by Andreas Klockner (NYU)
    Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being Productive


gpuarray: Reduction made easy


  Example: A scalar product calculation
  from pycuda.reduction import ReductionKernel
  dot = ReductionKernel(dtype out=numpy.float32, neutral=”0”,
          reduce expr=”a+b”, map expr=”x[i]∗y[i]”,
         arguments=”const float ∗x, const float ∗y”)

  from pycuda.curandom import rand as curand
  x = curand((1000∗1000), dtype=numpy.float32)
  y = curand((1000∗1000), dtype=numpy.float32)

   x dot y = dot(x, y ). get()
   x dot y cpu = numpy.dot(x.get(), y. get ())




                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Exciting Developments in GPU-Python


Step 3: Usage

                                                Complex numbers
                                                        . . . in GPUArray
                                                        . . . in user code
                                                        (pycuda-complex.hpp)
                                                If/then/else for GPUArrays
                                                Support for custom device pointers
                                                Smarter device picking/context
                                                creation
                                                PyFFT: FFT for PyOpenCL and
                                                PyCUDA
                                                scikits.cuda: CUFFT, CUBLAS,
                                                CULA


                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Exciting Developments in GPU-Python


Sparse Matrix-Vector on the GPU


      New feature in 0.94:
      Sparse matrix-vector
      multiplication
      Uses “packeted format”
      by Garland and Bell (also
      includes parts of their code)
      Integrates with scipy.sparse.
      Conjugate-gradients solver
      included
            Deferred convergence
            checking



                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


Kernel Invocation: Automatic Copies

   mod = pycuda.driver.SourceModule(
       ” global my func(float ∗out, float ∗in ){...} ”)
   func = mod.get function(”my func”)

   src = numpy.random.randn(400).astype(numpy.float32)
   dest = numpy.empty like(src)

   my func(
           cuda.Out(dest),
           cuda.In( src ),
           block=(400,1,1))

         “InOut” exists, too.
         Only for immediate invocation style.


                                      o slide by Andreas Klockner (NYU)
    Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
GPU Scripting PyOpenCL News RTCG Showcase       Exciting Developments in GPU-Python


Step 4: Debugging


 New in 0.94.1: Support for CUDA gdb:

 $ cuda-gdb --args python -m
 pycuda.debug demo.py

 Automatically:
      Sets Compiler flags
      Retains source code
      Disables compiler cache




                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


CUDA APIs


    C/C++                            Python               CUDA has two Programming
                                                          Interfaces:
  Runtime API                       PyCuda                       “Runtime” high-level
                                                                 (libcudart.so, in the
                  Driver API                                     “toolkit”)
                                                                 “Driver” low-level
                Kernel Driver                                    (libcuda.so, comes with
                                                                 GPU driver)
                   Hardware                               (mutually exclusive)




                                     o slide by Andreas Klockner (NYU)
   Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


Runtime vs. Driver API


   Runtime ↔ Driver differences:
         Explicit initialization.
         Code objects (“Modules”) become programming language
         objects.
         Texture handling requires slightly more work.
         Only needs nvcc for compiling GPU code.
   Driver API:
         Conceptually cleaner
         Less sugar-coating (provide in Python)
         Not very different otherwise



                                      o slide by Andreas Klockner (NYU)
    Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
Intro GPUs Scripting Hands-on        Intro Example Working with PyCuda A peek under the hood


PyCuda: API Tracing

   With ./configure --cuda-trace=1:
  import pycuda. driver as cuda                                           cuInit
  import pycuda. autoinit                                                 cuDeviceGetCount
  import numpy                                                            cuDeviceGet
                                                                          cuCtxCreate
  a = numpy.random.randn(4,4).astype(numpy.float32)                        cuMemAlloc
  a gpu = cuda.mem alloc(a.nbytes)                                        cuMemcpyHtoD
  cuda.memcpy htod(a gpu, a)                                              cuCtxGetDevice
                                                                          cuDeviceComputeCapability
  mod = cuda.SourceModule(”””                                             cuModuleLoadData
       global    void doublify ( float ∗a)                                 cuModuleGetFunction
     {                                                                    cuFuncSetBlockShape
       int idx = threadIdx.x + threadIdx.y∗4;                             cuParamSetv
       a[ idx ] ∗= 2;                                                     cuParamSetSize
     }                                                                    cuLaunchGrid
     ”””)                                                                 cuMemcpyDtoH
                                                                          cuCtxPopCurrent
  func = mod.get function(”doublify”)                                     cuCtxPushCurrent
  func(a gpu, block=(4,4,1))                                              cuMemFree
                                                                          cuCtxPopCurrent
  a doubled = numpy.empty like(a)                                         cuCtxPushCurrent
  cuda.memcpy dtoh(a doubled, a gpu)                                      cuModuleUnload
  print a doubled                                                         cuCtxPopCurrent
  print a                                                                 cuCtxDestroy




                                       o slide by Andreas Klockner (NYU)
     Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being Productive


PyCUDA: Vital Information



       http://mathema.tician.de/
       software/pycuda
       Complete documentation
       MIT License
       (no warranty, free for all use)
       Requires: numpy, Python 2.4+
       (Win/OS X/Linux)
       Support via mailing list




                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
Sle epy?
Outline

1. Scripting GPUs with PyCUDA
2. Meta-programming and RTCG
3. Case study in brain-inspired AI
... too much ?

                      ba nk c
                                 onflict
                                             s




            on
                                       ing



        isi
                              ale   sc


      ec
                           co




                                        ca
    pr




                                             ch
    d                part
                            ition




                                                 in
ixe
      cla                           ca m




                                                 g
            m                              ping
m


                pi
                     ng
                                 adca sting
                           bro
                                                  ms
        zero-cop                             trea
e ?
              ec id
       ’t d
c an
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is Available


GPU Programming: Implementation Choices


          Many difficult questions
          Insufficient heuristics
          Answers are hardware-specific and
          have no lasting value




                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is Available


GPU Programming: Implementation Choices


          Many difficult questions
          Insufficient heuristics
          Answers are hardware-specific and
          have no lasting value
                               Proposed Solution: Tune automatically
                               for hardware at run time, cache tuning
                               results.
                                      Decrease reliance on knowledge of
                                      hardware internals
                                      Shift emphasis from
                                      tuning results to tuning ideas


                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is Available


Metaprogramming



                                                            In GPU scripting,
                                                              GPU code does
                                                              not need to be
                                                              a compile-time
                                                                 constant.




                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is Available


Metaprogramming



                                                            In GPU scripting,
                                                              GPU code does
                                                              not need to be
                                                              a compile-time
                                                                 constant.



                                                (Key: Code is data–it wants to be
                                                  reasoned about at run time)




                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is Available


Metaprogramming


        Idea

                                                            In GPU scripting,
                                                              GPU code does
                                                              not need to be
                                                              a compile-time
                                                                 constant.



                                                (Key: Code is data–it wants to be
                                                  reasoned about at run time)




                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is Available


Metaprogramming


        Idea

                                                            In GPU scripting,
   Python Code
                                                              GPU code does
                                                              not need to be
    GPU Code
                                                              a compile-time
                                                                 constant.
  GPU Compiler

    GPU Binary
                                                (Key: Code is data–it wants to be
        GPU                                       reasoned about at run time)

       Result


                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is Available


Metaprogramming


        Idea

                                                            In GPU scripting,
   Python Code
                                                              GPU code does
                                                              not need to be
    GPU Code
                                                              a compile-time
                                                                 constant.
  GPU Compiler

    GPU Binary         Machine
                                                (Key: Code is data–it wants to be
        GPU                                       reasoned about at run time)

       Result


                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is Available


Metaprogramming


        Idea
                       Human                                In GPU scripting,
   Python Code
                                                              GPU code does
                                                              not need to be
    GPU Code
                                                              a compile-time
                                                                 constant.
  GPU Compiler

    GPU Binary
                                                (Key: Code is data–it wants to be
        GPU                                       reasoned about at run time)

       Result


                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is Available


Metaprogramming


        Idea

                          Good for code           In GPU scripting,
  Python Code
          News            generation                GPU code does
     The
                                                    not need ailabee
                                                              v to bl
    GPU Code              Gener  a t i on           d ge is A
                                               nowlea compile-time
                   e Code               most K
   4 R u n - T i m o d e w h e n th e                  constant.
        Writ
  GPU Compiler   ing C

    GPU Binaryase
        howc
         S
                                                (Key: Code is data–it wants to be
        GPU                                       reasoned about at run time)

       Result


                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is Available


Metaprogramming


        Idea

                             Good for code                  In GPUyCUDA
                                                                  P scripting,
   Python Code
                             generation                       GPU code does
                                                              not need to be
    GPU Code
                                                              a compile-time
                                                                 constant.
  GPU Compiler

    GPU Binary
                                                (Key: Code is data–it wants to be
        GPU                                       reasoned about at run time)

       Result


                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is Available


Metaprogramming


        Idea

                             Good for code                      PyOp UDA
                                                            In GPUyCenCL
                                                                  P scripting,
   Python Code
                             generation                       GPU code does
                                                              not need to be
    GPU Code
                                                              a compile-time
                                                                 constant.
  GPU Compiler

    GPU Binary
                                                (Key: Code is data–it wants to be
        GPU                                       reasoned about at run time)

       Result


                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is Available


Machine-generated Code



  Why machine-generate code?
       Automated Tuning
       (cf. ATLAS, FFTW)
       Data types
       Specialize code for given problem
       Constants faster than variables
       (→ register pressure)
       Loop Unrolling




                                   slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
                              Andreas Kl¨ckner
                                         o            PyCUDA: Even
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hood


PyCuda: Support for Metaprogramming



        Access properties of compiled code:
        func.{num regs,shared size bytes,local size bytes}
        Exact GPU timing via events
        Can calculate hardware-dependent MP occupancy
        codepy (by Andreas):
                Build C syntax trees from Python
                Generates readable, indented C
        Or use a templating engine (many available, e.g. Cheetah)




                                     o slide by Andreas Klockner (NYU)
   Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)       PyCuda Tutorial
Outline

1. Scripting GPUs with PyCUDA
2. Meta-programming and RTCG
3. Case study in brain-inspired AI (vision)
Motivation
The Problem:
Visual Object Recognition


               fast
               accurate
               tolerant to variations
               effortless
               critical to survival
The Approach
Reverse and Forward Engineering the Brain
The Approach
Reverse and Forward Engineering the Brain




     REVERSE                 FORWARD
       Study                       Build
    Natural System            Artificial System
Why is modeling challenging?

   The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run

   Neural data only provides weak constraints
➡ Lots of parameters – hard to explore


      Advice from Dave Cox:
      “Don’t run anything that takes longer than a
      week to complete, because it will just crash
      halfway through anyways (or you’ll discover
      a bug) and you’ll never finish your Ph.D.”
Why is modeling challenging?

   The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run

   Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
Visual Cortex




                                   t aflo ps !
                      in =2 0 pe
                bra
GPUs (since 2006)




7800 GTX      Monster16GPU   Tesla Cluster
  (2006)         (2008)         (2009)

OpenGL/Cg       CUDA         CUDA/OpenCL
C++/Python      Python          Python
r ow n!
 u ild you
B
Cell Broadband Engine (since 2007)

         Teraflop Playstation3 clusters:




   DiCarlo Lab / MIT        Cox Lab / Harvard
A Match Made in Heaven
Brains are parallel, GPUs are parallel



                     ≈
   Multiple scales of parallelism:
     “Embarrasingly” parallel: video
     frames, regions
     Fine-grained: independent “neurons,”
     operating on overlapping inputs
A Match Made in Heaven
Images In, Images Out



                    ≈
   Image processing particularly well-suited
    Excellent Arithmetic Intensity: very
    natural to load image patches into
    shared memory
    Data: 2D / 3D locality
Why is modeling challenging?

   The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run

   Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
Fukushima (1980)
LeCun et al. (1989)
Riesenhuber & Poggio (1999)
Serre & Poggio (2007)
Read-out


L3
                  thresh/sat            norm strength

                                            normalization               Learning
                                            neighborhood                         Rate
                                                                                 Trace
                                                                                 “Temp. Adv.”
                                                                                 “Auto-reset”
                                                                                    ...
                                               number of lters




L2
                                        thresh/sat            norm strength

                                                                         Learning
                                                 normalization
                                                 neighborhood                     Rate
         kernel                                                                   Trace
         size                                                                     “Temp. Adv.”
                                                                                  “Auto-reset”
                                                                                     ...
                                                      n. of lters




L1
                               thresh/sat            norm strength            Learning
                                                                                   Rate
                                                        normalization
                                                                                   Trace
                                                        neighborhood
                                                                                   “Temp. Adv.”
                                                                                   “Auto-reset”
kernel                                                                                ...
size

                                                                 number of lters




 input
    kernel
    size
neighborhood                         Rate
                                                                    Trace
                                                                    “Temp. Adv.”
                                                                    “Auto-reset”
                                                                       ...
                                  number of lters




L2
                           thresh/sat            norm strength

                                                            Learning
                                    normalization
                                    neighborhood                     Rate
         kernel                                                      Trace
         size                                                        “Temp. Adv.”
                                                                     “Auto-reset”
                                                                        ...
                                         n. of lters




L1
                  thresh/sat            norm strength            Learning
                                                                      Rate
                                           normalization
                                                                      Trace
                                           neighborhood
                                                                      “Temp. Adv.”
                                                                      “Auto-reset”
kernel                                                                   ...
size
Two conflicting requirements

   The brain is a massively parallel computer
                   FA  ST slow to run
➡ Big models are paralyzingly


   Neural data only provides weak constraints
                    LEXI BLE
                F
➡ Lots of parameters – hard to explore




  How to optimize?
What’s the bottleneck?
lutio ns!
                     k Co nvo
       i lter ba n
3D F
Fast vs Flexible: what can you do?


 - Make your code accessible
 - No focus on raw performance

Examples:


               MATLAB/CUDA by Jim Mutch (2010)



                              by John Moore (1995)
Fast vs Flexible: what can you do?




 - Use standard libraries
   (e.g. CUBLAS, CUFFT, Jacket)


 - But: “remap” problem to fit?

 - Memory issues (not always optimal)
Fast vs Flexible: what can you do?

 - Fully optimized, by hand
 - But for only a few input configurations...
Fast vs Flexible: what can you do?


 - Focus on flexibility/accessibility first
 - But add strong foundations for raw
   performance from the beginning

Example:

                                      Python/C/CUDA
                                           (OpenCL*)

http://deeplearning.net
by James Bergstra & Yoshua Bengio (2010)
Our answer?
Meta-programming
       and
   Auto-tuning
What?
Meta-programming !


 Leave the grunt-programming to the
 computer (i.e. auto-tuning like ATLAS or FFTW)
 •   Dynamically compile specialized versions
     of the same kernel for different conditions
 •   Empirical run-time tuning
 •   For free: smooth syntactic ugliness: unroll
     loops, index un-indexable registers, etc.
Meta-programming !


“Instrument” your solutions:
•   Block size
•   Work size
•   Loop unrolling
•   Pre-fetching
•   Spilling
•   etc.
Meta-programming !


 Let the computer generate and find the optimal
 code:
 •   brute-force search with a global objective
 •   machine-learning approach with local
     objectives and hidden variables (advanced)
     •   e.g. PyCuda makes this easy:
Basic GPU Meta-programming System




                                                     A Case Study
                           GPU  Meta-Programming:
                                                red Machine Vision
                           in Biologically-Inspi
                                                s]
                           [GPU Computing Gem

                           Pinto N, Cox DD
texture<float4, 1, cudaReadModeElementType> tex_float4;
__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)
extern "C" {
                                                                 C hee ta h
#for j in xrange($FILTER_H)

  __global__ void convolve_beta_j${j}(float4 *input, float4 *output)
  {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
    __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

    // -- input/output offsets
    const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
    const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
    float4 input_v4;

    // -- load input to shared memory
#for i in xrange($LOAD_ITERATIONS)
#if $i==($LOAD_ITERATIONS-1)
    if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
#end if
      {
	   input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i);
	   shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
	   shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
	   shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
conv_kernel_4x4x4.cu
conv_kernel_template.cu                                          #include <stdio.h>

                                                                 texture<float4, 1, cudaReadModeElementType> tex_float4;
                                                                 __constant__ float constant[4][4][4];

                                                                 #define IMUL(a, b) __mul24(a, b)
 texture<float4, 1, cudaReadModeElementType> tex_float4;         extern "C" {
 __constant__ float constant[$FILTER_D][$FILTER_W]
 [$N_FILTERS];                                                         __global__ void convolve_beta_j0(float4 *input, float4 *output)
                                                                       {
 #define IMUL(a, b) __mul24(a, b)
 extern "C" {                                                           __shared__ float shared_in[131][4+1];

                                                                        // -- input/output offsets
 #for j in xrange($FILTER_H)                                            const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
                                                                        const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
   __global__ void convolve_beta_j${j}(float4 *input, float4            float4 input_v4;
 *output)
                                                                        // -- load input to shared memory
   {
                                                                          {
                                                                 	

                input_v4 = tex1Dfetch(tex_float4, in_idx+128*0);
 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1                       	

                shared_in[threadIdx.x+128*0][0] = input_v4.x;
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];            	

                shared_in[threadIdx.x+128*0][1] = input_v4.y;
                                                                 	

                shared_in[threadIdx.x+128*0][2] = input_v4.z;
                                                                 	

                shared_in[threadIdx.x+128*0][3] = input_v4.w;
     // -- input/output offsets
                                                                          }
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +                      if((threadIdx.x+128*1)<131)
 blockIdx.x*blockDim.x + threadIdx.x;                                     {
     const uint out_idx = blockIdx.y*OUTPUT_W +                  	

                input_v4 = tex1Dfetch(tex_float4, in_idx+128*1);
 blockIdx.x*blockDim.x + threadIdx.x;                            	

                shared_in[threadIdx.x+128*1][0] = input_v4.x;
                                                                 	

                shared_in[threadIdx.x+128*1][1] = input_v4.y;
     float4 input_v4;
                                                                 	

                shared_in[threadIdx.x+128*1][2] = input_v4.z;
                                                                 	

                shared_in[threadIdx.x+128*1][3] = input_v4.w;
      // -- load input to shared memory                                   }
 #for i in xrange($LOAD_ITERATIONS)                                     __syncthreads();
 #if $i==($LOAD_ITERATIONS-1)
                                                                        // -- compute dot products
      if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
                                                                        float v, w;
 #end if
        {                                                               float sum0 = 0;
 	         input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*           float sum1 = 0;
 $i);                                                                   float sum2 = 0;
                                                                        float sum3 = 0;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;          v = shared_in[threadIdx.x+0][0];
 	         shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;          w = constant[0][0][0];
 	         shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;          sum0 += v*w;
        }                                                               w = constant[0][0][1];
                                                                        sum1 += v*w;
 #end for
                                                                        w = constant[0][0][2];
                                                                        sum2 += v*w;
                                                                        w = constant[0][0][3];
                                                                        sum3 += v*w;
                                                                        v = shared_in[threadIdx.x+1][0];
                                                                        w = constant[0][1][0];
                                                                        sum0 += v*w;
                                                                        w = constant[0][1][1];
                                                                        sum1 += v*w;
                                                                        w = constant[0][1][2];
                                                                        sum2 += v*w;
                                                                        w = constant[0][1][3];
                                                                        sum3 += v*w;
                                                                        v = shared_in[threadIdx.x+2][0];
                                                                        w = constant[0][2][0];
                                                                        sum0 += v*w;
                                                                        w = constant[0][2][1];
                                                                        sum1 += v*w;
conv_kernel_template.cu
 texture<float4, 1, cudaReadModeElementType> tex_float4;
 __constant__ float constant[$FILTER_D][$FILTER_W]
 [$N_FILTERS];

 #define IMUL(a, b) __mul24(a, b)
                                                                 conv_kernel_4x4x4.cu
 extern "C" {

 #for j in xrange($FILTER_H)

   __global__ void convolve_beta_j${j}(float4 *input, float4



                                                                             20 kB
 *output)
   {

 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

     // -- input/output offsets
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     const uint out_idx = blockIdx.y*OUTPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     float4 input_v4;

      // -- load input to shared memory
 #for i in xrange($LOAD_ITERATIONS)
 #if $i==($LOAD_ITERATIONS-1)
      if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
 #end if

 	
 $i);
        {
           input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*    conv_kernel_8x8x4.cu
 	         shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
 	         shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;
        }



                                                                             64 kB
 #end for
Benefits?
Smooth syntactic ugliness
Smooth syntactic ugliness
  Manipulations that are not easily
  accessible in CUDA C code:
  • variable-length argument lists
Smooth syntactic ugliness
  Manipulations that are not easily
  accessible in CUDA C code:
  • syntax-level code control (e.g. conditionals)
Smooth syntactic ugliness
  Manipulations that are not easily
  accessible in CUDA C code:
  • loop unrolling (possibly fine-controlled)
Smooth syntactic ugliness
                            Manipulations that are not easily
                            accessible in CUDA C code:
                            • fine-controlled loop unrolling
..)

  v = shared_in[threadIdx.x+0][0];
  w = constant[0][0][0];
  sum0 += v*w;
  w = constant[0][0][1];
  sum1 += v*w;
  w = constant[0][0][2];
  sum2 += v*w;
  w = constant[0][0][3];
  sum3 += v*w;
  v = shared_in[threadIdx.x+1][0];
  w = constant[0][1][0];
  sum0 += v*w;
  w = constant[0][1][1];
  sum1 += v*w;
  w = constant[0][1][2];
  sum2 += v*w;
  w = constant[0][1][3];
  sum3 += v*w;
  v = shared_in[threadIdx.x+2][0];
  w = constant[0][2][0];
  sum0 += v*w;
  w = constant[0][2][1];
  sum1 += v*w;
  w = constant[0][2][2];
  sum2 += v*w;
  w = constant[0][2][3];
  sum3 += v*w;
  v = shared_in[threadIdx.x+3][0];
  w = constant[0][3][0];
  sum0 += v*w;
  w = constant[0][3][1];
  sum1 += v*w;
  w = constant[0][3][2];
  sum2 += v*w;
  w = constant[0][3][3];
  sum3 += v*w;
  v = shared_in[threadIdx.x+0][1];
  w = constant[1][0][0];
  sum0 += v*w;
  w = constant[1][0][1];
  sum1 += v*w;
  w = constant[1][0][2];
  sum2 += v*w;
  w = constant[1][0][3];
  sum3 += v*w;
How about #pragma unroll ?
   (why don’t you trust the compiler?)
o t alo ne....
    we are n
              s for S ignal
    Using GPU
             elation                         pil   ers
         Corr                     ust com
                            ’t tr
                                                                                                                                                              itchell
                                                                                                                                               Daniel A. M




                        Don             The Murch


                                        ode fr
                                               a
                                                     ts
                                                 ison Widefi

                                                gmen
                                                           eld Array



                                                                                                                                          c
                                                                                                                                   tical”
                                                                                                                           e “iden
                                                                                                                    re thes                 + g *h;
                                                                                                                ompa                                                                                  LOPS
                                                                                                     •        C
                                                                                                               *c +
                                                                                                                                        e*f
                                                                                                                                                                                               770 GF
                                                                                                           + d
                                                                                                       b*c                       grating 8-s
                                                                                                                                            econd snap
                                                                                                                                                        shots over


                                                                                                  a +=
                                                                                                                            inte                           peeling,
                                                                                                                roduced by                     lanking and

                                                                                                                  b*c;
                                                                                                 -2526 field p                   d  after RFI b
                                                                                    f the J2107                     e of the fiel
                                                                         an image o                   ht is an imag
                                                                                                                                                                                                     S
                                                                                                                                                                                                 FLOP
                                                           n the left is                  . On the rig

                                                                                                            a += d*c;
                                             Figure 3:
                                                         O                            ing
                                                                           hout blank
                                                              interval wit

                                                                                                                                                                                            20 G
                                                 entire time                    eeled imag
                                                                                            e.                                                                noise
                                             the                          e unp                                                                   e above the
                                                             ntours of th                                                           f magnitud
                                                                                                                              ers o                           . This
                                                                                                                                                                                          10
                                                          co
                                              along with                                                         that are ord                     ubious data

                                                                                                             a += e*f;
                                                                                                   at levels                      iscard d
                                                                                      e receivers                    ill simply d         tector show
                                                                                                                                                      n here
                                                                        fract into th                     e system w
                                                  k
                                     ichael hClar
                                                            fl ect or re                         real-tim                      n-based de
                                               occasion, re                       s the MWA                        mple media
                                                                     integration                       hich the si
                                    M    wit
                                               floor. D
                                                wil
                                                        uring deep
                                                   l require a
                                                                series of d
                                                                            ata-quality
                                                                          art.
                                                                                          tests, of w
                                                                                                             a += g*h;
                                                            n integral p
                                                will form a  eenhill
                                                   Lincoln Gr
                               Paul La Pla
                                          nte and ces
                                              Referen                                          t Boolard
                                                                                                              a +=
                                                                                                          y, EDGES
                                                                                                                        Memo, 058
                                                                                                                                      , 2010.
                                                                                                                                               R.J. Cappal
                                                                                                                                                           lo, M.F. M
                                                                                                                                                                        orales, and
                                                                                         ics a                                           ale,                             d Topics
                                                                            RFI Statist                                    , C.J. Lonsd                      l of Selecte
                                                   [1] A.E   .E. Rogers,                                     , R.J. Sault                     IE EE Journa
                                                                                              R.B. Wayth                         eld Array,
                                                                                . Greenhill,                      hison Widefi                      ].
                                                                   itchell, L.J                    of the Murc                        07.1912                                 E, 97
                                                    [2] D.A. M                Time Calib
                                                                                           ration
                                                                                                               , [astro-
                                                                                                                             ph/08                               s of the IEE
                                                         S.M. O    rd, Real-                       7 17, 2008                                      , Proceeding
                                                                                     2 (5), 707–                                     n Overview
                          1
              nuary 201
sday, 27 Ja                                                             rocessing,                                     rray: Desig
                                                         in Signal P                                 on Widefield A
                                                                                       he Murchis                        8].                                            , Graphics
                                                                        ale, et al., T                      903.182                                        R.G. Edgar
                                                     [3]  C.J. Lonsd                    [ast  ro-ph/0                                     H. Pfister, and                   Series,
                                                                          506, 2009,                                      ell, K. Dale,                     Conference
                                                           (8), 1497–1                                    , D.A. Mitch                       d Array, ASP
                                                                                            R.B. Wayth                        on Wide-fiel
                                                                               Greenhill,                      the Murchis


     IICS‘2011                                        [4] S.M    . Ord, L.J.             ata Pro  cessing in                                                                 cal
                                                                           Units for D                                                                           Mathemati
                                                            Processing                                                                1 radio pola
                                                                                                                                                     rimetry. I.
                                                                           009.                                              aa d
                                                                                                                       nderstryn20 ing
                                                                                                                                    1
                                                            411, 127, 2                                  .J. Sault, U Janu                 6.
                                                                                   . Breg  man, and R ursday,.,2117, 137–147, 199
                                                                                                                     7
                                                                                                                                                                        alar
                                                                     amaker, J.D                       Th pl. Ser
                                                                                                       up                                                alogue of sc
                                                        [5] J.P. H                        strophys. S                                  ll-co herency an                rophys. Su
                                                                                                                                                                                   ppl.
                                                                            s, Astron. A                                  . IV. The fu                   Astron. Ast
                                                              foundation                                    polarimetry                     ric fidelity,
                                                                                                  g radio               ge and pola
                                                                                                                                      rimet
                                                                                    derstandin
Smooth syntactic ugliness

  Manipulations that are not easily
  accessible in CUDA C code:
  • index un-indexable resources (e.g. regs)
Explore design decision
  space more freely
Basic GPU Meta-programming System




                                                     A Case Study
                           GPU  Meta-Programming:
                                                red Machine Vision
                           in Biologically-Inspi
                                                s]
                           [GPU Computing Gem

                           Pinto N, Cox DD
Exploring design decision space more freely

  Meta-programming:


  • enables efficient learning of the GPU
    hardware/software


  • allows full exploitation of the GPU
    architecture
version A
conv_kernel_beta_template.cu
                                                                                             ...
                                                                        mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1
                                                                        mov.b32 $r1, c0[$ofs2+0x0008]
 texture<float4, 1, cudaReadModeElementType> tex_float4;
 __constant__ float constant[$FILTER_D][$FILTER_W]                      mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4
 [$N_FILTERS];
                                                                        mov.b32 $r1, c0[$ofs2+0x000c]
                                                                        mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4
 #define IMUL(a, b) __mul24(a, b)
 extern "C" {

 #for j in xrange($FILTER_H)                                            mov.b32 $r1, c0[$ofs2+0x0010]
   __global__ void convolve_beta_j${j}(float4 *input, float4
 *output)
                                                                        mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4
   {

 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];
                                                                                             ...
     // -- input/output offsets
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     const uint out_idx = blockIdx.y*OUTPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     float4 input_v4;

      // -- load input to shared memory
 #for i in xrange($LOAD_ITERATIONS)


                                                                                                   version B
 #if $i==($LOAD_ITERATIONS-1)
      if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
 #end if
        {
 	         input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*
 $i);
 	
 	
 	
           shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
           shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
           shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
                                                                                           ...
 	         shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;   mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1
        }
 #end for                                                        mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1
                                                                 mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1
                                                                 mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1
                                                                 mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1

                                                                                           ...
                                                                             aster... Why ?
 using decuda by Wladimir J. van der Laan                               2x f
Exploring design decision space more freely
Exploring design decision space more freely

  When USE_THREAD_PER_FILTER is True
  • each thread will access different cmem
     locations (in order)




using the decuda disassembler by Wladimir J. van der Laan
         (Python-based)
Exploring design decision space more freely

  When USE_THREAD_PER_FILTER is False
  • each thread will access the same cmem
     locations (broadcast)




using the decuda disassembler by Wladimir J. van der Laan
         (Python-based)
Exploring design decision space more freely

                                       more registers




                     thread-dependent data movement


                    v.s.


                     aster... Why ?
                2x f
Strategy


• intermediate design decisions can be made
  explicit


• multiple “forks” in the path can be kept in place

• frees up the developer to revisit paste choices
  (without incurring a combinatoric explosion of separate pieces of code)


• retesting sets of assumptions can be done
  frequently and programmatically from the
  “outer” framework of code
Toy Ex a mple
                     M atmul




http://wiki.tiker.net/PyCuda/Examples/DemoMetaMatrixmulCheetah
Summary

 Meta-programming:

 • can assist exploration and manual
   optimization
 • can de-clutter code
 • is easy and flexible with the right tools
   (e.g. Python, Py{CUDA,CL}, Cheetah, decuda)


 ➡ facilitates auto-tuning!
a pause?
Need
ninja level?
             t to   the
How   t o ge
practic e ...
            , pract ice,
Prac tice
Auto-tuning
Basic GPU Meta-programming System




                                                     A Case Study
                           GPU  Meta-Programming:
                                                red Machine Vision
                           in Biologically-Inspi
                                                s]
                           [GPU Computing Gem

                           Pinto N, Cox DD
Auto-tuning


The goal is to empirically optimize execution
time given:


• the environment
 - hardware (GPU, CPU, Memory, Mobo)
 - software (SDK, Compiler suite)


• the data (input dimensions, repetitions, etc.)
Basic auto-tuning: pseudo-code (1/3)
                 Filter-bank Convolution / Correlation




                        Scripting, Py{CUDA,CL}




                   NoSQL (CouchDB, MongoDB) ?
Basic auto-tuning: pseudo-code (2/3)




                                       Cheetah,
                                       Jinja, Mako



                                       PyCUDA/CL
Basic auto-tuning: pseudo-code (3/3)




                                   PyCUDA/CL




                                   NoSQL
                                   (CouchDB,
                                   MongoDB)
Optimizing what?
Optimizing strategy


• Like many operations, filter-bank convolution is
  usually “communication bound” on the GPU:
 -   compute is cheap
 -   communication is expensive
• We must take advantage of all types of memory:
 -   explicit: gmem (global), smem (shared), cmem
     (constant), tmem (texture)
 -   implicit: rmem (registers), bmem (bin-code?) *
• Different optimal access patterns
Example: thread gmem output size




                            stupid float4 xyzw trick
Example: multiple smem loads
Example: using texture fetches
Example: register spilling
Example: register pressure (nvcc)
Example: capitalizing on bmem (bin code) ??

                                 multiple versions of
                               the same function with
                                different input offsets

                                input offset in cubin
                                       code?
Results
Results

                                        Meta-prog           Meta-prog
GPU / SDK     Input       Filter-bank                                            Boost
                                        default (gflops)     auto-tuned (gflops)
             256x256x8     64x9x9x8      6.710 ± 0.005        36.584 ±   0.023   445.2 %
9600M GT     512x512x4    32x13x13x4     13.606 ± 0.002       35.582 ±   0.003   161.5 %
CUDA3.1     1024x1024x8    16x5x5x8      20.034 ± 0.113       26.084 ±   6.243   30.2 %
            2048x2048x4    4x8x8x4       25.781 ± 0.044       46.945 ±   0.100   82.1 %
             256x256x8     64x9x9x8     104.188 ±   0.051    168.083 ±   0.372   61.3 %
C1060        512x512x4    32x13x13x4    125.739 ±   0.109    234.053 ±   0.266   86.1 %
CUDA2.3     1024x1024x8    16x5x5x8     144.279 ±   0.764    243.697 ±   0.346   68.9 %
            2048x2048x4    4x8x8x4      180.060 ±   0.018    322.328 ±   0.348   79.0 %
             256x256x8     64x9x9x8     123.396 ±   0.016    197.006 ±   0.219   59.7 %
GTX285       512x512x4    32x13x13x4    143.277 ±   0.044    270.206 ±   0.209   88.6 %
CUDA2.3     1024x1024x8    16x5x5x8     148.841 ±   0.465    310.276 ±   0.538   108.5 %
            2048x2048x4    4x8x8x4      205.152 ±   0.015    376.685 ±   0.070   83.6 %
             256x256x8     64x9x9x8     467.631 ± 19.100    471.902 ± 11.419      0.9 %
GTX480       512x512x4    32x13x13x4    834.838 ± 8.275     974.266 ± 3.809      16.7 %
CUDA3.1     1024x1024x8    16x5x5x8     542.808 ± 1.135      614.019 ± 0.904     13.1 %
            2048x2048x4    4x8x8x4      378.165 ± 0.537      806.628 ± 0.168     113.3 %
Analysis
Analysis
Empirical results...

                                             Performance (g ops)

 Q9450 (Matlab/C) [2008]    0.3


     Q9450 (C/SSE) [2008]   9.0


     7900GTX (Cg) [2006]          68.2


  PS3/Cell (C/ASM) [2007]            111.4


8800GTX (CUDA1.x) [2007]                     192.7


 GTX280 (CUDA2.x) [2008]                             339.3

                                                                                  .
 GTX480 (CUDA3.x) [2010]                                             e cha nging..    974.3
                                                            g   am
                                                  e edup is
                                   >1    0 00X sp
Summary
Summary



 • Meta-programming makes developing
   high-performing code for GPU easier
 • Fantastic tools exist (e.g. PyCUDA) to help
 • Interesting way to explore/learn about
   GPUs (hw/sw)
 • Coarse auto-tuning yields good results
Future




   • More fermi optimizations
     (L1 cache, concurrent kernels)


   • OpenCL to optimize across vendors

   • Smarter auto-tuning techniques (ML)
     -   (boosted) decision trees
     -   evolutionary programming strategies
More ?
•   Thu 3/31/11:
    PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)
•   Tue 3/29/11:
    Algorithm Strategies (W. Hwu, UIUC)
•   Tue 4/5/11:
    Analysis-driven Optimization (C.Wooley, NVIDIA)
•   Thu 4/7/11:
    Irregular Parallelism & Efficient Data Structures (J.Owens, UCDavis)
•   Thu 4/14/11:
    Optimization for Ninjas (D.Merill, UVirg)
•   ...
one more thing
           or two...
Life/Code Hacking #2.x
                Speed {listen,read,writ}ing




accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.2b
                                                 Speed writing




accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.2b
                                                 Speed writing




               ?
          R SI


accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.2b
                                                 Speed writing




                               SI?
                             R


accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.2b
             Speed writing
Life/Code Hacking #2.3
                                                Speed reading




accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.3
                                                        Speed reading

1. Collect many papers, docs, chapters, etc. (100)
2. Skim through them quickly / select (50)
3. Read w/o full understanding / select (25)
4. Read completely w/ full understanding / select (10)
5. Complete mastery + reproduction (5)


        accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.3
                                                                      Speed reading
http://readerssoft.com/speed_reading_obstacles.php




                      accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.3
                                                                      Speed reading
http://readerssoft.com/speed_reading_obstacles.php

                                                     normal reading




                                                        vs.
                                                     speed reading




                      accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.3
                                                Speed reading
         like David Guetta, use one finger !




accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
CO ME

More Related Content

What's hot

P2P Container Image Distribution on IPFS With containerd and nerdctl
P2P Container Image Distribution on IPFS With containerd and nerdctlP2P Container Image Distribution on IPFS With containerd and nerdctl
P2P Container Image Distribution on IPFS With containerd and nerdctlKohei Tokunaga
 
Java applications containerized and deployed
Java applications containerized and deployedJava applications containerized and deployed
Java applications containerized and deployedAnthony Dahanne
 
DockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐるDockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐるKohei Tokunaga
 
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ..."Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...Edge AI and Vision Alliance
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
Startup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionStartup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionKohei Tokunaga
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in Pythonkonryd
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
 
Deep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoDeep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoVincenzo Lomonaco
 
OpenCL Programming 101
OpenCL Programming 101OpenCL Programming 101
OpenCL Programming 101Yoss Cohen
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNsGrigory Sapunov
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
Engineer Engineering Software
Engineer Engineering SoftwareEngineer Engineering Software
Engineer Engineering SoftwareYung-Yu Chen
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...Edge AI and Vision Alliance
 
containerdの概要と最近の機能
containerdの概要と最近の機能containerdの概要と最近の機能
containerdの概要と最近の機能Kohei Tokunaga
 

What's hot (20)

P2P Container Image Distribution on IPFS With containerd and nerdctl
P2P Container Image Distribution on IPFS With containerd and nerdctlP2P Container Image Distribution on IPFS With containerd and nerdctl
P2P Container Image Distribution on IPFS With containerd and nerdctl
 
Java applications containerized and deployed
Java applications containerized and deployedJava applications containerized and deployed
Java applications containerized and deployed
 
DockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐるDockerとKubernetesをかけめぐる
DockerとKubernetesをかけめぐる
 
Cuda materials
Cuda materialsCuda materials
Cuda materials
 
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ..."Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
 
Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Startup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionStartup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image Distribution
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in Python
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
 
Deep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoDeep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with Theano
 
OpenCL Programming 101
OpenCL Programming 101OpenCL Programming 101
OpenCL Programming 101
 
Building custom kernels for IPython
Building custom kernels for IPythonBuilding custom kernels for IPython
Building custom kernels for IPython
 
2. Cnnecst-Why the use of FPGA?
2. Cnnecst-Why the use of FPGA? 2. Cnnecst-Why the use of FPGA?
2. Cnnecst-Why the use of FPGA?
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNs
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Engineer Engineering Software
Engineer Engineering SoftwareEngineer Engineering Software
Engineer Engineering Software
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
 
containerdの概要と最近の機能
containerdの概要と最近の機能containerdの概要と最近の機能
containerdの概要と最近の機能
 

Viewers also liked

CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jancstalks
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - ConfooSirKetchup
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Storti Mario
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux ClubOfer Rosenberg
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteNVIDIA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUNur Ahmadi
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architectureCHIHTE LU
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architecturesinside-BigData.com
 
CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU ArchitectureMark Kilgard
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsMarcos Gonzalez
 

Viewers also liked (20)

CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jan
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - Confoo
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Gpgpu
GpgpuGpgpu
Gpgpu
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 Keynote
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPU
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architecture
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU Architecture
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
 

Similar to [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019NVIDIA
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchDirk Petersen
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)IT-Доминанта
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance PythonIan Ozsvald
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
OpenACC Monthly Highlights- December
OpenACC Monthly Highlights- DecemberOpenACC Monthly Highlights- December
OpenACC Monthly Highlights- DecemberNVIDIA
 
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC
 
Hands On Embedded Linux with BeagleBone Black
Hands On Embedded Linux with BeagleBone BlackHands On Embedded Linux with BeagleBone Black
Hands On Embedded Linux with BeagleBone BlackDaniele Costarella
 
LAS16-108: JerryScript and other scripting languages for IoT
LAS16-108: JerryScript and other scripting languages for IoTLAS16-108: JerryScript and other scripting languages for IoT
LAS16-108: JerryScript and other scripting languages for IoTLinaro
 
GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2NVIDIA
 
Computable content: Notebooks, containers, and data-centric organizational le...
Computable content: Notebooks, containers, and data-centric organizational le...Computable content: Notebooks, containers, and data-centric organizational le...
Computable content: Notebooks, containers, and data-centric organizational le...Domino Data Lab
 
Speeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCSpeeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCinside-BigData.com
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaKazuaki Ishizaki
 
Introduction to Python Programming in Civil Engineering
Introduction to Python Programming in Civil EngineeringIntroduction to Python Programming in Civil Engineering
Introduction to Python Programming in Civil EngineeringRushikesh Kolhe
 

Similar to [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning (20)

CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019CUDA DLI Training Courses at GTC 2019
CUDA DLI Training Courses at GTC 2019
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)Python и программирование GPU (Ивашкевич Глеб)
Python и программирование GPU (Ивашкевич Глеб)
 
Os Lamothe
Os LamotheOs Lamothe
Os Lamothe
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
CG-Orientation ppt.pptx
CG-Orientation ppt.pptxCG-Orientation ppt.pptx
CG-Orientation ppt.pptx
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
OpenACC Monthly Highlights- December
OpenACC Monthly Highlights- DecemberOpenACC Monthly Highlights- December
OpenACC Monthly Highlights- December
 
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
 
Hands On Embedded Linux with BeagleBone Black
Hands On Embedded Linux with BeagleBone BlackHands On Embedded Linux with BeagleBone Black
Hands On Embedded Linux with BeagleBone Black
 
LAS16-108: JerryScript and other scripting languages for IoT
LAS16-108: JerryScript and other scripting languages for IoTLAS16-108: JerryScript and other scripting languages for IoT
LAS16-108: JerryScript and other scripting languages for IoT
 
GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2
 
Computable content: Notebooks, containers, and data-centric organizational le...
Computable content: Notebooks, containers, and data-centric organizational le...Computable content: Notebooks, containers, and data-centric organizational le...
Computable content: Notebooks, containers, and data-centric organizational le...
 
2014/07/17 Parallelize computer vision by GPGPU computing
2014/07/17 Parallelize computer vision by GPGPU computing2014/07/17 Parallelize computer vision by GPGPU computing
2014/07/17 Parallelize computer vision by GPGPU computing
 
Speeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCSpeeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCC
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
 
Introduction to Python Programming in Civil Engineering
Introduction to Python Programming in Civil EngineeringIntroduction to Python Programming in Civil Engineering
Introduction to Python Programming in Civil Engineering
 

More from npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introductionnpinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)npinto
 

More from npinto (20)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
 

Recently uploaded

TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEaurabinda banchhor
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 

Recently uploaded (20)

TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSE
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 

[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

  • 1. Massively Parallel Computing CS 264 / CSCI E-292 Lecture #6: CUDA Ninja Tricks | March 1st, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  • 2.
  • 3. Massively Parallel Computing CS 264 / CSCI E-292 Lecture #6: CUDA Ninja Tricks | February 29th, 2011 Auto-tuning am ming, , Meta- progr riptin g” G PU “Sc Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  • 5. During this course, r CS264 adapted fo we’ll try to “ ” and use existing material ;-)
  • 7. Outline 1. Scripting GPUs with PyCUDA 2. Meta-programming and RTCG 3. Case study in brain-inspired AI
  • 8. Outline 1. Scripting GPUs with PyCUDA 2. Meta-programming and RTCG 3. Case study in brain-inspired AI
  • 9. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Why do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum compute/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Realize a promise: Use Scripting. . . from first prototype to full-scale production code. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 10. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Why do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum compute/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Realize a promise: Use Scripting. . . from first prototype to full-scale production code. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 11. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Why do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum compute/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Realize a promise: Use Scripting. . . from first prototype to full-scale production code. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 12. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive Why do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Python + CUDA = PyCUDA Python + OpenCL = PyOpenCL slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 13. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive How are High-Performance Codes constructed? “Traditional” Construction of High-Performance Codes: C/C++/Fortran Libraries “Alternative” Construction of High-Performance Codes: Scripting for ‘brains’ GPUs for ‘inner loops’ Play to the strengths of each programming environment. slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 14. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive Scripting: Python One example of a scripting language: Python Mature Large and active community Emphasizes readability Written in widely-portable C A ‘multi-paradigm’ language Rich ecosystem of sci-comp related software slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 15. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Scripting Languages Python: is discoverable and interactive. has comprehensive built-in functionality. manages resources automatically. uses run-time typing. works well for “gluing” lower-level blocks together. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 16. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Scripting: Goals Scripting languages aim to reduce the load on the programmer: Reduce required knowledge Encourage experimentation Eliminate sources of error Encourage abstraction wherever possible Value programmer time over computer time Think about the tools you use. Use the right tool for the job. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 17. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Scripting: Goals Scripting languages aim to reduce the load on the programmer: Reduce required knowledge Encourage experimentation Eliminate sources of error Encourage abstraction wherever possible Value programmer time over computer time Think about the tools you use. Use the right tool for the job. Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) o PyCuda Tutorial
  • 18. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Scripting: Goals Scripting languages aim to reduce the load on the programmer: Reduce required knowledge Encourage experimentation Eliminate sources of error Encourage abstraction wherever possible Value programmer time over computer time Think about the tools you use. Use the right tool for the job. Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) o PyCuda Tutorial
  • 19. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Scripting: Speed Usual answer to the “Speed Question”: Hybrid (“mixed”) Code. Plays to the strengths of each language. But: Introduces (some) complexity. Observation: GPU code is already hybrid. Consequence: No added complexity through hybrid code. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 20. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive Whetting your appetite 1 import pycuda.driver as cuda 2 import pycuda.autoinit , pycuda.compiler 3 import numpy 4 5 a = numpy.random.randn(4,4).astype(numpy.float32) 6 a gpu = cuda.mem alloc(a.nbytes) 7 cuda.memcpy htod(a gpu, a) [This is examples/demo.py in the PyCUDA distribution.] slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 21. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive Whetting your appetite 1 mod = pycuda.compiler.SourceModule(””” 2 global void twice( float ∗a) 3 { 4 int idx = threadIdx.x + threadIdx.y∗4; 5 a[ idx ] ∗= 2; 6 } 7 ”””) 8 9 func = mod.get function(”twice”) 10 func(a gpu, block=(4,4,1)) 11 12 a doubled = numpy.empty like(a) 13 cuda.memcpy dtoh(a doubled, a gpu) 14 print a doubled 15 print a Andreas Kl¨ckner o PyCUDA: Even Simpler GPU Programming with Python
  • 22. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive Whetting your appetite 1 mod = pycuda.compiler.SourceModule(””” 2 global void twice( float ∗a) 3 { 4 int idx = threadIdx.x + threadIdx.y∗4; 5 a[ idx ] ∗= 2; 6 } Compute kernel 7 ”””) 8 9 func = mod.get function(”twice”) 10 func(a gpu, block=(4,4,1)) 11 12 a doubled = numpy.empty like(a) 13 cuda.memcpy dtoh(a doubled, a gpu) 14 print a doubled 15 print a slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 23. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Whetting your appetite, Part II Did somebody say “Abstraction is good”? o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 24. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Whetting your appetite, Part II 1 import numpy 2 import pycuda.autoinit 3 from pycuda import gpuarray 4 5 a cpu = numpy.random.randn(4,4).astype(numpy.float32) 6 b cpu = numpy.random.randn(4,4).astype(numpy.float32) 7 c cpu = a cpu ∗ b cpu 8 9 a gpu = gpuarray.to gpu(a cpu) 10 b gpu = gpuarray.to gpu(b cpu) 11 c gpu = (a gpu ∗ b gpu).get() 12 13 print c cpu − c gpu o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 25. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Remember me? 1 // trivia 2 #include <stdio.h> 3 4 #define CUDA CHK(NAME, ARGS) { 5 cudaError t cuda err code = NAME ARGS; 6 if (cuda err code != cudaSuccess) { 1 // main2 7 printf (”%s failed with code %dn”, #NAME, cuda err code); 2 for ( int i = 0; i < n; i++) { a host[i] = i; b host [ i ] = i+1; } 8 abort (); 3 9 } 4 CUDA CHK(cudaMemcpy, (a device, a host, n∗sizeof(float), 10 } 5 cudaMemcpyHostToDevice)); 11 // end 6 CUDA CHK(cudaMemcpy, (b device, b host, n∗sizeof(float), 12 7 cudaMemcpyHostToDevice)); 13 // kernel 8 14 global void square array ( float ∗a, float ∗b, int n) 9 dim3 block dim(16, 16); 15 { 10 int block size = block dim.x∗block dim.y; 16 int i = ( blockIdx .x ∗ blockDim.y + threadIdx.y) 11 int n blocks = (n + block size−1) / block size ; 17 ∗ blockDim.x + threadIdx.x; 12 square array <<<n blocks, block dim>>>(a device, b device, n); 18 if ( i < n) 13 // end 19 a[ i ] = a[i ] ∗ b[i ]; 14 20 } 15 // main3 21 // end 16 CUDA CHK(cudaMemcpy, (a host, a device, n∗sizeof(float), 22 17 cudaMemcpyDeviceToHost)); 23 // main1 18 24 int main() 19 for ( int i = 0; i < n; i++) 25 { 20 printf (”%.0f ”, a host [ i ]); 26 cudaSetDevice(0); // EDIT ME 21 puts(”n”); 27 22 28 const int n = 4096; 23 free (a host ); 29 24 CUDA CHK(cudaFree, (a device)); 30 float ∗a host = (float ∗) malloc(n∗sizeof( float )); 25 } 31 float ∗b host = (float ∗) malloc(n∗sizeof( float )); 26 // end 32 33 float ∗a device, ∗b device; 34 CUDA CHK(cudaMalloc, ((void ∗∗) &a device, n∗sizeof(float))); 35 CUDA CHK(cudaMalloc, ((void ∗∗) &b device, n∗sizeof(float))); 36 // end o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 26. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive PyCUDA Philosophy Provide complete access Automatically manage resources Provide abstractions Check for and report errors automatically Full documentation Integrate tightly with numpy slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 27. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood PyCuda: Workflow Edit Cache! Run nvcc .cubin SourceModule("...") Upload to GPU PyCuda Run on GPU o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 28. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Automatic Cleanup Reachable objects (memory, streams, . . . ) are never destroyed. Once unreachable, released at an unspecified future time. Scarce resources (memory) can be explicitly freed. (obj.free()) Correctly deals with multiple contexts and dependencies. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 29. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() No: nd indexing, slicing, etc. (yet!) Yes: +, -, ∗, /, fill, sin, exp, rand, take, . . . Random numbers using pycuda.curandom Mixed types (int32 + float32 = float64) print gpuarray for debugging. Memory behind gpuarray available as .gpudata attribute. Use as kernel arguments, textures, etc. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 30. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive What’s this “numpy”, anyway? Numpy: package for large, multi-dimensional arrays. Vectors, Matrices, . . . A+B, sin(A), dot(A,B) la.solve(A, b), la.eig(A) cube[:, :, n-k:n+k], cube+5 All much faster than functional equivalents in Python. “Python’s MATLAB”: Basis for SciPy, plotting, . . . slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 31. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood gpuarray: Elementwise expressions Avoiding extra store-fetch cycles for elementwise math: from pycuda.curandom import rand as curand a gpu = curand((50,)) b gpu = curand((50,)) from pycuda.elementwise import ElementwiseKernel lin comb = ElementwiseKernel( ” float a, float ∗x, float b, float ∗y, float ∗z”, ”z[ i ] = a∗x[i ] + b∗y[i]”) c gpu = gpuarray.empty like (a gpu) lin comb(5, a gpu, 6, b gpu, c gpu) assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5 o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 32. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive gpuarray: Reduction made easy Example: A scalar product calculation from pycuda.reduction import ReductionKernel dot = ReductionKernel(dtype out=numpy.float32, neutral=”0”, reduce expr=”a+b”, map expr=”x[i]∗y[i]”, arguments=”const float ∗x, const float ∗y”) from pycuda.curandom import rand as curand x = curand((1000∗1000), dtype=numpy.float32) y = curand((1000∗1000), dtype=numpy.float32) x dot y = dot(x, y ). get() x dot y cpu = numpy.dot(x.get(), y. get ()) slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 33. GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-Python Step 3: Usage Complex numbers . . . in GPUArray . . . in user code (pycuda-complex.hpp) If/then/else for GPUArrays Support for custom device pointers Smarter device picking/context creation PyFFT: FFT for PyOpenCL and PyCUDA scikits.cuda: CUFFT, CUBLAS, CULA slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 34. GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-Python Sparse Matrix-Vector on the GPU New feature in 0.94: Sparse matrix-vector multiplication Uses “packeted format” by Garland and Bell (also includes parts of their code) Integrates with scipy.sparse. Conjugate-gradients solver included Deferred convergence checking slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 35. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Kernel Invocation: Automatic Copies mod = pycuda.driver.SourceModule( ” global my func(float ∗out, float ∗in ){...} ”) func = mod.get function(”my func”) src = numpy.random.randn(400).astype(numpy.float32) dest = numpy.empty like(src) my func( cuda.Out(dest), cuda.In( src ), block=(400,1,1)) “InOut” exists, too. Only for immediate invocation style. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 36. GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-Python Step 4: Debugging New in 0.94.1: Support for CUDA gdb: $ cuda-gdb --args python -m pycuda.debug demo.py Automatically: Sets Compiler flags Retains source code Disables compiler cache slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 37. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood CUDA APIs C/C++ Python CUDA has two Programming Interfaces: Runtime API PyCuda “Runtime” high-level (libcudart.so, in the Driver API “toolkit”) “Driver” low-level Kernel Driver (libcuda.so, comes with GPU driver) Hardware (mutually exclusive) o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 38. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Runtime vs. Driver API Runtime ↔ Driver differences: Explicit initialization. Code objects (“Modules”) become programming language objects. Texture handling requires slightly more work. Only needs nvcc for compiling GPU code. Driver API: Conceptually cleaner Less sugar-coating (provide in Python) Not very different otherwise o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 39. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood PyCuda: API Tracing With ./configure --cuda-trace=1: import pycuda. driver as cuda cuInit import pycuda. autoinit cuDeviceGetCount import numpy cuDeviceGet cuCtxCreate a = numpy.random.randn(4,4).astype(numpy.float32) cuMemAlloc a gpu = cuda.mem alloc(a.nbytes) cuMemcpyHtoD cuda.memcpy htod(a gpu, a) cuCtxGetDevice cuDeviceComputeCapability mod = cuda.SourceModule(””” cuModuleLoadData global void doublify ( float ∗a) cuModuleGetFunction { cuFuncSetBlockShape int idx = threadIdx.x + threadIdx.y∗4; cuParamSetv a[ idx ] ∗= 2; cuParamSetSize } cuLaunchGrid ”””) cuMemcpyDtoH cuCtxPopCurrent func = mod.get function(”doublify”) cuCtxPushCurrent func(a gpu, block=(4,4,1)) cuMemFree cuCtxPopCurrent a doubled = numpy.empty like(a) cuCtxPushCurrent cuda.memcpy dtoh(a doubled, a gpu) cuModuleUnload print a doubled cuCtxPopCurrent print a cuCtxDestroy o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 40. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive PyCUDA: Vital Information http://mathema.tician.de/ software/pycuda Complete documentation MIT License (no warranty, free for all use) Requires: numpy, Python 2.4+ (Win/OS X/Linux) Support via mailing list slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 42.
  • 43. Outline 1. Scripting GPUs with PyCUDA 2. Meta-programming and RTCG 3. Case study in brain-inspired AI
  • 44. ... too much ? ba nk c onflict s on ing isi ale sc ec co ca pr ch d part ition in ixe cla ca m g m ping m pi ng adca sting bro ms zero-cop trea
  • 45. e ? ec id ’t d c an
  • 46. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available GPU Programming: Implementation Choices Many difficult questions Insufficient heuristics Answers are hardware-specific and have no lasting value slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 47. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available GPU Programming: Implementation Choices Many difficult questions Insufficient heuristics Answers are hardware-specific and have no lasting value Proposed Solution: Tune automatically for hardware at run time, cache tuning results. Decrease reliance on knowledge of hardware internals Shift emphasis from tuning results to tuning ideas slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 48. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available Metaprogramming In GPU scripting, GPU code does not need to be a compile-time constant. slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 49. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available Metaprogramming In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 50. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available Metaprogramming Idea In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 51. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available Metaprogramming Idea In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 52. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available Metaprogramming Idea In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary Machine (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 53. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available Metaprogramming Idea Human In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 54. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available Metaprogramming Idea Good for code In GPU scripting, Python Code News generation GPU code does The not need ailabee v to bl GPU Code Gener a t i on d ge is A nowlea compile-time e Code most K 4 R u n - T i m o d e w h e n th e constant. Writ GPU Compiler ing C GPU Binaryase howc S (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 55. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available Metaprogramming Idea Good for code In GPUyCUDA P scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 56. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available Metaprogramming Idea Good for code PyOp UDA In GPUyCenCL P scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 57. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available Machine-generated Code Why machine-generate code? Automated Tuning (cf. ATLAS, FFTW) Data types Specialize code for given problem Constants faster than variables (→ register pressure) Loop Unrolling slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • 58. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood PyCuda: Support for Metaprogramming Access properties of compiled code: func.{num regs,shared size bytes,local size bytes} Exact GPU timing via events Can calculate hardware-dependent MP occupancy codepy (by Andreas): Build C syntax trees from Python Generates readable, indented C Or use a templating engine (many available, e.g. Cheetah) o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • 59. Outline 1. Scripting GPUs with PyCUDA 2. Meta-programming and RTCG 3. Case study in brain-inspired AI (vision)
  • 61. The Problem: Visual Object Recognition fast accurate tolerant to variations effortless critical to survival
  • 62. The Approach Reverse and Forward Engineering the Brain
  • 63. The Approach Reverse and Forward Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 64. Why is modeling challenging? The brain is a massively parallel computer ➡ Big models are paralyzingly slow to run Neural data only provides weak constraints ➡ Lots of parameters – hard to explore Advice from Dave Cox: “Don’t run anything that takes longer than a week to complete, because it will just crash halfway through anyways (or you’ll discover a bug) and you’ll never finish your Ph.D.”
  • 65. Why is modeling challenging? The brain is a massively parallel computer ➡ Big models are paralyzingly slow to run Neural data only provides weak constraints ➡ Lots of parameters – hard to explore
  • 66. Visual Cortex t aflo ps ! in =2 0 pe bra
  • 67. GPUs (since 2006) 7800 GTX Monster16GPU Tesla Cluster (2006) (2008) (2009) OpenGL/Cg CUDA CUDA/OpenCL C++/Python Python Python
  • 68. r ow n! u ild you B
  • 69. Cell Broadband Engine (since 2007) Teraflop Playstation3 clusters: DiCarlo Lab / MIT Cox Lab / Harvard
  • 70. A Match Made in Heaven Brains are parallel, GPUs are parallel ≈ Multiple scales of parallelism: “Embarrasingly” parallel: video frames, regions Fine-grained: independent “neurons,” operating on overlapping inputs
  • 71. A Match Made in Heaven Images In, Images Out ≈ Image processing particularly well-suited Excellent Arithmetic Intensity: very natural to load image patches into shared memory Data: 2D / 3D locality
  • 72. Why is modeling challenging? The brain is a massively parallel computer ➡ Big models are paralyzingly slow to run Neural data only provides weak constraints ➡ Lots of parameters – hard to explore
  • 74. LeCun et al. (1989)
  • 76. Serre & Poggio (2007)
  • 77. Read-out L3 thresh/sat norm strength normalization Learning neighborhood Rate Trace “Temp. Adv.” “Auto-reset” ... number of lters L2 thresh/sat norm strength Learning normalization neighborhood Rate kernel Trace size “Temp. Adv.” “Auto-reset” ... n. of lters L1 thresh/sat norm strength Learning Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset” kernel ... size number of lters input kernel size
  • 78. neighborhood Rate Trace “Temp. Adv.” “Auto-reset” ... number of lters L2 thresh/sat norm strength Learning normalization neighborhood Rate kernel Trace size “Temp. Adv.” “Auto-reset” ... n. of lters L1 thresh/sat norm strength Learning Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset” kernel ... size
  • 79. Two conflicting requirements The brain is a massively parallel computer FA ST slow to run ➡ Big models are paralyzingly Neural data only provides weak constraints LEXI BLE F ➡ Lots of parameters – hard to explore How to optimize?
  • 81. lutio ns! k Co nvo i lter ba n 3D F
  • 82. Fast vs Flexible: what can you do? - Make your code accessible - No focus on raw performance Examples: MATLAB/CUDA by Jim Mutch (2010) by John Moore (1995)
  • 83. Fast vs Flexible: what can you do? - Use standard libraries (e.g. CUBLAS, CUFFT, Jacket) - But: “remap” problem to fit? - Memory issues (not always optimal)
  • 84. Fast vs Flexible: what can you do? - Fully optimized, by hand - But for only a few input configurations...
  • 85. Fast vs Flexible: what can you do? - Focus on flexibility/accessibility first - But add strong foundations for raw performance from the beginning Example: Python/C/CUDA (OpenCL*) http://deeplearning.net by James Bergstra & Yoshua Bengio (2010)
  • 87. Meta-programming and Auto-tuning
  • 88. What?
  • 89. Meta-programming ! Leave the grunt-programming to the computer (i.e. auto-tuning like ATLAS or FFTW) • Dynamically compile specialized versions of the same kernel for different conditions • Empirical run-time tuning • For free: smooth syntactic ugliness: unroll loops, index un-indexable registers, etc.
  • 90. Meta-programming ! “Instrument” your solutions: • Block size • Work size • Loop unrolling • Pre-fetching • Spilling • etc.
  • 91. Meta-programming ! Let the computer generate and find the optimal code: • brute-force search with a global objective • machine-learning approach with local objectives and hidden variables (advanced) • e.g. PyCuda makes this easy:
  • 92. Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  • 93. texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern "C" { C hee ta h #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
  • 94. conv_kernel_4x4x4.cu conv_kernel_template.cu #include <stdio.h> texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[4][4][4]; #define IMUL(a, b) __mul24(a, b) texture<float4, 1, cudaReadModeElementType> tex_float4; extern "C" { __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; __global__ void convolve_beta_j0(float4 *input, float4 *output) { #define IMUL(a, b) __mul24(a, b) extern "C" { __shared__ float shared_in[131][4+1]; // -- input/output offsets #for j in xrange($FILTER_H) const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; __global__ void convolve_beta_j${j}(float4 *input, float4 float4 input_v4; *output) // -- load input to shared memory { { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 shared_in[threadIdx.x+128*0][0] = input_v4.x; __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; // -- input/output offsets } const uint in_idx = (blockIdx.y+$j)*INPUT_W + if((threadIdx.x+128*1)<131) blockIdx.x*blockDim.x + threadIdx.x; { const uint out_idx = blockIdx.y*OUTPUT_W + input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); blockIdx.x*blockDim.x + threadIdx.x; shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; float4 input_v4; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; // -- load input to shared memory } #for i in xrange($LOAD_ITERATIONS) __syncthreads(); #if $i==($LOAD_ITERATIONS-1) // -- compute dot products if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) float v, w; #end if { float sum0 = 0; input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* float sum1 = 0; $i); float sum2 = 0; float sum3 = 0; shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; v = shared_in[threadIdx.x+0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; w = constant[0][0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; sum0 += v*w; } w = constant[0][0][1]; sum1 += v*w; #end for w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w;
  • 95. conv_kernel_template.cu texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) conv_kernel_4x4x4.cu extern "C" { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 20 kB *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if $i); { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* conv_kernel_8x8x4.cu shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } 64 kB #end for
  • 98. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • variable-length argument lists
  • 99. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • syntax-level code control (e.g. conditionals)
  • 100. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • loop unrolling (possibly fine-controlled)
  • 101. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • fine-controlled loop unrolling ..) v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w;
  • 102. How about #pragma unroll ? (why don’t you trust the compiler?)
  • 103. o t alo ne.... we are n s for S ignal Using GPU elation pil ers Corr ust com ’t tr itchell Daniel A. M Don The Murch ode fr a ts ison Widefi gmen eld Array c tical” e “iden re thes + g *h; ompa LOPS • C *c + e*f 770 GF + d b*c grating 8-s econd snap shots over a += inte peeling, roduced by lanking and b*c; -2526 field p d after RFI b f the J2107 e of the fiel an image o ht is an imag S FLOP n the left is . On the rig a += d*c; Figure 3: O ing hout blank interval wit 20 G entire time eeled imag e. noise the e unp e above the ntours of th f magnitud ers o . This 10 co along with that are ord ubious data a += e*f; at levels iscard d e receivers ill simply d tector show n here fract into th e system w k ichael hClar fl ect or re real-tim n-based de occasion, re s the MWA mple media integration hich the si M wit floor. D wil uring deep l require a series of d ata-quality art. tests, of w a += g*h; n integral p will form a eenhill Lincoln Gr Paul La Pla nte and ces Referen t Boolard a += y, EDGES Memo, 058 , 2010. R.J. Cappal lo, M.F. M orales, and ics a ale, d Topics RFI Statist , C.J. Lonsd l of Selecte [1] A.E .E. Rogers, , R.J. Sault IE EE Journa R.B. Wayth eld Array, . Greenhill, hison Widefi ]. itchell, L.J of the Murc 07.1912 E, 97 [2] D.A. M Time Calib ration , [astro- ph/08 s of the IEE S.M. O rd, Real- 7 17, 2008 , Proceeding 2 (5), 707– n Overview 1 nuary 201 sday, 27 Ja rocessing, rray: Desig in Signal P on Widefield A he Murchis 8]. , Graphics ale, et al., T 903.182 R.G. Edgar [3] C.J. Lonsd [ast ro-ph/0 H. Pfister, and Series, 506, 2009, ell, K. Dale, Conference (8), 1497–1 , D.A. Mitch d Array, ASP R.B. Wayth on Wide-fiel Greenhill, the Murchis IICS‘2011 [4] S.M . Ord, L.J. ata Pro cessing in cal Units for D Mathemati Processing 1 radio pola rimetry. I. 009. aa d nderstryn20 ing 1 411, 127, 2 .J. Sault, U Janu 6. . Breg man, and R ursday,.,2117, 137–147, 199 7 alar amaker, J.D Th pl. Ser up alogue of sc [5] J.P. H strophys. S ll-co herency an rophys. Su ppl. s, Astron. A . IV. The fu Astron. Ast foundation polarimetry ric fidelity, g radio ge and pola rimet derstandin
  • 104. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • index un-indexable resources (e.g. regs)
  • 105. Explore design decision space more freely
  • 106. Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  • 107. Exploring design decision space more freely Meta-programming: • enables efficient learning of the GPU hardware/software • allows full exploitation of the GPU architecture
  • 108. version A conv_kernel_beta_template.cu ... mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1 mov.b32 $r1, c0[$ofs2+0x0008] texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4 [$N_FILTERS]; mov.b32 $r1, c0[$ofs2+0x000c] mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4 #define IMUL(a, b) __mul24(a, b) extern "C" { #for j in xrange($FILTER_H) mov.b32 $r1, c0[$ofs2+0x0010] __global__ void convolve_beta_j${j}(float4 *input, float4 *output) mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4 { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; ... // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) version B #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; ... shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1 } #end for mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1 mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1 mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1 mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1 ... aster... Why ? using decuda by Wladimir J. van der Laan 2x f
  • 109. Exploring design decision space more freely
  • 110. Exploring design decision space more freely When USE_THREAD_PER_FILTER is True • each thread will access different cmem locations (in order) using the decuda disassembler by Wladimir J. van der Laan (Python-based)
  • 111. Exploring design decision space more freely When USE_THREAD_PER_FILTER is False • each thread will access the same cmem locations (broadcast) using the decuda disassembler by Wladimir J. van der Laan (Python-based)
  • 112. Exploring design decision space more freely more registers thread-dependent data movement v.s. aster... Why ? 2x f
  • 113. Strategy • intermediate design decisions can be made explicit • multiple “forks” in the path can be kept in place • frees up the developer to revisit paste choices (without incurring a combinatoric explosion of separate pieces of code) • retesting sets of assumptions can be done frequently and programmatically from the “outer” framework of code
  • 114. Toy Ex a mple M atmul http://wiki.tiker.net/PyCuda/Examples/DemoMetaMatrixmulCheetah
  • 115. Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter code • is easy and flexible with the right tools (e.g. Python, Py{CUDA,CL}, Cheetah, decuda) ➡ facilitates auto-tuning!
  • 117. ninja level? t to the How t o ge
  • 118. practic e ... , pract ice, Prac tice
  • 120. Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  • 121. Auto-tuning The goal is to empirically optimize execution time given: • the environment - hardware (GPU, CPU, Memory, Mobo) - software (SDK, Compiler suite) • the data (input dimensions, repetitions, etc.)
  • 122. Basic auto-tuning: pseudo-code (1/3) Filter-bank Convolution / Correlation Scripting, Py{CUDA,CL} NoSQL (CouchDB, MongoDB) ?
  • 123. Basic auto-tuning: pseudo-code (2/3) Cheetah, Jinja, Mako PyCUDA/CL
  • 124. Basic auto-tuning: pseudo-code (3/3) PyCUDA/CL NoSQL (CouchDB, MongoDB)
  • 126. Optimizing strategy • Like many operations, filter-bank convolution is usually “communication bound” on the GPU: - compute is cheap - communication is expensive • We must take advantage of all types of memory: - explicit: gmem (global), smem (shared), cmem (constant), tmem (texture) - implicit: rmem (registers), bmem (bin-code?) * • Different optimal access patterns
  • 127. Example: thread gmem output size stupid float4 xyzw trick
  • 132. Example: capitalizing on bmem (bin code) ?? multiple versions of the same function with different input offsets input offset in cubin code?
  • 134. Results Meta-prog Meta-prog GPU / SDK Input Filter-bank Boost default (gflops) auto-tuned (gflops) 256x256x8 64x9x9x8 6.710 ± 0.005 36.584 ± 0.023 445.2 % 9600M GT 512x512x4 32x13x13x4 13.606 ± 0.002 35.582 ± 0.003 161.5 % CUDA3.1 1024x1024x8 16x5x5x8 20.034 ± 0.113 26.084 ± 6.243 30.2 % 2048x2048x4 4x8x8x4 25.781 ± 0.044 46.945 ± 0.100 82.1 % 256x256x8 64x9x9x8 104.188 ± 0.051 168.083 ± 0.372 61.3 % C1060 512x512x4 32x13x13x4 125.739 ± 0.109 234.053 ± 0.266 86.1 % CUDA2.3 1024x1024x8 16x5x5x8 144.279 ± 0.764 243.697 ± 0.346 68.9 % 2048x2048x4 4x8x8x4 180.060 ± 0.018 322.328 ± 0.348 79.0 % 256x256x8 64x9x9x8 123.396 ± 0.016 197.006 ± 0.219 59.7 % GTX285 512x512x4 32x13x13x4 143.277 ± 0.044 270.206 ± 0.209 88.6 % CUDA2.3 1024x1024x8 16x5x5x8 148.841 ± 0.465 310.276 ± 0.538 108.5 % 2048x2048x4 4x8x8x4 205.152 ± 0.015 376.685 ± 0.070 83.6 % 256x256x8 64x9x9x8 467.631 ± 19.100 471.902 ± 11.419 0.9 % GTX480 512x512x4 32x13x13x4 834.838 ± 8.275 974.266 ± 3.809 16.7 % CUDA3.1 1024x1024x8 16x5x5x8 542.808 ± 1.135 614.019 ± 0.904 13.1 % 2048x2048x4 4x8x8x4 378.165 ± 0.537 806.628 ± 0.168 113.3 %
  • 137. Empirical results... Performance (g ops) Q9450 (Matlab/C) [2008] 0.3 Q9450 (C/SSE) [2008] 9.0 7900GTX (Cg) [2006] 68.2 PS3/Cell (C/ASM) [2007] 111.4 8800GTX (CUDA1.x) [2007] 192.7 GTX280 (CUDA2.x) [2008] 339.3 . GTX480 (CUDA3.x) [2010] e cha nging.. 974.3 g am e edup is >1 0 00X sp
  • 139. Summary • Meta-programming makes developing high-performing code for GPU easier • Fantastic tools exist (e.g. PyCUDA) to help • Interesting way to explore/learn about GPUs (hw/sw) • Coarse auto-tuning yields good results
  • 140. Future • More fermi optimizations (L1 cache, concurrent kernels) • OpenCL to optimize across vendors • Smarter auto-tuning techniques (ML) - (boosted) decision trees - evolutionary programming strategies
  • 141. More ? • Thu 3/31/11: PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU) • Tue 3/29/11: Algorithm Strategies (W. Hwu, UIUC) • Tue 4/5/11: Analysis-driven Optimization (C.Wooley, NVIDIA) • Thu 4/7/11: Irregular Parallelism & Efficient Data Structures (J.Owens, UCDavis) • Thu 4/14/11: Optimization for Ninjas (D.Merill, UVirg) • ...
  • 142. one more thing or two...
  • 143. Life/Code Hacking #2.x Speed {listen,read,writ}ing accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 144. Life/Code Hacking #2.2b Speed writing accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 145. Life/Code Hacking #2.2b Speed writing ? R SI accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 146. Life/Code Hacking #2.2b Speed writing SI? R accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 147. Life/Code Hacking #2.2b Speed writing
  • 148. Life/Code Hacking #2.3 Speed reading accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 149. Life/Code Hacking #2.3 Speed reading 1. Collect many papers, docs, chapters, etc. (100) 2. Skim through them quickly / select (50) 3. Read w/o full understanding / select (25) 4. Read completely w/ full understanding / select (10) 5. Complete mastery + reproduction (5) accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 150. Life/Code Hacking #2.3 Speed reading http://readerssoft.com/speed_reading_obstacles.php accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 151. Life/Code Hacking #2.3 Speed reading http://readerssoft.com/speed_reading_obstacles.php normal reading vs. speed reading accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 152. Life/Code Hacking #2.3 Speed reading like David Guetta, use one finger ! accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 153. CO ME