0
Massively Parallel Computing                         CS 264 / CSCI E-292Lecture #6: CUDA Ninja Tricks | March 1st, 2011   ...
Massively Parallel Computing                                 CS 264 / CSCI E-292Lecture #6: CUDA Ninja Tricks | February 2...
News
During this course,                          r CS264                adapted fowe’ll try to          “                     ...
Todayyey!!
Outline1. Scripting GPUs with PyCUDA2. Meta-programming and RTCG3. Case study in brain-inspired AI
Outline1. Scripting GPUs with PyCUDA2. Meta-programming and RTCG3. Case study in brain-inspired AI
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodWhy do Scripting for GPUs?   ...
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodWhy do Scripting for GPUs?   ...
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodWhy do Scripting for GPUs?   ...
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being ProductiveWhy do Scripting for GPUs?      GPUs are everythi...
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being ProductiveHow are High-Performance Codes constructed?      ...
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being ProductiveScripting: Python   One example of a scripting la...
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodScripting Languages   Python:...
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodScripting: Goals   Scripting ...
Intro GPUs Scripting Hands-on      Intro Example Working with PyCuda A peek under the hoodScripting: Goals   Scripting lan...
Intro GPUs Scripting Hands-on      Intro Example Working with PyCuda A peek under the hoodScripting: Goals   Scripting lan...
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodScripting: Speed        Usual...
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being ProductiveWhetting your appetite1   import pycuda.driver as...
GPU Scripting PyOpenCL News RTCG Showcase     Overview Being ProductiveWhetting your appetite 1   mod = pycuda.compiler.So...
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being ProductiveWhetting your appetite 1   mod = pycuda.compiler....
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodWhetting your appetite, Part ...
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodWhetting your appetite, Part ...
Intro GPUs Scripting Hands-on                Intro Example Working with PyCuda A peek under the hood Remember me? 1   // t...
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being ProductivePyCUDA Philosophy                                ...
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodPyCuda: Workflow              ...
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodAutomatic Cleanup          Re...
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodgpuarray: Simple Linear Algeb...
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being ProductiveWhat’s this “numpy”, anyway? Numpy: package for l...
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodgpuarray: Elementwise express...
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being Productivegpuarray: Reduction made easy  Example: A scalar ...
GPU Scripting PyOpenCL News RTCG Showcase       Exciting Developments in GPU-PythonStep 3: Usage                          ...
GPU Scripting PyOpenCL News RTCG Showcase       Exciting Developments in GPU-PythonSparse Matrix-Vector on the GPU      Ne...
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodKernel Invocation: Automatic ...
GPU Scripting PyOpenCL News RTCG Showcase       Exciting Developments in GPU-PythonStep 4: Debugging New in 0.94.1: Suppor...
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodCUDA APIs    C/C++           ...
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodRuntime vs. Driver API   Runt...
Intro GPUs Scripting Hands-on        Intro Example Working with PyCuda A peek under the hoodPyCuda: API Tracing   With ./c...
GPU Scripting PyOpenCL News RTCG Showcase       Overview Being ProductivePyCUDA: Vital Information       http://mathema.ti...
Sle epy?
Outline1. Scripting GPUs with PyCUDA2. Meta-programming and RTCG3. Case study in brain-inspired AI
... too much ?                      ba nk c                                 onflict                                        ...
e ?              ec id       ’t dc an
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is AvailableGPU Programming: Implemen...
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is AvailableGPU Programming: Implemen...
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is AvailableMetaprogramming          ...
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is AvailableMetaprogramming          ...
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is AvailableMetaprogramming        Id...
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is AvailableMetaprogramming        Id...
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is AvailableMetaprogramming        Id...
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is AvailableMetaprogramming        Id...
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is AvailableMetaprogramming        Id...
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is AvailableMetaprogramming        Id...
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is AvailableMetaprogramming        Id...
GPU Scripting PyOpenCL News RTCG Showcase       Writing Code when the most Knowledge is AvailableMachine-generated Code  W...
Intro GPUs Scripting Hands-on         Intro Example Working with PyCuda A peek under the hoodPyCuda: Support for Metaprogr...
Outline1. Scripting GPUs with PyCUDA2. Meta-programming and RTCG3. Case study in brain-inspired AI (vision)
Motivation
The Problem:Visual Object Recognition               fast               accurate               tolerant to variations      ...
The ApproachReverse and Forward Engineering the Brain
The ApproachReverse and Forward Engineering the Brain     REVERSE                 FORWARD       Study                     ...
Why is modeling challenging?   The brain is a massively parallel computer➡ Big models are paralyzingly slow to run   Neura...
Why is modeling challenging?   The brain is a massively parallel computer➡ Big models are paralyzingly slow to run   Neura...
Visual Cortex                                   t aflo ps !                      in =2 0 pe                bra
GPUs (since 2006)7800 GTX      Monster16GPU   Tesla Cluster  (2006)         (2008)         (2009)OpenGL/Cg       CUDA     ...
r ow n! u ild youB
Cell Broadband Engine (since 2007)         Teraflop Playstation3 clusters:   DiCarlo Lab / MIT        Cox Lab / Harvard
A Match Made in HeavenBrains are parallel, GPUs are parallel                     ≈   Multiple scales of parallelism:     “...
A Match Made in HeavenImages In, Images Out                    ≈   Image processing particularly well-suited    Excellent ...
Why is modeling challenging?   The brain is a massively parallel computer➡ Big models are paralyzingly slow to run   Neura...
Fukushima (1980)
LeCun et al. (1989)
Riesenhuber & Poggio (1999)
Serre & Poggio (2007)
Read-outL3                  thresh/sat            norm strength                                            normalization  ...
neighborhood                         Rate                                                                    Trace        ...
Two conflicting requirements   The brain is a massively parallel computer                   FA  ST slow to run➡ Big models ...
What’s the bottleneck?
lutio ns!                     k Co nvo       i lter ba n3D F
Fast vs Flexible: what can you do? - Make your code accessible - No focus on raw performanceExamples:               MATLAB...
Fast vs Flexible: what can you do? - Use standard libraries   (e.g. CUBLAS, CUFFT, Jacket) - But: “remap” problem to fit? -...
Fast vs Flexible: what can you do? - Fully optimized, by hand - But for only a few input configurations...
Fast vs Flexible: what can you do? - Focus on flexibility/accessibility first - But add strong foundations for raw   perform...
Our answer?
Meta-programming       and   Auto-tuning
What?
Meta-programming ! Leave the grunt-programming to the computer (i.e. auto-tuning like ATLAS or FFTW) •   Dynamically compi...
Meta-programming !“Instrument” your solutions:•   Block size•   Work size•   Loop unrolling•   Pre-fetching•   Spilling•  ...
Meta-programming ! Let the computer generate and find the optimal code: •   brute-force search with a global objective •   ...
Basic GPU Meta-programming System                                                     A Case Study                        ...
texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];#defi...
conv_kernel_4x4x4.cuconv_kernel_template.cu                                          #include <stdio.h>                   ...
conv_kernel_template.cu texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FI...
Benefits?
Smooth syntactic ugliness
Smooth syntactic ugliness  Manipulations that are not easily  accessible in CUDA C code:  • variable-length argument lists
Smooth syntactic ugliness  Manipulations that are not easily  accessible in CUDA C code:  • syntax-level code control (e.g...
Smooth syntactic ugliness  Manipulations that are not easily  accessible in CUDA C code:  • loop unrolling (possibly fine-c...
Smooth syntactic ugliness                            Manipulations that are not easily                            accessib...
How about #pragma unroll ?   (why don’t you trust the compiler?)
o t alo ne....    we are n              s for S ignal    Using GPU             elation                         pil   ers  ...
Smooth syntactic ugliness  Manipulations that are not easily  accessible in CUDA C code:  • index un-indexable resources (...
Explore design decision  space more freely
Basic GPU Meta-programming System                                                     A Case Study                        ...
Exploring design decision space more freely  Meta-programming:  • enables efficient learning of the GPU    hardware/softwa...
version Aconv_kernel_beta_template.cu                                                                                     ...
Exploring design decision space more freely
Exploring design decision space more freely  When USE_THREAD_PER_FILTER is True  • each thread will access different cmem ...
Exploring design decision space more freely  When USE_THREAD_PER_FILTER is False  • each thread will access the same cmem ...
Exploring design decision space more freely                                       more registers                     threa...
Strategy• intermediate design decisions can be made  explicit• multiple “forks” in the path can be kept in place• frees up...
Toy Ex a mple                     M atmulhttp://wiki.tiker.net/PyCuda/Examples/DemoMetaMatrixmulCheetah
Summary Meta-programming: • can assist exploration and manual   optimization • can de-clutter code • is easy and flexible w...
a pause?Need
ninja level?             t to   theHow   t o ge
practic e ...            , pract ice,Prac tice
Auto-tuning
Basic GPU Meta-programming System                                                     A Case Study                        ...
Auto-tuningThe goal is to empirically optimize executiontime given:• the environment - hardware (GPU, CPU, Memory, Mobo) -...
Basic auto-tuning: pseudo-code (1/3)                 Filter-bank Convolution / Correlation                        Scriptin...
Basic auto-tuning: pseudo-code (2/3)                                       Cheetah,                                       ...
Basic auto-tuning: pseudo-code (3/3)                                   PyCUDA/CL                                   NoSQL  ...
Optimizing what?
Optimizing strategy• Like many operations, filter-bank convolution is  usually “communication bound” on the GPU: -   comput...
Example: thread gmem output size                            stupid float4 xyzw trick
Example: multiple smem loads
Example: using texture fetches
Example: register spilling
Example: register pressure (nvcc)
Example: capitalizing on bmem (bin code) ??                                 multiple versions of                          ...
Results
Results                                        Meta-prog           Meta-progGPU / SDK     Input       Filter-bank         ...
Analysis
Analysis
Empirical results...                                             Performance (g ops) Q9450 (Matlab/C) [2008]    0.3     Q9...
Summary
Summary • Meta-programming makes developing   high-performing code for GPU easier • Fantastic tools exist (e.g. PyCUDA) to...
Future   • More fermi optimizations     (L1 cache, concurrent kernels)   • OpenCL to optimize across vendors   • Smarter a...
More ?•   Thu 3/31/11:    PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)•   Tue 3/29/11:    Algorithm Strategies (W. Hwu, U...
one more thing           or two...
Life/Code Hacking #2.x                Speed {listen,read,writ}ingaccelerated e-learning (c) / massively parallel {learn,pr...
Life/Code Hacking #2.2b                                                 Speed writingaccelerated e-learning (c) / massivel...
Life/Code Hacking #2.2b                                                 Speed writing               ?          R SIacceler...
Life/Code Hacking #2.2b                                                 Speed writing                               SI?   ...
Life/Code Hacking #2.2b             Speed writing
Life/Code Hacking #2.3                                                Speed readingaccelerated e-learning (c) / massively ...
Life/Code Hacking #2.3                                                        Speed reading1. Collect many papers, docs, c...
Life/Code Hacking #2.3                                                                      Speed readinghttp://readerssof...
Life/Code Hacking #2.3                                                                      Speed readinghttp://readerssof...
Life/Code Hacking #2.3                                                Speed reading         like David Guetta, use one fing...
CO ME
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning
Upcoming SlideShare
Loading in...5
×

[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

3,166

Published on

http://cs264.org

Published in: Education, Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,166
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
126
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning"

  1. 1. Massively Parallel Computing CS 264 / CSCI E-292Lecture #6: CUDA Ninja Tricks | March 1st, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  2. 2. Massively Parallel Computing CS 264 / CSCI E-292Lecture #6: CUDA Ninja Tricks | February 29th, 2011 Auto-tuning am ming, , Meta- progr riptin g” G PU “Sc Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  3. 3. News
  4. 4. During this course, r CS264 adapted fowe’ll try to “ ”and use existing material ;-)
  5. 5. Todayyey!!
  6. 6. Outline1. Scripting GPUs with PyCUDA2. Meta-programming and RTCG3. Case study in brain-inspired AI
  7. 7. Outline1. Scripting GPUs with PyCUDA2. Meta-programming and RTCG3. Case study in brain-inspired AI
  8. 8. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodWhy do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum compute/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Realize a promise: Use Scripting. . . from first prototype to full-scale production code. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  9. 9. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodWhy do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum compute/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Realize a promise: Use Scripting. . . from first prototype to full-scale production code. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  10. 10. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodWhy do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum compute/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Realize a promise: Use Scripting. . . from first prototype to full-scale production code. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  11. 11. GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductiveWhy do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Python + CUDA = PyCUDA Python + OpenCL = PyOpenCL slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  12. 12. GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductiveHow are High-Performance Codes constructed? “Traditional” Construction of High-Performance Codes: C/C++/Fortran Libraries “Alternative” Construction of High-Performance Codes: Scripting for ‘brains’ GPUs for ‘inner loops’ Play to the strengths of each programming environment. slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  13. 13. GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductiveScripting: Python One example of a scripting language: Python Mature Large and active community Emphasizes readability Written in widely-portable C A ‘multi-paradigm’ language Rich ecosystem of sci-comp related software slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  14. 14. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodScripting Languages Python: is discoverable and interactive. has comprehensive built-in functionality. manages resources automatically. uses run-time typing. works well for “gluing” lower-level blocks together. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  15. 15. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodScripting: Goals Scripting languages aim to reduce the load on the programmer: Reduce required knowledge Encourage experimentation Eliminate sources of error Encourage abstraction wherever possible Value programmer time over computer time Think about the tools you use. Use the right tool for the job. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  16. 16. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodScripting: Goals Scripting languages aim to reduce the load on the programmer: Reduce required knowledge Encourage experimentation Eliminate sources of error Encourage abstraction wherever possible Value programmer time over computer time Think about the tools you use. Use the right tool for the job. Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) o PyCuda Tutorial
  17. 17. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodScripting: Goals Scripting languages aim to reduce the load on the programmer: Reduce required knowledge Encourage experimentation Eliminate sources of error Encourage abstraction wherever possible Value programmer time over computer time Think about the tools you use. Use the right tool for the job. Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) o PyCuda Tutorial
  18. 18. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodScripting: Speed Usual answer to the “Speed Question”: Hybrid (“mixed”) Code. Plays to the strengths of each language. But: Introduces (some) complexity. Observation: GPU code is already hybrid. Consequence: No added complexity through hybrid code. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  19. 19. GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductiveWhetting your appetite1 import pycuda.driver as cuda2 import pycuda.autoinit , pycuda.compiler3 import numpy45 a = numpy.random.randn(4,4).astype(numpy.float32)6 a gpu = cuda.mem alloc(a.nbytes)7 cuda.memcpy htod(a gpu, a) [This is examples/demo.py in the PyCUDA distribution.] slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  20. 20. GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductiveWhetting your appetite 1 mod = pycuda.compiler.SourceModule(””” 2 global void twice( float ∗a) 3 { 4 int idx = threadIdx.x + threadIdx.y∗4; 5 a[ idx ] ∗= 2; 6 } 7 ”””) 8 9 func = mod.get function(”twice”)10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a Andreas Kl¨ckner o PyCUDA: Even Simpler GPU Programming with Python
  21. 21. GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductiveWhetting your appetite 1 mod = pycuda.compiler.SourceModule(””” 2 global void twice( float ∗a) 3 { 4 int idx = threadIdx.x + threadIdx.y∗4; 5 a[ idx ] ∗= 2; 6 } Compute kernel 7 ”””) 8 9 func = mod.get function(”twice”)10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  22. 22. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodWhetting your appetite, Part II Did somebody say “Abstraction is good”? o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  23. 23. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodWhetting your appetite, Part II 1 import numpy 2 import pycuda.autoinit 3 from pycuda import gpuarray 4 5 a cpu = numpy.random.randn(4,4).astype(numpy.float32) 6 b cpu = numpy.random.randn(4,4).astype(numpy.float32) 7 c cpu = a cpu ∗ b cpu 8 9 a gpu = gpuarray.to gpu(a cpu)10 b gpu = gpuarray.to gpu(b cpu)11 c gpu = (a gpu ∗ b gpu).get()1213 print c cpu − c gpu o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  24. 24. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Remember me? 1 // trivia 2 #include <stdio.h> 3 4 #define CUDA CHK(NAME, ARGS) { 5 cudaError t cuda err code = NAME ARGS; 6 if (cuda err code != cudaSuccess) { 1 // main2 7 printf (”%s failed with code %dn”, #NAME, cuda err code); 2 for ( int i = 0; i < n; i++) { a host[i] = i; b host [ i ] = i+1; } 8 abort (); 3 9 } 4 CUDA CHK(cudaMemcpy, (a device, a host, n∗sizeof(float),10 } 5 cudaMemcpyHostToDevice));11 // end 6 CUDA CHK(cudaMemcpy, (b device, b host, n∗sizeof(float),12 7 cudaMemcpyHostToDevice));13 // kernel 814 global void square array ( float ∗a, float ∗b, int n) 9 dim3 block dim(16, 16);15 { 10 int block size = block dim.x∗block dim.y;16 int i = ( blockIdx .x ∗ blockDim.y + threadIdx.y) 11 int n blocks = (n + block size−1) / block size ;17 ∗ blockDim.x + threadIdx.x; 12 square array <<<n blocks, block dim>>>(a device, b device, n);18 if ( i < n) 13 // end19 a[ i ] = a[i ] ∗ b[i ]; 1420 } 15 // main321 // end 16 CUDA CHK(cudaMemcpy, (a host, a device, n∗sizeof(float),22 17 cudaMemcpyDeviceToHost));23 // main1 1824 int main() 19 for ( int i = 0; i < n; i++)25 { 20 printf (”%.0f ”, a host [ i ]);26 cudaSetDevice(0); // EDIT ME 21 puts(”n”);27 2228 const int n = 4096; 23 free (a host );29 24 CUDA CHK(cudaFree, (a device));30 float ∗a host = (float ∗) malloc(n∗sizeof( float )); 25 }31 float ∗b host = (float ∗) malloc(n∗sizeof( float )); 26 // end3233 float ∗a device, ∗b device;34 CUDA CHK(cudaMalloc, ((void ∗∗) &a device, n∗sizeof(float)));35 CUDA CHK(cudaMalloc, ((void ∗∗) &b device, n∗sizeof(float)));36 // end o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  25. 25. GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductivePyCUDA Philosophy Provide complete access Automatically manage resources Provide abstractions Check for and report errors automatically Full documentation Integrate tightly with numpy slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  26. 26. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodPyCuda: Workflow Edit Cache! Run nvcc .cubin SourceModule("...") Upload to GPU PyCuda Run on GPU o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  27. 27. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodAutomatic Cleanup Reachable objects (memory, streams, . . . ) are never destroyed. Once unreachable, released at an unspecified future time. Scarce resources (memory) can be explicitly freed. (obj.free()) Correctly deals with multiple contexts and dependencies. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  28. 28. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodgpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() No: nd indexing, slicing, etc. (yet!) Yes: +, -, ∗, /, fill, sin, exp, rand, take, . . . Random numbers using pycuda.curandom Mixed types (int32 + float32 = float64) print gpuarray for debugging. Memory behind gpuarray available as .gpudata attribute. Use as kernel arguments, textures, etc. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  29. 29. GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductiveWhat’s this “numpy”, anyway? Numpy: package for large, multi-dimensional arrays. Vectors, Matrices, . . . A+B, sin(A), dot(A,B) la.solve(A, b), la.eig(A) cube[:, :, n-k:n+k], cube+5 All much faster than functional equivalents in Python. “Python’s MATLAB”: Basis for SciPy, plotting, . . . slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  30. 30. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodgpuarray: Elementwise expressions Avoiding extra store-fetch cycles for elementwise math: from pycuda.curandom import rand as curand a gpu = curand((50,)) b gpu = curand((50,)) from pycuda.elementwise import ElementwiseKernel lin comb = ElementwiseKernel( ” float a, float ∗x, float b, float ∗y, float ∗z”, ”z[ i ] = a∗x[i ] + b∗y[i]”) c gpu = gpuarray.empty like (a gpu) lin comb(5, a gpu, 6, b gpu, c gpu) assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5 o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  31. 31. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productivegpuarray: Reduction made easy Example: A scalar product calculation from pycuda.reduction import ReductionKernel dot = ReductionKernel(dtype out=numpy.float32, neutral=”0”, reduce expr=”a+b”, map expr=”x[i]∗y[i]”, arguments=”const float ∗x, const float ∗y”) from pycuda.curandom import rand as curand x = curand((1000∗1000), dtype=numpy.float32) y = curand((1000∗1000), dtype=numpy.float32) x dot y = dot(x, y ). get() x dot y cpu = numpy.dot(x.get(), y. get ()) slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  32. 32. GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-PythonStep 3: Usage Complex numbers . . . in GPUArray . . . in user code (pycuda-complex.hpp) If/then/else for GPUArrays Support for custom device pointers Smarter device picking/context creation PyFFT: FFT for PyOpenCL and PyCUDA scikits.cuda: CUFFT, CUBLAS, CULA slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  33. 33. GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-PythonSparse Matrix-Vector on the GPU New feature in 0.94: Sparse matrix-vector multiplication Uses “packeted format” by Garland and Bell (also includes parts of their code) Integrates with scipy.sparse. Conjugate-gradients solver included Deferred convergence checking slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  34. 34. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodKernel Invocation: Automatic Copies mod = pycuda.driver.SourceModule( ” global my func(float ∗out, float ∗in ){...} ”) func = mod.get function(”my func”) src = numpy.random.randn(400).astype(numpy.float32) dest = numpy.empty like(src) my func( cuda.Out(dest), cuda.In( src ), block=(400,1,1)) “InOut” exists, too. Only for immediate invocation style. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  35. 35. GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-PythonStep 4: Debugging New in 0.94.1: Support for CUDA gdb: $ cuda-gdb --args python -m pycuda.debug demo.py Automatically: Sets Compiler flags Retains source code Disables compiler cache slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  36. 36. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodCUDA APIs C/C++ Python CUDA has two Programming Interfaces: Runtime API PyCuda “Runtime” high-level (libcudart.so, in the Driver API “toolkit”) “Driver” low-level Kernel Driver (libcuda.so, comes with GPU driver) Hardware (mutually exclusive) o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  37. 37. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodRuntime vs. Driver API Runtime ↔ Driver differences: Explicit initialization. Code objects (“Modules”) become programming language objects. Texture handling requires slightly more work. Only needs nvcc for compiling GPU code. Driver API: Conceptually cleaner Less sugar-coating (provide in Python) Not very different otherwise o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  38. 38. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodPyCuda: API Tracing With ./configure --cuda-trace=1: import pycuda. driver as cuda cuInit import pycuda. autoinit cuDeviceGetCount import numpy cuDeviceGet cuCtxCreate a = numpy.random.randn(4,4).astype(numpy.float32) cuMemAlloc a gpu = cuda.mem alloc(a.nbytes) cuMemcpyHtoD cuda.memcpy htod(a gpu, a) cuCtxGetDevice cuDeviceComputeCapability mod = cuda.SourceModule(””” cuModuleLoadData global void doublify ( float ∗a) cuModuleGetFunction { cuFuncSetBlockShape int idx = threadIdx.x + threadIdx.y∗4; cuParamSetv a[ idx ] ∗= 2; cuParamSetSize } cuLaunchGrid ”””) cuMemcpyDtoH cuCtxPopCurrent func = mod.get function(”doublify”) cuCtxPushCurrent func(a gpu, block=(4,4,1)) cuMemFree cuCtxPopCurrent a doubled = numpy.empty like(a) cuCtxPushCurrent cuda.memcpy dtoh(a doubled, a gpu) cuModuleUnload print a doubled cuCtxPopCurrent print a cuCtxDestroy o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  39. 39. GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductivePyCUDA: Vital Information http://mathema.tician.de/ software/pycuda Complete documentation MIT License (no warranty, free for all use) Requires: numpy, Python 2.4+ (Win/OS X/Linux) Support via mailing list slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  40. 40. Sle epy?
  41. 41. Outline1. Scripting GPUs with PyCUDA2. Meta-programming and RTCG3. Case study in brain-inspired AI
  42. 42. ... too much ? ba nk c onflict s on ing isi ale sc ec co ca pr ch d part ition inixe cla ca m g m pingm pi ng adca sting bro ms zero-cop trea
  43. 43. e ? ec id ’t dc an
  44. 44. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableGPU Programming: Implementation Choices Many difficult questions Insufficient heuristics Answers are hardware-specific and have no lasting value slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  45. 45. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableGPU Programming: Implementation Choices Many difficult questions Insufficient heuristics Answers are hardware-specific and have no lasting value Proposed Solution: Tune automatically for hardware at run time, cache tuning results. Decrease reliance on knowledge of hardware internals Shift emphasis from tuning results to tuning ideas slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  46. 46. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming In GPU scripting, GPU code does not need to be a compile-time constant. slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  47. 47. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  48. 48. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming Idea In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  49. 49. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming Idea In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  50. 50. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming Idea In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary Machine (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  51. 51. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming Idea Human In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  52. 52. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming Idea Good for code In GPU scripting, Python Code News generation GPU code does The not need ailabee v to bl GPU Code Gener a t i on d ge is A nowlea compile-time e Code most K 4 R u n - T i m o d e w h e n th e constant. Writ GPU Compiler ing C GPU Binaryase howc S (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  53. 53. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming Idea Good for code In GPUyCUDA P scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  54. 54. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming Idea Good for code PyOp UDA In GPUyCenCL P scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  55. 55. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMachine-generated Code Why machine-generate code? Automated Tuning (cf. ATLAS, FFTW) Data types Specialize code for given problem Constants faster than variables (→ register pressure) Loop Unrolling slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  56. 56. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodPyCuda: Support for Metaprogramming Access properties of compiled code: func.{num regs,shared size bytes,local size bytes} Exact GPU timing via events Can calculate hardware-dependent MP occupancy codepy (by Andreas): Build C syntax trees from Python Generates readable, indented C Or use a templating engine (many available, e.g. Cheetah) o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  57. 57. Outline1. Scripting GPUs with PyCUDA2. Meta-programming and RTCG3. Case study in brain-inspired AI (vision)
  58. 58. Motivation
  59. 59. The Problem:Visual Object Recognition fast accurate tolerant to variations effortless critical to survival
  60. 60. The ApproachReverse and Forward Engineering the Brain
  61. 61. The ApproachReverse and Forward Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  62. 62. Why is modeling challenging? The brain is a massively parallel computer➡ Big models are paralyzingly slow to run Neural data only provides weak constraints➡ Lots of parameters – hard to explore Advice from Dave Cox: “Don’t run anything that takes longer than a week to complete, because it will just crash halfway through anyways (or you’ll discover a bug) and you’ll never finish your Ph.D.”
  63. 63. Why is modeling challenging? The brain is a massively parallel computer➡ Big models are paralyzingly slow to run Neural data only provides weak constraints➡ Lots of parameters – hard to explore
  64. 64. Visual Cortex t aflo ps ! in =2 0 pe bra
  65. 65. GPUs (since 2006)7800 GTX Monster16GPU Tesla Cluster (2006) (2008) (2009)OpenGL/Cg CUDA CUDA/OpenCLC++/Python Python Python
  66. 66. r ow n! u ild youB
  67. 67. Cell Broadband Engine (since 2007) Teraflop Playstation3 clusters: DiCarlo Lab / MIT Cox Lab / Harvard
  68. 68. A Match Made in HeavenBrains are parallel, GPUs are parallel ≈ Multiple scales of parallelism: “Embarrasingly” parallel: video frames, regions Fine-grained: independent “neurons,” operating on overlapping inputs
  69. 69. A Match Made in HeavenImages In, Images Out ≈ Image processing particularly well-suited Excellent Arithmetic Intensity: very natural to load image patches into shared memory Data: 2D / 3D locality
  70. 70. Why is modeling challenging? The brain is a massively parallel computer➡ Big models are paralyzingly slow to run Neural data only provides weak constraints➡ Lots of parameters – hard to explore
  71. 71. Fukushima (1980)
  72. 72. LeCun et al. (1989)
  73. 73. Riesenhuber & Poggio (1999)
  74. 74. Serre & Poggio (2007)
  75. 75. Read-outL3 thresh/sat norm strength normalization Learning neighborhood Rate Trace “Temp. Adv.” “Auto-reset” ... number of ltersL2 thresh/sat norm strength Learning normalization neighborhood Rate kernel Trace size “Temp. Adv.” “Auto-reset” ... n. of ltersL1 thresh/sat norm strength Learning Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset”kernel ...size number of lters input kernel size
  76. 76. neighborhood Rate Trace “Temp. Adv.” “Auto-reset” ... number of ltersL2 thresh/sat norm strength Learning normalization neighborhood Rate kernel Trace size “Temp. Adv.” “Auto-reset” ... n. of ltersL1 thresh/sat norm strength Learning Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset”kernel ...size
  77. 77. Two conflicting requirements The brain is a massively parallel computer FA ST slow to run➡ Big models are paralyzingly Neural data only provides weak constraints LEXI BLE F➡ Lots of parameters – hard to explore How to optimize?
  78. 78. What’s the bottleneck?
  79. 79. lutio ns! k Co nvo i lter ba n3D F
  80. 80. Fast vs Flexible: what can you do? - Make your code accessible - No focus on raw performanceExamples: MATLAB/CUDA by Jim Mutch (2010) by John Moore (1995)
  81. 81. Fast vs Flexible: what can you do? - Use standard libraries (e.g. CUBLAS, CUFFT, Jacket) - But: “remap” problem to fit? - Memory issues (not always optimal)
  82. 82. Fast vs Flexible: what can you do? - Fully optimized, by hand - But for only a few input configurations...
  83. 83. Fast vs Flexible: what can you do? - Focus on flexibility/accessibility first - But add strong foundations for raw performance from the beginningExample: Python/C/CUDA (OpenCL*)http://deeplearning.netby James Bergstra & Yoshua Bengio (2010)
  84. 84. Our answer?
  85. 85. Meta-programming and Auto-tuning
  86. 86. What?
  87. 87. Meta-programming ! Leave the grunt-programming to the computer (i.e. auto-tuning like ATLAS or FFTW) • Dynamically compile specialized versions of the same kernel for different conditions • Empirical run-time tuning • For free: smooth syntactic ugliness: unroll loops, index un-indexable registers, etc.
  88. 88. Meta-programming !“Instrument” your solutions:• Block size• Work size• Loop unrolling• Pre-fetching• Spilling• etc.
  89. 89. Meta-programming ! Let the computer generate and find the optimal code: • brute-force search with a global objective • machine-learning approach with local objectives and hidden variables (advanced) • e.g. PyCuda makes this easy:
  90. 90. Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  91. 91. texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];#define IMUL(a, b) __mul24(a, b)extern "C" { C hee ta h#for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) {#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
  92. 92. conv_kernel_4x4x4.cuconv_kernel_template.cu #include <stdio.h> texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[4][4][4]; #define IMUL(a, b) __mul24(a, b) texture<float4, 1, cudaReadModeElementType> tex_float4; extern "C" { __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; __global__ void convolve_beta_j0(float4 *input, float4 *output) { #define IMUL(a, b) __mul24(a, b) extern "C" { __shared__ float shared_in[131][4+1]; // -- input/output offsets #for j in xrange($FILTER_H) const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; __global__ void convolve_beta_j${j}(float4 *input, float4 float4 input_v4; *output) // -- load input to shared memory { { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 shared_in[threadIdx.x+128*0][0] = input_v4.x; __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; // -- input/output offsets } const uint in_idx = (blockIdx.y+$j)*INPUT_W + if((threadIdx.x+128*1)<131) blockIdx.x*blockDim.x + threadIdx.x; { const uint out_idx = blockIdx.y*OUTPUT_W + input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); blockIdx.x*blockDim.x + threadIdx.x; shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; float4 input_v4; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; // -- load input to shared memory } #for i in xrange($LOAD_ITERATIONS) __syncthreads(); #if $i==($LOAD_ITERATIONS-1) // -- compute dot products if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) float v, w; #end if { float sum0 = 0; input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* float sum1 = 0; $i); float sum2 = 0; float sum3 = 0; shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; v = shared_in[threadIdx.x+0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; w = constant[0][0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; sum0 += v*w; } w = constant[0][0][1]; sum1 += v*w; #end for w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w;
  93. 93. conv_kernel_template.cu texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) conv_kernel_4x4x4.cu extern "C" { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 20 kB *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if $i); { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* conv_kernel_8x8x4.cu shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } 64 kB #end for
  94. 94. Benefits?
  95. 95. Smooth syntactic ugliness
  96. 96. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • variable-length argument lists
  97. 97. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • syntax-level code control (e.g. conditionals)
  98. 98. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • loop unrolling (possibly fine-controlled)
  99. 99. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • fine-controlled loop unrolling..) v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w;
  100. 100. How about #pragma unroll ? (why don’t you trust the compiler?)
  101. 101. o t alo ne.... we are n s for S ignal Using GPU elation pil ers Corr ust com ’t tr itchell Daniel A. M Don The Murch ode fr a ts ison Widefi gmen eld Array c tical” e “iden re thes + g *h; ompa LOPS • C *c + e*f 770 GF + d b*c grating 8-s econd snap shots over a += inte peeling, roduced by lanking and b*c; -2526 field p d after RFI b f the J2107 e of the fiel an image o ht is an imag S FLOP n the left is . On the rig a += d*c; Figure 3: O ing hout blank interval wit 20 G entire time eeled imag e. noise the e unp e above the ntours of th f magnitud ers o . This 10 co along with that are ord ubious data a += e*f; at levels iscard d e receivers ill simply d tector show n here fract into th e system w k ichael hClar fl ect or re real-tim n-based de occasion, re s the MWA mple media integration hich the si M wit floor. D wil uring deep l require a series of d ata-quality art. tests, of w a += g*h; n integral p will form a eenhill Lincoln Gr Paul La Pla nte and ces Referen t Boolard a += y, EDGES Memo, 058 , 2010. R.J. Cappal lo, M.F. M orales, and ics a ale, d Topics RFI Statist , C.J. Lonsd l of Selecte [1] A.E .E. Rogers, , R.J. Sault IE EE Journa R.B. Wayth eld Array, . Greenhill, hison Widefi ]. itchell, L.J of the Murc 07.1912 E, 97 [2] D.A. M Time Calib ration , [astro- ph/08 s of the IEE S.M. O rd, Real- 7 17, 2008 , Proceeding 2 (5), 707– n Overview 1 nuary 201sday, 27 Ja rocessing, rray: Desig in Signal P on Widefield A he Murchis 8]. , Graphics ale, et al., T 903.182 R.G. Edgar [3] C.J. Lonsd [ast ro-ph/0 H. Pfister, and Series, 506, 2009, ell, K. Dale, Conference (8), 1497–1 , D.A. Mitch d Array, ASP R.B. Wayth on Wide-fiel Greenhill, the Murchis IICS‘2011 [4] S.M . Ord, L.J. ata Pro cessing in cal Units for D Mathemati Processing 1 radio pola rimetry. I. 009. aa d nderstryn20 ing 1 411, 127, 2 .J. Sault, U Janu 6. . Breg man, and R ursday,.,2117, 137–147, 199 7 alar amaker, J.D Th pl. Ser up alogue of sc [5] J.P. H strophys. S ll-co herency an rophys. Su ppl. s, Astron. A . IV. The fu Astron. Ast foundation polarimetry ric fidelity, g radio ge and pola rimet derstandin
  102. 102. Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • index un-indexable resources (e.g. regs)
  103. 103. Explore design decision space more freely
  104. 104. Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  105. 105. Exploring design decision space more freely Meta-programming: • enables efficient learning of the GPU hardware/software • allows full exploitation of the GPU architecture
  106. 106. version Aconv_kernel_beta_template.cu ... mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1 mov.b32 $r1, c0[$ofs2+0x0008] texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4 [$N_FILTERS]; mov.b32 $r1, c0[$ofs2+0x000c] mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4 #define IMUL(a, b) __mul24(a, b) extern "C" { #for j in xrange($FILTER_H) mov.b32 $r1, c0[$ofs2+0x0010] __global__ void convolve_beta_j${j}(float4 *input, float4 *output) mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4 { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; ... // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) version B #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; ... shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1 } #end for mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1 mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1 mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1 mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1 ... aster... Why ? using decuda by Wladimir J. van der Laan 2x f
  107. 107. Exploring design decision space more freely
  108. 108. Exploring design decision space more freely When USE_THREAD_PER_FILTER is True • each thread will access different cmem locations (in order)using the decuda disassembler by Wladimir J. van der Laan (Python-based)
  109. 109. Exploring design decision space more freely When USE_THREAD_PER_FILTER is False • each thread will access the same cmem locations (broadcast)using the decuda disassembler by Wladimir J. van der Laan (Python-based)
  110. 110. Exploring design decision space more freely more registers thread-dependent data movement v.s. aster... Why ? 2x f
  111. 111. Strategy• intermediate design decisions can be made explicit• multiple “forks” in the path can be kept in place• frees up the developer to revisit paste choices (without incurring a combinatoric explosion of separate pieces of code)• retesting sets of assumptions can be done frequently and programmatically from the “outer” framework of code
  112. 112. Toy Ex a mple M atmulhttp://wiki.tiker.net/PyCuda/Examples/DemoMetaMatrixmulCheetah
  113. 113. Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter code • is easy and flexible with the right tools (e.g. Python, Py{CUDA,CL}, Cheetah, decuda) ➡ facilitates auto-tuning!
  114. 114. a pause?Need
  115. 115. ninja level? t to theHow t o ge
  116. 116. practic e ... , pract ice,Prac tice
  117. 117. Auto-tuning
  118. 118. Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  119. 119. Auto-tuningThe goal is to empirically optimize executiontime given:• the environment - hardware (GPU, CPU, Memory, Mobo) - software (SDK, Compiler suite)• the data (input dimensions, repetitions, etc.)
  120. 120. Basic auto-tuning: pseudo-code (1/3) Filter-bank Convolution / Correlation Scripting, Py{CUDA,CL} NoSQL (CouchDB, MongoDB) ?
  121. 121. Basic auto-tuning: pseudo-code (2/3) Cheetah, Jinja, Mako PyCUDA/CL
  122. 122. Basic auto-tuning: pseudo-code (3/3) PyCUDA/CL NoSQL (CouchDB, MongoDB)
  123. 123. Optimizing what?
  124. 124. Optimizing strategy• Like many operations, filter-bank convolution is usually “communication bound” on the GPU: - compute is cheap - communication is expensive• We must take advantage of all types of memory: - explicit: gmem (global), smem (shared), cmem (constant), tmem (texture) - implicit: rmem (registers), bmem (bin-code?) *• Different optimal access patterns
  125. 125. Example: thread gmem output size stupid float4 xyzw trick
  126. 126. Example: multiple smem loads
  127. 127. Example: using texture fetches
  128. 128. Example: register spilling
  129. 129. Example: register pressure (nvcc)
  130. 130. Example: capitalizing on bmem (bin code) ?? multiple versions of the same function with different input offsets input offset in cubin code?
  131. 131. Results
  132. 132. Results Meta-prog Meta-progGPU / SDK Input Filter-bank Boost default (gflops) auto-tuned (gflops) 256x256x8 64x9x9x8 6.710 ± 0.005 36.584 ± 0.023 445.2 %9600M GT 512x512x4 32x13x13x4 13.606 ± 0.002 35.582 ± 0.003 161.5 %CUDA3.1 1024x1024x8 16x5x5x8 20.034 ± 0.113 26.084 ± 6.243 30.2 % 2048x2048x4 4x8x8x4 25.781 ± 0.044 46.945 ± 0.100 82.1 % 256x256x8 64x9x9x8 104.188 ± 0.051 168.083 ± 0.372 61.3 %C1060 512x512x4 32x13x13x4 125.739 ± 0.109 234.053 ± 0.266 86.1 %CUDA2.3 1024x1024x8 16x5x5x8 144.279 ± 0.764 243.697 ± 0.346 68.9 % 2048x2048x4 4x8x8x4 180.060 ± 0.018 322.328 ± 0.348 79.0 % 256x256x8 64x9x9x8 123.396 ± 0.016 197.006 ± 0.219 59.7 %GTX285 512x512x4 32x13x13x4 143.277 ± 0.044 270.206 ± 0.209 88.6 %CUDA2.3 1024x1024x8 16x5x5x8 148.841 ± 0.465 310.276 ± 0.538 108.5 % 2048x2048x4 4x8x8x4 205.152 ± 0.015 376.685 ± 0.070 83.6 % 256x256x8 64x9x9x8 467.631 ± 19.100 471.902 ± 11.419 0.9 %GTX480 512x512x4 32x13x13x4 834.838 ± 8.275 974.266 ± 3.809 16.7 %CUDA3.1 1024x1024x8 16x5x5x8 542.808 ± 1.135 614.019 ± 0.904 13.1 % 2048x2048x4 4x8x8x4 378.165 ± 0.537 806.628 ± 0.168 113.3 %
  133. 133. Analysis
  134. 134. Analysis
  135. 135. Empirical results... Performance (g ops) Q9450 (Matlab/C) [2008] 0.3 Q9450 (C/SSE) [2008] 9.0 7900GTX (Cg) [2006] 68.2 PS3/Cell (C/ASM) [2007] 111.48800GTX (CUDA1.x) [2007] 192.7 GTX280 (CUDA2.x) [2008] 339.3 . GTX480 (CUDA3.x) [2010] e cha nging.. 974.3 g am e edup is >1 0 00X sp
  136. 136. Summary
  137. 137. Summary • Meta-programming makes developing high-performing code for GPU easier • Fantastic tools exist (e.g. PyCUDA) to help • Interesting way to explore/learn about GPUs (hw/sw) • Coarse auto-tuning yields good results
  138. 138. Future • More fermi optimizations (L1 cache, concurrent kernels) • OpenCL to optimize across vendors • Smarter auto-tuning techniques (ML) - (boosted) decision trees - evolutionary programming strategies
  139. 139. More ?• Thu 3/31/11: PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)• Tue 3/29/11: Algorithm Strategies (W. Hwu, UIUC)• Tue 4/5/11: Analysis-driven Optimization (C.Wooley, NVIDIA)• Thu 4/7/11: Irregular Parallelism & Efficient Data Structures (J.Owens, UCDavis)• Thu 4/14/11: Optimization for Ninjas (D.Merill, UVirg)• ...
  140. 140. one more thing or two...
  141. 141. Life/Code Hacking #2.x Speed {listen,read,writ}ingaccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  142. 142. Life/Code Hacking #2.2b Speed writingaccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  143. 143. Life/Code Hacking #2.2b Speed writing ? R SIaccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  144. 144. Life/Code Hacking #2.2b Speed writing SI? Raccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  145. 145. Life/Code Hacking #2.2b Speed writing
  146. 146. Life/Code Hacking #2.3 Speed readingaccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  147. 147. Life/Code Hacking #2.3 Speed reading1. Collect many papers, docs, chapters, etc. (100)2. Skim through them quickly / select (50)3. Read w/o full understanding / select (25)4. Read completely w/ full understanding / select (10)5. Complete mastery + reproduction (5) accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  148. 148. Life/Code Hacking #2.3 Speed readinghttp://readerssoft.com/speed_reading_obstacles.php accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  149. 149. Life/Code Hacking #2.3 Speed readinghttp://readerssoft.com/speed_reading_obstacles.php normal reading vs. speed reading accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  150. 150. Life/Code Hacking #2.3 Speed reading like David Guetta, use one finger !accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  151. 151. CO ME
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×