[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning
Upcoming SlideShare
Loading in...5
×
 

[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

on

  • 2,710 views

http://cs264.org

http://cs264.org

Statistics

Views

Total Views
2,710
Views on SlideShare
2,709
Embed Views
1

Actions

Likes
3
Downloads
108
Comments
0

1 Embed 1

http://www.redirectfiles.org 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning Presentation Transcript

  • Massively Parallel Computing CS 264 / CSCI E-292Lecture #6: CUDA Ninja Tricks | March 1st, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  • Massively Parallel Computing CS 264 / CSCI E-292Lecture #6: CUDA Ninja Tricks | February 29th, 2011 Auto-tuning am ming, , Meta- progr riptin g” G PU “Sc Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  • News
  • During this course, r CS264 adapted fowe’ll try to “ ”and use existing material ;-)
  • Todayyey!!
  • Outline1. Scripting GPUs with PyCUDA2. Meta-programming and RTCG3. Case study in brain-inspired AI
  • Outline1. Scripting GPUs with PyCUDA2. Meta-programming and RTCG3. Case study in brain-inspired AI
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodWhy do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum compute/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Realize a promise: Use Scripting. . . from first prototype to full-scale production code. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodWhy do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum compute/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Realize a promise: Use Scripting. . . from first prototype to full-scale production code. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodWhy do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum compute/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Realize a promise: Use Scripting. . . from first prototype to full-scale production code. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductiveWhy do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Python + CUDA = PyCUDA Python + OpenCL = PyOpenCL slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductiveHow are High-Performance Codes constructed? “Traditional” Construction of High-Performance Codes: C/C++/Fortran Libraries “Alternative” Construction of High-Performance Codes: Scripting for ‘brains’ GPUs for ‘inner loops’ Play to the strengths of each programming environment. slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductiveScripting: Python One example of a scripting language: Python Mature Large and active community Emphasizes readability Written in widely-portable C A ‘multi-paradigm’ language Rich ecosystem of sci-comp related software slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodScripting Languages Python: is discoverable and interactive. has comprehensive built-in functionality. manages resources automatically. uses run-time typing. works well for “gluing” lower-level blocks together. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodScripting: Goals Scripting languages aim to reduce the load on the programmer: Reduce required knowledge Encourage experimentation Eliminate sources of error Encourage abstraction wherever possible Value programmer time over computer time Think about the tools you use. Use the right tool for the job. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodScripting: Goals Scripting languages aim to reduce the load on the programmer: Reduce required knowledge Encourage experimentation Eliminate sources of error Encourage abstraction wherever possible Value programmer time over computer time Think about the tools you use. Use the right tool for the job. Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) o PyCuda Tutorial
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodScripting: Goals Scripting languages aim to reduce the load on the programmer: Reduce required knowledge Encourage experimentation Eliminate sources of error Encourage abstraction wherever possible Value programmer time over computer time Think about the tools you use. Use the right tool for the job. Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) o PyCuda Tutorial
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodScripting: Speed Usual answer to the “Speed Question”: Hybrid (“mixed”) Code. Plays to the strengths of each language. But: Introduces (some) complexity. Observation: GPU code is already hybrid. Consequence: No added complexity through hybrid code. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductiveWhetting your appetite1 import pycuda.driver as cuda2 import pycuda.autoinit , pycuda.compiler3 import numpy45 a = numpy.random.randn(4,4).astype(numpy.float32)6 a gpu = cuda.mem alloc(a.nbytes)7 cuda.memcpy htod(a gpu, a) [This is examples/demo.py in the PyCUDA distribution.] slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductiveWhetting your appetite 1 mod = pycuda.compiler.SourceModule(””” 2 global void twice( float ∗a) 3 { 4 int idx = threadIdx.x + threadIdx.y∗4; 5 a[ idx ] ∗= 2; 6 } 7 ”””) 8 9 func = mod.get function(”twice”)10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a Andreas Kl¨ckner o PyCUDA: Even Simpler GPU Programming with Python
  • GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductiveWhetting your appetite 1 mod = pycuda.compiler.SourceModule(””” 2 global void twice( float ∗a) 3 { 4 int idx = threadIdx.x + threadIdx.y∗4; 5 a[ idx ] ∗= 2; 6 } Compute kernel 7 ”””) 8 9 func = mod.get function(”twice”)10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodWhetting your appetite, Part II Did somebody say “Abstraction is good”? o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodWhetting your appetite, Part II 1 import numpy 2 import pycuda.autoinit 3 from pycuda import gpuarray 4 5 a cpu = numpy.random.randn(4,4).astype(numpy.float32) 6 b cpu = numpy.random.randn(4,4).astype(numpy.float32) 7 c cpu = a cpu ∗ b cpu 8 9 a gpu = gpuarray.to gpu(a cpu)10 b gpu = gpuarray.to gpu(b cpu)11 c gpu = (a gpu ∗ b gpu).get()1213 print c cpu − c gpu o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Remember me? 1 // trivia 2 #include <stdio.h> 3 4 #define CUDA CHK(NAME, ARGS) { 5 cudaError t cuda err code = NAME ARGS; 6 if (cuda err code != cudaSuccess) { 1 // main2 7 printf (”%s failed with code %dn”, #NAME, cuda err code); 2 for ( int i = 0; i < n; i++) { a host[i] = i; b host [ i ] = i+1; } 8 abort (); 3 9 } 4 CUDA CHK(cudaMemcpy, (a device, a host, n∗sizeof(float),10 } 5 cudaMemcpyHostToDevice));11 // end 6 CUDA CHK(cudaMemcpy, (b device, b host, n∗sizeof(float),12 7 cudaMemcpyHostToDevice));13 // kernel 814 global void square array ( float ∗a, float ∗b, int n) 9 dim3 block dim(16, 16);15 { 10 int block size = block dim.x∗block dim.y;16 int i = ( blockIdx .x ∗ blockDim.y + threadIdx.y) 11 int n blocks = (n + block size−1) / block size ;17 ∗ blockDim.x + threadIdx.x; 12 square array <<<n blocks, block dim>>>(a device, b device, n);18 if ( i < n) 13 // end19 a[ i ] = a[i ] ∗ b[i ]; 1420 } 15 // main321 // end 16 CUDA CHK(cudaMemcpy, (a host, a device, n∗sizeof(float),22 17 cudaMemcpyDeviceToHost));23 // main1 1824 int main() 19 for ( int i = 0; i < n; i++)25 { 20 printf (”%.0f ”, a host [ i ]);26 cudaSetDevice(0); // EDIT ME 21 puts(”n”);27 2228 const int n = 4096; 23 free (a host );29 24 CUDA CHK(cudaFree, (a device));30 float ∗a host = (float ∗) malloc(n∗sizeof( float )); 25 }31 float ∗b host = (float ∗) malloc(n∗sizeof( float )); 26 // end3233 float ∗a device, ∗b device;34 CUDA CHK(cudaMalloc, ((void ∗∗) &a device, n∗sizeof(float)));35 CUDA CHK(cudaMalloc, ((void ∗∗) &b device, n∗sizeof(float)));36 // end o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductivePyCUDA Philosophy Provide complete access Automatically manage resources Provide abstractions Check for and report errors automatically Full documentation Integrate tightly with numpy slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodPyCuda: Workflow Edit Cache! Run nvcc .cubin SourceModule("...") Upload to GPU PyCuda Run on GPU o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodAutomatic Cleanup Reachable objects (memory, streams, . . . ) are never destroyed. Once unreachable, released at an unspecified future time. Scarce resources (memory) can be explicitly freed. (obj.free()) Correctly deals with multiple contexts and dependencies. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodgpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() No: nd indexing, slicing, etc. (yet!) Yes: +, -, ∗, /, fill, sin, exp, rand, take, . . . Random numbers using pycuda.curandom Mixed types (int32 + float32 = float64) print gpuarray for debugging. Memory behind gpuarray available as .gpudata attribute. Use as kernel arguments, textures, etc. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductiveWhat’s this “numpy”, anyway? Numpy: package for large, multi-dimensional arrays. Vectors, Matrices, . . . A+B, sin(A), dot(A,B) la.solve(A, b), la.eig(A) cube[:, :, n-k:n+k], cube+5 All much faster than functional equivalents in Python. “Python’s MATLAB”: Basis for SciPy, plotting, . . . slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodgpuarray: Elementwise expressions Avoiding extra store-fetch cycles for elementwise math: from pycuda.curandom import rand as curand a gpu = curand((50,)) b gpu = curand((50,)) from pycuda.elementwise import ElementwiseKernel lin comb = ElementwiseKernel( ” float a, float ∗x, float b, float ∗y, float ∗z”, ”z[ i ] = a∗x[i ] + b∗y[i]”) c gpu = gpuarray.empty like (a gpu) lin comb(5, a gpu, 6, b gpu, c gpu) assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5 o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productivegpuarray: Reduction made easy Example: A scalar product calculation from pycuda.reduction import ReductionKernel dot = ReductionKernel(dtype out=numpy.float32, neutral=”0”, reduce expr=”a+b”, map expr=”x[i]∗y[i]”, arguments=”const float ∗x, const float ∗y”) from pycuda.curandom import rand as curand x = curand((1000∗1000), dtype=numpy.float32) y = curand((1000∗1000), dtype=numpy.float32) x dot y = dot(x, y ). get() x dot y cpu = numpy.dot(x.get(), y. get ()) slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-PythonStep 3: Usage Complex numbers . . . in GPUArray . . . in user code (pycuda-complex.hpp) If/then/else for GPUArrays Support for custom device pointers Smarter device picking/context creation PyFFT: FFT for PyOpenCL and PyCUDA scikits.cuda: CUFFT, CUBLAS, CULA slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-PythonSparse Matrix-Vector on the GPU New feature in 0.94: Sparse matrix-vector multiplication Uses “packeted format” by Garland and Bell (also includes parts of their code) Integrates with scipy.sparse. Conjugate-gradients solver included Deferred convergence checking slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodKernel Invocation: Automatic Copies mod = pycuda.driver.SourceModule( ” global my func(float ∗out, float ∗in ){...} ”) func = mod.get function(”my func”) src = numpy.random.randn(400).astype(numpy.float32) dest = numpy.empty like(src) my func( cuda.Out(dest), cuda.In( src ), block=(400,1,1)) “InOut” exists, too. Only for immediate invocation style. o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-PythonStep 4: Debugging New in 0.94.1: Support for CUDA gdb: $ cuda-gdb --args python -m pycuda.debug demo.py Automatically: Sets Compiler flags Retains source code Disables compiler cache slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodCUDA APIs C/C++ Python CUDA has two Programming Interfaces: Runtime API PyCuda “Runtime” high-level (libcudart.so, in the Driver API “toolkit”) “Driver” low-level Kernel Driver (libcuda.so, comes with GPU driver) Hardware (mutually exclusive) o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodRuntime vs. Driver API Runtime ↔ Driver differences: Explicit initialization. Code objects (“Modules”) become programming language objects. Texture handling requires slightly more work. Only needs nvcc for compiling GPU code. Driver API: Conceptually cleaner Less sugar-coating (provide in Python) Not very different otherwise o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodPyCuda: API Tracing With ./configure --cuda-trace=1: import pycuda. driver as cuda cuInit import pycuda. autoinit cuDeviceGetCount import numpy cuDeviceGet cuCtxCreate a = numpy.random.randn(4,4).astype(numpy.float32) cuMemAlloc a gpu = cuda.mem alloc(a.nbytes) cuMemcpyHtoD cuda.memcpy htod(a gpu, a) cuCtxGetDevice cuDeviceComputeCapability mod = cuda.SourceModule(””” cuModuleLoadData global void doublify ( float ∗a) cuModuleGetFunction { cuFuncSetBlockShape int idx = threadIdx.x + threadIdx.y∗4; cuParamSetv a[ idx ] ∗= 2; cuParamSetSize } cuLaunchGrid ”””) cuMemcpyDtoH cuCtxPopCurrent func = mod.get function(”doublify”) cuCtxPushCurrent func(a gpu, block=(4,4,1)) cuMemFree cuCtxPopCurrent a doubled = numpy.empty like(a) cuCtxPushCurrent cuda.memcpy dtoh(a doubled, a gpu) cuModuleUnload print a doubled cuCtxPopCurrent print a cuCtxDestroy o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • GPU Scripting PyOpenCL News RTCG Showcase Overview Being ProductivePyCUDA: Vital Information http://mathema.tician.de/ software/pycuda Complete documentation MIT License (no warranty, free for all use) Requires: numpy, Python 2.4+ (Win/OS X/Linux) Support via mailing list slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • Sle epy?
  • Outline1. Scripting GPUs with PyCUDA2. Meta-programming and RTCG3. Case study in brain-inspired AI
  • ... too much ? ba nk c onflict s on ing isi ale sc ec co ca pr ch d part ition inixe cla ca m g m pingm pi ng adca sting bro ms zero-cop trea
  • e ? ec id ’t dc an
  • GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableGPU Programming: Implementation Choices Many difficult questions Insufficient heuristics Answers are hardware-specific and have no lasting value slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableGPU Programming: Implementation Choices Many difficult questions Insufficient heuristics Answers are hardware-specific and have no lasting value Proposed Solution: Tune automatically for hardware at run time, cache tuning results. Decrease reliance on knowledge of hardware internals Shift emphasis from tuning results to tuning ideas slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming In GPU scripting, GPU code does not need to be a compile-time constant. slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming Idea In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming Idea In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming Idea In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary Machine (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming Idea Human In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming Idea Good for code In GPU scripting, Python Code News generation GPU code does The not need ailabee v to bl GPU Code Gener a t i on d ge is A nowlea compile-time e Code most K 4 R u n - T i m o d e w h e n th e constant. Writ GPU Compiler ing C GPU Binaryase howc S (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming Idea Good for code In GPUyCUDA P scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMetaprogramming Idea Good for code PyOp UDA In GPUyCenCL P scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is AvailableMachine-generated Code Why machine-generate code? Automated Tuning (cf. ATLAS, FFTW) Data types Specialize code for given problem Constants faster than variables (→ register pressure) Loop Unrolling slide by Andreas Klockner (NYU) Simpler GPU Programming with Python Andreas Kl¨ckner o PyCUDA: Even
  • Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hoodPyCuda: Support for Metaprogramming Access properties of compiled code: func.{num regs,shared size bytes,local size bytes} Exact GPU timing via events Can calculate hardware-dependent MP occupancy codepy (by Andreas): Build C syntax trees from Python Generates readable, indented C Or use a templating engine (many available, e.g. Cheetah) o slide by Andreas Klockner (NYU) Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
  • Outline1. Scripting GPUs with PyCUDA2. Meta-programming and RTCG3. Case study in brain-inspired AI (vision)
  • Motivation
  • The Problem:Visual Object Recognition fast accurate tolerant to variations effortless critical to survival
  • The ApproachReverse and Forward Engineering the Brain
  • The ApproachReverse and Forward Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • Why is modeling challenging? The brain is a massively parallel computer➡ Big models are paralyzingly slow to run Neural data only provides weak constraints➡ Lots of parameters – hard to explore Advice from Dave Cox: “Don’t run anything that takes longer than a week to complete, because it will just crash halfway through anyways (or you’ll discover a bug) and you’ll never finish your Ph.D.”
  • Why is modeling challenging? The brain is a massively parallel computer➡ Big models are paralyzingly slow to run Neural data only provides weak constraints➡ Lots of parameters – hard to explore
  • Visual Cortex t aflo ps ! in =2 0 pe bra
  • GPUs (since 2006)7800 GTX Monster16GPU Tesla Cluster (2006) (2008) (2009)OpenGL/Cg CUDA CUDA/OpenCLC++/Python Python Python
  • r ow n! u ild youB
  • Cell Broadband Engine (since 2007) Teraflop Playstation3 clusters: DiCarlo Lab / MIT Cox Lab / Harvard
  • A Match Made in HeavenBrains are parallel, GPUs are parallel ≈ Multiple scales of parallelism: “Embarrasingly” parallel: video frames, regions Fine-grained: independent “neurons,” operating on overlapping inputs
  • A Match Made in HeavenImages In, Images Out ≈ Image processing particularly well-suited Excellent Arithmetic Intensity: very natural to load image patches into shared memory Data: 2D / 3D locality
  • Why is modeling challenging? The brain is a massively parallel computer➡ Big models are paralyzingly slow to run Neural data only provides weak constraints➡ Lots of parameters – hard to explore
  • Fukushima (1980)
  • LeCun et al. (1989)
  • Riesenhuber & Poggio (1999)
  • Serre & Poggio (2007)
  • Read-outL3 thresh/sat norm strength normalization Learning neighborhood Rate Trace “Temp. Adv.” “Auto-reset” ... number of ltersL2 thresh/sat norm strength Learning normalization neighborhood Rate kernel Trace size “Temp. Adv.” “Auto-reset” ... n. of ltersL1 thresh/sat norm strength Learning Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset”kernel ...size number of lters input kernel size
  • neighborhood Rate Trace “Temp. Adv.” “Auto-reset” ... number of ltersL2 thresh/sat norm strength Learning normalization neighborhood Rate kernel Trace size “Temp. Adv.” “Auto-reset” ... n. of ltersL1 thresh/sat norm strength Learning Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset”kernel ...size
  • Two conflicting requirements The brain is a massively parallel computer FA ST slow to run➡ Big models are paralyzingly Neural data only provides weak constraints LEXI BLE F➡ Lots of parameters – hard to explore How to optimize?
  • What’s the bottleneck?
  • lutio ns! k Co nvo i lter ba n3D F
  • Fast vs Flexible: what can you do? - Make your code accessible - No focus on raw performanceExamples: MATLAB/CUDA by Jim Mutch (2010) by John Moore (1995)
  • Fast vs Flexible: what can you do? - Use standard libraries (e.g. CUBLAS, CUFFT, Jacket) - But: “remap” problem to fit? - Memory issues (not always optimal)
  • Fast vs Flexible: what can you do? - Fully optimized, by hand - But for only a few input configurations...
  • Fast vs Flexible: what can you do? - Focus on flexibility/accessibility first - But add strong foundations for raw performance from the beginningExample: Python/C/CUDA (OpenCL*)http://deeplearning.netby James Bergstra & Yoshua Bengio (2010)
  • Our answer?
  • Meta-programming and Auto-tuning
  • What?
  • Meta-programming ! Leave the grunt-programming to the computer (i.e. auto-tuning like ATLAS or FFTW) • Dynamically compile specialized versions of the same kernel for different conditions • Empirical run-time tuning • For free: smooth syntactic ugliness: unroll loops, index un-indexable registers, etc.
  • Meta-programming !“Instrument” your solutions:• Block size• Work size• Loop unrolling• Pre-fetching• Spilling• etc.
  • Meta-programming ! Let the computer generate and find the optimal code: • brute-force search with a global objective • machine-learning approach with local objectives and hidden variables (advanced) • e.g. PyCuda makes this easy:
  • Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  • texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];#define IMUL(a, b) __mul24(a, b)extern "C" { C hee ta h#for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) {#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
  • conv_kernel_4x4x4.cuconv_kernel_template.cu #include <stdio.h> texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[4][4][4]; #define IMUL(a, b) __mul24(a, b) texture<float4, 1, cudaReadModeElementType> tex_float4; extern "C" { __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; __global__ void convolve_beta_j0(float4 *input, float4 *output) { #define IMUL(a, b) __mul24(a, b) extern "C" { __shared__ float shared_in[131][4+1]; // -- input/output offsets #for j in xrange($FILTER_H) const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; __global__ void convolve_beta_j${j}(float4 *input, float4 float4 input_v4; *output) // -- load input to shared memory { { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 shared_in[threadIdx.x+128*0][0] = input_v4.x; __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; shared_in[threadIdx.x+128*0][1] = input_v4.y; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; // -- input/output offsets } const uint in_idx = (blockIdx.y+$j)*INPUT_W + if((threadIdx.x+128*1)<131) blockIdx.x*blockDim.x + threadIdx.x; { const uint out_idx = blockIdx.y*OUTPUT_W + input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); blockIdx.x*blockDim.x + threadIdx.x; shared_in[threadIdx.x+128*1][0] = input_v4.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; float4 input_v4; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; // -- load input to shared memory } #for i in xrange($LOAD_ITERATIONS) __syncthreads(); #if $i==($LOAD_ITERATIONS-1) // -- compute dot products if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) float v, w; #end if { float sum0 = 0; input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* float sum1 = 0; $i); float sum2 = 0; float sum3 = 0; shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; v = shared_in[threadIdx.x+0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; w = constant[0][0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; sum0 += v*w; } w = constant[0][0][1]; sum1 += v*w; #end for w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w;
  • conv_kernel_template.cu texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) conv_kernel_4x4x4.cu extern "C" { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 20 kB *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if $i); { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* conv_kernel_8x8x4.cu shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } 64 kB #end for
  • Benefits?
  • Smooth syntactic ugliness
  • Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • variable-length argument lists
  • Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • syntax-level code control (e.g. conditionals)
  • Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • loop unrolling (possibly fine-controlled)
  • Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • fine-controlled loop unrolling..) v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w; w = constant[0][2][2]; sum2 += v*w; w = constant[0][2][3]; sum3 += v*w; v = shared_in[threadIdx.x+3][0]; w = constant[0][3][0]; sum0 += v*w; w = constant[0][3][1]; sum1 += v*w; w = constant[0][3][2]; sum2 += v*w; w = constant[0][3][3]; sum3 += v*w; v = shared_in[threadIdx.x+0][1]; w = constant[1][0][0]; sum0 += v*w; w = constant[1][0][1]; sum1 += v*w; w = constant[1][0][2]; sum2 += v*w; w = constant[1][0][3]; sum3 += v*w;
  • How about #pragma unroll ? (why don’t you trust the compiler?)
  • o t alo ne.... we are n s for S ignal Using GPU elation pil ers Corr ust com ’t tr itchell Daniel A. M Don The Murch ode fr a ts ison Widefi gmen eld Array c tical” e “iden re thes + g *h; ompa LOPS • C *c + e*f 770 GF + d b*c grating 8-s econd snap shots over a += inte peeling, roduced by lanking and b*c; -2526 field p d after RFI b f the J2107 e of the fiel an image o ht is an imag S FLOP n the left is . On the rig a += d*c; Figure 3: O ing hout blank interval wit 20 G entire time eeled imag e. noise the e unp e above the ntours of th f magnitud ers o . This 10 co along with that are ord ubious data a += e*f; at levels iscard d e receivers ill simply d tector show n here fract into th e system w k ichael hClar fl ect or re real-tim n-based de occasion, re s the MWA mple media integration hich the si M wit floor. D wil uring deep l require a series of d ata-quality art. tests, of w a += g*h; n integral p will form a eenhill Lincoln Gr Paul La Pla nte and ces Referen t Boolard a += y, EDGES Memo, 058 , 2010. R.J. Cappal lo, M.F. M orales, and ics a ale, d Topics RFI Statist , C.J. Lonsd l of Selecte [1] A.E .E. Rogers, , R.J. Sault IE EE Journa R.B. Wayth eld Array, . Greenhill, hison Widefi ]. itchell, L.J of the Murc 07.1912 E, 97 [2] D.A. M Time Calib ration , [astro- ph/08 s of the IEE S.M. O rd, Real- 7 17, 2008 , Proceeding 2 (5), 707– n Overview 1 nuary 201sday, 27 Ja rocessing, rray: Desig in Signal P on Widefield A he Murchis 8]. , Graphics ale, et al., T 903.182 R.G. Edgar [3] C.J. Lonsd [ast ro-ph/0 H. Pfister, and Series, 506, 2009, ell, K. Dale, Conference (8), 1497–1 , D.A. Mitch d Array, ASP R.B. Wayth on Wide-fiel Greenhill, the Murchis IICS‘2011 [4] S.M . Ord, L.J. ata Pro cessing in cal Units for D Mathemati Processing 1 radio pola rimetry. I. 009. aa d nderstryn20 ing 1 411, 127, 2 .J. Sault, U Janu 6. . Breg man, and R ursday,.,2117, 137–147, 199 7 alar amaker, J.D Th pl. Ser up alogue of sc [5] J.P. H strophys. S ll-co herency an rophys. Su ppl. s, Astron. A . IV. The fu Astron. Ast foundation polarimetry ric fidelity, g radio ge and pola rimet derstandin
  • Smooth syntactic ugliness Manipulations that are not easily accessible in CUDA C code: • index un-indexable resources (e.g. regs)
  • Explore design decision space more freely
  • Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  • Exploring design decision space more freely Meta-programming: • enables efficient learning of the GPU hardware/software • allows full exploitation of the GPU architecture
  • version Aconv_kernel_beta_template.cu ... mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1 mov.b32 $r1, c0[$ofs2+0x0008] texture<float4, 1, cudaReadModeElementType> tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4 [$N_FILTERS]; mov.b32 $r1, c0[$ofs2+0x000c] mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4 #define IMUL(a, b) __mul24(a, b) extern "C" { #for j in xrange($FILTER_H) mov.b32 $r1, c0[$ofs2+0x0010] __global__ void convolve_beta_j${j}(float4 *input, float4 *output) mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4 { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; ... // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) version B #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; ... shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1 } #end for mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1 mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1 mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1 mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1 ... aster... Why ? using decuda by Wladimir J. van der Laan 2x f
  • Exploring design decision space more freely
  • Exploring design decision space more freely When USE_THREAD_PER_FILTER is True • each thread will access different cmem locations (in order)using the decuda disassembler by Wladimir J. van der Laan (Python-based)
  • Exploring design decision space more freely When USE_THREAD_PER_FILTER is False • each thread will access the same cmem locations (broadcast)using the decuda disassembler by Wladimir J. van der Laan (Python-based)
  • Exploring design decision space more freely more registers thread-dependent data movement v.s. aster... Why ? 2x f
  • Strategy• intermediate design decisions can be made explicit• multiple “forks” in the path can be kept in place• frees up the developer to revisit paste choices (without incurring a combinatoric explosion of separate pieces of code)• retesting sets of assumptions can be done frequently and programmatically from the “outer” framework of code
  • Toy Ex a mple M atmulhttp://wiki.tiker.net/PyCuda/Examples/DemoMetaMatrixmulCheetah
  • Summary Meta-programming: • can assist exploration and manual optimization • can de-clutter code • is easy and flexible with the right tools (e.g. Python, Py{CUDA,CL}, Cheetah, decuda) ➡ facilitates auto-tuning!
  • a pause?Need
  • ninja level? t to theHow t o ge
  • practic e ... , pract ice,Prac tice
  • Auto-tuning
  • Basic GPU Meta-programming System A Case Study GPU Meta-Programming: red Machine Vision in Biologically-Inspi s] [GPU Computing Gem Pinto N, Cox DD
  • Auto-tuningThe goal is to empirically optimize executiontime given:• the environment - hardware (GPU, CPU, Memory, Mobo) - software (SDK, Compiler suite)• the data (input dimensions, repetitions, etc.)
  • Basic auto-tuning: pseudo-code (1/3) Filter-bank Convolution / Correlation Scripting, Py{CUDA,CL} NoSQL (CouchDB, MongoDB) ?
  • Basic auto-tuning: pseudo-code (2/3) Cheetah, Jinja, Mako PyCUDA/CL
  • Basic auto-tuning: pseudo-code (3/3) PyCUDA/CL NoSQL (CouchDB, MongoDB)
  • Optimizing what?
  • Optimizing strategy• Like many operations, filter-bank convolution is usually “communication bound” on the GPU: - compute is cheap - communication is expensive• We must take advantage of all types of memory: - explicit: gmem (global), smem (shared), cmem (constant), tmem (texture) - implicit: rmem (registers), bmem (bin-code?) *• Different optimal access patterns
  • Example: thread gmem output size stupid float4 xyzw trick
  • Example: multiple smem loads
  • Example: using texture fetches
  • Example: register spilling
  • Example: register pressure (nvcc)
  • Example: capitalizing on bmem (bin code) ?? multiple versions of the same function with different input offsets input offset in cubin code?
  • Results
  • Results Meta-prog Meta-progGPU / SDK Input Filter-bank Boost default (gflops) auto-tuned (gflops) 256x256x8 64x9x9x8 6.710 ± 0.005 36.584 ± 0.023 445.2 %9600M GT 512x512x4 32x13x13x4 13.606 ± 0.002 35.582 ± 0.003 161.5 %CUDA3.1 1024x1024x8 16x5x5x8 20.034 ± 0.113 26.084 ± 6.243 30.2 % 2048x2048x4 4x8x8x4 25.781 ± 0.044 46.945 ± 0.100 82.1 % 256x256x8 64x9x9x8 104.188 ± 0.051 168.083 ± 0.372 61.3 %C1060 512x512x4 32x13x13x4 125.739 ± 0.109 234.053 ± 0.266 86.1 %CUDA2.3 1024x1024x8 16x5x5x8 144.279 ± 0.764 243.697 ± 0.346 68.9 % 2048x2048x4 4x8x8x4 180.060 ± 0.018 322.328 ± 0.348 79.0 % 256x256x8 64x9x9x8 123.396 ± 0.016 197.006 ± 0.219 59.7 %GTX285 512x512x4 32x13x13x4 143.277 ± 0.044 270.206 ± 0.209 88.6 %CUDA2.3 1024x1024x8 16x5x5x8 148.841 ± 0.465 310.276 ± 0.538 108.5 % 2048x2048x4 4x8x8x4 205.152 ± 0.015 376.685 ± 0.070 83.6 % 256x256x8 64x9x9x8 467.631 ± 19.100 471.902 ± 11.419 0.9 %GTX480 512x512x4 32x13x13x4 834.838 ± 8.275 974.266 ± 3.809 16.7 %CUDA3.1 1024x1024x8 16x5x5x8 542.808 ± 1.135 614.019 ± 0.904 13.1 % 2048x2048x4 4x8x8x4 378.165 ± 0.537 806.628 ± 0.168 113.3 %
  • Analysis
  • Analysis
  • Empirical results... Performance (g ops) Q9450 (Matlab/C) [2008] 0.3 Q9450 (C/SSE) [2008] 9.0 7900GTX (Cg) [2006] 68.2 PS3/Cell (C/ASM) [2007] 111.48800GTX (CUDA1.x) [2007] 192.7 GTX280 (CUDA2.x) [2008] 339.3 . GTX480 (CUDA3.x) [2010] e cha nging.. 974.3 g am e edup is >1 0 00X sp
  • Summary
  • Summary • Meta-programming makes developing high-performing code for GPU easier • Fantastic tools exist (e.g. PyCUDA) to help • Interesting way to explore/learn about GPUs (hw/sw) • Coarse auto-tuning yields good results
  • Future • More fermi optimizations (L1 cache, concurrent kernels) • OpenCL to optimize across vendors • Smarter auto-tuning techniques (ML) - (boosted) decision trees - evolutionary programming strategies
  • More ?• Thu 3/31/11: PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)• Tue 3/29/11: Algorithm Strategies (W. Hwu, UIUC)• Tue 4/5/11: Analysis-driven Optimization (C.Wooley, NVIDIA)• Thu 4/7/11: Irregular Parallelism & Efficient Data Structures (J.Owens, UCDavis)• Thu 4/14/11: Optimization for Ninjas (D.Merill, UVirg)• ...
  • one more thing or two...
  • Life/Code Hacking #2.x Speed {listen,read,writ}ingaccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • Life/Code Hacking #2.2b Speed writingaccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • Life/Code Hacking #2.2b Speed writing ? R SIaccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • Life/Code Hacking #2.2b Speed writing SI? Raccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • Life/Code Hacking #2.2b Speed writing
  • Life/Code Hacking #2.3 Speed readingaccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • Life/Code Hacking #2.3 Speed reading1. Collect many papers, docs, chapters, etc. (100)2. Skim through them quickly / select (50)3. Read w/o full understanding / select (25)4. Read completely w/ full understanding / select (10)5. Complete mastery + reproduction (5) accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • Life/Code Hacking #2.3 Speed readinghttp://readerssoft.com/speed_reading_obstacles.php accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • Life/Code Hacking #2.3 Speed readinghttp://readerssoft.com/speed_reading_obstacles.php normal reading vs. speed reading accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • Life/Code Hacking #2.3 Speed reading like David Guetta, use one finger !accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • CO ME