[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

  • 2,033 views
Uploaded on

http://cs264.org …

http://cs264.org

Abstract:

High-level scripting languages are in many ways polar opposites to
GPUs. GPUs are highly parallel, subject to hardware subtleties, and
designed for maximum throughput, and they offer a tremendous advance
in the performance achievable for a significant number of
computational problems. On the other hand, scripting languages such as
Python favor ease of use over computational speed and do not generally
emphasize parallelism. PyOpenCL and PyCUDA are two packages that
attempt to join the two together. By showing concrete examples, both
at the toy and the whole-application level, this talk aims to
demonstrate that by combining these opposites, a programming
environment is created that is greater than just the sum of its two
parts.

Speaker biography:

Andreas Klöckner obtained his PhD degree working with Jan Hesthaven at
the Department of Applied Mathematics at Brown University. He worked
on a variety of topics all aiming to broaden the utility of
discontinuous Galerkin (DG) methods. This included their use in the
simulation of plasma physics and the demonstration of their particular
suitability for computation on throughput-oriented graphics processors
(GPUs). He also worked on multi-rate time stepping methods and shock
capturing schemes for DG.

In the fall of 2010, he joined the Courant Institute of Mathematical
Sciences at New York University as a Courant Instructor. There, he is
working on problems in computational electromagnetics with Leslie
Greengard.

His research interests include:

- Discontinuous Galerkin and integral equation methods for wave
propagation

- Programming tools for parallel architectures

- High-order unstructured particle-in-cell methods for plasma simulation

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,033
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
91
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Intro PyOpenCL RTCG Perspectives Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA Andreas Kl¨ckner o Courant Institute of Mathematical Sciences New York University March 31, 2011 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 2. Intro PyOpenCL RTCG PerspectivesThanks Jan Hesthaven (Brown) Tim Warburton (Rice) Leslie Greengard (NYU) PyOpenCL, PyCUDA contributors Nvidia Corp., AMD Corp. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 3. Intro PyOpenCL RTCG PerspectivesOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 4. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOutline 1 Introduction A Common Theme Intro to OpenCL 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 5. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOutline 1 Introduction A Common Theme Intro to OpenCL 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 6. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLHow are High-Performance Codes constructed? “Traditional” Construction of High-Performance Codes: C/C++/Fortran Libraries “Alternative” Construction of High-Performance Codes: Scripting for ‘brains’ GPUs for ‘inner loops’ Play to the strengths of each programming environment. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 7. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOutline 1 Introduction A Common Theme Intro to OpenCL 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 8. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLWhat is OpenCL? OpenCL (Open Computing Language) is an open, royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors. [OpenCL 1.1 spec] Device-neutral (Nv GPU, AMD GPU, Intel/AMD CPU) Vendor-neutral Comes with RTCG Defines: Host-side programming interface (library) Device-side programming language (!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 9. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLWhat is OpenCL? OpenCL (Open Computing Language) is an open, royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors. [OpenCL 1.1 spec] Device-neutral (Nv GPU, AMD GPU, Big deal? Intel/AMD CPU) Vendor-neutral Comes with RTCG Defines: Host-side programming interface (library) Device-side programming language (!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 10. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLWhat is OpenCL? OpenCL (Open Computing Language) is an open, royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors. [OpenCL 1.1 spec] Big deal! Device-neutral (Nv GPU, AMD GPU, Big deal? Intel/AMD CPU) Vendor-neutral Comes with RTCG Defines: Host-side programming interface (library) Device-side programming language (!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 11. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLWho?OpenCL Working Group• Diverse industry participation - Processor vendors, system OEMs, middleware vendors, application developers• Many industry-leading experts involved in OpenCL’s design - A healthy diversity of industry perspectives• Apple made initial proposal and is very active in the working group - Serving as specification editor © Copyright Khronos Group, 2010 - Page 4 Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 12. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLWhen? OpenCL Timeline • Six months from proposal to released OpenCL 1.0 specification - Due to a strong initial proposal and a shared commercial incentive • Multiple conformant implementations shipping - Apple’s Mac OS X Snow Leopard now ships with OpenCL • 18 month cadence between OpenCL 1.0 and OpenCL 1.1 - Backwards compatibility protect software investment Khronos publicly Multiple conformant releases OpenCL 1.0 as implementations ship royalty-free across diverse OS specification and platforms Jun08 May09 Jun10 Dec08 2H09 Apple proposes OpenCL Khronos releases OpenCL OpenCL 1.1 working group and 1.0 conformance tests to Specification released and contributes draft specification ensure high-quality first implementations ship to Khronos implementations © Copyright Khronos Group, 2010 - Page 5 Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 13. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLWhy? Processor Parallelism CPUs GPUs Multiple cores driving Emerging Increasingly general performance increases purpose data-parallel Intersection computing Multi- Heterogeneous Graphics processor Computing APIs and programming Shading – e.g. OpenMP Languages OpenCL is a programming framework for heterogeneous compute resources © Copyright Khronos Group, 2010 - Page 3 Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 14. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLCL vs CUDA side-by-side CUDA source code: OpenCL source code: global void transpose( void transpose( float ∗A t, float ∗A, global float ∗a t, global float ∗a, int a width, int a height ) unsigned a width, unsigned a height ) { { int base idx a = int base idx a = blockIdx .x ∗ BLK SIZE + get group id (0) ∗ BLK SIZE + blockIdx .y ∗ A BLOCK STRIDE; get group id (1) ∗ A BLOCK STRIDE; int base idx a t = int base idx a t = blockIdx .y ∗ BLK SIZE + get group id (1) ∗ BLK SIZE + blockIdx .x ∗ A T BLOCK STRIDE; get group id (0) ∗ A T BLOCK STRIDE; int glob idx a = int glob idx a = base idx a + threadIdx.x base idx a + get local id (0) + a width ∗ threadIdx.y; + a width ∗ get local id (1); int glob idx a t = int glob idx a t = base idx a t + threadIdx.x base idx a t + get local id (0) + a height ∗ threadIdx .y; + a height ∗ get local id (1); shared float A shared[BLK SIZE][BLK SIZE+1]; local float a local [BLK SIZE][BLK SIZE+1]; A shared[ threadIdx .y ][ threadIdx .x] = a local [ get local id (1)∗BLK SIZE+get local id(0)] = A[ glob idx a ]; a[ glob idx a ]; syncthreads (); barrier (CLK LOCAL MEM FENCE); A t[ glob idx a t ] = a t [ glob idx a t ] = A shared[ threadIdx .x ][ threadIdx .y ]; a local [ get local id (0)∗BLK SIZE+get local id(1)]; } } Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 15. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL ↔ CUDA: A dictionary OpenCL CUDA Grid Grid Work Group Block Work Item Thread kernel global global device local shared private local imagend t texture<type, n, ...> barrier(LMF) syncthreads() get local id(012) threadIdx.xyz get group id(012) blockIdx.xyz get global id(012) – (reimplement) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 16. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Execution Model nD Grid Group Group Group (0, 0) (1, 0) (2, 0) Two-tiered Parallelism Group Group Group (0, 1) (1, 1) (2, 1) Grid = Nx × Ny × Nz work groups Work group = Sx × Sy × Sz work items Total: i∈{x,y ,z} Si Ni work items Work Group (1, 0) Comm/Sync only within work group Item Item Item Item Work group maps to compute unit (0, 0) (1, 0) (2, 0) (3, 0) Grid/Group ≈ outer loops in an algorithm Item Item Item Item (0, 1) (1, 1) (2, 1) (3, 1) Device Language: Item (0, 2) Item (1, 2) Item (2, 2) Item (3, 2) get {global,group,local} {id,size} Item Item Item Item (axis) (0, 3) (1, 3) (2, 3) (3, 3) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 17. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Host (CPU) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 18. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 19. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· Memory ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 20. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Memory Compute Device 1 (Platform 0) ··· Host ··· ··· Memory Compute Device 0 (Platform 1) (CPU) ··· Memory ··· ··· Memory Compute Device 1 (Platform 1) ··· ··· ··· Memory Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 21. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 22. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Platform 0 (e.g. CPUs) Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 23. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Platform 1 (e.g. GPUs) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 24. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 25. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service (think “chip”, Compute Device 0 (Platform 0) has memory ··· interface) ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 26. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service (think “chip”, Compute Device 0 (Platform 0) has memory ··· interface) ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Compute Unit ··· ··· (think “processor”, ··· has insn. fetch) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 27. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service (think “chip”, Compute Device 0 (Platform 0) has memory ··· interface) ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Compute Unit ··· ··· (think “processor”, ··· has insn. fetch) Processing Element (think “SIMD lane”) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 28. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 29. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Python ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 30. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Python ··· ··· ··· Device Language: ∼ C99 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 31. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL Object Diagram Figure 2.1 - OpenCL UML Class Diagram Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 32. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLWhy do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Python + CUDA = PyCUDA Python + OpenCL = PyOpenCL Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 33. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLOutline 1 Introduction 2 Programming with PyOpenCL First Contact About PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 34. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLOutline 1 Introduction 2 Programming with PyOpenCL First Contact About PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 35. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLDive into PyOpenCL 1 import pyopencl as cl , numpy 2 3 a = numpy.random.rand(256∗∗3).astype(numpy.float32) 4 5 ctx = cl. create some context () 6 queue = cl.CommandQueue(ctx) 7 8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes) 9 cl . enqueue write buffer (queue, a dev, a)1011 prg = cl.Program(ctx, ”””12 kernel void twice( global float ∗a)13 { a[ get global id (0)] ∗= 2; }14 ”””). build ()1516 prg. twice(queue, a.shape, (1,), a dev) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 36. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLDive into PyOpenCL 1 import pyopencl as cl , numpy 2 3 a = numpy.random.rand(256∗∗3).astype(numpy.float32) 4 5 ctx = cl. create some context () 6 queue = cl.CommandQueue(ctx) 7 8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes) 9 cl . enqueue write buffer (queue, a dev, a)1011 prg = cl.Program(ctx, ”””12 kernel void twice( global float ∗a)13 { a[ get global id (0)] ∗= 2; } Compute kernel14 ”””). build ()1516 prg. twice(queue, a.shape, (1,), a dev) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 37. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLDive into PyOpenCL 8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes) 9 cl . enqueue write buffer (queue, a dev, a)1011 prg = cl.Program(ctx, ”””12 kernel void twice( global float ∗a)13 { a[ get local id (0)+ get local size (0)∗ get group id (0)] ∗= 2; }14 ”””). build ()1516 prg. twice(queue, a.shape, (256,), a dev)1718 result = numpy.empty like(a)19 cl . enqueue read buffer (queue, a dev, result ). wait()20 import numpy.linalg as la21 assert la .norm(result − 2∗a) == 0 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 38. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLOutline 1 Introduction 2 Programming with PyOpenCL First Contact About PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 39. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLPyOpenCL: Completeness PyOpenCL exposes all of OpenCL. For example: Every GetInfo() query Images and Samplers Memory Maps Profiling and Synchronization GL Interop Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 40. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLPyOpenCL: Completeness PyOpenCL supports (nearly) every OS that has an OpenCL implementation. Linux OS X Windows Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 41. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLAutomatic Cleanup Reachable objects (memory, streams, . . . ) are never destroyed. Once unreachable, released at an unspecified future time. Scarce resources (memory) can be explicitly freed. (obj.release()) Correctly deals with multiple contexts and dependencies. (based on OpenCL’s reference counting) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 42. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLPyOpenCL: Documentation Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 43. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLPyOpenCL Philosophy Provide complete access Automatically manage resources Provide abstractions Allow interactive use Check for and report errors automatically Integrate tightly with numpy Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 44. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLPyOpenCL, PyCUDA: Vital Information http://mathema.tician.de/ software/pyopencl (or /pycuda) Complete documentation X Consortium License (no warranty, free for all use) Convenient abstractions Arrays, Elementwise op., Reduction, Scan Require: numpy, Python 2.4+ (Win/OS X/Linux) Community: mailing list, wiki, add-on packages (FFT, scikits.cuda, . . . ) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 45. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLCapturing Dependencies A f B = f(A) B C = g(B) g p E = f(C) C P q F = h(C) G = g(E,F) f h P = p(B) E F Q Q = q(B) g g r R = r(G,P,Q) G r r R Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 46. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLCapturing Dependencies A Switch queue to out-of-order mode! f B = f(A) Specify as list of events using B C = g(B) for= optional keyword to wait g p E = f(C) enqueue XXX. C P q F = h(C) also enqueue barrier. Can G = g(E,F) f h Common use case: P = p(B) Transmit/receive from other MPI E F Q Q = q(B) ranks. g g r R = r(G,P,Q) Possible on Nv Fermi: Submit G r parallel work to increase machine r use. R Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 47. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation The Idea RTCG in Action 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 48. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation The Idea RTCG in Action 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 49. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming In GPU scripting, GPU code does not need to be a compile-time constant. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 50. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 51. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming Idea In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 52. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming Idea In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 53. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming Idea In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary Machine (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 54. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming Idea Human In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 55. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming Idea Good for code In GPU scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 56. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming Idea Good for code In GPUyCUDA P scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 57. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming Idea Good for code PyOp UDA In GPUyCenCL P scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 58. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMachine-generated Code Why machine-generate code? Automated Tuning (cf. ATLAS, FFTW) Data types Specialize code for given problem Constants faster than variables (→ register pressure) Loop Unrolling Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 59. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionPyOpenCL: Support for Metaprogramming Three (main) ways of generating code: Simple %-operator substitution Combine with C preprocessor: simple, often sufficient Use a templating engine (Mako works very well) codepy: Build C syntax trees from Python Generates readable, indented C Many ways of evaluating code–most important one: Exact device timing via events Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 60. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation The Idea RTCG in Action 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 61. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionPyOpenCL Arrays: General Usage Remember your first PyOpenCL program? Abstraction is good: 1 import numpy 2 import pyopencl as cl 3 import pyopencl.array as cl array 4 5 ctx = cl. create some context () 6 queue = cl.CommandQueue(ctx) 7 8 a gpu = cl array . to device ( 9 ctx , queue, numpy.random.randn(4,4).astype(numpy.float32))10 a doubled = (2∗a gpu).get()11 print a doubled12 print a gpu Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 62. Intro PyOpenCL RTCG Perspectives Idea RTCG in Actionpyopencl.array: Simple Linear Algebra pyopencl.array.Array: Meant to look and feel just like numpy. p.a.to device(ctx, queue, numpy array) numpy array = ary.get() +, -, ∗, /, fill, sin, arange, exp, rand, . . . Mixed types (int32 + float32 = float64) print cl array for debugging. Allows access to raw bits Use as kernel arguments, memory maps Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 63. Intro PyOpenCL RTCG Perspectives Idea RTCG in Actionpyopencl.elementwise: Elementwise expressions Avoiding extra store-fetch cycles for elementwise math: n = 10000 a gpu = cl array . to device ( ctx , queue, numpy.random.randn(n).astype(numpy.float32)) b gpu = cl array . to device ( ctx , queue, numpy.random.randn(n).astype(numpy.float32)) from pyopencl.elementwise import ElementwiseKernel lin comb = ElementwiseKernel(ctx, ” float a, float ∗x, float b, float ∗y, float ∗z”, ”z[ i ] = a∗x[i ] + b∗y[i]”) c gpu = cl array . empty like (a gpu) lin comb(5, a gpu, 6, b gpu, c gpu) import numpy.linalg as la assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 64. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionRTCG via Substitution source = (””” kernel void %(name)s(%(arguments)s) { unsigned lid = get local id (0); unsigned gsize = get global size (0); unsigned work item start = get local size (0)∗ get group id (0); for (unsigned i = work item start + lid ; i < n; i += gsize) { %(operation)s; } } ””” % { ”arguments”: ”, ”. join (arg . declarator () for arg in arguments), ”operation”: operation , ”name”: name, ”loop prep”: loop prep , }) prg = cl.Program(ctx, source ). build () Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 65. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionRTCG via Templates from mako.template import Template tpl = Template(””” kernel void add( global ${ type name } ∗tgt, global const ${ type name } ∗op1, global const ${ type name } ∗op2) { int idx = get local id (0) + ${ local size } ∗ ${ thread strides } ∗ get group id (0); % for i in range( thread strides ): <% offset = i∗ local size %> tgt [ idx + ${ offset }] = op1[idx + ${ offset }] + op2[idx + ${ offset } ]; % endfor }”””) rendered tpl = tpl . render(type name=”float”, local size = local size , thread strides = thread strides ) knl = cl.Program(ctx, str ( rendered tpl )). build (). add Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 66. Intro PyOpenCL RTCG Perspectives Idea RTCG in Actionpyopencl.reduction: Reduction made easy Example: A dot product calculation from pyopencl.reduction import ReductionKernel dot = ReductionKernel(ctx, dtype out=numpy.float32, neutral=”0”, reduce expr=”a+b”, map expr=”x[i]∗y[i]”, arguments=” global const float ∗x, global const float ∗y”) import pyopencl.clrandom as cl rand x = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32) y = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32) x dot y = dot(x, y ). get() x dot y cpu = numpy.dot(x.get(), y. get ()) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 67. Intro PyOpenCL RTCG Perspectives Idea RTCG in Actionpyopencl.scan: Scan made easy Example: A cumulative sum computation from pyopencl.scan import InclusiveScanKernel knl = InclusiveScanKernel(ctx , np.int32 , ”a+b”) n = 2∗∗20−2∗∗18+5 host data = np.random.randint(0, 10, n). astype(np.int32) dev data = cl array . to device (queue, host data) knl(dev data) assert (dev data.get() == np.cumsum(host data, axis=0)).all() Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 68. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 69. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 70. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsWhetting your appetite1 import pycuda.driver as cuda2 import pycuda.autoinit , pycuda.compiler3 import numpy45 a = numpy.random.randn(4,4).astype(numpy.float32)6 a gpu = cuda.mem alloc(a.nbytes)7 cuda.memcpy htod(a gpu, a) [This is examples/demo.py in the PyCUDA distribution.] Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 71. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsWhetting your appetite 1 mod = pycuda.compiler.SourceModule(””” 2 global void twice( float ∗a) 3 { 4 int idx = threadIdx.x + threadIdx.y∗4; 5 a[ idx ] ∗= 2; 6 } 7 ”””) 8 9 func = mod.get function(”twice”)10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 72. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsWhetting your appetite 1 mod = pycuda.compiler.SourceModule(””” 2 global void twice( float ∗a) 3 { 4 int idx = threadIdx.x + threadIdx.y∗4; 5 a[ idx ] ∗= 2; 6 } Compute kernel 7 ”””) 8 9 func = mod.get function(”twice”)10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 73. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsPyOpenCL ↔ PyCUDA: A (rough) dictionary PyOpenCL PyCUDA Context Context CommandQueue Stream Buffer mem alloc / DeviceAllocation Program SourceModule Kernel Function Event (eg. enqueue marker) Event Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 74. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 75. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsDiscontinuous Galerkin Method Let Ω := i Dk ⊂ Rd . Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 76. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsDiscontinuous Galerkin Method Let Ω := i Dk ⊂ Rd . Goal Solve a conservation law on Ω: ut + · F (u) = 0 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 77. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsDiscontinuous Galerkin Method Let Ω := i Dk ⊂ Rd . Goal Solve a conservation law on Ω: ut + · F (u) = 0 Example Maxwell’s Equations: EM field: E (x, t), H(x, t) on Ω governed by 1 j 1 ∂t E − ×H =− , ∂t H + × E = 0, ε ε µ ρ ·E = , · H = 0. ε Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 78. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms ˆ ˆ 0= ut ϕ + [ · F (u)]ϕ dx − [ˆ · F − (ˆ · F )∗ ]ϕ dSx n n Dk ∂Dk Flux term Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 79. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms ˆ ˆ 0= ut ϕ + [ · F (u)]ϕ dx − [ˆ · F − (ˆ · F )∗ ]ϕ dSx n n Dk ∂Dk Flux term Flux terms: vary by problem expression specified by user evaluated pointwise Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 80. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms Example Example: Fluxes for Maxwell’s Equations 1 n · (F − F ∗ )E := ˆ [ˆ × ( H − αˆ × E )] n n 2 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 81. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms Example Example: Fluxes for Maxwell’s Equations 1 n · (F − F ∗ )E := ˆ [ˆ × ( H − αˆ × E )] n n 2 User writes: Vectorial statement in math. notation flux = 1/2∗cross(normal, h. int −h.ext −alpha∗cross(normal, e. int −e.ext)) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 82. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms Example Example: Fluxes for Maxwell’s Equations 1 n · (F − F ∗ )E := ˆ [ˆ × ( H − αˆ × E )] n n 2 We generate: Scalar evaluator in C (6×) a flux += ( ((( val a field5 − val b field5 )∗ fpair −>normal[2] − ( val a field4 − val b field4 )∗ fpair −>normal[0]) + val a field0 − val b field0 )∗ fpair −>normal[0] − ((( val a field4 − val b field4 ) ∗ fpair −>normal[1] − ( val a field1 − val b field1 )∗ fpair −>normal[2]) + val a field3 − val b field3 ) ∗ fpair −>normal[1] )∗ value type (0.5); Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 83. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsLoop Slicing for element-local parts of GPU DG Per Block: KL element-local mat.mult. + matrix load Preparation Question: How should one assign work to threads? Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 84. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsLoop Slicing for element-local parts of GPU DG Per Block: KL element-local mat.mult. + matrix load Preparation Question: How should one assign work to threads? ws : in sequence wi : “inline-parallel” wp : in parallel Thread Thread Thread t t t (amortize preparation) (exploit register space) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 85. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsLoop Slicing for Differentiation 2.2 3.0 Local differentiation, matrix-in-shared, order 4, with microblocking 2.8 2.0 point size denotes wi ∈ 1, ,4 2.6 1.8 2.4 Execution time [ms] 1.6 2.2 2.0 ws 1.4 1.8 1.2 1.6 1.4 1.0 1.2 0.8 15 20 25 30 1.0 wp Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 86. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsNvidia GTX280 vs. single core of Intel Core 2 Duo E8400 300 GPU 250 CPU 200 GFlops/s 150 100 50 00 2 4 6 8 10 Polynomial Order N Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 87. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsMemory Bandwidth on a GTX 280 200 Gather 180 Lift Global Memory Bandwidth [GB/s] Diff 160 Assy. Peak 140 120 100 80 60 40 201 2 3 4 5 6 7 8 9 Polynomial Order N Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 88. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsGPU DG Showcase Eletromagnetism Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 89. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsGPU DG Showcase Eletromagnetism Poisson Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 90. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsGPU DG Showcase Eletromagnetism Poisson CFD Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 91. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 92. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsAutomating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 93. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsAutomating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 94. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsAutomating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile Another way: Dumb enumeration Enumerate loop slicings Enumerate prefetch options Choose by running resulting code on actual hardware Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 95. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsLoo.py Example Empirical GPU loop optimization: a, b, c, i , j , k = [var(s) for s in ” abcijk ”] n = 500 k = make loop kernel([ LoopDimension(”i”, n), LoopDimension(”j”, n), LoopDimension(”k”, n), ], [ (c[ i +n∗j], a[ i +n∗k]∗b[k+n∗j]) ]) gen kwargs = { ”min threads”: 128, ”min blocks”: 32, } → Ideal case: Finds 160 GF/s kernel without human intervention. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 96. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsLoo.py Status Limited scope: Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . . Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 97. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsLoo.py Status Limited scope: Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . . Kernel compilation limits trial rate Non-Goal: Peak performance Good results currently for dense linear algebra and (some) DG subkernels Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 98. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 99. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsWhere to from here? PyCUDA, PyOpenCL, hedge → http://www.cims.nyu.edu/~kloeckner/ GPU RTCG AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation for High-Performance Computing, submitted. GPU-DG Article AK, T. Warburton, J. Bridge, J.S. Hesthaven, “Nodal Discontinuous Galerkin Methods on Graphics Processors”, J. Comp. Phys., 228 (21), 7863–7882. Also: Intro in GPU Computing Gems Vol 2 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 100. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsConclusions GPUs to me: architecture choice now widely available Fun time to be in computational science GPUs and scripting work surprisingly well together Exploit a natural task decomposition in computational codes RTCG: Crucial tool GPU Scripting great for prototyping . . . and just as suitable for production code Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 101. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsQuestions? ? Thank you for your attention! http://www.cims.nyu.edu/~kloeckner/ image credits Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 102. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsImage Credits Dictionary: sxc.hu/topfer C870 GPU: Nvidia Corp. OpenCL Logo: Apple Corp./Ars Technica OS Platforms: flickr.com/aOliN.Tk Old Books: flickr.com/ppdigital Floppy disk: flickr.com/ethanhein Machine: flickr.com/13521837@N00 Adding Machine: flickr.com/thomashawk Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 103. ImplementationsMultiple GPUs via MPI: 16 GPUs vs. 64 CPUs Flop Rates: 16 GPUs vs 64 CPU cores 4000 GPU CPU 3000 GFlops/s 2000 1000 00 2 4 6 8 10 Polynomial Order N Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 104. ImplementationsOutline 5 OpenCL implementations Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 105. ImplementationsThe Nvidia CL implementation Targets only GPUs Notes: Nearly identical to CUDA No native C-level JIT in CUDA (→ PyCUDA) Page-locked memory: Use CL MEM ALLOC HOST PTR. Careful: double meaning Need page-locked memory for genuinely overlapped transfers. No linear memory texturing CUDA device emulation mode deprecated → Use AMD CPU CL (faster, too!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 106. ImplementationsThe Apple CL implementation Targets CPUs and GPUs General notes: Different header name OpenCL/cl.h instead of CL/cl.h Use -framework OpenCL for C access. Beware of imperfect compiler cache implementation (ignores include files) CPU notes: One work item per processor GPU similar to hardware vendor implementation. (New: Intel w/ Sandy Bridge) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 107. ImplementationsThe AMD CL implementation Targets CPUs and GPUs (from both AMD and Nvidia) GPU notes: Wide SIMD groups (64) Native 4/5-wide vectors But: very flop-heavy machine, may ignore vectors for memory-bound workloads → Both implicit and explicit SIMD CPU notes: Many work items per processor (emulated) General: cl amd printf Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA