Your SlideShare is downloading. ×
0
Intro PyOpenCL RTCG Perspectives          Easy, Effective, Efficient:       GPU Programming in Python       with PyOpenCL and...
Intro PyOpenCL RTCG PerspectivesThanks     Jan Hesthaven (Brown)     Tim Warburton (Rice)     Leslie Greengard (NYU)     P...
Intro PyOpenCL RTCG PerspectivesOutline   1 Introduction   2 Programming with PyOpenCL   3 Run-Time Code Generation   4 Pe...
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCLOutline   1 Introduction          A Common Theme          Intro t...
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCLOutline   1 Introduction          A Common Theme          Intro t...
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCLHow are High-Performance Codes constructed?     “Traditional” Con...
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCLOutline   1 Introduction          A Common Theme          Intro t...
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCLWhat is OpenCL? OpenCL (Open Computing Language) is an open, roya...
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCLWhat is OpenCL? OpenCL (Open Computing Language) is an open, roya...
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCLWhat is OpenCL? OpenCL (Open Computing Language) is an open, roya...
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCLWho?OpenCL Working Group• Diverse industry participation   - Proc...
Intro PyOpenCL RTCG Perspectives           A Common Theme OpenCLWhen? OpenCL Timeline • Six months from proposal to releas...
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCLWhy? Processor Parallelism                  CPUs               ...
Intro PyOpenCL RTCG Perspectives           A Common Theme OpenCLCL vs CUDA side-by-side CUDA source code:                 ...
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCLOpenCL ↔ CUDA: A dictionary              OpenCL             CUDA ...
Intro PyOpenCL RTCG Perspectives           A Common Theme OpenCLOpenCL: Execution Model  nD Grid   Group     Group      Gr...
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCLOpenCL: Computing as a Service       Host      (CPU)             ...
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCLOpenCL: Computing as a Service                             Comp...
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCLOpenCL: Computing as a Service                             Comp...
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCLOpenCL: Computing as a Service                             Comp...
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCLOpenCL: Computing as a Service                             Comp...
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCLOpenCL: Computing as a Service                            Platf...
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCLOpenCL: Computing as a Service                             Comp...
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCLOpenCL: Computing as a Service                             Comp...
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCLOpenCL: Computing as a Service    (think “chip”,               ...
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCLOpenCL: Computing as a Service    (think “chip”,               ...
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCLOpenCL: Computing as a Service    (think “chip”,               ...
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCLOpenCL: Computing as a Service                             Comp...
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCLOpenCL: Computing as a Service                             Comp...
Intro PyOpenCL RTCG Perspectives      A Common Theme OpenCLOpenCL: Computing as a Service                             Comp...
Intro PyOpenCL RTCG Perspectives              A Common Theme OpenCLOpenCL Object Diagram                            Figure...
Intro PyOpenCL RTCG Perspectives    A Common Theme OpenCLWhy do Scripting for GPUs?     GPUs are everything that scripting...
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCLOutline   1 Introduction   2 Programming with PyOpenCL    ...
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCLOutline   1 Introduction   2 Programming with PyOpenCL    ...
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCLDive into PyOpenCL 1   import pyopencl as cl , numpy 2 3  ...
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCLDive into PyOpenCL 1   import pyopencl as cl , numpy 2 3  ...
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCLDive into PyOpenCL 8 a dev = cl. Buffer (ctx , cl .mem flags...
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCLOutline   1 Introduction   2 Programming with PyOpenCL    ...
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCLPyOpenCL: Completeness                          PyOpenCL e...
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCLPyOpenCL: Completeness PyOpenCL supports (nearly) every OS...
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCLAutomatic Cleanup       Reachable objects (memory,       s...
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCLPyOpenCL: Documentation                         Andreas Kl...
Intro PyOpenCL RTCG Perspectives     First Contact About PyOpenCLPyOpenCL Philosophy                                      ...
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCLPyOpenCL, PyCUDA: Vital Information     http://mathema.tic...
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCLCapturing Dependencies                                    ...
Intro PyOpenCL RTCG Perspectives    First Contact About PyOpenCLCapturing Dependencies                                    ...
Intro PyOpenCL RTCG Perspectives    Idea RTCG in ActionOutline   1 Introduction   2 Programming with PyOpenCL   3 Run-Time...
Intro PyOpenCL RTCG Perspectives    Idea RTCG in ActionOutline   1 Introduction   2 Programming with PyOpenCL   3 Run-Time...
Intro PyOpenCL RTCG Perspectives    Idea RTCG in ActionMetaprogramming                                                    ...
Intro PyOpenCL RTCG Perspectives     Idea RTCG in ActionMetaprogramming                                                   ...
Intro PyOpenCL RTCG Perspectives     Idea RTCG in ActionMetaprogramming     Idea                                          ...
Intro PyOpenCL RTCG Perspectives     Idea RTCG in ActionMetaprogramming      Idea                                         ...
Intro PyOpenCL RTCG Perspectives     Idea RTCG in ActionMetaprogramming      Idea                                         ...
Intro PyOpenCL RTCG Perspectives     Idea RTCG in ActionMetaprogramming      Idea                    Human                ...
Intro PyOpenCL RTCG Perspectives     Idea RTCG in ActionMetaprogramming      Idea                          Good for code  ...
Intro PyOpenCL RTCG Perspectives     Idea RTCG in ActionMetaprogramming      Idea                          Good for code  ...
Intro PyOpenCL RTCG Perspectives     Idea RTCG in ActionMetaprogramming      Idea                          Good for code  ...
Intro PyOpenCL RTCG Perspectives    Idea RTCG in ActionMachine-generated Code  Why machine-generate code?      Automated T...
Intro PyOpenCL RTCG Perspectives    Idea RTCG in ActionPyOpenCL: Support for Metaprogramming  Three (main) ways of generat...
Intro PyOpenCL RTCG Perspectives    Idea RTCG in ActionOutline   1 Introduction   2 Programming with PyOpenCL   3 Run-Time...
Intro PyOpenCL RTCG Perspectives    Idea RTCG in ActionPyOpenCL Arrays: General Usage     Remember your first PyOpenCL prog...
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Actionpyopencl.array: Simple Linear Algebra  pyopencl.array.Array:     Me...
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Actionpyopencl.elementwise: Elementwise expressions  Avoiding extra store...
Intro PyOpenCL RTCG Perspectives    Idea RTCG in ActionRTCG via Substitution   source = (”””         kernel void %(name)s(...
Intro PyOpenCL RTCG Perspectives    Idea RTCG in ActionRTCG via Templates  from mako.template import Template  tpl = Templ...
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Actionpyopencl.reduction: Reduction made easy  Example: A dot product cal...
Intro PyOpenCL RTCG Perspectives    Idea RTCG in Actionpyopencl.scan: Scan made easy  Example: A cumulative sum computatio...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsOutline   1 Introduction   2 Programming with PyOpenCL...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsOutline   1 Introduction   2 Programming with PyOpenCL...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsWhetting your appetite1   import pycuda.driver as cuda...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsWhetting your appetite 1 mod = pycuda.compiler.SourceM...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsWhetting your appetite 1 mod = pycuda.compiler.SourceM...
Intro PyOpenCL RTCG Perspectives     PyCUDA GPU-DG Loo.py ConclusionsPyOpenCL ↔ PyCUDA: A (rough) dictionary              ...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsOutline   1 Introduction   2 Programming with PyOpenCL...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsDiscontinuous Galerkin Method   Let Ω :=   i   Dk ⊂ Rd...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsDiscontinuous Galerkin Method   Let Ω :=    i   Dk ⊂ R...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsDiscontinuous Galerkin Method   Let Ω :=     i   Dk ⊂ ...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms          ˆ             ...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms          ˆ             ...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms Example  Example: Fluxe...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms Example  Example: Fluxe...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms Example  Example: Fluxe...
Intro PyOpenCL RTCG Perspectives     PyCUDA GPU-DG Loo.py ConclusionsLoop Slicing for element-local parts of GPU DG   Per ...
Intro PyOpenCL RTCG Perspectives     PyCUDA GPU-DG Loo.py ConclusionsLoop Slicing for element-local parts of GPU DG   Per ...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsLoop Slicing for Differentiation                       ...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsNvidia GTX280 vs. single core of Intel Core 2 Duo E840...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsMemory Bandwidth on a GTX 280                         ...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsGPU DG Showcase   Eletromagnetism                     ...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsGPU DG Showcase   Eletromagnetism                     ...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsGPU DG Showcase   Eletromagnetism                     ...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsOutline   1 Introduction   2 Programming with PyOpenCL...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsAutomating GPU Programming   GPU programming can be ti...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsAutomating GPU Programming   GPU programming can be ti...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsAutomating GPU Programming   GPU programming can be ti...
Intro PyOpenCL RTCG Perspectives       PyCUDA GPU-DG Loo.py ConclusionsLoo.py Example Empirical GPU loop optimization: a, ...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsLoo.py Status     Limited scope:         Require input...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsLoo.py Status     Limited scope:          Require inpu...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsOutline   1 Introduction   2 Programming with PyOpenCL...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsWhere to from here?   PyCUDA, PyOpenCL, hedge   → http...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsConclusions      GPUs to me: architecture choice now w...
Intro PyOpenCL RTCG Perspectives        PyCUDA GPU-DG Loo.py ConclusionsQuestions?                                        ...
Intro PyOpenCL RTCG Perspectives    PyCUDA GPU-DG Loo.py ConclusionsImage Credits      Dictionary: sxc.hu/topfer      C870...
ImplementationsMultiple GPUs via MPI: 16 GPUs vs. 64 CPUs             Flop Rates: 16 GPUs vs 64 CPU cores            4000 ...
ImplementationsOutline   5 OpenCL implementations                     Andreas Kl¨ckner                               o    ...
ImplementationsThe Nvidia CL implementation                      Targets only GPUs                      Notes:            ...
ImplementationsThe Apple CL implementation    Targets CPUs and GPUs    General notes:        Different header name        O...
ImplementationsThe AMD CL implementation         Targets CPUs and GPUs (from both AMD and Nvidia)         GPU notes:      ...
Upcoming SlideShare
Loading in...5
×

[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)

2,142

Published on

http://cs264.org

Abstract:

High-level scripting languages are in many ways polar opposites to
GPUs. GPUs are highly parallel, subject to hardware subtleties, and
designed for maximum throughput, and they offer a tremendous advance
in the performance achievable for a significant number of
computational problems. On the other hand, scripting languages such as
Python favor ease of use over computational speed and do not generally
emphasize parallelism. PyOpenCL and PyCUDA are two packages that
attempt to join the two together. By showing concrete examples, both
at the toy and the whole-application level, this talk aims to
demonstrate that by combining these opposites, a programming
environment is created that is greater than just the sum of its two
parts.

Speaker biography:

Andreas Klöckner obtained his PhD degree working with Jan Hesthaven at
the Department of Applied Mathematics at Brown University. He worked
on a variety of topics all aiming to broaden the utility of
discontinuous Galerkin (DG) methods. This included their use in the
simulation of plasma physics and the demonstration of their particular
suitability for computation on throughput-oriented graphics processors
(GPUs). He also worked on multi-rate time stepping methods and shock
capturing schemes for DG.

In the fall of 2010, he joined the Courant Institute of Mathematical
Sciences at New York University as a Courant Instructor. There, he is
working on problems in computational electromagnetics with Leslie
Greengard.

His research interests include:

- Discontinuous Galerkin and integral equation methods for wave
propagation

- Programming tools for parallel architectures

- High-order unstructured particle-in-cell methods for plasma simulation

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,142
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
98
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA (Andreas Kloeckner, NYU)"

  1. 1. Intro PyOpenCL RTCG Perspectives Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA Andreas Kl¨ckner o Courant Institute of Mathematical Sciences New York University March 31, 2011 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  2. 2. Intro PyOpenCL RTCG PerspectivesThanks Jan Hesthaven (Brown) Tim Warburton (Rice) Leslie Greengard (NYU) PyOpenCL, PyCUDA contributors Nvidia Corp., AMD Corp. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  3. 3. Intro PyOpenCL RTCG PerspectivesOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  4. 4. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOutline 1 Introduction A Common Theme Intro to OpenCL 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  5. 5. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOutline 1 Introduction A Common Theme Intro to OpenCL 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  6. 6. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLHow are High-Performance Codes constructed? “Traditional” Construction of High-Performance Codes: C/C++/Fortran Libraries “Alternative” Construction of High-Performance Codes: Scripting for ‘brains’ GPUs for ‘inner loops’ Play to the strengths of each programming environment. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  7. 7. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOutline 1 Introduction A Common Theme Intro to OpenCL 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  8. 8. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLWhat is OpenCL? OpenCL (Open Computing Language) is an open, royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors. [OpenCL 1.1 spec] Device-neutral (Nv GPU, AMD GPU, Intel/AMD CPU) Vendor-neutral Comes with RTCG Defines: Host-side programming interface (library) Device-side programming language (!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  9. 9. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLWhat is OpenCL? OpenCL (Open Computing Language) is an open, royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors. [OpenCL 1.1 spec] Device-neutral (Nv GPU, AMD GPU, Big deal? Intel/AMD CPU) Vendor-neutral Comes with RTCG Defines: Host-side programming interface (library) Device-side programming language (!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  10. 10. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLWhat is OpenCL? OpenCL (Open Computing Language) is an open, royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors. [OpenCL 1.1 spec] Big deal! Device-neutral (Nv GPU, AMD GPU, Big deal? Intel/AMD CPU) Vendor-neutral Comes with RTCG Defines: Host-side programming interface (library) Device-side programming language (!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  11. 11. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLWho?OpenCL Working Group• Diverse industry participation - Processor vendors, system OEMs, middleware vendors, application developers• Many industry-leading experts involved in OpenCL’s design - A healthy diversity of industry perspectives• Apple made initial proposal and is very active in the working group - Serving as specification editor © Copyright Khronos Group, 2010 - Page 4 Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  12. 12. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLWhen? OpenCL Timeline • Six months from proposal to released OpenCL 1.0 specification - Due to a strong initial proposal and a shared commercial incentive • Multiple conformant implementations shipping - Apple’s Mac OS X Snow Leopard now ships with OpenCL • 18 month cadence between OpenCL 1.0 and OpenCL 1.1 - Backwards compatibility protect software investment Khronos publicly Multiple conformant releases OpenCL 1.0 as implementations ship royalty-free across diverse OS specification and platforms Jun08 May09 Jun10 Dec08 2H09 Apple proposes OpenCL Khronos releases OpenCL OpenCL 1.1 working group and 1.0 conformance tests to Specification released and contributes draft specification ensure high-quality first implementations ship to Khronos implementations © Copyright Khronos Group, 2010 - Page 5 Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  13. 13. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLWhy? Processor Parallelism CPUs GPUs Multiple cores driving Emerging Increasingly general performance increases purpose data-parallel Intersection computing Multi- Heterogeneous Graphics processor Computing APIs and programming Shading – e.g. OpenMP Languages OpenCL is a programming framework for heterogeneous compute resources © Copyright Khronos Group, 2010 - Page 3 Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  14. 14. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLCL vs CUDA side-by-side CUDA source code: OpenCL source code: global void transpose( void transpose( float ∗A t, float ∗A, global float ∗a t, global float ∗a, int a width, int a height ) unsigned a width, unsigned a height ) { { int base idx a = int base idx a = blockIdx .x ∗ BLK SIZE + get group id (0) ∗ BLK SIZE + blockIdx .y ∗ A BLOCK STRIDE; get group id (1) ∗ A BLOCK STRIDE; int base idx a t = int base idx a t = blockIdx .y ∗ BLK SIZE + get group id (1) ∗ BLK SIZE + blockIdx .x ∗ A T BLOCK STRIDE; get group id (0) ∗ A T BLOCK STRIDE; int glob idx a = int glob idx a = base idx a + threadIdx.x base idx a + get local id (0) + a width ∗ threadIdx.y; + a width ∗ get local id (1); int glob idx a t = int glob idx a t = base idx a t + threadIdx.x base idx a t + get local id (0) + a height ∗ threadIdx .y; + a height ∗ get local id (1); shared float A shared[BLK SIZE][BLK SIZE+1]; local float a local [BLK SIZE][BLK SIZE+1]; A shared[ threadIdx .y ][ threadIdx .x] = a local [ get local id (1)∗BLK SIZE+get local id(0)] = A[ glob idx a ]; a[ glob idx a ]; syncthreads (); barrier (CLK LOCAL MEM FENCE); A t[ glob idx a t ] = a t [ glob idx a t ] = A shared[ threadIdx .x ][ threadIdx .y ]; a local [ get local id (0)∗BLK SIZE+get local id(1)]; } } Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  15. 15. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL ↔ CUDA: A dictionary OpenCL CUDA Grid Grid Work Group Block Work Item Thread kernel global global device local shared private local imagend t texture<type, n, ...> barrier(LMF) syncthreads() get local id(012) threadIdx.xyz get group id(012) blockIdx.xyz get global id(012) – (reimplement) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  16. 16. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Execution Model nD Grid Group Group Group (0, 0) (1, 0) (2, 0) Two-tiered Parallelism Group Group Group (0, 1) (1, 1) (2, 1) Grid = Nx × Ny × Nz work groups Work group = Sx × Sy × Sz work items Total: i∈{x,y ,z} Si Ni work items Work Group (1, 0) Comm/Sync only within work group Item Item Item Item Work group maps to compute unit (0, 0) (1, 0) (2, 0) (3, 0) Grid/Group ≈ outer loops in an algorithm Item Item Item Item (0, 1) (1, 1) (2, 1) (3, 1) Device Language: Item (0, 2) Item (1, 2) Item (2, 2) Item (3, 2) get {global,group,local} {id,size} Item Item Item Item (axis) (0, 3) (1, 3) (2, 3) (3, 3) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  17. 17. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Host (CPU) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  18. 18. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  19. 19. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· Memory ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  20. 20. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Memory Compute Device 1 (Platform 0) ··· Host ··· ··· Memory Compute Device 0 (Platform 1) (CPU) ··· Memory ··· ··· Memory Compute Device 1 (Platform 1) ··· ··· ··· Memory Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  21. 21. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  22. 22. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Platform 0 (e.g. CPUs) Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  23. 23. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Platform 1 (e.g. GPUs) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  24. 24. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  25. 25. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service (think “chip”, Compute Device 0 (Platform 0) has memory ··· interface) ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  26. 26. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service (think “chip”, Compute Device 0 (Platform 0) has memory ··· interface) ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Compute Unit ··· ··· (think “processor”, ··· has insn. fetch) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  27. 27. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service (think “chip”, Compute Device 0 (Platform 0) has memory ··· interface) ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Compute Unit ··· ··· (think “processor”, ··· has insn. fetch) Processing Element (think “SIMD lane”) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  28. 28. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  29. 29. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Python ··· ··· ··· Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  30. 30. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL: Computing as a Service Compute Device 0 (Platform 0) ··· ··· ··· Compute Device 1 (Platform 0) ··· ··· Host ··· Compute Device 0 (Platform 1) (CPU) ··· ··· ··· Compute Device 1 (Platform 1) Python ··· ··· ··· Device Language: ∼ C99 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  31. 31. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLOpenCL Object Diagram Figure 2.1 - OpenCL UML Class Diagram Credit: Khronos Group Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  32. 32. Intro PyOpenCL RTCG Perspectives A Common Theme OpenCLWhy do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput → complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough Python + CUDA = PyCUDA Python + OpenCL = PyOpenCL Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  33. 33. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLOutline 1 Introduction 2 Programming with PyOpenCL First Contact About PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  34. 34. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLOutline 1 Introduction 2 Programming with PyOpenCL First Contact About PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  35. 35. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLDive into PyOpenCL 1 import pyopencl as cl , numpy 2 3 a = numpy.random.rand(256∗∗3).astype(numpy.float32) 4 5 ctx = cl. create some context () 6 queue = cl.CommandQueue(ctx) 7 8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes) 9 cl . enqueue write buffer (queue, a dev, a)1011 prg = cl.Program(ctx, ”””12 kernel void twice( global float ∗a)13 { a[ get global id (0)] ∗= 2; }14 ”””). build ()1516 prg. twice(queue, a.shape, (1,), a dev) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  36. 36. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLDive into PyOpenCL 1 import pyopencl as cl , numpy 2 3 a = numpy.random.rand(256∗∗3).astype(numpy.float32) 4 5 ctx = cl. create some context () 6 queue = cl.CommandQueue(ctx) 7 8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes) 9 cl . enqueue write buffer (queue, a dev, a)1011 prg = cl.Program(ctx, ”””12 kernel void twice( global float ∗a)13 { a[ get global id (0)] ∗= 2; } Compute kernel14 ”””). build ()1516 prg. twice(queue, a.shape, (1,), a dev) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  37. 37. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLDive into PyOpenCL 8 a dev = cl. Buffer (ctx , cl .mem flags.READ WRITE, size=a.nbytes) 9 cl . enqueue write buffer (queue, a dev, a)1011 prg = cl.Program(ctx, ”””12 kernel void twice( global float ∗a)13 { a[ get local id (0)+ get local size (0)∗ get group id (0)] ∗= 2; }14 ”””). build ()1516 prg. twice(queue, a.shape, (256,), a dev)1718 result = numpy.empty like(a)19 cl . enqueue read buffer (queue, a dev, result ). wait()20 import numpy.linalg as la21 assert la .norm(result − 2∗a) == 0 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  38. 38. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLOutline 1 Introduction 2 Programming with PyOpenCL First Contact About PyOpenCL 3 Run-Time Code Generation 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  39. 39. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLPyOpenCL: Completeness PyOpenCL exposes all of OpenCL. For example: Every GetInfo() query Images and Samplers Memory Maps Profiling and Synchronization GL Interop Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  40. 40. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLPyOpenCL: Completeness PyOpenCL supports (nearly) every OS that has an OpenCL implementation. Linux OS X Windows Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  41. 41. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLAutomatic Cleanup Reachable objects (memory, streams, . . . ) are never destroyed. Once unreachable, released at an unspecified future time. Scarce resources (memory) can be explicitly freed. (obj.release()) Correctly deals with multiple contexts and dependencies. (based on OpenCL’s reference counting) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  42. 42. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLPyOpenCL: Documentation Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  43. 43. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLPyOpenCL Philosophy Provide complete access Automatically manage resources Provide abstractions Allow interactive use Check for and report errors automatically Integrate tightly with numpy Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  44. 44. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLPyOpenCL, PyCUDA: Vital Information http://mathema.tician.de/ software/pyopencl (or /pycuda) Complete documentation X Consortium License (no warranty, free for all use) Convenient abstractions Arrays, Elementwise op., Reduction, Scan Require: numpy, Python 2.4+ (Win/OS X/Linux) Community: mailing list, wiki, add-on packages (FFT, scikits.cuda, . . . ) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  45. 45. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLCapturing Dependencies A f B = f(A) B C = g(B) g p E = f(C) C P q F = h(C) G = g(E,F) f h P = p(B) E F Q Q = q(B) g g r R = r(G,P,Q) G r r R Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  46. 46. Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCLCapturing Dependencies A Switch queue to out-of-order mode! f B = f(A) Specify as list of events using B C = g(B) for= optional keyword to wait g p E = f(C) enqueue XXX. C P q F = h(C) also enqueue barrier. Can G = g(E,F) f h Common use case: P = p(B) Transmit/receive from other MPI E F Q Q = q(B) ranks. g g r R = r(G,P,Q) Possible on Nv Fermi: Submit G r parallel work to increase machine r use. R Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  47. 47. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation The Idea RTCG in Action 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  48. 48. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation The Idea RTCG in Action 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  49. 49. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming In GPU scripting, GPU code does not need to be a compile-time constant. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  50. 50. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  51. 51. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming Idea In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is data–it wants to be reasoned about at run time) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  52. 52. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming Idea In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  53. 53. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming Idea In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary Machine (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  54. 54. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming Idea Human In GPU scripting, Python Code GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  55. 55. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming Idea Good for code In GPU scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  56. 56. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming Idea Good for code In GPUyCUDA P scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  57. 57. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMetaprogramming Idea Good for code PyOp UDA In GPUyCenCL P scripting, Python Code generation GPU code does not need to be GPU Code a compile-time constant. GPU Compiler GPU Binary (Key: Code is data–it wants to be GPU reasoned about at run time) Result Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  58. 58. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionMachine-generated Code Why machine-generate code? Automated Tuning (cf. ATLAS, FFTW) Data types Specialize code for given problem Constants faster than variables (→ register pressure) Loop Unrolling Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  59. 59. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionPyOpenCL: Support for Metaprogramming Three (main) ways of generating code: Simple %-operator substitution Combine with C preprocessor: simple, often sufficient Use a templating engine (Mako works very well) codepy: Build C syntax trees from Python Generates readable, indented C Many ways of evaluating code–most important one: Exact device timing via events Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  60. 60. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation The Idea RTCG in Action 4 Perspectives Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  61. 61. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionPyOpenCL Arrays: General Usage Remember your first PyOpenCL program? Abstraction is good: 1 import numpy 2 import pyopencl as cl 3 import pyopencl.array as cl array 4 5 ctx = cl. create some context () 6 queue = cl.CommandQueue(ctx) 7 8 a gpu = cl array . to device ( 9 ctx , queue, numpy.random.randn(4,4).astype(numpy.float32))10 a doubled = (2∗a gpu).get()11 print a doubled12 print a gpu Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  62. 62. Intro PyOpenCL RTCG Perspectives Idea RTCG in Actionpyopencl.array: Simple Linear Algebra pyopencl.array.Array: Meant to look and feel just like numpy. p.a.to device(ctx, queue, numpy array) numpy array = ary.get() +, -, ∗, /, fill, sin, arange, exp, rand, . . . Mixed types (int32 + float32 = float64) print cl array for debugging. Allows access to raw bits Use as kernel arguments, memory maps Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  63. 63. Intro PyOpenCL RTCG Perspectives Idea RTCG in Actionpyopencl.elementwise: Elementwise expressions Avoiding extra store-fetch cycles for elementwise math: n = 10000 a gpu = cl array . to device ( ctx , queue, numpy.random.randn(n).astype(numpy.float32)) b gpu = cl array . to device ( ctx , queue, numpy.random.randn(n).astype(numpy.float32)) from pyopencl.elementwise import ElementwiseKernel lin comb = ElementwiseKernel(ctx, ” float a, float ∗x, float b, float ∗y, float ∗z”, ”z[ i ] = a∗x[i ] + b∗y[i]”) c gpu = cl array . empty like (a gpu) lin comb(5, a gpu, 6, b gpu, c gpu) import numpy.linalg as la assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  64. 64. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionRTCG via Substitution source = (””” kernel void %(name)s(%(arguments)s) { unsigned lid = get local id (0); unsigned gsize = get global size (0); unsigned work item start = get local size (0)∗ get group id (0); for (unsigned i = work item start + lid ; i < n; i += gsize) { %(operation)s; } } ””” % { ”arguments”: ”, ”. join (arg . declarator () for arg in arguments), ”operation”: operation , ”name”: name, ”loop prep”: loop prep , }) prg = cl.Program(ctx, source ). build () Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  65. 65. Intro PyOpenCL RTCG Perspectives Idea RTCG in ActionRTCG via Templates from mako.template import Template tpl = Template(””” kernel void add( global ${ type name } ∗tgt, global const ${ type name } ∗op1, global const ${ type name } ∗op2) { int idx = get local id (0) + ${ local size } ∗ ${ thread strides } ∗ get group id (0); % for i in range( thread strides ): <% offset = i∗ local size %> tgt [ idx + ${ offset }] = op1[idx + ${ offset }] + op2[idx + ${ offset } ]; % endfor }”””) rendered tpl = tpl . render(type name=”float”, local size = local size , thread strides = thread strides ) knl = cl.Program(ctx, str ( rendered tpl )). build (). add Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  66. 66. Intro PyOpenCL RTCG Perspectives Idea RTCG in Actionpyopencl.reduction: Reduction made easy Example: A dot product calculation from pyopencl.reduction import ReductionKernel dot = ReductionKernel(ctx, dtype out=numpy.float32, neutral=”0”, reduce expr=”a+b”, map expr=”x[i]∗y[i]”, arguments=” global const float ∗x, global const float ∗y”) import pyopencl.clrandom as cl rand x = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32) y = cl rand .rand(ctx , queue, (1000∗1000), dtype=numpy.float32) x dot y = dot(x, y ). get() x dot y cpu = numpy.dot(x.get(), y. get ()) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  67. 67. Intro PyOpenCL RTCG Perspectives Idea RTCG in Actionpyopencl.scan: Scan made easy Example: A cumulative sum computation from pyopencl.scan import InclusiveScanKernel knl = InclusiveScanKernel(ctx , np.int32 , ”a+b”) n = 2∗∗20−2∗∗18+5 host data = np.random.randint(0, 10, n). astype(np.int32) dev data = cl array . to device (queue, host data) knl(dev data) assert (dev data.get() == np.cumsum(host data, axis=0)).all() Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  68. 68. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  69. 69. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  70. 70. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsWhetting your appetite1 import pycuda.driver as cuda2 import pycuda.autoinit , pycuda.compiler3 import numpy45 a = numpy.random.randn(4,4).astype(numpy.float32)6 a gpu = cuda.mem alloc(a.nbytes)7 cuda.memcpy htod(a gpu, a) [This is examples/demo.py in the PyCUDA distribution.] Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  71. 71. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsWhetting your appetite 1 mod = pycuda.compiler.SourceModule(””” 2 global void twice( float ∗a) 3 { 4 int idx = threadIdx.x + threadIdx.y∗4; 5 a[ idx ] ∗= 2; 6 } 7 ”””) 8 9 func = mod.get function(”twice”)10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  72. 72. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsWhetting your appetite 1 mod = pycuda.compiler.SourceModule(””” 2 global void twice( float ∗a) 3 { 4 int idx = threadIdx.x + threadIdx.y∗4; 5 a[ idx ] ∗= 2; 6 } Compute kernel 7 ”””) 8 9 func = mod.get function(”twice”)10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  73. 73. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsPyOpenCL ↔ PyCUDA: A (rough) dictionary PyOpenCL PyCUDA Context Context CommandQueue Stream Buffer mem alloc / DeviceAllocation Program SourceModule Kernel Function Event (eg. enqueue marker) Event Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  74. 74. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  75. 75. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsDiscontinuous Galerkin Method Let Ω := i Dk ⊂ Rd . Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  76. 76. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsDiscontinuous Galerkin Method Let Ω := i Dk ⊂ Rd . Goal Solve a conservation law on Ω: ut + · F (u) = 0 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  77. 77. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsDiscontinuous Galerkin Method Let Ω := i Dk ⊂ Rd . Goal Solve a conservation law on Ω: ut + · F (u) = 0 Example Maxwell’s Equations: EM field: E (x, t), H(x, t) on Ω governed by 1 j 1 ∂t E − ×H =− , ∂t H + × E = 0, ε ε µ ρ ·E = , · H = 0. ε Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  78. 78. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms ˆ ˆ 0= ut ϕ + [ · F (u)]ϕ dx − [ˆ · F − (ˆ · F )∗ ]ϕ dSx n n Dk ∂Dk Flux term Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  79. 79. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms ˆ ˆ 0= ut ϕ + [ · F (u)]ϕ dx − [ˆ · F − (ˆ · F )∗ ]ϕ dSx n n Dk ∂Dk Flux term Flux terms: vary by problem expression specified by user evaluated pointwise Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  80. 80. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms Example Example: Fluxes for Maxwell’s Equations 1 n · (F − F ∗ )E := ˆ [ˆ × ( H − αˆ × E )] n n 2 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  81. 81. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms Example Example: Fluxes for Maxwell’s Equations 1 n · (F − F ∗ )E := ˆ [ˆ × ( H − αˆ × E )] n n 2 User writes: Vectorial statement in math. notation flux = 1/2∗cross(normal, h. int −h.ext −alpha∗cross(normal, e. int −e.ext)) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  82. 82. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsMetaprogramming DG: Flux Terms Example Example: Fluxes for Maxwell’s Equations 1 n · (F − F ∗ )E := ˆ [ˆ × ( H − αˆ × E )] n n 2 We generate: Scalar evaluator in C (6×) a flux += ( ((( val a field5 − val b field5 )∗ fpair −>normal[2] − ( val a field4 − val b field4 )∗ fpair −>normal[0]) + val a field0 − val b field0 )∗ fpair −>normal[0] − ((( val a field4 − val b field4 ) ∗ fpair −>normal[1] − ( val a field1 − val b field1 )∗ fpair −>normal[2]) + val a field3 − val b field3 ) ∗ fpair −>normal[1] )∗ value type (0.5); Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  83. 83. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsLoop Slicing for element-local parts of GPU DG Per Block: KL element-local mat.mult. + matrix load Preparation Question: How should one assign work to threads? Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  84. 84. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsLoop Slicing for element-local parts of GPU DG Per Block: KL element-local mat.mult. + matrix load Preparation Question: How should one assign work to threads? ws : in sequence wi : “inline-parallel” wp : in parallel Thread Thread Thread t t t (amortize preparation) (exploit register space) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  85. 85. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsLoop Slicing for Differentiation 2.2 3.0 Local differentiation, matrix-in-shared, order 4, with microblocking 2.8 2.0 point size denotes wi ∈ 1, ,4 2.6 1.8 2.4 Execution time [ms] 1.6 2.2 2.0 ws 1.4 1.8 1.2 1.6 1.4 1.0 1.2 0.8 15 20 25 30 1.0 wp Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  86. 86. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsNvidia GTX280 vs. single core of Intel Core 2 Duo E8400 300 GPU 250 CPU 200 GFlops/s 150 100 50 00 2 4 6 8 10 Polynomial Order N Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  87. 87. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsMemory Bandwidth on a GTX 280 200 Gather 180 Lift Global Memory Bandwidth [GB/s] Diff 160 Assy. Peak 140 120 100 80 60 40 201 2 3 4 5 6 7 8 9 Polynomial Order N Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  88. 88. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsGPU DG Showcase Eletromagnetism Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  89. 89. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsGPU DG Showcase Eletromagnetism Poisson Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  90. 90. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsGPU DG Showcase Eletromagnetism Poisson CFD Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  91. 91. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  92. 92. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsAutomating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  93. 93. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsAutomating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  94. 94. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsAutomating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile Another way: Dumb enumeration Enumerate loop slicings Enumerate prefetch options Choose by running resulting code on actual hardware Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  95. 95. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsLoo.py Example Empirical GPU loop optimization: a, b, c, i , j , k = [var(s) for s in ” abcijk ”] n = 500 k = make loop kernel([ LoopDimension(”i”, n), LoopDimension(”j”, n), LoopDimension(”k”, n), ], [ (c[ i +n∗j], a[ i +n∗k]∗b[k+n∗j]) ]) gen kwargs = { ”min threads”: 128, ”min blocks”: 32, } → Ideal case: Finds 160 GF/s kernel without human intervention. Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  96. 96. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsLoo.py Status Limited scope: Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . . Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  97. 97. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsLoo.py Status Limited scope: Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . . Kernel compilation limits trial rate Non-Goal: Peak performance Good results currently for dense linear algebra and (some) DG subkernels Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  98. 98. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsOutline 1 Introduction 2 Programming with PyOpenCL 3 Run-Time Code Generation 4 Perspectives PyCUDA DG-FEM on the GPU “Automatic” GPU Programming Conclusions Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  99. 99. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsWhere to from here? PyCUDA, PyOpenCL, hedge → http://www.cims.nyu.edu/~kloeckner/ GPU RTCG AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation for High-Performance Computing, submitted. GPU-DG Article AK, T. Warburton, J. Bridge, J.S. Hesthaven, “Nodal Discontinuous Galerkin Methods on Graphics Processors”, J. Comp. Phys., 228 (21), 7863–7882. Also: Intro in GPU Computing Gems Vol 2 Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  100. 100. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsConclusions GPUs to me: architecture choice now widely available Fun time to be in computational science GPUs and scripting work surprisingly well together Exploit a natural task decomposition in computational codes RTCG: Crucial tool GPU Scripting great for prototyping . . . and just as suitable for production code Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  101. 101. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsQuestions? ? Thank you for your attention! http://www.cims.nyu.edu/~kloeckner/ image credits Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  102. 102. Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py ConclusionsImage Credits Dictionary: sxc.hu/topfer C870 GPU: Nvidia Corp. OpenCL Logo: Apple Corp./Ars Technica OS Platforms: flickr.com/aOliN.Tk Old Books: flickr.com/ppdigital Floppy disk: flickr.com/ethanhein Machine: flickr.com/13521837@N00 Adding Machine: flickr.com/thomashawk Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  103. 103. ImplementationsMultiple GPUs via MPI: 16 GPUs vs. 64 CPUs Flop Rates: 16 GPUs vs 64 CPU cores 4000 GPU CPU 3000 GFlops/s 2000 1000 00 2 4 6 8 10 Polynomial Order N Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  104. 104. ImplementationsOutline 5 OpenCL implementations Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  105. 105. ImplementationsThe Nvidia CL implementation Targets only GPUs Notes: Nearly identical to CUDA No native C-level JIT in CUDA (→ PyCUDA) Page-locked memory: Use CL MEM ALLOC HOST PTR. Careful: double meaning Need page-locked memory for genuinely overlapped transfers. No linear memory texturing CUDA device emulation mode deprecated → Use AMD CPU CL (faster, too!) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  106. 106. ImplementationsThe Apple CL implementation Targets CPUs and GPUs General notes: Different header name OpenCL/cl.h instead of CL/cl.h Use -framework OpenCL for C access. Beware of imperfect compiler cache implementation (ignores include files) CPU notes: One work item per processor GPU similar to hardware vendor implementation. (New: Intel w/ Sandy Bridge) Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  107. 107. ImplementationsThe AMD CL implementation Targets CPUs and GPUs (from both AMD and Nvidia) GPU notes: Wide SIMD groups (64) Native 4/5-wide vectors But: very flop-heavy machine, may ignore vectors for memory-bound workloads → Both implicit and explicit SIMD CPU notes: Many work items per processor (emulated) General: cl amd printf Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×