SlideShare a Scribd company logo
1 of 21
Download to read offline
A beginner’s guide to programming GPUs with CUDA

                               Mike Peardon

                            School of Mathematics
                            Trinity College Dublin


                               April 24, 2009




 Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   1 / 20
What is a GPU?



                         Graphics Processing Unit
   Processor dedicated to rapid rendering of polygons - texturing,
   shading
   They are mass-produced, so very cheap 1 Tflop peak ≈ EU 1k.
   They have lots of compute cores, but a simpler architecture cf a
   standard CPU
   The “shader pipeline” can be used to do floating point calculations
   −→ cheap scientific/technical computing




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   2 / 20
What is a GPU? (2)




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   3 / 20
What is CUDA?



                        Compute Unified Device Architecture
   Extension to C programming language
   Adds library functions to access to GPU
   Adds directives to translate C into instructions that run on the host
   CPU or the GPU when needed
   Allows easy multi-threading - parallel execution on all thread
   processors on the GPU




   Mike Peardon (TCD)       A beginner’s guide to programming GPUs with CUDA   April 24, 2009   4 / 20
Will CUDA work on my PC/laptop?




   CUDA works on modern nVidia cards (Quadro, GeForce, Tesla)
   See
   http://www.nvidia.com/object/cuda learn products.html




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   5 / 20
nVidia’s compiler - nvcc



    CUDA code must be compiled using nvcc
    nvcc generates both instructions for host and GPU (PTX instruction
    set), as well as instructions to send data back and forwards between
    them
    Standard CUDA install; /usr/local/cuda/bin/nvcc
    Shell executing compiled code needs dynamic linker path
    LD LIBRARY PATH environment variable set to include
    /usr/local/cuda/lib




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   6 / 20
Simple overview
                    Network,
                    Disk, etc                               GPU
                                CPU            Multiprocessors

                   Memory                                          Memory




                              PCI bus
                  PC Motherboard
   GPU can’t directly access main memory
   CPU can’t directly access GPU memory
   Need to explicitly copy data
   No printf!
   Mike Peardon (TCD)       A beginner’s guide to programming GPUs with CUDA   April 24, 2009   7 / 20
Writing some code (1) - specifying where code runs


   CUDA provides function type qualifiers (that are not in C/C++) to
   enable programmer to define where a function should run
    host : specifies the code should run on the host CPU (redundant
   on its own - it is the default)
     device : specifies the code should run on the GPU, and the
   function can only be called by code running on the GPU
     global : specifies the code should run on the GPU, but be called
   from the host - this is the access point to start multi-threaded codes
   running on the GPU
   Device can’t execute code on the host!
   CUDA imposes some restrictions, such as device code is C-only (host
   code can be C++), device code can’t be called recursively...


   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   8 / 20
Code execution




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   9 / 20
Writing some code (2) - launching a                           global           function
   All calls to a global function must specify how many threaded
   copies are to launch and in what configuration.
   CUDA syntax: <<< >>>
   threads are grouped into thread blocks then into a grid of blocks
   This defines a memory heirarchy (important for performance)




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   10 / 20
The thread/block/grid model




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   11 / 20
Writing some code (3) - launching a                           global           function


   Inside the <<< >>>, need at least two arguments (can be two more,
   that have default values)
   Call looks eg. like my func<<<bg, tb>>>(arg1, arg2)
   bg specifies the dimensions of the block grid and tb specifies the
   dimensions of each thread block
   bg and tb are both of type dim3 (a new datatype defined by CUDA;
   three unsigned ints where any unspecified component defaults to 1).
   dim3 has struct-like access - members are x, y and z
   CUDA provides constructor: dim3 mygrid(2,2); sets mygrid.x=2,
   mygrid.y=2 and mygrid.z=1
   1d syntax allowed: myfunc<<<5, 6>>>() makes 5 blocks (in linear
   array) with 6 threads each and runs myfunc on them all.


   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   12 / 20
Writing some code (4) - built-in variables on the GPU



    For code running on the GPU ( device and global ), some
    variables are predefined, which allow threads to be located inside their
    blocks and grids
    dim3 gridDim Dimensions of the grid.
    uint3 blockIdx location of this block in the grid.
    dim3 blockDim Dimensions of the blocks
    uint3 threadIdx location of this thread in the block.
    int warpSize number of threads in the warp?




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   13 / 20
Writing some code (5) - where variables are stored



    For code running on the GPU ( device and global ), the
    memory used to hold a variable can be specified.
      device : the variable resides in the GPU’s global memory and is
    defined while the code runs.
      constant : the variable resides in the constant memory space of
    the GPU and is defined while the code runs.
      shared : the variable resides in the shared memory of the thread
    block and has the same lifespan as the block. block.




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   14 / 20
Example - vector adder
Start:
#include <stdlib.h>
#include <stdio.h>

#define N 1000
#define NBLOCK 10
#define NTHREAD 10


      Define the kernel to execute on the host
__global__
void adder(int n, float* a, float *b)
// a=a+b - thread code - add n numbers per thread
{
  int i,off = (N * blockIdx.x ) / NBLOCK +
    (threadIdx.x * N) / (NBLOCK * NTHREAD);

    for (i=off;i<off+n;i++)
    {
      a[i] = a[i] + b[i];
    }
}
      Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   15 / 20
Example - vector adder (2)


    Call using

  cudaMemcpy(gpu_a, host_a, sizeof(float) * n,
      cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_b, host_b, sizeof(float) * n,
      cudaMemcpyHostToDevice);

  adder<<<NBLOCK, NTHREAD>>>(n / (NBLOCK * NTHREAD), gpu_a, gpu_b);

  cudaMemcpy(host_c, gpu_a, sizeof(float) * n,
      cudaMemcpyDeviceToHost);


    Need the cudaMemcpy’s to push/pull the data on/off the GPU.



   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   16 / 20
arXiv:0810.5365 Barros et. al.

             “Blasting through lattice calculations using CUDA”
    An implementation of an important compute kernel for lattice QCD -
    the Wilson-Dirac operator - this is a sparse linear operator that
    represents the kinetic energy operator in a discrete version of the
    quantum field theory of relativistic quarks (interacting with gluons).
    Usually, performance is limited by memory bandwidth (and
    inter-processor communications).
    Data is stored in the GPU’s memory
    “Atom” of data is the spinor of a field on one site. This is 12 complex
    numbers (3 colours for 4 spins).
    They use the float4 CUDA data primitive, which packs four floating
    point numbers efficiently. An array of 6 float4 types then holds one
    lattice size of the quark field.


   Mike Peardon (TCD)    A beginner’s guide to programming GPUs with CUDA   April 24, 2009   17 / 20
arXiv:0810.5365 Barros et. al. (2)

Performance issues:
  1   16 threads can read 16 contiguous memory elements very efficiently -
      their implementation of 6 arrays for the spinor allows this contiguous
      access
  2   GPUs do not have caches; rather they have a small but fast shared
      memory. Access is managed by software instructions.
  3   The GPU has a very efficient thread manager which can schedule
      multiple threads to run withing the cores of a multi-processor. Best
      performance comes when the number of threads is (much) more than
      the number of cores.
  4   The local shared memory space is 16k - not enough! Barros et al also
      use the registers on the multiprocessors (8,192 of them).
      Unfortunately, this means they have to hand-unroll all their loops!


      Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   18 / 20
arXiv:0810.5365 Barros et. al. (3)

Performance: (even-odd) Wilson operator




    Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   19 / 20
arXiv:0810.5365 Barros et. al. (4)

Performance: Conjugate Gradient solver:




    Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   20 / 20
Conclusions


   The GPU offers a very impressive architecture for scientific computing
   on a single chip.
   Peak performance now is close to 1 TFlop for less than EU 1,000
   CUDA is an extension to C that allows multi-threaded software to
   execute on modern nVidia GPUs. There are alternatives for other
   manufacturer’s hardware and proposed architecture-independent
   schemes (like OpenCL)
   Efficient use of the hardware is challenging; threads must be
   scheduled efficiently and synchronisation is slow. Memory access must
   be defined very carefully.
   The (near) future will be very interesting...



   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   21 / 20

More Related Content

What's hot

Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesSubhajit Sahu
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009Randall Hand
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with RubyShin Yee Chung
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDASavith Satheesh
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
How to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memoHow to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memoNaoto MATSUMOTO
 

What's hot (18)

GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Cuda
CudaCuda
Cuda
 
Cuda
CudaCuda
Cuda
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
CUDA Architecture
CUDA ArchitectureCUDA Architecture
CUDA Architecture
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Debugging CUDA applications
Debugging CUDA applicationsDebugging CUDA applications
Debugging CUDA applications
 
How to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memoHow to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memo
 
Cuda intro
Cuda introCuda intro
Cuda intro
 

Similar to A beginner’s guide to programming GPUs with CUDA

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rFerdinand Jamitzky
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
GPU programming and Its Case Study
GPU programming and Its Case StudyGPU programming and Its Case Study
GPU programming and Its Case StudyZhengjie Lu
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
Share the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardJian-Hong Pan
 

Similar to A beginner’s guide to programming GPUs with CUDA (20)

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDATech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Hpc4
Hpc4Hpc4
Hpc4
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
Cuda materials
Cuda materialsCuda materials
Cuda materials
 
GPU programming and Its Case Study
GPU programming and Its Case StudyGPU programming and Its Case Study
GPU programming and Its Case Study
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Share the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development Board
 

More from Piyush Mittal

More from Piyush Mittal (20)

Power mock
Power mockPower mock
Power mock
 
Design pattern tutorial
Design pattern tutorialDesign pattern tutorial
Design pattern tutorial
 
Reflection
ReflectionReflection
Reflection
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
Intel open mp
Intel open mpIntel open mp
Intel open mp
 
Intro to parallel computing
Intro to parallel computingIntro to parallel computing
Intro to parallel computing
 
Cuda toolkit reference manual
Cuda toolkit reference manualCuda toolkit reference manual
Cuda toolkit reference manual
 
Matrix multiplication using CUDA
Matrix multiplication using CUDAMatrix multiplication using CUDA
Matrix multiplication using CUDA
 
Channel coding
Channel codingChannel coding
Channel coding
 
Basics of Coding Theory
Basics of Coding TheoryBasics of Coding Theory
Basics of Coding Theory
 
Java cheat sheet
Java cheat sheetJava cheat sheet
Java cheat sheet
 
Google app engine cheat sheet
Google app engine cheat sheetGoogle app engine cheat sheet
Google app engine cheat sheet
 
Git cheat sheet
Git cheat sheetGit cheat sheet
Git cheat sheet
 
Vi cheat sheet
Vi cheat sheetVi cheat sheet
Vi cheat sheet
 
Css cheat sheet
Css cheat sheetCss cheat sheet
Css cheat sheet
 
Cpp cheat sheet
Cpp cheat sheetCpp cheat sheet
Cpp cheat sheet
 
Ubuntu cheat sheet
Ubuntu cheat sheetUbuntu cheat sheet
Ubuntu cheat sheet
 
Php cheat sheet
Php cheat sheetPhp cheat sheet
Php cheat sheet
 
oracle 9i cheat sheet
oracle 9i cheat sheetoracle 9i cheat sheet
oracle 9i cheat sheet
 
Open ssh cheet sheat
Open ssh cheet sheatOpen ssh cheet sheat
Open ssh cheet sheat
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

A beginner’s guide to programming GPUs with CUDA

  • 1. A beginner’s guide to programming GPUs with CUDA Mike Peardon School of Mathematics Trinity College Dublin April 24, 2009 Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 1 / 20
  • 2. What is a GPU? Graphics Processing Unit Processor dedicated to rapid rendering of polygons - texturing, shading They are mass-produced, so very cheap 1 Tflop peak ≈ EU 1k. They have lots of compute cores, but a simpler architecture cf a standard CPU The “shader pipeline” can be used to do floating point calculations −→ cheap scientific/technical computing Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 2 / 20
  • 3. What is a GPU? (2) Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 3 / 20
  • 4. What is CUDA? Compute Unified Device Architecture Extension to C programming language Adds library functions to access to GPU Adds directives to translate C into instructions that run on the host CPU or the GPU when needed Allows easy multi-threading - parallel execution on all thread processors on the GPU Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 4 / 20
  • 5. Will CUDA work on my PC/laptop? CUDA works on modern nVidia cards (Quadro, GeForce, Tesla) See http://www.nvidia.com/object/cuda learn products.html Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 5 / 20
  • 6. nVidia’s compiler - nvcc CUDA code must be compiled using nvcc nvcc generates both instructions for host and GPU (PTX instruction set), as well as instructions to send data back and forwards between them Standard CUDA install; /usr/local/cuda/bin/nvcc Shell executing compiled code needs dynamic linker path LD LIBRARY PATH environment variable set to include /usr/local/cuda/lib Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 6 / 20
  • 7. Simple overview Network, Disk, etc GPU CPU Multiprocessors Memory Memory PCI bus PC Motherboard GPU can’t directly access main memory CPU can’t directly access GPU memory Need to explicitly copy data No printf! Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 7 / 20
  • 8. Writing some code (1) - specifying where code runs CUDA provides function type qualifiers (that are not in C/C++) to enable programmer to define where a function should run host : specifies the code should run on the host CPU (redundant on its own - it is the default) device : specifies the code should run on the GPU, and the function can only be called by code running on the GPU global : specifies the code should run on the GPU, but be called from the host - this is the access point to start multi-threaded codes running on the GPU Device can’t execute code on the host! CUDA imposes some restrictions, such as device code is C-only (host code can be C++), device code can’t be called recursively... Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 8 / 20
  • 9. Code execution Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 9 / 20
  • 10. Writing some code (2) - launching a global function All calls to a global function must specify how many threaded copies are to launch and in what configuration. CUDA syntax: <<< >>> threads are grouped into thread blocks then into a grid of blocks This defines a memory heirarchy (important for performance) Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 10 / 20
  • 11. The thread/block/grid model Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 11 / 20
  • 12. Writing some code (3) - launching a global function Inside the <<< >>>, need at least two arguments (can be two more, that have default values) Call looks eg. like my func<<<bg, tb>>>(arg1, arg2) bg specifies the dimensions of the block grid and tb specifies the dimensions of each thread block bg and tb are both of type dim3 (a new datatype defined by CUDA; three unsigned ints where any unspecified component defaults to 1). dim3 has struct-like access - members are x, y and z CUDA provides constructor: dim3 mygrid(2,2); sets mygrid.x=2, mygrid.y=2 and mygrid.z=1 1d syntax allowed: myfunc<<<5, 6>>>() makes 5 blocks (in linear array) with 6 threads each and runs myfunc on them all. Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 12 / 20
  • 13. Writing some code (4) - built-in variables on the GPU For code running on the GPU ( device and global ), some variables are predefined, which allow threads to be located inside their blocks and grids dim3 gridDim Dimensions of the grid. uint3 blockIdx location of this block in the grid. dim3 blockDim Dimensions of the blocks uint3 threadIdx location of this thread in the block. int warpSize number of threads in the warp? Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 13 / 20
  • 14. Writing some code (5) - where variables are stored For code running on the GPU ( device and global ), the memory used to hold a variable can be specified. device : the variable resides in the GPU’s global memory and is defined while the code runs. constant : the variable resides in the constant memory space of the GPU and is defined while the code runs. shared : the variable resides in the shared memory of the thread block and has the same lifespan as the block. block. Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 14 / 20
  • 15. Example - vector adder Start: #include <stdlib.h> #include <stdio.h> #define N 1000 #define NBLOCK 10 #define NTHREAD 10 Define the kernel to execute on the host __global__ void adder(int n, float* a, float *b) // a=a+b - thread code - add n numbers per thread { int i,off = (N * blockIdx.x ) / NBLOCK + (threadIdx.x * N) / (NBLOCK * NTHREAD); for (i=off;i<off+n;i++) { a[i] = a[i] + b[i]; } } Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 15 / 20
  • 16. Example - vector adder (2) Call using cudaMemcpy(gpu_a, host_a, sizeof(float) * n, cudaMemcpyHostToDevice); cudaMemcpy(gpu_b, host_b, sizeof(float) * n, cudaMemcpyHostToDevice); adder<<<NBLOCK, NTHREAD>>>(n / (NBLOCK * NTHREAD), gpu_a, gpu_b); cudaMemcpy(host_c, gpu_a, sizeof(float) * n, cudaMemcpyDeviceToHost); Need the cudaMemcpy’s to push/pull the data on/off the GPU. Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 16 / 20
  • 17. arXiv:0810.5365 Barros et. al. “Blasting through lattice calculations using CUDA” An implementation of an important compute kernel for lattice QCD - the Wilson-Dirac operator - this is a sparse linear operator that represents the kinetic energy operator in a discrete version of the quantum field theory of relativistic quarks (interacting with gluons). Usually, performance is limited by memory bandwidth (and inter-processor communications). Data is stored in the GPU’s memory “Atom” of data is the spinor of a field on one site. This is 12 complex numbers (3 colours for 4 spins). They use the float4 CUDA data primitive, which packs four floating point numbers efficiently. An array of 6 float4 types then holds one lattice size of the quark field. Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 17 / 20
  • 18. arXiv:0810.5365 Barros et. al. (2) Performance issues: 1 16 threads can read 16 contiguous memory elements very efficiently - their implementation of 6 arrays for the spinor allows this contiguous access 2 GPUs do not have caches; rather they have a small but fast shared memory. Access is managed by software instructions. 3 The GPU has a very efficient thread manager which can schedule multiple threads to run withing the cores of a multi-processor. Best performance comes when the number of threads is (much) more than the number of cores. 4 The local shared memory space is 16k - not enough! Barros et al also use the registers on the multiprocessors (8,192 of them). Unfortunately, this means they have to hand-unroll all their loops! Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 18 / 20
  • 19. arXiv:0810.5365 Barros et. al. (3) Performance: (even-odd) Wilson operator Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 19 / 20
  • 20. arXiv:0810.5365 Barros et. al. (4) Performance: Conjugate Gradient solver: Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 20 / 20
  • 21. Conclusions The GPU offers a very impressive architecture for scientific computing on a single chip. Peak performance now is close to 1 TFlop for less than EU 1,000 CUDA is an extension to C that allows multi-threaded software to execute on modern nVidia GPUs. There are alternatives for other manufacturer’s hardware and proposed architecture-independent schemes (like OpenCL) Efficient use of the hardware is challenging; threads must be scheduled efficiently and synchronisation is slow. Memory access must be defined very carefully. The (near) future will be very interesting... Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 21 / 20