This document provides an introduction and overview of NVIDIA's CUDA C programming language for GPU computing. It begins with basic terminology like host/device and introduces a simple "Hello World" example. It then demonstrates how to write and launch CUDA kernels, manage GPU memory, and perform parallel vector addition using both thread blocks and threads. The key aspects covered are declaring device functions, memory management APIs, launching kernels across blocks/threads, and using block/thread indices to index arrays in parallel.
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
These slides have been presented by Dr. Alessandro Vallero at the IEEE VLSI Test Symposium, San Francisco, CA, USA (April 22-25, 2018).
General Purpose computing on Graphics Processing Unit offers a remarkable speedup for data parallel workloads, leveraging GPUs computational power. However, differently from graphic computing, it requires highly reliable operation in most of application domains.
This presentation talk about a “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs“. The work is the outcome of a collaboration between the TestGroup of Politecnico di Torino (http://www.testgroup.polito.it) and the Computer Architecture Lab of the University of Athens (dscal.di.uoa.gr) started under the FP7 Clereco Project (http://www.clereco.eu). It presents an extended study based on a consolidated workflow for the evaluation of the reliability in correlation with the performance of four GPU architectures and corresponding chips: AMD Southern Islands and NVIDIA G80/GT200/Fermi. We obtained reliability measurements (AVF and FIT) employing both fault injection and ACE-analysis based on microarchitecture-level simulators. Apart from the reliability-only and performance-only measurements, we propose combined metrics for performance and reliability (to quantify instruction throughput or task execution throughput between failures) that assist comparisons for the same application among GPU chips of different ISAs and vendors, as well as among benchmarks on the same GPU chip.
Watch the presentation at: https://youtu.be/GV5xRDgfCw4
Paper Information:
Alessandro Vallero§ , Sotiris Tselonis, Dimitris Gizopoulos* and Stefano Di Carlo§, “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs”, IEEE VLSI Test Symposium 2018 (VTS 2018), San Francisco, CA (USA), April 22-25, 2018.
∗Politecnico di Torino, Italy. Email: stefano.dicarlo,alessandro.vallero@polito.it †University of Athens, Greece Email: dgizop@di.uoa.gr
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
These slides have been presented by Dr. Alessandro Vallero at the IEEE VLSI Test Symposium, San Francisco, CA, USA (April 22-25, 2018).
General Purpose computing on Graphics Processing Unit offers a remarkable speedup for data parallel workloads, leveraging GPUs computational power. However, differently from graphic computing, it requires highly reliable operation in most of application domains.
This presentation talk about a “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs“. The work is the outcome of a collaboration between the TestGroup of Politecnico di Torino (http://www.testgroup.polito.it) and the Computer Architecture Lab of the University of Athens (dscal.di.uoa.gr) started under the FP7 Clereco Project (http://www.clereco.eu). It presents an extended study based on a consolidated workflow for the evaluation of the reliability in correlation with the performance of four GPU architectures and corresponding chips: AMD Southern Islands and NVIDIA G80/GT200/Fermi. We obtained reliability measurements (AVF and FIT) employing both fault injection and ACE-analysis based on microarchitecture-level simulators. Apart from the reliability-only and performance-only measurements, we propose combined metrics for performance and reliability (to quantify instruction throughput or task execution throughput between failures) that assist comparisons for the same application among GPU chips of different ISAs and vendors, as well as among benchmarks on the same GPU chip.
Watch the presentation at: https://youtu.be/GV5xRDgfCw4
Paper Information:
Alessandro Vallero§ , Sotiris Tselonis, Dimitris Gizopoulos* and Stefano Di Carlo§, “Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA and AMD GPUs”, IEEE VLSI Test Symposium 2018 (VTS 2018), San Francisco, CA (USA), April 22-25, 2018.
∗Politecnico di Torino, Italy. Email: stefano.dicarlo,alessandro.vallero@polito.it †University of Athens, Greece Email: dgizop@di.uoa.gr
This is a presentation I gave on last GPGPU workshop we did on April 2013.
The usage of GPGPU is expanding, and creates a continuum from Mobile to HPC. At the same time, question is whether the GPGPU languages are the right ones (well, no) and aren't we wasting resources on re-developing the same SW stack instead of converging.
The presentation will introduce Nvidia and the concept of GPU computing in the context of Financial Services industry. Customer successes are referenced where dramatic speed-ups in performance have been achieved.
This presentation describes the components of GPU ecosystem for compute, provides overview of existing ecosystems, and contains a case study on NVIDIA Nsight
This presentation accompanies the webinar replay located here: http://bit.ly/1zmvlkL
AMD Media SDK Software Architect Mikhail Mironov shows you how to leverage an AMD platform for multimedia processing using the new Media Software Development Kit. He discusses how to use a new set of C++ interfaces for easy access to AMD hardware blocks, and shows you how to leverage the Media SDK in the development of video conferencing, wireless display, remote desktop, video editing, transcoding, and more.
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
Presentation MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabilities, by Srikanth Gollapudi at the AMD Developer Summit (APU13) November 11-13, 2013.
http://cs264.org
Abstract:
High-level scripting languages are in many ways polar opposites to
GPUs. GPUs are highly parallel, subject to hardware subtleties, and
designed for maximum throughput, and they offer a tremendous advance
in the performance achievable for a significant number of
computational problems. On the other hand, scripting languages such as
Python favor ease of use over computational speed and do not generally
emphasize parallelism. PyOpenCL and PyCUDA are two packages that
attempt to join the two together. By showing concrete examples, both
at the toy and the whole-application level, this talk aims to
demonstrate that by combining these opposites, a programming
environment is created that is greater than just the sum of its two
parts.
Speaker biography:
Andreas Klöckner obtained his PhD degree working with Jan Hesthaven at
the Department of Applied Mathematics at Brown University. He worked
on a variety of topics all aiming to broaden the utility of
discontinuous Galerkin (DG) methods. This included their use in the
simulation of plasma physics and the demonstration of their particular
suitability for computation on throughput-oriented graphics processors
(GPUs). He also worked on multi-rate time stepping methods and shock
capturing schemes for DG.
In the fall of 2010, he joined the Courant Institute of Mathematical
Sciences at New York University as a Courant Instructor. There, he is
working on problems in computational electromagnetics with Leslie
Greengard.
His research interests include:
- Discontinuous Galerkin and integral equation methods for wave
propagation
- Programming tools for parallel architectures
- High-order unstructured particle-in-cell methods for plasma simulation
This is a presentation I gave on last GPGPU workshop we did on April 2013.
The usage of GPGPU is expanding, and creates a continuum from Mobile to HPC. At the same time, question is whether the GPGPU languages are the right ones (well, no) and aren't we wasting resources on re-developing the same SW stack instead of converging.
The presentation will introduce Nvidia and the concept of GPU computing in the context of Financial Services industry. Customer successes are referenced where dramatic speed-ups in performance have been achieved.
This presentation describes the components of GPU ecosystem for compute, provides overview of existing ecosystems, and contains a case study on NVIDIA Nsight
This presentation accompanies the webinar replay located here: http://bit.ly/1zmvlkL
AMD Media SDK Software Architect Mikhail Mironov shows you how to leverage an AMD platform for multimedia processing using the new Media Software Development Kit. He discusses how to use a new set of C++ interfaces for easy access to AMD hardware blocks, and shows you how to leverage the Media SDK in the development of video conferencing, wireless display, remote desktop, video editing, transcoding, and more.
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
Presentation MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabilities, by Srikanth Gollapudi at the AMD Developer Summit (APU13) November 11-13, 2013.
http://cs264.org
Abstract:
High-level scripting languages are in many ways polar opposites to
GPUs. GPUs are highly parallel, subject to hardware subtleties, and
designed for maximum throughput, and they offer a tremendous advance
in the performance achievable for a significant number of
computational problems. On the other hand, scripting languages such as
Python favor ease of use over computational speed and do not generally
emphasize parallelism. PyOpenCL and PyCUDA are two packages that
attempt to join the two together. By showing concrete examples, both
at the toy and the whole-application level, this talk aims to
demonstrate that by combining these opposites, a programming
environment is created that is greater than just the sum of its two
parts.
Speaker biography:
Andreas Klöckner obtained his PhD degree working with Jan Hesthaven at
the Department of Applied Mathematics at Brown University. He worked
on a variety of topics all aiming to broaden the utility of
discontinuous Galerkin (DG) methods. This included their use in the
simulation of plasma physics and the demonstration of their particular
suitability for computation on throughput-oriented graphics processors
(GPUs). He also worked on multi-rate time stepping methods and shock
capturing schemes for DG.
In the fall of 2010, he joined the Courant Institute of Mathematical
Sciences at New York University as a Courant Instructor. There, he is
working on problems in computational electromagnetics with Leslie
Greengard.
His research interests include:
- Discontinuous Galerkin and integral equation methods for wave
propagation
- Programming tools for parallel architectures
- High-order unstructured particle-in-cell methods for plasma simulation
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2023/07/a-new-open-standards-based-open-source-programming-model-for-all-accelerators-a-presentation-from-codeplay-software/
Charles Macfarlane, Chief Business Officer at Codeplay Software, presents the “New, Open-standards-based, Open-source Programming Model for All Accelerators” tutorial at the May 2023 Embedded Vision Summit.
As demand for AI grows, developers are attempting to squeeze more and more performance from accelerators. Ideally, developers would choose the accelerators best suited to their applications. Unfortunately, today many developers are locked into limited hardware choices because they use proprietary programming models like NVIDIA’s CUDA. The oneAPI project was launched to create an open specification and open-source software that enables developers to write software using standard C++ code and deploy to GPUs from multiple vendors.
OneAPI is an open-source ecosystem based on the Khronos open-standard SYCL with libraries for enabling AI and HPC applications. OneAPI-enabled software is currently deployed on numerous supercomputers, with plans to extend into other market segments. OneAPI is evolving rapidly and the whole community of hardware and software developers is invited to contribute. In this presentation, Macfarlane introduces how oneAPI enables developers to write multi-target software and highlights opportunities for developers to contribute to making oneAPI available for all accelerators.
CUDA by Example : The Final Countdown : NotesSubhajit Sahu
Highlighted notes of:
Chapter 12: The Final Countdown
Book:
CUDA by Example
An Introduction to General Purpose GPU Computing
Authors:
Jason Sanders
Edward Kandrot
“This book is required reading for anyone working with accelerator-based computing systems.”
–From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory
CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is required–just the ability to program in a modestly extended version of C.
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.
Table of Contents
Why CUDA? Why Now?
Getting Started
Introduction to CUDA C
Parallel Programming in CUDA C
Thread Cooperation
Constant Memory and Events
Texture Memory
Graphics Interoperability
Atomics
Streams
CUDA C on Multiple GPUs
The Final Countdown
All the CUDA software tools you’ll need are freely available for download from NVIDIA.
Jason Sanders is a senior software engineer in NVIDIA’s CUDA Platform Group, helped develop early releases of CUDA system software and contributed to the OpenCL 1.0 Specification, an industry standard for heterogeneous computing. He has held positions at ATI Technologies, Apple, and Novell.
Edward Kandrot is a senior software engineer on NVIDIA’s CUDA Algorithms team, has more than twenty years of industry experience optimizing code performance for firms including Adobe, Microsoft, Google, and Autodesk.
1. NVIDIA GPU Technology
Siggraph Asia 2010
NVIDIA GPU Technology
Siggraph Asia 2010
Introduction to CUDA C
Samuel Gateau | Seoul | December 16, 2010
2. NVIDIA GPU Technology
Siggraph Asia 2010
Who should you thank for this talk?
Jason Sanders
Senior Software Engineer, NVIDIA
Co-author of CUDA by Example
3. NVIDIA GPU Technology
Siggraph Asia 2010
What is CUDA?
CUDA Architecture
— Expose general-purpose GPU computing as first-class capability
— Retain traditional DirectX/OpenGL graphics performance
CUDA C
— Based on industry-standard C
— A handful of language extensions to allow heterogeneous programs
— Straightforward APIs to manage devices, memory, etc.
This talk will introduce you to CUDA C
4. NVIDIA GPU Technology
Introduction to CUDA C
Siggraph Asia 2010
What will you learn today?
— Start from ―Hello, World!‖
— Write and launch CUDA C kernels
— Manage GPU memory
— Run parallel kernels in CUDA C
— Parallel communication and synchronization
— Race conditions and atomic operations
5. NVIDIA GPU Technology
CUDA C Prerequisites
Siggraph Asia 2010
You (probably) need experience with C or C++
You do not need any GPU experience
You do not need any graphics experience
You do not need any parallel programming experience
6. NVIDIA GPU Technology
CUDA C: The Basics
Siggraph Asia 2010
Terminology
Host – The CPU and its memory (host memory)
Device – The GPU and its memory (device memory)
Host Device
Note: Figure Not to Scale
7. NVIDIA GPU Technology
Hello, World!
Siggraph Asia 2010
int main( void ) {
printf( "Hello, World!n" );
return 0;
}
This basic program is just standard C that runs on the host
NVIDIA’s compiler (nvcc) will not complain about CUDA programs
with no device code
At its simplest, CUDA C is just C!
8. NVIDIA GPU Technology
Hello, World! with Device Code
Siggraph Asia 2010
__global__ void kernel( void ) {
}
int main( void ) {
kernel<<<1,1>>>();
printf( "Hello, World!n" );
return 0;
}
Two notable additions to the original ―Hello, World!‖
9. NVIDIA GPU Technology
Hello, World! with Device Code
Siggraph Asia 2010
__global__ void kernel( void ) {
}
CUDA C keyword __global__ indicates that a function
— Runs on the device
— Called from host code
nvcc splits source file into host and device components
— NVIDIA’s compiler handles device functions like kernel()
— Standard host compiler handles host functions like main()
gcc
Microsoft Visual C
10. NVIDIA GPU Technology
Hello, World! with Device Code
Siggraph Asia 2010
int main( void ) {
kernel<<< 1, 1 >>>();
printf( "Hello, World!n" );
return 0;
}
Triple angle brackets mark a call from host code to device code
— Sometimes called a ―kernel launch‖
— We’ll discuss the parameters inside the angle brackets later
This is all that’s required to execute a function on the GPU!
The function kernel() does nothing, so this is fairly anticlimactic…
11. NVIDIA GPU Technology
A More Complex Example
Siggraph Asia 2010
A simple kernel to add two integers:
__global__ void add( int *a, int *b, int *c ) {
*c = *a + *b;
}
As before, __global__ is a CUDA C keyword meaning
— add() will execute on the device
— add() will be called from the host
12. NVIDIA GPU Technology
A More Complex Example
Siggraph Asia 2010
Notice that we use pointers for our variables:
__global__ void add( int *a, int *b, int *c ) {
*c = *a + *b;
}
add() runs on the device…so a, b, and c must point to
device memory
How do we allocate memory on the GPU?
13. NVIDIA GPU Technology
Siggraph Asia 2010
Memory Management
Host and device memory are distinct entities
— Device pointers point to GPU memory
May be passed to and from host code
May not be dereferenced from host code
— Host pointers point to CPU memory
May be passed to and from device code
May not be dereferenced from device code
Basic CUDA API for dealing with device memory
— cudaMalloc(), cudaFree(), cudaMemcpy()
— Similar to their C equivalents, malloc(), free(), memcpy()
14. NVIDIA GPU Technology
Siggraph Asia 2010 A More Complex Example: add()
Using our add()kernel:
__global__ void add( int *a, int *b, int *c ) {
*c = *a + *b;
}
Let’s take a look at main()…
15. NVIDIA GPU Technology
Siggraph Asia 2010 A More Complex Example: main()
int main( void ) {
int a, b, c; // host copies of a, b, c
int *dev_a, *dev_b, *dev_c; // device copies of a, b, c
int size = sizeof( int ); // we need space for an integer
// allocate device copies of a, b, c
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, size );
a = 2;
b = 7;
16. NVIDIA GPU Technology
Siggraph Asia 2010 A More Complex Example: main() (cont)
// copy inputs to device
cudaMemcpy( dev_a, &a, size, cudaMemcpyHostToDevice );
cudaMemcpy( dev_b, &b, size, cudaMemcpyHostToDevice );
// launch add() kernel on GPU, passing parameters
add<<< 1, 1 >>>( dev_a, dev_b, dev_c );
// copy device result back to host copy of c
cudaMemcpy( &c, dev_c, size, cudaMemcpyDeviceToHost );
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
return 0;
}
17. NVIDIA GPU Technology
Siggraph Asia 2010
Parallel Programming in CUDA C
But wait…GPU computing is about massive parallelism
So how do we run code in parallel on the device?
Solution lies in the parameters between the triple angle brackets:
add<<< 1, 1 >>>( dev_a, dev_b, dev_c );
add<<< N, 1 >>>( dev_a, dev_b, dev_c );
Instead of executing add() once, add() executed N times in parallel
18. NVIDIA GPU Technology
Siggraph Asia 2010
Parallel Programming in CUDA C
With add() running in parallel…let’s do vector addition
Terminology: Each parallel invocation of add() referred to as a block
Kernel can refer to its block’s index with the variable blockIdx.x
Each block adds a value from a[] and b[], storing the result in c[]:
__global__ void add( int *a, int *b, int *c ) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
By using blockIdx.x to index arrays, each block handles different indices
19. NVIDIA GPU Technology
Siggraph Asia 2010
Parallel Programming in CUDA C
We write this code:
__global__ void add( int *a, int *b, int *c ) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
This is what runs in parallel on the device:
Block 0 Block 1
c[0] = a[0] + b[0]; c[1] = a[1] + b[1];
Block 2 Block 3
c[2] = a[2] + b[2]; c[3] = a[3] + b[3];
20. NVIDIA GPU Technology
Siggraph Asia 2010 Parallel Addition: add()
Using our newly parallelized add()kernel:
__global__ void add( int *a, int *b, int *c ) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
Let’s take a look at main()…
21. NVIDIA GPU Technology
Siggraph Asia 2010 Parallel Addition: main()
#define N 512
int main( void ) {
int *a, *b, *c; // host copies of a, b, c
int *dev_a, *dev_b, *dev_c; // device copies of a, b, c
int size = N * sizeof( int ); // we need space for 512 integers
// allocate device copies of a, b, c
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, size );
a = (int*)malloc( size );
b = (int*)malloc( size );
c = (int*)malloc( size );
random_ints( a, N );
random_ints( b, N );
22. NVIDIA GPU Technology
Siggraph Asia 2010 Parallel Addition: main() (cont)
// copy inputs to device
cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );
// launch add() kernel with N parallel blocks
add<<< N, 1 >>>( dev_a, dev_b, dev_c );
// copy device result back to host copy of c
cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost );
free( a ); free( b ); free( c );
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
return 0;
}
23. NVIDIA GPU Technology
Siggraph Asia 2010
Review
Difference between ―host‖ and ―device‖
— Host = CPU
— Device = GPU
Using __global__ to declare a function as device code
— Runs on device
— Called from host
Passing parameters from host code to a device function
24. NVIDIA GPU Technology
Siggraph Asia 2010
Review (cont)
Basic device memory management
— cudaMalloc()
— cudaMemcpy()
— cudaFree()
Launching parallel kernels
— Launch N copies of add() with: add<<< N, 1 >>>();
— Used blockIdx.x to access block’s index
25. NVIDIA GPU Technology
Siggraph Asia 2010
Threads
Terminology: A block can be split into parallel threads
Let’s change vector addition to use parallel threads instead of parallel blocks:
__global__ void add( int *a, int *b, int *c ) {
c[ threadIdx.x ] = a[ threadIdx.x ] + b[ threadIdx.x ];
blockIdx.x blockIdx.x blockIdx.x
}
We use threadIdx.x instead of blockIdx.x in add()
main() will require one change as well…
26. NVIDIA GPU Technology
Siggraph Asia 2010 Parallel Addition (Threads): main()
#define N 512
int main( void ) {
int *a, *b, *c; //host copies of a, b, c
int *dev_a, *dev_b, *dev_c; //device copies of a, b, c
int size = N * sizeof( int ); //we need space for 512 integers
// allocate device copies of a, b, c
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, size );
a = (int*)malloc( size );
b = (int*)malloc( size );
c = (int*)malloc( size );
random_ints( a, N );
random_ints( b, N );
27. NVIDIA GPU Technology
Siggraph Asia 2010 Parallel Addition (Threads): main() (cont)
// copy inputs to device
cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );
// launch add() kernel with N threads
blocks
add<<< 1, N >>>( dev_a, dev_b, dev_c );
N, 1
// copy device result back to host copy of c
cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost );
free( a ); free( b ); free( c );
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
return 0;
}
28. NVIDIA GPU Technology
Siggraph Asia 2010 Using Threads And Blocks
We’ve seen parallel vector addition using
— Many blocks with 1 thread apiece
— 1 block with many threads
Let’s adapt vector addition to use lots of both blocks and threads
After using threads and blocks together, we’ll talk about why threads
First let’s discuss data indexing…
29. NVIDIA GPU Technology
Siggraph Asia 2010
Indexing Arrays With Threads And Blocks
No longer as simple as just using threadIdx.x or blockIdx.x as indices
To index array with 1 thread per entry (using 8 threads/block)
threadIdx.x threadIdx.x threadIdx.x threadIdx.x
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3
If we have M threads/block, a unique array index for each entry given by
int index = threadIdx.x + blockIdx.x * M;
int index = x + y * width;
30. NVIDIA GPU Technology
Siggraph Asia 2010
Indexing Arrays: Example
In this example, the red entry would have an index of 21:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
M = 8 threads/block
blockIdx.x = 2
int index = threadIdx.x + blockIdx.x * M;
= 5 + 2 * 8;
= 21;
31. NVIDIA GPU Technology
Siggraph Asia 2010
Addition with Threads and Blocks
The blockDim.x is a built-in variable for threads per block:
int index= threadIdx.x + blockIdx.x * blockDim.x;
A combined version of our vector addition kernel to use blocks and threads:
__global__ void add( int *a, int *b, int *c ) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}
So what changes in main() when we use both blocks and threads?
32. NVIDIA GPU Technology
Siggraph Asia 2010 Parallel Addition (Blocks/Threads): main()
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main( void ) {
int *a, *b, *c; // host copies of a, b, c
int *dev_a, *dev_b, *dev_c; // device copies of a, b, c
int size = N * sizeof( int ); // we need space for N integers
// allocate device copies of a, b, c
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, size );
a = (int*)malloc( size );
b = (int*)malloc( size );
c = (int*)malloc( size );
random_ints( a, N );
random_ints( b, N );
33. NVIDIA GPU Technology
Siggraph Asia 2010 Parallel Addition (Blocks/Threads): main()
// copy inputs to device
cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );
// launch add() kernel with blocks and threads
add<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( dev_a, dev_b, dev_c );
// copy device result back to host copy of c
cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost );
free( a ); free( b ); free( c );
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
return 0;
}
34. NVIDIA GPU Technology
Siggraph Asia 2010
Why Bother With Threads?
Threads seem unnecessary
— Added a level of abstraction and complexity
— What did we gain?
Unlike parallel blocks, parallel threads have mechanisms to
— Communicate
— Synchronize
Let’s see how…
35. NVIDIA GPU Technology
Siggraph Asia 2010
Dot Product
Unlike vector addition, dot product is a reduction from vectors to a scalar
a b
a0 * b0
c
a1
a2
*
*
b1
b2
+
a3 * b3
c = a∙b
c = (a0, a1, a2, a3) ∙ (b0, b1, b2, b3)
c = a0 b0 + a1 b1 + a2 b2 + a3 b3
36. NVIDIA GPU Technology
Siggraph Asia 2010
Dot Product
Parallel threads have no problem computing the pairwise products:
a b
a0 * b0
a1
a2
*
*
b1
b2
+
a3 * b3
So we can start a dot product CUDA kernel by doing just that:
__global__ void dot( int *a, int *b, int *c ) {
// Each thread computes a pairwise product
int temp = a[threadIdx.x] * b[threadIdx.x];
37. NVIDIA GPU Technology
Siggraph Asia 2010
Dot Product
But we need to share data between threads to compute the final sum:
a b
a0 * b0
a1
a2
*
*
b1
b2
+
a3 * b3
__global__ void dot( int *a, int *b, int *c ) {
// Each thread computes a pairwise product
int temp = a[threadIdx.x] * b[threadIdx.x];
// Can’t compute the final sum
// Each thread’s copy of ‘temp’ is private
}
38. NVIDIA GPU Technology
Siggraph Asia 2010
Sharing Data Between Threads
Terminology: A block of threads shares memory called…shared memory
Extremely fast, on-chip memory (user-managed cache)
Declared with the __shared__ CUDA keyword
Not visible to threads in other blocks running in parallel
Block 0 Block 1 Block 2
Threads Threads Threads
…
Shared Memory Shared Memory Shared Memory
39. NVIDIA GPU Technology
Siggraph Asia 2010
Parallel Dot Product: dot()
We perform parallel multiplication, serial addition:
#define N 512
__global__ void dot( int *a, int *b, int *c ) {
// Shared memory for results of multiplication
__shared__ int temp[N];
temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];
// Thread 0 sums the pairwise products
if( 0 == threadIdx.x ) {
int sum = 0;
for( int i = 0; i < N; i++ )
sum += temp[i];
*c = sum;
}
}
40. NVIDIA GPU Technology
Siggraph Asia 2010
Parallel Dot Product Recap
We perform parallel, pairwise multiplications
Shared memory stores each thread’s result
We sum these pairwise products from a single thread
Sounds good…but we’ve made a huge mistake
41. NVIDIA GPU Technology
Siggraph Asia 2010
Faulty Dot Product Exposed!
Step 1: In parallel each thread writes a pairwise product
parallel,
__shared__ int temp
Step 2: Thread 0 reads and sums the products
__shared__ int temp
But there’s an assumption hidden in Step 1…
42. NVIDIA GPU Technology
Siggraph Asia 2010
Read-Before-Write Hazard
Suppose thread 0 finishes its write in step 1
Then thread 0 reads index 12 in step 2
This read returns garbage!
Before thread 12 writes to index 12 in step 1?
43. NVIDIA GPU Technology
Siggraph Asia 2010
Synchronization
We need threads to wait between the sections of dot():
__global__ void dot( int *a, int *b, int *c ) {
__shared__ int temp[N];
temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];
// * NEED THREADS TO SYNCHRONIZE HERE *
// No thread can advance until all threads
// have reached this point in the code
// Thread 0 sums the pairwise products
if( 0 == threadIdx.x ) {
int sum = 0;
for( int i = 0; i < N; i++ )
sum += temp[i];
*c = sum;
}
}
44. NVIDIA GPU Technology
Siggraph Asia 2010
__syncthreads()
We can synchronize threads with the function __syncthreads()
Threads in the block wait until all threads have hit the __syncthreads()
Thread 0 __syncthreads()
Thread 1 __syncthreads()
Thread 2 __syncthreads()
Thread 3 __syncthreads()
Thread 4 __syncthreads()
…
Threads are only synchronized within a block
45. NVIDIA GPU Technology
Siggraph Asia 2010 Parallel Dot Product: dot()
__global__ void dot( int *a, int *b, int *c ) {
__shared__ int temp[N];
temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];
__syncthreads();
if( 0 == threadIdx.x ) {
int sum = 0;
for( int i = 0; i < N; i++ )
sum += temp[i];
*c = sum;
}
}
With a properly synchronized dot() routine, let’s look at main()
46. NVIDIA GPU Technology
Siggraph Asia 2010 Parallel Dot Product: main()
#define N 512
int main( void ) {
int *a, *b, *c; // copies of a, b, c
int *dev_a, *dev_b, *dev_c; // device copies of a, b, c
int size = N * sizeof( int ); // we need space for 512 integers
// allocate device copies of a, b, c
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, sizeof( int ) );
a = (int *)malloc( size );
b = (int *)malloc( size );
c = (int *)malloc( sizeof( int ) );
random_ints( a, N );
random_ints( b, N );
47. NVIDIA GPU Technology
Siggraph Asia 2010 Parallel Dot Product: main()
// copy inputs to device
cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );
// launch dot() kernel with 1 block and N threads
dot<<< 1, N >>>( dev_a, dev_b, dev_c );
// copy device result back to host copy of c
cudaMemcpy( c, dev_c, sizeof( int ) , cudaMemcpyDeviceToHost );
free( a ); free( b ); free( c );
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
return 0;
}
48. NVIDIA GPU Technology
Siggraph Asia 2010
Review
Launching kernels with parallel threads
— Launch add() with N threads: add<<< 1, N >>>();
— Used threadIdx.x to access thread’s index
Using both blocks and threads
— Used (threadIdx.x + blockIdx.x * blockDim.x) to index input/output
— N/THREADS_PER_BLOCK blocks and THREADS_PER_BLOCK threads gave us N threads total
49. NVIDIA GPU Technology
Siggraph Asia 2010
Review (cont)
Using __shared__ to declare memory as shared memory
— Data shared among threads in a block
— Not visible to threads in other parallel blocks
Using __syncthreads() as a barrier
— No thread executes instructions after __syncthreads() until all
threads have reached the __syncthreads()
— Needs to be used to prevent data hazards
50. NVIDIA GPU Technology
Siggraph Asia 2010
Multiblock Dot Product
Recall our dot product launch:
// launch dot() kernel with 1 block and N threads
dot<<< 1, N >>>( dev_a, dev_b, dev_c );
Launching with one block will not utilize much of the GPU
Let’s write a multiblock version of dot product
51. NVIDIA GPU Technology
Siggraph Asia 2010
Multiblock Dot Product: Algorithm
Each block computes a sum of its pairwise products like before:
Block 0
a b
a0 * b0
sum
a1
a2
*
*
b1
b2
+
a3
… * b3
Block 1…
a b
a512 * b512
sum
a513
a514
*
*
b513
b514
+
a515 * b515
…
…
52. NVIDIA GPU Technology
Siggraph Asia 2010
Multiblock Dot Product: Algorithm
And then contributes its sum to the final result:
Block 0
a b
a0 * b0
sum
a1
a2
*
*
b1
b2
+
a3
… * b3
…
c
Block 1
a b
a512 * b512
sum
a513
a514
*
*
b513
b514
+
a515 * b515
…
…
53. NVIDIA GPU Technology
Siggraph Asia 2010
Multiblock Dot Product: dot()
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
__global__ void dot( int *a, int *b, int *c ) {
__shared__ int temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads();
if( 0 == threadIdx.x ) {
int sum = 0;
for( int i = 0; i < THREADS_PER_BLOCK; i++ )
sum += temp[i];
atomicAdd(
*c += sum; c , sum );
}
}
But we have a race condition…
We can fix it with one of CUDA’s atomic operations
54. NVIDIA GPU Technology
Siggraph Asia 2010
Race Conditions
Terminology: A race condition occurs when program behavior depends upon
relative timing of two (or more) event sequences
What actually takes place to execute the line in question: *c += sum;
— Read value at address c
— Add sum to value Terminology: Read-Modify-Write
— Write result to address c
What if two threads are trying to do this at the same time?
Thread 0, Block 0 Thread 0, Block 1
— Read value at address c — Read value at address c
— Add sum to value — Add sum to value
— Write result to address c — Write result to address c
55. NVIDIA GPU Technology
Siggraph Asia 2010
Global Memory Contention
Read-Modify-Write
Block 0 Reads 0 Computes 0+3 Writes 3
sum = 3 0 0+3 = 3 3
*c += sum c 0 0 3 3 3 7
Block 1 3 3+4 = 7 7
sum = 4 Reads 3 Computes 3+4 Writes 7
Read-Modify-Write
56. NVIDIA GPU Technology
Siggraph Asia 2010
Global Memory Contention
Read-Modify-Write
Block 0 Reads 0 Computes 0+3 Writes 3
sum = 3 0 0+3 = 3 3
*c += sum c 0 0 0 0 3 4
Block 1 0 0+4 = 4 4
sum = 4 Reads 0 Computes 0+4 Writes 4
Read-Modify-Write
57. NVIDIA GPU Technology
Siggraph Asia 2010
Atomic Operations
Terminology: Read-modify-write uninterruptible when atomic
Many atomic operations on memory available with CUDA C
atomicAdd() atomicInc()
atomicSub() atomicDec()
atomicMin() atomicExch()
atomicMax() atomicCAS()
Predictable result when simultaneous access to memory required
We need to atomically add sum to c in our multiblock dot product
58. NVIDIA GPU Technology
Siggraph Asia 2010
Multiblock Dot Product: dot()
__global__ void dot( int *a, int *b, int *c ) {
__shared__ int temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads();
if( 0 == threadIdx.x ) {
int sum = 0;
for( int i = 0; i < THREADS_PER_BLOCK; i++ )
sum += temp[i];
atomicAdd( c , sum );
}
}
Now let’s fix up main() to handle a multiblock dot product
59. NVIDIA GPU Technology
Siggraph Asia 2010 Parallel Dot Product: main()
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main( void ) {
int *a, *b, *c; // host copies of a, b, c
int *dev_a, *dev_b, *dev_c; // device copies of a, b, c
int size = N * sizeof( int ); // we need space for N ints
// allocate device copies of a, b, c
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, sizeof( int ) );
a = (int *)malloc( size );
b = (int *)malloc( size );
c = (int *)malloc( sizeof( int ) );
random_ints( a, N );
random_ints( b, N );
60. NVIDIA GPU Technology
Siggraph Asia 2010 Parallel Dot Product: main()
// copy inputs to device
cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );
// launch dot() kernel
dot<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( dev_a, dev_b, dev_c );
// copy device result back to host copy of c
cudaMemcpy( c, dev_c, sizeof( int ) , cudaMemcpyDeviceToHost );
free( a ); free( b ); free( c );
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
return 0;
}
61. NVIDIA GPU Technology
Siggraph Asia 2010
Review
Race conditions
— Behavior depends upon relative timing of multiple event sequences
— Can occur when an implied read-modify-write is interruptible
Atomic operations
— CUDA provides read-modify-write operations guaranteed to be atomic
— Atomics ensure correct results when multiple threads modify memory
62. NVIDIA GPU Technology
Siggraph Asia 2010 To Learn More CUDA C
Check out CUDA by Example
— Parallel Programming in CUDA C
— Thread Cooperation
— Constant Memory and Events
— Texture Memory
— Graphics Interoperability
— Atomics
— Streams
— CUDA C on Multiple GPUs
— Other CUDA Resources
For sale here at GTC
http://developer.nvidia.com/object/cuda-by-example.html