Tech Talk NVIDIA CUDA

NVIDIA CUDA
The Compute Unified Device Architecture

Jens Rühmkorf
Tech Talk, DLR Köln-Porz, July 22nd 2009

Slide 1
CUDA > J. Rühmkorf > July 22nd 2009

References

University of Illinois at Urbana-Champaign,
Wen-Mei Hwu & David Kirk, course ECE 498 AL, Spring 2009:
http://courses.ece.illinois.edu/ece498/al/

Website about General-Purpose Computation on Graphics Hardware:
http://gpgpu.org/developer/cuda
ACM Queue, Vol. 6 No. 2, March/April 2008 (Issue on GPGPU):
http://mags.acm.org/queue/20080304/
Dr. Dobb‘s: CUDA, Supercomputing for the Masses, Part 1-13:
http://www.ddj.com/architect/207200659
NVIDIA CUDA Best Practices Guide
http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVI
DIA_CUDA_BestPracticesGuide_2.3.pdf
Hubert Nguyen (ed.), GPU Gems 3, Addison-Wesley, 2007, online:
http://developer.nvidia.com/object/gpu-gems-3.html

Slide 2

Multi- and Manycore Architectures
A Difficult Road Lies Ahead

Don Knuth on Multicore Architectures
“[…] my personal unhappiness with the current trend toward multicore
architecture. To me, it looks more or less like the hardware designers
have run out of ideas, and that they’re trying to pass the blame for the
future demise of Moore’s Law to the software writers by giving us
machines that work faster only on a few key benchmarks! I won’t be
surprised at all if the whole multithreading idea turns out to be a flop”
In: InformIT, April 25th 2008
http://www.informit.com/articles/article.aspx?p=1193856

Slide 3

Overview

A high level view on CUDA

CUDA programming model

CUDA memory model

CUDA application programming interface

Simple CUDA example

Slide 4

Multicore and Manycore (1)
Structural Differences

Multicore Manycore
Multicore: yoke of oxen
Each core optimized for executing a single thread
Manycore: flock of chickens
Cores optimized for aggregate throughput,
deemphasizing individual performance

Slide 5

Technical Characteristics

Specifica-
Core i7 960 GTX285
tions
4 cores, 4 way 30 cores, 8 way
Processing SIMD SIMD
Elements
@3.2 GHz @1.5 GHz
4 cores, 2 threads, 30 cores, 32 SIMD
4 width SIMD:
Resident Threads
(max)
vectors, 32 width
SIMD:
Core i7
32 strands 30720 strands

SP GFLOP/s 102 1080

Memory Bandwidth 25.6 GB/s 159 GB/s

Register File - 1.875 MB

Local Store - 480 kB
GTX285
Slide 6

Performance Comparison: CPU vs. GPU

CPU vs. GPU
y-axis: floating point operations per sec., single precision

Slide 7

An Example of the Physical Reality Behind CUDA

CPU
GPU w/ (host)
local DRAM
(device)

Slide 8

CUDA Processing Flow

Slide 9

CUDA in a Nutshell
Key Characteristics

CUDA is designed for wide SIMD/SPMD parallelism & scalability
CUDA provides 3 key abstractions, i.e. a hierarchy:
of thread groups,
of shared memories, and
of barrier synchronization
CUDA programs are written in C + extensions
OpenCL is inspired by CUDA, but HW & SW vendor neutral
Programming model essentially identical

Slide 10

Hello World

// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i]
}

int main() {
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
hello-world.cu

Slide 11

Overview



CUDA memory model


Simple CUDA example

Slide 12

CUDA Programming Model
Structure of a CUDA application

Integrated host + device application C program
Serial or modestly parallel parts in host C code
Highly parallel parts in device SPMD kernel C code

Serial Code (host)

Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args); ...

Serial Code (host)

Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args); ...

Slide 13

CUDA Devices and Threads

A compute device
Is a coprocessor to the CPU or host
Has its own DRAM (device memory)
Runs many threads in parallel
Is typically a GPU but can also be another type of parallel
processing device
Express data-parallel portions as device kernels
(which run on many threads)

Slide 14

Arrays of Parallel Threads

Execute a Kernel by specifying arrays of threads
All threads run the same code (SPMD)
Use thread-ID to compute memory addresses & make control decisions

threadID 0 1 2 3 4 5 6 7

…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…

Slide 15

Use Thread Blocks for (Scalable) Cooperation
Divide monolithic thread array into multiple blocks
Threads within a block can cooperate via
shared memory,
atomic operations, and
barrier synchronization
Threads in different blocks cannot cooperate

Thread Block 0 Thread Block 1 Thread Block N - 1
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
threadID

… … …
float x = float x = float x =
input[threadID];
float y = func(x);
input[threadID];
float y = func(x);
… input[threadID];
float y = func(x);
… … …

Slide 16

Organisation of Thread Blocks

Thread Blocks can be one-, two- or
three-dimensional arrays
The host issues a sequence of
kernel invocations (kernel 1,
kernel 2) to the device
Each kernel is executed as a batch
of threads
This batch is organized as a grid of
thread blocks

2-dimensional thread blocks

Slide 17

Block IDs and Thread IDs
Each thread uses IDs to decide what
data to work on
Block ID: 1D, 2D, or 3D
Thread ID: 1D, 2D, or 3D
Simplifies memory
addressing when processing
multidimensional data

Slide 18

Overview



CUDA memory model


Simple CUDA example

Slide 19

CUDA Memory Model
Overview
Global memory Grid

Main means of communicating
R/W Data between host and Block (0, 0) Block (1, 0)
device
Shared Memory Shared Memory
Contents visible to all threads
Long latency access Registers Registers Registers Registers

We will focus on global memory for
now
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
Constant and texture memory will not
be covered here
Host Global Memory

Constant Memory

Texture Memory

Slide 20

CUDA Memory Model
CUDA Device Memory Allocation
cudaMalloc(): Grid

Allocates object in the device
global memory Block (0, 0) Block (1, 0)

Requires two parameters
Shared Memory Shared Memory
Address of a pointer to the
allocated object Registers Registers Registers Registers

Size of allocated object
cudaFree()
Frees object from the device
global memory
Host Global
Pointer to freed object Memory

Slide 21

CUDA Memory Model
CUDA Host-Device Data Transfer

cudaMemcpy() Grid

memory data transfer
Block (0, 0) Block (1, 0)
Requires four parameters
Pointer to destination Shared Memory Shared Memory
Pointer to source
Registers Registers Registers Registers
Number of bytes to copy
Type of transfer
Type of transfer is one of:
Host to Host
Host to Device Host Global
Memory
Device to Host
Device to Device
Asynchronous transfer

Slide 22

Overview



CUDA memory model


Simple CUDA example

Slide 23

CUDA API
Extended C
Integrated source
(foo.cu)

cudacc
EDG C/C++ frontend
Open64 Global Optimizer

GPU Assembly CPU Host Code
foo.s foo.cpp

OCG gcc / cl

Mark Murphy, “NVIDIA’s Experience with
G80 SASS
foo.sass with Open64,”
www.capsl.udel.edu/conferences/open64/2008
008/Papers/101.doc
Slide 24

CUDA API
C for CUDA
Function type specifiers image-convolution.cu
__global__, __device__,
__host__ __device__ float filter[N];
Variable type specifiers
__device__, __shared__, __global__ void convolve(float
__constant__ *image) {
Keywords __shared__ float region[M];
threadIdx, blockIdx ...
Intrinsics / builtin functions: region[threadIdx] = image[i];
__syncthreads() __syncthreads()
Runtime API ...
Memory, symbol, execution
management
image[j] = result;
Function launch
}

// Allocate GPU memory
void *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per block
convolve<<<100, 10>>>(myimage);
Slide 25

CUDA API
CUDA Function Type Qualifiers (1)

only callable
executed on:
from:
__device__ float deviceFunc() device device

__global__ void kernelFunc() device host

__host__ float hostFunc() host host

__global__:
defines a kernel function
must return void
__device__ and __host__ can be used together

Slide 26

CUDA API
CUDA Function Type Qualifiers (2)
__device__ functions cannot have their address taken
For functions executed on the device:
no recursion
no static variable declarations inside the function
no variable number of arguments

Slide 27

CUDA API
CUDA Variable Type Qualifiers
__device__
Resides in global memory space,
Has the lifetime of an application,
Is accessible from all the threads within the grid and from the host through
the runtime library.
__shared__ (optionally used together with __device__)
Resides in the shared memory space of a thread block,
Has the lifetime of the block,
Is only accessible from all the threads within the block.
Not covered here:
__constant__ (optionally used together with __device__)
Resides in constant memory space,
Has the lifetime of an application,
Is accessible from all the threads within the grid and from the host through
the runtime library.

Slide 28

CUDA API
Calling a Kernel Function – Execution Configuration
A kernel function ( == __global__ function) must be called with an
execution configuration:

__global__ void kernelFunc(...) {...};
dim3 dimGrid(100, 50); // 5000 thread blocks
dim3 dimBlock(4, 8, 8); // 256 threads per block
size_t sharedMemBytes = 64; // 64 bytes of shared memory
kernelFunc<<< dimGrid, dimBlock, sharedMemBytes >>>(...);

Any call to a kernel function is asynchronous from CUDA 1.0 on
Explicit synchronization needed for blocking

Slide 29

Overview



CUDA memory model


Simple CUDA example

Slide 30

A Simple CUDA Example
Matrix Multiplication

A simple matrix multiplication example that illustrates the basic features
of memory and thread management in CUDA programs
Leave shared memory usage until later
Local, register usage
Thread ID usage
Memory data transfer API between host and device
Assume square matrix for simplicity

Slide 31

Simple CUDA Example
Square Matrix Multiplication N

P = M * N of size WIDTH x WIDTH

WIDTH
Here: without tiling!
One thread calculates one element
of P
M and N are loaded WIDTH times
from global memory

M P

WIDTH
WIDTH WIDTH

Slide 32

Memory Layout of a Matrix in C

M0,0 M1,0 M2,0 M3,0

M0,1 M1,1 M2,1 M3,1

M0,2 M1,2 M2,2 M3,2

M0,3 M1,3 M2,3 M3,3

M

M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

Slide 33

Step 1: Matrix Multiplication
N
A Simple Host Version in C
k
// Matrix multiplication on the (CPU)
// host in double precision j

WIDTH
void matrixMulOnHost(float* M, float* N,
float* P, int width) {
for (int i = 0; i < width; ++i)
for (int j = 0; j < width; ++j) {
double sum = 0;
for (int k = 0; k < width; ++k) {
M P
double a = M[i * width + k];
double b = N[k *i width + j];
sum += a * b;

WIDTH
}
P[i * width + j] = sum;
}
k
} WIDTH WIDTH

Slide 34

Step 2: Input Matrix Data Transfer
(Host-sided Code)
void matrixMulOnDevice(float* M, float* N, float* P, int width){
int size = width * width * sizeof(float);
float* Md, *Nd, *Pd;
…
1. // Allocate and load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the device
cudaMalloc(&Pd, size);

Slide 35

Step 3: Output Matrix Data Transfer
(Host-sided Code)
2. // Kernel invocation code – to be shown later
…

3. // Read P from the device
cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// Free device matrices
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);
}

Slide 36

Step 4: Kernel Function (1)

// Matrix multiplication kernel – per thread code
__global__ void matrixMulKernel(float* Md, float* Nd,
float* Pd, int width) {

// Pvalue is used to store the element of the matrix
// that is computed by the thread
float Pvalue = 0;

// see next page…

Slide 37

Step 4: Kernel Function (2)
Nd

for (int k = 0; k < width; ++k) { k

WIDTH
float Melement =
Md[k + threadIdx.y*width]; tx
float Nelement =
Nd[threadIdx.x + k*width];
Pvalue += Melement * Nelement;
} Md Pd

{ ty ty
int i = threadIdx.x +

WIDTH
threadIdx.y*width;
Pd[i] = Pvalue;
k tx
}
} WIDTH WIDTH

Slide 38

Step 5: Kernel Invocation
(Host-sided Code)
// Insert into step 2. from before
// Setup the execution configuration
dim3 dimGrid(1, 1);
dim3 dimBlock(width, width);

// Launch the device computation threads!
matrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd,
width);

Slide 39

Example Far from Ideal
Only One Thread Block Used
Grid 1 Nd
Block 1
One Block of threads compute matrix 2
Pd
4
Each thread computes one
element of Pd Thread
(2, 2)
2

Each thread 6
Loads a row of matrix Md
Loads a column of matrix Nd
Perform one multiply and
addition for each pair of Md and
Nd elements
Compute to off-chip memory 3 2 5 4 48
access ratio close to 1:1 (not
very high)
Size of matrix limited by the number WIDTH
of threads allowed in a thread block
Md Pd
Slide 40

CUDA: A Bright Future?

Eure Rede aber sei: Ja, ja; nein, nein.
Was darüber ist, das ist vom Übel.
Matthäus 5:37

Slide 41

NVIDIA CUDA: Appendix
Best Practices & Things to Watch Out For

Slide 42

Appendix
Best Practices & Things to Watch Out For

Obtain relevant hardware data
Compiling a CUDA program
Linking
Debugging
C for CUDA vs. CUDA Driver API
Watch out: floating point computations
Unsupported C language elements
Branching of code
Coalesced access to device global memory
Access patterns to avoid bank conflicts

Slide 43

Obtain Relevant Hardware Data

Make sure to obtain relevant additional hardware data
Call cudaGetDeviceProperties()

Slide 44

Compiling a CUDA Program (1)
Parallel Thread eXecution
C/C++ CUDA float4 me = gx[gtid];
(PTX)
Application me.x += me.y * me.z;
Virtual Machine
and ISA (Instruction
Set Architecture)
NVCC CPU Code Programming
model
Execution
Virtual PTX Code resources and state

Physical
PTX to Target ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
Compiler mad.f32 $f1, $f5, $f3, $f1;

G80 … GPU

Target code

Slide 45

Compiling a CUDA Program (2)
NVCC as a Compiler Driver

Any source file containing CUDA language extensions must be compiled
with NVCC
NVCC is a compiler driver
Works by invoking all the necessary tools and compilers like
cudacc, g++, cl, ...
NVCC outputs:
C code (host CPU Code)
Must then be compiled with the rest of the application using
another tool
PTX
Object code directly
Or, PTX source, interpreted at runtime

Slide 46

Linking

Any executable with CUDA code requires two dynamic libraries:
The CUDA runtime library (cudart)
The CUDA core library (cuda)

Slide 47

Debugging Using the Device Emulation Mode

An executable compiled in device emulation mode (enabled via nvcc
-deviceemu) runs completely on the host using the CUDA runtime
No need of any device and CUDA driver
Each device thread is emulated with a host thread
Running in device emulation mode, one can:
Use host native debug support (breakpoints, inspection, etc.)
Access any device-specific data from host code and vice-versa
Call any host function from device code (e.g. printf) and vice-
versa
Detect deadlock situations caused by improper usage of
__syncthreads()

Slide 48

Device Emulation Mode Pitfalls

Emulated device threads execute sequentially, so simultaneous
accesses of the same memory location by multiple threads could
produce different results.
Dereferencing device pointers on the host or host pointers on the device
can produce correct results in device emulation mode, but will generate
an error in device execution mode

Slide 49

CUDA Driver API vs. C for CUDA (1)
Extended C
Integrated source
(foo.cu)

cudacc
EDG C/C++ frontend
Open64 Global Optimizer

GPU Assembly CPU Host Code
foo.s foo.cpp

OCG gcc / cl

G80 SASS Mark Murphy, “NVIDIA’s Experience
foo.sass Experience with Open64,”
www.capsl.udel.edu/conferences/open6
n64/2008/Papers/101.doc
Slide 50

Mutually Exclusive: Choose One or the Other
The C runtime for CUDA handles kernel loading The driver API is a lower-level API than the
and setting kernels before they are launched. The runtime API. When compared with the runtime
implicit code initialization, CUDA context API, the driver API has these advantages:
management, CUDA module management (cubin No dependency on the runtime library
and function mapping), kernel configuration, and
parameter passing are all performed by the C More control over devices (for example,
runtime for CUDA. only the driver API enables one CPU
thread to control multiple GPUs)
It comprises two principal parts:
No C extensions in the host code, so
The low-level functions compilers other than the default CPU
(cuda_runtime_api.h) have a C-style compiler can be used
interface that does not require compilation
with nvcc. Its primary disadvantages
The high-level functions (cuda_runtime.h) Verbose code
have a C++-style interface built on top of Greater difficulty in debugging
the low-level functions. No device emulation
Of these, the high-level functions are the most A key point is that for every runtime API function,
commonly used. They wrap some of the low-level there is an equivalent driver API function. The
functions, using overloading, references, and driver API, however, includes other functions
default arguments. These wrappers can be used missing in the runtime API, such as those for
from C++ code and can be compiled with any C++ migrating a context from one host thread to
compiler. another.

Slide 51

Example: Vector Addition Using C for CUDA
const unsigned int cnBlockSize = 512; // copy host vectors to device
const unsigned int cnBlocks = 3; cudaMemcpy(pDeviceMemA, pA, cnDimension *
const unsigned int cnDimension = sizeof(float),
cnBlocks * cnBlockSize; cudaMemcpyHostToDevice);
// create CUDA device & context cudaMemcpy(pDeviceMemB, pB, cnDimension *
cudaSetDevice( 0 ); // pick first device sizeof(float),
cudaMemcpyHostToDevice);
// allocate host vectors
float * pA = new float[cnDimension];
vectorAdd<<<cnBlocks, cnBlockSize>>>
float * pB = new float[cnDimension]; (pDeviceMemA, pDeviceMemB,
float * pC = new float[cnDimension]; pDeviceMemC);
// initialize host memory
randomInit(pA, cnDimension); // copy result from device to host
randomInit(pB, cnDimension); cudaMemcpy ((void *) pC, pDeviceMemC,
cnDimension * sizeof(float),
cudaMemcpyDeviceToHost);
// allocate device memory
delete[] pA;
float *pDeviceMemA, *pDeviceMemB,
*pDeviceMemC; delete[] pB;
cudaMalloc((void **)&pDeviceMemA, delete[] pC;
cnDimension * sizeof(float)); cudaFree(pDeviceMemA);
cudaMalloc((void **)&pDeviceMemB, cudaFree(pDeviceMemB);
cnDimension * sizeof(float)); cudaFree(pDeviceMemC);
cudaMalloc((void **)&pDeviceMemC,
cnDimension * sizeof(float));

Slide 52

Example: Vector Addition Using CUDA Driver API
const unsigned int cnBlocks = 3; #define ALIGN_UP(offset, alignment)
const unsigned int cnDimension = cnBlocks * cnBlockSize; (offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1)
CUdevice hDevice;
CUcontext hContext; int offset = 0;
CUmodule hModule; void* ptr;
CUfunction hFunction; ptr = (void*)(size_t)pDeviceMemA;
// create CUDA device & context ALIGN_UP(offset, __alignof(ptr));
cuInit(0); cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));
cuDeviceGet(&hContext, 0); // pick first device offset += sizeof(ptr);
cuCtxCreate(&hContext, 0, hDevice)); ptr = (void*)(size_t)pDeviceMemB;
cuModuleLoad(&hModule, “vectorAdd.cubin”); ALIGN_UP(offset, __alignof(ptr));
cuModuleGetFunction(&hFunction, hModule, "vectorAdd"); cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));
// allocate host vectors offset += sizeof(ptr);
float * pA = new float[cnDimension]; ptr = (void*)(size_t)pDeviceMemC;
float * pB = new float[cnDimension]; ALIGN_UP(offset, __alignof(ptr));
float * pC = new float[cnDimension]; cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));
// initialize host memory offset += sizeof(ptr);
randomInit(pA, cnDimension); cuParamSetSize(cuFunction, offset);
randomInit(pB, cnDimension); // execute kernel
// allocate memory on the device cuLaunchGrid(cuFunction, cnBlocks, 1);
CUdeviceptr pDeviceMemA, pDeviceMemB, pDeviceMemC; // copy the result from device back to host
cuMemAlloc(&pDeviceMemA, cnDimension * sizeof(float)); cuMemcpyDtoH((void *) pC, pDeviceMemC,
cuMemAlloc(&pDeviceMemB, cnDimension * sizeof(float)); cnDimension * sizeof(float));
cuMemAlloc(&pDeviceMemC, cnDimension * sizeof(float)); delete[] pA;
// copy host vectors to device delete[] pB;
cuMemcpyHtoD(pDeviceMemA, pA, cnDimension * sizeof(float)); delete[] pC;
cuMemcpyHtoD(pDeviceMemB, pB, cnDimension * sizeof(float)); cuMemFree(pDeviceMemA);
// set up parameter values cuMemFree(pDeviceMemB);
cuFuncSetBlockShape(cuFunction, cnBlockSize, 1, 1); cuMemFree(pDeviceMemC);

Slide 53

Watch Out: Floating Point Computations
Differing Results of FP Computations

Results of floating-point computations will slightly differ because of:
Different compiler outputs, instruction sets
Use of extended precision for intermediate results
There are various options to force strict single precision on
the host

Slide 54

Watch Out: Floating Point Computations
Single and Double Precision Operations

Double precision
No deviations from the IEEE 754 standard
Single precision
Denormals and signalling NaNs are not supported;
Only two IEEE rounding modes are supported (chop and round-to-
nearest even); and
The precision of division/square root is slightly lower than single
precision.

Slide 55

Limitations (1)
Only a Subset of C Available

C for CUDA offers only a subset of the C language:
Recursion-free
Function-pointer-free
Functions reside in the global device memory, therefore we cannot
obtain their addresses

Slide 56

Limitations (2)
Branching in Programm Code

For best performance
Threads should be running in groups 32 threads
32 threads = 1 warp
All threads of a warp should take the same execution path
Otherwise, branching will probably hurt

Slide 57

Coalesced Access to Device Global Memory

High Priority: Ensure global memory accesses are coalesced whenever
possible
Global memory loads and stores by threads of a half warp (16 threads)
are coalesced by the device in as few as one transaction (or two
transactions in the case of 128-bit words)
But: certain access requirements have to be met

Slide 58

Coalesced Access (Reading Floats)

Slide 59

Uncoalesced Access (Reading Floats)

Slide 60

Shared Memory – Bank Conflicts

Shared Memory
16 KB
Organized in 16 Banks, 1 KB each
Shared Memory
As fast as a register …
… if no bank conflicts occur!
Bank conflict:
More than one thread in the same half-warp access the same bank
Access needs to be serialized 
Cost = max (# of concurrent access)

Slide 61

Shared Memory – No Bank Conflicts

Linear addressing Random Linear addressing Broadcast
Step size = 1 Permutation Step size = 3
Word Words

Slide 62

Shared Memory – Bank Conflicts

Linear adressing Linear addressing No conflict
Step size = 2 Step size = 8 or 5-way
Words words conflict

Slide 63

Tech Talk NVIDIA CUDA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tech Talk NVIDIA CUDA

Similar to Tech Talk NVIDIA CUDA (20)

Recently uploaded

Recently uploaded (20)

Tech Talk NVIDIA CUDA