3. GPU is typically a computer card, installed into a
PCI Express slot
Market leaders – NVIDIA, Intel, AMD (ATI)
NVIDIA GPUs at UoM
Intel MIC (Many Integrated Core)
Graphics Processing Unit (GPU)
3
GeForce GTX 480 Tesla 2070
4. Example Specifications
GTX 480 Tesla 2070 Tesla K80
Peak double
precision FP
performance
650 Gigaflops 515 Gigaflops 2.91 Teraflops
Peak single
precision FP
performance
1.3 Teraflops 1.03 Teraflops 8.74 Teraflops
CUDA cores 480 448 4992
Frequency of CUDA
Cores
1.40 GHz 1.15 GHz 560/875 MHz
Memory size
(GDDR5)
1536 MB 6 GB 24 GB
Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec
ECC Memory No Yes Yes 4
5. GPUs (Cont.)
Originally designed to accelerate a large no of
computations performed in graphics rendering
Offloaded numerically intensive computation from CPU
GPUs grew with demand for high performance
graphics
Eventually GPUs have become much powerful,
even more than CPUs for many computations
Cost-power-performance advantage
5
6. GPU Basics
Today’s GPUs
High performance, many-core processors that can be
used to accelerate a wide range of applications
GPUs lead the race for floating-point performance since
start of 21st century
GPGPU
GPUs are being used as parallel processors for
general-purpose computation
6
7. CPU vs. GPU Architecture
7
GPU devotes more transistors for computation
8. FLOPS & GFLOPS
FLOPS = floating-point operations per second
Example
8
CPU GPU
Number of cores 4 448
FLOPS per core 4 1
Clock speed (GHz) 2.5 1.15
Performance (GFLOPS) 40 515
12. Applications of GPGPU
Computational Structural Mechanics
Bio-Informatics and Life Sciences
Computational Electromagnetics & Electrodynamics
Computational Finance
Computational Fluid Dynamics
Data Mining, Analytics, & Databases
Imaging & Computer Vision
Medical Imaging
Molecular Dynamics
Numerical Analytics
Weather, Atmospheric, Ocean Modeling & Space
Sciences 12
13. Programming GPUs
CUDA language for Nvidia GPU products
Compute Unified Device Architecture
Based on C
nvcc compiler
Lots of tools for analysis, debug, profile, …
OpenCL – Open Computing Language
Based on C
Supports GPU & CPU programming
Support for Java, Python, Matlab, etc.
Lots of active research
e.g., automatic code generation for GPUs 13
15. Caution!
GPU designed as a numeric computing engine
Will not perform well on some tasks as CPUs
Most applications will use both CPUs & GPUs
For some computations, cost of transfering data
between CPU & GPU can be high
SIMD-type data parallelism is key to benefit from
GPUs
… and enough of it (out of total computation)
15
16. CUDA Architecture
CUDA is NVIDA’s solution to access the GPU
Can be seen as an extension to C/C++
16
CUDA Software Stack
17. CUDA Architecture (Cont.)
2 main parts
1. Host (CPU part)
• Single Program, Single Data
• Launches kernel on GPU
2. Device (GPU part)
• Single Program, Multiple
Data
• Runs kernel
Function executed on GPU
(device) is called a “kernel”
17
18. CUDA Architecture (Cont.)
GRID Architecture
18
Grid
• A group of threads all running
the same kernel
• Can run multiple grids at once
Block
• Grids composed of blocks
• Each block is a logical unit
containing a no of
coordinating threads &
some amount of shared
memory
19. #include <cuda.h>
#include <stdio.h>
__global__ void kernel (void)
{ }
int main (void)
{
kernel <<< 1, 1 >>> ();
printf("Hello World!n");
return 0;
}
Example Program
“__global__” says
function is to be
compiled to run on a
“device” (GPU), not
“host” (CPU)
Angle brackets “<<<“
& “>>>” for passing
params/args to
runtime
19
20. Thread Blocks
Within host (CPU) code, call kernel by using
<<< & >>> specifying grid size (no of blocks)
& block size (no of threads)
20
21. Grids, Blocks & Threads
21
Grid of size 6 (3x2
blocks)
Each block has 12
threads (4x3)
23. CUDA Device Memory Model
Host, devices have separate memory spaces
e.g., hardware cards with their own DRAM
To execute a kernel on a device
Need to allocate memory on device
Transfer data
Host memory device memory
After device execution
Transfer results
Device memory host memory
Free device memory no longer needed
23
31. Intel Xeon Phi (Cont.)
More simple cores, many threads, & wider
vector units
Same programming model across host & device
Linux on device
Remote login
Access to network file systems
High compute density & energy efficiency
31