Newbie’s guide to_the_gpgpu_universe

Newbie’s guide to the GPGPU universe
Ofer Rosenberg

Agenda
• GPU History
• Anatomy of a Modern GPU
• Typical GPGPU Models
• The GPGPU universe

GPU History
A GPGPU perspective
3

From Shaders to Compute (1)
In the beginning, GPU HW was fixed & optimized for Graphics…
Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008:

• GPUs evolved to programmable
(which made Gaming companies very happy…)
Shader:
A simple program, that may run on a graphics processing
unit, and describe the traits of either a vertex or a pixel.

The birth of GPGPU (1)
• Interest from the academic world
Pixel shader = do the same program for (1024 X 768 X 60)
= highly efficient SPMD (Single Program, Multiple Data) machine
• Fictitious graphics pipe to solve problems
– Advanced Graphics problems
– General Computational problems
6

The birth of GPGPU (2)
• In 2002, Mark Harris from NVIDIA
coined the term GPGPU
“General-Purpose computation on
Graphics Processing Units”
• Used a graphics language for general
computation
• Highly effective, but :
– The developer needs to learn another
(not intuitive) language
– The developer was limited by the
graphics language

• GPUs needed one more evolutional step  Unified Shaders
8

Rise of modern GPGPU
• Unified Architecture paved the way for modern GPGPU languages
GeForce 8800
GTX (G80) was
released on
Nov. 2006
CUDA 0.8 was
released on Feb.
2007 (first official
Beta)
ATI x1900
(R580)
released on
Jan 2006
CTM was
released on
Nov. 2006

Evolution of Compute APIs (GPGPU)
• CUDA & CTM led to two compute standards: Direct Compute & OpenCL
• DirectCompute is a Microsoft standard
– Released as part of WIn7/DX11, a.k.a. Compute Shaders
– Runs only on Windows
– Microsoft C++ AMP maps to DirectCompute
• OpenCL is a cross-OS / cross-Vendor standard
– Managed by a working group in Khronos
– Apple is the spec editor & conformance owner
– Work can be scheduled on both GPUs and CPUs
CUDA 1.0
Released
June 2007
CUDA 2.0
Released
Aug 2008
OpenCL 1.0
Released
Dec 2008
DirectX 11
Released
Oct 2009
CUDA 3.0
Released
Mar 2010
OpenCL 1.1
Released
June 2010
CUDA 4.0
Released
May 2011
OpenCL 1.2
Released
Nov 2011
CUDA 4.1
Released
Jan 2012
CUDA 4.2
Released
April 2012
C++ AMP 1.0
Released
Aug 2012
CUDA 5.0
Released
Oct 2012
CUDA 5.5
Released
July 2013
OpenCL 2.0
Provisional
Released
July 2013
CTM SDK
Released
Nov 2006

GPGPU Evolution
2004 – Stanford University: Brook for GPUs
2006 – AMD releases CTM
NVIDIA releases CUDA
2008 – OpenCL 1.0 released
G80 – 346 GFLOPS R580 – 375 GFLOPS

GPGPU Evolution
Nov 2009 - First Hybrid SC in the Top10: Chinese Tianhe-1
1,024 Intel Xeon E5450 CPUs
5,120 Radeon 4870 X2 GPUs
Nov 2010 – First Hybrid SC reaches #1 on Top500 list: Tianhe-1A
14,336 Xeon X5670 CPUs
7,168 Nvidia Tesla M2050 GPUs
Source: http://www.top500.org/lists/

GPGPU Evolution
2013 - OpenCL on : Nexus 4 (Qualcomm Adreno 320)
Nexus 10 (ARM Mali T604)
Android 4.2 adds GPU support for Renderscript
2014 – NVIDIA Tegra 5 will support CUDA
2013 – GPGPU Continuum becomes a reality

The GPGPU Continuum
Apple A6 GPU
25 GFLOPS
< 2W
ORNL TITAN SC
27 PFLOPS
8200 KW
AMD G-T16R
46 GFLOPS*
4.5W
NVIDIA GTX Titan
4500 GFLOPS
250W
Intel i7-3770
511 GFLOPS*
77W
* GFLOPS of CPU+GPU

Anatomy of
a Modern GPU
GPGPU Perspective
15

Massive Parallelism
From GPGPU perspective,
GPU is a highly multi-threaded wide vector machine
16

Parallelism detailed
• Multi (Many) Cores
• Wide Vector Unit
• Multi-threaded (latency/stalls hiding)
17
14 SMXsK20NVIDIA
32 Compute UnitsHD7970AMD
60 CoresXeon Phi 5110PIntel
6 Warps per SMX32 floats = WarpK20NVIDIA
4 Wavefronts per CU64 floats = WavefrontHD7970AMD
1 VPU per Core16 floats = VPUXeon Phi 5110PIntel
64 Warps per SMXK20NVIDIA
40 Wavefronts per CUHD7970AMD
NVIDIA GK110 SMX

Typical GPU Caveats
• Wide vectors = SIMD (SIMT) execution
– Conditional code has to be executed “vector wide”
– Mitigation: Predication (execute all code using masks on parts)
– Performance hit on mixed execution, up to 1/N efficiency (where N is
vector width)
• Many Cores & Small caches = High percentage of Stalls
– Mitigation:
• Hold multiple in-flight contexts (aka Warps/Wavefronts) per core
• Stall = fast context switch between in-flight context and active context
• Requires huge register bank (NV & AMD: 256KB per SMX/CU)
– Latency hiding depends on having enough in-flight contexts
18A Must Read: (images to the right are taken from this talk)
“From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University and Mike Houston, Fellow, AMD

Typical GPGPU Models
This section describes some general GPGPU models, which apply
to a wide range of languages
19

Simplified System Model
• Host runs the OS, Application, Drivers, etc.
• GPU is connected to the Host through PCIe, Shared
Memory, etc.
Application code contains API calls*,
 which use a Runtime environment,
 which provides GPU access
The Application code contains “kernels”,
 which are short programs/functions,
 which are loaded and executed on the GPU
* In some languages the API calls are abstracted through special syntax or directives
20
Host
Application
Runtime
GPU
KernelKernel
Kernel

GPGPU Execution Model (1)
• A “kernel” is executed on a grid (1D/2D/3D)
• Each point in the grid executes one instance of
the kernel, orthogonally*
• Per-instance read/write is accomplished by using
the instance’s index
* There are sync primitives on a group/block level (or whole device)
21
OpenCL
CUDA
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
// Kernel invocation
dim3 dimBlock(16, 16);
dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x,
(N + dimBlock.y – 1) / dimBlock.y);
MatAdd<<<dimGrid, dimBlock>>>(A, B, C);
}

GPGPU Execution Model (2)
• GPU execution model is asynchronous
– Commands are sent down the stack
– Kernels executed based on GPU load & status (serves a few Apps)
– Application code may wait on completion
• Quequeing Model
– Explicit (OpenCL)
– Default is implicit, Advanced usage is explicit (CUDA)
• SPMD  MPMD
– GPU used to be able to execute one kernel at a time
– Modern languages support multiple simultaneous kernels 22

GPGPU Memory Model
Basically, a distributed memory system:
• Separated Host memory / Device memory
– Create a buffer/image on the host
– Create a buffer/image on the device
• Opaque handle (OpenCL) or device-side pointer (CUDA)
• Sync operations between memories:
– Read / Write
– Map / Unmap (marshalling)
• Pinned memory for faster sync
• GPU can access Host mapped memory (CUDA) 23
Host
Application
Runtime
GPU
Buffer
Create Write
Buffer

GPU Memory Model
• Few types, GPU architecture driven
• Has affect on performance – use the right type
• Watch out from coherency issues
– Not your typical MESI architecture…
24

Compilation Model
• Most GPGPU languages use dynamic compilation
– A common practice in the world of GPUs
– Different GPU architectures : no common ISA
– ISA varies even between generations of the same vendor
• Front-End converts High-level language to IR
(Intermediate Representation)
– Assembly of a virtual machine
– LLVM is very common in this world
– In some languages, this happens at application compile time
• Back-End(s) converts from IR to Binary
– Some Vendors use additional intermediate-to-intermediate stages
• Most languages enable storing of IR & IL
– Some do it implicitly (CUDA)
OpenCL C C for CUDA Fortran
LLVM* IR
PTX IL
GK110 Binary GF104 Binary
OpenACC
* NVIDIA has “NVVM”, which
is LLVM with a set of
restrictions

GPGPU usages
CUDA
usages
Advanced
Graphics
Game
Physics
Computer
Vision
Cluster/
HPC
Finance
Scientific
Media
Processing
Johannes Gutenberg University Mains
•CUDA Community Showcase:
•~900 applications from Academia
•http://www.nvidia.com/object/cuda-apps-
flash-new.html#
Imperial College London
UC Davis, California
TU Darmstadt

GPGPU Languages
• Welcome to the jungle…
28

Vendor overview: NVIDIA
Geforce:
• GPU for Gaming
• GTX680
Tesla:
• GPU Accelerators
• K10 / K20
Quadro:
• Professional GFX
• K5000
All running the same cores (Kepler GK104 or GK110)

Vendor overview: AMD
31
Radeon:
• GPU for Gaming
• HD7970
FirePro:
• Professional GFX
• W9000
All running the same cores (GCN)
APU:
• CPU+GPU on same die
• A10

Vendor overview: Intel
Xeon Phi:
• Accelerator Card
• 5110P
CPU:
• CPU+GPU on same die
• Haswell Core i7-4xxx

Leading Mobile GPU Vendors
Vivante CG4000
• Unified Shaders
• 4 Cores, SIMD4 each
• Supports OpenCL 1.2
• 48 Gflops
NVIDIA Tegra 4
• 6 X 4-wide Vertex shaders
• 4 X 4-wide Pixel Shaders
• No GPGPU support
• 74 GFLOPS
ARM Mali T604
• 4 Cores
• Multiple “pipes” per core
• Supports OpenCL 1.1
• 68 GFlops
Imagination PowerVR 5xx
• Apple, Samsung, Motorola,
Intel
• Unified Shaders
• Supports OpenCL 1.1 EP (543)
• 38 Gflops (Apple’s MP4 ver)
Qualcomm Adreno 320
• Part of Snapdragon S4
• Unified Shader
• Supports OpenCL 1.1 EP
• 50 GFlops

Newbie’s guide to_the_gpgpu_universe

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Newbie’s guide to_the_gpgpu_universe

Similar to Newbie’s guide to_the_gpgpu_universe (20)

Recently uploaded

Recently uploaded (20)

Newbie’s guide to_the_gpgpu_universe