OpenHPI - Parallel Programming Concepts - Week 4

Parallel Programming Concepts
OpenHPI Course
Week 4 : Accelerators
Unit 4.1: Accelerate Now!
Frank Feinbube + Teaching Team

Summary: Week 3
■ Short overview of shared memory parallel programming ideas
■ Different levels of abstractions
□ Process model, thread model, task model
■ Threads for concurrency and parallelization
□ Standardized POSIX interface
□ Java / .NET concurrency functionality
■ Tasks for concurrency and parallelization
□ OpenMP for C / C++, Java, .NET, Cilk, …
■ Functional language constructs for implicit parallelism
■ PGAS languages for NUMA optimization
2
Specialized languages help the programmer to achieve speedup.
What about accordingly specialized parallel hardware?
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

What is specialized hardware?
OpenHPI | Parallel Programming Concepts | Frank Feinbube
3

4
What is specialized hardware?
■ Graphics cards: speed up rendering tasks
■ Sound cards: play sounds
■ Physics cards ?
What else could be sped up by hardware?
■ Encryption
■ Compression
■ Codec parsing
■ Protocols and formats: XML
■ Text, regular expression matching
■ …

Wide Variety of Accelerators
5
Best Super Computer:
Tianhe-2 (MilkyWay-2)
2nd
Titan

Wide Variety of Applications
6
Fluids NBody
RadixSort

Wide Variety of Application Domains
7
BioInformatics
Computational
Chemistry Computational Finance
Computational
Fluid Dynamics
Computational
Structural Mechanics Data Science
Defense
Electronic
Design Automation
Imaging &
Computer Vision
Medical Imaging Numerical Analytics Weather and Climate
http://www.nvidia.com/object/gpu-applications-domain.html

What’s in it for me?
8

Short Term View:
Cheap Performance
Performance
Energy / Price
■ Cheap to buy and to maintain
■ GFLOPS per watt: Fermi 1,5 / Kepler 5 / Maxwell 15 (2014)
9
0
200
400
600
800
1000
1200
1400
0 10000 20000 30000 40000 50000
ExecutionTimein
Milliseconds
Problem Size (Number of Sudoku Places)
Intel E8500 CPU
AMD R800 GPU
NVIDIA GT200 GPU
lower means faster
GPU: Graphics Processing Unit
(CPU of a graphics card)

A single
Maxwell GPU
will have more
performance
than the fastest
super computer
of 2001
Your
computer /
notebook
(thx to GPUs)
What is 15 GFlops?
15 000 000 000 floating point operations / s
10

Middle Term View:
Even More Performance
11

Middle Term View:
Even More Performance
12

Long Term View:
Acceleration Everywhere
Dealing with massively multi-core:
■ Accelerators (APUs) that accompany common
general purpose CPUs (Hybrid Systems)
Hybrid Systems (Accelerators + CPUs)
■ GPU Compute Devices:
High Performance Computing
(top 2 supercomputers are accelerator-based!),
Business Servers, Home/Desktop
Computers, Mobile and Embedded Systems
■ Special-Purpose Accelerators:
(de)compression, XML parsing,
(en|de)cryption, regular expression
matching
13

How do they get so fast?
14

Three Ways Of Doing Anything Faster
[Pfister]
■ Work harder
(clock speed)
 Power wall problem
 Memory wall problem
■ Work smarter
(optimization, caching)
 ILP wall problem
■ Get help
(parallelization)
□ More cores per single CPU
□ Software needs to exploit
them in the right way
Problem
CPU
Core
Core
Core
Core
Core
15

Accelerators bypass the walls
■ CPUs are general purpose
□ Need to support all types of software / programming models
□ Need to support a large variety of legacy software
 That makes it hard to do something against the walls
■ Accelerators are special purpose
□ Need to support only a limited subset of software /
programming models
□ Legacy software is usually even more limited and strict
 That lessens the impact of the walls (memory & power wall)
 Stronger parallelization and better speedup possible
16

CPU and Accelerator trends
CPU
 Evolving towards throughput computing
 Motivated by energy-efficient performance
Accelerator
 Evolving towards general-purpose computing
 Motivated by higher quality graphics and
data-parallel programming
17
CPU
ACCThroughput Performance
Programmability
Multi-threading Multi-core Many-core
Fully
Programmable
Partially
Programmable
Fixed Function
NVIDIA
Keppler
Intel
MIC

Task Parallelism and Data Parallelism
18
Input Data
Parallel
Processing
Result Data
„CPU-style“ „GPU-style“

Flynn‘s Taxonomy (1966)
■ Classify parallel hardware architectures according to their
capabilities in the instruction and data processing dimension
Single Instruction,
Single Data (SISD)
Single Instruction,
Multiple Data (SIMD)
19
Processing Step
Instruction
Data Item
Output
Processing Step
Instruction
Data Items
Output
Multiple Instruction,
Single Data (MISD)
Processing Step
Instructions
Data Item
Output
Multiple Data (MIMD)
Processing Step
Instructions
Data Items
Output

OpenHPI Course
Unit 4.2: Accelerator Technology

Flynn‘s Taxonomy (1966)
■ Classify parallel hardware architectures according to their
capabilities in the instruction and data processing dimension
Single Instruction,
Single Data (SISD)
Single Instruction,
Multiple Data (SIMD)
21
Processing Step
Instruction
Data Item
Output
Processing Step
Instruction
Data Items
Output
Single Data (MISD)
Processing Step
Instructions
Data Item
Output
Multiple Data (MIMD)
Processing Step
Instructions
Data Items
Output

Why SIMD?
22

History of GPU Computing
• 1980s-1990s; configurable, not programmable;
first APIs (DirectX, OpenGL); Vertex Processing
Fixed Function
Graphic Pipelines
• Since 2001: APIs for Vertex Shading, Pixel
Shading and access to texture; DirectX9
Programmable Real-
Time Graphics
• 2006: NVIDIAs G80; unified processors arrays;
three programmable shading stages; DirectX10
Unified Graphics and
Computing Processors
23

From Fixed Function Pipeline
To Programmable Shading Stages
GPUs pre 2006:
DirectX 10 Geometry Shader Idea:
NVIDIA G80 Solution:
24
Vertex Shader Pixel Shader
Vertex Shader Geometry Shader Pixel Shader
Programmable Shader
(Vertex/Geometry/Pixel)
We would need a bigger card :/
This design allows:
Smaller card / more power

Fixed Function
Graphic Pipelines
Programmable Real-
Time Graphics
• compute problem as native graphic operations;
algorithms as shaders; data in textures
General Purpose GPU
(GPGPU)
25

General Purpose GPU (GPGPU)
26
𝐵 >
1
𝑛
𝑖=1
𝑛
𝑥𝑖
2.7 1.8 2.8 1.8
2.8 4.5 9.0 4.5
Programmable Shader
(Vertex/Geometry/Pixel)
Graphics Processing Unit (GPU):
Data
Texture
Formula
Shader
Result Texture
0 0 0 0
0 1 1 1
Result

Fixed Function
Graphic Pipelines
Programmable Real-
Time Graphics
• compute problem as native graphic operations;
algorithms as shaders; data in textures
General Purpose GPU
(GPGPU)
• Programming CUDA; shaders programmable;
load and store instructions; barriers; atomicsGPU Computing
27

What’s the difference to CPUs?
28

Task Parallelism and Data Parallelism
29
Input Data
Parallel
Processing
Result Data
„CPU-style“ „GPU-style“

CPU vs. GPU Architecture
■ Some huge threads
■ Branch prediction
■ 1000+ light-weight threads
■ Memory latency hiding
30
Control
PE
PE
PE
PE
Cache
DRAM DRAM
CPU GPU „many-core“„multi-core“

GPU Threads are different
GPU threads
■ Execute exactly the same instruction (line of code)
□ They share the program counter
 No branching!
Memory Latency Hiding
■ GPU holds thousands of threads
■ If some access the slow memory, others will run while they wait
■ No context switch
□ Enough registers to store all the threads all the time
31

What does it look like?
32

GF10
33
GPU
Hardware
in Detail

GF100
34
GF100
L2 Cache

GF100
35
GF100

GF100
36
GF100

GF100
37
…
GF100

GF100
38
…
GF100

It’s a jungle out there!
39

GPU Computing Platforms (Excerpt)
AMD
R700, R800, R900,
HD 7000, HD 8000, Rx 200
NVIDIA
G80, G92, GT200, GF100, GK110
Geforce, Quadro,
Tesla, ION
40

Compute Capability by version
Plus: varying amounts of cores, global memory sizes, bandwidth,
clock speeds (core, memory), bus width, memory access penalties …
41
1.0 1.1 1.2 1.3 2.x 3.0 3.5
double precision floating
point operations
No Yes
caches No Yes
max # concurrent kernels
1 8
Dynamic
Parallelism
max # threads per block 512 1024
max # Warps per MP 24 32 48 64
max # Threads per MP 768 1024 1536 2048
register count (32 bit) 8192 16384 32768 65536
max shared mem per MP 16KB 16/48KB 16/32/48KB
# shared memory banks 16 32

Intel Xeon Phi: Hardware
60 Cores based on P54C architecture (Pentium)
■ > 1.0 Ghz clock speed; 64bit based x86 instructions + SIMD
■ 1x 25 MB L2 Cache (=512KB per core) + 64 KB L1 (Cache coherency)
■ 8 (to 32) GB of DDR5
■ 4 Hardware Threads per Core (240 logical cores)
□ No Multicore / Hyper-Threading
□ Think graphics-card hardware threads
□ Only one runs = memory latency hiding
□ Switched after each instruction!!
-> use 120 or 240 threads for the 60 cores
■ 512 bit wide VPU with new ISA Kci
■ No support for MMX, SSE or AVX
■ Could handle 8 doule precision floats/16 single precision floats
■ Always structured in vectors with 16 elements
42

Intel Xeon Phi: Operating System
■ minimal, embedded Linux
■ Linux Standard Base (LSB) Core libraries.
■ Implements Busybox minimal shell environment
43

OpenHPI Course
Unit 4.3: Open Compute Language (OpenCL)

Simple Example: Vector Addition
45
𝑐 =
𝑎1
𝑎2
𝑎3
𝑎4
+
𝑏1
𝑏2
𝑏3
𝑏4
?

Open Compute Language (OpenCL)
46
AMD
ATI
NVIDIA
Intel
Apple
Merged, needed
commonality across
products
GPU vendor – wants
to steal market
share from CPU
Was tired of recoding
for many-core and
GPUs. Pushed vendors
to standardize.
CPU vendor – wants
to steal market
share from GPU
Wrote
a
draft
straw
man
API
Khronos
Compute
Group
formed
Ericsson
Nokia
IBM
Sony
Blizzard
Texas
Instruments
…

OpenCL Platform Model
■ OpenCL exposes CPUs, GPUs, and other Accelerators as “devices”
■ Each “device” contains one or more “compute units”, i.e. cores, SMs,...
■ Each “compute unit” contains one or more SIMD “processing elements”
47

Terminology
48
CPU OpenCL
Platform
Level
Memory
Level
Execution
Level
Platform
Level
Memory
Level
Execution
Level
SMP
system
Main
Memory
Process
Compute
Device
Global and
Constant
Memory
Index
Range
(NDRange)
Processor - -
Compute
Unit
Local
Memory
Work
Group
Core
Registers,
Thread
Local
Storage
Thread
Processing
Element
Registers,
Private
Memory
Work
Items
(+Kernels)

The BIG idea behind OpenCL
OpenCL execution model … execute a kernel at each point in a
problem domain.
E.g., process a 1024 x 1024 image with one kernel invocation per
pixel or 1024 x 1024 = 1,048,576 kernel executions
49
Traditional Loops Data Parallel OpenCL

Building and Executing OpenCL Code
50
Code
of one
or
more
Kernels
Compile for GPU
Compile for CPU
GPU Binary
Represen-
tation
CPU Binary
Represen-
tation
Kernel Program Device
OpenCL codes must be prepared to deal with much
greater hardware diversity (features are optional and
my not be supported on all devices) → compile code
that is tailored according to the device configuration

OpenCL Execution Model
An OpenCL kernel is executed by an array of work items.
■ All work items run the same code (SPMD)
■ Each work item has an index that it uses to compute memory
addresses and make control decisions
51
0 1 2 3 4 5 6 7Work item ids:
Threads

Work Groups: Scalable Cooperation
Divide monolithic work item array into work groups
■ Work items within a work group cooperate via shared
memory, atomic operations and barrier synchronization
■ Work items in different work groups cannot cooperate
52
0 1 2 3 4 5 6 7
Work
item ids:
Threads
8 9 10 11 12 13 14 15
Work group 0 Work group 1

■ Parallel work is submitted to devices by launching kernels
■ Kernels run over global dimension index ranges (NDRange), broken up
into “work groups”, and “work items”
■ Work items executing within the same work group can synchronize with
each other with barriers or memory fences
■ Work items in different work groups can’t sync with each other, except
by launching a new kernel
53

An example of an NDRange index space showing work-items, their global IDs
and their mapping onto the pair of work-group and local IDs.
54

Terminology
55
CPU OpenCL
Platform
Level
Memory
Level
Execution
Level
Platform
Level
Memory
Level
Execution
Level
SMP
system
Main
Memory
Process
Compute
Device
Global and
Constant
Memory
Index
Range
(NDRange)
Processor - -
Compute
Unit
Local
Memory
Work
Group
Core
Registers,
Thread
Local
Storage
Thread
Processing
Element
Registers,
Private
Memory
Work
Items
(+Kernels)

OpenCL Memory Architecture
Private
Per work-item
Local
Shared within
a workgroup
Global/
Constant
Visible to
all workgroups
Host Memory
On the CPU
56

Terminology
57
CPU OpenCL
Platform
Level
Memory
Level
Execution
Level
Platform
Level
Memory
Level
Execution
Level
SMP
system
Main
Memory
Process
Compute
Device
Global and
Constant
Memory
Index
Range
(NDRange)
Processor - -
Compute
Unit
Local
Memory
Work
Group
Core
Registers,
Thread
Local
Storage
Thread
Processing
Element
Registers,
Private
Memory
Work
Items
(+Kernels)

OpenCL Memory Architecture
■ Memory management is explicit:
you must move data from host → global → local… and back
58
Memory
Type
Keyword Description/Characteristics
Global
Memory
__global Shared by all work items; read/write; may be
cached (modern GPU), else slow; huge
Private
Memory
__private For local variables; per work item; may be
mapped onto global memory (Arrays on GPU)
Local
Memory
__local Shared between workitems of a work group;
may be mapped onto global memory (not
GPU), else fast; small
Constant
Memory
__constant Read-only, cached; add. special kind for GPUs:
texture memory

OpenCL Work Item Code
A subset of ISO C99 - without some C99 features
■ headers, function pointers, recursion, variable length arrays,
and bit fields
A superset of ISO C99 with additions for
■ Work-items and workgroups
■ Vector types (2,4,8,16): endian safe, aligned at vector length
■ Image types mapped to texture memory
■ Synchronization
■ Address space qualifiers
Also includes a large set of built-in functions for image manipulation,
work-item manipulation, specialized math routines, vectors, etc.
59

Vector Addition: Kernel
■ Kernel body is instantiated once for each work item; each
getting an unique index
» Code that actually executes on target devices
60

Vector Addition: Host Program
61
[5]

Vector Addition: Host Program
62
[5]
„standard“ overhead
for an OpenCL program

Development Support
Software development kits: NVIDIA and AMD; Windows and Linux
Special libraries: AMD Core Math Library, BLAS and FFT libraries by NVIDIA,
OpenNL for numerics and CULA for linear algebra; NVIDIA Performance
Primitives library: a sink for common GPU accelerated algorithms
Profiling and debugging tools:
■ NVIDIAs Parallel Nsight for Microsoft Visual Studio
■ AMDs ATI Stream Profiler
■ AMDs Stream KernelAnalyzer:
displays GPU assembler code, detects execution bottlenecks
■ gDEBugger (platform-independent)
Big knowledge bases with tutorials, examples, articles, show cases, and
developer forums
63

Nsight
64

OpenHPI Course
Unit 4.4: Optimizations

The Power of GPU Computing
0
200
400
600
800
1000
1200
1400
0 10000 20000 30000 40000 50000
ExecutionTimeinMilliseconds
Intel
E8500
CPU
AMD
R800
GPU
NVIDIA
GT200
GPU
66
* less is better
big performance gains for small problem sizes

The Power of GPU Computing
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 200000 400000 600000
ExecutionTimeinMilliseconds
Intel
E8500
CPU
AMD
R800
GPU
NVIDIA
GT200
GPU
67
* less is better
small/moderate performance gains for large problem sizes
→ further optimizations needed

Best Practices for Performance Tuning
• Asynchronous, Recompute, SimpleAlgorithm Design
• Chaining, Overlap Transfer & ComputeMemory Transfer
• Divergent Branching, PredicationControl Flow
• Local Memory as Cache, rare resourceMemory Types
• Coalescing, Bank ConflictsMemory Access
• Execution Size, EvaluationSizing
• Shifting, Fused Multiply, Vector TypesInstructions
• Native Math Functions, Build OptionsPrecision
68

Divergent Branching and Predication
Divergent Branching
■ Flow control instruction (if, switch, do, for, while) can result in
different execution paths
 Data parallel execution → varying execution paths will be serialized
 Threads converge back to same execution path after completion
Branch Predication
■ Instructions are associated with a per-thread condition code (predicate)
□ All instructions are scheduled for execution
□ Predicate true: executed normally
□ Predicate false: do not write results, do not evaluate addresses, do
not read operands
■ Compiler may use branch predication for if or switch statements
■ Unroll loops yourself (or use #pragma unroll for NVIDIA)
69

Use Caching: Local, Texture, Constant
Local Memory
■ Memory latency roughly 100x lower than global memory latency
■ Small, no coalescing problems, prone to memory bank conflicts
Texture Memory
■ 2-dimensionally cached, read-only
■ Can be used to avoid uncoalesced loads
form global memory
■ Used with the image data type
Constant Memory
■ Linear cache, read-only, 64 KB
■ as fast as reading from a register for the same address
■ Can be used for big lists of input arguments
70
0 1 2 3
64 65 66 67
128 129 130 131
192 193 194 195
…
…
…
…

Sizing:
What is the right execution layout?
■ Local work item count should be a multiple of native execution
size (NVIDIA 32, AMD 64, MIC 16), but not too big
■ Number of work groups should be multiple of the number of
multiprocessors (hundreds or thousands of work groups)
■ Can be configured in 1-, 2- or 3-dimensional layout: consider
access patterns and caching
■ Balance between latency hiding
and resource utilization
■ Experimenting is
required!
71
[4]

Instructions and Precision
■ Single precision floats provide best performance
■ Use shift operations to avoid expensive division and modulo calculations
■ Special compiler flags
■ AMD has native vector type implementation; NVIDIA is scalar
■ Use the native math library whenever speed trumps precision
72
Functions Throughput
single-precision floating-point add,
multiply, and multiply-add
8 operations per clock cycle
single-precision reciprocal,
reciprocal square root, and
native_logf(x)
2 operations per clock cycle
native_sin, native_cos, native_exp 1 operation per clock cycle

Coalesced Memory Accesses
Simple Access Pattern
■ Can be fetched in a single 64-byte
transaction (red rectangle)
■ Could also be permuted *
Sequential but Misaligned Access
■ Fall into single 128-byte segment:
single 128-byte transaction,
else: 64-byte transaction + 32-
byte transaction *
Strided Accesses
■ Depending on stride from 1 (here)
up to 16 transactions *
* 16 transactions with compute capability 1.1
73
[6]

Intel Xeon Phi: OpenCL
■ Whats different to GPUs?
□ Thread granularity bigger + No need for local shared memory
■ Work Groups are mapped to Threads
□ 240 OpenCL hardware threads handle workgroups
□ More than 1000 Work Groups recommended
□ Each thread executes one work group
■ Implicit Vectorization by the compiler of the inner most loop
□ 16 elements per Vector → dimension zero must be divisible by 16
otherwise scalar execution (Good work group size = 16)
74
__Kernel ABC()
for (int i = 0; i < get_local_size(2); i++)
for (int j = 0; j < get_local_size(1); j++)
for (int k = 0; k < get_local_size(0); k++)
Kernel_Body;
dimension zero of the NDRange

Intel Xeon Phi: OpenCL
■ Non uniform branching in a work group has significant overhead
■ Non vector size aligned memory access has significant overhead
■ Non linear access patterns have significant overhead
■ Manual prefetching can be advantageous
■ No hardware support for barriers
■ Further reading: Intel® SDK for OpenCL Applications XE 2013
Optimization Guide for Linux OS
75
Especially for dimension zero

OpenHPI Course
Unit 4.5: Future Trends

Towards new Platforms
WebCL [Draft] http://www.khronos.org/webcl/
■ JavaScript binding to OpenCL
■ Heterogeneous Parallel Computing (CPUs + GPU)
within Web Browsers
■ Enables compute intense programs
like physics engines, video editing…
■ Currently only available with add-ons
(Node.js, Firefox, WebKit)
Android installable client driver extension (ICD)
■ Enables OpenCL implementations to be discovered and loaded
as a shared object on Android systems.
77

Towards new Applications:
Dealing with Unstructured Grids
Too coarse Too fine
78
vs.

Dealing with Unstructured Grids
Fixed Grid Dynamic Grid
79

Dynamic Parallelism
80
CPU manages execution GPU manages execution

81

Towards new Programming Models:
OpenACC GPU Computing
82
Copy arrays to GPU
Parallelize loop with GPU kernels
Data automatically copied back at end of region

Towards new Programming Models:
OpenACC GPU Computing
83

Hybrid System
84
MIC
GPU
RAM
CPU
Core Core
CPU
Core Core
GPU Core
QPI
RAM
RAM
GPU
GPU
RAM

Summary: Week 4
■ Accelerators promise big speedups for data parallel applications
□ SIMD execution model (no branching)
□ Memory latency hiding with 1000s of light-weight threads
■ Enormous diversity -> OpenCL as a standard programming model for all
□ Uniform terms: Compute Device, Compute Unit, Processing Element
□ Idea: Loop parallelism with index ranges
□ Kernels are written in C; compiled at runtime; executed in parallel
□ Complex memory hierarchy, overhead to copy data from CPU
■ Getting fast is easy, getting faster is hard
□ Best practices for accelerators
□ Knowledge about hardware characteristics necessary
■ Future: faster, more features, more platforms, better programmability
85
Multi-core, Many-Core, ..
What if my computational problem still demands more power?

OpenHPI - Parallel Programming Concepts - Week 4

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to OpenHPI - Parallel Programming Concepts - Week 4

Similar to OpenHPI - Parallel Programming Concepts - Week 4 (20)

More from Peter Tröger

More from Peter Tröger (20)

Recently uploaded

Recently uploaded (20)

OpenHPI - Parallel Programming Concepts - Week 4