Exploring Compiler Optimization Opportunities
for the OpenMP 4.x Accelerator Model
on a POWER8+GPU Platform
Akihiro Hayashi (Rice), Jun Shirako (Rice)
Ettore Tiotto (IBM), Robert Ho (IBM)
Vivek Sarkar (Rice)
1
3rd Workshop on Accelerator Programming Using Directives (WACCPD) @ SC16
2Pictures Borrowed From : commons.wikimedia.org, www.olcf.ornl.gov, www.openmp.org
GPUs Everywhere
Personal Computers SmartphonesSupercomputers
✔100+ accelerated systems
on Top500 list
✔GPU acceleration for 2D/3D Rendering, Video
encoding/decoding, and so on.
This talk =
XL
Compilers
GPGPU
A gap between domain experts and
hardware
3
Application Domain
(Domain Experts)
Hardware: GPUs
(Concurrency Experts)
Prog Lang.
Compilers
Runtime
Want to get
significant
performance
improvement “easily”
Hard to exploit the
full capability of
hardware
A gap between domain experts and
hardware
4
Application Domain
(Domain Experts)
Prog Lang.
Compilers
Runtime
Want to get
significant
performance
improvement easily
Hard to exploit the
full capability of
hardware
We believe
Languages and
Compilers are
very important!
Hardware: GPUs
(Concurrency Experts)
Directive-based Accelerator
Programming Models
High-level abstraction of GPU parallelism
 OpenACC
 OpenMP 4.0/4.5 target
5
#pragma omp target map(tofrom: C[0:N]) map(to: B[0:N], A[0:N])
#pragma omp teams num_teams(N/1024) thread_limit(1024)
#pragma omp distribute parallel for
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
From the perspective of
compilers…
 Directive-based programming models
impose additional optimizations
on compilers
Kernel optimizations
Data transfer optimizations
6
Credits: dragon by Cassie McKown from the Noun Project, crossed swords by anbileru adaleru from the Noun Project, https://en.wikipedia.org/
Additional dragon : GPUs
Where are we so far?
Two OpenMP compilers
 Clang+LLVM Compiler
 IBM XL C/C++/Fortran Compiler
7
1.00
3.16
1.41 1.45
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
NaĂŻve
Hand-
tuned
clang
XLC
CUDA OpenMP
SpeedupovernaĂŻveCUDA
1.00
0.67 0.82
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
NaĂŻve clang XL C
CUDA OpenMP
SpeedupovernaĂŻveCUDA
Matrix Multiplication (Kernel Only) MRI Reconstruction (Kernel Only)
GPU: Tesla K80
Open Question:
How to optimize OpenMP programs?
Problem:
 OpenMP versions are in general slower than well-
tuned CUDA version
This talk is
 About studying potential performance
improvements by OpenMP compilers
With detailed performance analyses
Current Focus : GPU Kernel Only
 NOT about proposing a new GPU optimization
 NOT about pushing a specific compiler
8
Compiling OpenMP to GPUs
9
C/C++
(OpenMP)
Clang+LLVM
GPU
LLVM optimization
passes
C/C++
Fortran
(OpenMP)
XL Compiler
Wcode
High-level Optimizer
+
CPU/GPU Partitioner
POWER low-level optimizerCPU
GP
U
CPU + GPU
executable
PTX
PowerPC backend
PTXlibNVVMNVVM IR
CPU + GPU
executable
CPU
GP
U
LLVM IR
Wcode
NVPTX backend
CPU
LLVM IR
OpenMP Threading Model on GPUs
10
#pragma omp target teams {
// sequential region 1
if (...) {
// parallel region 1
#pragma omp parallel for
for () {}
}
Team 0
SMX0 (GPU)
Master thread only
Team 1
SMX1 (GPU)
Master thread only
State Machine Execution on GPUs
11
 Requires state machine
execution in general:
 Increasing register pressure
 Increasing branch divergence
Team 0
SMX0 (GPU)
Master thread onlySequential
Region
Parallel
Region
if (threadIdx.x == MASTER) {
labelForMaster = SEQUENTIAL_REGION1;
labelForOthers = IDLE;
}
bool finished = false;
while (!finished) {
switch (labelNext) {
case IDLE:
break;
case SEQUENTIAL_REGION1:
// code for sequential region 1
...
labelForMaster = ...;
labelForOthers = ...;
break;
case PARALLEL_REGION1:
// code for parallel region 1 ...
if (threadIdx.x == MASTER) {
// update labelForMaster and labelForOthers
}
break;
// other cases
}
}
Note: the latest clang and XL no longer use this scheme
Optimization Example:
Simplifying State Machine Execution on GPUs
12
What programmers can do
 Use “combined” directives if possible
What compilers can do
 Remove state machine execution if possible
 IBM XL compiler performs such an optimization
#pragma omp target teams distributed parallel for …
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
Performance Evaluation & Analysis
 Platform
 CPUs: IBM POWER8 CPUs (S824)
 2 x 12-core IBM POWER8 running at up to 3.52GHz
 1 TB of main memory
 GPUs: NVIDIA Tesla K80 GPU
 13 SMXs, each of which has 192 CUDA cores running at up to 875 MHz
 12 GB of global memory
 Compilers
 clang+LLVM (as of October 2015)
 clang 3.8: -O3
 XL C Compiler (as of June 2016)
 A beta version of IBM XL C/C++ 13.1.4: -O3
13
Benchmarks
14
Benchmark Type Size
Vector Addition Memory Bounded 67,108,864
SAXPY Memory Bounded 67,108,864
Matrix Multiplication Memory Bounded 2,048x2,048
BlackScholes Computation Bounded 4,194,304
SPEC ACCEL SP-XSOLVE3 Memory Bounded 5x255x256x256
SPEC ACCEL OMRIQ Computation Bounded 262,144
Variants :
 CUDA: CUDA-baseline, CUDA+Read-only-cache
(ROC)
 OMP4: clang+llvm , XL C (Both utilize ROC by
Insight #1 Minimize OpenMP
runtime overheads when possible
15
 Combined: use combined constructs (No State Machine)
Non-combined: put OMP directives on different lines (State
Machine)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
VecAdd Saxpy MM BS OMRIQ
1LV (combined and non-combined) Geomean
Speedupover
CUDA-baseline
clang-combined clang-non-combined XLC-combined XLC-non-combined
State Machine versions
are slower than NON-
State machine versions
XL C
optimizes it!
(1LV only)
Higher is better
Insight #2 Utilize read-only cache
carefully
16
The Read only cache does not always contribute
to performance improvements
0.97 0.97 0.98 0.96 1.01
1.12
1.00
0
0.2
0.4
0.6
0.8
1
1.2
VecAdd Saxpy MM BS OMRIQ SP
1LV (combined and non-combined) 2LV (non-
combined
only)
Geomean
Speedupover
CUDA-baseline
CUDA-baseline vs. CUDA-baseline + the read only cache Higher is better
Insight #3 Improve Math functions
code generations
17
BS and OMRIQ use Math functions
heavily
CUDA < XLC < clang
0.59
0.67 0.63
0.72
0.82 0.77
0
0.2
0.4
0.6
0.8
1
BS OMRIQ Geomean
Speedupover
CUDA-baseline
clang-combined XLC-combined
Higher is better
Insight #3 Improve Math functions
code generations (Cont’d)
18
#pragma omp target distributed parallel for
for (int i = 0; i < N; i++) {
float T = exp(randArray[i]);
call[i] = (float)log(randArray[i])/T;
}
Synthetic Benchmark
CUDA clang xlc
515 us 933 us 922 us
libdevice.bcMath API
721 us when
using expf and
logf manually
Math API is better
than libdevice
exp (double) expf(float) exp(double) expf(float)
log(double) logf(float) log(double) log(double)
Insight #4 Other important insights
Generate Fused-Multiply-Add (FMA)
if possible
 fadd + fmul = fma
 E.g. 1.5% Performance improvement in SAXPY
(clang)
Use schedule(static, 1)
 For better memory coalescing
19
1.00
0.48
0.27 0.14
0
0.5
1
1.5
chunk=1 chunk=2 chunk=4 chunk=8
SpeedupOver
chunk=1
(clang)
Higher is better
1.15 1.02
3.04 3.18
0
1
2
3
4
Clang XLC
Speedupover
CUDA-baseline
OMP-NaĂŻve OMP-Transformed
Insight #5 Loop transformation is much
more important (High-level Restructuring)
Performing loop transformations by hand
20
SPEC ACCEL SP
Significant Performance improvement by loop
distribution + permutation + tiling
Future work: High-level loop transformation framework
Insight #5 Loop transformation is much
more important (Parameter tuning)
21
…
for (int k = 0; k < N; k++) {
sum += A[i*N+k] * B[k*N+threadIdx.x];
}
Matrix Multiplication
unrolling factor = 2
L2 Hit (92.02%)
A[0][0]
* B[0][0]
A[0][0]
* B[0][1]
A[0][0]
* B[0][2]
threadID 0 threadID 1 threadID 2
A[0][1]
* B[1][0]
A[0][1]
* B[1][1]
A[0][1]
* B[1][2]
2 stride + 2 offset access simultaneously
unrolling factor = 8
L2 Hit (14.46%).
A[0][0]
* B[0][0]
A[0][0]
* B[0][1]
A[0][0]
* B[0][2]
threadID 0 threadID 1 threadID 2
A[0][1]
* B[1][0]
A[0][1]
* B[1][1]
A[0][1]
* B[1][2]
A[0][2]
* B[2][0]
A[0][2]
* B[2][0]
A[0][2]
* B[2][2]8 stride + 8 offset access simultaneously
1.41x
Summary
 OpenMP versions are in general slower than well-
tuned CUDA version
 For future high-level compiler optimizations
 Insight #1 : Minimize OpenMP overheads when possible
 Insight #2 : Utilize read-only cache carefully
 Insight #3 : Improve math functions generation
 Insight #4 : Other important insights
 E.g. FMA contraction, scheduling clause
Insight #5 : Loop transformation is much more important
 The paper includes more detailed information
including hardware counter numbers
22
On Going & Future work
 Minimizing OpenMP runtime overhead
 The latest versions of clang and XL C no longer generate state machine
 Plan to performance evaluation & analysis again
 High-level loop transformations for OpenMP programs
 For better memory coalescing and the read-only cache/shared memory
exploitation
 High-level Restructuring: Loop permutation, loop fusion, loop distribution, loop
tiling,
 Turning parameters: Tile size, Unrolling factor
 Using the polyhedral model + DL Model [SC’14]
 Taking additional information from users [PACT’15]
 Automatic Device Selection [PACT’15, PPPJ’15]
23
Acknowledgement
IBM CAS
Dr. Wayne Joubert
WACCPD reviewers and committee
members
24
Backup
25
Overall Results
26
0.97 0.97 0.98 0.96
1.01
1.12
0.97 0.95
1.38
0.59
0.67
0.75 0.75
1.41
0.38
0.56
1.28
0.97 0.97
1.45
0.72
0.82
0.97 0.97
1.45
0.72
0.82
1.14
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
VecAdd Saxpy MM BS OMRIQ SP
1LV (combined and non-combined) 2LV (non-combined
only)
SpeeduprelativetoCUDA-baseline
Higher is better
CUDA-ROC clang-combined clang-non-combined XLC-combined XLC-non-combined

Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator Model on a POWER8+GPU Platform

  • 1.
    Exploring Compiler OptimizationOpportunities for the OpenMP 4.x Accelerator Model on a POWER8+GPU Platform Akihiro Hayashi (Rice), Jun Shirako (Rice) Ettore Tiotto (IBM), Robert Ho (IBM) Vivek Sarkar (Rice) 1 3rd Workshop on Accelerator Programming Using Directives (WACCPD) @ SC16
  • 2.
    2Pictures Borrowed From: commons.wikimedia.org, www.olcf.ornl.gov, www.openmp.org GPUs Everywhere Personal Computers SmartphonesSupercomputers ✔100+ accelerated systems on Top500 list ✔GPU acceleration for 2D/3D Rendering, Video encoding/decoding, and so on. This talk = XL Compilers GPGPU
  • 3.
    A gap betweendomain experts and hardware 3 Application Domain (Domain Experts) Hardware: GPUs (Concurrency Experts) Prog Lang. Compilers Runtime Want to get significant performance improvement “easily” Hard to exploit the full capability of hardware
  • 4.
    A gap betweendomain experts and hardware 4 Application Domain (Domain Experts) Prog Lang. Compilers Runtime Want to get significant performance improvement easily Hard to exploit the full capability of hardware We believe Languages and Compilers are very important! Hardware: GPUs (Concurrency Experts)
  • 5.
    Directive-based Accelerator Programming Models High-levelabstraction of GPU parallelism  OpenACC  OpenMP 4.0/4.5 target 5 #pragma omp target map(tofrom: C[0:N]) map(to: B[0:N], A[0:N]) #pragma omp teams num_teams(N/1024) thread_limit(1024) #pragma omp distribute parallel for for (int i = 0; i < N; i++) { C[i] = A[i] + B[i]; }
  • 6.
    From the perspectiveof compilers…  Directive-based programming models impose additional optimizations on compilers Kernel optimizations Data transfer optimizations 6 Credits: dragon by Cassie McKown from the Noun Project, crossed swords by anbileru adaleru from the Noun Project, https://en.wikipedia.org/ Additional dragon : GPUs
  • 7.
    Where are weso far? Two OpenMP compilers  Clang+LLVM Compiler  IBM XL C/C++/Fortran Compiler 7 1.00 3.16 1.41 1.45 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 Naïve Hand- tuned clang XLC CUDA OpenMP SpeedupovernaïveCUDA 1.00 0.67 0.82 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 Naïve clang XL C CUDA OpenMP SpeedupovernaïveCUDA Matrix Multiplication (Kernel Only) MRI Reconstruction (Kernel Only) GPU: Tesla K80
  • 8.
    Open Question: How tooptimize OpenMP programs? Problem:  OpenMP versions are in general slower than well- tuned CUDA version This talk is  About studying potential performance improvements by OpenMP compilers With detailed performance analyses Current Focus : GPU Kernel Only  NOT about proposing a new GPU optimization  NOT about pushing a specific compiler 8
  • 9.
    Compiling OpenMP toGPUs 9 C/C++ (OpenMP) Clang+LLVM GPU LLVM optimization passes C/C++ Fortran (OpenMP) XL Compiler Wcode High-level Optimizer + CPU/GPU Partitioner POWER low-level optimizerCPU GP U CPU + GPU executable PTX PowerPC backend PTXlibNVVMNVVM IR CPU + GPU executable CPU GP U LLVM IR Wcode NVPTX backend CPU LLVM IR
  • 10.
    OpenMP Threading Modelon GPUs 10 #pragma omp target teams { // sequential region 1 if (...) { // parallel region 1 #pragma omp parallel for for () {} } Team 0 SMX0 (GPU) Master thread only Team 1 SMX1 (GPU) Master thread only
  • 11.
    State Machine Executionon GPUs 11  Requires state machine execution in general:  Increasing register pressure  Increasing branch divergence Team 0 SMX0 (GPU) Master thread onlySequential Region Parallel Region if (threadIdx.x == MASTER) { labelForMaster = SEQUENTIAL_REGION1; labelForOthers = IDLE; } bool finished = false; while (!finished) { switch (labelNext) { case IDLE: break; case SEQUENTIAL_REGION1: // code for sequential region 1 ... labelForMaster = ...; labelForOthers = ...; break; case PARALLEL_REGION1: // code for parallel region 1 ... if (threadIdx.x == MASTER) { // update labelForMaster and labelForOthers } break; // other cases } } Note: the latest clang and XL no longer use this scheme
  • 12.
    Optimization Example: Simplifying StateMachine Execution on GPUs 12 What programmers can do  Use “combined” directives if possible What compilers can do  Remove state machine execution if possible  IBM XL compiler performs such an optimization #pragma omp target teams distributed parallel for … for (int i = 0; i < N; i++) { C[i] = A[i] + B[i]; }
  • 13.
    Performance Evaluation &Analysis  Platform  CPUs: IBM POWER8 CPUs (S824)  2 x 12-core IBM POWER8 running at up to 3.52GHz  1 TB of main memory  GPUs: NVIDIA Tesla K80 GPU  13 SMXs, each of which has 192 CUDA cores running at up to 875 MHz  12 GB of global memory  Compilers  clang+LLVM (as of October 2015)  clang 3.8: -O3  XL C Compiler (as of June 2016)  A beta version of IBM XL C/C++ 13.1.4: -O3 13
  • 14.
    Benchmarks 14 Benchmark Type Size VectorAddition Memory Bounded 67,108,864 SAXPY Memory Bounded 67,108,864 Matrix Multiplication Memory Bounded 2,048x2,048 BlackScholes Computation Bounded 4,194,304 SPEC ACCEL SP-XSOLVE3 Memory Bounded 5x255x256x256 SPEC ACCEL OMRIQ Computation Bounded 262,144 Variants :  CUDA: CUDA-baseline, CUDA+Read-only-cache (ROC)  OMP4: clang+llvm , XL C (Both utilize ROC by
  • 15.
    Insight #1 MinimizeOpenMP runtime overheads when possible 15  Combined: use combined constructs (No State Machine) Non-combined: put OMP directives on different lines (State Machine) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 VecAdd Saxpy MM BS OMRIQ 1LV (combined and non-combined) Geomean Speedupover CUDA-baseline clang-combined clang-non-combined XLC-combined XLC-non-combined State Machine versions are slower than NON- State machine versions XL C optimizes it! (1LV only) Higher is better
  • 16.
    Insight #2 Utilizeread-only cache carefully 16 The Read only cache does not always contribute to performance improvements 0.97 0.97 0.98 0.96 1.01 1.12 1.00 0 0.2 0.4 0.6 0.8 1 1.2 VecAdd Saxpy MM BS OMRIQ SP 1LV (combined and non-combined) 2LV (non- combined only) Geomean Speedupover CUDA-baseline CUDA-baseline vs. CUDA-baseline + the read only cache Higher is better
  • 17.
    Insight #3 ImproveMath functions code generations 17 BS and OMRIQ use Math functions heavily CUDA < XLC < clang 0.59 0.67 0.63 0.72 0.82 0.77 0 0.2 0.4 0.6 0.8 1 BS OMRIQ Geomean Speedupover CUDA-baseline clang-combined XLC-combined Higher is better
  • 18.
    Insight #3 ImproveMath functions code generations (Cont’d) 18 #pragma omp target distributed parallel for for (int i = 0; i < N; i++) { float T = exp(randArray[i]); call[i] = (float)log(randArray[i])/T; } Synthetic Benchmark CUDA clang xlc 515 us 933 us 922 us libdevice.bcMath API 721 us when using expf and logf manually Math API is better than libdevice exp (double) expf(float) exp(double) expf(float) log(double) logf(float) log(double) log(double)
  • 19.
    Insight #4 Otherimportant insights Generate Fused-Multiply-Add (FMA) if possible  fadd + fmul = fma  E.g. 1.5% Performance improvement in SAXPY (clang) Use schedule(static, 1)  For better memory coalescing 19 1.00 0.48 0.27 0.14 0 0.5 1 1.5 chunk=1 chunk=2 chunk=4 chunk=8 SpeedupOver chunk=1 (clang) Higher is better
  • 20.
    1.15 1.02 3.04 3.18 0 1 2 3 4 ClangXLC Speedupover CUDA-baseline OMP-Naïve OMP-Transformed Insight #5 Loop transformation is much more important (High-level Restructuring) Performing loop transformations by hand 20 SPEC ACCEL SP Significant Performance improvement by loop distribution + permutation + tiling Future work: High-level loop transformation framework
  • 21.
    Insight #5 Looptransformation is much more important (Parameter tuning) 21 … for (int k = 0; k < N; k++) { sum += A[i*N+k] * B[k*N+threadIdx.x]; } Matrix Multiplication unrolling factor = 2 L2 Hit (92.02%) A[0][0] * B[0][0] A[0][0] * B[0][1] A[0][0] * B[0][2] threadID 0 threadID 1 threadID 2 A[0][1] * B[1][0] A[0][1] * B[1][1] A[0][1] * B[1][2] 2 stride + 2 offset access simultaneously unrolling factor = 8 L2 Hit (14.46%). A[0][0] * B[0][0] A[0][0] * B[0][1] A[0][0] * B[0][2] threadID 0 threadID 1 threadID 2 A[0][1] * B[1][0] A[0][1] * B[1][1] A[0][1] * B[1][2] A[0][2] * B[2][0] A[0][2] * B[2][0] A[0][2] * B[2][2]8 stride + 8 offset access simultaneously 1.41x
  • 22.
    Summary  OpenMP versionsare in general slower than well- tuned CUDA version  For future high-level compiler optimizations  Insight #1 : Minimize OpenMP overheads when possible  Insight #2 : Utilize read-only cache carefully  Insight #3 : Improve math functions generation  Insight #4 : Other important insights  E.g. FMA contraction, scheduling clause Insight #5 : Loop transformation is much more important  The paper includes more detailed information including hardware counter numbers 22
  • 23.
    On Going &Future work  Minimizing OpenMP runtime overhead  The latest versions of clang and XL C no longer generate state machine  Plan to performance evaluation & analysis again  High-level loop transformations for OpenMP programs  For better memory coalescing and the read-only cache/shared memory exploitation  High-level Restructuring: Loop permutation, loop fusion, loop distribution, loop tiling,  Turning parameters: Tile size, Unrolling factor  Using the polyhedral model + DL Model [SC’14]  Taking additional information from users [PACT’15]  Automatic Device Selection [PACT’15, PPPJ’15] 23
  • 24.
    Acknowledgement IBM CAS Dr. WayneJoubert WACCPD reviewers and committee members 24
  • 25.
  • 26.
    Overall Results 26 0.97 0.970.98 0.96 1.01 1.12 0.97 0.95 1.38 0.59 0.67 0.75 0.75 1.41 0.38 0.56 1.28 0.97 0.97 1.45 0.72 0.82 0.97 0.97 1.45 0.72 0.82 1.14 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 VecAdd Saxpy MM BS OMRIQ SP 1LV (combined and non-combined) 2LV (non-combined only) SpeeduprelativetoCUDA-baseline Higher is better CUDA-ROC clang-combined clang-non-combined XLC-combined XLC-non-combined