GPU programming made easy
with OpenMP
Aditya Nitsure (anitsure@in.ibm.com)
Pidad D’souza (pidsouza@in.ibm.com)
19/03/20 IBM 2
OpenMP (Open Multi-Processing)OpenMP (Open Multi-Processing)
(https://www.openmp.org)
OpenMP (Open Multi-Processing)OpenMP (Open Multi-Processing)
(https://www.openmp.org)
19/03/20 IBM 3
Take away
ā—
What is OpenMP?
ā—
How OpenMP programming model works ?
ā—
How to write, compile and run OpenMP program?
ā—
What are various OpenMP directives, runtime library
API & Environment variables?
ā—
How OpenMP program executes on accelerators
(GPUs)?
ā—
What are the best practices to write OpenMP
program?
19/03/20 IBM 4
Introduction (1)
ā—
Governed by OpenMP Architecture Review Board (ARB)
ā—
Open source, simple and up to date with the latest
hardware development
ā—
Specification for shared memory parallel programming
model for Fortran and C/C++ programming languages
ī€Œ
compiler directives
ī€Œ
library routines
ī€Œ
environment variables
ā—
Widely accepted by community – academician & industry
19/03/20 IBM 5
Introduction (2)
ā—
Shorter learning curve
ā—
Supported by many compiler vendors
ī€Œ
GNU, LLVM, AMD, IBM, Cray, Intel, PGI, OpenUH, ARM
ī€Œ
Compilers from research labs e.g. LLNL, Barcelona Supercomputing Centre
ā—
Easy to maintain sequential, CPU parallel and GPU
parallel versions of code.
ā—
Used in application development from variety of fields and
systems
ī€Œ
Aeronautics, Automotive, Pharmaceutics, Finance
ī€Œ
Supercomputers, Accelerators, Embedded multi-core systems
19/03/20 IBM 6
Specification History
āž¢ OpenMP 1.0 (1997 (Fortran) / 1998
(C/C++))
ā—
First release of Fortran & C/C++ specification (2
separate specifications)
āž¢ OpenMP 2.0 (2000 (Fortran) / 2002
(C/C++))
ā—
Addition of timing routines and various new
clauses – firstprivate, lastprivate, copyprivate etc.
āž¢ OpenMP 2.5 (2005)
ā—
combined standard for C/C++ & Fortran
ā—
Introduce notion of internal control variables
(ICVs) and new constructs – single, sections etc.
āž¢ OpenMP 3.0/3.1 (2008 – 2011)
ā—
Concept of task added in OpenMP execution
model
āž¢ OpenMP 4.0 (2013)
ā—
Accelerator (GPU) support
ā—
SIMD construct for vectorization of loops
ā—
Support for FORTRAN 2003
āž¢ OpenMP 4.5 (2015)
ā—
Significant improvement for devices (GPU
programming) & Fortran 2003
ā—
New taskloop construct
āž¢ OpenMP 5.0 (2018) – compilers are not yet
support full set of features of this standard !
ā—
Extended memory model to distinguish different
types of flush operations
ā—
Added target-offload-var ICV and
OMP_TARGET_OFFLOAD environment variable
ā—
Teams construct extended to execute on host
device
19/03/20 IBM 7
OpenMP Programming (1)
ā—
The OpenMP API uses the fork-join model of parallel
execution.
ā—
The OpenMP API provides a relaxed-consistency, shared-
memory model.
ā—
An OpenMP program begins as a single thread of execution,
called an initial thread. An initial thread executes sequentially.
ā—
When any thread encounters a parallel construct, the thread
creates a team of itself and zero or more additional threads
and becomes the master of the new team.
19/03/20 IBM 8
OpenMP Programming (2)
ā—
Compiler directives
– Specified with pragma preprocessing
– Starts with #pragma omp (For C/C++)
– Case sensitive
– Any directive is followed by one or more clauses
– Syntax #pragma omp directive-name [clause[ [,] clause] ... ] new-line
ā—
Runtime library routines
– The library routines are external functions with ā€œCā€ linkage
– omp.h provides prototype definition of all library routines
ā—
Environment variables
– Set at the program start up
– Specifies the settings of the Internal Control Variables (ICVs) that affect the
execution of OpenMP programs
– Always in upper case
19/03/20 IBM 9
OpenMP Hello World
#include <stdio.h>
#include <omp.h>
int main()
{
int i = 0;
int numThreads = 0;
// call to OpenMP runtime library
numThreads = omp_get_num_threads(numThreads);
// OpenMP directive
#pragma omp parallel
{
// call to OpenMP runtime library
int threadNum = omp_get_thread_num();
printf(ā€œHello World from thread %d nā€, threadNum);
}
return 0;
}
Output
[aditya@hpcwsw7 openmp_tutorial]$ ./helloWorld
Hello World from thread 0
Hello World from thread 3
Hello World from thread 1
Hello World from thread 2
[aditya@hpcwsw7 openmp_tutorial]$ ./helloWorld
Hello World from thread 0
Hello World from thread 1
Hello World from thread 2
Hello World from thread 3
// Setting OpenMP environment variable
export OMP_NUM_THREADS=4
//compile helloWorld program
xlc -qsmp=omp helloWorld.c -o helloWorld
// Run helloWorld program
./helloWorld
helloWorld.c
19/03/20 IBM 10
Compilation of OpenMP program
XL
ā—
CPU
-qsmp=omp
ā—
GPU
-qsmp=omp -qoffload
-qtgtarch=sm_70 (V100 GPU)
GNU
ā—
CPU
-fopenmp
ā—
GPU
-fopenmp
-foffload=nvptx-none
NVIDIA CUDA linking
-lcudart -L/usr/local/cuda/lib64
19/03/20 IBM 11
OpenMP Directives
ā—
Parallel construct
ā—
SIMD construct
ā—
Combined construct
ā—
Work sharing construct
ā—
Master and synchronization construct
ā—
Tasking construct
ā—
Device construct
19/03/20 IBM 12
Parallel construct
#pragma omp parallel [clause[ [,] clause] ... ] new-line
{
}
ā—
The fundamental construct that starts parallel execution
ā—
A team of threads is created to execute parallel region
ā—
Original thread becomes master of the new team
ā—
All threads in the team executes parallel region
#pragma omp parallel
Implicit barrier at the
end of parallel region
Original thread
becomes master
thread in parallel
region
#pragma omp parallel
{
int threadNum = omp_get_thread_num();
printf(ā€œHello World from thread %d nā€, threadNum);
}
19/03/20 IBM 13
SIMD construct
ā—
Applied on loop to transform loop iterations to execute concurrently
using SIMD instructions
ā—
When any thread encounters a simd construct, the iterations of the
loop associated with the construct may be executed concurrently using
the SIMD lanes that are available to the thread.
#pragma omp simd [clause[ [,] clause] ... ] new-line
{
}
#pragma omp parallel for simd
for (i=0; i<N; i++)
{
a[i] = b[i] * c[i]
}
#pragma omp parallel for simd
Implicit barrier at the end of
parallel region
b0
b1
b2
b3
c0
c1
c2
c3
*
vector unitEach iteration of
each thread
executed
concurrently
19/03/20 IBM 14
Combined construct
ā—
Combination of more than one construct
– Specifies one construct immediately nested inside another construct
ā—
Clauses from both constructs are permitted
– With some exceptions e.g. nowait clause cannot be specified in
parallel for or parallel sections
ā—
Examples
– #pragma omp parallel for
– #pragma omp parallel for simd
– #pragma omp parallel sections
– #pragma omp target parallel for simd
19/03/20 IBM 15
Worksharing construct
ā—
Distributes the execution of the associated
parallel region among the members of the team
ā—
Implicit barrier at the end
ā—
Types
– Loop construct
– Sections construct
– Single construct
– Workshare construct (only in Fortran)
19/03/20 IBM 16
Worksharing : Loop construct
ā—
Iterations of for-loop distributed among the threads
ā—
One or more iterations executed in parallel
ā—
Implicit barrier at the end unless nowait
#pragma omp for [clause[ [,] clause] ... ] new-line
// for-loop
{
}
int N = 200;
#pragma omp parallel
{
#pragma omp for
for (i=0; i<N; i++)
{
a[i] = b[i] * c[i]
}
}
#pragma omp parallel
#pragma omp for
Implicit barrier at the
end of for-loop
Iteration space
is divided
among threads
(0-39)
(40-79)
(80-119)
(120-159)
(160-199)
19/03/20 IBM 17
Worksharing : Section construct
ā—
Non iterative worksharing construct
ā—
Each section structured block is
executed once
ā—
The section structured blocks are
executed in parallel (one thread
per section)
ā—
Implicit barrier at the end unless
nowait specified
#pragma omp sections [clause[ [,] clause] ... ] new-line
{
#pragma omp section new-line
// structured block
#pragma omp section new-line
// structured block
}
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
// do_work1
}
#pragma omp section
{
// do_work2
}
}
}
#pragma omp parallel
#pragma omp sections
Implicit barrier at the
end of sections clause
section divided
among threads
section1 section2 section3
19/03/20 IBM 18
Worksharing : Single construct
ā—
Only one thread in the team executes the associated structured block
ā—
Implicit barrier at the end
ā—
All other threads wait until single block is finished
ā—
nowait clause removes barrier at the end of single construct
#pragma omp single [clause[ [,] clause] ... ] new-line
{
} #pragma omp parallel
end of parallel region
Single section
Only one thread
executes this section Barrier at the end of
single section
#pragma omp single
#pragma omp parallel
{
// do_some_work()
#pragma omp single
{
//write result to file
}
// do_post_work()
}
19/03/20 IBM 19
Tasking construct
ā—
Defines an explicit task
ā—
A task is generated and corresponding data environment is created from the
associated structured block
ā—
Execution of task could be immediate or defer for later execution
ā—
Examples : While-loop, recursion, sorting algorithms
#pragma omp task [clause[ [,] clause] ... ] new-line
{
} #pragma omp parallel
{
#pragma omp single
{
While ( ptr != null)
{
#pragma omp task
{
//do_some_work();
}
} // end of while
} // end of single
} // end of parallel
#pragma omp parallel
end of parallel region
Task scheduler schedules
tasks with available threads
#pragma omp single
end of single
Single thread
creates tasks
19/03/20 IBM 20
Synchronization construct (1)
#pragma omp parallel
shared (sum) private (localSum)
{
#pragma omp for
for (i=0; i<N; i++)
{
localSum += a[i];
}
#pragma omp critical (global_sum)
{
sum += localSum;
}
}
Critical section
One thread at a time
4 threads executes
sequentially
#pragma omp parallel
end of parallel region
end of for
Critical construct Master construct Ordered construct
#pragma omp parallel
{
#pragma omp for
// do_pre_work()
#pragma omp master
{
//write result to file
}
#pragma omp for
// do_post_work()
}
end of for
end of for
#pragma omp parallel
end of parallel region
Master section
Only master
thread executes
this section
No entry barrier
No exit barrier
#pragma omp parallel
{
#pragma omp for
for (i=0; i<N; i++)
{
// do_some_work
#pragma omp ordered
{
sum += a[i];
}
// do_post_work
}
}
#pragma omp parallel
end of parallel region
Ordered section
Iteration order is preserved
Start of Ordered
construct
End of Ordered
construct
End of for
19/03/20 IBM 21
Synchronization construct (2)
ā—
Barrier construct
– #pragma omp barrier new-line
– Stand alone directive
– ā€œMust to executeā€ for all threads
in parallel region
– No one continues unless all
threads reach barrier
ā—
Taskwait construct
– #pragma omp taskwait new-line
– Stand alone directive
– Waits until all child tasks
completes execution before
taskwait region
ā—
Atomic construct
– #pragma omp atomic [atomic-clause]
new-line
– ensures that a specific storage location is
accessed atomically
– Enforces atomicity for read, write, update
or capture
ā—
Flush construct
– #pragma omp flush [(list)] new-line
– Stand alone directive
– Makes temporary view of thread’s
memory consistent with main memory
– When list is specified, the flush operation
applies to the items in the list. Otherwise
to all thread visible data items.
19/03/20 IBM 22
GPU offloading using OpenMPGPU offloading using OpenMP
19/03/20 IBM 23
Device construct (1)
#pragma omp parallel for
Implicit barrier
#pragma omp parallel for
for (i=0; i<N; i++)
{
a[i] = b[i] + s * c[i];
}
for (i=0; i<N; i++)
{
a[i] = b[i] + s * c[i];
}
for-loop
CPU (no threading) GPU OpenMPCPU OpenMP
#pragma omp target
#pragma omp parallel for
for (i=0; i<N; i++)
{
a[i] = b[i] + s * c[i];
}
#pragma omp target
#pragma omp parallel for
Target construct
ā—
Execute construct on device (GPU)
ā—
Maps variables to device data environment
Note
ā—
No thread migration from one device to another.
ā—
In absence of device or unsupported implementation
for the target device, all target regions execute on the
host.
19/03/20 IBM 24
1. Thread starts on the host and host creates
the data environments on the device(s).
2. The host then maps data to the device data
environment. (data moves to device)
3. Host offloads target regions to target
devices. (code executes on device)
4. After execution, host updates the data
between the host and the device (transfers
data from device to host)
5. Host destroys data environment on the
device
Device Construct (2)
#pragma omp target
#pragma omp parallel for
for (i=0; i<N; i++)
{
a[i] = b[i] + s * c[i];
}
#pragma omp target
#pragma omp parallel for
1
2
3
4
5
The host centric execution model
Teams executes asynchronously
Team1 Team2
Threads
within teams
synchronise
CUDA block == Team
19/03/20 IBM 25
Device construct (3)
GPU OpenMP
#pragma omp target teams
#pragma omp parallel for
for (i=0; i<N; i++)
{
a[i] = b[i] + s * c[i];
}
#pragma omp target teams
#pragma omp parallel for
Team construct
ā—
Creates league of thread teams
ā—
Master thread of each team executes the region
ā—
No implicit barrier at the end of team construct
#pragma omp target
#pragma omp teams distribute parallel for
for (i=0; i<N; i++)
{
a[i] = b[i] + s * c[i];
}
#pragma omp target
#pragma omp teams distribute
parallel for
Distribute construct
ā—
Distributes iterations among master thread of all teams
ā—
No implicit barrier at the end of distribute construct
ā—
Always associated with loops and binds to team region
team1 team2 team3 team1 team2 team3
19/03/20 IBM 26
Device construct (4)
ā—
Target data
– #pragma omp target data [clause] new-line
{structured block}
– Maps variables to device data environment and task
encountering the construct executes the region
ā—
Target enter data / Target exit data
– #pragma omp target enter data [clause] new-line
– #pragma omp target exit data [clause] new-line
– Stand alone directive
– Variables are mapped/unmapped to/from device data
environment
– Generates a new task which enclosed target data region
19/03/20 IBM 27
Device Data Mapping
ā—
Map clause
– Maps variables from host data environment to
corresponding variables in device data environment
– map([ [map-type-modifier[,]] map-type : ] list)
– map-type
ā—
to (initializes variables on device variables from host
corresponding variables on host)
ā—
from (Writes back data from device to host)
ā—
tofrom – default map type (combination of to & from)
ā—
alloc (allocates device memory for variables)
ā—
release (reference count is decremented by one)
ā—
delete (deletes device memory of variables)
19/03/20 IBM 28
Device construct (5)
19/03/20 IBM 29
Device construct (5)
ā—
if clause
– if (directive name modifier : scalar expression)
– Used in target data, target enter/exit data, combined
construct
– In case of target clauses if clause expression
evaluates to false then target region executed on
host
ā—
device (n)
– Specifies the device (GPU) on which the target
region to execute. ā€˜n’ is device id.
– In absence of device clause, the target region
executes on default device
19/03/20 IBM 30
OpenMP Clauses (1)
ā—
Schedule [modifier : kind, chunk_size]
– Kind
ā—
Static : Iterations are divided into chunks and assigned to threads in round-robin faction
ā—
Dynamic : Iterations are divided into chunks and assigned to threads on dynamically
(first-come-first-serve basis)
ā—
Guided : Threads grab the chunk of iterations and gradually chunk reduces to
chunk_size as execution proceeds
ā—
Auto : compiler or OpenMP runtime decides optimum schedule from static, dynamic or
guided
ā—
Runtime : Deferred until runtime and schedule type and chunk size are taken from
environment variable.
– Modifier (new addition in OpenMP 4.5 specification)
ā—
simd : The chunk size is adjusted to simd width. Works only with SIMD construct e.g.
schedule(simd:static, 5)
ā—
Monotonic : Chunks are assigned in increasing logical order of iterations to each
threads.
ā—
Nonmonotonic : Assignment order of chunks to threads is unspecified.
19/03/20 IBM 31
OpenMP Clauses (2)
ā—
collapse (n)
– Collapses two or more loops to create larger iteration space
– The parameter (n), specifies the number of loops associated with
loop construct to be collapsed
#pragma omp parallel for collapse (2)
for (i=0; i<N; i++) {
for (j=0; j<M; j++){
for (k=0; k<M; k++){
foo(i, j, k);
}}}
ā—
nowait
– Used for asynchronous thread movement in worksharing
construct i.e. threads that finish early proceed to the next
instruction without waiting for other thread.
– Used in loop, sections, single, target and so on
19/03/20 IBM 32
Data Sharing Attribute Clauses (1)
ā—
Private (list)
– Declares variables as to be private to a task or thread
– All references to the original variable are replaced by references to new variable when
respective directive is encountered
– Private variables are not initialized by default
ā—
shared (list)
– Declares variables to be shared by tasks or threads
– All references to the variable points to the original variable when respective directive is
encountered
ā—
firstprivate (list)
– Initializes private variable with value of original variable prior to execution of the construct
– Used with parallel, task, target, teams etc.
ā—
lastprivate (list)
– Writes the value of the private variable back to original variable on existing the respective
construct
– Updated value is sequentially last iteration or section of worksharing construct.
19/03/20 IBM 33
Data Sharing Attribute Clauses (2)
ā—
reduction (reduction identifier : list)
– Updates original value with the final value of each of the
private copies, using the combiner – operator or identifier
– Reduction identifier is either identifier (max, min) or
operator (+, -, * , &, |, ˆ, && and ||)
ā—
threadprivate (list)
– Makes global variables (file scope, static) private to each
thread
– Each thread keep its own copy of global variable
19/03/20 IBM 34
Data Copying Clauses
Copy value of private or threadprivate variable from one
thread to another thread
ā—
copyin(list)
– Allowed on parallel and combined worksharing construct
– Copy value of threadprivate variable of master thread to
threadprivate variable of other threads before start of execution
ā—
copyprivate(list)
– Allowed on single construct
– Broadcast the value of private variable from one thread to
other threads in team
19/03/20 IBM 35
OpenMP Runtime library (1)
ā—
Runtime library definitions
– omp.h stores prototype of all OpenMP routines/functions and type.
ā—
Execution environment routines (35+ routines)
– Routines that affects and monitor threads, processors and parallel
environment
ā—
void omp_set_num_threads(int num_threads);
ā—
int omp_get_thread_num(void);
ā—
int omp_get_num_procs(void);
ā—
void omp_set_schedule(omp_sched_t kind, int chunk_size);
ā—
Timing routines
– routines that support a portable wall clock timer
– double omp_get_wtime(void);
– double omp_get_wtick(void);
19/03/20 IBM 36
OpenMP Runtime library (2)
ā—
Lock routines (6 routines)
– Routines for thread synchronization
– OpenMP locks are represented by OpenMP lock variables
– 2 types of locks
ā—
Simple : can be set only once
ā—
Nestable : can be set multiple time by same task
– Example APIs
ā—
void omp_init_lock(omp_lock_t * lock); void omp_destroy_lock(omp_lock_t * lock);
ā—
void omp_set_nest_lock(omp_nest_lock_t * lock); void
omp_unset_nest_lock(omp_nest_lock_t * lock);
ā—
int omp_test_lock(omp_lock_t * lock);
ā—
Device memory routines (7 routines)
– routines that support allocation of memory and management of pointers in
the data environments of target devices
ā—
void* omp_target_alloc(size_t size, int device_num);
ā—
void omp_target_free(void * device_ptr, int device_num);
ā—
int omp_target_memcpy(void * dst, void * src, size_t length, size_t dst_offset, size_t
src_offset, int dst_device_num, int src_device_num);
19/03/20 IBM 37
1 OMP_SCHEDULE sets the run-sched-var ICV that specifies the runtime schedule type and
chunk size. It can be set to any of the valid OpenMP schedule types.
2 OMP_NUM_THREADS sets the nthreads-var ICV that specifies the number of threads to use for
parallel regions.
3 OMP_DYNAMIC sets the dyn-var ICV that specifies the dynamic adjustment of threads to use
for parallel regions
4 OMP_PROC_BIND sets the bind-var ICV that controls the OpenMP thread affinity policy.
5 OMP_PLACES sets the place-partition-var ICV that defines the OpenMP places that are
available to the execution environment.
6 OMP_NESTED sets the nest-var ICV that enables or disables nested parallelism.
7 OMP_STACKSIZE sets the stacksize-var ICV that specifies the size of the stack for threads
created by the OpenMP implementation.
8 OMP_WAIT_POLICY sets the wait-policy-var ICV that controls the desired behavior of waiting
threads.
9 OMP_MAX_ACTIVE_LEVELS sets the max-active-levels-var ICV that controls the maximum number of
nested active parallel regions.
10 OMP_THREAD_LIMIT sets the thread-limit-var ICV that controls the maximum number of
threads participating in a contention group.
11 OMP_CANCELLATION sets the cancel-var ICV that enables or disables cancellation.
12 OMP_DISPLAY_ENV instructs the runtime to display the OpenMP version number and the initial
values of the ICVs, once, during initialization of the runtime.
13 OMP_DEFAULT_DEVICE sets the default-device-var ICV that controls the default device number.
14 OMP_MAX_TASK_PRIORITY sets the max-task-priority-var ICV that specifies the maximum value that can
be specified in the priority clause of the task construct.
OpenMP Environment Variables
19/03/20 IBM 38
Best practices
ā—
Always initialize data (arrays) with parallel code to benefit from
first touch placement policy on NUMA nodes
ā—
Bind OpenMP threads to cores (see OpenMP thread affinity
policy)
ā—
Watch for false sharing and add padding to arrays if necessary
ā—
Collapse loops to increase available parallelism
ā—
Always use teams and distribute to expose all available
parallelism
ā—
Avoid small parallel blocks on GPUs and unnecessary data
transfer between host and device.
ā—
Overlap data transfer with compute using asynchronous
transfers
19/03/20 IBM 39
Hands-onHands-on
19/03/20 IBM
Example : matrix multiplication
A B C
X =
Matrix multiplication implementation
for(i=0;i<ROWSIZE;i++)
{
for(j=0;j<COLSIZE;j++)
{
for(k=0;k<ROWSIZE;k++)
{
pC(i,j)+=pA(i,k)*pB(k,j);
}
}
}
19/03/20 IBM 41
Example 1 – GPU offloading
This example demonstrate offloading CPU functionality on single GPU using OpenMP
pragma. The notebook facilitate compilation, running, monitoring and profiling of
example program.
19/03/20 IBM
Example : matrix multiplication
(domain decomposition to solve parallelly)
CU0
CU1
CU2
CU3
CU0
CU2
CU3
CU1
CU1CU0
CU2
CU3
Implemented domain decomposition
A B C
X =
N0
N1
N2
N3
N0
N1
N2
N3
N0
N1
N2
N3
CU – compute unit
N - node
19/03/20 IBM 43
Example 2 – GPU scaling
This example demonstrate multi GPU scaling using OpenMP pragma. The notebook
facilitate compilation, running, monitoring and profiling of example program.

Omp tutorial cpugpu_programming_cdac

  • 1.
    GPU programming madeeasy with OpenMP Aditya Nitsure (anitsure@in.ibm.com) Pidad D’souza (pidsouza@in.ibm.com)
  • 2.
    19/03/20 IBM 2 OpenMP(Open Multi-Processing)OpenMP (Open Multi-Processing) (https://www.openmp.org) OpenMP (Open Multi-Processing)OpenMP (Open Multi-Processing) (https://www.openmp.org)
  • 3.
    19/03/20 IBM 3 Takeaway ā— What is OpenMP? ā— How OpenMP programming model works ? ā— How to write, compile and run OpenMP program? ā— What are various OpenMP directives, runtime library API & Environment variables? ā— How OpenMP program executes on accelerators (GPUs)? ā— What are the best practices to write OpenMP program?
  • 4.
    19/03/20 IBM 4 Introduction(1) ā— Governed by OpenMP Architecture Review Board (ARB) ā— Open source, simple and up to date with the latest hardware development ā— Specification for shared memory parallel programming model for Fortran and C/C++ programming languages ī€Œ compiler directives ī€Œ library routines ī€Œ environment variables ā— Widely accepted by community – academician & industry
  • 5.
    19/03/20 IBM 5 Introduction(2) ā— Shorter learning curve ā— Supported by many compiler vendors ī€Œ GNU, LLVM, AMD, IBM, Cray, Intel, PGI, OpenUH, ARM ī€Œ Compilers from research labs e.g. LLNL, Barcelona Supercomputing Centre ā— Easy to maintain sequential, CPU parallel and GPU parallel versions of code. ā— Used in application development from variety of fields and systems ī€Œ Aeronautics, Automotive, Pharmaceutics, Finance ī€Œ Supercomputers, Accelerators, Embedded multi-core systems
  • 6.
    19/03/20 IBM 6 SpecificationHistory āž¢ OpenMP 1.0 (1997 (Fortran) / 1998 (C/C++)) ā— First release of Fortran & C/C++ specification (2 separate specifications) āž¢ OpenMP 2.0 (2000 (Fortran) / 2002 (C/C++)) ā— Addition of timing routines and various new clauses – firstprivate, lastprivate, copyprivate etc. āž¢ OpenMP 2.5 (2005) ā— combined standard for C/C++ & Fortran ā— Introduce notion of internal control variables (ICVs) and new constructs – single, sections etc. āž¢ OpenMP 3.0/3.1 (2008 – 2011) ā— Concept of task added in OpenMP execution model āž¢ OpenMP 4.0 (2013) ā— Accelerator (GPU) support ā— SIMD construct for vectorization of loops ā— Support for FORTRAN 2003 āž¢ OpenMP 4.5 (2015) ā— Significant improvement for devices (GPU programming) & Fortran 2003 ā— New taskloop construct āž¢ OpenMP 5.0 (2018) – compilers are not yet support full set of features of this standard ! ā— Extended memory model to distinguish different types of flush operations ā— Added target-offload-var ICV and OMP_TARGET_OFFLOAD environment variable ā— Teams construct extended to execute on host device
  • 7.
    19/03/20 IBM 7 OpenMPProgramming (1) ā— The OpenMP API uses the fork-join model of parallel execution. ā— The OpenMP API provides a relaxed-consistency, shared- memory model. ā— An OpenMP program begins as a single thread of execution, called an initial thread. An initial thread executes sequentially. ā— When any thread encounters a parallel construct, the thread creates a team of itself and zero or more additional threads and becomes the master of the new team.
  • 8.
    19/03/20 IBM 8 OpenMPProgramming (2) ā— Compiler directives – Specified with pragma preprocessing – Starts with #pragma omp (For C/C++) – Case sensitive – Any directive is followed by one or more clauses – Syntax #pragma omp directive-name [clause[ [,] clause] ... ] new-line ā— Runtime library routines – The library routines are external functions with ā€œCā€ linkage – omp.h provides prototype definition of all library routines ā— Environment variables – Set at the program start up – Specifies the settings of the Internal Control Variables (ICVs) that affect the execution of OpenMP programs – Always in upper case
  • 9.
    19/03/20 IBM 9 OpenMPHello World #include <stdio.h> #include <omp.h> int main() { int i = 0; int numThreads = 0; // call to OpenMP runtime library numThreads = omp_get_num_threads(numThreads); // OpenMP directive #pragma omp parallel { // call to OpenMP runtime library int threadNum = omp_get_thread_num(); printf(ā€œHello World from thread %d nā€, threadNum); } return 0; } Output [aditya@hpcwsw7 openmp_tutorial]$ ./helloWorld Hello World from thread 0 Hello World from thread 3 Hello World from thread 1 Hello World from thread 2 [aditya@hpcwsw7 openmp_tutorial]$ ./helloWorld Hello World from thread 0 Hello World from thread 1 Hello World from thread 2 Hello World from thread 3 // Setting OpenMP environment variable export OMP_NUM_THREADS=4 //compile helloWorld program xlc -qsmp=omp helloWorld.c -o helloWorld // Run helloWorld program ./helloWorld helloWorld.c
  • 10.
    19/03/20 IBM 10 Compilationof OpenMP program XL ā— CPU -qsmp=omp ā— GPU -qsmp=omp -qoffload -qtgtarch=sm_70 (V100 GPU) GNU ā— CPU -fopenmp ā— GPU -fopenmp -foffload=nvptx-none NVIDIA CUDA linking -lcudart -L/usr/local/cuda/lib64
  • 11.
    19/03/20 IBM 11 OpenMPDirectives ā— Parallel construct ā— SIMD construct ā— Combined construct ā— Work sharing construct ā— Master and synchronization construct ā— Tasking construct ā— Device construct
  • 12.
    19/03/20 IBM 12 Parallelconstruct #pragma omp parallel [clause[ [,] clause] ... ] new-line { } ā— The fundamental construct that starts parallel execution ā— A team of threads is created to execute parallel region ā— Original thread becomes master of the new team ā— All threads in the team executes parallel region #pragma omp parallel Implicit barrier at the end of parallel region Original thread becomes master thread in parallel region #pragma omp parallel { int threadNum = omp_get_thread_num(); printf(ā€œHello World from thread %d nā€, threadNum); }
  • 13.
    19/03/20 IBM 13 SIMDconstruct ā— Applied on loop to transform loop iterations to execute concurrently using SIMD instructions ā— When any thread encounters a simd construct, the iterations of the loop associated with the construct may be executed concurrently using the SIMD lanes that are available to the thread. #pragma omp simd [clause[ [,] clause] ... ] new-line { } #pragma omp parallel for simd for (i=0; i<N; i++) { a[i] = b[i] * c[i] } #pragma omp parallel for simd Implicit barrier at the end of parallel region b0 b1 b2 b3 c0 c1 c2 c3 * vector unitEach iteration of each thread executed concurrently
  • 14.
    19/03/20 IBM 14 Combinedconstruct ā— Combination of more than one construct – Specifies one construct immediately nested inside another construct ā— Clauses from both constructs are permitted – With some exceptions e.g. nowait clause cannot be specified in parallel for or parallel sections ā— Examples – #pragma omp parallel for – #pragma omp parallel for simd – #pragma omp parallel sections – #pragma omp target parallel for simd
  • 15.
    19/03/20 IBM 15 Worksharingconstruct ā— Distributes the execution of the associated parallel region among the members of the team ā— Implicit barrier at the end ā— Types – Loop construct – Sections construct – Single construct – Workshare construct (only in Fortran)
  • 16.
    19/03/20 IBM 16 Worksharing: Loop construct ā— Iterations of for-loop distributed among the threads ā— One or more iterations executed in parallel ā— Implicit barrier at the end unless nowait #pragma omp for [clause[ [,] clause] ... ] new-line // for-loop { } int N = 200; #pragma omp parallel { #pragma omp for for (i=0; i<N; i++) { a[i] = b[i] * c[i] } } #pragma omp parallel #pragma omp for Implicit barrier at the end of for-loop Iteration space is divided among threads (0-39) (40-79) (80-119) (120-159) (160-199)
  • 17.
    19/03/20 IBM 17 Worksharing: Section construct ā— Non iterative worksharing construct ā— Each section structured block is executed once ā— The section structured blocks are executed in parallel (one thread per section) ā— Implicit barrier at the end unless nowait specified #pragma omp sections [clause[ [,] clause] ... ] new-line { #pragma omp section new-line // structured block #pragma omp section new-line // structured block } #pragma omp parallel { #pragma omp sections { #pragma omp section { // do_work1 } #pragma omp section { // do_work2 } } } #pragma omp parallel #pragma omp sections Implicit barrier at the end of sections clause section divided among threads section1 section2 section3
  • 18.
    19/03/20 IBM 18 Worksharing: Single construct ā— Only one thread in the team executes the associated structured block ā— Implicit barrier at the end ā— All other threads wait until single block is finished ā— nowait clause removes barrier at the end of single construct #pragma omp single [clause[ [,] clause] ... ] new-line { } #pragma omp parallel end of parallel region Single section Only one thread executes this section Barrier at the end of single section #pragma omp single #pragma omp parallel { // do_some_work() #pragma omp single { //write result to file } // do_post_work() }
  • 19.
    19/03/20 IBM 19 Taskingconstruct ā— Defines an explicit task ā— A task is generated and corresponding data environment is created from the associated structured block ā— Execution of task could be immediate or defer for later execution ā— Examples : While-loop, recursion, sorting algorithms #pragma omp task [clause[ [,] clause] ... ] new-line { } #pragma omp parallel { #pragma omp single { While ( ptr != null) { #pragma omp task { //do_some_work(); } } // end of while } // end of single } // end of parallel #pragma omp parallel end of parallel region Task scheduler schedules tasks with available threads #pragma omp single end of single Single thread creates tasks
  • 20.
    19/03/20 IBM 20 Synchronizationconstruct (1) #pragma omp parallel shared (sum) private (localSum) { #pragma omp for for (i=0; i<N; i++) { localSum += a[i]; } #pragma omp critical (global_sum) { sum += localSum; } } Critical section One thread at a time 4 threads executes sequentially #pragma omp parallel end of parallel region end of for Critical construct Master construct Ordered construct #pragma omp parallel { #pragma omp for // do_pre_work() #pragma omp master { //write result to file } #pragma omp for // do_post_work() } end of for end of for #pragma omp parallel end of parallel region Master section Only master thread executes this section No entry barrier No exit barrier #pragma omp parallel { #pragma omp for for (i=0; i<N; i++) { // do_some_work #pragma omp ordered { sum += a[i]; } // do_post_work } } #pragma omp parallel end of parallel region Ordered section Iteration order is preserved Start of Ordered construct End of Ordered construct End of for
  • 21.
    19/03/20 IBM 21 Synchronizationconstruct (2) ā— Barrier construct – #pragma omp barrier new-line – Stand alone directive – ā€œMust to executeā€ for all threads in parallel region – No one continues unless all threads reach barrier ā— Taskwait construct – #pragma omp taskwait new-line – Stand alone directive – Waits until all child tasks completes execution before taskwait region ā— Atomic construct – #pragma omp atomic [atomic-clause] new-line – ensures that a specific storage location is accessed atomically – Enforces atomicity for read, write, update or capture ā— Flush construct – #pragma omp flush [(list)] new-line – Stand alone directive – Makes temporary view of thread’s memory consistent with main memory – When list is specified, the flush operation applies to the items in the list. Otherwise to all thread visible data items.
  • 22.
    19/03/20 IBM 22 GPUoffloading using OpenMPGPU offloading using OpenMP
  • 23.
    19/03/20 IBM 23 Deviceconstruct (1) #pragma omp parallel for Implicit barrier #pragma omp parallel for for (i=0; i<N; i++) { a[i] = b[i] + s * c[i]; } for (i=0; i<N; i++) { a[i] = b[i] + s * c[i]; } for-loop CPU (no threading) GPU OpenMPCPU OpenMP #pragma omp target #pragma omp parallel for for (i=0; i<N; i++) { a[i] = b[i] + s * c[i]; } #pragma omp target #pragma omp parallel for Target construct ā— Execute construct on device (GPU) ā— Maps variables to device data environment Note ā— No thread migration from one device to another. ā— In absence of device or unsupported implementation for the target device, all target regions execute on the host.
  • 24.
    19/03/20 IBM 24 1.Thread starts on the host and host creates the data environments on the device(s). 2. The host then maps data to the device data environment. (data moves to device) 3. Host offloads target regions to target devices. (code executes on device) 4. After execution, host updates the data between the host and the device (transfers data from device to host) 5. Host destroys data environment on the device Device Construct (2) #pragma omp target #pragma omp parallel for for (i=0; i<N; i++) { a[i] = b[i] + s * c[i]; } #pragma omp target #pragma omp parallel for 1 2 3 4 5 The host centric execution model Teams executes asynchronously Team1 Team2 Threads within teams synchronise CUDA block == Team
  • 25.
    19/03/20 IBM 25 Deviceconstruct (3) GPU OpenMP #pragma omp target teams #pragma omp parallel for for (i=0; i<N; i++) { a[i] = b[i] + s * c[i]; } #pragma omp target teams #pragma omp parallel for Team construct ā— Creates league of thread teams ā— Master thread of each team executes the region ā— No implicit barrier at the end of team construct #pragma omp target #pragma omp teams distribute parallel for for (i=0; i<N; i++) { a[i] = b[i] + s * c[i]; } #pragma omp target #pragma omp teams distribute parallel for Distribute construct ā— Distributes iterations among master thread of all teams ā— No implicit barrier at the end of distribute construct ā— Always associated with loops and binds to team region team1 team2 team3 team1 team2 team3
  • 26.
    19/03/20 IBM 26 Deviceconstruct (4) ā— Target data – #pragma omp target data [clause] new-line {structured block} – Maps variables to device data environment and task encountering the construct executes the region ā— Target enter data / Target exit data – #pragma omp target enter data [clause] new-line – #pragma omp target exit data [clause] new-line – Stand alone directive – Variables are mapped/unmapped to/from device data environment – Generates a new task which enclosed target data region
  • 27.
    19/03/20 IBM 27 DeviceData Mapping ā— Map clause – Maps variables from host data environment to corresponding variables in device data environment – map([ [map-type-modifier[,]] map-type : ] list) – map-type ā— to (initializes variables on device variables from host corresponding variables on host) ā— from (Writes back data from device to host) ā— tofrom – default map type (combination of to & from) ā— alloc (allocates device memory for variables) ā— release (reference count is decremented by one) ā— delete (deletes device memory of variables)
  • 28.
  • 29.
    19/03/20 IBM 29 Deviceconstruct (5) ā— if clause – if (directive name modifier : scalar expression) – Used in target data, target enter/exit data, combined construct – In case of target clauses if clause expression evaluates to false then target region executed on host ā— device (n) – Specifies the device (GPU) on which the target region to execute. ā€˜n’ is device id. – In absence of device clause, the target region executes on default device
  • 30.
    19/03/20 IBM 30 OpenMPClauses (1) ā— Schedule [modifier : kind, chunk_size] – Kind ā— Static : Iterations are divided into chunks and assigned to threads in round-robin faction ā— Dynamic : Iterations are divided into chunks and assigned to threads on dynamically (first-come-first-serve basis) ā— Guided : Threads grab the chunk of iterations and gradually chunk reduces to chunk_size as execution proceeds ā— Auto : compiler or OpenMP runtime decides optimum schedule from static, dynamic or guided ā— Runtime : Deferred until runtime and schedule type and chunk size are taken from environment variable. – Modifier (new addition in OpenMP 4.5 specification) ā— simd : The chunk size is adjusted to simd width. Works only with SIMD construct e.g. schedule(simd:static, 5) ā— Monotonic : Chunks are assigned in increasing logical order of iterations to each threads. ā— Nonmonotonic : Assignment order of chunks to threads is unspecified.
  • 31.
    19/03/20 IBM 31 OpenMPClauses (2) ā— collapse (n) – Collapses two or more loops to create larger iteration space – The parameter (n), specifies the number of loops associated with loop construct to be collapsed #pragma omp parallel for collapse (2) for (i=0; i<N; i++) { for (j=0; j<M; j++){ for (k=0; k<M; k++){ foo(i, j, k); }}} ā— nowait – Used for asynchronous thread movement in worksharing construct i.e. threads that finish early proceed to the next instruction without waiting for other thread. – Used in loop, sections, single, target and so on
  • 32.
    19/03/20 IBM 32 DataSharing Attribute Clauses (1) ā— Private (list) – Declares variables as to be private to a task or thread – All references to the original variable are replaced by references to new variable when respective directive is encountered – Private variables are not initialized by default ā— shared (list) – Declares variables to be shared by tasks or threads – All references to the variable points to the original variable when respective directive is encountered ā— firstprivate (list) – Initializes private variable with value of original variable prior to execution of the construct – Used with parallel, task, target, teams etc. ā— lastprivate (list) – Writes the value of the private variable back to original variable on existing the respective construct – Updated value is sequentially last iteration or section of worksharing construct.
  • 33.
    19/03/20 IBM 33 DataSharing Attribute Clauses (2) ā— reduction (reduction identifier : list) – Updates original value with the final value of each of the private copies, using the combiner – operator or identifier – Reduction identifier is either identifier (max, min) or operator (+, -, * , &, |, ˆ, && and ||) ā— threadprivate (list) – Makes global variables (file scope, static) private to each thread – Each thread keep its own copy of global variable
  • 34.
    19/03/20 IBM 34 DataCopying Clauses Copy value of private or threadprivate variable from one thread to another thread ā— copyin(list) – Allowed on parallel and combined worksharing construct – Copy value of threadprivate variable of master thread to threadprivate variable of other threads before start of execution ā— copyprivate(list) – Allowed on single construct – Broadcast the value of private variable from one thread to other threads in team
  • 35.
    19/03/20 IBM 35 OpenMPRuntime library (1) ā— Runtime library definitions – omp.h stores prototype of all OpenMP routines/functions and type. ā— Execution environment routines (35+ routines) – Routines that affects and monitor threads, processors and parallel environment ā— void omp_set_num_threads(int num_threads); ā— int omp_get_thread_num(void); ā— int omp_get_num_procs(void); ā— void omp_set_schedule(omp_sched_t kind, int chunk_size); ā— Timing routines – routines that support a portable wall clock timer – double omp_get_wtime(void); – double omp_get_wtick(void);
  • 36.
    19/03/20 IBM 36 OpenMPRuntime library (2) ā— Lock routines (6 routines) – Routines for thread synchronization – OpenMP locks are represented by OpenMP lock variables – 2 types of locks ā— Simple : can be set only once ā— Nestable : can be set multiple time by same task – Example APIs ā— void omp_init_lock(omp_lock_t * lock); void omp_destroy_lock(omp_lock_t * lock); ā— void omp_set_nest_lock(omp_nest_lock_t * lock); void omp_unset_nest_lock(omp_nest_lock_t * lock); ā— int omp_test_lock(omp_lock_t * lock); ā— Device memory routines (7 routines) – routines that support allocation of memory and management of pointers in the data environments of target devices ā— void* omp_target_alloc(size_t size, int device_num); ā— void omp_target_free(void * device_ptr, int device_num); ā— int omp_target_memcpy(void * dst, void * src, size_t length, size_t dst_offset, size_t src_offset, int dst_device_num, int src_device_num);
  • 37.
    19/03/20 IBM 37 1OMP_SCHEDULE sets the run-sched-var ICV that specifies the runtime schedule type and chunk size. It can be set to any of the valid OpenMP schedule types. 2 OMP_NUM_THREADS sets the nthreads-var ICV that specifies the number of threads to use for parallel regions. 3 OMP_DYNAMIC sets the dyn-var ICV that specifies the dynamic adjustment of threads to use for parallel regions 4 OMP_PROC_BIND sets the bind-var ICV that controls the OpenMP thread affinity policy. 5 OMP_PLACES sets the place-partition-var ICV that defines the OpenMP places that are available to the execution environment. 6 OMP_NESTED sets the nest-var ICV that enables or disables nested parallelism. 7 OMP_STACKSIZE sets the stacksize-var ICV that specifies the size of the stack for threads created by the OpenMP implementation. 8 OMP_WAIT_POLICY sets the wait-policy-var ICV that controls the desired behavior of waiting threads. 9 OMP_MAX_ACTIVE_LEVELS sets the max-active-levels-var ICV that controls the maximum number of nested active parallel regions. 10 OMP_THREAD_LIMIT sets the thread-limit-var ICV that controls the maximum number of threads participating in a contention group. 11 OMP_CANCELLATION sets the cancel-var ICV that enables or disables cancellation. 12 OMP_DISPLAY_ENV instructs the runtime to display the OpenMP version number and the initial values of the ICVs, once, during initialization of the runtime. 13 OMP_DEFAULT_DEVICE sets the default-device-var ICV that controls the default device number. 14 OMP_MAX_TASK_PRIORITY sets the max-task-priority-var ICV that specifies the maximum value that can be specified in the priority clause of the task construct. OpenMP Environment Variables
  • 38.
    19/03/20 IBM 38 Bestpractices ā— Always initialize data (arrays) with parallel code to benefit from first touch placement policy on NUMA nodes ā— Bind OpenMP threads to cores (see OpenMP thread affinity policy) ā— Watch for false sharing and add padding to arrays if necessary ā— Collapse loops to increase available parallelism ā— Always use teams and distribute to expose all available parallelism ā— Avoid small parallel blocks on GPUs and unnecessary data transfer between host and device. ā— Overlap data transfer with compute using asynchronous transfers
  • 39.
  • 40.
    19/03/20 IBM Example :matrix multiplication A B C X = Matrix multiplication implementation for(i=0;i<ROWSIZE;i++) { for(j=0;j<COLSIZE;j++) { for(k=0;k<ROWSIZE;k++) { pC(i,j)+=pA(i,k)*pB(k,j); } } }
  • 41.
    19/03/20 IBM 41 Example1 – GPU offloading This example demonstrate offloading CPU functionality on single GPU using OpenMP pragma. The notebook facilitate compilation, running, monitoring and profiling of example program.
  • 42.
    19/03/20 IBM Example :matrix multiplication (domain decomposition to solve parallelly) CU0 CU1 CU2 CU3 CU0 CU2 CU3 CU1 CU1CU0 CU2 CU3 Implemented domain decomposition A B C X = N0 N1 N2 N3 N0 N1 N2 N3 N0 N1 N2 N3 CU – compute unit N - node
  • 43.
    19/03/20 IBM 43 Example2 – GPU scaling This example demonstrate multi GPU scaling using OpenMP pragma. The notebook facilitate compilation, running, monitoring and profiling of example program.