Parallel Processing
with OpenMP
Doug Sondak
Boston University
Scientific Computing and Visualization
Office of Information Technology
sondak@bu.edu
Outline
• Introduction
• Basics
• Data Dependencies
• A Few More Basics
• Caveats & Compilation
• Coarse-Grained Parallelization
• Thread Control Diredtives
• Some Additional Functions
• Some Additional Clauses
• Nested Paralleliism
• Locks
Introduction
Introduction
• Types of parallel machines
– distributed memory
• each processor has its own memory address space
• variable values are independent
x = 2 on one processor, x = 3 on a different processor
• examples: linux clusters, Blue Gene/L
– shared memory
• also called Symmetric Multiprocessing (SMP)
• single address space for all processors
– If one processor sets x = 2 , x will also equal 2 on other
processors (unless specified otherwise)
• examples: IBM p-series, SGI Altix
Shared vs. Distributed Memory
CPU 0 CPU 1 CPU 2 CPU 3 CPU 0 CPU 1 CPU 2 CPU 3
MEM 0 MEM 1 MEM 2 MEM 3 MEM
shared
distributed
Shared vs. Distributed Memory (cont’d)
• Multiple processes
– Each processor (typically) performs
independent task
• Multiple threads
– A process spawns additional tasks (threads)
with same memory address space
What is OpenMP?
• Application Programming Interface (API)
for multi-threaded parallelization consisting
of
– Source code directives
– Functions
– Environment variables
What is OpenMP? (cont’d)
• Advantages
– Easy to use
– Incremental parallelization
– Flexible
• Loop-level or coarse-grain
– Portable
• Since there’s a standard, will work on any SMP machine
• Disadvantage
– Shared-memory systems only
Basics
Basics
• Goal – distribute work among threads
• Two methods will be discussed here
– Loop-level
• Specified loops are parallelized
• This is approach taken by automatic parallelization tools
– Parallel regions
• Sometimes called “coarse-grained”
• Don’t know good term; good way to start argument with
semantically precise people
• Usually used in message-passing (MPI)
Basics (cont’d)
loop
loop
Loop-level Parallel regions
serial
serial
serial
parallel do & parallel for
• parallel do (Fortran) and parallel for (C)
directives parallelize subsequent loop
#pragma omp parallel for
for(i = 1; i <= maxi; i++){
a[i] = b[i] = c[i];
}
!$omp parallel do
do i = 1, maxi
a(i) = b(i) + c(i)
enddo
Use “c$” for fixed-format Fortran
parallel do & parallel for (cont’d)
• Suppose maxi = 1000
Thread 0 gets i = 1 to 250
Thread 1 gets i = 251 to 500
etc.
• Barrier implied at end of loop
workshare
• For Fortran 90/95 array syntax, the parallel
workshare directive is analogous to
parallel do
• Previous example would be:
• Also works for forall and where statements
!$omp parallel workshare
a = b + c
!$omp end parallel workshare
Shared vs. Private
• In parallel region, default behavior is that
all variables are shared except loop index
– All threads read and write the same memory
location for each variable
– This is ok if threads are accessing different
elements of an array
– Problem if threads write same scalar or array
element
– Loop index is private, so each thread has its
own copy
Shared vs. Private (cont’d)
• Here’s an example where a shared
variable, i2, could cause a problem
ifirst = 10
do i = 1, imax
i2 = 2*i
j(i) = ifirst + i2
enddo
ifirst = 10;
for(i = 1; i <= imax; i++){
i2 = 2*i;
j[i] = ifirst + i2;
}
Shared vs. Private (3)
• OpenMP clauses modify the behavior of
directives
• Private clause creates separate memory
location for specified variable for each
thread
Shared vs. Private (4)
ifirst = 10
!$omp parallel do private(i2)
do i = 1, imax
i2 = 2*i
j(i) = ifirst + i2
enddo
ifirst = 10;
#pragma omp parallel for private(i2)
for(i = 1; i <= imax; i++){
i2 = 2*i;
j[i] = ifirst + i2;
}
Shared vs. Private (5)
• Look at memory just before and after
parallel do/parallel for statement
– Let’s look at two threads
ifirst = 10
…
i2thread 0 = ???
spawn
additional
thread
ifirst = 10
…
i2thread 0 = ???
i2thread 1 = ???
…
before after
Shared vs. Private (6)
• Both threads access the same shared
value of ifirst
• They have their own private copies of i2
ifirst = 10
…
i2thread 0 = ???
i2thread 1 = ???
…
Thread 0 Thread 1
Data Dependencies
Data Dependencies
• Data on one thread can be dependent on
data on another thread
• This can result in wrong answers
– thread 0 may require a variable that is
calculated on thread 1
– answer depends on timing – When thread 0
does the calculation, has thread 1 calculated
it’s value yet?
Data Dependencies (cont’d)
• Example – Fibonacci Sequence
0, 1, 1, 2, 3, 5, 8, 13, …
a(1) = 0
a(2) = 1
do i = 3, 100
a(i) = a(i-1) + a(i-2)
enddo
a[1] = 0;
a[2] = 1;
for(i = 3; i <= 100; i++){
a[i] = a[i-1] + a[i-2];
}
Data Dependencies (3)
• parallelize on 2 threads
– thread 0 gets i = 3 to 51
– thread 1 gets i = 52 to 100
– look carefully at calculation for i = 52 on
thread 1
• what will be values of for i -1 and i - 2 ?
Data Dependencies (4)
• A test for dependency:
– if serial loop is executed in reverse order, will
it give the same result?
– if so, it’s ok
– you can test this on your serial code
Data Dependencies (5)
• What about subprogram calls?
• Does the subprogram write x or y to memory?
– If so, they need to be private
• Variables local to subprogram are local to each
thread
• Be careful with global variables and common
blocks
do i = 1, 100
call mycalc(i,x,y)
enddo
for(i = 0; i < 100; i++){
mycalc(i,x,y);
}
A Few More Basics
More clauses
• Can make private default rather than
shared
– Fortran only
– handy if most of the variables are private
– can use continuation characters for long lines
ifirst = 10
!$omp parallel do &
!$omp default(private) &
!$omp shared(ifirst,imax,j)
do i = 1, imax
i2 = 2*i
j(i) = ifirst + i2
enddo
More clauses (cont’d)
• Can use default none
– declare all variables (except loop variables)
as shared or private
– If you don’t declare any variables, you get a
handy list of all variables in loop
More clauses (3)
ifirst = 10
!$omp parallel do &
!$omp default(none) &
!$omp shared(ifirst,imax,j) &
!$omp private(i2)
do i = 1, imax
i2 = 2*i
j(i) = ifirst + i2
enddo
ifirst = 10;
#pragma omp parallel for 
default(none) 
shared(ifirst,imax,j) 
private(i2)
for(i = 0; i < imax; i++){
i2 = 2*i;
j[i] = ifirst + i2;
}
Firstprivate
• Suppose we need a running index total for
each index value on each thread
• if iper were declared private, the initial
value would not be carried into the loop
iper = 0
do i = 1, imax
iper = iper + 1
j(i) = iper
enddo
iper = 0;
for(i = 0; i < imax; i++){
iper = iper + 1;
j[i] = iper;
}
Firstprivate (cont’d)
• Solution – firstprivate clause
• Creates private memory location for each
thread
• Copies value from master thread (thread
0) to each memory location
iper = 0
!$omp parallel do &
!$omp firstprivate(iper)
do i = 1, imax
iper = iper + 1
j(i) = iper
enddo
iper = 0;
#pragma omp parallel for 
firstprivate(iper)
for(i = 0; i < imax; i++){
iper = iper + 1;
j[i] = iper;
}
Lastprivate
• saves value corresponding to the last loop
index
– "last" in the serial sense
!$omp parallel do lastprivate(i)
do i = 1, maxi-1
a(i) = b(i)
enddo
a(i) = b(1)
#pragma omp parallel for 
lastprivate(i)
for(i = 0; i < maxi-1; i++){
a[i] = b[i];
}
a(i) = b(1);
Reduction
• following example won’t parallelize
correctly with parallel do/parallel for
– different threads may try to write to sum1
simultaneously
sum1 = 0.0
do i = 1, maxi
sum1 = sum1 + a(i)
enddo
sum1 = 0.0;
for(i = 0; i < imaxi; i++){
sum1 = sum1 + a[i];
}
Reduction (cont’d)
• Solution? – Reduction clause
• each thread performs its own reduction (sum, in
this case)
• results from all threads are automatically
reduced (summed) at the end of the loop
sum1 = 0.0
!$omp parallel do &
!$reduction(+:sum1)
do i = 1, maxi
sum1 = sum1 + a(i)
enddo
sum1 = 0;
#pragma omp parallel for 
reduction(+:sum1)
for(i = 0; i < imaxi; i++){
sum1 = sum1 + a[i];
}
Reduction (3)
• Fortran operators/intrinsics: MAX, MIN,
IAND, IOR, IEOR,
+, *, -, .AND., .OR., .EQV., .NEQV.
• C operators: +, *, -, /, &, ^, |, &&, ||
• roundoff error may be different than serial
case
num_threads
• Number of threads for subsequent loop
can be specified with num_threads(n)
clause
– n is number of threads
Conditional Compilation
• Fortran: if compiled without OpenMP, directives are
treated as comments
– Great for portability
• !$ (c$ for fixed format) can be used for conditional
compilation for any source lines
!$ print*, ‘number of procs =', nprocs
• C or C++: conditional compilation can be performed with
the _OPENMP macro name.
#ifdef _OPENMP
… do stuff …
#endif
Basic OpenMP Functions
• omp_get_thread_num()
– returns ID of current thread
• omp_set_num_threads(nthreads)
– subroutine in Fortran
– sets number of threads in next parallel region to
nthreads
– alternative to OMP_NUM_THREADS environment
variable
– overrides OMP_NUM_THREADS
• omp_get_num_threads()
– returns number of threads in current parallel region
Caveats and Compilation
Some Hints
• OpenMP will do what you tell it to do
– If you try parallelize a loop with a dependency, it will go
ahead and do it!
• Do not parallelize small loops
– Overhead will be greater than speedup
– How small is “small”?
• Answer depends on processor speed and other system-
dependent parameters
• Try it!
• Maximize number of operations performed in
parallel
– parallelize outer loops where possible
Some Hints (cont’d)
• Always profile your code before parallelizing!
– tells where most time is being used
• Portland Group profiler:
– compile with –Mprof=func flag
– run code
• file with profile data will appear in working directory
– run profiler:
• pgprof –exe executable-name
• GUI will pop up with profiling information
Compile and Run
• Portland Group compilers (katana):
– Compile with -mp flag
– Can use up to 8 threads
• AIX compilers (twister, etc.):
– Use an _r suffix on the compiler name, e.g., cc_r,
xlf90_r
– Compile with -qsmp=omp flag
• Intel compilers (skate, cootie):
– Compile with -openmp and –fpp flags
– Can only use 2 threads
Compile and Run (cont’d)
• GNU compilers (all machines):
– Compile with -fopenmp flag
• Blue Gene does not accomodate OpenMP
• environment variable OMP_NUM_THREADS
sets number of threads
setenv OMP_NUM_THREADS 4
• for C, include omp.h
Exercise
• Copy files from /scratch/sondak/gt
– both C and Fortran versions are available for your
programming pleasure
• compile code with pgcc or pgf90
– include profiling flag –Mprof=func
– call executable “gt”
• Run code batch
– qsub rungt
– qstat –u username
Exercise (cont’d)
• When run completes, look at profile
– pgprof –exe gt
– Where is most time being used?
– How can we tackle this with OpenMP?
• Recompile without profiler flag
• Re-run and note timing from batch output file
– this is the base timing
• We will decide on an OpenMP strategy
• Add OpenMP directive(s)
• Re-run on 1, 2, and 4 procs. and check new
timings
Parallel
• parallel and do/for can be separated into two directives.
is the same as
!$omp parallel do
do i = 1, maxi
a(i) = b(i)
enddo
#pragma omp parallel for
for(i=0; i<maxi; i++){
a[i] = b[i];
}
!$omp parallel
!$omp do
do i = 1, maxi
a(i) = b(i)
enddo
!$omp end parallel
#pragma omp parallel
#pragma omp for
for(i=0; i<maxi; i++){
a[i] = b[i];
}
#pragma omp end parallel
Parallel (cont’d)
• Note that an end parallel directive is required.
• Everything within the parallel region will be run
in parallel (surprise!).
• The do/for directive indicates that the loop
indices will be distributed among threads rather
than duplicating every index on every thread.
Parallel (3)
• Multiple loops in parallel region:
• parallel directive has a significant overhead associated
with it.
• The above example has the potential to be faster than
using two parallel do/parallel for directives.
!$omp parallel
!$omp do
do i = 1, maxi
a(i) = b(i)
enddo
!$omp do
do i = 1, maxi
c(i) = a(2)
enddo
!$omp end parallel
#pragma omp parallel
#pragma omp for
for(i=0; i<maxi; i++){
a[i] = b[i];
}
#pragma omp for
for(i=0; i<maxi; i++){
c[i] = a[2];
}
#pragma omp end parallel
Coarse-Grained
• OpenMP is not restricted to loop-level (fine-
grained) parallelism.
• The !$omp parallel or #pragma omp parallel
directive duplicates subsequent code on all
threads until a !$omp end parallel or #pragma
omp end parallel directive is encountered.
• Allows parallelization similar to “MPI paradigm.”
Coarse-Grained (cont’d)
!$omp parallel &
!$omp private(myid,istart,iend,nthreads,nper)
nthreads = omp_get_num_threads()
nper = imax/nthreads
myid = omp_get_thread_num()
istart = myid*nper + 1
iend = istart + nper – 1
call do_work(istart,iend)
do i = istart, iend
a(i) = b(i)*c(i) + ...
enddo
!$omp end parallel
#pragma omp parallel 
#pragma omp private(myid,istart,iend,nthreads,nper)
nthreads = OMP_GET_NUM_THREADS();
nper = imax/nthreads;
myid = OMP_GET_THREAD_NUM();
istart = myid*nper;
iend = istart + nper – 1;
do_work(istart,iend);
for(i=istart; i<=iend; i++){
a[i] = b[i]*c[i] + ...
}
#pragma omp end parallel
Thread Control Directives
• barrier synchronizes threads
• Here barrier assures that a(1) or a[0] is available before
computing myval
Barrier
$omp parallel private(myid,istart,iend)
call myrange(myid,istart,iend)
do i = istart, iend
a(i) = a(i) - b(i)
enddo
!$omp barrier
myval(myid+1) = a(istart) + a(1)
#pragma omp parallel private(myid,istart,iend)
myrange(myid,istart,iend);
for(i=istart; i<=iend; i++){
a[i] = a[i] – b[i];
}
#pragma omp barrier
myval[myid] = a[istart] + a[0]
• if you want part of code to be executed
only on master thread, use master
directive
• “non-master” threads will skip over master
region and continue
Master
Master Example - Fortran
!$OMP PARALLEL PRIVATE(myid,istart,iend)
call myrange(myid,istart,iend)
do i = istart, iend
a(i) = a(i) - b(i)
enddo
!$OMP BARRIER
!$OMP MASTER
write(21) a
!$OMP END MASTER
call do_work(istart,iend)
!$OMP END PARALLEL
Master Example - C
#pragma omp parallel private(myid,istart,iend)
myrange(myid,istart,iend);
for(i=istart; i<=iend; i++){
a[i] = a[i] – b[i];
}
#pragma omp barrier
#pragma omp master
fwrite(fid,sizeof(float),iend-istart+1,a);
#pragma omp end master
do_work(istart,iend);
#pragma omp end parallel
• want part of code to be executed only by a
single thread
• don’t care whether or not it’s the master
thread
• use single directive
• Unlike the end master directive, end single
implies a barrier.
Single
Single Example - Fortran
!$OMP PARALLEL PRIVATE(myid,istart,iend)
call myrange(myid,istart,iend)
do i = istart, iend
a(i) = a(i) - b(i)
enddo
!$OMP BARRIER
!$OMP SINGLE
write(21) a
!$OMP END SINGLE
call do_work(istart,iend)
!$OMP END PARALLEL
Single Example - C
#pragma omp parallel private(myid,istart,iend)
myrange(myid,istart,iend);
for(i=istart; i<=iend; i++){
a[i] = a[i] – b[i];
}
#pragma omp barrier
#pragma omp single
fwrite(fid,sizeof(float),nvals,a);
#pragma omp end single
do_work(istart,iend);
#pragma omp end parallel
• have section of code that:
1. must be executed by every thread
2. threads may execute in any order
3. threads must not execute simultaneously
• use critical directive
• no implied barrier
Critical
Critical Example - Fortran
!$OMP PARALLEL PRIVATE(myid,istart,iend)
call myrange(myid,istart,iend)
do i = istart, iend
a(i) = a(i) - b(i)
enddo
!$OMP CRITICAL
call mycrit(myid,a)
!$OMP END CRITICAL
call do_work(istart,iend)
!$OMP END PARALLEL
Critical Example - C
#pragma omp parallel private(myid,istart,iend)
myrange(myid,istart,iend);
for(i=istart; i<=iend; i++){
a[i] = a[i] – b[i];
}
#pragma omp critical
mycrit(myid,a);
#pragma omp end critical
do_work(istart,iend);
#pragma omp end parallel
Ordered
• Suppose you want to write values in a loop:
• If loop were parallelized, could write out of order
• ordered directive forces serial order
do i = 1, nproc
call do_lots_of_work(result(i))
write(21,101) i, result(i)
enddo
for(i = 0; i < nproc; i++){
do_lots_of_work(result[i]);
fprintf(fid,”%d %fn,”i,result[i]”);
}
Ordered (cont’d)
!$omp parallel do
do i = 1, nproc
call do_lots_of_work(result(i))
!$omp ordered
write(21,101) i, result(i)
!$omp end ordered
enddo
#pragma omp parallel for
for(i = 0; i < nproc; i++){
do_lots_of_work(result[i]);
#pragma omp ordered
fprintf(fid,”%d %fn,”i,result[i]”);
#pragma omp end ordered
}
Ordered (3)
• since do_lots_of_work takes a lot of time,
most parallel benefit will be realized
• ordered is helpful for debugging
– Is result same with and without ordered
directive?
• In parallel region, have several independent
tasks
• Divide code into sections
• Each section is executed on a different thread.
• Implied barrier
• Fixes number of threads
• Note that sections directive delineates region
of code which includes section directives.
Sections
Sections Example - Fortran
!$OMP PARALLEL
!$OMP SECTIONS
!$OMP SECTION
call init_field(a)
!$OMP SECTION
call check_grid(x)
!$OMP END SECTIONS
!$OMP END PARALLEL
Sections Example - C
#pragma omp parallel
#pragma omp sections
#pragma omp section
init_field(a);
#pragma omp section
check_grid(x);
#pragma omp end sections
#pragma omp end parallel
• Analogous to do and parallel do or for and
parallel for, along with sections there exists a
parallel sections directive.
Parallel Sections
Parallel Sections Example - Fortran
!$OMP PARALLEL SECTIONS
!$OMP SECTION
call init_field(a)
!$OMP SECTION
call check_grid(x)
!$OMP END PARALLEL SECTIONS
Parallel Sections Example - C
#pragma omp parallel sections
#pragma omp section
init_field(a);
#pragma omp section
check_grid(x);
#pragma omp end parallel sections
• Parallel sections are established upon
compilation
– Number of threads is fixed
• Sometimes more flexibility is needed, such as
parallelism within if or while block
• Task directive will assign tasks to threads as
needed
Task
Task Example - Fortran
!$OMP PARALLEL
!$OMP SINGLE
p = listhead
do while(p)
!$OMP TASK
process(p)
!$OMP END TASK
p = next(p)
enddo
!$OMP END SINGLE
!$OMP END
Task Example - C
#pragma omp parallel
{
#pragma omp single
{
p = listhead;
while(p) {
#pragma omp task
process(p);
p = next(p);
}
}
}
• Only needed for Fortran
– Usually legacy code
• common block variables are inherently shared,
may want them to be private
• threadprivate gives each thread its own private
copy of variables in specified common block.
– Keep memory use in mind
• Place threadprivate after common block
declaration.
Threadprivate
common /work1/ work(1000)
!$omp threadprivate(/work1/)
Threadprivate (cont’d)
• threadprivate renders all values in common
block private
• may want to use some current values of
variables on all threads
• copyin initializes each thread’s copy of
specified common-block variables with values
form the master thread
Copyin
Copyin Example - Fortran
real, parameter :: nvals = 501
real, dimension(nvals) :: x, y, xmet, ymet
common /work1/ work(nvals), istart, iend
!$OMP THREADPRIVATE(/work1/)
istart = 2; iend = 500
call read_coords(x,y)
!$OMP PARALLEL SECTIONS
COPYIN(istart,iend)
!$OMP SECTION
call metric(x)
xmet = work
!$OMP SECTION
call metric(y)
ymet = work
!$OMP END PARALLEL SECTIONS
subroutine metric(c)
real, dimension(:) :: c
common /work1/ work(nvals), istart, iend
!$OMP THREADPRIVATE(/work1/)
do i = istart, iend
work(i) = 0.5*(c(i+1)-c(i-1))
enddo
end subroutine metric
• Similar to critical
– Allows greater optimization than critical
• Applies to single line following the atomic
directive
Atomic
Atomic Example - Fortran
do i = 1, 10
!$OMPATOMIC
x(j(i)) = x(j(i)) + 1.0
enddo
Atomic Example - C
for(i = 0; i<10; i++){
#pragma omp atomic
x[j[i]]= x[j[i]] + 1.0;
}
• Different threads can have different values for
the same shared variable
– example – value in register
• flush assures that the calling thread has a
consistent view of memory
• upcoming example is from OpenMP spec
Flush
Flush Example - Fortran
integer isync(num_threads)
!$omp parallel default(private) shared(isync)
iam = omp_get_thread_num()
isync(iam) = 0
!$omp barrier
call work()
isync(iam) = 1
! let other threads know I finished work
!$omp flush(isync)
do while(isync(neigh) == 0)
!$omp flush(isync)
enddo
$omp end parallel
Flush Example - C
int isync[num_threads];
#pragma omp parallel default(private) shared(isync)
iam = omp_get_thread_num();
isync[iam] = 0;
#pragma omp barrier
work();
isync[iam] = 1;
/* let other threads know I finished work */
#pragma omp flush(isync)
while(isync[neigh] == 0)
#pragma omp flush(isync)
#pragma omp end parallel
Some Additional Functions
• On some systems the number of threads
available to you may be adjusted dynamically.
• The number of threads requested can be
construed as the maximum number of threads
available.
• Dynamic threading can be turned on or off:
Dynamic Thread Assignment
logical mydyn = .false.
call omp_set_dynamic(mydyn)
int mydyn = 0;
omp_set_dynamic(mydyn);
• Dynamic threading can also be turned on or off
by setting the environment variable
omp_dynamic to true or false
– call to omp_set_dynamic overrides the environment
variable
• The function omp_get_dynamic() returns a
value of “true” or “false,” indicating whether
dynamic threading is turned on or off.
Dynamic Thread Assignment (cont’d)
• integer function
• returns maximum number of threads available
in current parallel region
• same result as omp_get_num_threads if
dynamic threading is turned off
OMP_GET_MAX_THREADS
• integer function
• returns maximum number of processors in the
system
– indicates amount of hardware, not number of
available processors
• could be used to make sure enough
processors are available for specified number
of sections
OMP_GET_NUM_PROCS
• logical function
• tells whether or not function was called from a
parallel region
• useful debugging device when using
“orphaned” directives
– example of orphaned directive: omp parallel in
“main,” omp do or omp for in routine called from main
OMP_IN_PARALLEL
Some Additional Clauses
Collapse
• collapse(n) allows parallelism over n
contiguously nested loops
Collapse Example - Fortran
!$omp parallel do collapse(2)
do j = 1, nval
do i = 1, nval
ival(i,j) = i + j
enddo
enddo
Collapse Example - C
#pragma omp parallel for collapse(2)
for(j=0; j<nval; j++){
for(i=0; i<nval; i++){
ival[i][j] = i + j;
}
}
Schedule
• schedule refers to the way in which loop
indices are distributed among threads
• default is static, which we had seen in an
earlier example
– each thread is assigned a contiguous range of
indices in order of thread number
• called round robin
– number of indices assigned to each thread is
as equal as possible
Schedule (cont’d)
• static example
– loop runs from 1 to 51, 4 threads
thread indices no. indices
0 1-13 13
1 14-26 13
2 27-39 13
3 40-51 12
Schedule (3)
• number of indices doled out at a time to each
thread is called the chunk size
• can be modified with the SCHEDULE clause
!$omp do schedule(static,5)
#pragma omp for schedule(static,5)
Schedule (4)
• example – same as previous example (loop runs
from 1-51, 4 threads), but set chunk size to 5
thread chunk 1
indices
chunk 2
indices
chunk 3
indices
no.
indices
0 1-5 21-25 41-45 15
1 6-10 26-30 46-50 15
2 11-15 31-35 51 11
3 16-20 36-40 - 10
Dynamic Schedule
• schedule(dynamic) clause assigns chunks to
threads dynamically as the threads become
available for more work
• default chunk size is 1
• higher overhead than STATIC
!$omp do schedule(dynamic,5)
#pragma omp for
schedule(dynamic,5)
Guided Schedule
• schedule(guided) clause assigns chunks
automatically, exponentially decreasing chunk
size with each assignment
• specified chunk size is the minimum chunk size
except for the last chunk, which can be of any
size
• default chunk size is 1
Runtime Schedule
• schedule can be specified through
omp_schedule environment variable
setenv omp_schedule “dynamic,5”
• the schedule(runtime) clause tells it to set the
schedule using the environment variable
!$omp do schedule(runtime)
#pragma omp for schedule(runtime)
Nested Parallelism
• parallel construct within parallel construct:
Nested Parallelism
!$omp parallel do
do j = 1, jmax
!$omp parallel do
do i = 1, i,max
call do_work(i,j)
enddo
enddo
#pragma omp parallel for
for(j=0; j<jmax; j++){
#pragma omp parallel for
for(i=0; i<imax; i++){
do_work(i,j);
}
}
• OpenMP standard allows but does not require
nested parallelism
• if nested parallelism is not implemented,
previous examples are legal, but the inner loop
will be serial
• logical function omp_get_nested() returns
.true. or .false. (1 or 0) to indicate whether or
not nested parallelism is enabled in the current
region
Nested Parallelism (cont’d)
• omp_set_nested(nest) enables/disables
nested parallelism if argument is true/false
– subroutine in Fortran (.true./.false.)
– function in C (1/0)
• nested parallelism can also be turned on or off
by setting the environment variable
omp_nested to true or false
– calls to omp_set_nested override the environment
variable
Nested Parallelism (3)
Locks
• Locks offer additional control of threads
• a thread can take/release ownership of a lock
over a specified region of code
• when a lock is owned by a thread, other
threads cannot execute the locked region
• can be used to perform useful work by one or
more threads while another thread is engaged
in a serial task
Locks
• Lock routines each have a single argument
– argument type is an integer (Fortran) or a pointer to
an integer (C) long enough to hold an address
– OpenMP pre-defines required types
integer(omp_lock_kind) :: mylock
omp_nest_lock_t *mylock;
Lock Routines
• omp_init_lock(mylock)
– initializes mylock
– must be called before mylock is used
– subroutine in Fortran
• omp_set_lock(mylock)
– gives ownership of mylock to calling thread
– other threads cannot execute code following call
until mylock is released
– subroutine in Fortran
Lock Routines (cont’d)
• omp_test_lock(mylock)
– logical function
– if mylock is presently owned by another
thread, returns false
– if mylock is available, returns true and sets
lock, i.e., gives ownership to calling
thread
– be careful: omp_test_lock(mylock)does
more than just test the lock
Lock Routines (3)
• omp_unset_lock(mylock)
– releases ownership of mylock
– subroutine in Fortran
• omp_destroy_lock(mylock)
– call when you’re done with the lock
– complement to omp_init_lock(mylock)
– subroutine in Fortran
Lock Routines (4)
• an independent task takes a significant
amount of time, and it must be performed
serially
• another task, independent of the first,
may be performed in parallel
• improve parallel efficiency by doing both
tasks at the same time
• one thread locks serial task
• other threads work on parallel task until
serial task is complete
Lock Example
Lock Example - Fortran
call omp_init_lock(mylock)
!$omp parallel
if(omp_test_lock(mylock)) then
call long_serial_task
call omp_unset_lock(mylock)
else
dowhile(.not. omp_test_lock(mylock))
call short_parallel_task
enddo
call omp_unset_lock(mylock)
endif
!$omp end parallel
call omp_destroy_lock(mylock)
Lock Example C
omp_init_lock(mylock);
#pragma omp parallel
if(omp_test_lock(mylock)){
long_serial_task();
omp_unset_lock(mylock);
}else{
while(! omp_test_lock(mylock)){
short_parallel_task();
}
omp_unset_lock(mylock);
}
#pragma omp end parallel omp_destroy_lock(mylock);
• Please fill out evaluation survey for this
tutorial at
http://scv.bu.edu/survey/tutorial_evaluation.html
Survey

openmp.ppt

  • 1.
    Parallel Processing with OpenMP DougSondak Boston University Scientific Computing and Visualization Office of Information Technology sondak@bu.edu
  • 2.
    Outline • Introduction • Basics •Data Dependencies • A Few More Basics • Caveats & Compilation • Coarse-Grained Parallelization • Thread Control Diredtives • Some Additional Functions • Some Additional Clauses • Nested Paralleliism • Locks
  • 3.
  • 4.
    Introduction • Types ofparallel machines – distributed memory • each processor has its own memory address space • variable values are independent x = 2 on one processor, x = 3 on a different processor • examples: linux clusters, Blue Gene/L – shared memory • also called Symmetric Multiprocessing (SMP) • single address space for all processors – If one processor sets x = 2 , x will also equal 2 on other processors (unless specified otherwise) • examples: IBM p-series, SGI Altix
  • 5.
    Shared vs. DistributedMemory CPU 0 CPU 1 CPU 2 CPU 3 CPU 0 CPU 1 CPU 2 CPU 3 MEM 0 MEM 1 MEM 2 MEM 3 MEM shared distributed
  • 6.
    Shared vs. DistributedMemory (cont’d) • Multiple processes – Each processor (typically) performs independent task • Multiple threads – A process spawns additional tasks (threads) with same memory address space
  • 7.
    What is OpenMP? •Application Programming Interface (API) for multi-threaded parallelization consisting of – Source code directives – Functions – Environment variables
  • 8.
    What is OpenMP?(cont’d) • Advantages – Easy to use – Incremental parallelization – Flexible • Loop-level or coarse-grain – Portable • Since there’s a standard, will work on any SMP machine • Disadvantage – Shared-memory systems only
  • 9.
  • 10.
    Basics • Goal –distribute work among threads • Two methods will be discussed here – Loop-level • Specified loops are parallelized • This is approach taken by automatic parallelization tools – Parallel regions • Sometimes called “coarse-grained” • Don’t know good term; good way to start argument with semantically precise people • Usually used in message-passing (MPI)
  • 11.
  • 12.
    parallel do &parallel for • parallel do (Fortran) and parallel for (C) directives parallelize subsequent loop #pragma omp parallel for for(i = 1; i <= maxi; i++){ a[i] = b[i] = c[i]; } !$omp parallel do do i = 1, maxi a(i) = b(i) + c(i) enddo Use “c$” for fixed-format Fortran
  • 13.
    parallel do &parallel for (cont’d) • Suppose maxi = 1000 Thread 0 gets i = 1 to 250 Thread 1 gets i = 251 to 500 etc. • Barrier implied at end of loop
  • 14.
    workshare • For Fortran90/95 array syntax, the parallel workshare directive is analogous to parallel do • Previous example would be: • Also works for forall and where statements !$omp parallel workshare a = b + c !$omp end parallel workshare
  • 15.
    Shared vs. Private •In parallel region, default behavior is that all variables are shared except loop index – All threads read and write the same memory location for each variable – This is ok if threads are accessing different elements of an array – Problem if threads write same scalar or array element – Loop index is private, so each thread has its own copy
  • 16.
    Shared vs. Private(cont’d) • Here’s an example where a shared variable, i2, could cause a problem ifirst = 10 do i = 1, imax i2 = 2*i j(i) = ifirst + i2 enddo ifirst = 10; for(i = 1; i <= imax; i++){ i2 = 2*i; j[i] = ifirst + i2; }
  • 17.
    Shared vs. Private(3) • OpenMP clauses modify the behavior of directives • Private clause creates separate memory location for specified variable for each thread
  • 18.
    Shared vs. Private(4) ifirst = 10 !$omp parallel do private(i2) do i = 1, imax i2 = 2*i j(i) = ifirst + i2 enddo ifirst = 10; #pragma omp parallel for private(i2) for(i = 1; i <= imax; i++){ i2 = 2*i; j[i] = ifirst + i2; }
  • 19.
    Shared vs. Private(5) • Look at memory just before and after parallel do/parallel for statement – Let’s look at two threads ifirst = 10 … i2thread 0 = ??? spawn additional thread ifirst = 10 … i2thread 0 = ??? i2thread 1 = ??? … before after
  • 20.
    Shared vs. Private(6) • Both threads access the same shared value of ifirst • They have their own private copies of i2 ifirst = 10 … i2thread 0 = ??? i2thread 1 = ??? … Thread 0 Thread 1
  • 21.
  • 22.
    Data Dependencies • Dataon one thread can be dependent on data on another thread • This can result in wrong answers – thread 0 may require a variable that is calculated on thread 1 – answer depends on timing – When thread 0 does the calculation, has thread 1 calculated it’s value yet?
  • 23.
    Data Dependencies (cont’d) •Example – Fibonacci Sequence 0, 1, 1, 2, 3, 5, 8, 13, … a(1) = 0 a(2) = 1 do i = 3, 100 a(i) = a(i-1) + a(i-2) enddo a[1] = 0; a[2] = 1; for(i = 3; i <= 100; i++){ a[i] = a[i-1] + a[i-2]; }
  • 24.
    Data Dependencies (3) •parallelize on 2 threads – thread 0 gets i = 3 to 51 – thread 1 gets i = 52 to 100 – look carefully at calculation for i = 52 on thread 1 • what will be values of for i -1 and i - 2 ?
  • 25.
    Data Dependencies (4) •A test for dependency: – if serial loop is executed in reverse order, will it give the same result? – if so, it’s ok – you can test this on your serial code
  • 26.
    Data Dependencies (5) •What about subprogram calls? • Does the subprogram write x or y to memory? – If so, they need to be private • Variables local to subprogram are local to each thread • Be careful with global variables and common blocks do i = 1, 100 call mycalc(i,x,y) enddo for(i = 0; i < 100; i++){ mycalc(i,x,y); }
  • 27.
    A Few MoreBasics
  • 28.
    More clauses • Canmake private default rather than shared – Fortran only – handy if most of the variables are private – can use continuation characters for long lines ifirst = 10 !$omp parallel do & !$omp default(private) & !$omp shared(ifirst,imax,j) do i = 1, imax i2 = 2*i j(i) = ifirst + i2 enddo
  • 29.
    More clauses (cont’d) •Can use default none – declare all variables (except loop variables) as shared or private – If you don’t declare any variables, you get a handy list of all variables in loop
  • 30.
    More clauses (3) ifirst= 10 !$omp parallel do & !$omp default(none) & !$omp shared(ifirst,imax,j) & !$omp private(i2) do i = 1, imax i2 = 2*i j(i) = ifirst + i2 enddo ifirst = 10; #pragma omp parallel for default(none) shared(ifirst,imax,j) private(i2) for(i = 0; i < imax; i++){ i2 = 2*i; j[i] = ifirst + i2; }
  • 31.
    Firstprivate • Suppose weneed a running index total for each index value on each thread • if iper were declared private, the initial value would not be carried into the loop iper = 0 do i = 1, imax iper = iper + 1 j(i) = iper enddo iper = 0; for(i = 0; i < imax; i++){ iper = iper + 1; j[i] = iper; }
  • 32.
    Firstprivate (cont’d) • Solution– firstprivate clause • Creates private memory location for each thread • Copies value from master thread (thread 0) to each memory location iper = 0 !$omp parallel do & !$omp firstprivate(iper) do i = 1, imax iper = iper + 1 j(i) = iper enddo iper = 0; #pragma omp parallel for firstprivate(iper) for(i = 0; i < imax; i++){ iper = iper + 1; j[i] = iper; }
  • 33.
    Lastprivate • saves valuecorresponding to the last loop index – "last" in the serial sense !$omp parallel do lastprivate(i) do i = 1, maxi-1 a(i) = b(i) enddo a(i) = b(1) #pragma omp parallel for lastprivate(i) for(i = 0; i < maxi-1; i++){ a[i] = b[i]; } a(i) = b(1);
  • 34.
    Reduction • following examplewon’t parallelize correctly with parallel do/parallel for – different threads may try to write to sum1 simultaneously sum1 = 0.0 do i = 1, maxi sum1 = sum1 + a(i) enddo sum1 = 0.0; for(i = 0; i < imaxi; i++){ sum1 = sum1 + a[i]; }
  • 35.
    Reduction (cont’d) • Solution?– Reduction clause • each thread performs its own reduction (sum, in this case) • results from all threads are automatically reduced (summed) at the end of the loop sum1 = 0.0 !$omp parallel do & !$reduction(+:sum1) do i = 1, maxi sum1 = sum1 + a(i) enddo sum1 = 0; #pragma omp parallel for reduction(+:sum1) for(i = 0; i < imaxi; i++){ sum1 = sum1 + a[i]; }
  • 36.
    Reduction (3) • Fortranoperators/intrinsics: MAX, MIN, IAND, IOR, IEOR, +, *, -, .AND., .OR., .EQV., .NEQV. • C operators: +, *, -, /, &, ^, |, &&, || • roundoff error may be different than serial case
  • 37.
    num_threads • Number ofthreads for subsequent loop can be specified with num_threads(n) clause – n is number of threads
  • 38.
    Conditional Compilation • Fortran:if compiled without OpenMP, directives are treated as comments – Great for portability • !$ (c$ for fixed format) can be used for conditional compilation for any source lines !$ print*, ‘number of procs =', nprocs • C or C++: conditional compilation can be performed with the _OPENMP macro name. #ifdef _OPENMP … do stuff … #endif
  • 39.
    Basic OpenMP Functions •omp_get_thread_num() – returns ID of current thread • omp_set_num_threads(nthreads) – subroutine in Fortran – sets number of threads in next parallel region to nthreads – alternative to OMP_NUM_THREADS environment variable – overrides OMP_NUM_THREADS • omp_get_num_threads() – returns number of threads in current parallel region
  • 40.
  • 41.
    Some Hints • OpenMPwill do what you tell it to do – If you try parallelize a loop with a dependency, it will go ahead and do it! • Do not parallelize small loops – Overhead will be greater than speedup – How small is “small”? • Answer depends on processor speed and other system- dependent parameters • Try it! • Maximize number of operations performed in parallel – parallelize outer loops where possible
  • 42.
    Some Hints (cont’d) •Always profile your code before parallelizing! – tells where most time is being used • Portland Group profiler: – compile with –Mprof=func flag – run code • file with profile data will appear in working directory – run profiler: • pgprof –exe executable-name • GUI will pop up with profiling information
  • 43.
    Compile and Run •Portland Group compilers (katana): – Compile with -mp flag – Can use up to 8 threads • AIX compilers (twister, etc.): – Use an _r suffix on the compiler name, e.g., cc_r, xlf90_r – Compile with -qsmp=omp flag • Intel compilers (skate, cootie): – Compile with -openmp and –fpp flags – Can only use 2 threads
  • 44.
    Compile and Run(cont’d) • GNU compilers (all machines): – Compile with -fopenmp flag • Blue Gene does not accomodate OpenMP • environment variable OMP_NUM_THREADS sets number of threads setenv OMP_NUM_THREADS 4 • for C, include omp.h
  • 45.
    Exercise • Copy filesfrom /scratch/sondak/gt – both C and Fortran versions are available for your programming pleasure • compile code with pgcc or pgf90 – include profiling flag –Mprof=func – call executable “gt” • Run code batch – qsub rungt – qstat –u username
  • 46.
    Exercise (cont’d) • Whenrun completes, look at profile – pgprof –exe gt – Where is most time being used? – How can we tackle this with OpenMP? • Recompile without profiler flag • Re-run and note timing from batch output file – this is the base timing • We will decide on an OpenMP strategy • Add OpenMP directive(s) • Re-run on 1, 2, and 4 procs. and check new timings
  • 47.
    Parallel • parallel anddo/for can be separated into two directives. is the same as !$omp parallel do do i = 1, maxi a(i) = b(i) enddo #pragma omp parallel for for(i=0; i<maxi; i++){ a[i] = b[i]; } !$omp parallel !$omp do do i = 1, maxi a(i) = b(i) enddo !$omp end parallel #pragma omp parallel #pragma omp for for(i=0; i<maxi; i++){ a[i] = b[i]; } #pragma omp end parallel
  • 48.
    Parallel (cont’d) • Notethat an end parallel directive is required. • Everything within the parallel region will be run in parallel (surprise!). • The do/for directive indicates that the loop indices will be distributed among threads rather than duplicating every index on every thread.
  • 49.
    Parallel (3) • Multipleloops in parallel region: • parallel directive has a significant overhead associated with it. • The above example has the potential to be faster than using two parallel do/parallel for directives. !$omp parallel !$omp do do i = 1, maxi a(i) = b(i) enddo !$omp do do i = 1, maxi c(i) = a(2) enddo !$omp end parallel #pragma omp parallel #pragma omp for for(i=0; i<maxi; i++){ a[i] = b[i]; } #pragma omp for for(i=0; i<maxi; i++){ c[i] = a[2]; } #pragma omp end parallel
  • 50.
    Coarse-Grained • OpenMP isnot restricted to loop-level (fine- grained) parallelism. • The !$omp parallel or #pragma omp parallel directive duplicates subsequent code on all threads until a !$omp end parallel or #pragma omp end parallel directive is encountered. • Allows parallelization similar to “MPI paradigm.”
  • 51.
    Coarse-Grained (cont’d) !$omp parallel& !$omp private(myid,istart,iend,nthreads,nper) nthreads = omp_get_num_threads() nper = imax/nthreads myid = omp_get_thread_num() istart = myid*nper + 1 iend = istart + nper – 1 call do_work(istart,iend) do i = istart, iend a(i) = b(i)*c(i) + ... enddo !$omp end parallel #pragma omp parallel #pragma omp private(myid,istart,iend,nthreads,nper) nthreads = OMP_GET_NUM_THREADS(); nper = imax/nthreads; myid = OMP_GET_THREAD_NUM(); istart = myid*nper; iend = istart + nper – 1; do_work(istart,iend); for(i=istart; i<=iend; i++){ a[i] = b[i]*c[i] + ... } #pragma omp end parallel
  • 52.
  • 53.
    • barrier synchronizesthreads • Here barrier assures that a(1) or a[0] is available before computing myval Barrier $omp parallel private(myid,istart,iend) call myrange(myid,istart,iend) do i = istart, iend a(i) = a(i) - b(i) enddo !$omp barrier myval(myid+1) = a(istart) + a(1) #pragma omp parallel private(myid,istart,iend) myrange(myid,istart,iend); for(i=istart; i<=iend; i++){ a[i] = a[i] – b[i]; } #pragma omp barrier myval[myid] = a[istart] + a[0]
  • 54.
    • if youwant part of code to be executed only on master thread, use master directive • “non-master” threads will skip over master region and continue Master
  • 55.
    Master Example -Fortran !$OMP PARALLEL PRIVATE(myid,istart,iend) call myrange(myid,istart,iend) do i = istart, iend a(i) = a(i) - b(i) enddo !$OMP BARRIER !$OMP MASTER write(21) a !$OMP END MASTER call do_work(istart,iend) !$OMP END PARALLEL
  • 56.
    Master Example -C #pragma omp parallel private(myid,istart,iend) myrange(myid,istart,iend); for(i=istart; i<=iend; i++){ a[i] = a[i] – b[i]; } #pragma omp barrier #pragma omp master fwrite(fid,sizeof(float),iend-istart+1,a); #pragma omp end master do_work(istart,iend); #pragma omp end parallel
  • 57.
    • want partof code to be executed only by a single thread • don’t care whether or not it’s the master thread • use single directive • Unlike the end master directive, end single implies a barrier. Single
  • 58.
    Single Example -Fortran !$OMP PARALLEL PRIVATE(myid,istart,iend) call myrange(myid,istart,iend) do i = istart, iend a(i) = a(i) - b(i) enddo !$OMP BARRIER !$OMP SINGLE write(21) a !$OMP END SINGLE call do_work(istart,iend) !$OMP END PARALLEL
  • 59.
    Single Example -C #pragma omp parallel private(myid,istart,iend) myrange(myid,istart,iend); for(i=istart; i<=iend; i++){ a[i] = a[i] – b[i]; } #pragma omp barrier #pragma omp single fwrite(fid,sizeof(float),nvals,a); #pragma omp end single do_work(istart,iend); #pragma omp end parallel
  • 60.
    • have sectionof code that: 1. must be executed by every thread 2. threads may execute in any order 3. threads must not execute simultaneously • use critical directive • no implied barrier Critical
  • 61.
    Critical Example -Fortran !$OMP PARALLEL PRIVATE(myid,istart,iend) call myrange(myid,istart,iend) do i = istart, iend a(i) = a(i) - b(i) enddo !$OMP CRITICAL call mycrit(myid,a) !$OMP END CRITICAL call do_work(istart,iend) !$OMP END PARALLEL
  • 62.
    Critical Example -C #pragma omp parallel private(myid,istart,iend) myrange(myid,istart,iend); for(i=istart; i<=iend; i++){ a[i] = a[i] – b[i]; } #pragma omp critical mycrit(myid,a); #pragma omp end critical do_work(istart,iend); #pragma omp end parallel
  • 63.
    Ordered • Suppose youwant to write values in a loop: • If loop were parallelized, could write out of order • ordered directive forces serial order do i = 1, nproc call do_lots_of_work(result(i)) write(21,101) i, result(i) enddo for(i = 0; i < nproc; i++){ do_lots_of_work(result[i]); fprintf(fid,”%d %fn,”i,result[i]”); }
  • 64.
    Ordered (cont’d) !$omp paralleldo do i = 1, nproc call do_lots_of_work(result(i)) !$omp ordered write(21,101) i, result(i) !$omp end ordered enddo #pragma omp parallel for for(i = 0; i < nproc; i++){ do_lots_of_work(result[i]); #pragma omp ordered fprintf(fid,”%d %fn,”i,result[i]”); #pragma omp end ordered }
  • 65.
    Ordered (3) • sincedo_lots_of_work takes a lot of time, most parallel benefit will be realized • ordered is helpful for debugging – Is result same with and without ordered directive?
  • 66.
    • In parallelregion, have several independent tasks • Divide code into sections • Each section is executed on a different thread. • Implied barrier • Fixes number of threads • Note that sections directive delineates region of code which includes section directives. Sections
  • 67.
    Sections Example -Fortran !$OMP PARALLEL !$OMP SECTIONS !$OMP SECTION call init_field(a) !$OMP SECTION call check_grid(x) !$OMP END SECTIONS !$OMP END PARALLEL
  • 68.
    Sections Example -C #pragma omp parallel #pragma omp sections #pragma omp section init_field(a); #pragma omp section check_grid(x); #pragma omp end sections #pragma omp end parallel
  • 69.
    • Analogous todo and parallel do or for and parallel for, along with sections there exists a parallel sections directive. Parallel Sections
  • 70.
    Parallel Sections Example- Fortran !$OMP PARALLEL SECTIONS !$OMP SECTION call init_field(a) !$OMP SECTION call check_grid(x) !$OMP END PARALLEL SECTIONS
  • 71.
    Parallel Sections Example- C #pragma omp parallel sections #pragma omp section init_field(a); #pragma omp section check_grid(x); #pragma omp end parallel sections
  • 72.
    • Parallel sectionsare established upon compilation – Number of threads is fixed • Sometimes more flexibility is needed, such as parallelism within if or while block • Task directive will assign tasks to threads as needed Task
  • 73.
    Task Example -Fortran !$OMP PARALLEL !$OMP SINGLE p = listhead do while(p) !$OMP TASK process(p) !$OMP END TASK p = next(p) enddo !$OMP END SINGLE !$OMP END
  • 74.
    Task Example -C #pragma omp parallel { #pragma omp single { p = listhead; while(p) { #pragma omp task process(p); p = next(p); } } }
  • 75.
    • Only neededfor Fortran – Usually legacy code • common block variables are inherently shared, may want them to be private • threadprivate gives each thread its own private copy of variables in specified common block. – Keep memory use in mind • Place threadprivate after common block declaration. Threadprivate
  • 76.
    common /work1/ work(1000) !$ompthreadprivate(/work1/) Threadprivate (cont’d)
  • 77.
    • threadprivate rendersall values in common block private • may want to use some current values of variables on all threads • copyin initializes each thread’s copy of specified common-block variables with values form the master thread Copyin
  • 78.
    Copyin Example -Fortran real, parameter :: nvals = 501 real, dimension(nvals) :: x, y, xmet, ymet common /work1/ work(nvals), istart, iend !$OMP THREADPRIVATE(/work1/) istart = 2; iend = 500 call read_coords(x,y) !$OMP PARALLEL SECTIONS COPYIN(istart,iend) !$OMP SECTION call metric(x) xmet = work !$OMP SECTION call metric(y) ymet = work !$OMP END PARALLEL SECTIONS subroutine metric(c) real, dimension(:) :: c common /work1/ work(nvals), istart, iend !$OMP THREADPRIVATE(/work1/) do i = istart, iend work(i) = 0.5*(c(i+1)-c(i-1)) enddo end subroutine metric
  • 79.
    • Similar tocritical – Allows greater optimization than critical • Applies to single line following the atomic directive Atomic
  • 80.
    Atomic Example -Fortran do i = 1, 10 !$OMPATOMIC x(j(i)) = x(j(i)) + 1.0 enddo
  • 81.
    Atomic Example -C for(i = 0; i<10; i++){ #pragma omp atomic x[j[i]]= x[j[i]] + 1.0; }
  • 82.
    • Different threadscan have different values for the same shared variable – example – value in register • flush assures that the calling thread has a consistent view of memory • upcoming example is from OpenMP spec Flush
  • 83.
    Flush Example -Fortran integer isync(num_threads) !$omp parallel default(private) shared(isync) iam = omp_get_thread_num() isync(iam) = 0 !$omp barrier call work() isync(iam) = 1 ! let other threads know I finished work !$omp flush(isync) do while(isync(neigh) == 0) !$omp flush(isync) enddo $omp end parallel
  • 84.
    Flush Example -C int isync[num_threads]; #pragma omp parallel default(private) shared(isync) iam = omp_get_thread_num(); isync[iam] = 0; #pragma omp barrier work(); isync[iam] = 1; /* let other threads know I finished work */ #pragma omp flush(isync) while(isync[neigh] == 0) #pragma omp flush(isync) #pragma omp end parallel
  • 85.
  • 86.
    • On somesystems the number of threads available to you may be adjusted dynamically. • The number of threads requested can be construed as the maximum number of threads available. • Dynamic threading can be turned on or off: Dynamic Thread Assignment logical mydyn = .false. call omp_set_dynamic(mydyn) int mydyn = 0; omp_set_dynamic(mydyn);
  • 87.
    • Dynamic threadingcan also be turned on or off by setting the environment variable omp_dynamic to true or false – call to omp_set_dynamic overrides the environment variable • The function omp_get_dynamic() returns a value of “true” or “false,” indicating whether dynamic threading is turned on or off. Dynamic Thread Assignment (cont’d)
  • 88.
    • integer function •returns maximum number of threads available in current parallel region • same result as omp_get_num_threads if dynamic threading is turned off OMP_GET_MAX_THREADS
  • 89.
    • integer function •returns maximum number of processors in the system – indicates amount of hardware, not number of available processors • could be used to make sure enough processors are available for specified number of sections OMP_GET_NUM_PROCS
  • 90.
    • logical function •tells whether or not function was called from a parallel region • useful debugging device when using “orphaned” directives – example of orphaned directive: omp parallel in “main,” omp do or omp for in routine called from main OMP_IN_PARALLEL
  • 91.
  • 92.
    Collapse • collapse(n) allowsparallelism over n contiguously nested loops
  • 93.
    Collapse Example -Fortran !$omp parallel do collapse(2) do j = 1, nval do i = 1, nval ival(i,j) = i + j enddo enddo
  • 94.
    Collapse Example -C #pragma omp parallel for collapse(2) for(j=0; j<nval; j++){ for(i=0; i<nval; i++){ ival[i][j] = i + j; } }
  • 95.
    Schedule • schedule refersto the way in which loop indices are distributed among threads • default is static, which we had seen in an earlier example – each thread is assigned a contiguous range of indices in order of thread number • called round robin – number of indices assigned to each thread is as equal as possible
  • 96.
    Schedule (cont’d) • staticexample – loop runs from 1 to 51, 4 threads thread indices no. indices 0 1-13 13 1 14-26 13 2 27-39 13 3 40-51 12
  • 97.
    Schedule (3) • numberof indices doled out at a time to each thread is called the chunk size • can be modified with the SCHEDULE clause !$omp do schedule(static,5) #pragma omp for schedule(static,5)
  • 98.
    Schedule (4) • example– same as previous example (loop runs from 1-51, 4 threads), but set chunk size to 5 thread chunk 1 indices chunk 2 indices chunk 3 indices no. indices 0 1-5 21-25 41-45 15 1 6-10 26-30 46-50 15 2 11-15 31-35 51 11 3 16-20 36-40 - 10
  • 99.
    Dynamic Schedule • schedule(dynamic)clause assigns chunks to threads dynamically as the threads become available for more work • default chunk size is 1 • higher overhead than STATIC !$omp do schedule(dynamic,5) #pragma omp for schedule(dynamic,5)
  • 100.
    Guided Schedule • schedule(guided)clause assigns chunks automatically, exponentially decreasing chunk size with each assignment • specified chunk size is the minimum chunk size except for the last chunk, which can be of any size • default chunk size is 1
  • 101.
    Runtime Schedule • schedulecan be specified through omp_schedule environment variable setenv omp_schedule “dynamic,5” • the schedule(runtime) clause tells it to set the schedule using the environment variable !$omp do schedule(runtime) #pragma omp for schedule(runtime)
  • 102.
  • 103.
    • parallel constructwithin parallel construct: Nested Parallelism !$omp parallel do do j = 1, jmax !$omp parallel do do i = 1, i,max call do_work(i,j) enddo enddo #pragma omp parallel for for(j=0; j<jmax; j++){ #pragma omp parallel for for(i=0; i<imax; i++){ do_work(i,j); } }
  • 104.
    • OpenMP standardallows but does not require nested parallelism • if nested parallelism is not implemented, previous examples are legal, but the inner loop will be serial • logical function omp_get_nested() returns .true. or .false. (1 or 0) to indicate whether or not nested parallelism is enabled in the current region Nested Parallelism (cont’d)
  • 105.
    • omp_set_nested(nest) enables/disables nestedparallelism if argument is true/false – subroutine in Fortran (.true./.false.) – function in C (1/0) • nested parallelism can also be turned on or off by setting the environment variable omp_nested to true or false – calls to omp_set_nested override the environment variable Nested Parallelism (3)
  • 106.
  • 107.
    • Locks offeradditional control of threads • a thread can take/release ownership of a lock over a specified region of code • when a lock is owned by a thread, other threads cannot execute the locked region • can be used to perform useful work by one or more threads while another thread is engaged in a serial task Locks
  • 108.
    • Lock routineseach have a single argument – argument type is an integer (Fortran) or a pointer to an integer (C) long enough to hold an address – OpenMP pre-defines required types integer(omp_lock_kind) :: mylock omp_nest_lock_t *mylock; Lock Routines
  • 109.
    • omp_init_lock(mylock) – initializesmylock – must be called before mylock is used – subroutine in Fortran • omp_set_lock(mylock) – gives ownership of mylock to calling thread – other threads cannot execute code following call until mylock is released – subroutine in Fortran Lock Routines (cont’d)
  • 110.
    • omp_test_lock(mylock) – logicalfunction – if mylock is presently owned by another thread, returns false – if mylock is available, returns true and sets lock, i.e., gives ownership to calling thread – be careful: omp_test_lock(mylock)does more than just test the lock Lock Routines (3)
  • 111.
    • omp_unset_lock(mylock) – releasesownership of mylock – subroutine in Fortran • omp_destroy_lock(mylock) – call when you’re done with the lock – complement to omp_init_lock(mylock) – subroutine in Fortran Lock Routines (4)
  • 112.
    • an independenttask takes a significant amount of time, and it must be performed serially • another task, independent of the first, may be performed in parallel • improve parallel efficiency by doing both tasks at the same time • one thread locks serial task • other threads work on parallel task until serial task is complete Lock Example
  • 113.
    Lock Example -Fortran call omp_init_lock(mylock) !$omp parallel if(omp_test_lock(mylock)) then call long_serial_task call omp_unset_lock(mylock) else dowhile(.not. omp_test_lock(mylock)) call short_parallel_task enddo call omp_unset_lock(mylock) endif !$omp end parallel call omp_destroy_lock(mylock)
  • 114.
    Lock Example C omp_init_lock(mylock); #pragmaomp parallel if(omp_test_lock(mylock)){ long_serial_task(); omp_unset_lock(mylock); }else{ while(! omp_test_lock(mylock)){ short_parallel_task(); } omp_unset_lock(mylock); } #pragma omp end parallel omp_destroy_lock(mylock);
  • 115.
    • Please fillout evaluation survey for this tutorial at http://scv.bu.edu/survey/tutorial_evaluation.html Survey