SlideShare a Scribd company logo
1 of 115
Parallel Processing
with OpenMP
Doug Sondak
Boston University
Scientific Computing and Visualization
Office of Information Technology
sondak@bu.edu
Outline
• Introduction
• Basics
• Data Dependencies
• A Few More Basics
• Caveats & Compilation
• Coarse-Grained Parallelization
• Thread Control Diredtives
• Some Additional Functions
• Some Additional Clauses
• Nested Paralleliism
• Locks
Introduction
Introduction
• Types of parallel machines
– distributed memory
• each processor has its own memory address space
• variable values are independent
x = 2 on one processor, x = 3 on a different processor
• examples: linux clusters, Blue Gene/L
– shared memory
• also called Symmetric Multiprocessing (SMP)
• single address space for all processors
– If one processor sets x = 2 , x will also equal 2 on other
processors (unless specified otherwise)
• examples: IBM p-series, SGI Altix
Shared vs. Distributed Memory
CPU 0 CPU 1 CPU 2 CPU 3 CPU 0 CPU 1 CPU 2 CPU 3
MEM 0 MEM 1 MEM 2 MEM 3 MEM
shared
distributed
Shared vs. Distributed Memory (cont’d)
• Multiple processes
– Each processor (typically) performs
independent task
• Multiple threads
– A process spawns additional tasks (threads)
with same memory address space
What is OpenMP?
• Application Programming Interface (API)
for multi-threaded parallelization consisting
of
– Source code directives
– Functions
– Environment variables
What is OpenMP? (cont’d)
• Advantages
– Easy to use
– Incremental parallelization
– Flexible
• Loop-level or coarse-grain
– Portable
• Since there’s a standard, will work on any SMP machine
• Disadvantage
– Shared-memory systems only
Basics
Basics
• Goal – distribute work among threads
• Two methods will be discussed here
– Loop-level
• Specified loops are parallelized
• This is approach taken by automatic parallelization tools
– Parallel regions
• Sometimes called “coarse-grained”
• Don’t know good term; good way to start argument with
semantically precise people
• Usually used in message-passing (MPI)
Basics (cont’d)
loop
loop
Loop-level Parallel regions
serial
serial
serial
parallel do & parallel for
• parallel do (Fortran) and parallel for (C)
directives parallelize subsequent loop
#pragma omp parallel for
for(i = 1; i <= maxi; i++){
a[i] = b[i] = c[i];
}
!$omp parallel do
do i = 1, maxi
a(i) = b(i) + c(i)
enddo
Use “c$” for fixed-format Fortran
parallel do & parallel for (cont’d)
• Suppose maxi = 1000
Thread 0 gets i = 1 to 250
Thread 1 gets i = 251 to 500
etc.
• Barrier implied at end of loop
workshare
• For Fortran 90/95 array syntax, the parallel
workshare directive is analogous to
parallel do
• Previous example would be:
• Also works for forall and where statements
!$omp parallel workshare
a = b + c
!$omp end parallel workshare
Shared vs. Private
• In parallel region, default behavior is that
all variables are shared except loop index
– All threads read and write the same memory
location for each variable
– This is ok if threads are accessing different
elements of an array
– Problem if threads write same scalar or array
element
– Loop index is private, so each thread has its
own copy
Shared vs. Private (cont’d)
• Here’s an example where a shared
variable, i2, could cause a problem
ifirst = 10
do i = 1, imax
i2 = 2*i
j(i) = ifirst + i2
enddo
ifirst = 10;
for(i = 1; i <= imax; i++){
i2 = 2*i;
j[i] = ifirst + i2;
}
Shared vs. Private (3)
• OpenMP clauses modify the behavior of
directives
• Private clause creates separate memory
location for specified variable for each
thread
Shared vs. Private (4)
ifirst = 10
!$omp parallel do private(i2)
do i = 1, imax
i2 = 2*i
j(i) = ifirst + i2
enddo
ifirst = 10;
#pragma omp parallel for private(i2)
for(i = 1; i <= imax; i++){
i2 = 2*i;
j[i] = ifirst + i2;
}
Shared vs. Private (5)
• Look at memory just before and after
parallel do/parallel for statement
– Let’s look at two threads
ifirst = 10
…
i2thread 0 = ???
spawn
additional
thread
ifirst = 10
…
i2thread 0 = ???
i2thread 1 = ???
…
before after
Shared vs. Private (6)
• Both threads access the same shared
value of ifirst
• They have their own private copies of i2
ifirst = 10
…
i2thread 0 = ???
i2thread 1 = ???
…
Thread 0 Thread 1
Data Dependencies
Data Dependencies
• Data on one thread can be dependent on
data on another thread
• This can result in wrong answers
– thread 0 may require a variable that is
calculated on thread 1
– answer depends on timing – When thread 0
does the calculation, has thread 1 calculated
it’s value yet?
Data Dependencies (cont’d)
• Example – Fibonacci Sequence
0, 1, 1, 2, 3, 5, 8, 13, …
a(1) = 0
a(2) = 1
do i = 3, 100
a(i) = a(i-1) + a(i-2)
enddo
a[1] = 0;
a[2] = 1;
for(i = 3; i <= 100; i++){
a[i] = a[i-1] + a[i-2];
}
Data Dependencies (3)
• parallelize on 2 threads
– thread 0 gets i = 3 to 51
– thread 1 gets i = 52 to 100
– look carefully at calculation for i = 52 on
thread 1
• what will be values of for i -1 and i - 2 ?
Data Dependencies (4)
• A test for dependency:
– if serial loop is executed in reverse order, will
it give the same result?
– if so, it’s ok
– you can test this on your serial code
Data Dependencies (5)
• What about subprogram calls?
• Does the subprogram write x or y to memory?
– If so, they need to be private
• Variables local to subprogram are local to each
thread
• Be careful with global variables and common
blocks
do i = 1, 100
call mycalc(i,x,y)
enddo
for(i = 0; i < 100; i++){
mycalc(i,x,y);
}
A Few More Basics
More clauses
• Can make private default rather than
shared
– Fortran only
– handy if most of the variables are private
– can use continuation characters for long lines
ifirst = 10
!$omp parallel do &
!$omp default(private) &
!$omp shared(ifirst,imax,j)
do i = 1, imax
i2 = 2*i
j(i) = ifirst + i2
enddo
More clauses (cont’d)
• Can use default none
– declare all variables (except loop variables)
as shared or private
– If you don’t declare any variables, you get a
handy list of all variables in loop
More clauses (3)
ifirst = 10
!$omp parallel do &
!$omp default(none) &
!$omp shared(ifirst,imax,j) &
!$omp private(i2)
do i = 1, imax
i2 = 2*i
j(i) = ifirst + i2
enddo
ifirst = 10;
#pragma omp parallel for 
default(none) 
shared(ifirst,imax,j) 
private(i2)
for(i = 0; i < imax; i++){
i2 = 2*i;
j[i] = ifirst + i2;
}
Firstprivate
• Suppose we need a running index total for
each index value on each thread
• if iper were declared private, the initial
value would not be carried into the loop
iper = 0
do i = 1, imax
iper = iper + 1
j(i) = iper
enddo
iper = 0;
for(i = 0; i < imax; i++){
iper = iper + 1;
j[i] = iper;
}
Firstprivate (cont’d)
• Solution – firstprivate clause
• Creates private memory location for each
thread
• Copies value from master thread (thread
0) to each memory location
iper = 0
!$omp parallel do &
!$omp firstprivate(iper)
do i = 1, imax
iper = iper + 1
j(i) = iper
enddo
iper = 0;
#pragma omp parallel for 
firstprivate(iper)
for(i = 0; i < imax; i++){
iper = iper + 1;
j[i] = iper;
}
Lastprivate
• saves value corresponding to the last loop
index
– "last" in the serial sense
!$omp parallel do lastprivate(i)
do i = 1, maxi-1
a(i) = b(i)
enddo
a(i) = b(1)
#pragma omp parallel for 
lastprivate(i)
for(i = 0; i < maxi-1; i++){
a[i] = b[i];
}
a(i) = b(1);
Reduction
• following example won’t parallelize
correctly with parallel do/parallel for
– different threads may try to write to sum1
simultaneously
sum1 = 0.0
do i = 1, maxi
sum1 = sum1 + a(i)
enddo
sum1 = 0.0;
for(i = 0; i < imaxi; i++){
sum1 = sum1 + a[i];
}
Reduction (cont’d)
• Solution? – Reduction clause
• each thread performs its own reduction (sum, in
this case)
• results from all threads are automatically
reduced (summed) at the end of the loop
sum1 = 0.0
!$omp parallel do &
!$reduction(+:sum1)
do i = 1, maxi
sum1 = sum1 + a(i)
enddo
sum1 = 0;
#pragma omp parallel for 
reduction(+:sum1)
for(i = 0; i < imaxi; i++){
sum1 = sum1 + a[i];
}
Reduction (3)
• Fortran operators/intrinsics: MAX, MIN,
IAND, IOR, IEOR,
+, *, -, .AND., .OR., .EQV., .NEQV.
• C operators: +, *, -, /, &, ^, |, &&, ||
• roundoff error may be different than serial
case
num_threads
• Number of threads for subsequent loop
can be specified with num_threads(n)
clause
– n is number of threads
Conditional Compilation
• Fortran: if compiled without OpenMP, directives are
treated as comments
– Great for portability
• !$ (c$ for fixed format) can be used for conditional
compilation for any source lines
!$ print*, ‘number of procs =', nprocs
• C or C++: conditional compilation can be performed with
the _OPENMP macro name.
#ifdef _OPENMP
… do stuff …
#endif
Basic OpenMP Functions
• omp_get_thread_num()
– returns ID of current thread
• omp_set_num_threads(nthreads)
– subroutine in Fortran
– sets number of threads in next parallel region to
nthreads
– alternative to OMP_NUM_THREADS environment
variable
– overrides OMP_NUM_THREADS
• omp_get_num_threads()
– returns number of threads in current parallel region
Caveats and Compilation
Some Hints
• OpenMP will do what you tell it to do
– If you try parallelize a loop with a dependency, it will go
ahead and do it!
• Do not parallelize small loops
– Overhead will be greater than speedup
– How small is “small”?
• Answer depends on processor speed and other system-
dependent parameters
• Try it!
• Maximize number of operations performed in
parallel
– parallelize outer loops where possible
Some Hints (cont’d)
• Always profile your code before parallelizing!
– tells where most time is being used
• Portland Group profiler:
– compile with –Mprof=func flag
– run code
• file with profile data will appear in working directory
– run profiler:
• pgprof –exe executable-name
• GUI will pop up with profiling information
Compile and Run
• Portland Group compilers (katana):
– Compile with -mp flag
– Can use up to 8 threads
• AIX compilers (twister, etc.):
– Use an _r suffix on the compiler name, e.g., cc_r,
xlf90_r
– Compile with -qsmp=omp flag
• Intel compilers (skate, cootie):
– Compile with -openmp and –fpp flags
– Can only use 2 threads
Compile and Run (cont’d)
• GNU compilers (all machines):
– Compile with -fopenmp flag
• Blue Gene does not accomodate OpenMP
• environment variable OMP_NUM_THREADS
sets number of threads
setenv OMP_NUM_THREADS 4
• for C, include omp.h
Exercise
• Copy files from /scratch/sondak/gt
– both C and Fortran versions are available for your
programming pleasure
• compile code with pgcc or pgf90
– include profiling flag –Mprof=func
– call executable “gt”
• Run code batch
– qsub rungt
– qstat –u username
Exercise (cont’d)
• When run completes, look at profile
– pgprof –exe gt
– Where is most time being used?
– How can we tackle this with OpenMP?
• Recompile without profiler flag
• Re-run and note timing from batch output file
– this is the base timing
• We will decide on an OpenMP strategy
• Add OpenMP directive(s)
• Re-run on 1, 2, and 4 procs. and check new
timings
Parallel
• parallel and do/for can be separated into two directives.
is the same as
!$omp parallel do
do i = 1, maxi
a(i) = b(i)
enddo
#pragma omp parallel for
for(i=0; i<maxi; i++){
a[i] = b[i];
}
!$omp parallel
!$omp do
do i = 1, maxi
a(i) = b(i)
enddo
!$omp end parallel
#pragma omp parallel
#pragma omp for
for(i=0; i<maxi; i++){
a[i] = b[i];
}
#pragma omp end parallel
Parallel (cont’d)
• Note that an end parallel directive is required.
• Everything within the parallel region will be run
in parallel (surprise!).
• The do/for directive indicates that the loop
indices will be distributed among threads rather
than duplicating every index on every thread.
Parallel (3)
• Multiple loops in parallel region:
• parallel directive has a significant overhead associated
with it.
• The above example has the potential to be faster than
using two parallel do/parallel for directives.
!$omp parallel
!$omp do
do i = 1, maxi
a(i) = b(i)
enddo
!$omp do
do i = 1, maxi
c(i) = a(2)
enddo
!$omp end parallel
#pragma omp parallel
#pragma omp for
for(i=0; i<maxi; i++){
a[i] = b[i];
}
#pragma omp for
for(i=0; i<maxi; i++){
c[i] = a[2];
}
#pragma omp end parallel
Coarse-Grained
• OpenMP is not restricted to loop-level (fine-
grained) parallelism.
• The !$omp parallel or #pragma omp parallel
directive duplicates subsequent code on all
threads until a !$omp end parallel or #pragma
omp end parallel directive is encountered.
• Allows parallelization similar to “MPI paradigm.”
Coarse-Grained (cont’d)
!$omp parallel &
!$omp private(myid,istart,iend,nthreads,nper)
nthreads = omp_get_num_threads()
nper = imax/nthreads
myid = omp_get_thread_num()
istart = myid*nper + 1
iend = istart + nper – 1
call do_work(istart,iend)
do i = istart, iend
a(i) = b(i)*c(i) + ...
enddo
!$omp end parallel
#pragma omp parallel 
#pragma omp private(myid,istart,iend,nthreads,nper)
nthreads = OMP_GET_NUM_THREADS();
nper = imax/nthreads;
myid = OMP_GET_THREAD_NUM();
istart = myid*nper;
iend = istart + nper – 1;
do_work(istart,iend);
for(i=istart; i<=iend; i++){
a[i] = b[i]*c[i] + ...
}
#pragma omp end parallel
Thread Control Directives
• barrier synchronizes threads
• Here barrier assures that a(1) or a[0] is available before
computing myval
Barrier
$omp parallel private(myid,istart,iend)
call myrange(myid,istart,iend)
do i = istart, iend
a(i) = a(i) - b(i)
enddo
!$omp barrier
myval(myid+1) = a(istart) + a(1)
#pragma omp parallel private(myid,istart,iend)
myrange(myid,istart,iend);
for(i=istart; i<=iend; i++){
a[i] = a[i] – b[i];
}
#pragma omp barrier
myval[myid] = a[istart] + a[0]
• if you want part of code to be executed
only on master thread, use master
directive
• “non-master” threads will skip over master
region and continue
Master
Master Example - Fortran
!$OMP PARALLEL PRIVATE(myid,istart,iend)
call myrange(myid,istart,iend)
do i = istart, iend
a(i) = a(i) - b(i)
enddo
!$OMP BARRIER
!$OMP MASTER
write(21) a
!$OMP END MASTER
call do_work(istart,iend)
!$OMP END PARALLEL
Master Example - C
#pragma omp parallel private(myid,istart,iend)
myrange(myid,istart,iend);
for(i=istart; i<=iend; i++){
a[i] = a[i] – b[i];
}
#pragma omp barrier
#pragma omp master
fwrite(fid,sizeof(float),iend-istart+1,a);
#pragma omp end master
do_work(istart,iend);
#pragma omp end parallel
• want part of code to be executed only by a
single thread
• don’t care whether or not it’s the master
thread
• use single directive
• Unlike the end master directive, end single
implies a barrier.
Single
Single Example - Fortran
!$OMP PARALLEL PRIVATE(myid,istart,iend)
call myrange(myid,istart,iend)
do i = istart, iend
a(i) = a(i) - b(i)
enddo
!$OMP BARRIER
!$OMP SINGLE
write(21) a
!$OMP END SINGLE
call do_work(istart,iend)
!$OMP END PARALLEL
Single Example - C
#pragma omp parallel private(myid,istart,iend)
myrange(myid,istart,iend);
for(i=istart; i<=iend; i++){
a[i] = a[i] – b[i];
}
#pragma omp barrier
#pragma omp single
fwrite(fid,sizeof(float),nvals,a);
#pragma omp end single
do_work(istart,iend);
#pragma omp end parallel
• have section of code that:
1. must be executed by every thread
2. threads may execute in any order
3. threads must not execute simultaneously
• use critical directive
• no implied barrier
Critical
Critical Example - Fortran
!$OMP PARALLEL PRIVATE(myid,istart,iend)
call myrange(myid,istart,iend)
do i = istart, iend
a(i) = a(i) - b(i)
enddo
!$OMP CRITICAL
call mycrit(myid,a)
!$OMP END CRITICAL
call do_work(istart,iend)
!$OMP END PARALLEL
Critical Example - C
#pragma omp parallel private(myid,istart,iend)
myrange(myid,istart,iend);
for(i=istart; i<=iend; i++){
a[i] = a[i] – b[i];
}
#pragma omp critical
mycrit(myid,a);
#pragma omp end critical
do_work(istart,iend);
#pragma omp end parallel
Ordered
• Suppose you want to write values in a loop:
• If loop were parallelized, could write out of order
• ordered directive forces serial order
do i = 1, nproc
call do_lots_of_work(result(i))
write(21,101) i, result(i)
enddo
for(i = 0; i < nproc; i++){
do_lots_of_work(result[i]);
fprintf(fid,”%d %fn,”i,result[i]”);
}
Ordered (cont’d)
!$omp parallel do
do i = 1, nproc
call do_lots_of_work(result(i))
!$omp ordered
write(21,101) i, result(i)
!$omp end ordered
enddo
#pragma omp parallel for
for(i = 0; i < nproc; i++){
do_lots_of_work(result[i]);
#pragma omp ordered
fprintf(fid,”%d %fn,”i,result[i]”);
#pragma omp end ordered
}
Ordered (3)
• since do_lots_of_work takes a lot of time,
most parallel benefit will be realized
• ordered is helpful for debugging
– Is result same with and without ordered
directive?
• In parallel region, have several independent
tasks
• Divide code into sections
• Each section is executed on a different thread.
• Implied barrier
• Fixes number of threads
• Note that sections directive delineates region
of code which includes section directives.
Sections
Sections Example - Fortran
!$OMP PARALLEL
!$OMP SECTIONS
!$OMP SECTION
call init_field(a)
!$OMP SECTION
call check_grid(x)
!$OMP END SECTIONS
!$OMP END PARALLEL
Sections Example - C
#pragma omp parallel
#pragma omp sections
#pragma omp section
init_field(a);
#pragma omp section
check_grid(x);
#pragma omp end sections
#pragma omp end parallel
• Analogous to do and parallel do or for and
parallel for, along with sections there exists a
parallel sections directive.
Parallel Sections
Parallel Sections Example - Fortran
!$OMP PARALLEL SECTIONS
!$OMP SECTION
call init_field(a)
!$OMP SECTION
call check_grid(x)
!$OMP END PARALLEL SECTIONS
Parallel Sections Example - C
#pragma omp parallel sections
#pragma omp section
init_field(a);
#pragma omp section
check_grid(x);
#pragma omp end parallel sections
• Parallel sections are established upon
compilation
– Number of threads is fixed
• Sometimes more flexibility is needed, such as
parallelism within if or while block
• Task directive will assign tasks to threads as
needed
Task
Task Example - Fortran
!$OMP PARALLEL
!$OMP SINGLE
p = listhead
do while(p)
!$OMP TASK
process(p)
!$OMP END TASK
p = next(p)
enddo
!$OMP END SINGLE
!$OMP END
Task Example - C
#pragma omp parallel
{
#pragma omp single
{
p = listhead;
while(p) {
#pragma omp task
process(p);
p = next(p);
}
}
}
• Only needed for Fortran
– Usually legacy code
• common block variables are inherently shared,
may want them to be private
• threadprivate gives each thread its own private
copy of variables in specified common block.
– Keep memory use in mind
• Place threadprivate after common block
declaration.
Threadprivate
common /work1/ work(1000)
!$omp threadprivate(/work1/)
Threadprivate (cont’d)
• threadprivate renders all values in common
block private
• may want to use some current values of
variables on all threads
• copyin initializes each thread’s copy of
specified common-block variables with values
form the master thread
Copyin
Copyin Example - Fortran
real, parameter :: nvals = 501
real, dimension(nvals) :: x, y, xmet, ymet
common /work1/ work(nvals), istart, iend
!$OMP THREADPRIVATE(/work1/)
istart = 2; iend = 500
call read_coords(x,y)
!$OMP PARALLEL SECTIONS
COPYIN(istart,iend)
!$OMP SECTION
call metric(x)
xmet = work
!$OMP SECTION
call metric(y)
ymet = work
!$OMP END PARALLEL SECTIONS
subroutine metric(c)
real, dimension(:) :: c
common /work1/ work(nvals), istart, iend
!$OMP THREADPRIVATE(/work1/)
do i = istart, iend
work(i) = 0.5*(c(i+1)-c(i-1))
enddo
end subroutine metric
• Similar to critical
– Allows greater optimization than critical
• Applies to single line following the atomic
directive
Atomic
Atomic Example - Fortran
do i = 1, 10
!$OMPATOMIC
x(j(i)) = x(j(i)) + 1.0
enddo
Atomic Example - C
for(i = 0; i<10; i++){
#pragma omp atomic
x[j[i]]= x[j[i]] + 1.0;
}
• Different threads can have different values for
the same shared variable
– example – value in register
• flush assures that the calling thread has a
consistent view of memory
• upcoming example is from OpenMP spec
Flush
Flush Example - Fortran
integer isync(num_threads)
!$omp parallel default(private) shared(isync)
iam = omp_get_thread_num()
isync(iam) = 0
!$omp barrier
call work()
isync(iam) = 1
! let other threads know I finished work
!$omp flush(isync)
do while(isync(neigh) == 0)
!$omp flush(isync)
enddo
$omp end parallel
Flush Example - C
int isync[num_threads];
#pragma omp parallel default(private) shared(isync)
iam = omp_get_thread_num();
isync[iam] = 0;
#pragma omp barrier
work();
isync[iam] = 1;
/* let other threads know I finished work */
#pragma omp flush(isync)
while(isync[neigh] == 0)
#pragma omp flush(isync)
#pragma omp end parallel
Some Additional Functions
• On some systems the number of threads
available to you may be adjusted dynamically.
• The number of threads requested can be
construed as the maximum number of threads
available.
• Dynamic threading can be turned on or off:
Dynamic Thread Assignment
logical mydyn = .false.
call omp_set_dynamic(mydyn)
int mydyn = 0;
omp_set_dynamic(mydyn);
• Dynamic threading can also be turned on or off
by setting the environment variable
omp_dynamic to true or false
– call to omp_set_dynamic overrides the environment
variable
• The function omp_get_dynamic() returns a
value of “true” or “false,” indicating whether
dynamic threading is turned on or off.
Dynamic Thread Assignment (cont’d)
• integer function
• returns maximum number of threads available
in current parallel region
• same result as omp_get_num_threads if
dynamic threading is turned off
OMP_GET_MAX_THREADS
• integer function
• returns maximum number of processors in the
system
– indicates amount of hardware, not number of
available processors
• could be used to make sure enough
processors are available for specified number
of sections
OMP_GET_NUM_PROCS
• logical function
• tells whether or not function was called from a
parallel region
• useful debugging device when using
“orphaned” directives
– example of orphaned directive: omp parallel in
“main,” omp do or omp for in routine called from main
OMP_IN_PARALLEL
Some Additional Clauses
Collapse
• collapse(n) allows parallelism over n
contiguously nested loops
Collapse Example - Fortran
!$omp parallel do collapse(2)
do j = 1, nval
do i = 1, nval
ival(i,j) = i + j
enddo
enddo
Collapse Example - C
#pragma omp parallel for collapse(2)
for(j=0; j<nval; j++){
for(i=0; i<nval; i++){
ival[i][j] = i + j;
}
}
Schedule
• schedule refers to the way in which loop
indices are distributed among threads
• default is static, which we had seen in an
earlier example
– each thread is assigned a contiguous range of
indices in order of thread number
• called round robin
– number of indices assigned to each thread is
as equal as possible
Schedule (cont’d)
• static example
– loop runs from 1 to 51, 4 threads
thread indices no. indices
0 1-13 13
1 14-26 13
2 27-39 13
3 40-51 12
Schedule (3)
• number of indices doled out at a time to each
thread is called the chunk size
• can be modified with the SCHEDULE clause
!$omp do schedule(static,5)
#pragma omp for schedule(static,5)
Schedule (4)
• example – same as previous example (loop runs
from 1-51, 4 threads), but set chunk size to 5
thread chunk 1
indices
chunk 2
indices
chunk 3
indices
no.
indices
0 1-5 21-25 41-45 15
1 6-10 26-30 46-50 15
2 11-15 31-35 51 11
3 16-20 36-40 - 10
Dynamic Schedule
• schedule(dynamic) clause assigns chunks to
threads dynamically as the threads become
available for more work
• default chunk size is 1
• higher overhead than STATIC
!$omp do schedule(dynamic,5)
#pragma omp for
schedule(dynamic,5)
Guided Schedule
• schedule(guided) clause assigns chunks
automatically, exponentially decreasing chunk
size with each assignment
• specified chunk size is the minimum chunk size
except for the last chunk, which can be of any
size
• default chunk size is 1
Runtime Schedule
• schedule can be specified through
omp_schedule environment variable
setenv omp_schedule “dynamic,5”
• the schedule(runtime) clause tells it to set the
schedule using the environment variable
!$omp do schedule(runtime)
#pragma omp for schedule(runtime)
Nested Parallelism
• parallel construct within parallel construct:
Nested Parallelism
!$omp parallel do
do j = 1, jmax
!$omp parallel do
do i = 1, i,max
call do_work(i,j)
enddo
enddo
#pragma omp parallel for
for(j=0; j<jmax; j++){
#pragma omp parallel for
for(i=0; i<imax; i++){
do_work(i,j);
}
}
• OpenMP standard allows but does not require
nested parallelism
• if nested parallelism is not implemented,
previous examples are legal, but the inner loop
will be serial
• logical function omp_get_nested() returns
.true. or .false. (1 or 0) to indicate whether or
not nested parallelism is enabled in the current
region
Nested Parallelism (cont’d)
• omp_set_nested(nest) enables/disables
nested parallelism if argument is true/false
– subroutine in Fortran (.true./.false.)
– function in C (1/0)
• nested parallelism can also be turned on or off
by setting the environment variable
omp_nested to true or false
– calls to omp_set_nested override the environment
variable
Nested Parallelism (3)
Locks
• Locks offer additional control of threads
• a thread can take/release ownership of a lock
over a specified region of code
• when a lock is owned by a thread, other
threads cannot execute the locked region
• can be used to perform useful work by one or
more threads while another thread is engaged
in a serial task
Locks
• Lock routines each have a single argument
– argument type is an integer (Fortran) or a pointer to
an integer (C) long enough to hold an address
– OpenMP pre-defines required types
integer(omp_lock_kind) :: mylock
omp_nest_lock_t *mylock;
Lock Routines
• omp_init_lock(mylock)
– initializes mylock
– must be called before mylock is used
– subroutine in Fortran
• omp_set_lock(mylock)
– gives ownership of mylock to calling thread
– other threads cannot execute code following call
until mylock is released
– subroutine in Fortran
Lock Routines (cont’d)
• omp_test_lock(mylock)
– logical function
– if mylock is presently owned by another
thread, returns false
– if mylock is available, returns true and sets
lock, i.e., gives ownership to calling
thread
– be careful: omp_test_lock(mylock)does
more than just test the lock
Lock Routines (3)
• omp_unset_lock(mylock)
– releases ownership of mylock
– subroutine in Fortran
• omp_destroy_lock(mylock)
– call when you’re done with the lock
– complement to omp_init_lock(mylock)
– subroutine in Fortran
Lock Routines (4)
• an independent task takes a significant
amount of time, and it must be performed
serially
• another task, independent of the first,
may be performed in parallel
• improve parallel efficiency by doing both
tasks at the same time
• one thread locks serial task
• other threads work on parallel task until
serial task is complete
Lock Example
Lock Example - Fortran
call omp_init_lock(mylock)
!$omp parallel
if(omp_test_lock(mylock)) then
call long_serial_task
call omp_unset_lock(mylock)
else
dowhile(.not. omp_test_lock(mylock))
call short_parallel_task
enddo
call omp_unset_lock(mylock)
endif
!$omp end parallel
call omp_destroy_lock(mylock)
Lock Example C
omp_init_lock(mylock);
#pragma omp parallel
if(omp_test_lock(mylock)){
long_serial_task();
omp_unset_lock(mylock);
}else{
while(! omp_test_lock(mylock)){
short_parallel_task();
}
omp_unset_lock(mylock);
}
#pragma omp end parallel omp_destroy_lock(mylock);
• Please fill out evaluation survey for this
tutorial at
http://scv.bu.edu/survey/tutorial_evaluation.html
Survey

More Related Content

Similar to openmp.ppt

Similar to openmp.ppt (20)

ppt7
ppt7ppt7
ppt7
 
ppt2
ppt2ppt2
ppt2
 
name name2 n
name name2 nname name2 n
name name2 n
 
name name2 n2
name name2 n2name name2 n2
name name2 n2
 
test ppt
test ppttest ppt
test ppt
 
name name2 n
name name2 nname name2 n
name name2 n
 
ppt21
ppt21ppt21
ppt21
 
name name2 n
name name2 nname name2 n
name name2 n
 
ppt17
ppt17ppt17
ppt17
 
ppt30
ppt30ppt30
ppt30
 
name name2 n2.ppt
name name2 n2.pptname name2 n2.ppt
name name2 n2.ppt
 
ppt18
ppt18ppt18
ppt18
 
Ruby for Perl Programmers
Ruby for Perl ProgrammersRuby for Perl Programmers
Ruby for Perl Programmers
 
ppt9
ppt9ppt9
ppt9
 
Lecture7
Lecture7Lecture7
Lecture7
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Parallel and Distributed Computing Chapter 5
Parallel and Distributed Computing Chapter 5Parallel and Distributed Computing Chapter 5
Parallel and Distributed Computing Chapter 5
 
Some Rough Fibrous Material
Some Rough Fibrous MaterialSome Rough Fibrous Material
Some Rough Fibrous Material
 
openmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptopenmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.ppt
 
07 control+structures
07 control+structures07 control+structures
07 control+structures
 

Recently uploaded

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 

openmp.ppt

  • 1. Parallel Processing with OpenMP Doug Sondak Boston University Scientific Computing and Visualization Office of Information Technology sondak@bu.edu
  • 2. Outline • Introduction • Basics • Data Dependencies • A Few More Basics • Caveats & Compilation • Coarse-Grained Parallelization • Thread Control Diredtives • Some Additional Functions • Some Additional Clauses • Nested Paralleliism • Locks
  • 4. Introduction • Types of parallel machines – distributed memory • each processor has its own memory address space • variable values are independent x = 2 on one processor, x = 3 on a different processor • examples: linux clusters, Blue Gene/L – shared memory • also called Symmetric Multiprocessing (SMP) • single address space for all processors – If one processor sets x = 2 , x will also equal 2 on other processors (unless specified otherwise) • examples: IBM p-series, SGI Altix
  • 5. Shared vs. Distributed Memory CPU 0 CPU 1 CPU 2 CPU 3 CPU 0 CPU 1 CPU 2 CPU 3 MEM 0 MEM 1 MEM 2 MEM 3 MEM shared distributed
  • 6. Shared vs. Distributed Memory (cont’d) • Multiple processes – Each processor (typically) performs independent task • Multiple threads – A process spawns additional tasks (threads) with same memory address space
  • 7. What is OpenMP? • Application Programming Interface (API) for multi-threaded parallelization consisting of – Source code directives – Functions – Environment variables
  • 8. What is OpenMP? (cont’d) • Advantages – Easy to use – Incremental parallelization – Flexible • Loop-level or coarse-grain – Portable • Since there’s a standard, will work on any SMP machine • Disadvantage – Shared-memory systems only
  • 10. Basics • Goal – distribute work among threads • Two methods will be discussed here – Loop-level • Specified loops are parallelized • This is approach taken by automatic parallelization tools – Parallel regions • Sometimes called “coarse-grained” • Don’t know good term; good way to start argument with semantically precise people • Usually used in message-passing (MPI)
  • 11. Basics (cont’d) loop loop Loop-level Parallel regions serial serial serial
  • 12. parallel do & parallel for • parallel do (Fortran) and parallel for (C) directives parallelize subsequent loop #pragma omp parallel for for(i = 1; i <= maxi; i++){ a[i] = b[i] = c[i]; } !$omp parallel do do i = 1, maxi a(i) = b(i) + c(i) enddo Use “c$” for fixed-format Fortran
  • 13. parallel do & parallel for (cont’d) • Suppose maxi = 1000 Thread 0 gets i = 1 to 250 Thread 1 gets i = 251 to 500 etc. • Barrier implied at end of loop
  • 14. workshare • For Fortran 90/95 array syntax, the parallel workshare directive is analogous to parallel do • Previous example would be: • Also works for forall and where statements !$omp parallel workshare a = b + c !$omp end parallel workshare
  • 15. Shared vs. Private • In parallel region, default behavior is that all variables are shared except loop index – All threads read and write the same memory location for each variable – This is ok if threads are accessing different elements of an array – Problem if threads write same scalar or array element – Loop index is private, so each thread has its own copy
  • 16. Shared vs. Private (cont’d) • Here’s an example where a shared variable, i2, could cause a problem ifirst = 10 do i = 1, imax i2 = 2*i j(i) = ifirst + i2 enddo ifirst = 10; for(i = 1; i <= imax; i++){ i2 = 2*i; j[i] = ifirst + i2; }
  • 17. Shared vs. Private (3) • OpenMP clauses modify the behavior of directives • Private clause creates separate memory location for specified variable for each thread
  • 18. Shared vs. Private (4) ifirst = 10 !$omp parallel do private(i2) do i = 1, imax i2 = 2*i j(i) = ifirst + i2 enddo ifirst = 10; #pragma omp parallel for private(i2) for(i = 1; i <= imax; i++){ i2 = 2*i; j[i] = ifirst + i2; }
  • 19. Shared vs. Private (5) • Look at memory just before and after parallel do/parallel for statement – Let’s look at two threads ifirst = 10 … i2thread 0 = ??? spawn additional thread ifirst = 10 … i2thread 0 = ??? i2thread 1 = ??? … before after
  • 20. Shared vs. Private (6) • Both threads access the same shared value of ifirst • They have their own private copies of i2 ifirst = 10 … i2thread 0 = ??? i2thread 1 = ??? … Thread 0 Thread 1
  • 22. Data Dependencies • Data on one thread can be dependent on data on another thread • This can result in wrong answers – thread 0 may require a variable that is calculated on thread 1 – answer depends on timing – When thread 0 does the calculation, has thread 1 calculated it’s value yet?
  • 23. Data Dependencies (cont’d) • Example – Fibonacci Sequence 0, 1, 1, 2, 3, 5, 8, 13, … a(1) = 0 a(2) = 1 do i = 3, 100 a(i) = a(i-1) + a(i-2) enddo a[1] = 0; a[2] = 1; for(i = 3; i <= 100; i++){ a[i] = a[i-1] + a[i-2]; }
  • 24. Data Dependencies (3) • parallelize on 2 threads – thread 0 gets i = 3 to 51 – thread 1 gets i = 52 to 100 – look carefully at calculation for i = 52 on thread 1 • what will be values of for i -1 and i - 2 ?
  • 25. Data Dependencies (4) • A test for dependency: – if serial loop is executed in reverse order, will it give the same result? – if so, it’s ok – you can test this on your serial code
  • 26. Data Dependencies (5) • What about subprogram calls? • Does the subprogram write x or y to memory? – If so, they need to be private • Variables local to subprogram are local to each thread • Be careful with global variables and common blocks do i = 1, 100 call mycalc(i,x,y) enddo for(i = 0; i < 100; i++){ mycalc(i,x,y); }
  • 27. A Few More Basics
  • 28. More clauses • Can make private default rather than shared – Fortran only – handy if most of the variables are private – can use continuation characters for long lines ifirst = 10 !$omp parallel do & !$omp default(private) & !$omp shared(ifirst,imax,j) do i = 1, imax i2 = 2*i j(i) = ifirst + i2 enddo
  • 29. More clauses (cont’d) • Can use default none – declare all variables (except loop variables) as shared or private – If you don’t declare any variables, you get a handy list of all variables in loop
  • 30. More clauses (3) ifirst = 10 !$omp parallel do & !$omp default(none) & !$omp shared(ifirst,imax,j) & !$omp private(i2) do i = 1, imax i2 = 2*i j(i) = ifirst + i2 enddo ifirst = 10; #pragma omp parallel for default(none) shared(ifirst,imax,j) private(i2) for(i = 0; i < imax; i++){ i2 = 2*i; j[i] = ifirst + i2; }
  • 31. Firstprivate • Suppose we need a running index total for each index value on each thread • if iper were declared private, the initial value would not be carried into the loop iper = 0 do i = 1, imax iper = iper + 1 j(i) = iper enddo iper = 0; for(i = 0; i < imax; i++){ iper = iper + 1; j[i] = iper; }
  • 32. Firstprivate (cont’d) • Solution – firstprivate clause • Creates private memory location for each thread • Copies value from master thread (thread 0) to each memory location iper = 0 !$omp parallel do & !$omp firstprivate(iper) do i = 1, imax iper = iper + 1 j(i) = iper enddo iper = 0; #pragma omp parallel for firstprivate(iper) for(i = 0; i < imax; i++){ iper = iper + 1; j[i] = iper; }
  • 33. Lastprivate • saves value corresponding to the last loop index – "last" in the serial sense !$omp parallel do lastprivate(i) do i = 1, maxi-1 a(i) = b(i) enddo a(i) = b(1) #pragma omp parallel for lastprivate(i) for(i = 0; i < maxi-1; i++){ a[i] = b[i]; } a(i) = b(1);
  • 34. Reduction • following example won’t parallelize correctly with parallel do/parallel for – different threads may try to write to sum1 simultaneously sum1 = 0.0 do i = 1, maxi sum1 = sum1 + a(i) enddo sum1 = 0.0; for(i = 0; i < imaxi; i++){ sum1 = sum1 + a[i]; }
  • 35. Reduction (cont’d) • Solution? – Reduction clause • each thread performs its own reduction (sum, in this case) • results from all threads are automatically reduced (summed) at the end of the loop sum1 = 0.0 !$omp parallel do & !$reduction(+:sum1) do i = 1, maxi sum1 = sum1 + a(i) enddo sum1 = 0; #pragma omp parallel for reduction(+:sum1) for(i = 0; i < imaxi; i++){ sum1 = sum1 + a[i]; }
  • 36. Reduction (3) • Fortran operators/intrinsics: MAX, MIN, IAND, IOR, IEOR, +, *, -, .AND., .OR., .EQV., .NEQV. • C operators: +, *, -, /, &, ^, |, &&, || • roundoff error may be different than serial case
  • 37. num_threads • Number of threads for subsequent loop can be specified with num_threads(n) clause – n is number of threads
  • 38. Conditional Compilation • Fortran: if compiled without OpenMP, directives are treated as comments – Great for portability • !$ (c$ for fixed format) can be used for conditional compilation for any source lines !$ print*, ‘number of procs =', nprocs • C or C++: conditional compilation can be performed with the _OPENMP macro name. #ifdef _OPENMP … do stuff … #endif
  • 39. Basic OpenMP Functions • omp_get_thread_num() – returns ID of current thread • omp_set_num_threads(nthreads) – subroutine in Fortran – sets number of threads in next parallel region to nthreads – alternative to OMP_NUM_THREADS environment variable – overrides OMP_NUM_THREADS • omp_get_num_threads() – returns number of threads in current parallel region
  • 41. Some Hints • OpenMP will do what you tell it to do – If you try parallelize a loop with a dependency, it will go ahead and do it! • Do not parallelize small loops – Overhead will be greater than speedup – How small is “small”? • Answer depends on processor speed and other system- dependent parameters • Try it! • Maximize number of operations performed in parallel – parallelize outer loops where possible
  • 42. Some Hints (cont’d) • Always profile your code before parallelizing! – tells where most time is being used • Portland Group profiler: – compile with –Mprof=func flag – run code • file with profile data will appear in working directory – run profiler: • pgprof –exe executable-name • GUI will pop up with profiling information
  • 43. Compile and Run • Portland Group compilers (katana): – Compile with -mp flag – Can use up to 8 threads • AIX compilers (twister, etc.): – Use an _r suffix on the compiler name, e.g., cc_r, xlf90_r – Compile with -qsmp=omp flag • Intel compilers (skate, cootie): – Compile with -openmp and –fpp flags – Can only use 2 threads
  • 44. Compile and Run (cont’d) • GNU compilers (all machines): – Compile with -fopenmp flag • Blue Gene does not accomodate OpenMP • environment variable OMP_NUM_THREADS sets number of threads setenv OMP_NUM_THREADS 4 • for C, include omp.h
  • 45. Exercise • Copy files from /scratch/sondak/gt – both C and Fortran versions are available for your programming pleasure • compile code with pgcc or pgf90 – include profiling flag –Mprof=func – call executable “gt” • Run code batch – qsub rungt – qstat –u username
  • 46. Exercise (cont’d) • When run completes, look at profile – pgprof –exe gt – Where is most time being used? – How can we tackle this with OpenMP? • Recompile without profiler flag • Re-run and note timing from batch output file – this is the base timing • We will decide on an OpenMP strategy • Add OpenMP directive(s) • Re-run on 1, 2, and 4 procs. and check new timings
  • 47. Parallel • parallel and do/for can be separated into two directives. is the same as !$omp parallel do do i = 1, maxi a(i) = b(i) enddo #pragma omp parallel for for(i=0; i<maxi; i++){ a[i] = b[i]; } !$omp parallel !$omp do do i = 1, maxi a(i) = b(i) enddo !$omp end parallel #pragma omp parallel #pragma omp for for(i=0; i<maxi; i++){ a[i] = b[i]; } #pragma omp end parallel
  • 48. Parallel (cont’d) • Note that an end parallel directive is required. • Everything within the parallel region will be run in parallel (surprise!). • The do/for directive indicates that the loop indices will be distributed among threads rather than duplicating every index on every thread.
  • 49. Parallel (3) • Multiple loops in parallel region: • parallel directive has a significant overhead associated with it. • The above example has the potential to be faster than using two parallel do/parallel for directives. !$omp parallel !$omp do do i = 1, maxi a(i) = b(i) enddo !$omp do do i = 1, maxi c(i) = a(2) enddo !$omp end parallel #pragma omp parallel #pragma omp for for(i=0; i<maxi; i++){ a[i] = b[i]; } #pragma omp for for(i=0; i<maxi; i++){ c[i] = a[2]; } #pragma omp end parallel
  • 50. Coarse-Grained • OpenMP is not restricted to loop-level (fine- grained) parallelism. • The !$omp parallel or #pragma omp parallel directive duplicates subsequent code on all threads until a !$omp end parallel or #pragma omp end parallel directive is encountered. • Allows parallelization similar to “MPI paradigm.”
  • 51. Coarse-Grained (cont’d) !$omp parallel & !$omp private(myid,istart,iend,nthreads,nper) nthreads = omp_get_num_threads() nper = imax/nthreads myid = omp_get_thread_num() istart = myid*nper + 1 iend = istart + nper – 1 call do_work(istart,iend) do i = istart, iend a(i) = b(i)*c(i) + ... enddo !$omp end parallel #pragma omp parallel #pragma omp private(myid,istart,iend,nthreads,nper) nthreads = OMP_GET_NUM_THREADS(); nper = imax/nthreads; myid = OMP_GET_THREAD_NUM(); istart = myid*nper; iend = istart + nper – 1; do_work(istart,iend); for(i=istart; i<=iend; i++){ a[i] = b[i]*c[i] + ... } #pragma omp end parallel
  • 53. • barrier synchronizes threads • Here barrier assures that a(1) or a[0] is available before computing myval Barrier $omp parallel private(myid,istart,iend) call myrange(myid,istart,iend) do i = istart, iend a(i) = a(i) - b(i) enddo !$omp barrier myval(myid+1) = a(istart) + a(1) #pragma omp parallel private(myid,istart,iend) myrange(myid,istart,iend); for(i=istart; i<=iend; i++){ a[i] = a[i] – b[i]; } #pragma omp barrier myval[myid] = a[istart] + a[0]
  • 54. • if you want part of code to be executed only on master thread, use master directive • “non-master” threads will skip over master region and continue Master
  • 55. Master Example - Fortran !$OMP PARALLEL PRIVATE(myid,istart,iend) call myrange(myid,istart,iend) do i = istart, iend a(i) = a(i) - b(i) enddo !$OMP BARRIER !$OMP MASTER write(21) a !$OMP END MASTER call do_work(istart,iend) !$OMP END PARALLEL
  • 56. Master Example - C #pragma omp parallel private(myid,istart,iend) myrange(myid,istart,iend); for(i=istart; i<=iend; i++){ a[i] = a[i] – b[i]; } #pragma omp barrier #pragma omp master fwrite(fid,sizeof(float),iend-istart+1,a); #pragma omp end master do_work(istart,iend); #pragma omp end parallel
  • 57. • want part of code to be executed only by a single thread • don’t care whether or not it’s the master thread • use single directive • Unlike the end master directive, end single implies a barrier. Single
  • 58. Single Example - Fortran !$OMP PARALLEL PRIVATE(myid,istart,iend) call myrange(myid,istart,iend) do i = istart, iend a(i) = a(i) - b(i) enddo !$OMP BARRIER !$OMP SINGLE write(21) a !$OMP END SINGLE call do_work(istart,iend) !$OMP END PARALLEL
  • 59. Single Example - C #pragma omp parallel private(myid,istart,iend) myrange(myid,istart,iend); for(i=istart; i<=iend; i++){ a[i] = a[i] – b[i]; } #pragma omp barrier #pragma omp single fwrite(fid,sizeof(float),nvals,a); #pragma omp end single do_work(istart,iend); #pragma omp end parallel
  • 60. • have section of code that: 1. must be executed by every thread 2. threads may execute in any order 3. threads must not execute simultaneously • use critical directive • no implied barrier Critical
  • 61. Critical Example - Fortran !$OMP PARALLEL PRIVATE(myid,istart,iend) call myrange(myid,istart,iend) do i = istart, iend a(i) = a(i) - b(i) enddo !$OMP CRITICAL call mycrit(myid,a) !$OMP END CRITICAL call do_work(istart,iend) !$OMP END PARALLEL
  • 62. Critical Example - C #pragma omp parallel private(myid,istart,iend) myrange(myid,istart,iend); for(i=istart; i<=iend; i++){ a[i] = a[i] – b[i]; } #pragma omp critical mycrit(myid,a); #pragma omp end critical do_work(istart,iend); #pragma omp end parallel
  • 63. Ordered • Suppose you want to write values in a loop: • If loop were parallelized, could write out of order • ordered directive forces serial order do i = 1, nproc call do_lots_of_work(result(i)) write(21,101) i, result(i) enddo for(i = 0; i < nproc; i++){ do_lots_of_work(result[i]); fprintf(fid,”%d %fn,”i,result[i]”); }
  • 64. Ordered (cont’d) !$omp parallel do do i = 1, nproc call do_lots_of_work(result(i)) !$omp ordered write(21,101) i, result(i) !$omp end ordered enddo #pragma omp parallel for for(i = 0; i < nproc; i++){ do_lots_of_work(result[i]); #pragma omp ordered fprintf(fid,”%d %fn,”i,result[i]”); #pragma omp end ordered }
  • 65. Ordered (3) • since do_lots_of_work takes a lot of time, most parallel benefit will be realized • ordered is helpful for debugging – Is result same with and without ordered directive?
  • 66. • In parallel region, have several independent tasks • Divide code into sections • Each section is executed on a different thread. • Implied barrier • Fixes number of threads • Note that sections directive delineates region of code which includes section directives. Sections
  • 67. Sections Example - Fortran !$OMP PARALLEL !$OMP SECTIONS !$OMP SECTION call init_field(a) !$OMP SECTION call check_grid(x) !$OMP END SECTIONS !$OMP END PARALLEL
  • 68. Sections Example - C #pragma omp parallel #pragma omp sections #pragma omp section init_field(a); #pragma omp section check_grid(x); #pragma omp end sections #pragma omp end parallel
  • 69. • Analogous to do and parallel do or for and parallel for, along with sections there exists a parallel sections directive. Parallel Sections
  • 70. Parallel Sections Example - Fortran !$OMP PARALLEL SECTIONS !$OMP SECTION call init_field(a) !$OMP SECTION call check_grid(x) !$OMP END PARALLEL SECTIONS
  • 71. Parallel Sections Example - C #pragma omp parallel sections #pragma omp section init_field(a); #pragma omp section check_grid(x); #pragma omp end parallel sections
  • 72. • Parallel sections are established upon compilation – Number of threads is fixed • Sometimes more flexibility is needed, such as parallelism within if or while block • Task directive will assign tasks to threads as needed Task
  • 73. Task Example - Fortran !$OMP PARALLEL !$OMP SINGLE p = listhead do while(p) !$OMP TASK process(p) !$OMP END TASK p = next(p) enddo !$OMP END SINGLE !$OMP END
  • 74. Task Example - C #pragma omp parallel { #pragma omp single { p = listhead; while(p) { #pragma omp task process(p); p = next(p); } } }
  • 75. • Only needed for Fortran – Usually legacy code • common block variables are inherently shared, may want them to be private • threadprivate gives each thread its own private copy of variables in specified common block. – Keep memory use in mind • Place threadprivate after common block declaration. Threadprivate
  • 76. common /work1/ work(1000) !$omp threadprivate(/work1/) Threadprivate (cont’d)
  • 77. • threadprivate renders all values in common block private • may want to use some current values of variables on all threads • copyin initializes each thread’s copy of specified common-block variables with values form the master thread Copyin
  • 78. Copyin Example - Fortran real, parameter :: nvals = 501 real, dimension(nvals) :: x, y, xmet, ymet common /work1/ work(nvals), istart, iend !$OMP THREADPRIVATE(/work1/) istart = 2; iend = 500 call read_coords(x,y) !$OMP PARALLEL SECTIONS COPYIN(istart,iend) !$OMP SECTION call metric(x) xmet = work !$OMP SECTION call metric(y) ymet = work !$OMP END PARALLEL SECTIONS subroutine metric(c) real, dimension(:) :: c common /work1/ work(nvals), istart, iend !$OMP THREADPRIVATE(/work1/) do i = istart, iend work(i) = 0.5*(c(i+1)-c(i-1)) enddo end subroutine metric
  • 79. • Similar to critical – Allows greater optimization than critical • Applies to single line following the atomic directive Atomic
  • 80. Atomic Example - Fortran do i = 1, 10 !$OMPATOMIC x(j(i)) = x(j(i)) + 1.0 enddo
  • 81. Atomic Example - C for(i = 0; i<10; i++){ #pragma omp atomic x[j[i]]= x[j[i]] + 1.0; }
  • 82. • Different threads can have different values for the same shared variable – example – value in register • flush assures that the calling thread has a consistent view of memory • upcoming example is from OpenMP spec Flush
  • 83. Flush Example - Fortran integer isync(num_threads) !$omp parallel default(private) shared(isync) iam = omp_get_thread_num() isync(iam) = 0 !$omp barrier call work() isync(iam) = 1 ! let other threads know I finished work !$omp flush(isync) do while(isync(neigh) == 0) !$omp flush(isync) enddo $omp end parallel
  • 84. Flush Example - C int isync[num_threads]; #pragma omp parallel default(private) shared(isync) iam = omp_get_thread_num(); isync[iam] = 0; #pragma omp barrier work(); isync[iam] = 1; /* let other threads know I finished work */ #pragma omp flush(isync) while(isync[neigh] == 0) #pragma omp flush(isync) #pragma omp end parallel
  • 86. • On some systems the number of threads available to you may be adjusted dynamically. • The number of threads requested can be construed as the maximum number of threads available. • Dynamic threading can be turned on or off: Dynamic Thread Assignment logical mydyn = .false. call omp_set_dynamic(mydyn) int mydyn = 0; omp_set_dynamic(mydyn);
  • 87. • Dynamic threading can also be turned on or off by setting the environment variable omp_dynamic to true or false – call to omp_set_dynamic overrides the environment variable • The function omp_get_dynamic() returns a value of “true” or “false,” indicating whether dynamic threading is turned on or off. Dynamic Thread Assignment (cont’d)
  • 88. • integer function • returns maximum number of threads available in current parallel region • same result as omp_get_num_threads if dynamic threading is turned off OMP_GET_MAX_THREADS
  • 89. • integer function • returns maximum number of processors in the system – indicates amount of hardware, not number of available processors • could be used to make sure enough processors are available for specified number of sections OMP_GET_NUM_PROCS
  • 90. • logical function • tells whether or not function was called from a parallel region • useful debugging device when using “orphaned” directives – example of orphaned directive: omp parallel in “main,” omp do or omp for in routine called from main OMP_IN_PARALLEL
  • 92. Collapse • collapse(n) allows parallelism over n contiguously nested loops
  • 93. Collapse Example - Fortran !$omp parallel do collapse(2) do j = 1, nval do i = 1, nval ival(i,j) = i + j enddo enddo
  • 94. Collapse Example - C #pragma omp parallel for collapse(2) for(j=0; j<nval; j++){ for(i=0; i<nval; i++){ ival[i][j] = i + j; } }
  • 95. Schedule • schedule refers to the way in which loop indices are distributed among threads • default is static, which we had seen in an earlier example – each thread is assigned a contiguous range of indices in order of thread number • called round robin – number of indices assigned to each thread is as equal as possible
  • 96. Schedule (cont’d) • static example – loop runs from 1 to 51, 4 threads thread indices no. indices 0 1-13 13 1 14-26 13 2 27-39 13 3 40-51 12
  • 97. Schedule (3) • number of indices doled out at a time to each thread is called the chunk size • can be modified with the SCHEDULE clause !$omp do schedule(static,5) #pragma omp for schedule(static,5)
  • 98. Schedule (4) • example – same as previous example (loop runs from 1-51, 4 threads), but set chunk size to 5 thread chunk 1 indices chunk 2 indices chunk 3 indices no. indices 0 1-5 21-25 41-45 15 1 6-10 26-30 46-50 15 2 11-15 31-35 51 11 3 16-20 36-40 - 10
  • 99. Dynamic Schedule • schedule(dynamic) clause assigns chunks to threads dynamically as the threads become available for more work • default chunk size is 1 • higher overhead than STATIC !$omp do schedule(dynamic,5) #pragma omp for schedule(dynamic,5)
  • 100. Guided Schedule • schedule(guided) clause assigns chunks automatically, exponentially decreasing chunk size with each assignment • specified chunk size is the minimum chunk size except for the last chunk, which can be of any size • default chunk size is 1
  • 101. Runtime Schedule • schedule can be specified through omp_schedule environment variable setenv omp_schedule “dynamic,5” • the schedule(runtime) clause tells it to set the schedule using the environment variable !$omp do schedule(runtime) #pragma omp for schedule(runtime)
  • 103. • parallel construct within parallel construct: Nested Parallelism !$omp parallel do do j = 1, jmax !$omp parallel do do i = 1, i,max call do_work(i,j) enddo enddo #pragma omp parallel for for(j=0; j<jmax; j++){ #pragma omp parallel for for(i=0; i<imax; i++){ do_work(i,j); } }
  • 104. • OpenMP standard allows but does not require nested parallelism • if nested parallelism is not implemented, previous examples are legal, but the inner loop will be serial • logical function omp_get_nested() returns .true. or .false. (1 or 0) to indicate whether or not nested parallelism is enabled in the current region Nested Parallelism (cont’d)
  • 105. • omp_set_nested(nest) enables/disables nested parallelism if argument is true/false – subroutine in Fortran (.true./.false.) – function in C (1/0) • nested parallelism can also be turned on or off by setting the environment variable omp_nested to true or false – calls to omp_set_nested override the environment variable Nested Parallelism (3)
  • 106. Locks
  • 107. • Locks offer additional control of threads • a thread can take/release ownership of a lock over a specified region of code • when a lock is owned by a thread, other threads cannot execute the locked region • can be used to perform useful work by one or more threads while another thread is engaged in a serial task Locks
  • 108. • Lock routines each have a single argument – argument type is an integer (Fortran) or a pointer to an integer (C) long enough to hold an address – OpenMP pre-defines required types integer(omp_lock_kind) :: mylock omp_nest_lock_t *mylock; Lock Routines
  • 109. • omp_init_lock(mylock) – initializes mylock – must be called before mylock is used – subroutine in Fortran • omp_set_lock(mylock) – gives ownership of mylock to calling thread – other threads cannot execute code following call until mylock is released – subroutine in Fortran Lock Routines (cont’d)
  • 110. • omp_test_lock(mylock) – logical function – if mylock is presently owned by another thread, returns false – if mylock is available, returns true and sets lock, i.e., gives ownership to calling thread – be careful: omp_test_lock(mylock)does more than just test the lock Lock Routines (3)
  • 111. • omp_unset_lock(mylock) – releases ownership of mylock – subroutine in Fortran • omp_destroy_lock(mylock) – call when you’re done with the lock – complement to omp_init_lock(mylock) – subroutine in Fortran Lock Routines (4)
  • 112. • an independent task takes a significant amount of time, and it must be performed serially • another task, independent of the first, may be performed in parallel • improve parallel efficiency by doing both tasks at the same time • one thread locks serial task • other threads work on parallel task until serial task is complete Lock Example
  • 113. Lock Example - Fortran call omp_init_lock(mylock) !$omp parallel if(omp_test_lock(mylock)) then call long_serial_task call omp_unset_lock(mylock) else dowhile(.not. omp_test_lock(mylock)) call short_parallel_task enddo call omp_unset_lock(mylock) endif !$omp end parallel call omp_destroy_lock(mylock)
  • 114. Lock Example C omp_init_lock(mylock); #pragma omp parallel if(omp_test_lock(mylock)){ long_serial_task(); omp_unset_lock(mylock); }else{ while(! omp_test_lock(mylock)){ short_parallel_task(); } omp_unset_lock(mylock); } #pragma omp end parallel omp_destroy_lock(mylock);
  • 115. • Please fill out evaluation survey for this tutorial at http://scv.bu.edu/survey/tutorial_evaluation.html Survey