Nbvtalkataitamimageprocessingconf

Writing OpenMP Programs on
Many and Multi Core Machines
Prof NB Venkateswarlu
ISTE Visiting Professor 2010-11
CSE, AITAM, Tekkali
venkat_ritch@yahoo.com
www.ritchcenter.com/nbv

Agenda
• Why OpenMP ?
• Elements of OpenMP
• Scalable Speedup and Data Locality
• Parallelizing Sequential Programs
• Breaking data dependencies
• Avoiding synchronization overheads
• Achieving Cache and Page Locality
• SGI Tools for Performance Analysis
and Tuning

Why OpenMP ?
• Parallel programming is more difficult
than sequential programming
• OpenMP is a scalable, portable,
incremental approach to designing
shared memory parallel programs
• OpenMP supports
– fine and coarse grained parallelism
– data and control parallelism

What is OpenMP ?
Three components:
• Set of compiler directives for
– creating teams of threads
– sharing the work among threads
– synchronizing the threads
• Library routines for setting and querying
thread attributes
• Environment variables for controlling run-
time behavior of the parallel program

Elements of OpenMP
• Parallel regions and work sharing
• Data scoping
• Synchronization
• Compiling and running OpenMP
programs

Parallelism in OpenMP
• The parallel region is the construct for
creating multiple threads in an
OpenMP program
• A team of threads is created at run
time for a parallel region
• A nested parallel region is allowed,
but may contain a team of one thread
• Nested parallelism is enabled with
setenv OMP_NESTED TRUE

Parallelism in OpenMP
beginning of
parallel region
fork
join
fork
join
end of parallel
region

Hello World in OpenMP
#include <omp.h>
int main() {
int iam =0, np = 1;
#pragma omp parallel private(iam, np)
{
#if defined (_OPENMP)
np = omp_get_num_threads();
iam = omp_get_thread_num();
#endif
printf(“Hello from thread %d out of %d n”, iam, np);
}
}
parallel region directive
with data scoping clause

Specifying Parallel Regions
• Fortran !
$OMP PARALLEL [clause [clause…]]
! Block
of code executed by all threads
!$OMP END PARALLEL
• C and C++
#pragma omp parallel [clause [clause...]]
{
/* Block executed by all threads */

Work sharing in OpenMP
• Two ways to specify parallel work:
– Explicitly coded in parallel regions
– Work-sharing constructs
» DO and for constructs: parallel loops
» sections
» single
• SPMD type of parallelism supported

Work and Data Partitioning
Loop parallelization
• distribute the work among the threads,
without explicitly distributing the data.
• scheduling determines which thread
accesses which data
• communication between threads is
implicit, through data sharing
• synchronization via parallel constructs
or is explicitly inserted in the code

Data Partitioning & SPMD
• Data is distributed explicitly among
processes
• With message passing, e.g., MPI,
where no data is shared, data is
explicitly communicated
• Synchronization is explicit or
embedded in communication
• With parallel regions in OpenMP, both
SPMD and data sharing are supported

Pros and Cons of SPMD
» Potentially higher parallel fraction
than with loop parallelism
» The fewer parallel regions, the less
overhead
» More explicit synchronization needed
than for loop parallelization
» Does not promote incremental
parallelization and requires manually
assigning data subsets to threads

SPMD Example
program mat_init
implicit none
integer, parameter::N=1024
real A(N,N)
integer :: iam, np
iam = 0
np = 1
!$omp parallel private(iam,np)
np = omp_get_num_threads()
iam = omp_get_thread_num()
! Each thread calls work
call work(N, A, iam, np)
!$omp end parallel
end
subroutine work(n, A, iam, np)
integer n, iam, n
real A(n,n)
integer :: chunk,low,high,i,j
chunk = (n + np - 1)/np
low = 1 + iam*chunk
high=min(n,(iam+1)*chunk)
do j = low, high
do I=1,n
A(I,j)=3.14 + &
sqrt(real(i*i*i+j*j+i*j*j))
enddo
enddo
return
A single parallel region, no scheduling needed,
each thread explicitly determines its work

Extent of directives
Most directives have as extent a
structured block, or basic block, i.e., a
sequence of statements with a flow of
control that satisfies:
• there is only one entry point in the
block, at the beginning of the block
• there is only one exit point, at the end
of the block; the exceptions are that
exit() in C and stop in Fortran are
allowed

Work Sharing Constructs
• DO and for : parallelizes a loop,
dividing the iterations among the
threads
• sections : defines a sequence of
contiguous blocks, the beginning of
each bock being marked by a section
directive. The block within each
section is assigned to one thread
• single: assigns a block of a parallel
region to a single thread

Specialized Parallel
Regions
Work-sharing can be specified
combined with a parallel region
• parallel DO and parallel for : a
parallel region which contains a
parallel loop
• parallel sections, a parallel region
that contains a number of section
constructs

Scheduling
• Scheduling assigns the iterations of a
parallel loop to the team threads
• The directives [parallel] do and
[parallel] for take the clause
schedule(type [,chunk])
• The optional chunk is a loop-invariant
positive integer specifying the number
of contiguous iterations assigned to a
thread

Scheduling
The type can be one of
• static threads are statically assigned
chunks of size chunk in a round-robin
fashion. The default for chunk is
ceiling(N/p) where N is the number of
iterations and p is the number of
processors
• dynamic threads are dynamically
assigned chunks of size chunk, i.e.,

Scheduling
when a thread is ready to receive new
work, it is assigned the next pending
chunk. Default value for chunk is 1.
• guided a variant of dynamic
scheduling in which the size of the
chunk decreases exponentially from
chunk to 1. Default value for chunk is
ceiling(N/p)

Scheduling
• runtime indicates that the schedule
type and chunk are specified by the
environment variable OMP_SCHEDULE. A
chunk cannot be specified with
runtime.
• Example of run-time specified
scheduling
setenv OMP_SCHEDULE “dynamic,2”

Scheduling
• If the schedule clause is missing, an
implementation dependent schedule is
selected. MIPSpro selects by default
the static schedule
• Static scheduling has low overhead
and provides better data locality
• Dynamic and guided scheduling may
provide better load balancing

Work Sharing Constructs
A motivating example
for(i=0;I<N;i++) { a[i] = a[i] + b[i];}
#pragma omp parallel
{
int id, i, Nthrds, istart, iend;
id = omp_get_thread_num();
Nthrds = omp_get_num_threads();
istart = id * N / Nthrds;
iend = (id+1) * N / Nthrds;
for(i=istart;I<iend;i++) {a[i]=a[i]+b[i];}
}
#pragma omp for schedule(static)
for(i=0;I<N;i++) { a[i]=a[i]+b[i];}
OpenMP
parallel region
and a work-
Sequential
code
OpenMP
Parallel Region
OpenMP Parallel
Region and a
work-sharing for
construct

Work-sharing ConstructWork-sharing Construct
 Threads are assigned an
independent set of
iterations
 Threads must wait at the
end of work-sharing
construct
#pragma omp for
Implicit barrier
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
i = 10
i = 11
i = 12
#pragma omp for
for(i = 1, i < 13, i++)
c[i] = a[i] + b[i]

Combining pragmasCombining pragmas
 These two code segments are equivalent
{
#pragma omp for
for (i=0;i< MAX; i++)
{ res[i]
= huge();
}
}
#pragma omp parallel for
for (i=0;i< MAX; i++) {
res[i] = huge();
}

Types of Extents
Two types for the extent of a directive:
• static or lexical extent: the code
textually enclosed between the
beginning and the end of the
structured block following the
directive
• dynamic extent: static extent as well
as the procedures called from within
the static extent

Orphaned Directives
A directive which is in the dynamic
extent of another directive but not in
its static extent is said to be orphaned
• Work sharing directives can be
orphaned
• This allows a work-sharing construct
to occur in a subroutine which can be
called both by serial and parallel code,
improving modularity

Directive Binding
• Work sharing directives (do, for,
sections, and single) as well as
master and barrier bind to the
dynamically closest parallel directive,
if one exists, and have no effect when
they are not in the dynamic extent of a
parallel region
• The ordered directive binds to the
enclosing do or for directive having
the ordered clause

Directive Binding
• critical (and atomic) provide
mutual exclusive execution (and
update) with respect to all the
threads in the program

Data Scoping
• Work-sharing and parallel
directives accept data scoping clauses
• Scope clauses apply to the static
extent of the directive and to variables
passed as actual arguments
• The shared clause applied to a
variable means that all threads will
access the single copy of that variable
created in the master thread

Data Scoping
• The private clause applied to a
variable means that a volatile copy of
the variable is cloned for each thread
• Semi-private data for parallel loops:
– reduction: variable that is the target of a reduction
operation performed by the loop, e.g., sum
– firstprivate: initialize the private copy from the
value of the shared variable
– lastprivate: upon loop exit, master thread holds
the value seen by the thread assigned the last loop
iteration

Threadprivate Data
• The threadprivate directive is
associated with the declaration of a
static variable (C) or common block
(Fortran) and specifies persistent data
(spans parallel regions) cloned, but
not initialized, for each thread
• To guarantee persistence, the dynamic
threads feature must be disabled
setenv OMP_DYNAMIC FALSE

Threadprivate Data
• threadprivate data can be initialized
in a thread using the copyin clause
associated with the parallel,
parallel do/for, and parallel
sections directives
• the value stored in the master thread
is copied into each team thread
• Syntax: copyin (name [,name])
where name is a variable or (in Fortran)
a named common block

Scoping Rules
• Data declared outside a parallel region
region is shared by default, except for
– loop index variable of parallel do
– data declared as threadprivate
• Local data in the dynamic extent of a
parallel region is private:
– subroutine local variables, and
– C/C++ blocks within a parallel region

Scoping Restrictions
• The private clause for a directive in
the dynamic extent of a parallel region
can be specified only for variables that
are shared in the enclosing parallel
region
– That is, a privatized variable cannot
be privatized again
• The shared clause is not allowed for
the DO (Fortran) or for (C) directive

Shared Data
• Access to shared data must be
mutually exclusive: a thread at a time
• For shared arrays, when different
threads access mutually exclusive
subscripts, synchronization is not
needed
• For shared scalars, critical sections or
atomic updates must be used
• Consistency operation: flush directive

Synchronization
Explicit, via directives:
• critical, implements the critical
sections, providing mutual exclusion
• atomic, implements atomic update of
a shared variable
• barrier, a thread waits at the point
where the directive is placed until all
other threads reach the barrier

Synchronization
• ordered, preserves the order of the
sequential execution; can occur at
most once inside a parallel loop
• flush, creates consistent view of
thread-visible data
• master, block in a parallel region that
is executed by the master thread and
skipped by the other threads; unlike
single, there is no implied barrier

Implicit Synchronization
• There is an implied barrier at the
end of a parallel region, and of a work-
sharing construct for which a nowait
clause is not specified
• A flush is implied by an explicit or
implicit barrier as well as upon
entry and exit of a critical or
ordered block

Directive Nesting
• A parallel directive can appear in
the dynamic extent of another
parallel, i.e., parallel regions can be
nested
• Work-sharing directives binding to the
same parallel directive cannot be
nested
• An ordered directive cannot appear
in the dynamic extent of a critical
directive

Directive Nesting
• A barrier or master directive
cannot appear in the dynamic extent of
a work-sharing region ( DO or for,
sections, and single) or ordered
block
• In addition, a barrier directive cannot
appear in the dynamic extent of a
critical or master block

Environment Variables
Name Value
OMP_NUM_THREADS positive number
OMP_DYNAMIC TRUE or FALSE
OMP_NESTED TRUE or FALSE
OMP_SCHEDULE “static,2”
• Online help: man openmp

Library Routines
OpenMP defines library routines that
can be divided in three categories
1. Query and set multithreading
• get/set number of threads or processors
omp_set_num_threads,
omp_get_num_threads,
omp_in_parallel, …
• get thread ID:
omp_get_thread_num

Library Routines
2. Set and get execution environment
• Inquire/set nested parallelism:
omp_get_nested
omp_set_nested
• Inquire/set dynamic number of threads in
different parallel regions:
omp_set_dynamic
omp_get_dynamic

Library Routines
3. API for manipulating locks
• A lock variable provides thread
synchronization, has C type omp_lock_t
and Fortran type integer*8, and holds
a 64-bit address
• Locking routines: omp_init_lock,
omp_set_lock,omp_unset_lock...
Man pages: omp_threads, omp_lock

Reality Check
Irregular and ambiguous aspects are
sources of language- and
implementation dependent behavior:
• nowait clause is allowed at the
beginning of [parallel] for (C/C+
+) but at the end of [parallel] DO
(Fortran)
• default clause can specify private
scope in Fortran, but not in C/C++

Reality Check
• Can only privatize full objects, not
array elements, or fields of data
structures
• For a threadprivate variable or
block one cannot specify any clause
except for the copyin clause
• In MIPSpro 7.3.1 one cannot specify in
the same directive both the
firstprivate and lastprivate
clauses for a variable

Reality Check
• With MIPSpro 7.3.1, when a loop is
parallelized with the do (Fortran) or for
(C/C++) directive, the indexes of the nested
loops are, by default, private in Fortran, but
shared in C/C++
Probably, this is a compiler issue
• Fortunately, the compiler warns about
unsynchronized accesses to shared
variables
• This does not occur for parallel do or
parallel for

Compiling and Running
• Use MIPSpro with the option -mp both for
compiling and linking
default -MP:open_mp=ON must be in effect
• Fortran:
f90 [-freeform] [-cpp]-mp prog.f
-freeform needed for free form source
-cpp needed when using #ifdef s
• C/C++:
cc -mp -O3 prog.c
CC -mp -O3 prog.C

Setting the Number of Threads
• Environment variables:
setenv OMP_NUM_THREADS 8
if OMP_NUM_THREADS is not set, but
MP_SET_NUMTHREADS is set, the latter
defines the number of threads
• Environment variables can be
overridden by the programmer:
omp_set_num_threads(int n)

Scalable Speedup
• Most often the memory is the limit to
the performance of a shared memory
program
• On scalable architectures, the latency
and bandwidth of memory accesses
depend on the locality of accesses
• In achieving good speedup of a shared
memory program, data locality is an
essential element

What Determines Data
Locality
• Initial data distribution determines on
which node the memory is placed
– first touch or round-robin system policies
– data distribution directives
– explicit page placement
• Work sharing, e.g., loop scheduling,
determines which thread accesses
which data
• Cache friendliness determines how
often main memory is accessed

Cache Friendliness
For both serial loops and parallel loops
• locality of references
– spatial locality: use adjacent cache lines and all
items in a cache line
– temporal locality: reuse same cache line; may
employ techniques such as cache blocking
• low cache contention
– avoid the sharing of cache lines among different
objects; may resort to array padding or increasing
the rank of an array

Cache Friendliness
• Contention is an issue specific to
parallel loops, e.g., false sharing of
cache lines
cache friendliness =
high locality of references
+
low contention

NUMA machines
• Memory hierarchies exist in single-CPU
computers and Symmetric
Multiprocessors (SMPs)
• Distributed shared memory (DSM)
machines based on Non-Uniform
Memory Architecture (NUMA) add
levels to the hierarchy:
– local memory has low latency
– remote memory has high latency

Origin2000 memory
hierarchy
Level Latency (cycles)
register 0
primary cache 2..3
secondary cache 8..10
local main memory & TLB hit 75
remote main memory & TLB hit 250
main memory & TLB miss 2000
page fault 10^6

Page Level Locality
• An ideal application has full page
locality: pages accessed by a
processor are on the same node as the
processor, and no page is accessed by
more than one processor (no page
sharing)
• Twofold benefit:
» low memory latency
» scalability of memory bandwidth

Page Level Locality
• The benefits brought about by page
locality are more important for
programs that are not cache friendly
• We look at several data placement
strategies for improving page locality
» system based placement
» data initialization and directives
» combination of system and program
directed data placement

Page sharing due to alignment
array section accessed
by processor 2
array section accessed
by processor 1
• page 1
page 1 page 2 page 3
• Consider an array whose size is twice the size
of a page, and which is distributed between two
nodes
• Page 1 and page 2 are located one node 1,
page 3 is on node 2
• Page 2 is shared by the two processors, due to
the array not starting on a page boundary
array layout

Achieving Page Locality
IRIX has two page placement policies:
• first-touch: the process which first
references a virtual address causes that
address to be mapped to a page on the
node where the process runs
• round-robin: pages allocated to a job are
selected from nodes traversed in round-
robin order
• IRIX uses first-touch, unless
setenv _DSM_ROUND_ROBIN

IRIX allows to migrate pages between
nodes, to adjust the page placement
• a page is migrated based on the affinity of
data accesses to that page, which is
derived at run-time from the per-process
cache-miss pattern
• page migration follows the page affinity
with a delay whose magnitude depends
on the aggressiveness of migration

• To enable data migration, except for
explicitly placed data
setenv _DSM_MIGRATION ON
• To enable migration of all data
setenv _DSM_MIGRATION ALL_ON
• To set the aggressiveness of migration
setenv _DSM_MIGRATION_LEVEL n
where n is an integer between 0 (least
aggressive, disables migration) and 100 (most
aggressive, the default)

Methods, from best to worst
• Parallel data initialization, using
OpenMP parallel work constructs such
as parallel do, combined with
operating system’s first-touch
placement policy
» works with heap, local, global arrays
» no data distribution directives needed
» can be used with page migration

• IRIX round-robin page placement
» improves application’s memory
bandwidth
» no change of code needed
» allows both serial and parallel
initialization of data in the program

• Regular distribution directive
» allows serial initialization of data
» data has same layout as in a serial
program
» page granularity of distribution
» cannot distribute heap allocated and
assumed-size arrays

• Page Migration:
» makes initial data placement less
important, e.g., allows sequential data
initialization
» improves locality of a computation
whose data access pattern changes
during the computations
» it is useful for programs that have
stable affinity for long time intervals

» page migration can be combined
with other techniques such as first-
touch or round-robin
» page migration is expensive
» page migration implements CPU
affinity with a delay

Reshaped Distribution
• Reshaped distribution directive
» no page granularity limitation
» data layout is most likely different
from the layout in a serial program
» code bloating: each routine that is
passed a reshaped parameter must
have a version specialized for handling
reshaped arrays
– layout is different for reshaped arrays

» cannot reshape initialized data, heap
allocated and assumed-size arrays
» overhead of indirect addressing
» side effect: a global structure or
Fortran common block that contains a
reshaped array cannot be declared
threadprivate, and cannot be
localized with the -Wl,-Xlocal option
Reshaped Distribution

• Regular distribution
!$SGI distribute a(d1[,d2])
[onto(p1[,p2])]
#pragma distribute a[d1][[d2]] [onto(p1[,p2])]
• Reshaped distribution
!$SGI distribute_reshape a(d1[,d2]) [onto(p1[,p2])]
#pragma distribute_reshape a[d1][[d2]]
• Distribution methods are denoted by d1, d2
• Optional clause onto specifies a processor
grid n1 x n2 , such that n1/n2 = p1/p2
SGI Data Placement Directives

Three distribution methods
* means no distribution along the
direction in which it appears
block distributes the elements of an
array in p contiguous chunks of size
ceiling(N/p), where N is the extent in
the distributed direction and p is the
number of processors
Distribution Specification

cyclic(k) distributes the elements of
an array in chunks of size k in a round-
robin fashion, i.e.,
– first processor is assigned array elements 1, K+1,
2*k+1,... (in Fortran) or 0, k, 2k,.. (in C)
– second processor is assigned elements 2, K+2,…
(in Fortran) or 1,K+1,2*K+1 (in C/C++)
Interleaved distribution is obtained for
k=1 (the default) and block-cyclic
distribution for k>1
Distribution Specification

For regular distribution, one should
distribute the outer dimension of an
array, to minimize the effect of page
granularity:
• Distribute columns in Fortran
!$SGI distribute A(*, block)
• Distribute rows in C/C++
#pragma distribute a(block,*)
Regular Distribution Tip

• Assumed size is not allowed for array
formal parameters which are declared
as distributed
• Specify array size:
void foo(int n, double a[n])
{
#pragma distribute_reshape a(block)
…
}
Distributed Arrays as
Formal Parameters

If a reshaped array is declared as
threadprivate, the compiler will
silently ignore the threadprivate
directive
» threadprivate is quietly ignored:
double a[n]
#pragma omp threadprivate(a)
#pragma distribute_reshape a(block)
Reshaped Array Pitfall

Parallelizing Code
• Optimize single-CPU performance
– maximize cache reuse
– eliminate cache misses
– compiler flags: -LNO:cache_size2=4m
-OPT:IEEE_arithmetic=3 -Ofast=ip27
• Parallelize as high a fraction of the
work as possible
– preserve cache friendliness

Parallelizing Code
– avoid synchronization and scheduling
overhead: partition in few parallel regions,
avoid reduction, single and critical
sections, make the code loop fusion
friendly, use static scheduling
– partition work to achieve load balancing
• Check correctness of parallel code
– run OpenMP compiled code first on one
thread, then on several threads

Synchronization Overhead
• Parallel regions, work-sharing, and
synchronization incur overhead
• Edinburgh OpenMP Microbenchmarks,
version 1.0, by J. Mark Bull, are used
to measure the cost of synchronization
on a 32 processor Origin 2000, with
300 MHz R12000 processors, and
compiling the benchmarks with
MIPSpro Fortran 90 compiler, version
7. 3.1.1m

Insights
• cost (DO) ~ cost(barrier)
• cost (parallel DO) ~ 2 * cost(barrier)
• cost (parallel) > cost (parallel DO)
• atomic is less expensive than critical
• bad scalability for
– reduction
– mutual exclusion: critical, (un)lock
– single

Loop Parallelization
• Identify the loops that are bottleneck to
performance
• Parallelize the loops, and ensure that
– no data races are created
– cache friendliness is preserved
– page locality is achieved
– synchronization and scheduling
overheads are minimized

Hurdles to Loop
Parallelization
• Data dependencies among iterations
caused by shared variables
• Input/Output operations inside the loop
• Calls to thread-unsafe code, e.g., the
intrinsic function rtc
• Branches out of the loop
• Insufficient work in the loop body
• The MIPSpro auto-parallelizer helps in
identifying these hurdles

Auto-Parallelizer
• The MIPSpro auto-parallelizer (APO)
can be used both for automatically
parallelizing loops and for determining
the reasons which prevent a loop from
being parallelized
• The auto-parallelizer is activated using
command line option apo to the f90,
f77, cc, and CC compilers
• Other auto-parallelizer options: apo
list and mplist

Auto-Parallelizer
• Example:
f90 -apo list -mplist myprog.f
• apo list enables APO and generates
the file myprog.list which describes
which loops have been parallelized,
which have not, and why not
• mplist generates the parallelized
source program myprog.w2f.f
(myprog.w2c.c for C) equivalent to the
original code myprog.f

For More Information
About the Auto-Parallelizer
• The ProMP Parallel Analyzer View
product consists of the program cvpav
• Cvpav analyzes files created by
compiling with the option –apo keep
• Try out the tutorial examples:
cd /usr/demos/ProMP/omp_tutorial
make
cvpav -f omp_demo.f

Data Races
• Parallelizing a loop with data
dependencies causes data races:
unordered or interfering accesses by
multiple threads to shared variables,
which make the values of these
variables different from the values
assumed in a serial execution
• A program with data races produces
unpredictable results, which depend on
thread scheduling and speed.

Types of Data Dependencies
• Reduction operations:
const int n = 4096;
int a[n], i, sum=0;
for (i = 0; i < n; i++) {
sum += a[i];
}
– Easy to parallelize using reduction
variables

– Auto-parallelizer is able to detect
reduction and parallelize it
const int n = 4096;
int a[n], i, sum = 0;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < n; i++) {
sum += a[i];
}

• Carried dependence on a shared
array, e.g., recurrence:
const int n = 4096;
int a[n], i;
for (i = 0; i < n-1; i++) {
a[i] = a[i+1];
}
– Non-trivial to eliminate, the auto-
parallelizer cannot do it

Parallelizing the Recurrence
#define N 16384
int a[N], work[N+1];
// Save border element
work[N]= a[0];
// Save & shift even indices
for ( i = 2; i < N; i+=2)
{
work[i-1] = a[i];
}
// Update even indices from odd
for ( i = 0; i < N-1; i+=2)
{
a[i] = a[i+1];
}
// Update odd indices with even
for ( i = 1; i < N-1; i+=2)
{
a[i] = work[i];
}
// Set border element
a[N-1] = work[N];
Idea: Segregate even and odd indices

Performing Reduction
The bad scalability of the reduction clause
affects its usefulness, e.g., bad speedup
when summing the elements of a matrix:
#define N 1<<12
#define M 16
int i, j;
double a[N][M], sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
sum += a[i][j];

Parallelizing the Sum
#define N 1<<12
#define M 16
int main() {
double a[N][M], sum = 0.0;
#pragma distribute a[block][*]
int i, j = 0;
#pragma omp parallel private(i,j)
{
double mysum = 0.0;
// initialization of a
// not shown
// compute partial sum
#pragma omp for nowait
for (i = 0; i < N; i++)
for (j = 0; j < M; i++)
mysum += a[i][j];
}
// each thread adds its
// partial sum
#pragma omp atomic
sum += mysum;
}
}
Idea: Use explicit partial sums and combine
them atomically

Loop Fusion
• Increases the work in the loop body
• Better serial programs: fusion
promotes software pipelining and
reduces the frequency of branches
• Better OpenMP programs: fusion
reduces synchronization and
scheduling overhead
– fewer parallel regions and work-
sharing constructs

Promoting Loop Fusion
• Loop fusion inhibited by statements
between loops which may have
dependencies with data accessed by
the loops
• Promote fusion: reorder the code to get
loops which are not separated by
statements creating data dependencies
• Use one parallel do construct for
several adjacent loops; may leave it to
the compiler to actually perform fusion

Fusion-friendly code
integer,parameter::n=4096
real :: sum, a(n)
do i=1,n
a(i) = sqrt(dble(i*i+1))
enddo
sum = 0.d0
do i=1,n
sum = sum + a(i)
enddo
integer,parameter::n=4096
real :: sum, a(n)
sum = 0.d0
do i=1,n
a(i) = sqrt(dble(i*i+1))
enddo
do i=1,n
sum = sum + a(i)
enddo
Unfriendly Friendly

Tradeoffs in Parallelization
• To increase parallel fraction of work
when parallelizing loops, it is best to
parallelize the outermost loop of a
nested loop
• However, doing so may require loop
transformations such as loop
interchanges, which can destroy cache
friendliness, e.g., defeat cache blocking

Tradeoffs in Parallelization
• Static loop scheduling in large chunks
per thread promotes cache and page
locality but may not achieve load
balancing
• Dynamic and interleaved scheduling
achieve good load balancing but cause
poor locality of data references

Tuning the Parallel Code
• Examine resource usage, e.g.,
execution time, number of floating
point operations, primary, secondary,
and TLB cache misses and identify
– the performance bottleneck
– the routines generating the
bottleneck
Useful SGI tools: perfex, ssrun, prof
• Correct the performance problem and
verify the desired speedup.

• In C, use SGI function syssgi
#include <sys/syssgi.h>
ptrdiff_t syssgi (int request, …)
with a request value of SGI_PHYSP
• In Fortran, use the intrinsic function
integer dsm_home_threadnum
thread=dsm_home_threadnum(arr(i))
- see lab exercise
Investigating Data Placement

SGI Performance Tools
• MIPSpro compilers and libraries
• ProDev workshop (formerly CASE
Vision) : cvd, cvperf, cvstatic
• ProMP: parallel analyzer: cvpav
• SpeedShop: profiling execution and
reporting profile data
• Perfex: per process event count statistics
• dprof: address space profiling

Performance Data
• Timing: time spent in various sections of
the program
• Events captured by the performance
counters of the R1X000 CPU.
- 32 events, divided in two equal sets
- Examples: clock cycles, L1 and L2 cache,
and TLB misses, floating point operations,
number of instructions
• I/O system calls, heap malloc and
free, floating point exceptions

Speedshop Profiling
• Speedshop is tool package which
supports profiling at the function and
source line level
• Uses several methods for collecting
information
– PC and call stack sampling
– basic block counting
– exception tracing

Perfex Tool
• Provides event statistics at the
process level
• Reports the number of occurrences of
the events captured by the R1X000
hardware counters in each process of
a parallel program
• In addition, reports information
derived from the event counts, e.g.
MFLOPS, memory bandwidth

Perfex Tool
perfex -mp [other options] a.out
• To profile secondary cache misses in the
data cache (event 26) and instruction
cache (event 10):
perfex -mp -e 26 -e 10 a.out
• To multiplex all 32 events (-a) , get time
estimates (-y) and trace exceptions (-x)
perfex -a -x -y a.out

Speedshop
• Data Collection
– ssrun main data collection tool.
Running it on a.out creates the files
a.out.experiment.mPID and a.out.experiment.pPID
– ssusage summary of resources
used, similar to the time commands
– ssapi API for caliper points
• Data Analysis
– prof

Ssrun sampling Experiments
Statistical sampling, triggered by a preset
time base or by overflow of hardware
counters
-pcsamp PC sampling gives user CPU
time
-usertime call stack sampling, gives
user and system CPU time
-totaltime call stack sampling, gives
walltime

Ssrun sampling
Sampling triggered by overflow of R1X000
hardware counters
-gi_hwc Graduated instructions
-gfp_hwc Floating point instructions
-ic_hwc Misses in L1 I-cache
-dc_hwc Misses in L1 D-cache
-dsc_hwc Data misses in L2 cache
-tlb_hwc TLB misses
-prof_hwc User selected event

User selected sampling
• Select a hardware counter, say
secondary cache misses (26), and an
overflow value
setenv _SPEEDSHOP_HWC_COUNTER_NUMBER 26
setenv _SPEEDSHOP_HWC_COUNTER_OVERFLOW 99
• Run the experiment
ssrun -prof_hwc a.out
• Default counter is L1 I-cache misses
(9) and default overflow is 2,053

Ssrun ideal and tracing
experiments
• Ideal Experiment: basic block counting
-ideal counts the number of times
each basic block is executed and
estimates the time. Descendant of pixie
• Tracing
-fpe floating point exceptions
-io file open, read,write, close
-heap malloc and free

Prof Tool
• Display event counts or time in routines
sorted in descending order of the counts
• Source line granularity with command line
option -h or -l
• For ideal and usertime experiments get
call hierarchy with -butterfly option
• For ideal experiment can get architecture
information with the -archinfo option
• Cut off report at top 100-p% with -quit p%

Address Space Profiling: dprof
• Gives per process histograms of page
accesses
• Sampling with a specified time base
– the current instruction is interrupted
– the address of the operand referenced by the
interrupted instruction is recorded
• Time base is either the interval timer or
an R1X000 hardware counter overflow
• R1X000 counters: man r10k_counters

Data Profiling: dprof
• Syntax
dprof [-hwpc [-cntr n] [-ovfl m]]
[-itimer [-ms t]] [-out profile_file]
a.out
• Default is interval timer ( -itimer )
with t=100 ms
• Can select hardware counter (-hwpc)
which has the defaults
n = 0 is the R1X000 cycle counter
m=10000 is the counter’s overflow value

The Future of OpenMP
• Data placement directives will
become part of OpenMP
– affinity scheduling may be a useful
feature
• It is desirable to add parallel
input/output to OpenMP
• Java binding of OpenMP

Image class
117
class Image {
public:
short* mData;
int mWidth, mHeight, mDepth;
int mVoxelsPerSlice;
int mVoxelsPerVolume;
short* mSlicePointers; // Pointers to the start of each slice
short getVoxel ( int x, int y, int z ) {...}
void setVoxel ( int x, int y, int z, short v ) {...}
};

Threshold – OpenMP #1
118
void doThreshold ( Image* in, Image* out ) {
for ( int z = 0; z < in->mDepth; z++ ) {
for ( int y = 0; y < in->mHeight; y++ ) {
for ( int x = 0; x < in->mWidth; x++ ) {
if ( in->getVoxel(x,y,z) > 100 ) {
out->setVoxel(x,y,z,1);
} else {
out->setVoxel(x,y,z,0);
}
}
}
}
}
// NB: can loop over slices, rows or columns by moving
// pragma, but must choose at compile time

Threshold – OpenMP #2
119
void doThreshold ( Image* in, Image* out ) {
for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) {
if ( in->mData[s] > 100 ) {
out->mData[s] = 1;
} else {
out->mData[s] = 0;
}
}
}
// Likely a lot faster than previous code

References
Lawrence Livermore National Laboratory
www.llnl.gov/computing/tutorials/workshops/workshop/openMP/MAIN.html
Ohio Supercomputing Center
oscinfo.osc.edu/training/openmp/big
Minnesota Supercomputing Institute
www.msi.umn.edu/tutorials/shared_tutorials/openMP
Edinburgh OpenMP Microbenchmarks
www.epcc.ed.ac.uk/research/openmpbench
Mattson and Eigenmann Tutorial
dynamo.ecn.purdue.edu/~eigenman/EE563/Handouts/BWsc00introOMP.pdf
Mattson and Eigenmann Advanced OpenMP
dynamo.ecn.purdue.edu/~eigenman/EE563/Handouts/BWsc00advancedOMP.pdf

Nbvtalkataitamimageprocessingconf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Nbvtalkataitamimageprocessingconf

Similar to Nbvtalkataitamimageprocessingconf (20)

More from Nagasuri Bala Venkateswarlu

More from Nagasuri Bala Venkateswarlu (20)

Recently uploaded

Recently uploaded (20)

Nbvtalkataitamimageprocessingconf